IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
1) Exit early if the list is empty.
2) splice the list into a local list,
so that we block hard irqs only once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231003181920.3280453-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.
This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.
Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.
In the collision scenario below:
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]
when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().
However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.
This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().
This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.
The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:
(1) If an app reads data when > 1*MSS is unacknowledged, then
tcp_cleanup_rbuf() ACKs immediately because of:
tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
(2) If an app reads all received data, and the packets were < 1*MSS,
and either (a) the app is not ping-pong or (b) we received two
packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
of:
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
!inet_csk_in_pingpong_mode(sk))) &&
(3) *However*: if an app reads exactly 1*MSS of data,
tcp_cleanup_rbuf() does not send an immediate ACK. This is true
even if the app is not ping-pong and the 1*MSS of data had the PSH
bit set, suggesting the sending application completed an
application write.
Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.
The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.
The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.
When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.
And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.
The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.
Fixes: fc6415bcb0 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUdcDQNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AOf7EACipnPx/532GUk1pECg+iWGTfhFOu1jdHjAILzy
+Ft/kfTLvd8kfZg6DuKIb6KYfj7w7uQ/xcD6wfqV8HBcss0SOyilx2ZUgH8ThwDv
tSIsUsx1M1gOGkXK713GrD6h/PR5BBv3vVFymvr+MliYH4C2mmsGOGWk5D+s8IqU
q3XDMMMlsZpfqCA8QGKK7TkFhnvnHdeoHGhZhw9ywXik733Qa4OUbJ5tkxztDKrr
DKF/FhpYxWPKHURtPXaQpWuni7xbMjg+3lHYlWTRZkQRQOoPWidBuTumqJxvwT3Z
FYwlS7T7OBMiFByy4spBnBs0uGiA6rR3sZ2/Gn98o9HpYlCllxpZm53Ay0u8sZTL
RBhMkacOXTWN5n1fbIqHIZc6vs7Tm1crvT2V/CseAuhe9TDiD5cHkaz7hJUQif6h
dmF48QHCHuSgWGtyPmVbTDSZ02YF++R398zHuBM2TXkFz8B9vI5DRpbXw0yX4ktg
LZSKnBALOPN5Ye27+W+itNfNaMC3+Elto3Cv9IvpTaXWl8WpF8hnNagLObEXxJ1Q
3dLRKpSHDKJe7BLQoqm9ESFUE80bZr+S1Xleukz0z7AamCrM/rxQGKBwbTJs2NoE
1YezWzhw68+aQ7BY8eWigDAQKmtn1Oju3v5u5IekGKQVvXd5x97VGlJQRVxQvr2Z
jDHNFw==
=YLDi
-----END PGP SIGNATURE-----
Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter patches for net
First patch resolves a regression with vlan header matching, this was
broken since 6.5 release. From myself.
Second patch fixes an ancient problem with sctp connection tracking in
case INIT_ACK packets are delayed. This comes with a selftest, both
patches from Xin Long.
Patch 4 extends the existing nftables audit selftest, from
Phil Sutter.
Patch 5, also from Phil, avoids a situation where nftables
would emit an audit record twice. This was broken since 5.13 days.
Patch 6, from myself, avoids spurious insertion failure if we encounter an
overlapping but expired range during element insertion with the
'nft_set_rbtree' backend. This problem exists since 6.2.
* tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
selftests: netfilter: Extend nft_audit.sh
selftests: netfilter: test for sctp collision processing in nf_conntrack
netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
netfilter: nft_payload: rebuild vlan header on h_proto access
====================
Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUVjwUNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AKneEACzrKtIC0j0DyhgVW4Kb57T8Y7cD5wQCv7oz1Cx
8A3UJ1pSLYhRnz94zY453GIenK+zx/KKIetDhyWnjA9gjk95HkUN+OwuuiKnUAgu
7KPGbIYat7hERwoZpR88nrbTYXcDZfcZGTqWA++3yL2vn4Lu4lsuowqXYKBf/axk
5gEwEtwn2mVsdo0qTVJcXkHqnf5CCdqd26ixF4yB1rz/P6kISi4I9q7ul43paFJW
+/ifacdG+7raQkGlUlYiDNMVd0uO01HHaAcWfYa+FOMK+GSn+89zzTs906CU0g2O
GRJSWjNTgfDtM2AHN7peUnf/G9XHSK2Y7Re8FzauKzwWSl5N9w5610nbQnT+ME5O
uOZE1P/lhnidOwCEV8zU4yhs6fBrCMCHz+S5Yh8C8PCUhi12IEEYRHyGCoUVMOwY
1LINjdn4HddL57QUGumy0VqVBlxQru8VXnlzm0eIyhsbZ3/mVXQWIHX4u1G36UUQ
zSkm4/qP4kna/tV86mETNX1MUcJsQ1vQ842abcUbxudKei/uT9av6YHlz/aBOcQZ
NDMrGVO6mjh7/HnYUr7+zbQfhLZdg424SpGEoiuS7dDcTpGlcT3pnWBJDGEHsy+4
0VnLI8/GPT1/jQCCYTVLu+tn0XmfZF18j2bvGhz1hM9J/HXaRpuqjGF6thLgYl63
CZf5Yg==
=ALU2
-----END PGP SIGNATURE-----
Merge tag 'nf-next-23-09-28' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
First patch, from myself, is a bug fix. The issue (connect timeout) is
ancient, so I think its safe to give this more soak time given the esoteric
conditions needed to trigger this.
Also updates the existing selftest to cover this.
Add netlink extacks when an update references a non-existent
table/chain/set. This allows userspace to provide much better
errors to the user, from Pablo Neira Ayuso.
Last patch adds more policy checks to nf_tables as a better
alternative to the existing runtime checks, from Phil Sutter.
* tag 'nf-next-23-09-28' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Utilize NLA_POLICY_NESTED_ARRAY
netfilter: nf_tables: missing extended netlink error in lookup functions
selftests: netfilter: test nat source port clash resolution interaction with tcp early demux
netfilter: nf_nat: undo erroneous tcp edemux lookup after port clash
====================
Link: https://lore.kernel.org/r/20230928144916.18339-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It seems that tipc_crypto_key_revoke() could be be invoked by
wokequeue tipc_crypto_work_rx() under process context and
timer/rx callback under softirq context, thus the lock acquisition
on &tx->lock seems better use spin_lock_bh() to prevent possible
deadlock.
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.
tipc_crypto_work_rx() <workqueue>
--> tipc_crypto_key_distr()
--> tipc_bcast_xmit()
--> tipc_bcbase_xmit()
--> tipc_bearer_bc_xmit()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock)
<timer interrupt>
--> tipc_disc_timeout()
--> tipc_bearer_xmit_skb()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock) <deadlock here>
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to a small omission, the offload_failed flag is missing from ipv4
fibmatch results. Make sure it is set correctly.
The issue can be witnessed using the following commands:
echo "1 1" > /sys/bus/netdevsim/new_device
ip link add dummy1 up type dummy
ip route add 192.0.2.0/24 dev dummy1
echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
ip route add 198.51.100.0/24 dev dummy1
ip route
# 192.168.15.0/24 has rt_trap
# 198.51.100.0/24 has rt_offload_failed
ip route get 192.168.15.1 fibmatch
# Result has rt_trap
ip route get 198.51.100.1 fibmatch
# Result differs from the route shown by `ip route`, it is missing
# rt_offload_failed
ip link del dev dummy1
echo 1 > /sys/bus/netdevsim/del_device
Fixes: 36c5100e85 ("IPv4: Add "offload failed" indication to routes")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
to list individually. Many stack fixes, even rfkill
(found by simulation and the new eevdf scheduler)!
Also a bigger maintainers file cleanup, to remove old
and redundant information.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmUT/FoACgkQ10qiO8sP
aAAeqg/+Iaim4AFPPDeWUvSARyqcMIKqmDtaFqkXE+OY8oahbvvbnGuLM/v/7V3r
NG5pzoi0gqFasF+DC1lNWFUnmtWdjzjhhY9FInoWgiNO04V48c3NPI/YF/2Yvy0x
biJxSsiaY+buY/p2QOXvlXHetnaftXNikFPZaD1mVGG/GIGZAwqqUO/EkIdliZf5
q/RBt5jzMF/nXTRIGc53kq7tKT97gGnDDYMych0U130PlyyqAZ+iAsR9UbiPa+ww
Z1JB8qZO72Hx9iN0WMjXNBX8vxC961Dj+fkttgJqZALTk0UqQitwcsIJgM29JjTG
VHsxHMdZAROFGaEHdMs1eHkqHkqyOxA4Jhr5LAci/chzgKkRw2I2LAXndOeg5TLH
cWdhNjRW/ZJ64czr5hALQQxK4nj8m7SPFHY/6UuBnNHEXCjr7vUcuhcoK63Kb6Np
6Sg4jtyhakaXemqjhcmIYNF1dG5CDbUmQXFkj9Z9EEyHjAzGJ7ASmdhHwlBQnJuH
39ESEky2zQINAJbisaw9R2zj+V9Ia/mFSbi2q30kX5J4xTHGIURNo9OPkLAQWDdw
6u/d5VZLigliIK+Qj8kVtn41wmUEwB3W5Aq4CI7xB91vKRHCUZQvZ5xBJfHtk2pD
7VIzZscReWCsQP9T0hv38jeaw4m/P+mZO1iOC2qxveJvrQogZMk=
=OL/S
-----END PGP SIGNATURE-----
Merge tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Quite a collection of fixes this time, really too many
to list individually. Many stack fixes, even rfkill
(found by simulation and the new eevdf scheduler)!
Also a bigger maintainers file cleanup, to remove old
and redundant information.
* tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits)
wifi: iwlwifi: mvm: Fix incorrect usage of scan API
wifi: mac80211: Create resources for disabled links
wifi: cfg80211: avoid leaking stack data into trace
wifi: mac80211: allow transmitting EAPOL frames with tainted key
wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
wifi: cfg80211: Fix 6GHz scan configuration
wifi: mac80211: fix potential key leak
wifi: mac80211: fix potential key use-after-free
wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling
wifi: brcmfmac: Replace 1-element arrays with flexible arrays
wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM
rfkill: sync before userspace visibility/changes
wifi: mac80211: fix mesh id corruption on 32 bit systems
wifi: cfg80211: add missing kernel-doc for cqm_rssi_work
wifi: cfg80211: fix cqm_config access race
wifi: iwlwifi: mvm: Fix a memory corruption issue
wifi: iwlwifi: Ensure ack flag is properly cleared.
wifi: iwlwifi: dbg_ini: fix structure packing
iwlwifi: mvm: handle PS changes in vif_cfg_changed
...
====================
Link: https://lore.kernel.org/r/20230927095835.25803-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZRqk1wAKCRDbK58LschI
g8GRAQC4E0bw6BTFRl0b3MxvpZES6lU0BUtX2gKVK4tLZdXw/wEAmTlBXQqNzF3b
BkCQknVbFTSw/8l8pzUW123Fb46wUAQ=
=E3hd
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-10-02
We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).
The main changes are:
1) Fix BPF verifier to reset backtrack_state masks on global function
exit as otherwise subsequent precision tracking would reuse them,
from Andrii Nakryiko.
2) Several sockmap fixes for available bytes accounting,
from John Fastabend.
3) Reject sk_msg egress redirects to non-TCP sockets given this
is only supported for TCP sockets today, from Jakub Sitnicki.
4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
limits with BPF_F_BEFORE directive, from Daniel Borkmann
and Nikolay Aleksandrov.
5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
size_index for selecting a bpf_mem_cache, from Hou Tao.
6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
from Song Liu.
7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
from Leon Hwang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use kmalloc_size_roundup() to adjust size_index
selftest/bpf: Add various selftests for program limits
bpf, mprog: Fix maximum program check on mprog attachment
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
bpf, sockmap: Add tests for MSG_F_PEEK
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf: unconditionally reset backtrack_state masks on global func exit
bpf: Fix tr dereferencing
selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================
Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nft_rbtree_gc_elem() walks back and removes the end interval element that
comes before the expired element.
There is a small chance that we've cached this element as 'rbe_ge'.
If this happens, we hold and test a pointer that has been queued for
freeing.
It also causes spurious insertion failures:
$ cat test-testcases-sets-0044interval_overlap_0.1/testout.log
Error: Could not process rule: File exists
add element t s { 0 - 2 }
^^^^^^
Failed to insert 0 - 2 given:
table ip t {
set s {
type inet_service
flags interval,timeout
timeout 2s
gc-interval 2s
}
}
The set (rbtree) is empty. The 'failure' doesn't happen on next attempt.
Reason is that when we try to insert, the tree may hold an expired
element that collides with the range we're adding.
While we do evict/erase this element, we can trip over this check:
if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
return -ENOTEMPTY;
rbe_ge was erased by the synchronous gc, we should not have done this
check. Next attempt won't find it, so retry results in successful
insertion.
Restart in-kernel to avoid such spurious errors.
Such restart are rare, unless userspace intentionally adds very large
numbers of elements with very short timeouts while setting a huge
gc interval.
Even in this case, this cannot loop forever, on each retry an existing
element has been removed.
As the caller is holding the transaction mutex, its impossible
for a second entity to add more expiring elements to the tree.
After this it also becomes feasible to remove the async gc worker
and perform all garbage collection from the commit path.
Fixes: c9e6978e27 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>
When adding/updating an object, the transaction handler emits suitable
audit log entries already, the one in nft_obj_notify() is redundant. To
fix that (and retain the audit logging from objects' 'update' callback),
Introduce an "audit log free" variant for internal use.
Fixes: c520292f29 ("audit: log nftables configuration change events once per table")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>
In Scenario A and B below, as the delayed INIT_ACK always changes the peer
vtag, SCTP ct with the incorrect vtag may cause packet loss.
Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK
192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772]
192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151]
192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772]
192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] *
192.168.1.2 > 192.168.1.1: [COOKIE ECHO]
192.168.1.1 > 192.168.1.2: [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: [COOKIE ACK]
Scenario B: INIT_ACK is delayed until the peer completes its own handshake
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *
This patch fixes it as below:
In SCTP_CID_INIT processing:
- clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] &&
ct->proto.sctp.init[!dir]. (Scenario E)
- set ct->proto.sctp.init[dir].
In SCTP_CID_INIT_ACK processing:
- drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] &&
ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C)
- drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] &&
ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A)
In SCTP_CID_COOKIE_ACK processing:
- clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir].
(Scenario D)
Also, it's important to allow the ct state to move forward with cookie_echo
and cookie_ack from the opposite dir for the collision scenarios.
There are also other Scenarios where it should allow the packet through,
addressed by the processing above:
Scenario C: new CT is created by INIT_ACK.
Scenario D: start INIT on the existing ESTABLISHED ct.
Scenario E: start INIT after the old collision on the existing ESTABLISHED
ct.
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
(both side are stopped, then start new connection again in hours)
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742]
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
nft can perform merging of adjacent payload requests.
This means that:
ether saddr 00:11 ... ether type 8021ad ...
is a single payload expression, for 8 bytes, starting at the
ethernet source offset.
Check that offset+length is fully within the source/destination mac
addersses.
This bug prevents 'ether type' from matching the correct h_proto in case
vlan tag got stripped.
Fixes: de6843be30 ("netfilter: nft_payload: rebuild vlan header when needed")
Reported-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
The call netdev_{put, hold} of dev_{put, hold} will check NULL, so there
is no need to check before using dev_{put, hold}, remove it to silence
the warning:
./net/can/raw.c:497:2-9: WARNING: NULL check before dev_{put, hold} functions is not needed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6231
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reported-by: Simon Horman <horms@kernel.org>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230825064656.87751-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid
lifetime derivation mechanism.
RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router
Advertisement PIO shall be ignored if it less than 2 hours and to reset
the lifetime of the corresponding address to 2 hours. An in-progress
6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently
looking to remove this mechanism. While this draft has not been moving
particularly quickly for other reasons, there is widespread consensus on
section 4.2 which updates RFC4862 section 5.5.3e.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Jen Linkova <furry@google.com>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230925214711.959704-1-prohr@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ETHTOOL_A_PLCA_ENABLED data type is u8. But while parsing the
value from the attribute, nla_get_u32() is used in the plca_update_sint()
function instead of nla_get_u8(). So plca_cfg.enabled variable is updated
with some garbage value instead of 0 or 1 and always enables plca even
though plca is disabled through ethtool application. This bug has been
fixed by parsing the values based on the attributes type in the policy.
Fixes: 8580e16c28 ("net/ethtool: add netlink interface for the PLCA RS")
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230908044548.5878-1-Parthiban.Veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The KSZ9477 has support for HSR (High-Availability Seamless Redundancy).
One of its offloading (i.e. performed in the switch IC hardware) features
is to duplicate received frame to both HSR aware switch ports.
To achieve this goal - the tail TAG needs to be modified. To be more
specific, both ports must be marked as destination (egress) ones.
The NETIF_F_HW_HSR_DUP flag indicates that the device supports HSR and
assures (in HSR core code) that frame is sent only once from HOST to
switch with tail tag indicating both ports.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In some cases, drivers may need to veto the changing of a MAC address on
a user port. Such is the case with KSZ9477 when it offloads a HSR device,
because it programs the MAC address of multiple ports to a shared
hardware register. Those ports need to have equal MAC addresses for the
lifetime of the HSR offload.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Drivers can provide meaningful error messages which state a reason why
they can't perform an offload, and dsa_slave_changeupper() already has
the infrastructure to propagate these over netlink rather than printing
to the kernel log. So pass the extack argument and modify the xrs700x
driver's port_hsr_join() prototype.
Also take the opportunity and use the extack for the 2 -EOPNOTSUPP cases
from xrs700x_hsr_join().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The function doesn't modify the addresses passed as input, mark them
as 'const' to make that clear.
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20230924153014.786962-1-b.galvani@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
'n_tables' is small, UDP_TUNNEL_NIC_MAX_TABLES = 4 as a maximum. So there
is no real point to allocate the 'entries' pointers array with a dedicate
memory allocation.
Using a flexible array for struct udp_tunnel_nic->entries avoids the
overhead of an additional memory allocation.
This also saves an indirection when the array is accessed.
Finally, __counted_by() can be used for run-time bounds checking if
configured and supported by the compiler.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/4a096ba9cf981a588aa87235bb91e933ee162b3d.1695542544.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This is inspired by several syzbot reports where
tcp_metrics_flush_all() was seen in the traces.
We can avoid acquiring tcp_metrics_lock for empty buckets,
and we should add one cond_resched() to break potential long loops.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tcp_init_metrics() only wants to get metrics if they were
previously stored in the cache. Creating an entry is adding
useless costs, especially when tcp_no_metrics_save is set.
Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to set tp->snd_ssthresh to TCP_INFINITE_SSTHRESH
in the case tcp_get_metrics() fails for some reason.
Fixes: 9ad7c049f0 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When removing an item from RCU protected list, we must prevent
store-tearing, using rcu_assign_pointer() or WRITE_ONCE().
Fixes: 04f721c671 ("tcp_metrics: Rewrite tcp_metrics_flush_all")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit b0e214d212 ("netfilter: keep conntrack reference until
IPsecv6 policy checks are done") is a direct copy of the old
commit b59c270104 ("[NETFILTER]: Keep conntrack reference until
IPsec policy checks are done") but for IPv6. However, it also
copies a bug that this old commit had. That is: when the third
packet of 3WHS connection establishment contains payload, it is
added into socket receive queue without the XFRM check and the
drop of connection tracking context.
That leads to nf_conntrack module being impossible to unload as
it waits for all the conntrack references to be dropped while
the packet release is deferred in per-cpu cache indefinitely, if
not consumed by the application.
The issue for IPv4 was fixed in commit 6f0012e351 ("tcp: add a
missing nf_reset_ct() in 3WHS handling") by adding a missing XFRM
check and correctly dropping the conntrack context. However, the
issue was introduced to IPv6 code afterwards. Fixing it the
same way for IPv6 now.
Fixes: b0e214d212 ("netfilter: keep conntrack reference until IPsecv6 policy checks are done")
Link: https://lore.kernel.org/netdev/d589a999-d4dd-2768-b2d5-89dec64a4a42@ovn.org/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230922210530.2045146-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
After deleting an interface address in fib_del_ifaddr(), the function
scans the fib_info list for stray entries and calls fib_flush() and
fib_table_flush(). Then the stray entries will be deleted silently and no
RTM_DELROUTE notification will be sent.
This lack of notification can make routing daemons, or monitor like
`ip monitor route` miss the routing changes. e.g.
+ ip link add dummy1 type dummy
+ ip link add dummy2 type dummy
+ ip link set dummy1 up
+ ip link set dummy2 up
+ ip addr add 192.168.5.5/24 dev dummy1
+ ip route add 7.7.7.0/24 dev dummy2 src 192.168.5.5
+ ip -4 route
7.7.7.0/24 dev dummy2 scope link src 192.168.5.5
192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
+ ip monitor route
+ ip addr del 192.168.5.5/24 dev dummy1
Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5
Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5
As Ido reminded, fib_table_flush() isn't only called when an address is
deleted, but also when an interface is deleted or put down. The lack of
notification in these cases is deliberate. And commit 7c6bb7d2fa
("net/ipv6: Add knob to skip DELROUTE message on device down") introduced
a sysctl to make IPv6 behave like IPv4 in this regard. So we can't send
the route delete notify blindly in fib_table_flush().
To fix this issue, let's add a new flag in "struct fib_info" to track the
deleted prefer source address routes, and only send notify for them.
After update:
+ ip monitor route
+ ip addr del 192.168.5.5/24 dev dummy1
Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5
Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5
Deleted 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5
Suggested-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230922075508.848925-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
key_serial_t fields are signed integers. Use nla_get/put_s32 for
those to avoid implicit signed conversion in the netlink protocol.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/169530167716.8905.645746457741372879.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Socket file descriptors are signed integers. Use nla_get/put_s32 for
those to avoid implicit signed conversion in the netlink protocol.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/169530165057.8905.8650469415145814828.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct dp_meter.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230922172858.3822653-12-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct dp_meter_instance.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230922172858.3822653-10-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add missing annotations to inet->mc_index and inet->mc_addr
to fix data-races.
getsockopt(IP_MULTICAST_IF) can be lockless.
setsockopt() side is left for later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation is already lockless, because the socket
lock is released before reading socket fields.
Add missing READ_ONCE() annotations.
Note that corresponding WRITE_ONCE() are needed, the order
of the patches do not really matter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing READ_ONCE() annotations when reading inet->uc_index
Implementing getsockopt(IP_UNICAST_IF) locklessly seems possible,
the setsockopt() part might not be possible at the moment.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_dst_get() does not require socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet->inet_opt being RCU protected, we can use RCU instead
of locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some reads of inet->tos are racy.
Add needed READ_ONCE() annotations and convert IP_TOS option lockless.
v2: missing changes in include/net/route.h (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet->pmtudisc can be read locklessly.
Implement proper lockless reads and writes to inet->pmtudisc
ip_sock_set_mtu_discover() can now be called from arbitrary
contexts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet->mc_ttl can be read locklessly.
Implement proper lockless reads and writes to inet->mc_ttl
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the change in commit 0bdf399342c5("net: Avoid address
overwrite in kernel_connect"), BPF hooks run on bind may rewrite the
address passed to kernel_bind(). This change
1) Makes a copy of the bind address in kernel_bind() to insulate
callers.
2) Replaces direct calls to sock->ops->bind() in net with kernel_bind()
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: 4fbac77d2d ("bpf: Hooks for sys_bind")
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of sock_sendmsg(), and similarly kernel_sendmsg(), in kernel
space may observe their value of msg_name change in cases where BPF
sendmsg hooks rewrite the send address. This has been confirmed to break
NFS mounts running in UDP mode and has the potential to break other
systems.
This patch:
1) Creates a new function called __sock_sendmsg() with same logic as the
old sock_sendmsg() function.
2) Replaces calls to sock_sendmsg() made by __sys_sendto() and
__sys_sendmsg() with __sock_sendmsg() to avoid an unnecessary copy,
as these system calls are already protected.
3) Modifies sock_sendmsg() so that it makes a copy of msg_name if
present before passing it down the stack to insulate callers from
changes to the send address.
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0bdf399342 ("net: Avoid address overwrite in kernel_connect")
ensured that kernel_connect() will not overwrite the address parameter
in cases where BPF connect hooks perform an address rewrite. This change
replaces direct calls to sock->ops->connect() in net with kernel_connect()
to make these call safe.
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: d74bad4e74 ("bpf: Hooks for sys_connect")
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_txrehash readers are already safe against
concurrent change of this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SO_MAX_PACING_RATE setsockopt() does not need to hold
the socket lock, because sk->sk_pacing_rate readers
can run fine if the value is changed by other threads,
after adding READ_ONCE() accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting sk->sk_ll_usec, sk_prefer_busy_poll and sk_busy_poll_budget
do not require the socket lock, readers are lockless anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This options can not be set and return -ENOPROTOOPT,
no need to acqure socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock->flags are atomic, no need to hold the socket lock
in sk_setsockopt() for SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup of 8bf43be799 ("net: annotate data-races
around sk->sk_priority").
sk->sk_priority can be read and written without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
do_execute_actions() function can be called recursively multiple
times while executing actions that require pipeline forking or
recirculations. It may also be re-entered multiple times if the packet
leaves openvswitch module and re-enters it through a different port.
Currently, there is a 256-byte array allocated on stack in this
function that is supposed to hold NSH header. Compilers tend to
pre-allocate that space right at the beginning of the function:
a88: 48 81 ec b0 01 00 00 sub $0x1b0,%rsp
NSH is not a very common protocol, but the space is allocated on every
recursive call or re-entry multiplying the wasted stack space.
Move the stack allocation to push_nsh() function that is only used
if NSH actions are actually present. push_nsh() is also a simple
function without a possibility for re-entry, so the stack is returned
right away.
With this change the preallocated space is reduced by 256 B per call:
b18: 48 81 ec b0 00 00 00 sub $0xb0,%rsp
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eelco Chaudron echaudro@redhat.com
Reviewed-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Including the transhdrlen in length is a problem when the packet is
partially filled (e.g. something like send(MSG_MORE) happened previously)
when appending to an IPv4 or IPv6 packet as we don't want to repeat the
transport header or account for it twice. This can happen under some
circumstances, such as splicing into an L2TP socket.
The symptom observed is a warning in __ip6_append_data():
WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
that occurs when MSG_SPLICE_PAGES is used to append more data to an already
partially occupied skbuff. The warning occurs when 'copy' is larger than
the amount of data in the message iterator. This is because the requested
length includes the transport header length when it shouldn't. This can be
triggered by, for example:
sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
bind(sfd, ...); // ::1
connect(sfd, ...); // ::1 port 7
send(sfd, buffer, 4100, MSG_MORE);
sendfile(sfd, dfd, NULL, 1024);
Fix this by only adding transhdrlen into the length if the write queue is
empty in l2tp_ip6_sendmsg(), analogously to how UDP does things.
l2tp_ip_sendmsg() looks like it won't suffer from this problem as it builds
the UDP packet itself.
Fixes: a32e0eec70 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Reported-by: syzbot+62cbf263225ae13ff153@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
cc: syzkaller-bugs@googlegroups.com
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
n->output field can be read locklessly, while a writer
might change the pointer concurrently.
Add missing annotations to prevent load-store tearing.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Core networking has opt-in atomic variant of dev->stats,
simply use DEV_STATS_INC(), DEV_STATS_ADD() and DEV_STATS_READ().
v2: removed @priv local var in l2tp_eth_dev_recv() (Simon)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at a related syzbot report involving neigh_periodic_work(),
I found that I forgot to add an annotation when deleting an
RCU protected item from a list.
Readers use rcu_deference(*np), we need to use either
rcu_assign_pointer() or WRITE_ONCE() on writer side
to prevent store tearing.
I use rcu_assign_pointer() to have lockdep support,
this was the choice made in neigh_flush_dev().
Fixes: 767e97e1e0 ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
FQ performs garbage collection at enqueue time, and only
if number of flows is above a given threshold, which
is hit after the qdisc has been used a bit.
Since an RB-tree traversal is needed to locate a flow,
it makes sense to perform gc all the time, to keep
rb-trees smaller.
This reduces by 50 % average storage costs in FQ,
and avoids 1 cache line miss at enqueue time when
fast path added in prior patch can not be used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCQ_F_CAN_BYPASS can be used by few qdiscs.
Idea is that if we queue a packet to an empty qdisc,
following dequeue() would pick it immediately.
FQ can not use the generic TCQ_F_CAN_BYPASS code,
because some additional checks need to be performed.
This patch adds a similar fast path to FQ.
Most of the time, qdisc is not throttled,
and many packets can avoid bringing/touching
at least four cache lines, and consuming 128bytes
of memory to store the state of a flow.
After this patch, netperf can send UDP packets about 13 % faster,
and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC.
TCP traffic is also improved, thanks to a reduction of cache line misses.
I have measured a 5 % increase of throughput on a tcp_rr intensive workload.
tc -s -d qd sh dev eth1
...
qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024
orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit
refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
flows 122 (inactive 122 throttled 0)
gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when one fq qdisc has no more packets to send, it can still
have some flows stored in its RR lists (q->new_flows & q->old_flows)
This was a design choice, but what is a bit disturbing is that
the inactive_flows counter does not include the count of empty flows
in RR lists.
As next patch needs to know better if there are active flows,
this change makes inactive_flows exact.
Before the patch, following command on an empty qdisc could have returned:
lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
flows 1322 (inactive 1316 throttled 0)
flows 1330 (inactive 1325 throttled 0)
flows 1193 (inactive 1190 throttled 0)
flows 1208 (inactive 1202 throttled 0)
After the patch, we now have:
lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
flows 1322 (inactive 1322 throttled 0)
flows 1330 (inactive 1330 throttled 0)
flows 1193 (inactive 1193 throttled 0)
flows 1208 (inactive 1208 throttled 0)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While BPF allows to set icsk->->icsk_delack_max
and/or icsk->icsk_rto_min, we have an ip route
attribute (RTAX_RTO_MIN) to be able to tune rto_min,
but nothing to consequently adjust max delayed ack,
which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).
This makes RTAX_RTO_MIN of almost no practical use,
unless customers are in big trouble.
Modern days datacenter communications want to set
rto_min to ~5 ms, and the max delayed ack one jiffie
smaller to avoid spurious retransmits.
After this patch, an "rto_min 5" route attribute will
effectively lower max delayed ack timers to 4 ms.
Note in the following ss output, "rto:6 ... ato:4"
$ ss -temoi dst XXXXXX
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597
ino:255134 sk:1001 <->
skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
bytes_received:54823120 segs_out:1370582 segs_in:1370580
data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536
While we could argue this patch fixes a bug with RTAX_RTO_MIN,
I do not add a Fixes: tag, so that we can soak it a bit before
asking backports to stable branches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looks like I lost that by accident, add it back.
Fixes: 076fc8775d ("wifi: cfg80211: remove wdev mutex")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Somehow, I managed to botch this and pretty much completely break
wifi. My original patch did contain these changes, but I seem to
have lost them before sending to the list. Fix it now.
Reported-and-tested-by: Kalle Valo <kvalo@kernel.org>
Fixes: 6c02fab724 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With a SOCKMAP/SOCKHASH map and an sk_msg program user can steer messages
sent from one TCP socket (s1) to actually egress from another TCP
socket (s2):
tcp_bpf_sendmsg(s1) // = sk_prot->sendmsg
tcp_bpf_send_verdict(s1) // __SK_REDIRECT case
tcp_bpf_sendmsg_redir(s2)
tcp_bpf_push_locked(s2)
tcp_bpf_push(s2)
tcp_rate_check_app_limited(s2) // expects tcp_sock
tcp_sendmsg_locked(s2) // ditto
There is a hard-coded assumption in the call-chain, that the egress
socket (s2) is a TCP socket.
However in commit 122e6c79ef ("sock_map: Update sock type checks for
UDP") we have enabled redirects to non-TCP sockets. This was done for the
sake of BPF sk_skb programs. There was no indention to support sk_msg
send-to-egress use case.
As a result, attempts to send-to-egress through a non-TCP socket lead to a
crash due to invalid downcast from sock to tcp_sock:
BUG: kernel NULL pointer dereference, address: 000000000000002f
...
Call Trace:
<TASK>
? show_regs+0x60/0x70
? __die+0x1f/0x70
? page_fault_oops+0x80/0x160
? do_user_addr_fault+0x2d7/0x800
? rcu_is_watching+0x11/0x50
? exc_page_fault+0x70/0x1c0
? asm_exc_page_fault+0x27/0x30
? tcp_tso_segs+0x14/0xa0
tcp_write_xmit+0x67/0xce0
__tcp_push_pending_frames+0x32/0xf0
tcp_push+0x107/0x140
tcp_sendmsg_locked+0x99f/0xbb0
tcp_bpf_push+0x19d/0x3a0
tcp_bpf_sendmsg_redir+0x55/0xd0
tcp_bpf_send_verdict+0x407/0x550
tcp_bpf_sendmsg+0x1a1/0x390
inet_sendmsg+0x6a/0x70
sock_sendmsg+0x9d/0xc0
? sockfd_lookup_light+0x12/0x80
__sys_sendto+0x10e/0x160
? syscall_enter_from_user_mode+0x20/0x60
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0x82/0x110
__x64_sys_sendto+0x1f/0x30
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reject selecting a non-TCP sockets as redirect target from a BPF sk_msg
program to prevent the crash. When attempted, user will receive an EACCES
error from send/sendto/sendmsg() syscall.
Fixes: 122e6c79ef ("sock_map: Update sock type checks for UDP")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230920102055.42662-1-jakub@cloudflare.com
When data is peek'd off the receive queue we shouldn't considered it
copied from tcp_sock side. When we increment copied_seq this will confuse
tcp_data_ready() because copied_seq can be arbitrarily increased. From
application side it results in poll() operations not waking up when
expected.
Notice tcp stack without BPF recvmsg programs also does not increment
copied_seq.
We broke this when we moved copied_seq into recvmsg to only update when
actual copy was happening. But, it wasn't working correctly either before
because the tcp_data_ready() tried to use the copied_seq value to see
if data was read by user yet. See fixes tags.
Fixes: e5c6de5fa0 ("bpf, sockmap: Incorrectly handling copied_seq")
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230926035300.135096-3-john.fastabend@gmail.com
Before fix e5c6de5fa0 tcp_read_skb() would increment the tp->copied-seq
value. This (as described in the commit) would cause an error for apps
because once that is incremented the application might believe there is no
data to be read. Then some apps would stall or abort believing no data is
available.
However, the fix is incomplete because it introduces another issue in
the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume
as many skbs as possible. The problem is the call is ...
tcp_recv_skb(sk, seq, &offset)
... where 'seq' is:
u32 seq = tp->copied_seq;
Now we can hit a case where we've yet incremented copied_seq from BPF side,
but then tcp_recv_skb() fails this test ...
if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
... so that instead of returning the skb we call tcp_eat_recv_skb() which
frees the skb. This is because the routine believes the SKB has been collapsed
per comment:
/* This looks weird, but this can happen if TCP collapsing
* splitted a fat GRO packet, while we released socket lock
* in skb_splice_bits()
*/
This can't happen here we've unlinked the full SKB and orphaned it. Anyways
it would confuse any BPF programs if the data were suddenly moved underneath
it.
To fix this situation do simpler operation and just skb_peek() the data
of the queue followed by the unlink. It shouldn't need to check this
condition and tcp_read_skb() reads entire skbs so there is no need to
handle the 'offset!=0' case as we would see in tcp_read_sock().
Fixes: e5c6de5fa0 ("bpf, sockmap: Incorrectly handling copied_seq")
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230926035300.135096-2-john.fastabend@gmail.com
Mark attributes which are supposed to be arrays of nested attributes
with known content as such. Originally suggested for
NFTA_RULE_EXPRESSIONS only, but does apply to others as well.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
Set netlink extended error reporting for several lookup functions which
allows userspace to infer what is the error cause.
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
In commit 03a3ca37e4 ("netfilter: nf_nat: undo erroneous tcp edemux lookup")
I fixed a problem with source port clash resolution and DNAT.
A very similar issue exists with REDIRECT (DNAT to local address) and
port rewrites.
Consider two port redirections done at prerouting hook:
-p tcp --port 1111 -j REDIRECT --to-ports 80
-p tcp --port 1112 -j REDIRECT --to-ports 80
Its possible, however unlikely, that we get two connections sharing
the same source port, i.e.
saddr:12345 -> daddr:1111
saddr:12345 -> daddr:1112
This works on sender side because destination address is
different.
After prerouting, nat will change first syn packet to
saddr:12345 -> daddr:80, stack will send a syn-ack back and 3whs
completes.
The second syn however will result in a source port clash:
after dnat rewrite, new syn has
saddr:12345 -> daddr:80
This collides with the reply direction of the first connection.
The NAT engine will handle this in the input nat hook by
also altering the source port, so we get for example
saddr:13535 -> daddr:80
This allows the stack to send back a syn-ack to that address.
Reverse NAT during POSTROUTING will rewrite the packet to
daddr:1112 -> saddr:12345 again. Tuple will be unique on-wire
and peer can process it normally.
Problem is when ACK packet comes in:
After prerouting, packet payload is mangled to saddr:12345 -> daddr:80.
Early demux will assign the 3whs-completing ACK skb to the first
connections' established socket.
This will then elicit a challenge ack from the first connections'
socket rather than complete the connection of the second.
The second connection can never complete.
Detect this condition by checking if the associated sockets port
matches the conntrack entries reply tuple.
If it doesn't, then input source address translation mangled
payload after early demux and the found sk is incorrect.
Discard this sk and let TCP stack do another lookup.
Signed-off-by: Florian Westphal <fw@strlen.de>
Currently, skbs generated by pktgen always have their reference count
incremented before transmission, causing their reference count to be
always greater than 1, leading to two issues:
1. Only the code paths for shared skbs can be tested.
2. In certain situations, skbs can only be released by pktgen.
To enhance testing comprehensiveness, we are introducing the "SHARED"
flag to indicate whether an SKB is shared. This flag is enabled by
default, aligning with the current behavior. However, disabling this
flag allows skbs with a reference count of 1 to be transmitted.
So we can test non-shared skbs and code paths where skbs are released
within the stack.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-2-liangchen.linux@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When specifying an unknown flag, it will print all available flags.
Currently, these flags are provided as fixed strings, which requires
manual updates when flags change. Replacing it with automated flag
enumeration.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-1-liangchen.linux@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit 88428cc4ae.
The problem this commit is intended to fix was comprehensively fixed
in commit 7de62bc09f ("SUNRPC dont update timeout value on connection
reset").
Since then, this commit has been preventing the correct timeout of soft
mounted requests.
Cc: stable@vger.kernel.org # 5.9.x: 09252177d5: SUNRPC: Handle major timeout in xprt_adjust_timeout()
Cc: stable@vger.kernel.org # 5.9.x: 7de62bc09f: SUNRPC dont update timeout value on connection reset
Cc: stable@vger.kernel.org # 5.9.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcauth_checkverf() should return a distinct error code when a
server recognizes the AUTH_TLS probe but does not support TLS so
that the client's header decoder can respond appropriately and
quickly. No retries are necessary is in this case, since the server
has already affirmatively answered "TLS is unsupported".
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make __ieee80211_data_to_8023() return more individual drop
reasons instead of just doing RX_DROP_U_INVALID_8023.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This has many different reasons, split the return value into
the individual reasons for better traceability. Also, since
symbolic tracing doesn't work for these, add a few comments
for the numbering.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Convert all instances of RX_DROP_UNUSABLE to indicate a
better reason, and then remove RX_DROP_UNUSABLE.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we just check "result & RX_DROP_UNUSABLE", this really only works
by accident, because SKB_DROP_REASON_SUBSYS_MAC80211_UNUSABLE got to
have the value 1, and SKB_DROP_REASON_SUBSYS_MAC80211_MONITOR is 2.
Fix this to really check the entire subsys mask for the value, so it
doesn't matter what the subsystem value is.
Fixes: 7f4e09700b ("wifi: mac80211: report all unusable beacon frames")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When associating to an MLD AP, links may be disabled. Create all
resources associated with a disabled link so that we can later enable it
without having to create these resources on the fly.
Fixes: 6d543b34db ("wifi: mac80211: Support disabled links during association")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://lore.kernel.org/r/20230925173028.f9afdb26f6c7.I4e6e199aaefc1bf017362d64f3869645fa6830b5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the structure is not initialized then boolean types might be copied
into the tracing data without being initialised. This causes data from
the stack to leak into the trace and also triggers a UBSAN failure which
can easily be avoided here.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://lore.kernel.org/r/20230925171855.a9271ef53b05.I8180bae663984c91a3e036b87f36a640ba409817@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Lower layer device driver stop/wake TX by calling ieee80211_stop_queue()/
ieee80211_wake_queue() while hw scan. Sometimes hw scan and PTK rekey are
running in parallel, when M4 sent from wpa_supplicant arrive while the TX
queue is stopped, then the M4 will pending send, and then new key install
from wpa_supplicant. After TX queue wake up by lower layer device driver,
the M4 will be dropped by below call stack.
When key install started, the current key flag is set KEY_FLAG_TAINTED in
ieee80211_pairwise_rekey(), and then mac80211 wait key install complete by
lower layer device driver. Meanwhile ieee80211_tx_h_select_key() will return
TX_DROP for the M4 in step 12 below, and then ieee80211_free_txskb() called
by ieee80211_tx_dequeue(), so the M4 will not send and free, then the rekey
process failed becaue AP not receive M4. Please see details in steps below.
There are a interval between KEY_FLAG_TAINTED set for current key flag and
install key complete by lower layer device driver, the KEY_FLAG_TAINTED is
set in this interval, all packet including M4 will be dropped in this
interval, the interval is step 8~13 as below.
issue steps:
TX thread install key thread
1. stop_queue -idle-
2. sending M4 -idle-
3. M4 pending -idle-
4. -idle- starting install key from wpa_supplicant
5. -idle- =>ieee80211_key_replace()
6. -idle- =>ieee80211_pairwise_rekey() and set
currently key->flags |= KEY_FLAG_TAINTED
7. -idle- =>ieee80211_key_enable_hw_accel()
8. -idle- =>drv_set_key() and waiting key install
complete from lower layer device driver
9. wake_queue -waiting state-
10. re-sending M4 -waiting state-
11. =>ieee80211_tx_h_select_key() -waiting state-
12. drop M4 by KEY_FLAG_TAINTED -waiting state-
13. -idle- install key complete with success/fail
success: clear flag KEY_FLAG_TAINTED
fail: start disconnect
Hence add check in step 11 above to allow the EAPOL send out in the
interval. If lower layer device driver use the old key/cipher to encrypt
the M4, then AP received/decrypt M4 correctly, after M4 send out, lower
layer device driver install the new key/cipher to hardware and return
success.
If lower layer device driver use new key/cipher to send the M4, then AP
will/should drop the M4, then it is same result with this issue, AP will/
should kick out station as well as this issue.
issue log:
kworker/u16:4-5238 [000] 6456.108926: stop_queue: phy1 queue:0, reason:0
wpa_supplicant-961 [003] 6456.119737: rdev_tx_control_port: wiphy_name=phy1 name=wlan0 ifindex=6 dest=ARRAY[9e, 05, 31, 20, 9b, d0] proto=36488 unencrypted=0
wpa_supplicant-961 [003] 6456.119839: rdev_return_int_cookie: phy1, returned 0, cookie: 504
wpa_supplicant-961 [003] 6456.120287: rdev_add_key: phy1, netdev:wlan0(6), key_index: 0, mode: 0, pairwise: true, mac addr: 9e:05:31:20:9b:d0
wpa_supplicant-961 [003] 6456.120453: drv_set_key: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 cipher:0xfac04, flags=0x9, keyidx=0, hw_key_idx=0
kworker/u16:9-3829 [001] 6456.168240: wake_queue: phy1 queue:0, reason:0
kworker/u16:9-3829 [001] 6456.168255: drv_wake_tx_queue: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 ac:0 tid:7
kworker/u16:9-3829 [001] 6456.168305: cfg80211_control_port_tx_status: wdev(1), cookie: 504, ack: false
wpa_supplicant-961 [003] 6459.167982: drv_return_int: phy1 - -110
issue call stack:
nl80211_frame_tx_status+0x230/0x340 [cfg80211]
cfg80211_control_port_tx_status+0x1c/0x28 [cfg80211]
ieee80211_report_used_skb+0x374/0x3e8 [mac80211]
ieee80211_free_txskb+0x24/0x40 [mac80211]
ieee80211_tx_dequeue+0x644/0x954 [mac80211]
ath10k_mac_tx_push_txq+0xac/0x238 [ath10k_core]
ath10k_mac_op_wake_tx_queue+0xac/0xe0 [ath10k_core]
drv_wake_tx_queue+0x80/0x168 [mac80211]
__ieee80211_wake_txqs+0xe8/0x1c8 [mac80211]
_ieee80211_wake_txqs+0xb4/0x120 [mac80211]
ieee80211_wake_txqs+0x48/0x80 [mac80211]
tasklet_action_common+0xa8/0x254
tasklet_action+0x2c/0x38
__do_softirq+0xdc/0x384
Signed-off-by: Wen Gong <quic_wgong@quicinc.com>
Link: https://lore.kernel.org/r/20230801064751.25803-1-quic_wgong@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reject configuring a channel for MLO if either EHT is not supported or
the BSS does not have the correct ML element. This avoids trying to do
a multi-link association with a misconfigured AP.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.80c3b8e5a344.Iaa2d466ee6280994537e1ae7ab9256a27934806f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With this cfg80211 can report the link that caused the error to
userspace which is then able to react to it by e.g. removing the link
from the association and retrying.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.275fc7f5c426.I8086c0fdbbf92537d6a8b8e80b33387fcfd5553d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When one of the links (other than the assoc_link) is misconfigured
and cannot work the association will fail. However, userspace was not
able to tell that the operation only failed because of a problem with
one of the links. Fix this, by allowing the driver to set a per-link
error code and reporting the (first) offending link by setting the
bad_attr accordingly.
This only allows us to report the first error, but that is sufficient
for userspace to e.g. remove the offending link and retry.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.ebe63c0bd513.I40799998f02bf987acee1501a2522dc98bb6eb5a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support antenna control for injection by parsing the antenna
radiotap field (which may be presented multiple times) and
telling the driver about the resulting antenna bitmap. Of
course there's no guarantee the driver will actually honour
this, just like any other injection control.
If misconfigured, i.e. the injected HT/VHT MCS needs more
chains than antennas are configured, the bitmap is reset to
zero, indicating no selection.
For now this is only set up for two anntenas so we keep more
free bits, but that can be trivially extended if any driver
implements support for it that can deal with hardware with
more antennas.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.f71001aa4da9.I00ccb762a806ea62bc3d728fa3a0d29f4f285eeb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support handling of advertised TID-to-link mapping elements received
in a beacon.
These elements are used by AP MLD to disable specific links and force
all clients to stop using these links.
By default if no TID-to-link mapping is advertised, all TIDs shall be
mapped to all links.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.623c4b692ff9.Iab0a6f561d85b8ab6efe541590985a2b6e9e74aa@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Connection to an AP that is running a CSA flow may end up with a
failure as the AP might change its channel during the connection
flow while we do not track the channel change yet.
Avoid that by rejecting a connection to such an AP.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.e5001a762a4a.I9745c695f3403b259ad000ce94110588a836c04a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
rx_chains was set only upon registration and it we rely on it for the
active chains upon SMPS configuration after association.
When we use the set_antenna() API to limit the rx_chains from 2 to 1,
this caused issues with iwlwifi since we still had 2 active_chains
requested.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230920211508.2dde4da246b2.I904223c868c77cf2ba132a3088fe6506fcbb443b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Using the full struct cfg80211_ap_settings for an update is
misleading, since most settings cannot be updated. Split the
update case off into a new struct cfg80211_ap_update.
Change-Id: I3ba4dd9280938ab41252f145227a7005edf327e4
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers should really be able to rely on the wiphy mutex
being held all the time, unless otherwise documented. For
ethtool, that wasn't quite right. Fix and clarify this in
both code and documentation.
Reported-by: syzbot+c12a771b218dcbba32e1@syzkaller.appspotmail.com
Fixes: 0e8185ce1d ("wifi: mac80211: check wiphy mutex in ops")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cisco AP module 9115 with FW 17.3 has a bug and sends a too
large maximum MPDU length in the association response
(indicating 12k) that it cannot actually process.
Work around that by taking the minimum between what's in the
association response and the BSS elements (from beacon or
probe response).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230918140607.d1966a9a532e.I090225babb7cd4d1081ee9acd40e7de7e41c15ae@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the scan request includes a non broadcast BSSID, when adding the
scan parameters for 6GHz collocated scanning, do not include entries
that do not match the given BSSID.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230918140607.6d31d2a96baf.I6c4e3e3075d1d1878ee41f45190fdc6b86f18708@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When returning from ieee80211_key_link(), the key needs to
have been freed or successfully installed. This was missed
in a number of error paths, fix it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When ieee80211_key_link() is called by ieee80211_gtk_rekey_add()
but returns 0 due to KRACK protection (identical key reinstall),
ieee80211_gtk_rekey_add() will still return a pointer into the
key, in a potential use-after-free. This normally doesn't happen
since it's only called by iwlwifi in case of WoWLAN rekey offload
which has its own KRACK protection, but still better to fix, do
that by returning an error code and converting that to success on
the cfg80211 boundary only, leaving the error for bad callers of
ieee80211_gtk_rekey_add().
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: fdf7cb4185 ("mac80211: accept key reinstall without changing anything")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmUMFG8SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOksHAP+QE2eNf5yxo86dIS+3RQnOQ8kFBnNbEn
04lrheGnzG7PpNnGoCoTZna+xYQPYVLgbmmip2/CFnQvnQIsKyLQfCui85sfV2V9
KjUeE/kTgeC+jUQOWNDyz3zDP/MPC2LmiK8Gwyggvm9vFYn5tVZXC36aPZBZ7Vok
/DUW6iXyl31SeVGOOEKakcwn0GIYJSABhVFNsjrDe4tV+leUwvf8obAq3ZWxOGaU
D94ez28lSXgfOSWfQQ/l1rHI/yC0fr8HYyWJ60dNG2uS3fNEqT8LyqZfAUK24kVz
XbAGZa+GA7CDq3cVsU7vCWNWbB5fO+kXtmGOwPtuKtJQM5LPo4X77CuSHlpzdyvq
TuW0vxeVfdzAYVb3Zg+2QgWxDJjY0B8ujwdDWrnnKTPu4Ylhn6HLISXIlkMBoGwT
1/47TCnmn9t+lGagkMADppRRnJotHWObQG5wkzksqVa2CUB0HTESgbrm4rsxe6Ku
JiZhHbTiiPWy7LgY6EFtj/YGPvLs0CSltvh4QUsd+QtDTM/EN7y3HcHqkv88ropG
bSvJIh6WXdEJkwfSUdA0LECXSC6dizzZW2Y1glnT+7FMlhE1jVY4gruNJ37mCYMb
0gh9Zr76c2KYLA5vljGp6uo3j3A7wARJTdLfRFVcaFoz6NQmuFf9ZdBfDNDcymxs
AGvO3j55JAZf
=AoVg
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
Current release - regressions:
- bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE
- netfilter: fix entries val in rule reset audit log
- eth: stmmac: fix incorrect rxq|txq_stats reference
Previous releases - regressions:
- ipv4: fix null-deref in ipv4_link_failure
- netfilter:
- fix several GC related issues
- fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
- eth: team: fix null-ptr-deref when team device type is changed
- eth: i40e: fix VF VLAN offloading when port VLAN is configured
- eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
Previous releases - always broken:
- core: fix ETH_P_1588 flow dissector
- mptcp: fix several connection hang-up conditions
- bpf:
- avoid deadlock when using queue and stack maps from NMI
- add override check to kprobe multi link attach
- hsr: properly parse HSRv1 supervisor frames.
- eth: igc: fix infinite initialization loop with early XDP redirect
- eth: octeon_ep: fix tx dma unmap len values in SG
- eth: hns3: fix GRE checksum offload issue"
* tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
sfc: handle error pointers returned by rhashtable_lookup_get_insert_fast()
igc: Expose tx-usecs coalesce setting to user
octeontx2-pf: Do xdp_do_flush() after redirects.
bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI
net: ena: Flush XDP packets on error.
net/handshake: Fix memory leak in __sock_create() and sock_alloc_file()
net: hinic: Fix warning-hinic_set_vlan_fliter() warn: variable dereferenced before check 'hwdev'
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
vxlan: Add missing entries to vxlan_get_size()
net: rds: Fix possible NULL-pointer dereference
team: fix null-ptr-deref when team device type is changed
net: bridge: use DEV_STATS_INC()
net: hns3: add 5ms delay before clear firmware reset irq source
net: hns3: fix fail to delete tc flower rules during reset issue
net: hns3: only enable unicast promisc when mac table full
net: hns3: fix GRE checksum offload issue
net: hns3: add cmdq check for vf periodic service task
net: stmmac: fix incorrect rxq|txq_stats reference
...
This adds handling of MSG_ZEROCOPY flag on transmission path:
1) If this flag is set and zerocopy transmission is possible (enabled
in socket options and transport allows zerocopy), then non-linear
skb will be created and filled with the pages of user's buffer.
Pages of user's buffer are locked in memory by 'get_user_pages()'.
2) Replaces way of skb owning: instead of 'skb_set_owner_sk_safe()' it
calls 'skb_set_owner_w()'. Reason of this change is that
'__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' of socket, so
to decrease this field correctly, proper skb destructor is needed:
'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
3) Adds new callback to 'struct virtio_transport': 'can_msgzerocopy'.
If this callback is set, then transport needs extra check to be able
to send provided number of buffers in zerocopy mode. Currently, the
only transport that needs this callback set is virtio, because this
transport adds new buffers to the virtio queue and we need to check,
that number of these buffers is less than size of the queue (it is
required by virtio spec). vhost and loopback transports don't need
this check.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This is preparation patch for MSG_ZEROCOPY support. It adds handling of
non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
'skb_copy_datagram_iter()'. Main advantage of the second one is that it
can handle paged part of the skb by using 'kmap()' on each page, but if
there are no pages in the skb, it behaves like simple copying to iov
iterator. This patch also adds new field to the control block of skb -
this value shows current offset in the skb to read next portion of data
(it doesn't matter linear it or not). Idea behind this field is that
'skb_copy_datagram_iter()' handles both types of skb internally - it
just needs an offset from which to copy data from the given skb. This
offset is incremented on each read from skb. This approach allows to
simplify handling of both linear and non-linear skbs, because for
linear skb we need to call 'skb_pull()' after reading data from it,
while in non-linear case we need to update 'data_len'.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUKru0NHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AOLvEACEDiOEB9yhlnx4ZENgNRHMG48qLXKyRs+0cVL/
PLmc+jNtq/doMLUXOkZUnoTKdr6W5JuTVHMK/BJOWu5QjHCCybeQScx7uIjG8Yl1
F6zufiZ1g+nzyQsMHh/xQazqb75qnc7XvhzeW2vCiasPIDAaFpzmscX5nDy1rx7Q
HIpJCippnQ7Ami0X+qdH3SDgP4bK0+YXtLhtXLHGskb1yyP1QynjzUbDdZNHca88
sLn6OB36DCuANGDtgAFNo1dptkgT2aaE6XBGIVlpwtfo5KjJtzvGRIohxnbTcnSY
n562GokPQXn+yeJD2ZKW5XpvXTEDSLimzpHm4wNTRjTyrG3kIy+JxBjBM5inZam+
c7AJjuAZlfgrxbk2J6/laiV2WjqazswHuvoqs4wND2rCgi7Cdw9cBJ7iXy6ijN0J
oQS8Or0FnOU280hWEJLvp8jbdRzhksvRI5PHYaX5mvOy/iLKB9m6LIlHcjTMx0dh
NyLVPLx7C8UNOsqNIBe5545FqiRi5z/VxCDdL8AbCIy8cLja8A3D4t2Otggiktx7
skQCrLSBXcxh8sdmDvUUDfjeRmthqDnIbICCPJlWAZIR8e4rxbh8HxJHXutO7eEs
QH5CXJsW2psM/AcBrypqb2TQvsOAyGfL7XUt92GGOkHUjr333Qx2E4pGAsS7I+lz
XSguQA==
=dTwV
-----END PGP SIGNATURE-----
Merge tag 'nf-23-09-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter updates for net
The following three patches fix regressions in the netfilter subsystem:
1. Reject attempts to repeatedly toggle the 'dormant' flag in a single
transaction. Doing so makes nf_tables lose track of the real state
vs. the desired state. This ends with an attempt to unregister hooks
that were never registered in the first place, which yields a splat.
2. Fix element counting in the new nftables garbage collection infra
that came with 6.5: More than 255 expired elements wraps a counter
which results in memory leak.
3. Since 6.4 ipset can BUG when a set is renamed while a CREATE command
is in progress, fix from Jozsef Kadlecsik.
* tag 'nf-23-09-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
netfilter: nf_tables: fix memleak when more than 255 elements expired
netfilter: nf_tables: disable toggling dormant table state more than once
====================
Link: https://lore.kernel.org/r/20230920084156.4192-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This fixes the following warnings:
net/bluetooth/hci_core.c: In function ‘hci_register_dev’:
net/bluetooth/hci_core.c:2620:54: warning: ‘%d’ directive output may
be truncated writing between 1 and 10 bytes into a region of size 5
[-Wformat-truncation=]
2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id);
| ^~
net/bluetooth/hci_core.c:2620:50: note: directive argument in the range
[0, 2147483647]
2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id);
| ^~~~~~~
net/bluetooth/hci_core.c:2620:9: note: ‘snprintf’ output between 5 and
14 bytes into a destination of size 8
2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
While executing the Android 13 CTS Verifier Secure Server test on a
ChromeOS device, it was observed that the Bluetooth host initiates
authentication for an RFCOMM connection after SSP completes.
When this happens, some Intel Bluetooth controllers, like AC9560, would
disconnect with "Connection Rejected due to Security Reasons (0x0e)".
Historically, BlueZ did not mandate this authentication while an
authenticated combination key was already in use for the connection.
This behavior was changed since commit 7b5a9241b7
("Bluetooth: Introduce requirements for security level 4").
So, this patch addresses the aforementioned disconnection issue by
restoring the previous behavior.
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
iso_listen_cis shall only return -EADDRINUSE if the listening socket has
the destination set to BDADDR_ANY otherwise if the destination is set to
a specific address it is for broadcast which shall be ignored.
Fixes: f764a6c2c1 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Syzbot found a bug "BUG: sleeping function called from invalid context
at kernel/locking/mutex.c:580". It is because hci_link_tx_to holds an
RCU read lock and calls hci_disconnect which would hold a mutex lock
since the commit a13f316e90 ("Bluetooth: hci_conn: Consolidate code
for aborting connections"). Here's an example call trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xfc/0x174 lib/dump_stack.c:106
___might_sleep+0x4a9/0x4d3 kernel/sched/core.c:9663
__mutex_lock_common kernel/locking/mutex.c:576 [inline]
__mutex_lock+0xc7/0x6e7 kernel/locking/mutex.c:732
hci_cmd_sync_queue+0x3a/0x287 net/bluetooth/hci_sync.c:388
hci_abort_conn+0x2cd/0x2e4 net/bluetooth/hci_conn.c:1812
hci_disconnect+0x207/0x237 net/bluetooth/hci_conn.c:244
hci_link_tx_to net/bluetooth/hci_core.c:3254 [inline]
__check_timeout net/bluetooth/hci_core.c:3419 [inline]
__check_timeout+0x310/0x361 net/bluetooth/hci_core.c:3399
hci_sched_le net/bluetooth/hci_core.c:3602 [inline]
hci_tx_work+0xe8f/0x12d0 net/bluetooth/hci_core.c:3652
process_one_work+0x75c/0xba1 kernel/workqueue.c:2310
worker_thread+0x5b2/0x73a kernel/workqueue.c:2457
kthread+0x2f7/0x30b kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298
This patch releases RCU read lock before calling hci_disconnect and
reacquires it afterward to fix the bug.
Fixes: a13f316e90 ("Bluetooth: hci_conn: Consolidate code for aborting connections")
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When HCI_QUIRK_STRICT_DUPLICATE_FILTER is set LE scanning requires
periodic restarts of the scanning procedure as the controller would
consider device previously found as duplicated despite of RSSI changes,
but in order to set the scan timeout properly set le_scan_restart needs
to be synchronous so it shall not use hci_cmd_sync_queue which defers
the command processing to cmd_sync_work.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-bluetooth/578e6d7afd676129decafba846a933f5@agner.ch/#t
Fixes: 27d54b778a ("Bluetooth: Rework le_scan_restart for hci_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_req_prepare_suspend() has been deprecated in favor of
hci_suspend_sync().
Fixes: 182ee45da0 ("Bluetooth: hci_sync: Rework hci_suspend_notifier")
Signed-off-by: Yao Xiao <xiaoyao@rock-chips.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Don't populate the read-only array lanes on the stack, instead make
it static const.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kyle Zeng reported that there is a race between IPSET_CMD_ADD and IPSET_CMD_SWAP
in netfilter/ip_set, which can lead to the invocation of `__ip_set_put` on a
wrong `set`, triggering the `BUG_ON(set->ref == 0);` check in it.
The race is caused by using the wrong reference counter, i.e. the ref counter instead
of ref_netlink.
Fixes: 24e227896b ("netfilter: ipset: Add schedule point in call_ad().")
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/ZPZqetxOmH+w%2Fmyc@westworld/#r
Tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
When more than 255 elements expired we're supposed to switch to a new gc
container structure.
This never happens: u8 type will wrap before reaching the boundary
and nft_trans_gc_space() always returns true.
This means we recycle the initial gc container structure and
lose track of the elements that came before.
While at it, don't deref 'gc' after we've passed it to call_rcu.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
nft -f -<<EOF
add table ip t
add table ip t { flags dormant; }
add chain ip t c { type filter hook input priority 0; }
add table ip t
EOF
Triggers a splat from nf core on next table delete because we lose
track of right hook register state:
WARNING: CPU: 2 PID: 1597 at net/netfilter/core.c:501 __nf_unregister_net_hook
RIP: 0010:__nf_unregister_net_hook+0x41b/0x570
nf_unregister_net_hook+0xb4/0xf0
__nf_tables_unregister_hook+0x160/0x1d0
[..]
The above should have table in *active* state, but in fact no
hooks were registered.
Reject on/off/on games rather than attempting to fix this.
Fixes: 179d9ba555 ("netfilter: nf_tables: fix table flag updates")
Reported-by: "Lee, Cherie-Anne" <cherie.lee@starlabs.sg>
Cc: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Cc: info@starlabs.sg
Signed-off-by: Florian Westphal <fw@strlen.de>
In rds_rdma_cm_event_handler_cmn() check, if conn pointer exists
before dereferencing it as rdma_set_service_type() argument
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: fd261ce6a3 ("rds: rdma: update rdma transport for tos")
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have data-races while reading np->srcprefs
Switch the field to a plain byte, add READ_ONCE()
and WRITE_ONCE() annotations where needed,
and IPV6_ADDR_PREFERENCES setsockopt() can now be lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230918142321.1794107-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugfixes:
* Various O_DIRECT related fixes from Trond
* Error handling
* Locking issues
* Use the correct commit infor for joining page groups
* Fixes for rescheduling IO
* Sunrpc bad verifier fixes
* Report EINVAL errors from connect()
* Revalidate creds that the server has rejected
* Revert "SUNRPC: Fail faster on bad verifier"
* Fix pNFS session trunking when MDS=DS
* Fix zero-value filehandles for post-open getattr operations
* Fix compiler warning about tautological comparisons
* Revert "SUNRPC: clean up integer overflow check" before Trond's fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmUIjesACgkQ18tUv7Cl
QOv/MBAAxxZrW45qlTsEMvhnFXu1cvt08IdqivWURuSxK6oV31tARMqw00MxzhsF
ubNX4hynDNmsWYJLryHTTGwUJzaNQOhAGjU46AhwFgS3Lj2wefeuR3IYuEDgQcHF
8uXrrT/GWRIG4a/Uu1IxodfgMvgUPV+HGFc5BQFGkXGqPQdgbdxBDtOtyU1QlmaV
rqKnxPgUibEZwVoz8+PNsN+pwNjKSaSR3pH53KKS6HUBAM69sT4jJiJDO/UGSlzb
F9U1DHPeyltHXVAaQXUjHlsjh49NbIjzD5/T8xKWscvS7rbQFCcJkUl3MBHyvSEr
eixx7muqOMCEm4lNFJqHIY7XrEN+50wi92tKj05Bz9wrnp4L84cxSV/KNZrKNA3T
LCeIm4DGynPcLPopEUm9lqZMrabwvdV4wZSiybB6p/F/5ksVyaMnwTLzZZ1+NhkB
OglNhF4L6m18/Ai/bY6XOixzgzWfcmYfXRhq7YoeO/JSvixG9Sk2crC6ryjWadgo
xbitjjCYXHl3MUkH2TcaQoMGFhmvSShXg//5YXpxm3/0C3EVPa44T7/aFJqelL2p
kkVsLcjcevAOwBHxehYLofL6c3GhLqRnwmTV7yqgJqfZf8uGxeQhXCMqOmv57/c0
3iJJM4Llb14B+t3xl3T8MotT4WejzlHDnJKrscipfXJwRT7UDIM=
=M6aI
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Various O_DIRECT related fixes from Trond:
- Error handling
- Locking issues
- Use the correct commit info for joining page groups
- Fixes for rescheduling IO
Sunrpc bad verifier fixes:
- Report EINVAL errors from connect()
- Revalidate creds that the server has rejected
- Revert "SUNRPC: Fail faster on bad verifier"
Misc:
- Fix pNFS session trunking when MDS=DS
- Fix zero-value filehandles for post-open getattr operations
- Fix compiler warning about tautological comparisons
- Revert 'SUNRPC: clean up integer overflow check' before Trond's fix"
* tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Silence compiler complaints about tautological comparisons
Revert "SUNRPC: clean up integer overflow check"
NFSv4.1: fix zero value filehandle in post open getattr
NFSv4.1: fix pnfs MDS=DS session trunking
Revert "SUNRPC: Fail faster on bad verifier"
SUNRPC: Mark the cred for revalidation if the server rejects it
NFS/pNFS: Report EINVAL errors from connect() to the server
NFS: More fixes for nfs_direct_write_reschedule_io()
NFS: Use the correct commit info in nfs_join_page_group()
NFS: More O_DIRECT accounting fixes for error paths
NFS: Fix O_DIRECT locking issues
NFS: Fix error handling for O_DIRECT write scheduling
According to RFC 8684 section 3.3:
A connection is not closed unless [...] or an implementation-specific
connection-level send timeout.
Currently the MPTCP protocol does not implement such timeout, and
connection timing-out at the TCP-level never move to close state.
Introduces a catch-up condition at subflow close time to move the
MPTCP socket to close, too.
That additionally allows removing similar existing inside the worker.
Finally, allow some additional timeout for plain ESTABLISHED mptcp
sockets, as the protocol allows creating new subflows even at that
point and making the connection functional again.
This issue is actually present since the beginning, but it is basically
impossible to solve without a long chain of functional pre-requisites
topped by commit bbd49d114d ("mptcp: consolidate transition to
TCP_CLOSE in mptcp_do_fastclose()"). When backporting this current
patch, please also backport this other commit as well.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/430
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The msk socket uses to different timeout to track close related
events and retransmissions. The existing helpers do not indicate
clearly which timer they actually touch, making the related code
quite confusing.
Change the existing helpers name to avoid such confusion. No
functional change intended.
This patch is linked to the next one ("mptcp: fix dangling connection
hang-up"). The two patches are supposed to be backported together.
Cc: stable@vger.kernel.org # v5.11+
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
On incoming TCP reset, subflow closing could happen before error
propagation. That in turn could cause the socket error being ignored,
and a missing socket state transition, as reported by Daire-Byrne.
Address the issues explicitly checking for subflow socket error at
close time. To avoid code duplication, factor-out of __mptcp_error_report()
a new helper implementing the relevant bits.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/429
Fixes: 15cc104533 ("mptcp: deliver ssk errors to msk")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will simplify the next patch ("mptcp: process pending subflow error
on close").
No functional change intended.
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case multiple subflows race to update the mptcp-level receive
window, the subflow losing the race should use the window value
provided by the "winning" subflow to update it's own tcp-level
rcv_wnd.
To such goal, the current code bogusly uses the mptcp-level rcv_wnd
value as observed before the update attempt. On unlucky circumstances
that may lead to TCP-level window shrinkage, and stall the other end.
Address the issue feeding to the rcv wnd update the correct value.
Fixes: f3589be0c4 ("mptcp: never shrink offered window")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/427
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct ceph_monmap.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If, for any reason, the open-coded arithmetic causes a wraparound,
the protection that `struct_size()` adds against potential integer
overflows is defeated. Fix this by hardening call to `struct_size()`
with `size_add()`.
Fixes: e034c6d23b ("tipc: Use struct_size() helper")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If, for any reason, the open-coded arithmetic causes a wraparound,
the protection that `struct_size()` adds against potential integer
overflows is defeated. Fix this by hardening call to `struct_size()`
with `size_add()`.
Fixes: b89fec54fd ("tls: rx: wrap decrypt params in a struct")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 GHz regulatory domains introduces Power Spectral Density (PSD).
The PSD value of the regulatory rule should be taken into effect
for the ieee80211_channels falling into that particular regulatory
rule. Save the values in the channel which has PSD value and add
nl80211 attributes accordingly to handle it.
Co-developed-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Signed-off-by: Wen Gong <quic_wgong@quicinc.com>
Link: https://lore.kernel.org/r/20230914082026.3709-1-quic_wgong@quicinc.com
[use hole in chan flags, reword docs]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If userspace quickly opens /dev/rfkill after a new
instance was created, it might see the old state of
the instance from before the sync work runs and may
even _change_ the state, only to have the sync work
change it again.
Fix this by doing the sync inline where needed, not
just for /dev/rfkill but also for sysfs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Struct hsr_sup_tlv describes HW layout and therefore it needs a __packed
attribute to ensure the compiler does not add any padding.
Due to the size and __packed attribute of the structs that use
hsr_sup_tlv it has no functional impact.
Add __packed to struct hsr_sup_tlv.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
While adding support for parsing the redbox supervision frames, the
author added `pull_size' and `total_pull_size' to track the amount of
bytes that were pulled from the skb during while parsing the skb so it
can be reverted/ pushed back at the end.
In the process probably copy&paste error occurred and for the HSRv1 case
the ethhdr was used instead of the hsr_tag. Later the hsr_tag was used
instead of hsr_sup_tag. The later error didn't matter because both
structs have the size so HSRv0 was still working. It broke however HSRv1
parsing because struct ethhdr is larger than struct hsr_tag.
Reinstate the old pulling flow and pull first ethhdr, hsr_tag in v1 case
followed by hsr_sup_tag.
[bigeasy: commit message]
Fixes: eafaa88b3e ("net: hsr: Add support for redbox supervision frames")'
Suggested-by: Tristram.Ha@microchip.com
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the carrier/no-carrier state for the network interface
shared between the BMC and the passthrough channel. Without this
functionality the BMC is unable to reconfigure the NIC in the event
of a re-cabling to a different subnet.
Signed-off-by: Johnathan Mantey <johnathanx.mantey@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we assume the skb is associated with a device before calling
__ip_options_compile, which is not always the case if it is re-routed by
ipvs.
When skb->dev is NULL, dev_net(skb->dev) will become null-dereference.
This patch adds a check for the edge case and switch to use the net_device
from the rtable when skb->dev is NULL.
Fixes: ed0de45a10 ("ipv4: recompile ip options in ipv4_link_failure")
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Cc: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 79 files changed, 5275 insertions(+), 600 deletions(-).
The main changes are:
1) Basic BTF validation in libbpf, from Andrii Nakryiko.
2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi.
3) next_thread cleanups, from Oleg Nesterov.
4) Add mcpu=v4 support to arm32, from Puranjay Mohan.
5) Add support for __percpu pointers in bpf progs, from Yonghong Song.
6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang.
7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman",
Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle),
Song Liu, Stanislav Fomichev, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In mlx5, there is a devlink instance created for PCI device. Also, one
separate devlink instance is created for auxiliary device that
represents the netdev of uplink port. This relation is currently
invisible to the devlink user.
Benefit from the rel infrastructure and allow for nested devlink
instance to set the relationship for the nested-in devlink instance.
Note that there may be many nested instances, therefore use xarray to
hold the list of rel_indexes for individual nested instances.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benefit from the newly introduced rel infrastructure, treat the linecard
nested devlink instances in the same way as port function instances.
Convert the code to use the rel infrastructure.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new helper devl_port_fn_devlink_set() to be used by driver
assigning a devlink instance to the peer devlink port function.
Expose this to user over new netlink attribute nested under port
function nest to expose devlink handle related to the port function.
This is particularly helpful for user to understand the relationship
between devlink instances created for SFs and the port functions
they belong to.
Note that caller of devlink_port_notify() needs to hold devlink
instance lock, put the assertion to devl_port_fn_devlink_set() to make
this requirement explicit. Also note the limitations that only allow to
make this assignment for registered objects.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a bit tricky to maintain relationship between devlink objects and
nested devlink instances due to following aspects:
1) Locking. It is necessary to lock the devlink instance that contains
the object first, only after that to lock the nested instance.
2) Lifetimes. Objects (e.g devlink port) may be removed before
the nested devlink instance.
3) Notifications. If nested instance changes (e.g. gets
registered/unregistered) the nested-in object needs to send
appropriate notifications.
Resolve this by introducing an xarray that holds 1:1 relationships
between devlink object and related nested devlink instance.
Use that xarray index to get the object/nested devlink instance on
the other side.
Provide necessary helpers:
devlink_rel_nested_in_add/clear() to add and clear the relationship.
devlink_rel_nested_in_notify() to call the nested-in object to send
notifications during nested instance register/unregister/netns
change.
devlink_rel_devlink_handle_put() to be used by nested-in object fill
function to fill the nested handle.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the next patch is going to call this helper with need to fill another
type of nested attribute, pass it over function arg.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the next patch is going to call this helper out of the linecard.c,
move to netlink.c.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>