70252 Commits

Author SHA1 Message Date
Jiri Pirko
852e85a704 net: devlink: add unlocked variants of devling_trap*() functions
Add unlocked variants of devl_trap*() functions to be used in drivers
called-in with devlink->lock held.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 20:10:46 -07:00
Moshe Shemesh
e26fde2f5b net: devlink: avoid false DEADLOCK warning reported by lockdep
Add a lock_class_key per devlink instance to avoid DEADLOCK warning by
lockdep, while locking more than one devlink instance in driver code,
for example in opening VFs flow.

Kernel log:
[  101.433802] ============================================
[  101.433803] WARNING: possible recursive locking detected
[  101.433810] 5.19.0-rc1+ #35 Not tainted
[  101.433812] --------------------------------------------
[  101.433813] bash/892 is trying to acquire lock:
[  101.433815] ffff888127bfc2f8 (&devlink->lock){+.+.}-{3:3}, at: probe_one+0x3c/0x690 [mlx5_core]
[  101.433909]
               but task is already holding lock:
[  101.433910] ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
[  101.433989]
               other info that might help us debug this:
[  101.433990]  Possible unsafe locking scenario:

[  101.433991]        CPU0
[  101.433991]        ----
[  101.433992]   lock(&devlink->lock);
[  101.433993]   lock(&devlink->lock);
[  101.433995]
                *** DEADLOCK ***

[  101.433996]  May be due to missing lock nesting notation

[  101.433996] 6 locks held by bash/892:
[  101.433998]  #0: ffff88810eb50448 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0
[  101.434009]  #1: ffff888114777c88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x20d/0x520
[  101.434017]  #2: ffff888102b58660 (kn->active#231){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x230/0x520
[  101.434023]  #3: ffff888102d70198 (&dev->mutex){....}-{3:3}, at: sriov_numvfs_store+0x132/0x310
[  101.434031]  #4: ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
[  101.434108]  #5: ffff88812adce198 (&dev->mutex){....}-{3:3}, at: __device_attach+0x76/0x430
[  101.434116]
               stack backtrace:
[  101.434118] CPU: 5 PID: 892 Comm: bash Not tainted 5.19.0-rc1+ #35
[  101.434120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  101.434130] Call Trace:
[  101.434133]  <TASK>
[  101.434135]  dump_stack_lvl+0x57/0x7d
[  101.434145]  __lock_acquire.cold+0x1df/0x3e7
[  101.434151]  ? register_lock_class+0x1880/0x1880
[  101.434157]  lock_acquire+0x1c1/0x550
[  101.434160]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434229]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  101.434232]  ? __xa_alloc+0x1ed/0x2d0
[  101.434236]  ? ksys_write+0xf3/0x1d0
[  101.434239]  __mutex_lock+0x12c/0x14b0
[  101.434243]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434312]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434380]  ? devlink_alloc_ns+0x11b/0x910
[  101.434385]  ? mutex_lock_io_nested+0x1320/0x1320
[  101.434388]  ? lockdep_init_map_type+0x21a/0x7d0
[  101.434391]  ? lockdep_init_map_type+0x21a/0x7d0
[  101.434393]  ? __init_swait_queue_head+0x70/0xd0
[  101.434397]  probe_one+0x3c/0x690 [mlx5_core]
[  101.434467]  pci_device_probe+0x1b4/0x480
[  101.434471]  really_probe+0x1e0/0xaa0
[  101.434474]  __driver_probe_device+0x219/0x480
[  101.434478]  driver_probe_device+0x49/0x130
[  101.434481]  __device_attach_driver+0x1b8/0x280
[  101.434484]  ? driver_allows_async_probing+0x140/0x140
[  101.434487]  bus_for_each_drv+0x123/0x1a0
[  101.434489]  ? bus_for_each_dev+0x1a0/0x1a0
[  101.434491]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  101.434494]  ? trace_hardirqs_on+0x2d/0x100
[  101.434498]  __device_attach+0x1a3/0x430
[  101.434501]  ? device_driver_attach+0x1e0/0x1e0
[  101.434503]  ? pci_bridge_d3_possible+0x1e0/0x1e0
[  101.434506]  ? pci_create_resource_files+0xeb/0x190
[  101.434511]  pci_bus_add_device+0x6c/0xa0
[  101.434514]  pci_iov_add_virtfn+0x9e4/0xe00
[  101.434517]  ? trace_hardirqs_on+0x2d/0x100
[  101.434521]  sriov_enable+0x64a/0xca0
[  101.434524]  ? pcibios_sriov_disable+0x10/0x10
[  101.434528]  mlx5_core_sriov_configure+0xab/0x280 [mlx5_core]
[  101.434602]  sriov_numvfs_store+0x20a/0x310
[  101.434605]  ? sriov_totalvfs_show+0xc0/0xc0
[  101.434608]  ? sysfs_file_ops+0x170/0x170
[  101.434611]  ? sysfs_file_ops+0x117/0x170
[  101.434614]  ? sysfs_file_ops+0x170/0x170
[  101.434616]  kernfs_fop_write_iter+0x348/0x520
[  101.434619]  new_sync_write+0x2e5/0x520
[  101.434621]  ? new_sync_read+0x520/0x520
[  101.434624]  ? lock_acquire+0x1c1/0x550
[  101.434626]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  101.434630]  vfs_write+0x5cb/0x8d0
[  101.434633]  ksys_write+0xf3/0x1d0
[  101.434635]  ? __x64_sys_read+0xb0/0xb0
[  101.434638]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  101.434640]  ? syscall_enter_from_user_mode+0x1d/0x50
[  101.434643]  do_syscall_64+0x3d/0x90
[  101.434647]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  101.434650] RIP: 0033:0x7f5ff536b2f7
[  101.434658] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  101.434661] RSP: 002b:00007ffd9ea85d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  101.434664] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5ff536b2f7
[  101.434666] RDX: 0000000000000002 RSI: 000055c4c279e230 RDI: 0000000000000001
[  101.434668] RBP: 000055c4c279e230 R08: 000000000000000a R09: 0000000000000001
[  101.434669] R10: 000055c4c283cbf0 R11: 0000000000000246 R12: 0000000000000002
[  101.434670] R13: 00007f5ff543d500 R14: 0000000000000002 R15: 00007f5ff543d700
[  101.434673]  </TASK>

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 20:10:46 -07:00
Pavel Begunkov
2e07a521e1 skbuff: add SKBFL_DONT_ORPHAN flag
We don't want to list every single ubuf_info callback in
skb_orphan_frags(), add a flag controlling the behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 19:58:45 -07:00
Pavel Begunkov
1b4b2b09d4 skbuff: don't mix ubuf_info from different sources
We should not append MSG_ZEROCOPY requests to skbuff with non
MSG_ZEROCOPY ubuf_info, they might be not compatible.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 19:58:45 -07:00
Pavel Begunkov
773ba4fe91 ipv6: avoid partial copy for zc
Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such copies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 19:58:29 -07:00
Pavel Begunkov
8eb77cc739 ipv4: avoid partial copy for zc
Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such copies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18 19:58:03 -07:00
Tetsuo Handa
3598cb6e18 wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop()
lockdep complains use of uninitialized spinlock at ieee80211_do_stop() [1],
for commit f856373e2f31ffd3 ("wifi: mac80211: do not wake queues on a vif
that is being stopped") guards clear_bit() using fq.lock even before
fq_init() from ieee80211_txq_setup_flows() initializes this spinlock.

According to discussion [2], Toke was not happy with expanding usage of
fq.lock. Since __ieee80211_wake_txqs() is called under RCU read lock, we
can instead use synchronize_rcu() for flushing ieee80211_wake_txqs().

Link: https://syzkaller.appspot.com/bug?extid=eceab52db7c4b961e9d6 [1]
Link: https://lkml.kernel.org/r/874k0zowh2.fsf@toke.dk [2]
Reported-by: syzbot <syzbot+eceab52db7c4b961e9d6@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: f856373e2f31ffd3 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
Tested-by: syzbot <syzbot+eceab52db7c4b961e9d6@syzkaller.appspotmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@kernel.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/9cc9b81d-75a3-3925-b612-9d0ad3cab82b@I-love.SAKURA.ne.jp
2022-07-18 15:01:14 +03:00
Kuniyuki Iwashima
021266ec64 tcp: Fix data-races around sysctl_tcp_fastopen_blackhole_timeout.
While reading sysctl_tcp_fastopen_blackhole_timeout, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its readers.

Fixes: cf1ef3f0719b ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
5a54213318 tcp: Fix data-races around sysctl_tcp_fastopen.
While reading sysctl_tcp_fastopen, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 2100c8d2d9db ("net-tcp: Fast Open base")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
79539f3474 tcp: Fix data-races around sysctl_max_syn_backlog.
While reading sysctl_max_syn_backlog, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
cbfc649558 tcp: Fix a data-race around sysctl_tcp_tw_reuse.
While reading sysctl_tcp_tw_reuse, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
39e24435a7 tcp: Fix data-races around some timeout sysctl knobs.
While reading these sysctl knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.

  - tcp_retries1
  - tcp_retries2
  - tcp_orphan_retries
  - tcp_fin_timeout

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
46778cd16e tcp: Fix data-races around sysctl_tcp_reordering.
While reading sysctl_tcp_reordering, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
4177f54589 tcp: Fix data-races around sysctl_tcp_migrate_req.
While reading sysctl_tcp_migrate_req, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
f2e383b5bb tcp: Fix data-races around sysctl_tcp_syncookies.
While reading sysctl_tcp_syncookies, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
20a3b1c0f6 tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.
While reading sysctl_tcp_syn(ack)?_retries, they can be changed
concurrently.  Thus, we need to add READ_ONCE() to their readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
f2f316e287 tcp: Fix data-races around keepalive sysctl knobs.
While reading sysctl_tcp_keepalive_(time|probes|intvl), they can be changed
concurrently.  Thus, we need to add READ_ONCE() to their readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
8ebcc62c73 igmp: Fix data-races around sysctl_igmp_qrv.
While reading sysctl_igmp_qrv, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

This test can be packed into a helper, so such changes will be in the
follow-up series after net is merged into net-next.

  qrv ?: READ_ONCE(net->ipv4.sysctl_igmp_qrv);

Fixes: a9fe8e29945d ("ipv4: implement igmp_qrv sysctl to tune igmp robustness variable")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:54 +01:00
Kuniyuki Iwashima
6ae0f2e553 igmp: Fix data-races around sysctl_igmp_max_msf.
While reading sysctl_igmp_max_msf, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:53 +01:00
Kuniyuki Iwashima
6305d821e3 igmp: Fix a data-race around sysctl_igmp_max_memberships.
While reading sysctl_igmp_max_memberships, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:53 +01:00
Kuniyuki Iwashima
f6da2267e7 igmp: Fix data-races around sysctl_igmp_llm_reports.
While reading sysctl_igmp_llm_reports, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

This test can be packed into a helper, so such changes will be in the
follow-up series after net is merged into net-next.

  if (ipv4_is_local_multicast(pmc->multiaddr) &&
      !READ_ONCE(net->ipv4.sysctl_igmp_llm_reports))

Fixes: df2cf4a78e48 ("IGMP: Inhibit reports for local multicast groups")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 12:21:53 +01:00
Tariq Toukan
f08d8c1bb9 net/tls: Fix race in TLS device down flow
Socket destruction flow and tls_device_down function sync against each
other using tls_device_lock and the context refcount, to guarantee the
device resources are freed via tls_dev_del() by the end of
tls_device_down.

In the following unfortunate flow, this won't happen:
- refcount is decreased to zero in tls_device_sk_destruct.
- tls_device_down starts, skips the context as refcount is zero, going
  all the way until it flushes the gc work, and returns without freeing
  the device resources.
- only then, tls_device_queue_ctx_destruction is called, queues the gc
  work and frees the context's device resources.

Solve it by decreasing the refcount in the socket's destruction flow
under the tls_device_lock, for perfect synchronization.  This does not
slow down the common likely destructor flow, in which both the refcount
is decreased and the spinlock is acquired, anyway.

Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:33:07 +01:00
Jakub Kicinski
fd31f3996a tls: rx: decrypt into a fresh skb
We currently CoW Rx skbs whenever we can't decrypt to a user
space buffer. The skbs can be enormous (64kB) and CoW does
a linear alloc which has a strong chance of failing under
memory pressure. Or even without, skb_cow_data() assumes
GFP_ATOMIC.

Allocate a new frag'd skb and decrypt into it. We finally
take advantage of the decrypted skb getting returned via
darg.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
cbbdee9918 tls: rx: async: don't put async zc on the list
The "zero-copy" path in SW TLS will engage either for no skbs or
for all but last. If the recvmsg parameters are right and the
socket can do ZC we'll ZC until the iterator can't fit a full
record at which point we'll decrypt one more record and copy
over the necessary bits to fill up the request.

The only reason we hold onto the ZC skbs which went thru the async
path until the end of recvmsg() is to count bytes. We need an accurate
count of zc'ed bytes so that we can calculate how much of the non-zc'd
data to copy. To allow freeing input skbs on the ZC path count only
how much of the list we'll need to consume.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
c618db2afe tls: rx: async: hold onto the input skb
Async crypto currently benefits from the fact that we decrypt
in place. When we allow input and output to be different skbs
we will have to hang onto the input while we move to the next
record. Clone the inputs and keep them on a list.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
6ececdc513 tls: rx: async: adjust record geometry immediately
Async crypto TLS Rx currently waits for crypto to be done
in order to strip the TLS header and tailer. Simplify
the code by moving the pointers immediately, since only
TLS 1.2 is supported here there is no message padding.

This simplifies the decryption into a new skb in the next
patch as we don't have to worry about input vs output
skb in the decrypt_done() handler any more.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
6bd116c8c6 tls: rx: return the decrypted skb via darg
Instead of using ctx->recv_pkt after decryption read the skb
from darg.skb. This moves the decision of what the "output skb"
is to the decrypt handlers. For now after decrypt handler returns
successfully ctx->recv_pkt is simply moved to darg.skb, but it
will change soon.

Note that tls_decrypt_sg() cannot clear the ctx->recv_pkt
because it gets called to re-encrypt (i.e. by the device offload).
So we need an awkward temporary if() in tls_rx_one_record().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
541cc48be3 tls: rx: read the input skb from ctx->recv_pkt
Callers always pass ctx->recv_pkt into decrypt_skb_update(),
and it propagates it to its callees. This may give someone
the false impression that those functions can accept any valid
skb containing a TLS record. That's not the case, the record
sequence number is read from the context, and they can only
take the next record coming out of the strp.

Let the functions get the skb from the context instead of
passing it in. This will also make it cleaner to return
a different skb than ctx->recv_pkt as the decrypted one
later on.

Since we're touching the definition of decrypt_skb_update()
use this as an opportunity to rename it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
8a95873281 tls: rx: factor out device darg update
I already forgot to transform darg from input to output
semantics once on the NIC inline crypto fastpath. To
avoid this happening again create a device equivalent
of decrypt_internal(). A function responsible for decryption
and transforming darg.

While at it rename decrypt_internal() to a hopefully slightly
more meaningful name.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:11 +01:00
Jakub Kicinski
53d57999fe tls: rx: remove the message decrypted tracking
We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
Anything on the list is decrypted, anything on ctx->recv_pkt needs
to be decrypted.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:10 +01:00
Jakub Kicinski
abb47dc95d tls: rx: don't keep decrypted skbs on ctx->recv_pkt
Detach the skb from ctx->recv_pkt after decryption is done,
even if we can't consume it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:10 +01:00
Jakub Kicinski
008141de85 tls: rx: don't try to keep the skbs always on the list
I thought that having the skb either always on the ctx->rx_list
or ctx->recv_pkt will simplify the handling, as we would not
have to remember to flip it from one to the other on exit paths.

This became a little harder to justify with the fix for BPF
sockmaps. Subsequent changes will make the situation even worse.
Queue the skbs only when really needed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:10 +01:00
Jakub Kicinski
4cbc325ed6 tls: rx: allow only one reader at a time
recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.

tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.

This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.

Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:24:10 +01:00
Wen Gu
ddefb2d205 net/smc: Extend SMC-R link group netlink attribute
Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
SMC-R link group.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:17 +01:00
Wen Gu
b8d199451c net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R
On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.

When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.

So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.

Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:

1) regression in data path, which is brought by additional address
   translation of sndbuf by RNIC in Tx. But in general, translating
   address through MTT is fast.

   Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
   latency and bandwidth test with physically and virtually contiguous
   buffers are as follows:

- client:
  smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
  -t 5 -vu tcp_{bw|lat}
- server:
  smc_run taskset -c <cpu> qperf

   [latency]
   msgsize              tcp            smcr        smcr-use-virt-buf
   1               11.17 us         7.56 us         7.51 us (-0.67%)
   2               10.65 us         7.74 us         7.56 us (-2.31%)
   4               11.11 us         7.52 us         7.59 us ( 0.84%)
   8               10.83 us         7.55 us         7.51 us (-0.48%)
   16              11.21 us         7.46 us         7.51 us ( 0.71%)
   32              10.65 us         7.53 us         7.58 us ( 0.61%)
   64              10.95 us         7.74 us         7.80 us ( 0.76%)
   128             11.14 us         7.83 us         7.87 us ( 0.47%)
   256             10.97 us         7.94 us         7.92 us (-0.28%)
   512             11.23 us         7.94 us         8.20 us ( 3.25%)
   1024            11.60 us         8.12 us         8.20 us ( 0.96%)
   2048            14.04 us         8.30 us         8.51 us ( 2.49%)
   4096            16.88 us         9.13 us         9.07 us (-0.64%)
   8192            22.50 us        10.56 us        11.22 us ( 6.26%)
   16384           28.99 us        12.88 us        13.83 us ( 7.37%)
   32768           40.13 us        16.76 us        16.95 us ( 1.16%)
   65536           68.70 us        24.68 us        24.85 us ( 0.68%)
   [bandwidth]
   msgsize                tcp              smcr          smcr-use-virt-buf
   1                1.65 MB/s         1.59 MB/s         1.53 MB/s (-3.88%)
   2                3.32 MB/s         3.17 MB/s         3.08 MB/s (-2.67%)
   4                6.66 MB/s         6.33 MB/s         6.09 MB/s (-3.85%)
   8               13.67 MB/s        13.45 MB/s        11.97 MB/s (-10.99%)
   16              25.36 MB/s        27.15 MB/s        24.16 MB/s (-11.01%)
   32              48.22 MB/s        54.24 MB/s        49.41 MB/s (-8.89%)
   64             106.79 MB/s       107.32 MB/s        99.05 MB/s (-7.71%)
   128            210.21 MB/s       202.46 MB/s       201.02 MB/s (-0.71%)
   256            400.81 MB/s       416.81 MB/s       393.52 MB/s (-5.59%)
   512            746.49 MB/s       834.12 MB/s       809.99 MB/s (-2.89%)
   1024          1292.33 MB/s      1641.96 MB/s      1571.82 MB/s (-4.27%)
   2048          2007.64 MB/s      2760.44 MB/s      2717.68 MB/s (-1.55%)
   4096          2665.17 MB/s      4157.44 MB/s      4070.76 MB/s (-2.09%)
   8192          3159.72 MB/s      4361.57 MB/s      4270.65 MB/s (-2.08%)
   16384         4186.70 MB/s      4574.13 MB/s      4501.17 MB/s (-1.60%)
   32768         4093.21 MB/s      4487.42 MB/s      4322.43 MB/s (-3.68%)
   65536         4057.14 MB/s      4735.61 MB/s      4555.17 MB/s (-3.81%)

2) regression in buffer initialization and destruction path, which is
   brought by additional MR operations of sndbufs. But thanks to link
   group buffer reuse mechanism, the impact of this kind of regression
   decreases as times of buffer reuse increases.

   Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
   buffer-related function obtained by bpftrace are as follows:

   Function                         Phys-bufs           Virt-bufs
   smcr_new_buf_create()             67154 ns            79164 ns
   smc_ib_buf_map_sg()                 525 ns              928 ns
   smc_ib_get_memory_region()       162294 ns           161191 ns
   smc_wr_reg_send()                  9957 ns             9635 ns
   smc_ib_put_memory_region()       203548 ns           198374 ns
   smc_ib_buf_unmap_sg()               508 ns             1158 ns

------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
   the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
   physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:17 +01:00
Wen Gu
b984f370ed net/smc: Use sysctl-specified types of buffers in new link group
This patch introduces a new SMC-R specific element buf_type
in struct smc_link_group, for recording the value of sysctl
smcr_buf_type when link group is created.

New created link group will create and reuse buffers of the
type specified by buf_type.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:17 +01:00
Wen Gu
4bc5008e43 net/smc: Introduce a sysctl for setting SMC-R buffer type
This patch introduces the sysctl smcr_buf_type for setting
the type of SMC-R sndbufs and RMBs.

Valid values includes:

- SMCR_PHYS_CONT_BUFS, which means use physically contiguous
  buffers for better performance and is the default value.

- SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
  buffers in case of physically contiguous memory is scarce.

- SMCR_MIXED_BUFS, which means first try to use physically
  contiguous buffers. If not available, then use virtually
  contiguous buffers.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:17 +01:00
Guangguan Wang
0ef69e7884 net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu
Some CPU, such as Xeon, can guarantee DMA cache coherency.
So it is no need to use dma sync APIs to flush cache on such CPUs.
In order to avoid calling dma sync APIs on the IO path, use the
dma_need_sync to check whether smc_buf_desc needs dma sync when
creating smc_buf_desc.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:17 +01:00
Guangguan Wang
6d52e2de64 net/smc: remove redundant dma sync ops
smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
consistency. Smc sndbufs are dma buffers, where CPU writes data to
it and PCIE device reads data from it. So for sndbufs,
smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
redundant as PCIE device will not write the buffers. Smc rmbs
are dma buffers, where PCIE device write data to it and CPU read
data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18 11:19:16 +01:00
Jaehee Park
aaa5f515b1 net: ipv6: new accept_untracked_na option to accept na only if in-network
This patch adds a third knob, '2', which extends the
accept_untracked_na option to learn a neighbor only if the src ip is
in the same subnet as an address configured on the interface that
received the neighbor advertisement. This is similar to the arp_accept
configuration for ipv4.

Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
Suggested-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-15 18:55:50 -07:00
Jaehee Park
e68c5dcf0a net: ipv4: new arp_accept option to accept garp only if in-network
In many deployments, we want the option to not learn a neighbor from
garp if the src ip is not in the same subnet as an address configured
on the interface that received the garp message. net.ipv4.arp_accept
sysctl is currently used to control creation of a neigh from a
received garp packet. This patch adds a new option '2' to
net.ipv4.arp_accept which extends option '1' by including the subnet
check.

Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
Suggested-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-15 18:55:49 -07:00
Kuniyuki Iwashima
11052589cf tcp/udp: Make early_demux back namespacified.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis.  Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp").  However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.

We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic.  Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.

This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline.  So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.

Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-15 18:50:35 -07:00
Tyler Hicks
aa7aeee169 net/9p: Initialize the iounit field during fid creation
Ensure that the fid's iounit field is set to zero when a new fid is
created. Certain 9P operations, such as OPEN and CREATE, allow the
server to reply with an iounit size which the client code assigns to the
p9_fid struct shortly after the fid is created by p9_fid_create(). On
the other hand, an XATTRWALK operation doesn't allow for the server to
specify an iounit value. The iounit field of the newly allocated p9_fid
struct remained uninitialized in that case. Depending on allocation
patterns, the iounit value could have been something reasonable that was
carried over from previously freed fids or, in the worst case, could
have been arbitrary values from non-fid related usages of the memory
location.

The bug was detected in the Windows Subsystem for Linux 2 (WSL2) kernel
after the uninitialized iounit field resulted in the typical sequence of
two getxattr(2) syscalls, one to get the size of an xattr and another
after allocating a sufficiently sized buffer to fit the xattr value, to
hit an unexpected ERANGE error in the second call to getxattr(2). An
uninitialized iounit field would sometimes force rsize to be smaller
than the xattr value size in p9_client_read_once() and the 9P server in
WSL refused to chunk up the READ on the attr_fid and, instead, returned
ERANGE to the client. The virtfs server in QEMU seems happy to chunk up
the READ and this problem goes undetected there.

Link: https://lkml.kernel.org/r/20220710141402.803295-1-tyhicks@linux.microsoft.com
Fixes: ebf46264a004 ("fs/9p: Add support user. xattr")
Cc: stable@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-07-16 07:16:55 +09:00
Johannes Berg
bd363ee533 wifi: mac80211: mlme: set sta.mlo correctly
Due to some changes and rebasing between different patches
this fell through the cracks; we need to set sta.mlo if the
connection is using MLO.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-07-15 15:12:45 +02:00
Johannes Berg
8f5d9e68c9 wifi: mac80211: remove stray printk
Unfortunately, a printk snuck into a previous patch,
remove it.

Fixes: 81151ce462e5 ("wifi: mac80211: support MLO authentication/association with one link")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-07-15 15:05:35 +02:00
Hangyu Hua
4ac7573e1f net: 9p: fix refcount leak in p9_read_work() error handling
p9_req_put need to be called when m->rreq->rc.sdata is NULL to avoid
temporary refcount leak.

Link: https://lkml.kernel.org/r/20220712104438.30800-1-hbh25y@gmail.com
Fixes: 728356dedeff ("9p: Add refcount to p9_req_t")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
[Dominique: commit wording adjustments, p9_req_put argument fixes for rebase]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-07-15 20:22:21 +09:00
Dominique Martinet
67dd8e445e 9p: roll p9_tag_remove into p9_req_put
mempool prep commit removed the awkward kref usage which didn't
allow passing client pointer easily with the ref, so we no longer
need a separate function to remove the tag from idr.

This has the side benefit that it should be more robust in detecting
leaks: umount will now properly catch unfreed requests as they still
will be in the idr until the last ref is dropped

Link: https://lkml.kernel.org/r/20220712060801.2487140-1-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
2022-07-15 20:22:09 +09:00
Kuniyuki Iwashima
2a85388f1d tcp: Fix a data-race around sysctl_tcp_probe_interval.
While reading sysctl_tcp_probe_interval, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 05cbc0db03e8 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-15 11:49:56 +01:00
Kuniyuki Iwashima
92c0aa4175 tcp: Fix a data-race around sysctl_tcp_probe_threshold.
While reading sysctl_tcp_probe_threshold, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 6b58e0a5f32d ("ipv4: Use binary search to choose tcp PMTU probe_size")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-15 11:49:56 +01:00
Kuniyuki Iwashima
8e92d44236 tcp: Fix a data-race around sysctl_tcp_mtu_probe_floor.
While reading sysctl_tcp_mtu_probe_floor, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: c04b79b6cfd7 ("tcp: add new tcp_mtu_probe_floor sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-15 11:49:56 +01:00