69439 Commits

Author SHA1 Message Date
Luiz Angelo Daros de Luca
fe7324b932 net: dsa: OF-ware slave_mii_bus
If found, register the DSA internally allocated slave_mii_bus with an OF
"mdio" child object. It can save some drivers from creating their
custom internal MDIO bus.

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23 12:27:53 +01:00
liuyacan
8c3b8dc5cc net/smc: fix listen processing for SMC-Rv2
In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc, but the latter may fail. Unfortunately, the caller
will only check the former. In this case, a NULL pointer reference
will occur in smc_clc_send_confirm_accept() when accessing
conn->rmb_desc.

This patch does two things:
1. Use the return code to determine whether V2 is available.
2. If the return code is NODEV, continue to check whether V1 is
available.

Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23 10:08:33 +01:00
liuyacan
75c1edf23b net/smc: postpone sk_refcnt increment in connect()
Same trigger condition as commit 86434744. When setsockopt runs
in parallel to a connect(), and switch the socket into fallback
mode. Then the sk_refcnt is incremented in smc_connect(), but
its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the
corresponding sk_refcnt decrement in __smc_release() will not be
performed.

Fixes: 86434744fedf ("net/smc: add fallback check to connect()")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23 09:58:47 +01:00
Jakub Kicinski
c304eddcec net: wrap the wireless pointers in struct net_device in an ifdef
Most protocol-specific pointers in struct net_device are under
a respective ifdef. Wireless is the notable exception. Since
there's a sizable number of custom-built kernels for datacenter
workloads which don't build wireless it seems reasonable to
ifdefy those pointers as well.

While at it move IPv4 and IPv6 pointers up, those are special
for obvious reasons.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:51:54 +01:00
David Howells
9a3dedcf18 rxrpc: Fix decision on when to generate an IDLE ACK
Fix the decision on when to generate an IDLE ACK by keeping a count of the
number of packets we've received, but not yet soft-ACK'd, and the number of
packets we've processed, but not yet hard-ACK'd, rather than trying to keep
track of which DATA sequence numbers correspond to those points.

We then generate an ACK when either counter exceeds 2.  The counters are
both cleared when we transcribe the information into any sort of ACK packet
for transmission.  IDLE and DELAY ACKs are skipped if both counters are 0
(ie. no change).

Fixes: 805b21b929e2 ("rxrpc: Send an ACK after every few DATA packets we receive")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:30:53 +01:00
David Howells
81524b6312 rxrpc: Don't let ack.previousPacket regress
The previousPacket field in the rx ACK packet should never go backwards -
it's now the highest DATA sequence number received, not the last on
received (it used to be used for out of sequence detection).

Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:30:53 +01:00
David Howells
8940ba3cfe rxrpc: Fix overlapping ACK accounting
Fix accidental overlapping of Rx-phase ACK accounting with Tx-phase ACK
accounting through variables shared between the two.  call->acks_* members
refer to ACKs received in the Tx phase and call->ackr_* members to ACKs
sent/to be sent during the Rx phase.

Fixes: 1a2391c30c0b ("rxrpc: Fix detection of out of order acks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:30:53 +01:00
David Howells
114af61f88 rxrpc: Don't try to resend the request if we're receiving the reply
rxrpc has a timer to trigger resending of unacked data packets in a call.
This is not cancelled when a client call switches to the receive phase on
the basis that most calls don't last long enough for it to ever expire.
However, if it *does* expire after we've started to receive the reply, we
shouldn't then go into trying to retransmit or pinging the server to find
out if an ack got lost.

Fix this by skipping the resend code if we're into receiving the reply to a
client call.

Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:30:53 +01:00
David Howells
88e2215975 rxrpc: Fix listen() setting the bar too high for the prealloc rings
AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().

This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty.  rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.

Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.

 BUG: kernel NULL pointer dereference, address: 0000000000000086
 ...
 RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
 Call Trace:
  <TASK>
  ? __wake_up_common_lock+0x7a/0x90
  ? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
  ? rxrpc_abort_call+0x4c/0x60 [rxrpc]
  rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
  rxrpc_release+0xc9/0x1c0 [rxrpc]
  __sock_release+0x37/0xa0
  sock_close+0x11/0x20
  __fput+0x89/0x240
  task_work_run+0x59/0x90
  do_exit+0x319/0xaa0

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:30:53 +01:00
David Howells
adc9613ff6 afs: Adjust ACK interpretation to try and cope with NAT
If a client's address changes, say if it is NAT'd, this can disrupt an in
progress operation.  For most operations, this is not much of a problem,
but StoreData can be different as some servers modify the target file as
the data comes in, so if a store request is disrupted, the file can get
corrupted on the server.

The problem is that the server doesn't recognise packets that come after
the change of address as belonging to the original client and will bounce
them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if
the packet number falls within the initial sequence number window of a call
or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting
it.  In both cases, firstPacket will be 1 and previousPacket will be 0 in
the ACK information.

Fix this by the following means:

 (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1
     and previousPacket as 0, assume this indicates that the server saw the
     incoming packets from a different peer and thus as a different call.
     Fail the call with error -ENETRESET.

 (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the
     first packet has been hard-ACK'd.  If it hasn't been hard-ACK'd, the
     ACK packet will cause it to get retransmitted, so the call will just
     be repeated.

 (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of
     the operation.

 (4) Prioritise the error code over things like -ECONNRESET as the server
     did actually respond.

 (5) Make writeback treat -ENETRESET as a retryable error and make it
     redirty all the pages involved in a write so that the VM will retry.

Note that there is still a circumstance that I can't easily deal with: if
the operation is fully received and processed by the server, but the reply
is lost due to address change.  There's no way to know if the op happened.
We can examine the server, but a conflicting change could have been made by
a third party - and we can't tell the difference.  In such a case, a
message like:

    kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a)

will be logged to dmesg on the next op to touch the file and the client
will reset the inode state, including invalidating clean parts of the
pagecache.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:02 +01:00
David Howells
de696c4784 rxrpc, afs: Fix selection of abort codes
The RX_USER_ABORT code should really only be used to indicate that the user
of the rxrpc service (ie. userspace) implicitly caused a call to be aborted
- for instance if the AF_RXRPC socket is closed whilst the call was in
progress.  (The user may also explicitly abort a call and specify the abort
code to use).

Change some of the points of generation to use other abort codes instead:

 (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see
     ENOMEM and EFAULT during received data delivery and abort with
     RX_CALL_DEAD in the default case.

 (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a
     reply.

 (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had
     heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't.

 (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not
     completed successfully or been aborted.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:02 +01:00
David Howells
4ba68c5192 rxrpc: Return an error to sendmsg if call failed
If at the end of rxrpc sendmsg() or rxrpc_kernel_send_data() the call that
was being given data was aborted remotely or otherwise failed, return an
error rather than returning the amount of data buffered for transmission.

The call (presumably) did not complete, so there's not much point
continuing with it.  AF_RXRPC considers it "complete" and so will be
unwilling to do anything else with it - and won't send a notification for
it, deeming the return from sendmsg sufficient.

Not returning an error causes afs to incorrectly handle a StoreData
operation that gets interrupted by a change of address due to NAT
reconfiguration.

This doesn't normally affect most operations since their request parameters
tend to fit into a single UDP packet and afs_make_call() returns before the
server responds; StoreData is different as it involves transmission of a
lot of data.

This can be triggered on a client by doing something like:

	dd if=/dev/zero of=/afs/example.com/foo bs=1M count=512

at one prompt, and then changing the network address at another prompt,
e.g.:

	ifconfig enp6s0 inet 192.168.6.2 && route add 192.168.6.1 dev enp6s0

Tracing packets on an Auristor fileserver looks something like:

192.168.6.1 -> 192.168.6.3  RX 107 ACK Idle  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
192.168.6.3 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(64538) (64538)
192.168.6.3 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(64538) (64538)
192.168.6.1 -> 192.168.6.3  RX 107 ACK Idle  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
<ARP exchange for 192.168.6.2>
192.168.6.2 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(0) (0)
192.168.6.2 -> 192.168.6.1  AFS (RX) 1482 FS Request: Unknown(0) (0)
192.168.6.1 -> 192.168.6.2  RX 107 ACK Exceeds Window  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
192.168.6.1 -> 192.168.6.2  RX 74 ABORT  Seq: 0  Call: 4  Source Port: 7000  Destination Port: 7001
192.168.6.1 -> 192.168.6.2  RX 74 ABORT  Seq: 29321  Call: 4  Source Port: 7000  Destination Port: 7001

The Auristor fileserver logs code -453 (RXGEN_SS_UNMARSHAL), but the abort
code received by kafs is -5 (RX_PROTOCOL_ERROR) as the rx layer sees the
condition and generates an abort first and the unmarshal error is a
consequence of that at the application layer.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004810.html # v1
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:02 +01:00
David Howells
ad25f5cb39 rxrpc: Fix locking issue
There's a locking issue with the per-netns list of calls in rxrpc.  The
pieces of code that add and remove a call from the list use write_lock()
and the calls procfile uses read_lock() to access it.  However, the timer
callback function may trigger a removal by trying to queue a call for
processing and finding that it's already queued - at which point it has a
spare refcount that it has to do something with.  Unfortunately, if it puts
the call and this reduces the refcount to 0, the call will be removed from
the list.  Unfortunately, since the _bh variants of the locking functions
aren't used, this can deadlock.

================================
WARNING: inconsistent lock state
5.18.0-rc3-build4+ #10 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b
{SOFTIRQ-ON-W} state was registered at:
...
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&rxnet->call_lock);
  <Interrupt>
    lock(&rxnet->call_lock);

 *** DEADLOCK ***

1 lock held by ksoftirqd/2/25:
 #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d

Changes
=======
ver #2)
 - Changed to using list_next_rcu() rather than rcu_dereference() directly.

Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:01 +01:00
David Howells
a05754295e rxrpc: Use refcount_t rather than atomic_t
Move to using refcount_t rather than atomic_t for refcounts in rxrpc.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:01 +01:00
David Howells
33912c2639 rxrpc: Allow list of in-use local UDP endpoints to be viewed in /proc
Allow the list of in-use local UDP endpoints in the current network
namespace to be viewed in /proc.

To aid with this, the endpoint list is converted to an hlist and RCU-safe
manipulation is used so that the list can be read with only the RCU
read lock held.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22 21:03:01 +01:00
Joanne Koong
d5a42de8bd net: Add a second bind table hashed by port and address
We currently have one tcp bind table (bhash) which hashes by port
number only. In the socket bind path, we check for bind conflicts by
traversing the specified port's inet_bind2_bucket while holding the
bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).

In instances where there are tons of sockets hashed to the same port
at different addresses, checking for a bind conflict is time-intensive
and can cause softirq cpu lockups, as well as stops new tcp connections
since __inet_inherit_port() also contests for the spinlock.

This patch proposes adding a second bind table, bhash2, that hashes by
port and ip address. Searching the bhash2 table leads to significantly
faster conflict resolution and less time holding the spinlock.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20 18:16:24 -07:00
Xiu Jianfeng
29849a486a stcp: Use memset_after() to zero sctp_stream_out_ext
Use memset_after() helper to simplify the code, there is no functional
change in this patch.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Link: https://lore.kernel.org/r/20220519062932.249926-1-xiujianfeng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20 17:42:53 -07:00
Jakub Kicinski
c09b0cd2cc net: avoid strange behavior with skb_defer_max == 1
When user sets skb_defer_max to 1 the kick threshold is 0
(half of 1). If we increment queue length before the check
the kick will never happen, and the skb may get stranded.
This is likely harmless but can be avoided by moving the
increment after the check. This way skb_defer_max == 1
will always kick. Still a silly config to have, but
somehow that feels more correct.

While at it drop a comment which seems to be outdated
or confusing, and wrap the defer_count write with
a WRITE_ONCE() since it's read on the fast path
that avoids taking the lock.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220518185522.2038683-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20 17:05:36 -07:00
Geliang Tang
3bc253c2e6 bpf: Add bpf_skc_to_mptcp_sock_proto
This patch implements a new struct bpf_func_proto, named
bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP,
and a new helper bpf_skc_to_mptcp_sock(), which invokes another new
helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct
mptcp_sock from a given subflow socket.

v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c,
remove build check for CONFIG_BPF_JIT
v5: Drop EXPORT_SYMBOL (Martin)

Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
2022-05-20 15:29:00 -07:00
Linus Torvalds
b851c1f8e0 A fix for a nasty use-after-free, marked for stable.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmKHsGETHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6Q8B/97dkJamfa0rfcenW8qnb6Rx2DI6QmE
 vEV2et8Qvrjxr9s10ylTaiH7veYG5Cgb986ufDN1Af52uDx1VdW7TOz4cD7Umx8G
 QsjzviREL3VfN7Ag3WY0SsI5cjQ/iRJfjMJx/fB4G5bMkor1ouH32sQNtmcVLS6D
 HHQZqVL7xP0ORV0lFvBns5EVUCsLHAKjoPGiLprmm7lwlhOo3e60WHBbBHTD9Isc
 SrO8Gz5QiHYyVS6eksgYOZj0Tg5qLFKtKWXXxb1nyF8fLHcQU0S/zicf4AQKDj7i
 5HOagl3S3Gmu+0g/wnWF9YyG3yoTVgfEfZ38XAh1rlwJJOkb1rbeFKAb
 =tZDh
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a nasty use-after-free, marked for stable"

* tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-client:
  libceph: fix misleading ceph_osdc_cancel_request() comment
  libceph: fix potential use-after-free on linger ping and resends
2022-05-20 08:15:40 -10:00
Jakub Kicinski
dc7769244e tcp_ipv6: set the drop_reason in the right place
Looks like the IPv6 version of the patch under Fixes was
a copy/paste of the IPv4 but hit the wrong spot.
It is tcp_v6_rcv() which uses drop_reason as a boolean, and
needs to be protected against reason == 0 before calling free.
tcp_v6_do_rcv() has a pretty straightforward flow.

The resulting warning looks like this:
  WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775
  Call Trace:
    tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767)
    ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438)
    ip6_input_finish (include/linux/rcupdate.h:726)
    ip6_input (include/linux/netfilter.h:307)

Fixes: f8319dfd1b3b ("net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()")
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20220520021347.2270207-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20 09:35:42 -07:00
Jakub Kicinski
805cb5aadc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, misc
updates and fallout fixes from recent Florian's code rewritting (from
last pull request):

1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi.

2) Avoid unnecessary GC with a timestamp in conncount, from William Tu
   and Yifeng Sun.

3) Remove TCP conntrack debugging, from Florian Westphal.

4) Fix compilation warning in ctnetlink, from Florian.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"
  netfilter: conntrack: remove pr_debug callsites from tcp tracker
  netfilter: nf_conncount: reduce unnecessary GC
  netfilter: Use l3mdev flow key when re-routing mangled packets
====================

Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 21:53:08 -07:00
Mat Martineau
d9fb797046 mptcp: Do not traverse the subflow connection list without lock
The MPTCP socket's conn_list (list of subflows) requires the socket lock
to access. The MP_FAIL timeout code added such an access, where it would
check the list of subflows both in timer context and (later) in workqueue
context where the socket lock is held.

Rather than check the list twice, remove the check in the timeout
handler and only depend on the check in the workqueue. Also remove the
MPTCP_FAIL_NO_RESPONSE flag, since mptcp_mp_fail_no_response() has
insignificant overhead and can be checked on each worker run.

Fixes: 49fa1919d6bc ("mptcp: reset subflow when MP_FAIL doesn't respond")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 20:05:07 -07:00
Mat Martineau
d42f9e4e23 mptcp: Check for orphaned subflow before handling MP_FAIL timer
MP_FAIL timeout (waiting for a peer to respond to an MP_FAIL with
another MP_FAIL) is implemented using the MPTCP socket's sk_timer. That
timer is also used at MPTCP socket close, so it's important to not have
the two timer users interfere with each other.

At MPTCP socket close, all subflows are orphaned before sk_timer is
manipulated. By checking the SOCK_DEAD flag on the subflows, each
subflow can determine if the timer is safe to alter without acquiring
any MPTCP-level lock. This replaces code that was using the
mptcp_data_lock and MPTCP-level socket state checks that did not
correctly protect the timer.

Fixes: 49fa1919d6bc ("mptcp: reset subflow when MP_FAIL doesn't respond")
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 20:05:07 -07:00
Paolo Abeni
7b16871f99 mptcp: stop using the mptcp_has_another_subflow() helper
The mentioned helper requires the msk socket lock, and the
current callers don't own it nor can't acquire it, so the
access is racy.

All the current callers are really checking for infinite mapping
fallback, and the latter condition is explicitly tracked by
the relevant msk variable: we can safely remove the caller usage
- and the caller itself.

The issue is present since MP_FAIL implementation, but the
fix only applies since the infinite fallback support, ence the
somewhat unexpected fixes tag.

Fixes: 0530020a7c8f ("mptcp: track and update contiguous data status")
Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 20:05:07 -07:00
Yuchung Cheng
9ad084d666 tcp: improve PRR loss recovery
This patch improves TCP PRR loss recovery behavior for a corner
case. Previously during PRR conservation-bound mode, it strictly
sends the amount equals to the amount newly acked or s/acked.

The patch changes s.t. PRR may send additional amount that was banked
previously (e.g. application-limited) in the conservation-bound
mode, similar to the slow-start mode. This unifies and simplifies the
algorithm further and may improve the recovery latency. This change
still follow the general packet conservation design principle and
always keep inflight/cwnd below the slow start threshold set
by the congestion control module.

PRR is described in RFC 6937. We'll include this change in the
latest revision rfc6937-bis as well.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220519003410.2531936-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 18:49:50 -07:00
Jakub Kicinski
1c2133114d net: tls: fix messing up lists when bpf enabled
Artem points out that skb may try to take over the skb and
queue it to its own list. Unlink the skb before calling out.

Fixes: b1a2c1786330 ("tls: rx: clear ctx->recv_pkt earlier")
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 17:55:06 -07:00
Jakub Kicinski
fcf8962827 linux-can-next-for-5.19-20220519
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmKGpZwTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCtfkuQ2KDTXR4SB/4nr48LlHedhSR57kOl1Zm7gKhu60nK
 EoNPGLODMsaWv4QdYXzU6HP5B8gUUb36DUlKHv7pH642UonPgfXudxpRKGuam35d
 zxkapZbcs/A5LfF7yrFZYdBm+wUiepGHDmauhy3BJ2cR/bxkw7fL+yea0HCLWnjQ
 qdt5KXGhnnWzvojKJsX6a4/tZpClruRbv1czrGYf9DjqSC+2Zf7t1tLxO0wk/zHi
 kRQR2qpXR6YiBqwb/f64qXM0DB3KEQk7CAN8Z3V+4RX5f112EGFjaHVGufpMGIOh
 dS+4u5v/kOChnR/91RqqR5o0fuw8XSHyb2PAcI2eLs/2e5xMUoPhIjfQ
 =69VA
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.19-20220519' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2022-05-19

Oliver Hartkopp contributes a patch for the ISO-TP CAN protocol to
update the validation of address information during bind.

The next patch is by Jakub Kicinski and converts the CAN network
drivers from netif_napi_add() to the netif_napi_add_weight() function.

Another patch by Oliver Hartkopp removes obsolete CAN specific LED
support.

Vincent Mailhol's patch for the mcp251xfd driver fixes a
-Wunaligned-access warning by clang-14.

* tag 'linux-can-next-for-5.19-20220519' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: mcp251xfd: silence clang's -Wunaligned-access warning
  can: can-dev: remove obsolete CAN LED support
  can: can-dev: move to netif_napi_add_weight()
  can: isotp: isotp_bind(): do not validate unused address information
====================

Link: https://lore.kernel.org/r/20220519202308.1435903-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 16:49:05 -07:00
Oliver Hartkopp
b76b163f46 can: isotp: isotp_bind(): do not validate unused address information
With commit 2aa39889c463 ("can: isotp: isotp_bind(): return -EINVAL on
incorrect CAN ID formatting") the bind() syscall returns -EINVAL when
the given CAN ID needed to be sanitized. But in the case of an unconfirmed
broadcast mode the rx CAN ID is not needed and may be uninitialized from
the caller - which is ok.

This patch makes sure the result of an inproper CAN ID format is only
provided when the address information is needed.

Link: https://lore.kernel.org/all/20220517145653.2556-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19 22:11:28 +02:00
Jakub Kicinski
d353e1a3ba wireless-next patches for v5.19
Second set of patches for v5.19 and most likely the last one. rtw89
 got support for 8852ce devices and mt76 now supports Wireless Ethernet
 Dispatch.
 
 Major changes:
 
 cfg80211/mac80211
 
 * support disabling EHT mode
 
 rtw89
 
 * add support for Realtek 8852ce devices
 
 mt76
 
 * Wireless Ethernet Dispatch support for flow offload
 
 * non-standard VHT MCS10-11 support
 
 * mt7921 AP mode support
 
 * mt7921 ipv6 NS offload support
 
 ath11k
 
 * enable keepalive during WoWLAN suspend
 
 * implement remain-on-channel support
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmKGYt4RHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZuK2gf/ZswLtwE2CIwrEhz/Q0MDtxUvw8ulRhKl
 d+1PC+bCd/VArMESjpu7le+WNAZ1OPBWdh1pgkDm8QpCQZYe7/hRED82DB/Jw3Cl
 KmOx2nr6Xb4uEN+yjqZrSXzA+Hrysy24bCQRG2CJKjdToe/fwTuRiz8WIcPKtxio
 b/d/Kz0LpSoHTlU1PzqIsXulN8QUKJA4kRw70rJHAlMJVYiTBuAD+AmXfbhHD8uX
 t2CJDH2fykDd1CAWFQwcmI++2tS+xclYL81vDg3aEinQJ9aNcDz06qSE5qr2H+K5
 lUYy42yc+ONkIIh8LlxrLgZie7oSmkrb7aA0Zc+F0SWp/B6ZO/k8FA==
 =aILH
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v5.19

Second set of patches for v5.19 and most likely the last one. rtw89
got support for 8852ce devices and mt76 now supports Wireless Ethernet
Dispatch.

Major changes:

cfg80211/mac80211
 - support disabling EHT mode

rtw89
 - add support for Realtek 8852ce devices

mt76
 - Wireless Ethernet Dispatch support for flow offload
 - non-standard VHT MCS10-11 support
 - mt7921 AP mode support
 - mt7921 ipv6 NS offload support

ath11k
 - enable keepalive during WoWLAN suspend
 - implement remain-on-channel support

* tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (135 commits)
  iwlwifi: mei: fix potential NULL-ptr deref
  iwlwifi: mei: clear the sap data header before sending
  iwlwifi: mvm: remove vif_count
  iwlwifi: mvm: always tell the firmware to accept MCAST frames in BSS
  iwlwifi: mvm: add OTP info in case of init failure
  iwlwifi: mvm: fix assert 1F04 upon reconfig
  iwlwifi: fw: init SAR GEO table only if data is present
  iwlwifi: mvm: clean up authorized condition
  iwlwifi: mvm: use NULL instead of ERR_PTR when parsing wowlan status
  iwlwifi: pcie: simplify MSI-X cause mapping
  rtw89: pci: only mask out INT indicator register for disable interrupt v1
  rtw89: convert rtw89_band to nl80211_band precisely
  rtw89: 8852c: update txpwr tables to HALRF_027_00_052
  rtw89: cfo: check mac_id to avoid out-of-bounds
  rtw89: 8852c: set TX antenna path
  rtw89: add ieee80211::sta_rc_update ops
  wireless: Fix Makefile to be in alphabetical order
  mac80211: refactor freeing the next_beacon
  cfg80211: fix kernel-doc for cfg80211_beacon_data
  mac80211: minstrel_ht: support ieee80211_rate_status
  ...
====================

Link: https://lore.kernel.org/r/20220519153334.8D051C385AA@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 13:01:08 -07:00
Jakub Kicinski
d7e6f58360 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/main.c
  b33886971dbc ("net/mlx5: Initialize flow steering during driver probe")
  40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support")
  f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table")
https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/
  16d42d313350 ("net/mlx5: Drain fw_reset when removing device")
  8324a02c342a ("net/mlx5: Add exit route when waiting for FW")
https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  e274f7154008 ("selftests: mptcp: add subflow limits test-cases")
  b6e074e171bc ("selftests: mptcp: add infinite map testcase")
  5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type")
https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/

net/mptcp/options.c
  ba2c89e0ea74 ("mptcp: fix checksum byte order")
  1e39e5a32ad7 ("mptcp: infinite mapping sending")
  ea66758c1795 ("tcp: allow MPTCP to update the announced window")
https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/

net/mptcp/pm.c
  95d686517884 ("mptcp: fix subflow accounting on close")
  4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/

net/mptcp/subflow.c
  ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure")
  0348c690ed37 ("mptcp: add the fallback check")
  f8d4bcacff3b ("mptcp: infinite mapping receiving")
https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19 11:23:59 -07:00
Luiz Augusto von Dentz
8f9ae5b3ae Bluetooth: eir: Add helpers for managing service data
This adds helpers for accessing and appending service data (0x16) ad
type.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-19 20:11:26 +02:00
Chuck Lever
2059b698a2 SUNRPC: Simplify synopsis of svc_pool_for_cpu()
Clean up: There is one caller. The @cpu argument can be made
implicit now that a get_cpu/put_cpu pair is no longer needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:40 -04:00
Sebastian Andrzej Siewior
586095d339 SUNRPC: Don't disable preemption while calling svc_pool_for_cpu().
svc_xprt_enqueue() disables preemption via get_cpu() and then asks
for a pool of a specific CPU (current) via svc_pool_for_cpu().
While preemption is disabled, svc_xprt_enqueue() acquires
svc_pool::sp_lock with bottom-halfs disabled, which can sleep on
PREEMPT_RT.

Disabling preemption is not required here. The pool is protected with a
lock so the following list access is safe even cross-CPU. The following
iteration through svc_pool::sp_all_threads is under RCU-readlock and
remaining operations within the loop are atomic and do not rely on
disabled-preemption.

Use raw_smp_processor_id() as the argument for the requested CPU in
svc_pool_for_cpu().

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:40 -04:00
Chuck Lever
983084b267 SUNRPC: Remove svc_rqst::rq_xprt_hlen
Clean up: This field is now always set to zero.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Chuck Lever
4af8b42e56 SUNRPC: Remove dead code in svc_tcp_release_rqst()
Clean up: svc_tcp_sendto() always sets rq_xprt_ctxt to NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:39 -04:00
Chuck Lever
37324e6bb1 SUNRPC: Cache deferral injection
Cache deferral injection stress-tests the cache deferral logic as
well as upper layer protocol deferred request handlers. This
facility is for developers and professional testers to ensure
coverage of the rqst deferral code paths. To date, we haven't
had an adequate way to ensure these code paths are covered
during testing, short of temporary code changes to force their
use.

A file called /sys/kernel/debug/fail_sunrpc/ignore-cache-wait
enables administrators to disable cache deferral injection while
allowing other types of sunrpc errors to be injected. The default
setting is that cache deferral injection is enabled (ignore=false).

To enable support for cache deferral injection,
CONFIG_FAULT_INJECTION, CONFIG_FAULT_INJECTION_DEBUG_FS, and
CONFIG_SUNRPC_DEBUG must all be set to "Y".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19 12:25:38 -04:00
Linus Torvalds
d904c8cc03 Networking fixes for 5.18-rc8, including fixes from can, xfrm and
netfilter subtrees.
 
 Notably this reverts a recent TCP/DCCP netns-related change
 to address a possible UaF.
 
 Current release - regressions:
   - tcp: revert "tcp/dccp: get rid of inet_twsk_purge()"
 
   - xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown
 
 Previous releases - regressions:
   - netfilter: flowtable: fix TCP flow teardown
 
   - can: revert "can: m_can: pci: use custom bit timings for Elkhart Lake"
 
   - xfrm: check encryption module availability consistency
 
   - eth: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf()
 
   - eth: mlx5: initialize flow steering during driver probe
 
   - eth: ice: fix crash when writing timestamp on RX rings
 
 Previous releases - always broken:
   - mptcp: fix checksum byte order
 
   - eth: lan966x: fix assignment of the MAC address
 
   - eth: mlx5: remove HW-GRO from reported features
 
   - eth: ftgmac100: disable hardware checksum on AST2600
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmKGAYYSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkrt8P/2GyYNQT7q0h3Plsxc/m1tIUCPiERROE
 zIU0R2QVc64xpkMISeVb3YYpa3eqhtQsNWgt7Xsr1NRXBmyx60dvGpS81w8Gnxuo
 ruA7SxnH6OA0usviiYPmeGP9emvCEkO5YRW5kxl1Cpum19yNxjfZKJ6ARk0IDp/D
 C1S91PYtF9s25Yytrlpv9lVVBvTHQxg2EQocZHxO+7/j2O8jJP/NAYltpVaRNC2W
 gLcOWTAujrjAfpdsBhJsWXv4dTCQOAgnIXYP9P1JdFMAZtkXoYQUjaXP7dsaAXHw
 iE9FBRkqDKVhj94CxR6VPOSo0kVvOuBfkc1eJeZ74lvahkHBq4EyiVCo6/JhNQTd
 /bi/mTeUlI9yYyu/j9lMDy4CwOuiB69Dl4vNR/G5C1rF7l1vQkZr50pnD96MePwu
 9fR5+ipZsDhj5c77OMiraqnnOyWXVtD2YCZCCw80a9/aWG4zxcIDtnNQIfqAACvx
 0wNgG2bPSKRablytep1Qs84Vvupaa1cC2eTBbA+6LzQqk3CR9/YMUSD6MXitxQyD
 RJYbm5QMqdW2QH8zE21E+8wzIPeN9m66lJFppuntuB+I/CHWAnP/CmdbWysR3FQ+
 5ZisPh4PUqb1VIzGKUbym/D9FB20Vc8zq6oQa8LqiIOODUrxQMg3F2O43OWsYsn3
 TDNCwo5BQ/Z8
 =C848
 -----END PGP SIGNATURE-----

Merge tag 'net-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can, xfrm and netfilter subtrees.

  Notably this reverts a recent TCP/DCCP netns-related change to address
  a possible UaF.

  Current release - regressions:

   - tcp: revert "tcp/dccp: get rid of inet_twsk_purge()"

   - xfrm: set dst dev to blackhole_netdev instead of loopback_dev in
     ifdown

  Previous releases - regressions:

   - netfilter: flowtable: fix TCP flow teardown

   - can: revert "can: m_can: pci: use custom bit timings for Elkhart
     Lake"

   - xfrm: check encryption module availability consistency

   - eth: vmxnet3: fix possible use-after-free bugs in
     vmxnet3_rq_alloc_rx_buf()

   - eth: mlx5: initialize flow steering during driver probe

   - eth: ice: fix crash when writing timestamp on RX rings

  Previous releases - always broken:

   - mptcp: fix checksum byte order

   - eth: lan966x: fix assignment of the MAC address

   - eth: mlx5: remove HW-GRO from reported features

   - eth: ftgmac100: disable hardware checksum on AST2600"

* tag 'net-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
  net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
  ptp: ocp: change sysfs attr group handling
  selftests: forwarding: fix missing backslash
  netfilter: nf_tables: disable expression reduction infra
  netfilter: flowtable: move dst_check to packet path
  netfilter: flowtable: fix TCP flow teardown
  net: ftgmac100: Disable hardware checksum on AST2600
  igb: skip phy status check where unavailable
  nfc: pn533: Fix buggy cleanup order
  mptcp: Do TCP fallback on early DSS checksum failure
  mptcp: fix checksum byte order
  net: af_key: check encryption module availability consistency
  net: af_key: add check for pfkey_broadcast in function pfkey_process
  net/mlx5: Drain fw_reset when removing device
  net/mlx5e: CT: Fix setting flow_source for smfs ct tuples
  net/mlx5e: CT: Fix support for GRE tuples
  net/mlx5e: Remove HW-GRO from reported features
  net/mlx5e: Properly block HW GRO when XDP is enabled
  net/mlx5e: Properly block LRO when XDP is enabled
  net/mlx5e: Block rx-gro-hw feature in switchdev mode
  ...
2022-05-19 05:50:29 -10:00
Boris Pismenny
c1318b39c7 tls: Add opt-in zerocopy mode of sendfile()
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.

In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.

This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.

The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.

The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.

Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.

Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:

* non-zerocopy: 33.6 Gbit/s
* zerocopy: 79.92 Gbit/s

CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-19 12:14:11 +02:00
Andrew Lunn
fbb3abdf22 net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:

       br1
     /    \
    /      \
   /        \
 br0.11    wlan0
   |
   br0
 /  |  \
p1  p2  p3

br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has a
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.

A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating that the switch has
dealt with forwarding the frame out ports p2 and p3 as needed. This
flag instructs the software bridge it does not need to pass the frame
back down again. However, the flag is not getting reset when the frame
is passed upwards. As a result br1 sees the flag, wrongly interprets
it, and fails to forward the frame to wlan0.

When passing a frame upwards, clear the flag. This is the Rx
equivalent of br_switchdev_frame_unmark() in br_dev_xmit().

Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-19 09:20:44 +02:00
Jakub Kicinski
7dc02d7f08 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Reduce number of hardware offload retries from flowtable datapath
   which might hog system with retries, from Felix Fietkau.

2) Skip neighbour lookup for PPPoE device, fill_forward_path() already
   provides this and set on destination address from fill_forward_path for
   PPPoE device, also from Felix.

4) When combining PPPoE on top of a VLAN device, set info->outdev to the
   PPPoE device so software offload works, from Felix.

5) Fix TCP teardown flowtable state, races with conntrack gc might result
   in resetting the state to ESTABLISHED and the time to one day. Joint
   work with Oz Shlomo and Sven Auhagen.

6) Call dst_check() from flowtable datapath to check if dst is stale
   instead of doing it from garbage collector path.

7) Disable register tracking infrastructure, either user-space or
   kernel need to pre-fetch keys inconditionally, otherwise register
   tracking assumes data is already available in register that might
   not well be there, leading to incorrect reductions.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: disable expression reduction infra
  netfilter: flowtable: move dst_check to packet path
  netfilter: flowtable: fix TCP flow teardown
  netfilter: nft_flow_offload: fix offload with pppoe + vlan
  net: fix dev_fill_forward_path with pppoe + bridge
  netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices
  netfilter: flowtable: fix excessive hw offload attempts after failure
====================

Link: https://lore.kernel.org/r/20220518213841.359653-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-18 19:34:26 -07:00
Ilya Dryomov
d0bb883c63 libceph: fix misleading ceph_osdc_cancel_request() comment
cancel_request() never guaranteed that after its return the OSD
client would be completely done with the OSD request.  The callback
(if specified) can still be invoked and a ref can still be held.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
2022-05-18 21:21:29 +02:00
Ilya Dryomov
75dbb685f4 libceph: fix potential use-after-free on linger ping and resends
request_reinit() is not only ugly as the comment rightfully suggests,
but also unsafe.  Even though it is called with osdc->lock held for
write in all cases, resetting the OSD request refcount can still race
with handle_reply() and result in use-after-free.  Taking linger ping
as an example:

    handle_timeout thread                     handle_reply thread

                                              down_read(&osdc->lock)
                                              req = lookup_request(...)
                                              ...
                                              finish_request(req)  # unregisters
                                              up_read(&osdc->lock)
                                              __complete_request(req)
                                                linger_ping_cb(req)

      # req->r_kref == 2 because handle_reply still holds its ref

    down_write(&osdc->lock)
    send_linger_ping(lreq)
      req = lreq->ping_req  # same req
      # cancel_linger_request is NOT
      # called - handle_reply already
      # unregistered
      request_reinit(req)
        WARN_ON(req->r_kref != 1)  # fires
        request_init(req)
          kref_init(req->r_kref)

                   # req->r_kref == 1 after kref_init

                                              ceph_osdc_put_request(req)
                                                kref_put(req->r_kref)

            # req->r_kref == 0 after kref_put, req is freed

        <further req initialization/use> !!!

This happens because send_linger_ping() always (re)uses the same OSD
request for watch ping requests, relying on cancel_linger_request() to
unregister it from the OSD client and rip its messages out from the
messenger.  send_linger() does the same for watch/notify registration
and watch reconnect requests.  Unfortunately cancel_request() doesn't
guarantee that after it returns the OSD client would be completely done
with the OSD request -- a ref could still be held and the callback (if
specified) could still be invoked too.

The original motivation for request_reinit() was inability to deal with
allocation failures in send_linger() and send_linger_ping().  Switching
to using osdc->req_mempool (currently only used by CephFS) respects that
and allows us to get rid of request_reinit().

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
2022-05-18 21:21:05 +02:00
Pablo Neira Ayuso
9e539c5b6d netfilter: nf_tables: disable expression reduction infra
Either userspace or kernelspace need to pre-fetch keys inconditionally
before comparisons for this to work. Otherwise, register tracking data
is misleading and it might result in reducing expressions which are not
yet registers.

First expression is also guaranteed to be evaluated always, however,
certain expressions break before writing data to registers, before
comparing the data, leaving the register in undetermined state.

This patch disables this infrastructure by now.

Fixes: b2d306542ff9 ("netfilter: nf_tables: do not reduce read-only expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18 17:34:26 +02:00
Ritaro Takenaka
2738d9d963 netfilter: flowtable: move dst_check to packet path
Fixes sporadic IPv6 packet loss when flow offloading is enabled.

IPv6 route GC and flowtable GC are not synchronized.
When dst_cache becomes stale and a packet passes through the flow before
the flowtable GC teardowns it, the packet can be dropped.
So, it is necessary to check dst every time in packet path.

Fixes: 227e1e4d0d6c ("netfilter: nf_flowtable: skip device lookup from interface index")
Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18 17:34:26 +02:00
Pablo Neira Ayuso
e5eaac2beb netfilter: flowtable: fix TCP flow teardown
This patch addresses three possible problems:

1. ct gc may race to undo the timeout adjustment of the packet path, leaving
   the conntrack entry in place with the internal offload timeout (one day).

2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
   timeout is reached before the flow offload del.

3. tcp ct is always set to ESTABLISHED with a very long timeout
   in flow offload teardown/delete even though the state might be already
   CLOSED. Also as a remark we cannot assume that the FIN or RST packet
   is hitting flow table teardown as the packet might get bumped to the
   slow path in nftables.

This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.

Moreover, teturn the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.

Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.

If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().

This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.

flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.

Joint work with Oz and Sven.

Fixes: 1e5b2471bcc4 ("netfilter: nf_flow_table: teardown flow timeout race")
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18 17:34:26 +02:00
Jason A. Donenfeld
d4150779e6 random32: use real rng for non-deterministic randomness
random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.

The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.

Stop👏making👏bespoke👏random👏number👏generators👏.

Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.

The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.

There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18 15:53:52 +02:00
Mat Martineau
ae66fb2ba6 mptcp: Do TCP fallback on early DSS checksum failure
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.

If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).

This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.

Fixes: dd8bcd1768ff ("mptcp: validate the data checksum")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18 13:05:42 +01:00
Paolo Abeni
ba2c89e0ea mptcp: fix checksum byte order
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.

As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.

Address the issue writing in the packet the unmodified __sum16 value.

MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275
Fixes: c5b39e26d003 ("mptcp: send out checksum for DSS")
Fixes: 390b95a5fb84 ("mptcp: receive checksum for DSS")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18 13:05:42 +01:00
David S. Miller
089403a3f7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2022-05-18

1) Fix "disable_policy" flag use when arriving from different devices.
   From Eyal Birger.

2) Fix error handling of pfkey_broadcast in function pfkey_process.
   From Jiasheng Jiang.

3) Check the encryption module availability consistency in pfkey.
   From Thomas Bartschies.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18 12:47:36 +01:00