74599 Commits

Author SHA1 Message Date
Chuck Lever
789ce196a3 SUNRPC: Remove BUG_ON call sites
There is no need to take down the whole system for these assertions.

I'd rather not attempt a heroic save here, as some bug has occurred
that has left the transport data structures in an unknown state.
Just warn and then leak the left-over resources.

Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:10 -04:00
NeilBrown
15d39883ee SUNRPC: change the back-channel queue to lwq
This removes the need to store and update back-links in the list.
It also remove the need for the _bh version of spin_lock().

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:08 -04:00
NeilBrown
580a25756a SUNRPC: discard sp_lock
sp_lock is now only used to protect sp_all_threads.  This isn't needed
as sp_all_threads is only manipulated through svc_set_num_threads(),
which is already serialized.  Read-acccess only requires rcu_read_lock().
So no more locking is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:08 -04:00
NeilBrown
2e8fc923fe SUNRPC: change sp_nrthreads to atomic_t
Using an atomic_t avoids the need to take a spinlock (which can soon be
removed).

Choosing a thread to kill needs to be careful as we cannot set the "die
now" bit atomically with the test on the count.  Instead we temporarily
increase the count.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown
9a0e6accc0 SUNRPC: use lwq for sp_sockets - renamed to sp_xprts
lwq avoids using back pointers in lists, and uses less locking.
This introduces a new spinlock, but the other one will be removed in a
future patch.

For svc_clean_up_xprts(), we now dequeue the entire queue, walk it to
remove and process the xprts that need cleaning up, then re-enqueue the
remaining queue.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown
5b80147e0c SUNRPC: only have one thread waking up at a time
Currently if several items of work become available in quick succession,
that number of threads (if available) will be woken.  By the time some
of them wake up another thread that was already cache-warm might have
come along and completed the work.  Anecdotal evidence suggests as many
as 15% of wakes find nothing to do once they get to the point of
looking.

This patch changes svc_pool_wake_idle_thread() to wake the first thread
on the queue but NOT remove it.  Subsequent calls will wake the same
thread.  Once that thread starts it will dequeue itself and after
dequeueing some work to do, it will wake the next thread if there is more
work ready.  This results in a more orderly increase in the number of
busy threads.

As a bonus, this allows us to reduce locking around the idle queue.
svc_pool_wake_idle_thread() no longer needs to take a lock (beyond
rcu_read_lock()) as it doesn't manipulate the queue, it just looks at
the first item.

The thread itself can avoid locking by using the new
llist_del_first_this() interface.  This will safely remove the thread
itself if it is the head.  If it isn't the head, it will do nothing.
If multiple threads call this concurrently only one will succeed.  The
others will do nothing, so no corruption can result.

If a thread wakes up and finds that it cannot dequeue itself that means
either
- that it wasn't woken because it was the head of the queue.  Maybe the
  freezer woke it.  In that case it can go back to sleep (after trying
  to freeze of course).
- some other thread found there was nothing to do very recently, and
  placed itself on the head of the queue in front of this thread.
  It must check again after placing itself there, so it can be deemed to
  be responsible for any pending work, and this thread can go back to
  sleep until woken.

No code ever tests for busy threads any more.  Only each thread itself
cares if it is busy.  So svc_thread_busy() is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:07 -04:00
NeilBrown
d7926ee8b7 SUNRPC: rename some functions from rqst_ to svc_thread_
Functions which directly manipulate a 'struct rqst', such as
svc_rqst_alloc() or svc_rqst_release_pages(), can reasonably
have "rqst" in there name.
However functions that act on the running thread, such as
XX_should_sleep() or XX_wait_for_work() should seem more
natural with a "svc_thread_" prefix.

So make those changes.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:06 -04:00
NeilBrown
9bd4161c59 SUNRPC: change service idle list to be an llist
With an llist we don't need to take a lock to add a thread to the list,
though we still need a lock to remove it.  That will go in the next
patch.

Unlike double-linked lists, a thread cannot reliably remove itself from
the list.  Only the first thread can be removed, and that can change
asynchronously.  So some care is needed.

We already check if there is pending work to do, so we are unlikely to
add ourselves to the idle list and then want to remove ourselves again.

If we DO find something needs to be done after adding ourselves to the
list, we simply wake up the first thread on the list.  If that was us,
we successfully removed ourselves and can continue.  If it was some
other thread, they will do the work that needs to be done.  We can
safely sleep until woken.

We also remove the test on freezing() from rqst_should_sleep().  Instead
we set TASK_FREEZABLE before scheduling.  This makes is safe to
schedule() when a freeze is pending.  As we now loop waiting to be
removed from the idle queue, this is a cleaner way to handle freezing.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:05 -04:00
NeilBrown
2b65a22684 SUNRPC: discard SP_CONGESTED
We can tell if a pool is congested by checking if the idle list is
empty.  We don't need a separate flag.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:05 -04:00
NeilBrown
5ff817b235 SUNRPC: add list of idle threads
Rather than searching a list of threads to find an idle one, having a
list of idle threads allows an idle thread to be found immediately.

This adds some spin_lock calls which is not ideal, but as the hold-time
is tiny it is still faster than searching a list.  A future patch will
remove them using llist.h.  This involves some subtlety and so is left
to a separate patch.

This removes the need for the RQ_BUSY flag.  The rqst is "busy"
precisely when it is not on the "idle" list.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:04 -04:00
NeilBrown
fa341560ca SUNRPC: change how svc threads are asked to exit.
svc threads are currently stopped using kthread_stop().  This requires
identifying a specific thread.  However we don't care which thread
stops, just as long as one does.

So instead, set a flag in the svc_pool to say that a thread needs to
die, and have each thread check this flag instead of calling
kthread_should_stop().  The first thread to find and clear this flag
then moves towards exiting.

This removes an explicit dependency on sp_all_threads which will make a
future patch simpler.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:04 -04:00
NeilBrown
063ab935a4 SUNRPC: integrate back-channel processing with svc_recv()
Using svc_recv() for (NFSv4.1) back-channel handling means we have just
one mechanism for waking threads.

Also change kthread_freezable_should_stop() in nfs4_callback_svc() to
kthread_should_stop() as used elsewhere.
kthread_freezable_should_stop() effectively adds a try_to_freeze() call,
and svc_recv() already contains that at an appropriate place.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
Chuck Lever
6ed8cdf967 SUNRPC: Clean up bc_svc_process()
The test robot complained that, in some build configurations, the
@error variable in bc_svc_process's only caller is set but never
used. This happens because dprintk() is the only consumer of that
value.

 - Remove the dprintk() call sites in favor of the svc_process
   tracepoint
 - The @error variable and the return value of bc_svc_process() are
   now unused, so get rid of them.
 - The @serv parameter is set to rqstp->rq_serv by the only caller,
   and bc_svc_process() then uses it only to set rqstp->rq_serv. It
   can be removed.
 - Rename bc_svc_process() according to the convention that
   globally-visible RPC server functions have names that begin with
   "svc_"; and because it is globally-visible, give it a proper
   kdoc comment.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308121314.HA8Rq2XG-lkp@intel.com/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
NeilBrown
7b31f4daeb SUNRPC: rename and refactor svc_get_next_xprt()
svc_get_next_xprt() does a lot more than just get an xprt.  It also
decides if it needs to sleep, depending not only on the availability of
xprts but also on the need to exit or handle external work.

So rename it to svc_rqst_wait_for_work() and only do the testing and
waiting.  Move all the waiting-related code out of svc_recv() into the
new svc_rqst_wait_for_work().

Move the dequeueing code out of svc_get_next_xprt() into svc_recv().

Previously svc_xprt_dequeue() would be called twice, once before waiting
and possibly once after.  Now instead rqst_should_sleep() is called
twice.  Once to decide if waiting is needed, and once to check against
after setting the task state do see if we might have missed a wakeup.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
NeilBrown
e3274026e2 SUNRPC: move all of xprt handling into svc_xprt_handle()
svc_xprt_handle() does lots of things itself, but leaves some to the
caller - svc_recv().  This isn't elegant.

Move that code out of svc_recv() into svc_xprt_handle()

Move the calls to svc_xprt_release() from svc_send() and svc_drop()
(the two possible final steps in svc_process()) and from svc_recv() (in
the case where svc_process() wasn't called) into svc_xprt_handle().

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:02 -04:00
Linus Torvalds
a1ef447dee Fixes for an overreaching WARN_ON, two error paths and a switch to
kernel_connect() which recently grown protection against someone using
 BPF to rewrite the address.  All but one marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmUpYoQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwvmCACK13dkAaupcHyteYPloBgtJLNixR3X
 6++nHCOXGtE7cK6n1snobFQgp/d5BqSKAeymyLqDjJOJVqG/5n8FR1gwcuY/Ogdj
 Aju2Mkt7R/R/V6kvmCbqGSwiOvxZP1gBFkKcluRNQkFNP3boKw4vmJGq29Rabbyr
 66NfZSETsR/H4JhAWvUyVmffrvxIx11THnvmrAnprGVKoK72HwOQFx0H4KLD2Hio
 aN/yiiR15yDS6hL8dfilt+QGO6o+ZWgvRl1GOzqjwISgfeUUYVknMJXrEf+20kHs
 kX3tWaLWVZsU1FyVFRO5HPYLAREjALXstFO4K1OHPERM7EipWNinIHoS
 =wyYP
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Fixes for an overreaching WARN_ON, two error paths and a switch to
  kernel_connect() which recently grown protection against someone using
  BPF to rewrite the address.

  All but one marked for stable"

* tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client:
  ceph: fix type promotion bug on 32bit systems
  libceph: use kernel_connect()
  ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()
  ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13 11:27:31 -07:00
Jeremy Cline
354a6e707e nfc: nci: assert requested protocol is valid
The protocol is used in a bit mask to determine if the protocol is
supported. Assert the provided protocol is less than the maximum
defined so it doesn't potentially perform a shift-out-of-bounds and
provide a clearer error for undefined protocols vs unsupported ones.

Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Reported-and-tested-by: syzbot+0839b78e119aae1fec78@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0839b78e119aae1fec78
Signed-off-by: Jeremy Cline <jeremy@jcline.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231009200054.82557-1-jeremy@jcline.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-12 09:32:10 +02:00
Kuniyuki Iwashima
e2bca4870f af_packet: Fix fortified memcpy() without flex array.
Sergei Trofimovich reported a regression [0] caused by commit a0ade8404c3b
("af_packet: Fix warning of fortified memcpy() in packet_getname().").

It introduced a flex array sll_addr_flex in struct sockaddr_ll as a
union-ed member with sll_addr to work around the fortified memcpy() check.

However, a userspace program uses a struct that has struct sockaddr_ll in
the middle, where a flex array is illegal to exist.

  include/linux/if_packet.h:24:17: error: flexible array member 'sockaddr_ll::<unnamed union>::<unnamed struct>::sll_addr_flex' not at end of 'struct packet_info_t'
     24 |                 __DECLARE_FLEX_ARRAY(unsigned char, sll_addr_flex);
        |                 ^~~~~~~~~~~~~~~~~~~~

To fix the regression, let's go back to the first attempt [1] telling
memcpy() the actual size of the array.

Reported-by: Sergei Trofimovich <slyich@gmail.com>
Closes: https://github.com/NixOS/nixpkgs/pull/252587#issuecomment-1741733002 [0]
Link: https://lore.kernel.org/netdev/20230720004410.87588-3-kuniyu@amazon.com/ [1]
Fixes: a0ade8404c3b ("af_packet: Fix warning of fortified memcpy() in packet_getname().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20231009153151.75688-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-12 09:15:15 +02:00
Jakub Kicinski
71c299c711 net: tcp: fix crashes trying to free half-baked MTU probes
tcp_stream_alloc_skb() initializes the skb to use tcp_tsorted_anchor
which is a union with the destructor. We need to clean that
TCP-iness up before freeing.

Fixes: 736013292e3c ("tcp: let tcp_mtu_probe() build headless packets")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231010173651.3990234-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11 17:24:46 -07:00
Nils Hoppmann
a950a5921d net/smc: Fix pos miscalculation in statistics
SMC_STAT_PAYLOAD_SUB(_smc_stats, _tech, key, _len, _rc) will calculate
wrong bucket positions for payloads of exactly 4096 bytes and
(1 << (m + 12)) bytes, with m == SMC_BUF_MAX - 1.

Intended bucket distribution:
Assume l == size of payload, m == SMC_BUF_MAX - 1.

Bucket 0                : 0 < l <= 2^13
Bucket n, 1 <= n <= m-1 : 2^(n+12) < l <= 2^(n+13)
Bucket m                : l > 2^(m+12)

Current solution:
_pos = fls64((l) >> 13)
[...]
_pos = (_pos < m) ? ((l == 1 << (_pos + 12)) ? _pos - 1 : _pos) : m

For l == 4096, _pos == -1, but should be _pos == 0.
For l == (1 << (m + 12)), _pos == m, but should be _pos == m - 1.

In order to avoid special treatment of these corner cases, the
calculation is adjusted. The new solution first subtracts the length by
one, and then calculates the correct bucket by shifting accordingly,
i.e. _pos = fls64((l - 1) >> 13), l > 0.
This not only fixes the issues named above, but also makes the whole
bucket assignment easier to follow.

Same is done for SMC_STAT_RMB_SIZE_SUB(_smc_stats, _tech, k, _len),
where the calculation of the bucket position is similar to the one
named above.

Fixes: e0e4b8fa5338 ("net/smc: Add SMC statistics support")
Suggested-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:36:35 +01:00
Jakub Kicinski
ad98426a88 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZSXLzwAKCRDbK58LschI
 g1wuAQDTT1mrUmRqrpPob/U3HCcTg64hgdRwyF+6IU39/+neGwEAoP0FKZoy3DDf
 C8FOdVChBjapPsp9zTeYPv0nlZMITAE=
 =1Shl
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-10-11

We've added 14 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 398 insertions(+), 104 deletions(-).

The main changes are:

1) Fix s390 JIT backchain issues in the trampoline code generation which
   previously clobbered the caller's backchain, from Ilya Leoshkevich.

2) Fix zero-size allocation warning in xsk sockets when the configured
   ring size was close to SIZE_MAX, from Andrew Kanner.

3) Fixes for bpf_mprog API that were found when implementing support
   in the ebpf-go library along with selftests, from Daniel Borkmann
   and Lorenz Bauer.

4) Fix riscv JIT to properly sign-extend the return register in programs.
   This fixes various test_progs selftests on riscv, from Björn Töpel.

5) Fix verifier log for async callback return values where the allowed
   range was displayed incorrectly, from David Vernet.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  s390/bpf: Fix unwinding past the trampoline
  s390/bpf: Fix clobbering the caller's backchain in the trampoline
  selftests/bpf: Add testcase for async callback return value failure
  bpf: Fix verifier log for async callback return values
  xdp: Fix zero-size allocation warning in xskq_create()
  riscv, bpf: Track both a0 (RISC-V ABI) and a5 (BPF) return values
  riscv, bpf: Sign-extend return values
  selftests/bpf: Make seen_tc* variable tests more robust
  selftests/bpf: Test query on empty mprog and pass revision into attach
  selftests/bpf: Adapt assert_mprog_count to always expect 0 count
  selftests/bpf: Test bpf_mprog query API via libbpf and raw syscall
  bpf: Refuse unused attributes in bpf_prog_{attach,detach}
  bpf: Handle bpf_mprog_query with NULL entry
  bpf: Fix BPF_PROG_QUERY last field check
====================

Link: https://lore.kernel.org/r/20231010223610.3984-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:59:49 -07:00
Kory Maincent
108a36d07c ethtool: Fix mod state of verbose no_mask bitset
A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. The commit
6699170376ab fixes this issue by clearing the whole target bitmap before we
start iterating. The solution proposed brought an issue with the behavior
of the mod variable. As the bitset is always cleared the old val will
always differ to the new val.

Fix it by adding a new temporary variable which save the state of the old
bitmap.

Fixes: 6699170376ab ("ethtool: fix application of verbose no_mask bitset")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231009133645.44503-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:48:15 -07:00
Jakub Kicinski
b52acd02c1 linux-can-fixes-for-6.6-20231009
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmUjqCwTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRC+UBxKKooE6HJUB/sGBLojDlbGAqMFwhCmZ6ZNLg3xQcrB
 SNgIxA87jsMfSCGX9vkhkaXfNLOgDE2zYe4i2QB4M1iMatVY4MSY2vtJbw8oL6dr
 X6zT9STwFPBVlH/CIqfCq9eQNhKrIQ65khmYg2DtFJCBuZniBrhfZLwVROUj3FXr
 FUIAMNjn9Xtj2R5JwtOtn5hvdzO8z3dCQMtzqFVm9pSm5LJVkTGaDe85t/mkLdS2
 stwlbGPVz+WElHueBDEjfbxiWnPgpEVSbuThTRxS0M5+a96uVHa4F+SFGgkSdYlI
 2MQUGiJ797qZTy2MvkGaqa/1/uqcmNOWNm8NqzLfg4LQMvnFW8/qAaV8
 =9CD6
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.6-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2023-10-09

Lukas Magel's patch for the CAN ISO-TP protocol fixes the TX state
detection and wait behavior.

John Watts contributes a patch to only show the sun4i_can Kconfig
option on ARCH_SUNXI.

A patch by Miquel Raynal fixes the soft-reset workaround for Renesas
SoCs in the sja1000 driver.

Markus Schneider-Pargmann's patch for the tcan4x5x m_can glue driver
fixes the id2 register for the tcan4553.

2 patches by Haibo Chen fix the flexcan stop mode for the imx93 SoC.

* tag 'linux-can-fixes-for-6.6-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: tcan4x5x: Fix id2_register for tcan4553
  can: flexcan: remove the auto stop mode for IMX93
  can: sja1000: Always restart the Tx queue after an overrun
  arm64: dts: imx93: add the Flex-CAN stop mode by GPR
  can: sun4i_can: Only show Kconfig if ARCH_SUNXI is set
  can: isotp: isotp_sendmsg(): fix TX state detection and wait behavior
====================

Link: https://lore.kernel.org/r/20231009085256.693378-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:46:00 -07:00
Eric Dumazet
31c07dffaf net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
Sili Luo reported a race in nfc_llcp_sock_get(), leading to UAF.

Getting a reference on the socket found in a lookup while
holding a lock should happen before releasing the lock.

nfc_llcp_sock_get_sn() has a similar problem.

Finally nfc_llcp_recv_snl() needs to make sure the socket
found by nfc_llcp_sock_from_sn() does not disappear.

Fixes: 8f50020ed9b8 ("NFC: LLCP late binding")
Reported-by: Sili Luo <rootlab@huawei.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20231009123110.3735515-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:44:44 -07:00
Jeremy Kerr
5093bbfc10 mctp: perform route lookups under a RCU read-side lock
Our current route lookups (mctp_route_lookup and mctp_route_lookup_null)
traverse the net's route list without the RCU read lock held. This means
the route lookup is subject to preemption, resulting in an potential
grace period expiry, and so an eventual kfree() while we still have the
route pointer.

Add the proper read-side critical section locks around the route
lookups, preventing premption and a possible parallel kfree.

The remaining net->mctp.routes accesses are already under a
rcu_read_lock, or protected by the RTNL for updates.

Based on an analysis from Sili Luo <rootlab@huawei.com>, where
introducing a delay in the route lookup could cause a UAF on
simultaneous sendmsg() and route deletion.

Reported-by: Sili Luo <rootlab@huawei.com>
Fixes: 889b7da23abf ("mctp: Add initial routing framework")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/29c4b0e67dc1bf3571df3982de87df90cae9b631.1696837310.git.jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:43:22 -07:00
Gerd Bayer
a72178cfe8 net/smc: Fix dependency of SMC on ISM
When the SMC protocol is built into the kernel proper while ISM is
configured to be built as module, linking the kernel fails due to
unresolved dependencies out of net/smc/smc_ism.o to
ism_get_smcd_ops, ism_register_client, and ism_unregister_client
as reported via the linux-next test automation (see link).
This however is a bug introduced a while ago.

Correct the dependency list in ISM's and SMC's Kconfig to reflect the
dependencies that are actually inverted. With this you cannot build a
kernel with CONFIG_SMC=y and CONFIG_ISM=m. Either ISM needs to be 'y',
too - or a 'n'. That way, SMC can still be configured on non-s390
architectures that do not have (nor need) an ISM driver.

Fixes: 89e7d2ba61b7 ("net/ism: Add new API for client registration")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/linux-next/d53b5b50-d894-4df8-8969-fd39e63440ae@infradead.org/
Co-developed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231006125847.1517840-1-gbayer@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10 11:51:41 +02:00
Eric Dumazet
26c29961b1 net: refine debug info in skb_checksum_help()
syzbot uses panic_on_warn.

This means that the skb_dump() I added in the blamed commit are
not even called.

Rewrite this so that we get the needed skb dump before syzbot crashes.

Fixes: eeee4b77dc52 ("net: add more debug info in skb_checksum_help()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231006173355.2254983-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-09 19:46:41 -07:00
Andrew Kanner
a12bbb3ccc xdp: Fix zero-size allocation warning in xskq_create()
Syzkaller reported the following issue:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 2807 at mm/vmalloc.c:3247 __vmalloc_node_range (mm/vmalloc.c:3361)
  Modules linked in:
  CPU: 0 PID: 2807 Comm: repro Not tainted 6.6.0-rc2+ #12
  Hardware name: Generic DT based system
  unwind_backtrace from show_stack (arch/arm/kernel/traps.c:258)
  show_stack from dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
  dump_stack_lvl from __warn (kernel/panic.c:633 kernel/panic.c:680)
  __warn from warn_slowpath_fmt (./include/linux/context_tracking.h:153 kernel/panic.c:700)
  warn_slowpath_fmt from __vmalloc_node_range (mm/vmalloc.c:3361 (discriminator 3))
  __vmalloc_node_range from vmalloc_user (mm/vmalloc.c:3478)
  vmalloc_user from xskq_create (net/xdp/xsk_queue.c:40)
  xskq_create from xsk_setsockopt (net/xdp/xsk.c:953 net/xdp/xsk.c:1286)
  xsk_setsockopt from __sys_setsockopt (net/socket.c:2308)
  __sys_setsockopt from ret_fast_syscall (arch/arm/kernel/entry-common.S:68)

xskq_get_ring_size() uses struct_size() macro to safely calculate the
size of struct xsk_queue and q->nentries of desc members. But the
syzkaller repro was able to set q->nentries with the value initially
taken from copy_from_sockptr() high enough to return SIZE_MAX by
struct_size(). The next PAGE_ALIGN(size) is such case will overflow
the size_t value and set it to 0. This will trigger WARN_ON_ONCE in
vmalloc_user() -> __vmalloc_node_range().

The issue is reproducible on 32-bit arm kernel.

Fixes: 9f78bf330a66 ("xsk: support use vaddr as ring")
Reported-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c84b4705fb31741e@google.com/T/
Reported-by: syzbot+b132693e925cbbd89e26@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000e20df20606ebab4f@google.com/T/
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://syzkaller.appspot.com/bug?extid=fae676d3cf469331fc89
Link: https://lore.kernel.org/bpf/20231007075148.1759-1-andrew.kanner@gmail.com
2023-10-09 16:13:29 +02:00
Jordan Rife
7563cf17dc libceph: use kernel_connect()
Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().

This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-10-09 13:35:24 +02:00
Moshe Shemesh
aba0e909dc devlink: Hold devlink lock on health reporter dump get
Devlink health dump get callback should take devlink lock as any other
devlink callback. Otherwise, since devlink_mutex was removed, this
callback is not protected from a race of the reporter being destroyed
while handling the callback.

Add devlink lock to the callback and to any call for
devlink_health_do_dump(). This should be safe as non of the drivers dump
callback implementation takes devlink lock.

As devlink lock is added to any callback of dump, the reporter dump_lock
is now redundant and can be removed.

Fixes: d3efc2a6a6d8 ("net: devlink: remove devlink_mutex")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/1696510216-189379-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-06 15:56:46 -07:00
Lukas Magel
d9c2ba65e6 can: isotp: isotp_sendmsg(): fix TX state detection and wait behavior
With patch [1], isotp_poll was updated to also queue the poller in the
so->wait queue, which is used for send state changes. Since the queue
now also contains polling tasks that are not interested in sending, the
queue fill state can no longer be used as an indication of send
readiness. As a consequence, nonblocking writes can lead to a race and
lock-up of the socket if there is a second task polling the socket in
parallel.

With this patch, isotp_sendmsg does not consult wq_has_sleepers but
instead tries to atomically set so->tx.state and waits on so->wait if it
is unable to do so. This behavior is in alignment with isotp_poll, which
also checks so->tx.state to determine send readiness.

V2:
- Revert direct exit to goto err_event_drop

[1] https://lore.kernel.org/all/20230331125511.372783-1-michal.sojka@cvut.cz

Reported-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Closes: https://lore.kernel.org/linux-can/11328958-453f-447f-9af8-3b5824dfb041@munic.io/
Signed-off-by: Lukas Magel <lukas.magel@posteo.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Fixes: 79e19fa79cb5 ("can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events")
Link: https://github.com/pylessard/python-udsoncan/issues/178#issuecomment-1743786590
Link: https://lore.kernel.org/all/20230827092205.7908-1-lukas.magel@posteo.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-10-06 12:54:33 +02:00
Gustavo A. R. Silva
c4d49196ce net: sched: cls_u32: Fix allocation size in u32_init()
commit d61491a51f7e ("net/sched: cls_u32: Replace one-element array
with flexible-array member") incorrecly replaced an instance of
`sizeof(*tp_c)` with `struct_size(tp_c, hlist->ht, 1)`. This results
in a an over-allocation of 8 bytes.

This change is wrong because `hlist` in `struct tc_u_common` is a
pointer:

net/sched/cls_u32.c:
struct tc_u_common {
        struct tc_u_hnode __rcu *hlist;
        void                    *ptr;
        int                     refcnt;
        struct idr              handle_idr;
        struct hlist_node       hnode;
        long                    knodes;
};

So, the use of `struct_size()` makes no sense: we don't need to allocate
any extra space for a flexible-array member. `sizeof(*tp_c)` is just fine.

So, `struct_size(tp_c, hlist->ht, 1)` translates to:

sizeof(*tp_c) + sizeof(tp_c->hlist->ht) ==
sizeof(struct tc_u_common) + sizeof(struct tc_u_knode *) ==
						144 + 8  == 0x98 (byes)
						     ^^^
						      |
						unnecessary extra
						allocation size

$ pahole -C tc_u_common net/sched/cls_u32.o
struct tc_u_common {
	struct tc_u_hnode *        hlist;                /*     0     8 */
	void *                     ptr;                  /*     8     8 */
	int                        refcnt;               /*    16     4 */

	/* XXX 4 bytes hole, try to pack */

	struct idr                 handle_idr;           /*    24    96 */
	/* --- cacheline 1 boundary (64 bytes) was 56 bytes ago --- */
	struct hlist_node          hnode;                /*   120    16 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	long int                   knodes;               /*   136     8 */

	/* size: 144, cachelines: 3, members: 6 */
	/* sum members: 140, holes: 1, sum holes: 4 */
	/* last cacheline: 16 bytes */
};

And with `sizeof(*tp_c)`, we have:

	sizeof(*tp_c) == sizeof(struct tc_u_common) == 144 == 0x90 (bytes)

which is the correct and original allocation size.

Fix this issue by replacing `struct_size(tp_c, hlist->ht, 1)` with
`sizeof(*tp_c)`, and avoid allocating 8 too many bytes.

The following difference in binary output is expected and reflects the
desired change:

| net/sched/cls_u32.o
| @@ -6148,7 +6148,7 @@
| include/linux/slab.h:599
|     2cf5:      mov    0x0(%rip),%rdi        # 2cfc <u32_init+0xfc>
|                        2cf8: R_X86_64_PC32     kmalloc_caches+0xc
|-    2cfc:      mov    $0x98,%edx
|+    2cfc:      mov    $0x90,%edx

Reported-by: Alejandro Colomar <alx@kernel.org>
Closes: https://lore.kernel.org/lkml/09b4a2ce-da74-3a19-6961-67883f634d98@kernel.org/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-06 11:43:05 +01:00
Linus Torvalds
f291209eca Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot
 of 6.5 fixes here. WiFi fixes are most user-awaited.
 
 Current release - regressions:
 
  - Bluetooth: fix hci_link_tx_to RCU lock usage
 
 Current release - new code bugs:
 
  - bpf: mprog: fix maximum program check on mprog attachment
 
  - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
 
 Previous releases - regressions:
 
  - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
 
  - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(),
    it doesn't handle zero length like we expected
 
  - wifi:
    - cfg80211: fix cqm_config access race, fix crashes with brcmfmac
    - iwlwifi: mvm: handle PS changes in vif_cfg_changed
    - mac80211: fix mesh id corruption on 32 bit systems
    - mt76: mt76x02: fix MT76x0 external LNA gain handling
 
  - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
 
  - l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
 
  - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
 
  - eth: stmmac: fix the incorrect parameter after refactoring
 
 Previous releases - always broken:
 
  - net: replace calls to sock->ops->connect() with kernel_connect(),
    prevent address rewrite in kernel_bind(); otherwise BPF hooks may
    modify arguments, unexpectedly to the caller
 
  - tcp: fix delayed ACKs when reads and writes align with MSS
 
  - bpf:
    - verifier: unconditionally reset backtrack_state masks on global
      func exit
    - s390: let arch_prepare_bpf_trampoline return program size,
      fix struct_ops offsets
    - sockmap: fix accounting of available bytes in presence of PEEKs
    - sockmap: reject sk_msg egress redirects to non-TCP sockets
 
  - ipv4/fib: send netlink notify when delete source address routes
 
  - ethtool: plca: fix width of reads when parsing netlink commands
 
  - netfilter: nft_payload: rebuild vlan header on h_proto access
 
  - Bluetooth: hci_codec: fix leaking memory of local_codecs
 
  - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
 
  - eth: stmmac:
    - dwmac-stm32: fix resume on STM32 MCU
    - remove buggy and unneeded stmmac_poll_controller, depend on NAPI
 
  - ibmveth: always recompute TCP pseudo-header checksum, fix use
    of the driver with Open vSwitch
 
  - wifi:
    - rtw88: rtw8723d: fix MAC address offset in EEPROM
    - mt76: fix lock dependency problem for wed_lock
    - mwifiex: sanity check data reported by the device
    - iwlwifi: ensure ack flag is properly cleared
    - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
    - iwlwifi: mvm: fix incorrect usage of scan API
 
 Misc:
 
  - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmUe/hkACgkQMUZtbf5S
 Iru94hAAlkVsHgGy7T8Te23m5Q5v33ZRj+7hbjFBN+be2TJu8cWo9qL6jGvrxBP6
 D0X32ZvoWtX95Ua023Ibs1WtEc6ebROctTuDZpoIW35MYamFz9fYecoY4i+t+m1+
 6Y/UVRu4eHWXUtZclRQUEUdMe5HAJms9uNlIkTxLivvFhERgmGKtAsca8nuU9wMo
 lJJUAFxf4wJKx/338AUaa0yfsg6cQcgRpbm1csAR9VSa3mU1PbrIYzf7fMbWjRET
 6DiijheeClb7biF4ZKqqHgZcProwVFxoFCq1GH+PCzaw8K1eIkGwXlpVEW89lgsC
 Ukc1L9JgLJIBSObvNWiO2mu5w+V89+XR2rL9KM3tW6x+k5tEncNL3WOphqH69NPS
 gfizGlvXjP+2aCJgHovvlQEaBn/7xNYWTAbqh1jVDleUC95Ur8ap4X42YB/3QvPN
 X9l8hp4Htu8SclqjXKtMz9qt6Ug5Uyi88o+1U53BNE6C6ICKW9i4uApT1DsLBAK8
 QM5WPTj/ChIBbQu7HWNW+Ux3NX5R6fFzZ5BfKrjbuNEHQKRauj2300gVtU6xGS7T
 IFWiu8i00T34aXF2Vnfykc0zNRylhw/DHqtJFUxmJQOBQgyKlkjYacf2cYru5lnR
 BWA8Zsg5wpapT5CWSGlSRid4sRMtcDiMsI7fnIqB5CPhJGnR6eI=
 =JAtc
 -----END PGP SIGNATURE-----

Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from Bluetooth, netfilter, BPF and WiFi.

  I didn't collect precise data but feels like we've got a lot of 6.5
  fixes here. WiFi fixes are most user-awaited.

  Current release - regressions:

   - Bluetooth: fix hci_link_tx_to RCU lock usage

  Current release - new code bugs:

   - bpf: mprog: fix maximum program check on mprog attachment

   - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()

  Previous releases - regressions:

   - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling

   - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
     doesn't handle zero length like we expected

   - wifi:
      - cfg80211: fix cqm_config access race, fix crashes with brcmfmac
      - iwlwifi: mvm: handle PS changes in vif_cfg_changed
      - mac80211: fix mesh id corruption on 32 bit systems
      - mt76: mt76x02: fix MT76x0 external LNA gain handling

   - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER

   - l2tp: fix handling of transhdrlen in __ip{,6}_append_data()

   - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent

   - eth: stmmac: fix the incorrect parameter after refactoring

  Previous releases - always broken:

   - net: replace calls to sock->ops->connect() with kernel_connect(),
     prevent address rewrite in kernel_bind(); otherwise BPF hooks may
     modify arguments, unexpectedly to the caller

   - tcp: fix delayed ACKs when reads and writes align with MSS

   - bpf:
      - verifier: unconditionally reset backtrack_state masks on global
        func exit
      - s390: let arch_prepare_bpf_trampoline return program size, fix
        struct_ops offsets
      - sockmap: fix accounting of available bytes in presence of PEEKs
      - sockmap: reject sk_msg egress redirects to non-TCP sockets

   - ipv4/fib: send netlink notify when delete source address routes

   - ethtool: plca: fix width of reads when parsing netlink commands

   - netfilter: nft_payload: rebuild vlan header on h_proto access

   - Bluetooth: hci_codec: fix leaking memory of local_codecs

   - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids

   - eth: stmmac:
     - dwmac-stm32: fix resume on STM32 MCU
     - remove buggy and unneeded stmmac_poll_controller, depend on NAPI

   - ibmveth: always recompute TCP pseudo-header checksum, fix use of
     the driver with Open vSwitch

   - wifi:
      - rtw88: rtw8723d: fix MAC address offset in EEPROM
      - mt76: fix lock dependency problem for wed_lock
      - mwifiex: sanity check data reported by the device
      - iwlwifi: ensure ack flag is properly cleared
      - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
      - iwlwifi: mvm: fix incorrect usage of scan API

  Misc:

   - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"

* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  MAINTAINERS: update Matthieu's email address
  mptcp: userspace pm allow creating id 0 subflow
  mptcp: fix delegated action races
  net: stmmac: remove unneeded stmmac_poll_controller
  net: lan743x: also select PHYLIB
  net: ethernet: mediatek: disable irq before schedule napi
  net: mana: Fix oversized sge0 for GSO packets
  net: mana: Fix the tso_bytes calculation
  net: mana: Fix TX CQE error handling
  netlink: annotate data-races around sk->sk_err
  sctp: update hb timer immediately after users change hb_interval
  sctp: update transport state when processing a dupcook packet
  tcp: fix delayed ACKs for MSS boundary condition
  tcp: fix quick-ack counting to count actual ACKs of new data
  page_pool: fix documentation typos
  tipc: fix a potential deadlock on &tx->lock
  net: stmmac: dwmac-stm32: fix resume on STM32 MCU
  ipv4: Set offload_failed flag in fibmatch results
  netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
  netfilter: nf_tables: Deduplicate nft_register_obj audit logs
  ...
2023-10-05 11:29:21 -07:00
Geliang Tang
e5ed101a60 mptcp: userspace pm allow creating id 0 subflow
This patch drops id 0 limitation in mptcp_nl_cmd_sf_create() to allow
creating additional subflows with the local addr ID 0.

There is no reason not to allow additional subflows from this local
address: we should be able to create new subflows from the initial
endpoint. This limitation was breaking fullmesh support from userspace.

Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/391
Cc: stable@vger.kernel.org
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-2-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 09:34:32 -07:00
Paolo Abeni
a5efdbcece mptcp: fix delegated action races
The delegated action infrastructure is prone to the following
race: different CPUs can try to schedule different delegated
actions on the same subflow at the same time.

Each of them will check different bits via mptcp_subflow_delegate(),
and will try to schedule the action on the related per-cpu napi
instance.

Depending on the timing, both can observe an empty delegated list
node, causing the same entry to be added simultaneously on two different
lists.

The root cause is that the delegated actions infra does not provide
a single synchronization point. Address the issue reserving an additional
bit to mark the subflow as scheduled for delegation. Acquiring such bit
guarantee the caller to own the delegated list node, and being able to
safely schedule the subflow.

Clear such bit only when the subflow scheduling is completed, ensuring
proper barrier in place.

Additionally swap the meaning of the delegated_action bitmask, to allow
the usage of the existing helper to set multiple bit at once.

Fixes: bcd97734318d ("mptcp: use delegate action to schedule 3rd ack retrans")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 09:34:31 -07:00
Eric Dumazet
d0f95894fd netlink: annotate data-races around sk->sk_err
syzbot caught another data-race in netlink when
setting sk->sk_err.

Annotate all of them for good measure.

BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg

write to 0xffff8881613bb220 of 4 bytes by task 28147 on cpu 0:
netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
sock_recvmsg_nosec net/socket.c:1027 [inline]
sock_recvmsg net/socket.c:1049 [inline]
__sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
__do_sys_recvfrom net/socket.c:2247 [inline]
__se_sys_recvfrom net/socket.c:2243 [inline]
__x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

write to 0xffff8881613bb220 of 4 bytes by task 28146 on cpu 1:
netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
sock_recvmsg_nosec net/socket.c:1027 [inline]
sock_recvmsg net/socket.c:1049 [inline]
__sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
__do_sys_recvfrom net/socket.c:2247 [inline]
__se_sys_recvfrom net/socket.c:2243 [inline]
__x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00000000 -> 0x00000016

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 28146 Comm: syz-executor.0 Not tainted 6.6.0-rc3-syzkaller-00055-g9ed22ae6be81 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231003183455.3410550-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 17:32:54 -07:00
Xin Long
1f4e803cd9 sctp: update hb timer immediately after users change hb_interval
Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.

This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.

Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:

  https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 17:29:58 -07:00
Xin Long
2222a78075 sctp: update transport state when processing a dupcook packet
During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.

In the collision scenario below:

  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
    192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
    192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
  192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]

when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().

However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.

This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().

This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 17:29:44 -07:00
Neal Cardwell
4720852ed9 tcp: fix delayed ACKs for MSS boundary condition
This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.

The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:

(1) If an app reads data when > 1*MSS is unacknowledged, then
    tcp_cleanup_rbuf() ACKs immediately because of:

     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

(2) If an app reads all received data, and the packets were < 1*MSS,
    and either (a) the app is not ping-pong or (b) we received two
    packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
    of:

     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
       !inet_csk_in_pingpong_mode(sk))) &&

(3) *However*: if an app reads exactly 1*MSS of data,
    tcp_cleanup_rbuf() does not send an immediate ACK. This is true
    even if the app is not ping-pong and the 1*MSS of data had the PSH
    bit set, suggesting the sending application completed an
    application write.

Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.

The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 15:34:18 -07:00
Neal Cardwell
059217c18b tcp: fix quick-ack counting to count actual ACKs of new data
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.

The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.

When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.

And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.

The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.

Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 15:34:18 -07:00
Jakub Kicinski
c56e67f3ff netfilter pull request 2023-10-04
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUdcDQNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AOf7EACipnPx/532GUk1pECg+iWGTfhFOu1jdHjAILzy
 +Ft/kfTLvd8kfZg6DuKIb6KYfj7w7uQ/xcD6wfqV8HBcss0SOyilx2ZUgH8ThwDv
 tSIsUsx1M1gOGkXK713GrD6h/PR5BBv3vVFymvr+MliYH4C2mmsGOGWk5D+s8IqU
 q3XDMMMlsZpfqCA8QGKK7TkFhnvnHdeoHGhZhw9ywXik733Qa4OUbJ5tkxztDKrr
 DKF/FhpYxWPKHURtPXaQpWuni7xbMjg+3lHYlWTRZkQRQOoPWidBuTumqJxvwT3Z
 FYwlS7T7OBMiFByy4spBnBs0uGiA6rR3sZ2/Gn98o9HpYlCllxpZm53Ay0u8sZTL
 RBhMkacOXTWN5n1fbIqHIZc6vs7Tm1crvT2V/CseAuhe9TDiD5cHkaz7hJUQif6h
 dmF48QHCHuSgWGtyPmVbTDSZ02YF++R398zHuBM2TXkFz8B9vI5DRpbXw0yX4ktg
 LZSKnBALOPN5Ye27+W+itNfNaMC3+Elto3Cv9IvpTaXWl8WpF8hnNagLObEXxJ1Q
 3dLRKpSHDKJe7BLQoqm9ESFUE80bZr+S1Xleukz0z7AamCrM/rxQGKBwbTJs2NoE
 1YezWzhw68+aQ7BY8eWigDAQKmtn1Oju3v5u5IekGKQVvXd5x97VGlJQRVxQvr2Z
 jDHNFw==
 =YLDi
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
netfilter patches for net

First patch resolves a regression with vlan header matching, this was
broken since 6.5 release.  From myself.

Second patch fixes an ancient problem with sctp connection tracking in
case INIT_ACK packets are delayed.  This comes with a selftest, both
patches from Xin Long.

Patch 4 extends the existing nftables audit selftest, from
Phil Sutter.

Patch 5, also from Phil, avoids a situation where nftables
would emit an audit record twice. This was broken since 5.13 days.

Patch 6, from myself, avoids spurious insertion failure if we encounter an
overlapping but expired range during element insertion with the
'nft_set_rbtree' backend. This problem exists since 6.2.

* tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
  netfilter: nf_tables: Deduplicate nft_register_obj audit logs
  selftests: netfilter: Extend nft_audit.sh
  selftests: netfilter: test for sctp collision processing in nf_conntrack
  netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
  netfilter: nft_payload: rebuild vlan header on h_proto access
====================

Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 14:53:17 -07:00
Chengfeng Ye
08e50cf071 tipc: fix a potential deadlock on &tx->lock
It seems that tipc_crypto_key_revoke() could be be invoked by
wokequeue tipc_crypto_work_rx() under process context and
timer/rx callback under softirq context, thus the lock acquisition
on &tx->lock seems better use spin_lock_bh() to prevent possible
deadlock.

This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.

tipc_crypto_work_rx() <workqueue>
--> tipc_crypto_key_distr()
--> tipc_bcast_xmit()
--> tipc_bcbase_xmit()
--> tipc_bearer_bc_xmit()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock)
<timer interrupt>
   --> tipc_disc_timeout()
   --> tipc_bearer_xmit_skb()
   --> tipc_crypto_xmit()
   --> tipc_ehdr_build()
   --> tipc_crypto_key_revoke()
   --> spin_lock(&tx->lock) <deadlock here>

Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 13:24:12 -07:00
Benjamin Poirier
0add5c597f ipv4: Set offload_failed flag in fibmatch results
Due to a small omission, the offload_failed flag is missing from ipv4
fibmatch results. Make sure it is set correctly.

The issue can be witnessed using the following commands:
echo "1 1" > /sys/bus/netdevsim/new_device
ip link add dummy1 up type dummy
ip route add 192.0.2.0/24 dev dummy1
echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
ip route add 198.51.100.0/24 dev dummy1
ip route
	# 192.168.15.0/24 has rt_trap
	# 198.51.100.0/24 has rt_offload_failed
ip route get 192.168.15.1 fibmatch
	# Result has rt_trap
ip route get 198.51.100.1 fibmatch
	# Result differs from the route shown by `ip route`, it is missing
	# rt_offload_failed
ip link del dev dummy1
echo 1 > /sys/bus/netdevsim/del_device

Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 11:39:36 -07:00
Jakub Kicinski
72897b2959 Quite a collection of fixes this time, really too many
to list individually. Many stack fixes, even rfkill
 (found by simulation and the new eevdf scheduler)!
 
 Also a bigger maintainers file cleanup, to remove old
 and redundant information.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmUT/FoACgkQ10qiO8sP
 aAAeqg/+Iaim4AFPPDeWUvSARyqcMIKqmDtaFqkXE+OY8oahbvvbnGuLM/v/7V3r
 NG5pzoi0gqFasF+DC1lNWFUnmtWdjzjhhY9FInoWgiNO04V48c3NPI/YF/2Yvy0x
 biJxSsiaY+buY/p2QOXvlXHetnaftXNikFPZaD1mVGG/GIGZAwqqUO/EkIdliZf5
 q/RBt5jzMF/nXTRIGc53kq7tKT97gGnDDYMych0U130PlyyqAZ+iAsR9UbiPa+ww
 Z1JB8qZO72Hx9iN0WMjXNBX8vxC961Dj+fkttgJqZALTk0UqQitwcsIJgM29JjTG
 VHsxHMdZAROFGaEHdMs1eHkqHkqyOxA4Jhr5LAci/chzgKkRw2I2LAXndOeg5TLH
 cWdhNjRW/ZJ64czr5hALQQxK4nj8m7SPFHY/6UuBnNHEXCjr7vUcuhcoK63Kb6Np
 6Sg4jtyhakaXemqjhcmIYNF1dG5CDbUmQXFkj9Z9EEyHjAzGJ7ASmdhHwlBQnJuH
 39ESEky2zQINAJbisaw9R2zj+V9Ia/mFSbi2q30kX5J4xTHGIURNo9OPkLAQWDdw
 6u/d5VZLigliIK+Qj8kVtn41wmUEwB3W5Aq4CI7xB91vKRHCUZQvZ5xBJfHtk2pD
 7VIzZscReWCsQP9T0hv38jeaw4m/P+mZO1iOC2qxveJvrQogZMk=
 =OL/S
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================

Quite a collection of fixes this time, really too many
to list individually. Many stack fixes, even rfkill
(found by simulation and the new eevdf scheduler)!

Also a bigger maintainers file cleanup, to remove old
and redundant information.

* tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits)
  wifi: iwlwifi: mvm: Fix incorrect usage of scan API
  wifi: mac80211: Create resources for disabled links
  wifi: cfg80211: avoid leaking stack data into trace
  wifi: mac80211: allow transmitting EAPOL frames with tainted key
  wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
  wifi: cfg80211: Fix 6GHz scan configuration
  wifi: mac80211: fix potential key leak
  wifi: mac80211: fix potential key use-after-free
  wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling
  wifi: brcmfmac: Replace 1-element arrays with flexible arrays
  wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
  wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM
  rfkill: sync before userspace visibility/changes
  wifi: mac80211: fix mesh id corruption on 32 bit systems
  wifi: cfg80211: add missing kernel-doc for cqm_rssi_work
  wifi: cfg80211: fix cqm_config access race
  wifi: iwlwifi: mvm: Fix a memory corruption issue
  wifi: iwlwifi: Ensure ack flag is properly cleared.
  wifi: iwlwifi: dbg_ini: fix structure packing
  iwlwifi: mvm: handle PS changes in vif_cfg_changed
  ...
====================

Link: https://lore.kernel.org/r/20230927095835.25803-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 11:30:22 -07:00
Jakub Kicinski
1eb3dee16a bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZRqk1wAKCRDbK58LschI
 g8GRAQC4E0bw6BTFRl0b3MxvpZES6lU0BUtX2gKVK4tLZdXw/wEAmTlBXQqNzF3b
 BkCQknVbFTSw/8l8pzUW123Fb46wUAQ=
 =E3hd
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-10-02

We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).

The main changes are:

1) Fix BPF verifier to reset backtrack_state masks on global function
   exit as otherwise subsequent precision tracking would reuse them,
   from Andrii Nakryiko.

2) Several sockmap fixes for available bytes accounting,
   from John Fastabend.

3) Reject sk_msg egress redirects to non-TCP sockets given this
   is only supported for TCP sockets today, from Jakub Sitnicki.

4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
   limits with BPF_F_BEFORE directive, from Daniel Borkmann
   and Nikolay Aleksandrov.

5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
   size_index for selecting a bpf_mem_cache, from Hou Tao.

6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
   from Song Liu.

7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
   from Leon Hwang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use kmalloc_size_roundup() to adjust size_index
  selftest/bpf: Add various selftests for program limits
  bpf, mprog: Fix maximum program check on mprog attachment
  bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
  bpf, sockmap: Add tests for MSG_F_PEEK
  bpf, sockmap: Do not inc copied_seq when PEEK flag set
  bpf: tcp_read_skb needs to pop skb regardless of seq
  bpf: unconditionally reset backtrack_state masks on global func exit
  bpf: Fix tr dereferencing
  selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
  s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================

Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04 08:28:07 -07:00
Florian Westphal
087388278e netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
nft_rbtree_gc_elem() walks back and removes the end interval element that
comes before the expired element.

There is a small chance that we've cached this element as 'rbe_ge'.
If this happens, we hold and test a pointer that has been queued for
freeing.

It also causes spurious insertion failures:

$ cat test-testcases-sets-0044interval_overlap_0.1/testout.log
Error: Could not process rule: File exists
add element t s {  0 -  2 }
                   ^^^^^^
Failed to insert  0 -  2 given:
table ip t {
        set s {
                type inet_service
                flags interval,timeout
                timeout 2s
                gc-interval 2s
        }
}

The set (rbtree) is empty. The 'failure' doesn't happen on next attempt.

Reason is that when we try to insert, the tree may hold an expired
element that collides with the range we're adding.
While we do evict/erase this element, we can trip over this check:

if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
      return -ENOTEMPTY;

rbe_ge was erased by the synchronous gc, we should not have done this
check.  Next attempt won't find it, so retry results in successful
insertion.

Restart in-kernel to avoid such spurious errors.

Such restart are rare, unless userspace intentionally adds very large
numbers of elements with very short timeouts while setting a huge
gc interval.

Even in this case, this cannot loop forever, on each retry an existing
element has been removed.

As the caller is holding the transaction mutex, its impossible
for a second entity to add more expiring elements to the tree.

After this it also becomes feasible to remove the async gc worker
and perform all garbage collection from the commit path.

Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-04 15:57:28 +02:00
Phil Sutter
0d880dc6f0 netfilter: nf_tables: Deduplicate nft_register_obj audit logs
When adding/updating an object, the transaction handler emits suitable
audit log entries already, the one in nft_obj_notify() is redundant. To
fix that (and retain the audit logging from objects' 'update' callback),
Introduce an "audit log free" variant for internal use.

Fixes: c520292f29b8 ("audit: log nftables configuration change events once per table")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-04 15:57:06 +02:00
Xin Long
8e56b063c8 netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
In Scenario A and B below, as the delayed INIT_ACK always changes the peer
vtag, SCTP ct with the incorrect vtag may cause packet loss.

Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK

  192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772]
    192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151]
    192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772]
  192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] *
  192.168.1.2 > 192.168.1.1: [COOKIE ECHO]
    192.168.1.1 > 192.168.1.2: [COOKIE ECHO]
    192.168.1.2 > 192.168.1.1: [COOKIE ACK]

Scenario B: INIT_ACK is delayed until the peer completes its own handshake

  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
    192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
    192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
  192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *

This patch fixes it as below:

In SCTP_CID_INIT processing:
- clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] &&
  ct->proto.sctp.init[!dir]. (Scenario E)
- set ct->proto.sctp.init[dir].

In SCTP_CID_INIT_ACK processing:
- drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] &&
  ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C)
- drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] &&
  ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A)

In SCTP_CID_COOKIE_ACK processing:
- clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir].
  (Scenario D)

Also, it's important to allow the ct state to move forward with cookie_echo
and cookie_ack from the opposite dir for the collision scenarios.

There are also other Scenarios where it should allow the packet through,
addressed by the processing above:

Scenario C: new CT is created by INIT_ACK.

Scenario D: start INIT on the existing ESTABLISHED ct.

Scenario E: start INIT after the old collision on the existing ESTABLISHED
ct.

  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
  192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
  (both side are stopped, then start new connection again in hours)
  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742]

Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-04 14:12:01 +02:00
Florian Westphal
af84f9e447 netfilter: nft_payload: rebuild vlan header on h_proto access
nft can perform merging of adjacent payload requests.
This means that:

ether saddr 00:11 ... ether type 8021ad ...

is a single payload expression, for 8 bytes, starting at the
ethernet source offset.

Check that offset+length is fully within the source/destination mac
addersses.

This bug prevents 'ether type' from matching the correct h_proto in case
vlan tag got stripped.

Fixes: de6843be3082 ("netfilter: nft_payload: rebuild vlan header when needed")
Reported-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-04 14:12:01 +02:00
Jeremy Cline
dfc7f7a988 net: nfc: llcp: Add lock when modifying device list
The device list needs its associated lock held when modifying it, or the
list could become corrupted, as syzbot discovered.

Reported-and-tested-by: syzbot+c1d0a03d305972dbbe14@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c1d0a03d305972dbbe14
Signed-off-by: Jeremy Cline <jeremy@jcline.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 6709d4b7bc2e ("net: nfc: Fix use-after-free caused by nfc_llcp_find_local")
Link: https://lore.kernel.org/r/20230908235853.1319596-1-jeremy@jcline.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-03 07:39:31 -07:00