67162 Commits

Author SHA1 Message Date
Eric Dumazet
49ecc2e9c3 net: align static siphash keys
siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-16 19:07:54 -08:00
Eric Dumazet
b3cb764aa1 net: drop nopreempt requirement on sock_prot_inuse_add()
This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
4199bae10c net: merge net->core.prot_inuse and net->core.sock_inuse
net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.

per cpu allocator tend to place them in very different places.

Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
d477eb9004 net: make sock_inuse_add() available
MPTCP hard codes it, let us instead provide this helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
2a12ae5d43 net: inline sock_prot_inuse_add()
sock_prot_inuse_add() is very small, we can inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
587652bbdd net: gro: populate net/core/gro.c
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.

gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
e456a18a39 net: gro: move skb_gro_receive into net/core/gro.c
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
0b935d7f8c net: gro: move skb_gro_receive_list to udp_offload.c
This helper is used once, no need to keep it in fat net/core/skbuff.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
4721031c35 net: move gro definitions to include/net/gro.h
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
29fbc26e6d tcp: do not call tcp_cleanup_rbuf() if we have a backlog
Under pressure, tcp recvmsg() has logic to process the socket backlog,
but calls tcp_cleanup_rbuf() right before.

Avoiding sending ACK right before processing new segments makes
a lot of sense, as this decrease the number of ACK packets,
with no impact on effective ACK clocking.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:35 +00:00
Eric Dumazet
8bd172b787 tcp: check local var (timeo) before socket fields in one test
Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.

Modern applications use non-blocking IO, while a socket is terminated
only once during its life time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:35 +00:00
Eric Dumazet
f35f821935 tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:35 +00:00
Eric Dumazet
3df684c1a3 tcp: avoid indirect calls to sock_rfree
TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.

Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:35 +00:00
Eric Dumazet
b96c51bd3b tcp: tp->urg_data is unlikely to be set
Use some unlikely() hints in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
7b6a893a59 tcp: annotate races around tp->urg_data
tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
owned.

Also, it is faster to first check tp->urg_data in tcp_poll(),
then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
located in a different/cold cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
0307a0b74b tcp: annotate data-races on tp->segs_in and tp->data_segs_in
tcp_segs_in() can be called from BH, while socket spinlock
is held but socket owned by user, eventually reading these
fields from tcp_get_info()

Found by code inspection, no need to backport this patch
to older kernels.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
d2489c7b6d tcp: add RETPOLINE mitigation to sk_backlog_rcv
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
93afcfd1db tcp: small optimization in tcp recvmsg()
When reading large chunks of data, incoming packets might
be added to the backlog from BH.

tcp recvmsg() detects the backlog queue is not empty, and uses
a release_sock()/lock_sock() pair to process this backlog.

We now have __sk_flush_backlog() to perform this
a bit faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
91b6d32563 net: cache align tcp_memory_allocated, tcp_sockets_allocated
tcp_memory_allocated and tcp_sockets_allocated often share
a common cache line, source of false sharing.

Also take care of udp_memory_allocated and mptcp_sockets_allocated.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
aba546565b net: remove sk_route_nocaps
Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
d0d598ca86 net: remove sk_route_forced_caps
We were only using one bit, and we can replace it by sk_is_tcp()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
42f67eea3b net: use sk_is_tcp() in more places
Move sk_is_tcp() to include/net/sock.h and use it where we can.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
3735440200 tcp: small optimization in tcp_v6_send_check()
For TCP flows, inet6_sk(sk)->saddr has the same value
than sk->sk_v6_rcv_saddr.

Using sk->sk_v6_rcv_saddr increases data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
d519f35096 tcp: minor optimization in tcp_add_backlog()
If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
are not used. Defer their access to the point we need them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Jakub Kicinski
a5bdc36354 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-11-15

We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).

The main changes are:

1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
   BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.

2) Big batch of libbpf improvements including various fixes, future proofing APIs,
   and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.

3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
   programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.

4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
   ensure exception handling still works, from Russell King and Alan Maguire.

5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
   file such as shared library, from Song Liu.

6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
   updating documentation and bash-completion among others, from Quentin Monnet.

7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
   the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.

8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
   opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.

9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
  bpftool: Use libbpf_get_error() to check error
  bpftool: Fix mixed indentation in documentation
  bpftool: Update the lists of names for maps and prog-attach types
  bpftool: Fix indent in option lists in the documentation
  bpftool: Remove inclusion of utilities.mak from Makefiles
  bpftool: Fix memory leak in prog_dump()
  selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
  selftests/bpf: Fix an unused-but-set-variable compiler warning
  bpf: Introduce btf_tracing_ids
  bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
  bpftool: Enable libbpf's strict mode by default
  docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
  selftests/bpf: Clarify llvm dependency with btf_tag selftest
  selftests/bpf: Add a C test for btf_type_tag
  selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
  selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
  selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
  selftests/bpf: Test libbpf API function btf__add_type_tag()
  bpftool: Support BTF_KIND_TYPE_TAG
  libbpf: Support BTF_KIND_TYPE_TAG
  ...
====================

Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-15 08:49:23 -08:00
Christophe JAILLET
cc0be1ad68 net: bridge: Slightly optimize 'find_portno()'
The 'inuse' bitmap is local to this function. So we can use the
non-atomic '__set_bit()' to save a few cycles.

While at it, also remove some useless {}.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-15 14:07:06 +00:00
Harshit Mogalapalli
cb3ef7b000 net: sched: sch_netem: Refactor code in 4-state loss generator
Fixed comments to match description with variable names and
refactored code to match the convention as per [1].

To match the convention mapping is done as follows:
State 3 - LOST_IN_BURST_PERIOD
State 4 - LOST_IN_GAP_PERIOD

[1] S. Salsano, F. Ludovici, A. Ordine, "Definition of a general
and intuitive loss model for packet networks and its implementation
in the Netem module in the Linux kernel"

Fixes: a6e2fe17eba4 ("sch_netem: replace magic numbers with enumerate")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-15 13:23:23 +00:00
luo penghao
1274a4eb31 ipv6: Remove duplicate statements
This statement is repeated with the initialization statement

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-14 12:20:44 +00:00
luo penghao
0de3521500 ipv4: Remove duplicate assignments
there is a same action when the variable is initialized

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-14 12:20:44 +00:00
luo penghao
ef14102914 ipv4: drop unused assignment
The assignment in the if statement will be overwritten by the
following statement

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-14 12:20:44 +00:00
Song Liu
9e2ad638ae bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
syzbot reported the following BUG w/o CONFIG_DEBUG_INFO_BTF

BUG: KASAN: global-out-of-bounds in task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661
Read of size 4 at addr ffffffff90297404 by task swapper/0/1

CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.15.0-syzkaller #0
Hardware name: ... Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0xf/0x309 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661
do_one_initcall+0x103/0x650 init/main.c:1295
do_initcall_level init/main.c:1368 [inline]
do_initcalls init/main.c:1384 [inline]
do_basic_setup init/main.c:1403 [inline]
kernel_init_freeable+0x6b1/0x73a init/main.c:1606
kernel_init+0x1a/0x1d0 init/main.c:1497
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>

This is caused by hard-coded name[1] in BTF_ID_LIST_GLOBAL (w/o
CONFIG_DEBUG_INFO_BTF). Fix this by adding a parameter n to
BTF_ID_LIST_GLOBAL. This avoids ifdef CONFIG_DEBUG_INFO_BTF in btf.c and
filter.c.

Fixes: 7c7e3d31e785 ("bpf: Introduce helper bpf_find_vma")
Reported-by: syzbot+e0d81ec552a21d9071aa@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211112150243.1270987-2-songliubraving@fb.com
2021-11-12 10:19:09 -08:00
Linus Torvalds
f54ca91fe6 Networking fixes for 5.16-rc1, including fixes from bpf, can
and netfilter.
 
 Current release - regressions:
 
  - bpf: do not reject when the stack read size is different
    from the tracked scalar size
 
  - net: fix premature exit from NAPI state polling in napi_disable()
 
  - riscv, bpf: fix RV32 broken build, and silence RV64 warning
 
 Current release - new code bugs:
 
  - net: fix possible NULL deref in sock_reserve_memory
 
  - amt: fix error return code in amt_init(); fix stopping the workqueue
 
  - ax88796c: use the correct ioctl callback
 
 Previous releases - always broken:
 
  - bpf: stop caching subprog index in the bpf_pseudo_func insn
 
  - security: fixups for the security hooks in sctp
 
  - nfc: add necessary privilege flags in netlink layer, limit operations
    to admin only
 
  - vsock: prevent unnecessary refcnt inc for non-blocking connect
 
  - net/smc: fix sk_refcnt underflow on link down and fallback
 
  - nfnetlink_queue: fix OOB when mac header was cleared
 
  - can: j1939: ignore invalid messages per standard
 
  - bpf, sockmap:
    - fix race in ingress receive verdict with redirect to self
    - fix incorrect sk_skb data_end access when src_reg = dst_reg
    - strparser, and tls are reusing qdisc_skb_cb and colliding
 
  - ethtool: fix ethtool msg len calculation for pause stats
 
  - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
    to access an unregistering real_dev
 
  - udp6: make encap_rcv() bump the v6 not v4 stats
 
  - drv: prestera: add explicit padding to fix m68k build
 
  - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
 
  - drv: mvpp2: fix wrong SerDes reconfiguration order
 
 Misc & small latecomers:
 
  - ipvs: auto-load ipvs on genl access
 
  - mctp: sanity check the struct sockaddr_mctp padding fields
 
  - libfs: support RENAME_EXCHANGE in simple_rename()
 
  - avoid double accounting for pure zerocopy skbs
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
 IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
 fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
 5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
 xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
 cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
 AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
 DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
 JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
 mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
 dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
 KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
 =Ttgq
 -----END PGP SIGNATURE-----

Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and netfilter.

  Current release - regressions:

   - bpf: do not reject when the stack read size is different from the
     tracked scalar size

   - net: fix premature exit from NAPI state polling in napi_disable()

   - riscv, bpf: fix RV32 broken build, and silence RV64 warning

  Current release - new code bugs:

   - net: fix possible NULL deref in sock_reserve_memory

   - amt: fix error return code in amt_init(); fix stopping the
     workqueue

   - ax88796c: use the correct ioctl callback

  Previous releases - always broken:

   - bpf: stop caching subprog index in the bpf_pseudo_func insn

   - security: fixups for the security hooks in sctp

   - nfc: add necessary privilege flags in netlink layer, limit
     operations to admin only

   - vsock: prevent unnecessary refcnt inc for non-blocking connect

   - net/smc: fix sk_refcnt underflow on link down and fallback

   - nfnetlink_queue: fix OOB when mac header was cleared

   - can: j1939: ignore invalid messages per standard

   - bpf, sockmap:
      - fix race in ingress receive verdict with redirect to self
      - fix incorrect sk_skb data_end access when src_reg = dst_reg
      - strparser, and tls are reusing qdisc_skb_cb and colliding

   - ethtool: fix ethtool msg len calculation for pause stats

   - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
     access an unregistering real_dev

   - udp6: make encap_rcv() bump the v6 not v4 stats

   - drv: prestera: add explicit padding to fix m68k build

   - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge

   - drv: mvpp2: fix wrong SerDes reconfiguration order

  Misc & small latecomers:

   - ipvs: auto-load ipvs on genl access

   - mctp: sanity check the struct sockaddr_mctp padding fields

   - libfs: support RENAME_EXCHANGE in simple_rename()

   - avoid double accounting for pure zerocopy skbs"

* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
  selftests/net: udpgso_bench_rx: fix port argument
  net: wwan: iosm: fix compilation warning
  cxgb4: fix eeprom len when diagnostics not implemented
  net: fix premature exit from NAPI state polling in napi_disable()
  net/smc: fix sk_refcnt underflow on linkdown and fallback
  net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
  gve: fix unmatched u64_stats_update_end()
  net: ethernet: lantiq_etop: Fix compilation error
  selftests: forwarding: Fix packet matching in mirroring selftests
  vsock: prevent unnecessary refcnt inc for nonblocking connect
  net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
  net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
  net: stmmac: allow a tc-taprio base-time of zero
  selftests: net: test_vxlan_under_vrf: fix HV connectivity test
  net: hns3: allow configure ETS bandwidth of all TCs
  net: hns3: remove check VF uc mac exist when set by PF
  net: hns3: fix some mac statistics is always 0 in device version V2
  net: hns3: fix kernel crash when unload VF while it is being reset
  net: hns3: sync rx ring head in echo common pull
  net: hns3: fix pfc packet number incorrect after querying pfc parameters
  ...
2021-11-11 09:49:36 -08:00
Alexander Lobakin
0315a075f1 net: fix premature exit from NAPI state polling in napi_disable()
Commit 719c57197010 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.

Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.

Error path looks like:

napi_disable():
unsigned long val, new; /* new is uninitialized */

do {
	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
				      NAPIF_STATE_SCHED is set */
	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
		usleep_range(20, 200);
		continue; /* go straight to the condition check */
	}
	new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
						  writes garbage */

napi_enable():
do {
	val = READ_ONCE(n->state);
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>

while the typical BUG splat is like:

[  172.652461] ------------[ cut here ]------------
[  172.652462] kernel BUG at net/core/dev.c:6937!
[  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
[  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[  172.782403] Call Trace:
[  172.784857]  <TASK>
[  172.786963]  ice_up_complete+0x6f/0x210 [ice]
[  172.791349]  ice_xdp+0x136/0x320 [ice]
[  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
[  172.799648]  dev_xdp_install+0x61/0xe0
[  172.803401]  dev_xdp_attach+0x1e0/0x550
[  172.807240]  dev_change_xdp_fd+0x1e6/0x220
[  172.811338]  do_setlink+0xee8/0x1010
[  172.814917]  rtnl_setlink+0xe5/0x170
[  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[  172.823732]  ? security_capable+0x36/0x50
< snip >

Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.

From v1 [0]:
 - just use a for-loop to simplify both the fix and the existing
   code (Eric).

[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com

Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-10 17:45:15 -08:00
Linus Torvalds
38764c7340 A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
 xdr-related cleanup from Chuck.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
 XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
 x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
 lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
 Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
 29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
 MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
 K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
 pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
 Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
 0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
 6aiufVWKNNRQef9y
 =yKBl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
  support for a filehandle format deprecated 20 years ago, and further
  xdr-related cleanup from Chuck"

* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd4: remove obselete comment
  nfsd: document server-to-server-copy parameters
  NFSD:fix boolreturn.cocci warning
  nfsd: update create verifier comment
  SUNRPC: Change return value type of .pc_encode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
  NFSD: Save location of NFSv4 COMPOUND status
  SUNRPC: Change return value type of .pc_decode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
  SUNRPC: De-duplicate .pc_release() call sites
  SUNRPC: Simplify the SVC dispatch code path
  SUNRPC: Capture value of xdr_buf::page_base
  SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
  svcrdma: Split svcrmda_wc_{read,write} tracepoints
  svcrdma: Split the svcrdma_wc_send() tracepoint
  svcrdma: Split the svcrdma_wc_receive() tracepoint
  NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
  SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
  NFSD: Initialize pointer ni with NULL and not plain integer 0
  NFSD: simplify struct nfsfh
  ...
2021-11-10 16:45:54 -08:00
Linus Torvalds
2ec20f4895 NFS client updates for Linux 5.16
Highlights include:
 
 Features:
 - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
 - Optimisations for READDIR and the 'ls -l' style workload
 - Further replacements of dprintk() with tracepoints and other tracing
   improvements
 - Ensure we re-probe NFSv4 server capabilities when the user does a
   "mount -o remount"
 
 Bugfixes:
 - Fix an Oops in pnfs_mark_request_commit()
 - Fix up deadlocks in the commit code
 - Fix regressions in NFSv2/v3 attribute revalidation due to the
   change_attr_type optimisations
 - Fix some dentry verifier races
 - Fix some missing dentry verifier settings
 - Fix a performance regression in nfs_set_open_stateid_locked()
 - SUNRPC was sending multiple SYN calls when re-establishing a TCP
   connection.
 - Fix multiple NFSv4 issues due to missing sanity checking of server
   return values
 - Fix a potential Oops when FREE_STATEID races with an unmount
 
 Cleanups:
 - Clean up the labelled NFS code
 - Remove unused header <linux/pnfs_osd_xdr.h>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
 APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
 SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
 IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
 onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
 XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
 UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
 wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
 y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
 bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
 YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
 RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
 =n8lj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:
   - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
   - Optimisations for READDIR and the 'ls -l' style workload
   - Further replacements of dprintk() with tracepoints and other
     tracing improvements
   - Ensure we re-probe NFSv4 server capabilities when the user does a
     "mount -o remount"

  Bugfixes:
   - Fix an Oops in pnfs_mark_request_commit()
   - Fix up deadlocks in the commit code
   - Fix regressions in NFSv2/v3 attribute revalidation due to the
     change_attr_type optimisations
   - Fix some dentry verifier races
   - Fix some missing dentry verifier settings
   - Fix a performance regression in nfs_set_open_stateid_locked()
   - SUNRPC was sending multiple SYN calls when re-establishing a TCP
     connection.
   - Fix multiple NFSv4 issues due to missing sanity checking of server
     return values
   - Fix a potential Oops when FREE_STATEID races with an unmount

  Cleanups:
   - Clean up the labelled NFS code
   - Remove unused header <linux/pnfs_osd_xdr.h>"

* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
  NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
  NFS: Remove the nfs4_label argument from decode_getattr_*() functions
  NFS: Remove the nfs4_label argument from nfs_setsecurity
  NFS: Remove the nfs4_label argument from nfs_fhget()
  NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
  NFS: Remove the nfs4_label argument from nfs_instantiate()
  NFS: Remove the nfs4_label from the nfs_setattrres
  NFS: Remove the nfs4_label from the nfs4_getattr_res
  NFS: Remove the f_label from the nfs4_opendata and nfs_openres
  NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
  NFS: Remove the label from the nfs4_lookup_res struct
  NFS: Remove the nfs4_label from the nfs4_link_res struct
  NFS: Remove the nfs4_label from the nfs4_create_res struct
  NFS: Remove the nfs4_label from the nfs_entry struct
  NFS: Create a new nfs_alloc_fattr_with_label() function
  NFS: Always initialise fattr->label in nfs_fattr_alloc()
  NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
  NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
  NFSv4: Remove unnecessary 'minor version' check
  NFSv4: Fix potential Oops in decode_op_map()
  ...
2021-11-10 16:32:46 -08:00
Mark Pashmfouroush
f89315650b bpf: Add ingress_ifindex to bpf_sk_lookup
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.

Add the ifindex of the arriving connection to the bpf_sk_lookup API.

Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
2021-11-10 16:29:58 -08:00
Linus Torvalds
5147da902e Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exit cleanups from Eric Biederman:
 "While looking at some issues related to the exit path in the kernel I
  found several instances where the code is not using the existing
  abstractions properly.

  This set of changes introduces force_fatal_sig a way of sending a
  signal and not allowing it to be caught, and corrects the misuse of
  the existing abstractions that I found.

  A lot of the misuse of the existing abstractions are silly things such
  as doing something after calling a no return function, rolling BUG by
  hand, doing more work than necessary to terminate a kernel thread, or
  calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).

  In the review a deficiency in force_fatal_sig and force_sig_seccomp
  where ptrace or sigaction could prevent the delivery of the signal was
  found. I have added a change that adds SA_IMMUTABLE to change that
  makes it impossible to interrupt the delivery of those signals, and
  allows backporting to fix force_sig_seccomp

  And Arnd found an issue where a function passed to kthread_run had the
  wrong prototype, and after my cleanup was failing to build."

* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
  soc: ti: fix wkup_m3_rproc_boot_thread return type
  signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
  signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
  exit/r8188eu: Replace the macro thread_exit with a simple return 0
  exit/rtl8712: Replace the macro thread_exit with a simple return 0
  exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
  signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
  signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
  signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
  exit/syscall_user_dispatch: Send ordinary signals on failure
  signal: Implement force_fatal_sig
  exit/kthread: Have kernel threads return instead of calling do_exit
  signal/s390: Use force_sigsegv in default_trap_handler
  signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
  signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
  signal/sparc: In setup_tsb_params convert open coded BUG into BUG
  signal/powerpc: On swapcontext failure force SIGSEGV
  signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
  signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
  signal/sparc32: Remove unreachable do_exit in do_sparc_fault
  ...
2021-11-10 16:15:54 -08:00
Dust Li
e5d5aadcf3 net/smc: fix sk_refcnt underflow on linkdown and fallback
We got the following WARNING when running ab/nginx
test with RDMA link flapping (up-down-up).
The reason is when smc_sock fallback and at linkdown
happens simultaneously, we may got the following situation:

__smc_lgr_terminate()
 --> smc_conn_kill()
    --> smc_close_active_abort()
           smc_sock->sk_state = SMC_CLOSED
           sock_put(smc_sock)

smc_sock was set to SMC_CLOSED and sock_put() been called
when terminate the link group. But later application call
close() on the socket, then we got:

__smc_release():
    if (smc_sock->fallback)
        smc_sock->sk_state = SMC_CLOSED
        sock_put(smc_sock)

Again we set the smc_sock to CLOSED through it's already
in CLOSED state, and double put the refcnt, so the following
warning happens:

refcount_t: underflow; use-after-free.
WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0
Modules linked in:
CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
RIP: 0010:refcount_warn_saturate+0x8d/0xf0
Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48

RSP: 0018:ffffc90000527e50 EFLAGS: 00010286
RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027
RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001
R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340
R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8
FS:  00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 smc_release+0x353/0x3f0
 __sock_release+0x3d/0xb0
 sock_close+0x11/0x20
 __fput+0x93/0x230
 task_work_run+0x65/0xa0
 exit_to_user_mode_prepare+0xf9/0x100
 syscall_exit_to_user_mode+0x27/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds check in __smc_release() to make
sure we won't do an extra sock_put() and set the
socket to CLOSED when its already in CLOSED state.

Fixes: 51f1de79ad8e (net/smc: replace sock_put worker by socket refcounting)
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-10 14:43:59 +00:00
Eiichi Tsukata
c7cd82b905 vsock: prevent unnecessary refcnt inc for nonblocking connect
Currently vosck_connect() increments sock refcount for nonblocking
socket each time it's called, which can lead to memory leak if
it's called multiple times because connect timeout function decrements
sock refcount only once.

Fixes it by making vsock_connect() return -EALREADY immediately when
sock state is already SS_CONNECTING.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-10 14:36:11 +00:00
Eric Dumazet
6dc25401cb net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_any
1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls
   ktime_mono_to_any() with out-of-bound value.

2) if q->tk_offset is changed in taprio_parse_clockid(),
   taprio_get_time() might also call ktime_mono_to_any()
   with out-of-bound value as sysbot found:

UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27
index 3 is out of range for type 'ktime_t *[3]'
CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
 ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908
 get_tcp_tstamp net/sched/sch_taprio.c:322 [inline]
 get_packet_txtime net/sched/sch_taprio.c:353 [inline]
 taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420
 taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485
 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785
 __dev_xmit_skb net/core/dev.c:3869 [inline]
 __dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194
 batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108
 batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
 batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
 batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701
 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
 kthread+0x405/0x4f0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Fixes: 7ede7b03484b ("taprio: make clock reference conversions easier")
Fixes: 54002066100b ("taprio: Adjust timestamps for TCP packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vedang Patel <vedang.patel@intel.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-09 19:16:23 -08:00
Jakub Kicinski
fceb07950a Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2021-11-09

We've added 7 non-merge commits during the last 3 day(s) which contain
a total of 10 files changed, 174 insertions(+), 48 deletions(-).

The main changes are:

1) Various sockmap fixes, from John and Jussi.

2) Fix out-of-bound issue with bpf_pseudo_func, from Martin.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
  bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
  bpf, sockmap: Fix race in ingress receive verdict with redirect to self
  bpf, sockmap: Remove unhash handler for BPF sockmap usage
  bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
  bpf: selftest: Trigger a DCE on the whole subprog
  bpf: Stop caching subprog index in the bpf_pseudo_func insn
====================

Link: https://lore.kernel.org/r/20211109215702.38350-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-09 15:44:48 -08:00
Linus Torvalds
f89ce84bc3 9p-for-5.16-rc1: fixes, netfs read support and checkpatch rewrite
- fix syzcaller uninitialized value usage after missing error check
 - add module autoloading based on transport name
 - convert cached reads to use netfs helpers
 - adjust readahead based on transport msize
 - and many, many checkpatch.pl warning fixes...
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmGJK4IACgkQq06b7GqY
 5nAKOQ/+NO41If4p93P65j78pF7EIuGolIRnG3OgLV9M92v3sGQVChMoX68GPzrf
 mOWbhAL7lHdvzJBdrYjxa5rGOq9zzjPHJ/toI/93pLOqFHZlw15GV/nq0wLZZoc+
 fvWCc1co6aL3suAeVUXM3bwUAZHRkCssDSKajbMrLFikIkifSoroM4tzAvktEXXJ
 4nM5xtTi6p3r8HLffTHNTqoEmA08VQm41hqRZ5XYLhieFUMH9p8b5cbWhycROn2D
 aNLgXcnXohvhJpbrkJ16fxdWGCUGaxAxK6gso8wMwBns+/7jq+uBeVypdIji00LL
 KDzQGs8VVThXyMwtB40rYzCTfYqXxJB2qoR+dJd6hh48lmvwdjDF/nGe5aFIYjJ/
 251xm2yojs3BQs4/v7lJOA4RT9K0r57ZjzhsnMqSNgnJVa6ufZel76wZPxsjwfR2
 KQeoNRd2ftPaIfduscaHl6Ay9vMj4oGah/T2wuAI6hpV8Y9iB6XTA27dbxTCfq1Y
 AC2yzWvno3jgSYaKVdJ4TfPoZ4HynTQveTuF8zJakmUPJ6k0oJTku0t8ij3+gF+C
 Y0KtOFMFw3FRZe4va4UJRHqHxqOhqT5BAP5BmaDKYv8myLcEqQWJmKshx+wxxsoF
 astAXY79zkS3+n90c5pn0aqdLJk0KWsIN+46gp7c/CYlLH1PCMI=
 =1g7l
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "Fixes, netfs read support and checkpatch rewrite:

   - fix syzcaller uninitialized value usage after missing error check

   - add module autoloading based on transport name

   - convert cached reads to use netfs helpers

   - adjust readahead based on transport msize

   - and many, many checkpatch.pl warning fixes..."

* tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux:
  9p: fix a bunch of checkpatch warnings
  9p: set readahead and io size according to maxsize
  9p p9mode2perm: remove useless strlcpy and check sscanf return code
  9p v9fs_parse_options: replace simple_strtoul with kstrtouint
  9p: fix file headers
  fs/9p: fix indentation and Add missing a blank line after declaration
  fs/9p: fix warnings found by checkpatch.pl
  9p: fix minor indentation and codestyle
  fs/9p: cleanup: opening brace at the beginning of the next line
  9p: Convert to using the netfs helper lib to do reads and caching
  fscache_cookie_enabled: check cookie is valid before accessing it
  net/9p: autoload transport modules
  9p/net: fix missing error check in p9_check_errors
2021-11-09 10:30:13 -08:00
Linus Torvalds
59a2ceeef6 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "87 patches.

  Subsystems affected by this patch series: mm (pagecache and hugetlb),
  procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
  init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
  sysvfs, kcov, gdb, resource, selftests, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
  ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
  ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
  selftests/kselftest/runner/run_one(): allow running non-executable files
  virtio-mem: disallow mapping virtio-mem memory via /dev/mem
  kernel/resource: disallow access to exclusive system RAM regions
  kernel/resource: clean up and optimize iomem_is_exclusive()
  scripts/gdb: handle split debug for vmlinux
  kcov: replace local_irq_save() with a local_lock_t
  kcov: avoid enable+disable interrupts if !in_task()
  kcov: allocate per-CPU memory on the relevant node
  Documentation/kcov: define `ip' in the example
  Documentation/kcov: include types.h in the example
  sysv: use BUILD_BUG_ON instead of runtime check
  kernel/fork.c: unshare(): use swap() to make code cleaner
  seq_file: fix passing wrong private data
  seq_file: move seq_escape() to a header
  signal: remove duplicate include in signal.h
  crash_dump: remove duplicate include in crash_dump.h
  crash_dump: fix boolreturn.cocci warning
  hfs/hfsplus: use WARN_ON for sanity check
  ...
2021-11-09 10:11:53 -08:00
Kefeng Wang
a20deb3a34 sections: move and rename core_kernel_data() to is_kernel_core_data()
Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.

Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:02:50 -08:00
Jussi Maki
b2c4618162 bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
The current conversion of skb->data_end reads like this:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r2 +116)
   563: (1f) r1 -= r11

But similar to the case in 84f44df664e9 ("bpf: sock_ops sk access may stomp
registers when dst_reg = src_reg"), the code will read an incorrect skb->len
when src == dst. In this case we end up generating this xlated code:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r1 +116)
   563: (1f) r1 -= r11

... where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and the
later deref for skb->len ends up following skb->data instead of skb.

This fixes the issue similarly to the patch mentioned above by creating an
additional temporary variable and using to store the register when dst_reg =
src_reg. We name the variable bpf_temp_reg and place it in the cb context for
sk_skb. Then we restore from the temp to ensure nothing is lost.

Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
2021-11-09 01:05:34 +01:00
John Fastabend
e0dc3b93bd bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.

But, more importantly strparser is using the following to access its metadata.

  (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))

Where _strp_msg is defined as:

  struct _strp_msg {
        struct strp_msg            strp;                 /*     0     8 */
        int                        accum_len;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 2 */
        /* last cacheline: 12 bytes */
  };

So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.

In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
2021-11-09 01:05:28 +01:00
John Fastabend
c5d2177a72 bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:

 None, Tx, Rx, and TxRx

To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:

  AppA  redirect   AppB
   Tx <-----------> Rx
   |                |
   +                +
   TCP <--> lo <--> TCP

In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.

In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.

           AppProxy       socket pool
       sk0 ------------->{sk1,sk2, skn}
        ^                      |
        |                      |
        |                      v
       ingress              lb egress
       TCP                  TCP

Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.

However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.

       AppB
       recv()                (userspace)
     -----------------------
       tcp_bpf_recvmsg()     (kernel)
         |             |
         |             |
         |             |
       ingress_msgQ    |
         |             |
       RX_BPF          |
         |             |
         v             v
       sk->receive_queue

This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.

Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.

To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.

With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
2021-11-09 00:58:26 +01:00
John Fastabend
b8b8315e39 bpf, sockmap: Remove unhash handler for BPF sockmap usage
We do not need to handle unhash from BPF side we can simply wait for the
close to happen. The original concern was a socket could transition from
ESTABLISHED state to a new state while the BPF hook was still attached.
But, we convinced ourself this is no longer possible and we also improved
BPF sockmap to handle listen sockets so this is no longer a problem.

More importantly though there are cases where unhash is called when data is
in the receive queue. The BPF unhash logic will flush this data which is
wrong. To be correct it should keep the data in the receive queue and allow
a receiving application to continue reading the data. This may happen when
tcp_abort() is received for example. Instead of complicating the logic in
unhash simply moving all this to tcp_close() hook solves this.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com
2021-11-09 00:57:19 +01:00
John Fastabend
40a34121ac bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
In order to fix an issue with sockets in TCP sockmap redirect cases we plan
to allow CLOSE state sockets to exist in the sockmap. However, the check in
bpf_sk_lookup_assign() currently only invalidates sockets in the
TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
never SOCK_CLOSE state sockets in the map.

To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
or a udp socket in TCP_CLOSE state. This also makes the code more resilent
to future changes.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com
2021-11-09 00:56:35 +01:00
Alexey Khoroshilov
e7ea51cd87 sctp: remove unreachable code from sctp_sf_violation_chunk()
sctp_sf_violation_chunk() is not called with asoc argument equal to NULL,
but if that happens it would lead to NULL pointer dereference
in sctp_vtag_verify().

The patch removes code that handles NULL asoc in sctp_sf_violation_chunk().

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Proposed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-07 19:32:39 +00:00