7980 Commits

Author SHA1 Message Date
Jakub Kicinski
a387ff8e5d dev_addr_list: put the first addr on the tree
Since all netdev->dev_addr modifications go via dev_addr_mod()
we can put it on the list. When address is change remove it
and add it back.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20 12:25:57 +00:00
Jakub Kicinski
d07b26f5bb dev_addr: add a modification check
netdev->dev_addr should only be modified via helpers,
but someone may be casting off the const. Add a runtime
check to catch abuses.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20 12:25:57 +00:00
Jakub Kicinski
5f0b692384 net: unexport dev_addr_init() & dev_addr_flush()
There are no module callers in-tree and it's hard to justify
why anyone would init or flush addresses of a netdev (note
the flush is more of a destructor, it frees netdev->dev_addr).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20 12:25:57 +00:00
Jakub Kicinski
adeef3e321 net: constify netdev->dev_addr
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. We converted all users to make modifications via appropriate
helpers, make netdev->dev_addr const.

The update helpers need to upcast from the buffer to
struct netdev_hw_addr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20 12:25:57 +00:00
John Fastabend
c0d95d3380 bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
When a sock is added to a sock map we evaluate what proto op hooks need to
be used. However, when the program is removed from the sock map we have not
been evaluating if that changes the required program layout.

Before the patch listed in the 'fixes' tag this was not causing failures
because the base program set handles all cases. Specifically, the case with
a stream parser and the case with out a stream parser are both handled. With
the fix below we identified a race when running with a proto op that attempts
to read skbs off both the stream parser and the skb->receive_queue. Namely,
that a race existed where when the stream parser is empty checking the
skb->receive_queue from recvmsg at the precies moment when the parser is
paused and the receive_queue is not empty could result in skipping the stream
parser. This may break a RX policy depending on the parser to run.

The fix tag then loads a specific proto ops that resolved this race. But, we
missed removing that proto ops recv hook when the sock is removed from the
sockmap. The result is the stream parser is stopped so no more skbs will be
aggregated there, but the hook and BPF program continues to be attached on
the psock. User space will then get an EBUSY when trying to read the socket
because the recvmsg() handler is now waiting on a stopped stream parser.

To fix we rerun the proto ops init() function which will look at the new set
of progs attached to the psock and rest the proto ops hook to the correct
handlers. And in the above case where we remove the sock from the sock map
the RX prog will no longer be listed so the proto ops is removed.

Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
2021-11-20 00:56:24 +01:00
John Fastabend
38207a5e81 bpf, sockmap: Attach map progs to psock early for feature probes
When a TCP socket is added to a sock map we look at the programs attached
to the map to determine what proto op hooks need to be changed. Before
the patch in the 'fixes' tag there were only two categories -- the empty
set of programs or a TX policy. In any case the base set handled the
receive case.

After the fix we have an optimized program for receive that closes a small,
but possible, race on receive. This program is loaded only when the map the
psock is being added to includes a RX policy. Otherwise, the race is not
possible so we don't need to handle the race condition.

In order for the call to sk_psock_init() to correctly evaluate the above
conditions all progs need to be set in the psock before the call. However,
in the current code this is not the case. We end up evaluating the
requirements on the old prog state. If your psock is attached to multiple
maps -- for example a tx map and rx map -- then the second update would pull
in the correct maps. But, the other pattern with a single rx enabled map
the correct receive hooks are not used. The result is the race fixed by the
patch in the fixes tag below may still be seen in this case.

To fix we simply set all psock->progs before doing the call into
sock_map_init(). With this the init() call gets the full list of programs
and chooses the correct proto ops on the first iteration instead of
requiring the second update to pull them in. This fixes the race case when
only a single map is used.

Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com
2021-11-20 00:56:16 +01:00
Jakub Kicinski
50fc24944a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-18 13:13:16 -08:00
Leon Romanovsky
fec1faf221 devlink: Don't throw an error if flash notification sent before devlink visible
The mlxsw driver calls to various devlink flash routines even before
users can get any access to the devlink instance itself. For example,
mlxsw_core_fw_rev_validate() one of such functions.

__mlxsw_core_bus_device_register
 -> mlxsw_core_fw_rev_validate
  -> mlxsw_core_fw_flash
   -> mlxfw_firmware_flash
    -> mlxfw_status_notify
     -> devlink_flash_update_status_notify
      -> __devlink_flash_update_notify
       -> WARN_ON(...)

It causes to the WARN_ON to trigger warning about devlink not registered.

Fixes: cf530217408e ("devlink: Notify users when objects are accessible")
Reported-by: Danielle Ratson <danieller@nvidia.com>
Tested-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-18 11:34:17 +00:00
Yunsheng Lin
f915b75bff page_pool: Revert "page_pool: disable dma mapping support..."
This reverts commit d00e60ee54b12de945b8493cf18c1ada9e422514.

As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.

1. https://www.spinics.net/lists/netdev/msg779890.html

Fixes: d00e60ee54b1 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Tested-by: "kernelci.org bot" <bot@kernelci.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-18 11:29:40 +00:00
Eric Dumazet
8160fb43d5 net: use an atomic_long_t for queue->trans_timeout
tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.

As we want to no longer disrupt transmits, we use an
atomic_long_t instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-17 14:56:16 +00:00
Eric Dumazet
49ecc2e9c3 net: align static siphash keys
siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-16 19:07:54 -08:00
Jakub Kicinski
f083ec3160 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-11-16

We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).

The main changes are:

1) Fix pruning regression where verifier went overly conservative rejecting
   previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.

2) Fix verifier TOCTOU bug when using read-only map's values as constant
   scalars during verification, from Daniel Borkmann.

3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.

4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.

5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
   programs due to deadlock possibilities, from Dmitrii Banshchikov.

6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.

7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.

8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  udp: Validate checksum in udp_read_sock()
  bpf: Fix toctou on read-only map's constant scalar tracking
  samples/bpf: Fix build error due to -isystem removal
  selftests/bpf: Add tests for restricted helpers
  bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
  libbpf: Perform map fd cleanup for gen_loader in case of error
  samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
  tools/runqslower: Fix cross-build
  samples/bpf: Fix summary per-sec stats in xdp_sample_user
  selftests/bpf: Check map in map pruning
  bpf: Fix inner map state pruning regression.
  xsk: Fix crash on double free in buffer pool
====================

Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-16 16:53:48 -08:00
Eric Dumazet
4199bae10c net: merge net->core.prot_inuse and net->core.sock_inuse
net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.

per cpu allocator tend to place them in very different places.

Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
d477eb9004 net: make sock_inuse_add() available
MPTCP hard codes it, let us instead provide this helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
2a12ae5d43 net: inline sock_prot_inuse_add()
sock_prot_inuse_add() is very small, we can inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:20:45 +00:00
Eric Dumazet
587652bbdd net: gro: populate net/core/gro.c
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.

gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
e456a18a39 net: gro: move skb_gro_receive into net/core/gro.c
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
0b935d7f8c net: gro: move skb_gro_receive_list to udp_offload.c
This helper is used once, no need to keep it in fat net/core/skbuff.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
4721031c35 net: move gro definitions to include/net/gro.h
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:16:54 +00:00
Eric Dumazet
d2489c7b6d tcp: add RETPOLINE mitigation to sk_backlog_rcv
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
aba546565b net: remove sk_route_nocaps
Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
d0d598ca86 net: remove sk_route_forced_caps
We were only using one bit, and we can replace it by sk_is_tcp()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet
42f67eea3b net: use sk_is_tcp() in more places
Move sk_is_tcp() to include/net/sock.h and use it where we can.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Dmitrii Banshchikov
5e0bc3082e bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
progs may result in locking issues.

bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
isn't safe for any context:
======================================================
WARNING: possible circular locking dependency detected
5.15.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.4/14877 is trying to acquire lock:
ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255

but task is already holding lock:
ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
       __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
       debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
       debug_init kernel/time/hrtimer.c:468 [inline]
       hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
       ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
       ntp_init+0xa1/0xad kernel/time/ntp.c:1095
       timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
       start_kernel+0x267/0x56e init/main.c:1030
       secondary_startup_64_no_verify+0xb1/0xbb

-> #0 (tk_core.seq.seqcount){----}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3051 [inline]
       check_prevs_add kernel/locking/lockdep.c:3174 [inline]
       validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
       __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
       seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
       ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
       ktime_get_coarse include/linux/timekeeping.h:120 [inline]
       ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
       ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
       bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
       bpf_prog_a99735ebafdda2f1+0x10/0xb50
       bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
       __bpf_prog_run include/linux/filter.h:626 [inline]
       bpf_prog_run include/linux/filter.h:633 [inline]
       BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
       trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
       perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
       perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
       trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
       lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
       __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
       _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
       debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
       debug_deactivate kernel/time/hrtimer.c:481 [inline]
       __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
       __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
       hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
       local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
       __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
       sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
       _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
       try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
       wake_up_process kernel/sched/core.c:4200 [inline]
       wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
       futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
       do_futex+0x367/0x560 kernel/futex/syscalls.c:127
       __do_sys_futex kernel/futex/syscalls.c:199 [inline]
       __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae

There is a possible deadlock with bpf_timer_* set of helpers:
hrtimer_start()
  lock_base();
  trace_hrtimer...()
    perf_event()
      bpf_run()
        bpf_timer_start()
          hrtimer_start()
            lock_base()         <- DEADLOCK

Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.

Fixes: d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com
Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
2021-11-15 20:35:58 -08:00
Jakub Kicinski
a5bdc36354 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-11-15

We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).

The main changes are:

1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
   BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.

2) Big batch of libbpf improvements including various fixes, future proofing APIs,
   and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.

3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
   programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.

4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
   ensure exception handling still works, from Russell King and Alan Maguire.

5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
   file such as shared library, from Song Liu.

6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
   updating documentation and bash-completion among others, from Quentin Monnet.

7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
   the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.

8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
   opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.

9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
  bpftool: Use libbpf_get_error() to check error
  bpftool: Fix mixed indentation in documentation
  bpftool: Update the lists of names for maps and prog-attach types
  bpftool: Fix indent in option lists in the documentation
  bpftool: Remove inclusion of utilities.mak from Makefiles
  bpftool: Fix memory leak in prog_dump()
  selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
  selftests/bpf: Fix an unused-but-set-variable compiler warning
  bpf: Introduce btf_tracing_ids
  bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
  bpftool: Enable libbpf's strict mode by default
  docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
  selftests/bpf: Clarify llvm dependency with btf_tag selftest
  selftests/bpf: Add a C test for btf_type_tag
  selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
  selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
  selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
  selftests/bpf: Test libbpf API function btf__add_type_tag()
  bpftool: Support BTF_KIND_TYPE_TAG
  libbpf: Support BTF_KIND_TYPE_TAG
  ...
====================

Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-15 08:49:23 -08:00
Tetsuo Handa
938cca9e41 sock: fix /proc/net/sockstat underflow in sk_clone_lock()
sk_clone_lock() needs to call sock_inuse_add(1) before entering the
sk_free_unlock_clone() error path, for __sk_free() from sk_free() from
sk_free_unlock_clone() calls sock_inuse_add(-1).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 648845ab7e200993 ("sock: Move the socket inuse to namespace.")
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-15 14:27:39 +00:00
Song Liu
9e2ad638ae bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
syzbot reported the following BUG w/o CONFIG_DEBUG_INFO_BTF

BUG: KASAN: global-out-of-bounds in task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661
Read of size 4 at addr ffffffff90297404 by task swapper/0/1

CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.15.0-syzkaller #0
Hardware name: ... Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0xf/0x309 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661
do_one_initcall+0x103/0x650 init/main.c:1295
do_initcall_level init/main.c:1368 [inline]
do_initcalls init/main.c:1384 [inline]
do_basic_setup init/main.c:1403 [inline]
kernel_init_freeable+0x6b1/0x73a init/main.c:1606
kernel_init+0x1a/0x1d0 init/main.c:1497
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>

This is caused by hard-coded name[1] in BTF_ID_LIST_GLOBAL (w/o
CONFIG_DEBUG_INFO_BTF). Fix this by adding a parameter n to
BTF_ID_LIST_GLOBAL. This avoids ifdef CONFIG_DEBUG_INFO_BTF in btf.c and
filter.c.

Fixes: 7c7e3d31e785 ("bpf: Introduce helper bpf_find_vma")
Reported-by: syzbot+e0d81ec552a21d9071aa@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211112150243.1270987-2-songliubraving@fb.com
2021-11-12 10:19:09 -08:00
Alexander Lobakin
0315a075f1 net: fix premature exit from NAPI state polling in napi_disable()
Commit 719c57197010 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.

Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.

Error path looks like:

napi_disable():
unsigned long val, new; /* new is uninitialized */

do {
	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
				      NAPIF_STATE_SCHED is set */
	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
		usleep_range(20, 200);
		continue; /* go straight to the condition check */
	}
	new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
						  writes garbage */

napi_enable():
do {
	val = READ_ONCE(n->state);
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>

while the typical BUG splat is like:

[  172.652461] ------------[ cut here ]------------
[  172.652462] kernel BUG at net/core/dev.c:6937!
[  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
[  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[  172.782403] Call Trace:
[  172.784857]  <TASK>
[  172.786963]  ice_up_complete+0x6f/0x210 [ice]
[  172.791349]  ice_xdp+0x136/0x320 [ice]
[  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
[  172.799648]  dev_xdp_install+0x61/0xe0
[  172.803401]  dev_xdp_attach+0x1e0/0x550
[  172.807240]  dev_change_xdp_fd+0x1e6/0x220
[  172.811338]  do_setlink+0xee8/0x1010
[  172.814917]  rtnl_setlink+0xe5/0x170
[  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[  172.823732]  ? security_capable+0x36/0x50
< snip >

Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.

From v1 [0]:
 - just use a for-loop to simplify both the fix and the existing
   code (Eric).

[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com

Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-10 17:45:15 -08:00
Mark Pashmfouroush
f89315650b bpf: Add ingress_ifindex to bpf_sk_lookup
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.

Add the ifindex of the arriving connection to the bpf_sk_lookup API.

Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
2021-11-10 16:29:58 -08:00
Jussi Maki
b2c4618162 bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
The current conversion of skb->data_end reads like this:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r2 +116)
   563: (1f) r1 -= r11

But similar to the case in 84f44df664e9 ("bpf: sock_ops sk access may stomp
registers when dst_reg = src_reg"), the code will read an incorrect skb->len
when src == dst. In this case we end up generating this xlated code:

  ; data_end = (void*)(long)skb->data_end;
   559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
   560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
   561: (0f) r1 += r11
   562: (61) r11 = *(u32 *)(r1 +116)
   563: (1f) r1 -= r11

... where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and the
later deref for skb->len ends up following skb->data instead of skb.

This fixes the issue similarly to the patch mentioned above by creating an
additional temporary variable and using to store the register when dst_reg =
src_reg. We name the variable bpf_temp_reg and place it in the cb context for
sk_skb. Then we restore from the temp to ensure nothing is lost.

Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
2021-11-09 01:05:34 +01:00
John Fastabend
e0dc3b93bd bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.

But, more importantly strparser is using the following to access its metadata.

  (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))

Where _strp_msg is defined as:

  struct _strp_msg {
        struct strp_msg            strp;                 /*     0     8 */
        int                        accum_len;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 2 */
        /* last cacheline: 12 bytes */
  };

So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.

In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
2021-11-09 01:05:28 +01:00
John Fastabend
40a34121ac bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
In order to fix an issue with sockets in TCP sockmap redirect cases we plan
to allow CLOSE state sockets to exist in the sockmap. However, the check in
bpf_sk_lookup_assign() currently only invalidates sockets in the
TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
never SOCK_CLOSE state sockets in the map.

To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
or a udp socket in TCP_CLOSE state. This also makes the code more resilent
to future changes.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com
2021-11-09 00:56:35 +01:00
Guo Zhengkui
96d0c9be43 devlink: fix flexible_array.cocci warning
Fix following coccicheck warning:
./net/core/devlink.c:69:6-10: WARNING use flexible-array member instead

Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Link: https://lore.kernel.org/r/20211103121607.27490-1-guozhengkui@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-04 16:43:55 -07:00
Eric Dumazet
d00c8ee317 net: fix possible NULL deref in sock_reserve_memory
Sanity check in sock_reserve_memory() was not enough to prevent malicious
user to trigger a NULL deref.

In this case, the isse is that sk_prot->memory_allocated is NULL.

Use standard sk_has_account() helper to deal with this.

BUG: KASAN: null-ptr-deref in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
BUG: KASAN: null-ptr-deref in atomic_long_add_return include/linux/atomic/atomic-instrumented.h:1218 [inline]
BUG: KASAN: null-ptr-deref in sk_memory_allocated_add include/net/sock.h:1371 [inline]
BUG: KASAN: null-ptr-deref in sock_reserve_memory net/core/sock.c:994 [inline]
BUG: KASAN: null-ptr-deref in sock_setsockopt+0x22ab/0x2b30 net/core/sock.c:1443
Write of size 8 at addr 0000000000000000 by task syz-executor.0/11270

CPU: 1 PID: 11270 Comm: syz-executor.0 Not tainted 5.15.0-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 __kasan_report mm/kasan/report.c:446 [inline]
 kasan_report.cold+0x66/0xdf mm/kasan/report.c:459
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
 instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
 atomic_long_add_return include/linux/atomic/atomic-instrumented.h:1218 [inline]
 sk_memory_allocated_add include/net/sock.h:1371 [inline]
 sock_reserve_memory net/core/sock.c:994 [inline]
 sock_setsockopt+0x22ab/0x2b30 net/core/sock.c:1443
 __sys_setsockopt+0x4f8/0x610 net/socket.c:2172
 __do_sys_setsockopt net/socket.c:2187 [inline]
 __se_sys_setsockopt net/socket.c:2184 [inline]
 __x64_sys_setsockopt+0xba/0x150 net/socket.c:2184
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f56076d5ae9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5604c4b188 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f56077e8f60 RCX: 00007f56076d5ae9
RDX: 0000000000000049 RSI: 0000000000000001 RDI: 0000000000000003
RBP: 00007f560772ff25 R08: 000000000000fec7 R09: 0000000000000000
R10: 0000000020000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fffb61a100f R14: 00007f5604c4b300 R15: 0000000000022000
 </TASK>

Fixes: 2bb2f5fb21b0 ("net: add new socket option SO_RESERVE_MEM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-04 11:28:04 +00:00
Talal Ahmad
9b65b17db7 net: avoid double accounting for pure zerocopy skbs
Track skbs containing only zerocopy data and avoid charging them to
kernel memory to correctly account the memory utilization for
msg_zerocopy. All of the data in such skbs is held in user pages which
are already accounted to user. Before this change, they are charged
again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.

Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.

A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.

In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.

Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.

With this commit we don't see the warning we saw in previous commit
which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194.

Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-03 11:19:49 +00:00
Eric Dumazet
c4777efa75 net: add and use skb_unclone_keeptruesize() helper
While commit 097b9146c0e2 ("net: fix up truesize of cloned
skb in skb_prepare_for_shift()") fixed immediate issues found
when KFENCE was enabled/tested, there are still similar issues,
when tcp_trim_head() hits KFENCE while the master skb
is cloned.

This happens under heavy networking TX workloads,
when the TX completion might be delayed after incoming ACK.

This patch fixes the WARNING in sk_stream_kill_queues
when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.

Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-02 18:10:34 -07:00
Jakub Kicinski
84882cf72c Revert "net: avoid double accounting for pure zerocopy skbs"
This reverts commit f1a456f8f3fc5828d8abcad941860380ae147b1d.

  WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
  CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
  RIP: 0010:skb_try_coalesce+0x78b/0x7e0
  Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
  RSP: 0018:ffff88881f449688 EFLAGS: 00010282
  RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
  RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
  RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
  R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
  R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
  FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   tcp_try_coalesce+0xeb/0x290
   ? tcp_parse_options+0x610/0x610
   ? mark_held_locks+0x79/0xa0
   tcp_queue_rcv+0x69/0x2f0
   tcp_rcv_established+0xa49/0xd40
   ? tcp_data_queue+0x18a0/0x18a0
   tcp_v6_do_rcv+0x1c9/0x880
   ? rt6_mtu_change_route+0x100/0x100
   tcp_v6_rcv+0x1624/0x1830

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 22:26:08 -07:00
Jakub Kicinski
8a33dcc2f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in the fixes we had queued in case there was another -rc.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 20:05:14 -07:00
Jakub Kicinski
b7b98f8689 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-11-01

We've added 181 non-merge commits during the last 28 day(s) which contain
a total of 280 files changed, 11791 insertions(+), 5879 deletions(-).

The main changes are:

1) Fix bpf verifier propagation of 64-bit bounds, from Alexei.

2) Parallelize bpf test_progs, from Yucong and Andrii.

3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus.

4) Improve bpf selftests on s390, from Ilya.

5) bloomfilter bpf map type, from Joanne.

6) Big improvements to JIT tests especially on Mips, from Johan.

7) Support kernel module function calls from bpf, from Kumar.

8) Support typeless and weak ksym in light skeleton, from Kumar.

9) Disallow unprivileged bpf by default, from Pawan.

10) BTF_KIND_DECL_TAG support, from Yonghong.

11) Various bpftool cleanups, from Quentin.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits)
  libbpf: Deprecate AF_XDP support
  kbuild: Unify options for BTF generation for vmlinux and modules
  selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
  bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
  bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
  selftests/bpf: Fix also no-alu32 strobemeta selftest
  bpf: Add missing map_delete_elem method to bloom filter map
  selftests/bpf: Add bloom map success test for userspace calls
  bpf: Add alignment padding for "map_extra" + consolidate holes
  bpf: Bloom filter map naming fixups
  selftests/bpf: Add test cases for struct_ops prog
  bpf: Add dummy BPF STRUCT_OPS for test purpose
  bpf: Factor out helpers for ctx access checking
  bpf: Factor out a helper to prepare trampoline for struct_ops prog
  selftests, bpf: Fix broken riscv build
  riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h
  tools, build: Add RISC-V to HOSTARCH parsing
  riscv, bpf: Increase the maximum number of iterations
  selftests, bpf: Add one test for sockmap with strparser
  selftests, bpf: Fix test_txmsg_ingress_parser error
  ...
====================

Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 19:59:46 -07:00
Talal Ahmad
f1a456f8f3 net: avoid double accounting for pure zerocopy skbs
Track skbs with only zerocopy data and avoid charging them to kernel
memory to correctly account the memory utilization for msg_zerocopy.
All of the data in such skbs is held in user pages which are already
accounted to user. Before this change, they are charged again in
kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.

Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.

A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.

In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.

Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.

Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 16:33:27 -07:00
Liu Jian
7303524e04 skmsg: Lose offset info in sk_psock_skb_ingress
If sockmap enable strparser, there are lose offset info in
sk_psock_skb_ingress(). If the length determined by parse_msg function is not
skb->len, the skb will be converted to sk_msg multiple times, and userspace
app will get the data multiple times.

Fix this by get the offset and length from strp_msg. And as Cong suggested,
add one bit in skb->_sk_redir to distinguish enable or disable strparser.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com
2021-11-01 17:08:21 +01:00
Jakub Kicinski
1af0a0948e ethtool: don't drop the rtnl_lock half way thru the ioctl
devlink compat code needs to drop rtnl_lock to take
devlink->lock to ensure correct lock ordering.

This is problematic because we're not strictly guaranteed
that the netdev will not disappear after we re-lock.
It may open a possibility of nested ->begin / ->complete
calls.

Instead of calling into devlink under rtnl_lock take
a ref on the devlink instance and make the call after
we've dropped rtnl_lock.

We (continue to) assume that netdevs have an implicit
reference on the devlink returned from ndo_get_devlink_port

Note that ndo_get_devlink_port will now get called
under rtnl_lock. That should be fine since none of
the drivers seem to be taking serious locks inside
ndo_get_devlink_port.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01 13:26:07 +00:00
Jakub Kicinski
46db1b77cd devlink: expose get/put functions
Allow those who hold implicit reference on a devlink instance
to try to take a full ref on it. This will be used from netdev
code which has an implicit ref because of driver call ordering.

Note that after recent changes devlink_unregister() may happen
before netdev unregister, but devlink_free() should still happen
after, so we are safe to try, but we can't just refcount_inc()
and assume it's not zero.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01 13:26:07 +00:00
Jakub Kicinski
f49deaa64a ethtool: push the rtnl_lock into dev_ethtool()
Don't take the lock in net/core/dev_ioctl.c,
we'll have things to do outside rtnl_lock soon.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01 13:26:07 +00:00
Yoshiki Komachi
6de6e46d27 cls_flower: Fix inability to match GRE/IPIP packets
When a packet of a new flow arrives in openvswitch kernel module, it dissects
the packet and passes the extracted flow key to ovs-vswtichd daemon. If hw-
offload configuration is enabled, the daemon creates a new TC flower entry to
bypass openvswitch kernel module for the flow (TC flower can also offload flows
to NICs but this time that does not matter).

In this processing flow, I found the following issue in cases of GRE/IPIP
packets.

When ovs_flow_key_extract() in openvswitch module parses a packet of a new
GRE (or IPIP) flow received on non-tunneling vports, it extracts information
of the outer IP header for ip_proto/src_ip/dst_ip match keys.

This means ovs-vswitchd creates a TC flower entry with IP protocol/addresses
match keys whose values are those of the outer IP header. OTOH, TC flower,
which uses flow_dissector (different parser from openvswitch module), extracts
information of the inner IP header.

The following flow is an example to describe the issue in more detail.

   <----------- Outer IP -----------------> <---------- Inner IP ---------->
  +----------+--------------+--------------+----------+----------+----------+
  | ip_proto | src_ip       | dst_ip       | ip_proto | src_ip   | dst_ip   |
  | 47 (GRE) | 192.168.10.1 | 192.168.10.2 | 6 (TCP)  | 10.0.0.1 | 10.0.0.2 |
  +----------+--------------+--------------+----------+----------+----------+

In this case, TC flower entry and extracted information are shown as below:

  - ovs-vswitchd creates TC flower entry with:
      - ip_proto: 47
      - src_ip: 192.168.10.1
      - dst_ip: 192.168.10.2

  - TC flower extracts below for IP header matches:
      - ip_proto: 6
      - src_ip: 10.0.0.1
      - dst_ip: 10.0.0.2

Thus, GRE or IPIP packets never match the TC flower entry, as each
dissector behaves differently.

IMHO, the behavior of TC flower (flow dissector) does not look correct,
as ip_proto/src_ip/dst_ip in TC flower match means the outermost IP
header information except for GRE/IPIP cases. This patch adds a new
flow_dissector flag FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP which skips
dissection of the encapsulated inner GRE/IPIP header in TC flower
classifier.

Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 14:03:22 +01:00
Jakub Kicinski
c52ef04d59 devlink: make all symbols GPL-only
devlink_alloc() and devlink_register() are both GPL.
A non-GPL module won't get far, so for consistency
we can make all symbols GPL without risking any real
life breakage.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:46:26 +01:00
Jeremy Kerr
78476d315e mctp: Add flow extension to skb
This change adds a new skb extension for MCTP, to represent a
request/response flow.

The intention is to use this in a later change to allow i2c controllers
to correctly configure a multiplexer over a flow.

Since we have a cleanup function in the core path (if an extension is
present), we'll need to make CONFIG_MCTP a bool, rather than a tristate.

Includes a fix for a build warning with clang:
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:23:51 +01:00
Jakub Kicinski
7df621a3ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  7b50ecfcc6cd ("net: Rename ->stream_memory_read to ->sock_is_readable")
  4c1e34c0dbff ("vsock: Enable y2038 safe timeval for timeout")

drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
  0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table")
  e77bcdd1f639 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.")

Adjacent code addition in both cases, keep both.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 10:43:58 -07:00
Leon Romanovsky
ee775b5695 devlink: Simplify internal devlink params implementation
Reduce extra indirection from devlink_params_*() API. Such change
makes it clear that we can drop devlink->lock from these flows, because
everything is executed when the devlink is not registered yet.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:55:17 +01:00
Yajun Deng
b859a360d8 xdp: Remove redundant warning
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't
equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is
redundant.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:13:57 -07:00