30460 Commits

Author SHA1 Message Date
Andrii Nakryiko
aaf6886d9b libbpf: remove deprecated BTF APIs
Get rid of deprecated BTF-related APIs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28 13:13:32 -07:00
Andrii Nakryiko
d320fad217 libbpf: remove deprecated probing APIs
Get rid of deprecated feature-probing APIs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28 13:13:32 -07:00
Andrii Nakryiko
53e6af3a76 libbpf: remove deprecated XDP APIs
Get rid of deprecated bpf_set_link*() and bpf_get_link*() APIs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28 13:13:32 -07:00
Andrii Nakryiko
765a34130e libbpf: remove deprecated low-level APIs
Drop low-level APIs as well as high-level (and very confusingly named)
BPF object loading bpf_prog_load_xattr() and bpf_prog_load_deprecated()
APIs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28 13:13:32 -07:00
Andrii Nakryiko
f366006342 libbpf: move xsk.{c,h} into selftests/bpf
Remove deprecated xsk APIs from libbpf. But given we have selftests
relying on this, move those files (with minimal adjustments to make them
compilable) under selftests/bpf.

We also remove all the removed APIs from libbpf.map, while overall
keeping version inheritance chain, as most APIs are backwards
compatible so there is no need to reassign them as LIBBPF_1.0.0 versions.

Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28 13:13:32 -07:00
Daniel Müller
fd75733da2 bpf: Merge "types_are_compat" logic into relo_core.c
BPF type compatibility checks (bpf_core_types_are_compat()) are
currently duplicated between kernel and user space. That's a historical
artifact more than intentional doing and can lead to subtle bugs where
one implementation is adjusted but another is forgotten.

That happened with the enum64 work, for example, where the libbpf side
was changed (commit 23b2a3a8f63a ("libbpf: Add enum64 relocation
support")) to use the btf_kind_core_compat() helper function but the
kernel side was not (commit 6089fb325cf7 ("bpf: Add btf enum64
support")).

This patch addresses both the duplication issue, by merging both
implementations and moving them into relo_core.c, and fixes the alluded
to kind check (by giving preference to libbpf's already adjusted logic).

For discussion of the topic, please refer to:
https://lore.kernel.org/bpf/CAADnVQKbWR7oarBdewgOBZUPzryhRYvEbkhyPJQHHuxq=0K1gw@mail.gmail.com/T/#mcc99f4a33ad9a322afaf1b9276fb1f0b7add9665

Changelog:
v1 -> v2:
- limited libbpf recursion limit to 32
- changed name to __bpf_core_types_are_compat
- included warning previously present in libbpf version
- merged kernel and user space changes into a single patch

Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220623182934.2582827-1-deso@posteo.net
2022-06-24 14:15:37 -07:00
Jiri Olsa
b168852eb8 perf tools: Rework prologue generation code
Some functions we use for bpf prologue generation are going to be
deprecated. This change reworks current code not to use them.

We need to replace following functions/struct:
   bpf_program__set_prep
   bpf_program__nth_fd
   struct bpf_prog_prep_result

Currently we use bpf_program__set_prep to hook perf callback before
program is loaded and provide new instructions with the prologue.

We replace this function/ality by taking instructions for specific
program, attaching prologue to them and load such new ebpf programs
with prologue using separate bpf_prog_load calls (outside libbpf
load machinery).

Before we can take and use program instructions, we need libbpf to
actually load it. This way we get the final shape of its instructions
with all relocations and verifier adjustments).

There's one glitch though.. perf kprobe program already assumes
generated prologue code with proper values in argument registers,
so loading such program directly will fail in the verifier.

That's where the fallback pre-load handler fits in and prepends
the initialization code to the program. Once such program is loaded
we take its instructions, cut off the initialization code and prepend
the prologue.

I know.. sorry ;-)

To have access to the program when loading this patch adds support to
register 'fallback' section handler to take care of perf kprobe programs.
The fallback means that it handles any section definition besides the
ones that libbpf handles.

The handler serves two purposes:
  - allows perf programs to have special arguments in section name
  - allows perf to use pre-load callback where we can attach init
    code (zeroing all argument registers) to each perf program

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/bpf/20220616202214.70359-2-jolsa@kernel.org
2022-06-24 13:36:23 -07:00
Eduard Zingerman
41188e9e9d selftest/bpf: Test for use-after-free bug fix in inline_bpf_loop
This test verifies that bpf_loop() inlining works as expected when
address of `env->prog` is updated. This address is updated upon BPF
program reallocation.

Reallocation is handled by bpf_prog_realloc(), which reuses old memory
if page boundary is not crossed. The value of `len` in the test is
chosen to cross this boundary on bpf_loop() patching.

Verify that the use-after-free bug in inline_bpf_loop() reported by
Dan Carpenter is fixed.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-3-eddyz87@gmail.com
2022-06-24 16:51:00 +02:00
Jörn-Thorben Hinz
6dc7a0baf1 selftests/bpf: Fix rare segfault in sock_fields prog test
test_sock_fields__detach() got called with a null pointer here when one
of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load()
call resulted in a jump to the "done" label.

A skeletons *__detach() is not safe to call with a null pointer, though.
This led to a segfault.

Go the easy route and only call test_sock_fields__destroy() which is
null-pointer safe and includes detaching.

Came across this while looking[1] to introduce the usage of
bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with
vmlinux.h.

[1] https://lore.kernel.org/bpf/629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/

Fixes: 8f50f16ff39d ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads")
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de
2022-06-23 10:52:12 -07:00
Jörn-Thorben Hinz
f14a3f644a selftests/bpf: Test a BPF CC implementing the unsupported get_info()
Test whether a TCP CC implemented in BPF providing get_info() is
rejected correctly. get_info() is unsupported in a BPF CC. The check for
required functions in a BPF CC has moved, this test ensures unsupported
functions are still rejected correctly.

Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23 09:49:58 -07:00
Jörn-Thorben Hinz
0735627d78 selftests/bpf: Test an incomplete BPF CC
Test whether a TCP CC implemented in BPF providing neither cong_avoid()
nor cong_control() is correctly rejected. This check solely depends on
tcp_register_congestion_control() now, which is invoked during
bpf_map__attach_struct_ops().

Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23 09:49:57 -07:00
Jörn-Thorben Hinz
6e945d57cc selftests/bpf: Test a BPF CC writing sk_pacing_*
Test whether a TCP CC implemented in BPF is allowed to write
sk_pacing_rate and sk_pacing_status in struct sock. This is needed when
cong_control() is implemented and used.

Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23 09:49:57 -07:00
Dave Marchevsky
7308748925 selftests/bpf: Add benchmark for local_storage get
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage. Only one inner map is created - a hashmap
meant to hold tid -> data mapping for all tasks. Size of the hashmap is
hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
keys which are actually fetched as part of the benchmark is
configurable.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Hashmap Control
===============
        num keys: 10
hashmap (control) sequential    get:  hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s

        num keys: 1000
hashmap (control) sequential    get:  hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s

        num keys: 10000
hashmap (control) sequential    get:  hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s

        num keys: 100000
hashmap (control) sequential    get:  hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s

        num keys: 4194304
hashmap (control) sequential    get:  hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s

Local Storage
=============
        num_maps: 1
local_storage cache sequential  get:  hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
local_storage cache interleaved get:  hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
local_storage cache interleaved get:  hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
local_storage cache interleaved get:  hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
local_storage cache interleaved get:  hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
local_storage cache interleaved get:  hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:

Hashmap Control
===============
        num keys: 10
hashmap (control) sequential    get:  hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s

        num keys: 1000
hashmap (control) sequential    get:  hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s

        num keys: 10000
hashmap (control) sequential    get:  hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s

        num keys: 100000
hashmap (control) sequential    get:  hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s

        num keys: 4194304
hashmap (control) sequential    get:  hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s

Local Storage
=============
        num_maps: 1
local_storage cache sequential  get:  hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
local_storage cache interleaved get:  hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
local_storage cache interleaved get:  hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
local_storage cache interleaved get:  hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
local_storage cache interleaved get:  hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
local_storage cache interleaved get:  hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
local_storage cache interleaved get:  hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
local_storage cache interleaved get:  hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and
"sequential get" are:

hashmap_control_1k:   ~53.8ns
hashmap_control_10k:  ~124.2ns
hashmap_control_100k: ~206.5ns
sequential_get_1:     ~12.1ns
sequential_get_10:    ~16.0ns
sequential_get_16:    ~13.8ns
sequential_get_17:    ~16.8ns
sequential_get_24:    ~40.9ns
sequential_get_32:    ~65.2ns
sequential_get_100:   ~148.2ns
sequential_get_1000:  ~2204ns

Clearly demonstrating a cliff.

In the discussion for v1 of this patch, Alexei noted that local_storage
was 2.5x faster than a large hashmap when initially implemented [1]. The
benchmark results show that local_storage is 5-10x faster: a
long-running BPF application putting some pid-specific info into a
hashmap for each pid it sees will probably see on the order of 10-100k
pids. Bench numbers for hashmaps of this size are ~10x slower than
sequential_get_16, but as the number of local_storage maps grows far
past local_storage cache size the performance advantage shrinks and
eventually reverses.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
  [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-22 19:14:33 -07:00
Eduard Zingerman
0e1bf9ed20 selftests/bpf: BPF test_prog selftests for bpf_loop inlining
Two new test BPF programs for test_prog selftests checking bpf_loop
behavior. Both are corner cases for bpf_loop inlinig transformation:
 - check that bpf_loop behaves correctly when callback function is not
   a compile time constant
 - check that local function variables are not affected by allocating
   additional stack storage for registers spilled by loop inlining

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:40:52 -07:00
Eduard Zingerman
f8acfdd044 selftests/bpf: BPF test_verifier selftests for bpf_loop inlining
A number of test cases for BPF selftests test_verifier to check how
bpf_loop inline transformation rewrites the BPF program. The following
cases are covered:
 - happy path
 - no-rewrite when flags is non-zero
 - no-rewrite when callback is non-constant
 - subprogno in insn_aux is updated correctly when dead sub-programs
   are removed
 - check that correct stack offsets are assigned for spilling of R6-R8
   registers

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:40:51 -07:00
Eduard Zingerman
7a42008ca5 selftests/bpf: allow BTF specs and func infos in test_verifier tests
The BTF and func_info specification for test_verifier tests follows
the same notation as in prog_tests/btf.c tests. E.g.:

  ...
  .func_info = { { 0, 6 }, { 8, 7 } },
  .func_info_cnt = 2,
  .btf_strings = "\0int\0",
  .btf_types = {
    BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4),
    BTF_PTR_ENC(1),
  },
  ...

The BTF specification is loaded only when specified.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:40:51 -07:00
Eduard Zingerman
933ff53191 selftests/bpf: specify expected instructions in test_verifier tests
Allows to specify expected and unexpected instruction sequences in
test_verifier test cases. The instructions are requested from kernel
after BPF program loading, thus allowing to check some of the
transformations applied by BPF verifier.

- `expected_insn` field specifies a sequence of instructions expected
  to be found in the program;
- `unexpected_insn` field specifies a sequence of instructions that
  are not expected to be found in the program;
- `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask
  `off` and `imm` fields.
- `SKIP_INSNS` could be used to specify that some instructions in the
  (un)expected pattern are not important (behavior similar to usage of
  `\t` in `errstr` field).

The intended usage is as follows:

  {
	"inline simple bpf_loop call",
	.insns = {
	/* main */
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
	BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_2,
			BPF_PSEUDO_FUNC, 0, 6),
    ...
	BPF_EXIT_INSN(),
	/* callback */
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1),
	BPF_EXIT_INSN(),
	},
	.expected_insns = {
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
	SKIP_INSNS(),
	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1)
	},
	.unexpected_insns = {
	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0,
			INSN_OFF_MASK, INSN_IMM_MASK),
	},
	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
	.result = ACCEPT,
	.runs = 0,
  },

Here it is expected that move of 1 to register 1 would remain in place
and helper function call instruction would be replaced by a relative
call instruction.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:40:51 -07:00
Maxim Mikityanskiy
e068c0776b selftests/bpf: Enable config options needed for xdp_synproxy test
This commit adds the kernel config options needed to run the recently
added xdp_synproxy test. Users without these options will hit errors
like this:

test_synproxy:FAIL:iptables -t raw -I PREROUTING         -i tmp1 -p
tcp -m tcp --syn --dport 8080 -j CT --notrack unexpected error: 256
(errno 22)

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220620104939.4094104-1-maximmi@nvidia.com
2022-06-20 14:10:54 +02:00
Jakub Kicinski
9fb424c4c2 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-06-17

We've added 72 non-merge commits during the last 15 day(s) which contain
a total of 92 files changed, 4582 insertions(+), 834 deletions(-).

The main changes are:

1) Add 64 bit enum value support to BTF, from Yonghong Song.

2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov.

3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a
   socket especially useful in synproxy scenarios, from Maxim Mikityanskiy.

4) Fix libbpf's internal USDT address translation logic for shared libraries as
   well as uprobe's symbol file offset calculation, from Andrii Nakryiko.

5) Extend libbpf to provide an API for textual representation of the various
   map/prog/attach/link types and use it in bpftool, from Daniel Müller.

6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the
   core seen in 32 bit when storing BPF function addresses, from Pu Lehui.

7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases
   for 'long' types, from Douglas Raillard.

8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting
   has been unreliable and caused a regression on COS, from Quentin Monnet.

9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link
   detachment, from Tadeusz Struk.

10) Fix bpftool build bootstrapping during cross compilation which was pointing
    to the wrong AR process, from Shahab Vahedi.

11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi.

12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there
    is no free element. Also add a benchmark as reproducer, from Feng Zhou.

13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin.

14) Various minor cleanup and improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
  bpf: Fix bpf_skc_lookup comment wrt. return type
  bpf: Fix non-static bpf_func_proto struct definitions
  selftests/bpf: Don't force lld on non-x86 architectures
  selftests/bpf: Add selftests for raw syncookie helpers in TC mode
  bpf: Allow the new syncookie helpers to work with SKBs
  selftests/bpf: Add selftests for raw syncookie helpers
  bpf: Add helpers to issue and check SYN cookies in XDP
  bpf: Allow helpers to accept pointers with a fixed size
  bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
  selftests/bpf: add tests for sleepable (uk)probes
  libbpf: add support for sleepable uprobe programs
  bpf: allow sleepable uprobe programs to attach
  bpf: implement sleepable uprobes by chaining gps
  bpf: move bpf_prog to bpf.h
  libbpf: Fix internal USDT address translation logic for shared libraries
  samples/bpf: Check detach prog exist or not in xdp_fwd
  selftests/bpf: Avoid skipping certain subtests
  selftests/bpf: Fix test_varlen verification failure with latest llvm
  bpftool: Do not check return value from libbpf_set_strict_mode()
  Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
  ...
====================

Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-17 19:35:19 -07:00
Ido Schimmel
ed62af4546 selftests: spectrum-2: tc_flower_scale: Dynamically set scale target
Instead of hard coding the scale target in the test, dynamically set it
based on the maximum number of flow counters and their current
occupancy.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Petr Machata
be00853bfd selftests: mlxsw: Add a RIF counter scale test
This tests creates as many RIFs as possible, ideally more than there can be
RIF counters (though that is currently only possible on Spectrum-1). It
then tries to enable L3 HW stats on each of the RIFs. It also contains the
traffic test, which tries to run traffic through a log2 of those counters
and checks that the traffic is shown in the counter values.

Like with tc_flower traffic test, take a log2 subset of rules. The logic
behind picking log2 rules is that then every bit of the instantiated item's
number is exercised. This should catch issues whether they happen at the
high end, low end, or somewhere in between.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Petr Machata
dd5d20e17c selftests: mlxsw: tc_flower_scale: Add a traffic test
Add a test that checks that the created filters do actually trigger on
matching traffic.

Exercising all the rules would be a very lengthy process. Instead, take a
log2 subset of rules. The logic behind picking log2 rules is that then
every bit of the instantiated item's number is exercised. This should catch
issues whether they happen at the high end, low end, or somewhere in
between.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Petr Machata
35d5829e86 selftests: mlxsw: resource_scale: Pass target count to cleanup
The scale tests are verifying behavior of mlxsw when number of instances of
some resource reaches the ASIC capacity. The number of instances is
referred to as "target" number.

No scale tests so far needed to know this target number to clean up. E.g.
the tc_flower simply removes the clsact qdisc that all the tested filters
are hooked onto, and that takes care of collecting all the filters.

However, for the RIF counter test, which is being added in a future patch,
VLAN netdevices are created. These are created as part of the test, but of
course the cleanup needs to undo them again. For that it needs to know how
many there were. To support this usage, pass the target number to the
cleanup callback.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Petr Machata
8cad339db3 selftests: mlxsw: resource_scale: Allow skipping a test
The scale tests are currently testing two things: that some number of
instances of a given resource can actually be created; and that when an
attempt is made to create more than the supported amount, the failures are
noted and handled gracefully.

Sometimes the scale test depends on more than one resource. In particular,
a following patch will add a RIF counter scale test, which depends on the
number of RIF counters that can be bound, and also on the number of RIFs
that can be created.

When the test is limited by the auxiliary resource and not by the primary
one, there's no point trying to run the overflow test, because it would be
testing exhaustion of the wrong resource.

To support this use case, when the $test_get_target yields 0, skip the test
instead.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Petr Machata
3128b9f51e selftests: mlxsw: resource_scale: Introduce traffic tests
The scale tests are currently testing two things: that some number of
instances of a given resource can actually be created; and that when an
attempt is made to create more than the supported amount, the failures are
noted and handled gracefully.

However the ability to allocate the resource does not mean that the
resource actually works when passing traffic. For that, make it possible
for a given scale to also test traffic.

Traffic test is only run on the positive leg of the scale test (no point
trying to pass traffic when the expected outcome is that the resource will
not be allocated). Traffic tests are opt-in, if a given test does not
expose it, it is not run.

To this end, delay the test cleanup until after the traffic test is run.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Ido Schimmel
d3ffeb2dba selftests: mlxsw: resource_scale: Update scale target after test setup
The scale of each resource is tested in the following manner:

1. The scale target is queried.
2. The test setup is prepared.
3. The test is invoked.

In some cases, the occupancy of a resource changes as part of the second
step, requiring the test to return a scale target that takes this change
into account.

Make this more robust by re-querying the scale target after the second
step.

Another possible solution is to swap the first and second steps, but
when a test needs to be skipped (i.e., scale target is zero), the setup
would have been in vain.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Amit Cohen
e386a527fc selftests: mirror_gre_bridge_1q_lag: Enslave port to bridge before other configurations
Using mlxsw driver, the configurations are offloaded just in case that
there is a physical port which is enslaved to the virtual device
(e.g., to a bridge). In 'mirror_gre_bridge_1q_lag' test, the bridge gets an
address and route before there are ports in the bridge. It means that these
configurations are not offloaded.

Till now the test passes with mlxsw driver even that the RIF of the
bridge is not in the hardware, because the ARP packets are trapped in
layer 2 and also mirrored, so there is no real need of the RIF in hardware.
The previous patch changed the traps 'ARP_REQUEST' and 'ARP_RESPONSE' to
be done at layer 3 instead of layer 2. With this change the ARP packets are
not trapped during the test, as the RIF is not in the hardware because of
the order of configurations.

Reorder the configurations to make them to be offloaded, then the test will
pass with the change of the traps.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17 10:31:33 +01:00
Andrii Nakryiko
08c79c9cd6 selftests/bpf: Don't force lld on non-x86 architectures
LLVM's lld linker doesn't have a universal architecture support (e.g.,
it definitely doesn't work on s390x), so be safe and force lld for
urandom_read and liburandom_read.so only on x86 architectures.

This should fix s390x CI runs.

Fixes: 3e6fe5ce4d48 ("libbpf: Fix internal USDT address translation logic for shared libraries")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617045512.1339795-1-andrii@kernel.org
2022-06-17 10:16:01 +02:00
Maxim Mikityanskiy
784d5dc0ef selftests/bpf: Add selftests for raw syncookie helpers in TC mode
This commit extends selftests for the new BPF helpers
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} to also test the TC BPF
functionality added in the previous commit.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-7-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 21:20:30 -07:00
Maxim Mikityanskiy
fb5cd0ce70 selftests/bpf: Add selftests for raw syncookie helpers
This commit adds selftests for the new BPF helpers:
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6}.

xdp_synproxy_kern.c is a BPF program that generates SYN cookies on
allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
iptables module.

xdp_synproxy.c is a userspace control application that allows to
configure the following options in runtime: list of allowed ports, MSS,
window scale, TTL.

A selftest is added to prog_tests that leverages the above programs to
test the functionality of the new helpers.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-5-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 21:20:30 -07:00
Maxim Mikityanskiy
33bf988504 bpf: Add helpers to issue and check SYN cookies in XDP
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
program to generate SYN cookies in response to TCP SYN packets and to
check those cookies upon receiving the first ACK packet (the final
packet of the TCP handshake).

Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
listening socket on the local machine, which allows to use them together
with synproxy to accelerate SYN cookie generation.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 21:20:30 -07:00
Maxim Mikityanskiy
ac80287a6a bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
bpf_tcp_gen_syncookie expects the full length of the TCP header (with
all options), and bpf_tcp_check_syncookie accepts lengths bigger than
sizeof(struct tcphdr). Fix the documentation that says these lengths
should be exactly sizeof(struct tcphdr).

While at it, fix a typo in the name of struct ipv6hdr.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-2-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 21:20:29 -07:00
Delyan Kratunov
cb3f4a4a46 selftests/bpf: add tests for sleepable (uk)probes
Add tests that ensure sleepable uprobe programs work correctly.
Add tests that ensure sleepable kprobe programs cannot attach.

Also add tests that attach both sleepable and non-sleepable
uprobe programs to the same location (i.e. same bpf_prog_array).

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/c744e5bb7a5c0703f05444dc41f2522ba3579a48.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 19:27:30 -07:00
Delyan Kratunov
c4cac71fc8 libbpf: add support for sleepable uprobe programs
Add section mappings for u(ret)probe.s programs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/aedbc3b74f3523f00010a7b0df8f3388cca59f16.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 19:27:30 -07:00
Andrii Nakryiko
3e6fe5ce4d libbpf: Fix internal USDT address translation logic for shared libraries
Perform the same virtual address to file offset translation that libbpf
is doing for executable ELF binaries also for shared libraries.
Currently libbpf is making a simplifying and sometimes wrong assumption
that for shared libraries relative virtual addresses inside ELF are
always equal to file offsets.

Unfortunately, this is not always the case with LLVM's lld linker, which
now by default generates quite more complicated ELF segments layout.
E.g., for liburandom_read.so from selftests/bpf, here's an excerpt from
readelf output listing ELF segments (a.k.a. program headers):

  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x0005e4 0x0005e4 R   0x1000
  LOAD           0x0005f0 0x00000000000015f0 0x00000000000015f0 0x000160 0x000160 R E 0x1000
  LOAD           0x000750 0x0000000000002750 0x0000000000002750 0x000210 0x000210 RW  0x1000
  LOAD           0x000960 0x0000000000003960 0x0000000000003960 0x000028 0x000029 RW  0x1000

Compare that to what is generated by GNU ld (or LLVM lld's with extra
-znoseparate-code argument which disables this cleverness in the name of
file size reduction):

  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000550 0x000550 R   0x1000
  LOAD           0x001000 0x0000000000001000 0x0000000000001000 0x000131 0x000131 R E 0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x0000ac 0x0000ac R   0x1000
  LOAD           0x002dc0 0x0000000000003dc0 0x0000000000003dc0 0x000262 0x000268 RW  0x1000

You can see from the first example above that for executable (Flg == "R E")
PT_LOAD segment (LOAD #2), Offset doesn't match VirtAddr columns.
And it does in the second case (GNU ld output).

This is important because all the addresses, including USDT specs,
operate in a virtual address space, while kernel is expecting file
offsets when performing uprobe attach. So such mismatches have to be
properly taken care of and compensated by libbpf, which is what this
patch is fixing.

Also patch clarifies few function and variable names, as well as updates
comments to reflect this important distinction (virtaddr vs file offset)
and to ephasize that shared libraries are not all that different from
executables in this regard.

This patch also changes selftests/bpf Makefile to force urand_read and
liburand_read.so to be built with Clang and LLVM's lld (and explicitly
request this ELF file size optimization through -znoseparate-code linker
parameter) to validate libbpf logic and ensure regressions don't happen
in the future. I've bundled these selftests changes together with libbpf
changes to keep the above description tied with both libbpf and
selftests changes.

Fixes: 74cc6311cec9 ("libbpf: Add USDT notes parsing and resolution logic")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616055543.3285835-1-andrii@kernel.org
2022-06-17 01:20:10 +02:00
Linus Torvalds
48a23ec6ff Mostly driver fixes.
Current release - regressions:
 
  - Revert "net: Add a second bind table hashed by port and address",
    needs more work
 
  - amd-xgbe: use platform_irq_count(), static setup of IRQ resources
    had been removed from DT core
 
  - dts: at91: ksz9477_evb: add phy-mode to fix port/phy validation
 
 Current release - new code bugs:
 
  - hns3: modify the ring param print info
 
 Previous releases - always broken:
 
  - axienet: make the 64b addressable DMA depends on 64b architectures
 
  - iavf: fix issue with MAC address of VF shown as zero
 
  - ice: fix PTP TX timestamp offset calculation
 
  - usb: ax88179_178a needs FLAG_SEND_ZLP
 
 Misc:
 
  - document some net.sctp.* sysctls
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKrcx8ACgkQMUZtbf5S
 IrsKjA/9Ho+cxnGAvx7ngQepqAU8RQDFy7sQoFHiGqs+jMeph/E81PM2QDBR9g9h
 k/s7YRpLGuxWFT7KUJScNl0ZyPgSk5EHcqy202ToYyDQv+srLnh5bgbRykMF2Unc
 D4mf63a2pNo9S0L1PmMz87p+XaWIwblqQ0wbl5F97e7eAWel+y7rPCBqR0lZ9Il7
 w8rZp6iOVOhD495s1ikqOYUVCntepC9MQIo8iIE/WrREiOWmZNNbV8RzvuHRNQs6
 j9eLsukKwTfekQbzR3SXbYxyjwRowAQ3bD5sEL3MuqflsxRpVm5lEqN0AuVlAo3C
 IJZFSFqnusC4cSUYVdfWhYlx8om+uw4XKzfqQD/T7yobjoVA/Mmt/Uf7Mw6krR+g
 bI+/bpgX7WpLYQNtBFAils5pY36pthN+zg9FuU0v7tNLgC3AmQqA8sRI/fRCVJFV
 b1Wmk6Ldj1lCynX0KpzU6XSFGzP2Ht9CYReImiwvZbaABIoM14woHhRPrh8UGWIY
 sdpLoR+XRyL/0N1W7l0FgbGm/zOaEbh8fo0ZGYHLukXPUby6osiV36frzjxOj/NO
 DqNkPq4ajfWFvcWdqbfRKXwpLyM/Ki2WpQjvaNzDLOL74sDspr8wjnIOOLbuHv/8
 NW6tcWwfIu9nkDJOpRedh+O2gj6FKdruobdKVgQd376J0kxWLv0=
 =JSV9
 -----END PGP SIGNATURE-----

Merge tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Mostly driver fixes.

  Current release - regressions:

   - Revert "net: Add a second bind table hashed by port and address",
     needs more work

   - amd-xgbe: use platform_irq_count(), static setup of IRQ resources
     had been removed from DT core

   - dts: at91: ksz9477_evb: add phy-mode to fix port/phy validation

  Current release - new code bugs:

   - hns3: modify the ring param print info

  Previous releases - always broken:

   - axienet: make the 64b addressable DMA depends on 64b architectures

   - iavf: fix issue with MAC address of VF shown as zero

   - ice: fix PTP TX timestamp offset calculation

   - usb: ax88179_178a needs FLAG_SEND_ZLP

  Misc:

   - document some net.sctp.* sysctls"

* tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
  net: axienet: add missing error return code in axienet_probe()
  Revert "net: Add a second bind table hashed by port and address"
  net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg
  net: usb: ax88179_178a needs FLAG_SEND_ZLP
  MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS
  ARM: dts: at91: ksz9477_evb: fix port/phy validation
  net: bgmac: Fix an erroneous kfree() in bgmac_remove()
  ice: Fix memory corruption in VF driver
  ice: Fix queue config fail handling
  ice: Sync VLAN filtering features for DVM
  ice: Fix PTP TX timestamp offset calculation
  mlxsw: spectrum_cnt: Reorder counter pools
  docs: networking: phy: Fix a typo
  amd-xgbe: Use platform_irq_count()
  octeontx2-vf: Add support for adaptive interrupt coalescing
  xilinx:  Fix build on x86.
  net: axienet: Use iowrite64 to write all 64b descriptor pointers
  net: axienet: make the 64b addresable DMA depends on 64b archectures
  net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization
  net: hns3: fix PF rss size initialization bug
  ...
2022-06-16 11:51:32 -07:00
Joanne Koong
593d1ebe00 Revert "net: Add a second bind table hashed by port and address"
This reverts:

commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry")
Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/

There are a few things that need to be fixed here:
* Updating bhash2 in cases where the socket's rcv saddr changes
* Adding bhash2 hashbucket locks

Links to syzbot reports:
https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/
https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/

Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com
Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-16 11:07:59 -07:00
Yonghong Song
3831cd1f9f selftests/bpf: Avoid skipping certain subtests
Commit 704c91e59fe0 ('selftests/bpf: Test "bpftool gen min_core_btf"')
added a test test_core_btfgen to test core relocation with btf
generated with 'bpftool gen min_core_btf'. Currently,
among 76 subtests, 25 are skipped.

  ...
  #46/69   core_reloc_btfgen/enumval:OK
  #46/70   core_reloc_btfgen/enumval___diff:OK
  #46/71   core_reloc_btfgen/enumval___val3_missing:OK
  #46/72   core_reloc_btfgen/enumval___err_missing:SKIP
  #46/73   core_reloc_btfgen/enum64val:OK
  #46/74   core_reloc_btfgen/enum64val___diff:OK
  #46/75   core_reloc_btfgen/enum64val___val3_missing:OK
  #46/76   core_reloc_btfgen/enum64val___err_missing:SKIP
  ...
  #46      core_reloc_btfgen:SKIP
  Summary: 1/51 PASSED, 25 SKIPPED, 0 FAILED

Alexei found that in the above core_reloc_btfgen/enum64val___err_missing
should not be skipped.

Currently, the core_reloc tests have some negative tests.
In Commit 704c91e59fe0, for core_reloc_btfgen, all negative tests
are skipped with the following condition
  if (!test_case->btf_src_file || test_case->fails) {
	test__skip();
	continue;
  }
This is too conservative. Negative tests do not fail
mkstemp() and run_btfgen() should not be skipped.
There are a few negative tests indeed failing run_btfgen()
and this patch added 'run_btfgen_fails' to mark these tests
so that they can be skipped for btfgen tests. With this,
we have
  ...
  #46/69   core_reloc_btfgen/enumval:OK
  #46/70   core_reloc_btfgen/enumval___diff:OK
  #46/71   core_reloc_btfgen/enumval___val3_missing:OK
  #46/72   core_reloc_btfgen/enumval___err_missing:OK
  #46/73   core_reloc_btfgen/enum64val:OK
  #46/74   core_reloc_btfgen/enum64val___diff:OK
  #46/75   core_reloc_btfgen/enum64val___val3_missing:OK
  #46/76   core_reloc_btfgen/enum64val___err_missing:OK
  ...
  Summary: 1/62 PASSED, 14 SKIPPED, 0 FAILED

Totally 14 subtests are skipped instead of 25.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220614055526.628299-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-14 16:18:29 -07:00
Yonghong Song
96752e1ec0 selftests/bpf: Fix test_varlen verification failure with latest llvm
With latest llvm15, test_varlen failed with the following verifier log:

  17: (85) call bpf_probe_read_kernel_str#115   ; R0_w=scalar(smin=-4095,smax=256)
  18: (bf) r1 = r0                      ; R0_w=scalar(id=1,smin=-4095,smax=256) R1_w=scalar(id=1,smin=-4095,smax=256)
  19: (67) r1 <<= 32                    ; R1_w=scalar(smax=1099511627776,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32_max=)
  20: (bf) r2 = r1                      ; R1_w=scalar(id=2,smax=1099511627776,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32)
  21: (c7) r2 s>>= 32                   ; R2=scalar(smin=-2147483648,smax=256)
  ; if (len >= 0) {
  22: (c5) if r2 s< 0x0 goto pc+7       ; R2=scalar(umax=256,var_off=(0x0; 0x1ff))
  ; payload4_len1 = len;
  23: (18) r2 = 0xffffc90000167418      ; R2_w=map_value(off=1048,ks=4,vs=1572,imm=0)
  25: (63) *(u32 *)(r2 +0) = r0         ; R0=scalar(id=1,smin=-4095,smax=256) R2_w=map_value(off=1048,ks=4,vs=1572,imm=0)
  26: (77) r1 >>= 32                    ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; payload += len;
  27: (18) r6 = 0xffffc90000167424      ; R6_w=map_value(off=1060,ks=4,vs=1572,imm=0)
  29: (0f) r6 += r1                     ; R1_w=Pscalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=map_value(off=1060,ks=4,vs=1572,umax=4294967295,var_off=(0)
  ; len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
  30: (bf) r1 = r6                      ; R1_w=map_value(off=1060,ks=4,vs=1572,umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=map_value(off=1060,ks=4,vs=1572,um)
  31: (b7) r2 = 256                     ; R2_w=256
  32: (18) r3 = 0xffffc90000164100      ; R3_w=map_value(off=256,ks=4,vs=1056,imm=0)
  34: (85) call bpf_probe_read_kernel_str#115
  R1 unbounded memory access, make sure to bounds check any such access
  processed 27 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
  -- END PROG LOAD LOG --
  libbpf: failed to load program 'handler32_signed'

The failure is due to
  20: (bf) r2 = r1                      ; R1_w=scalar(id=2,smax=1099511627776,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32)
  21: (c7) r2 s>>= 32                   ; R2=scalar(smin=-2147483648,smax=256)
  22: (c5) if r2 s< 0x0 goto pc+7       ; R2=scalar(umax=256,var_off=(0x0; 0x1ff))
  26: (77) r1 >>= 32                    ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  29: (0f) r6 += r1                     ; R1_w=Pscalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=map_value(off=1060,ks=4,vs=1572,umax=4294967295,var_off=(0)
where r1 has conservative value range compared to r2 and r1 is used later.

In llvm, commit [1] triggered the above code generation and caused
verification failure.

It may take a while for llvm to address this issue. In the main time,
let us change the variable 'len' type to 'long' and adjust condition properly.
Tested with llvm14 and latest llvm15, both worked fine.

 [1] https://reviews.llvm.org/D126647

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220613233449.2860753-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-14 16:17:07 -07:00
Quentin Monnet
93270357da bpftool: Do not check return value from libbpf_set_strict_mode()
The function always returns 0, so we don't need to check whether the
return value is 0 or not.

This change was first introduced in commit a777e18f1bcd ("bpftool: Use
libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"), but later reverted to
restore the unconditional rlimit bump in bpftool. Let's re-add it.

Co-developed-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220610112648.29695-3-quentin@isovalent.com
2022-06-14 22:18:56 +02:00
Quentin Monnet
6b4384ff10 Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.

In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
kernel has switched to memcg-based memory accounting. Thanks to the
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
with other systems and ask libbpf to raise the limit for us if
necessary.

How do we know if memcg-based accounting is supported? There is a probe
in libbpf to check this. But this probe currently relies on the
availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
landed in the same kernel version as the memory accounting change. This
works in the generic case, but it may fail, for example, if the helper
function has been backported to an older kernel. This has been observed
for Google Cloud's Container-Optimized OS (COS), where the helper is
available but rlimit is still in use. The probe succeeds, the rlimit is
not raised, and probing features with bpftool, for example, fails.

A patch was submitted [0] to update this probe in libbpf, based on what
the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
0, attempt to load a BPF object, and reset the rlimit. But it may induce
some hard-to-debug flakiness if another process starts, or the current
application is killed, while the rlimit is reduced, and the approach was
discarded.

As a workaround to ensure that the rlimit bump does not depend on the
availability of a given helper, we restore the unconditional rlimit bump
in bpftool for now.

  [0] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
  [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20220610112648.29695-2-quentin@isovalent.com
2022-06-14 22:18:06 +02:00
Yonghong Song
c49a44b39b libbpf: Fix an unsigned < 0 bug
Andrii reported a bug with the following information:

  2859 	if (enum64_placeholder_id == 0) {
  2860 		enum64_placeholder_id = btf__add_int(btf, "enum64_placeholder", 1, 0);
  >>>     CID 394804:  Control flow issues  (NO_EFFECT)
  >>>     This less-than-zero comparison of an unsigned value is never true. "enum64_placeholder_id < 0U".
  2861 		if (enum64_placeholder_id < 0)
  2862 			return enum64_placeholder_id;
  2863    	...

Here enum64_placeholder_id declared as '__u32' so enum64_placeholder_id < 0
is always false. Declare enum64_placeholder_id as 'int' in order to capture
the potential error properly.

Fixes: f2a625889bb8 ("libbpf: Add enum64 sanitization")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220613054314.1251905-1-yhs@fb.com
2022-06-14 17:01:54 +02:00
Linus Torvalds
24625f7d91 ARM64:
* Properly reset the SVE/SME flags on vcpu load
 
 * Fix a vgic-v2 regression regarding accessing the pending
 state of a HW interrupt from userspace (and make the code
 common with vgic-v3)
 
 * Fix access to the idreg range for protected guests
 
 * Ignore 'kvm-arm.mode=protected' when using VHE
 
 * Return an error from kvm_arch_init_vm() on allocation failure
 
 * A bunch of small cleanups (comments, annotations, indentation)
 
 RISC-V:
 
 * Typo fix in arch/riscv/kvm/vmid.c
 
 * Remove broken reference pattern from MAINTAINERS entry
 
 x86-64:
 
 * Fix error in page tables with MKTME enabled
 
 * Dirty page tracking performance test extended to running a nested
   guest
 
 * Disable APICv/AVIC in cases that it cannot implement correctly
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKjTIAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNhPQgAiIVtp8aepujUM/NhkNyK3SIdLzlS
 oZCZiS6bvaecKXi/QvhBU0EBxAEyrovk3lmVuYNd41xI+PDjyaA4SDIl5DnToGUw
 bVPNFSYqjpF939vUUKjc0RCdZR4o5g3Od3tvWoHTHviS1a8aAe5o9pcpHpD0D6Mp
 Gc/o58nKAOPl3htcFKmjymqo3Y6yvkJU9NB7DCbL8T5mp5pJ959Mw1/LlmBaAzJC
 OofrynUm4NjMyAj/mAB1FhHKFyQfjBXLhiVlS0SLiiEA/tn9/OXyVFMKG+n5VkAZ
 Q337GMFe2RikEIuMEr3Rc4qbZK3PpxHhaj+6MPRuM0ho/P4yzl2Nyb/OhA==
 =h81Q
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "While last week's pull request contained miscellaneous fixes for x86,
  this one covers other architectures, selftests changes, and a bigger
  series for APIC virtualization bugs that were discovered during 5.20
  development. The idea is to base 5.20 development for KVM on top of
  this tag.

  ARM64:

   - Properly reset the SVE/SME flags on vcpu load

   - Fix a vgic-v2 regression regarding accessing the pending state of a
     HW interrupt from userspace (and make the code common with vgic-v3)

   - Fix access to the idreg range for protected guests

   - Ignore 'kvm-arm.mode=protected' when using VHE

   - Return an error from kvm_arch_init_vm() on allocation failure

   - A bunch of small cleanups (comments, annotations, indentation)

  RISC-V:

   - Typo fix in arch/riscv/kvm/vmid.c

   - Remove broken reference pattern from MAINTAINERS entry

  x86-64:

   - Fix error in page tables with MKTME enabled

   - Dirty page tracking performance test extended to running a nested
     guest

   - Disable APICv/AVIC in cases that it cannot implement correctly"

[ This merge also fixes a misplaced end parenthesis bug introduced in
  commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC
  ID or APIC base") pointed out by Sean Christopherson ]

Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits)
  KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
  KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2
  KVM: selftests: Clean up LIBKVM files in Makefile
  KVM: selftests: Link selftests directly with lib object files
  KVM: selftests: Drop unnecessary rule for STATIC_LIBS
  KVM: selftests: Add a helper to check EPT/VPID capabilities
  KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h
  KVM: selftests: Refactor nested_map() to specify target level
  KVM: selftests: Drop stale function parameter comment for nested_map()
  KVM: selftests: Add option to create 2M and 1G EPT mappings
  KVM: selftests: Replace x86_page_size with PG_LEVEL_XX
  KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE
  KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put
  KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
  KVM: x86: disable preemption while updating apicv inhibition
  KVM: x86: SVM: fix avic_kick_target_vcpus_fast
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
  KVM: x86: document AVIC/APICv inhibit reasons
  KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs
  ...
2022-06-14 07:57:18 -07:00
Linus Torvalds
8e8afafb0b Yet another hw vulnerability with a software mitigation: Processor MMIO
Stale Data.
 
 They are a class of MMIO-related weaknesses which can expose stale data
 by propagating it into core fill buffers. Data which can then be leaked
 using the usual speculative execution methods.
 
 Mitigations include this set along with microcode updates and are
 similar to MDS and TAA vulnerabilities: VERW now clears those buffers
 too.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKXMkMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWGPD/idalLIhhV5F2+hZIKm0WSnsBxAOh9K
 7y8xBxpQQ5FUfW3vm7Pg3ro6VJp7w2CzKoD4lGXzGHriusn3qst3vkza9Ay8xu8g
 RDwKe6hI+p+Il9BV9op3f8FiRLP9bcPMMReW/mRyYsOnJe59hVNwRAL8OG40PY4k
 hZgg4Psfvfx8bwiye5efjMSe4fXV7BUCkr601+8kVJoiaoszkux9mqP+cnnB5P3H
 zW1d1jx7d6eV1Y063h7WgiNqQRYv0bROZP5BJkufIoOHUXDpd65IRF3bDnCIvSEz
 KkMYJNXb3qh7EQeHS53NL+gz2EBQt+Tq1VH256qn6i3mcHs85HvC68gVrAkfVHJE
 QLJE3MoXWOqw+mhwzCRrEXN9O1lT/PqDWw8I4M/5KtGG/KnJs+bygmfKBbKjIVg4
 2yQWfMmOgQsw3GWCRjgEli7aYbDJQjany0K/qZTq54I41gu+TV8YMccaWcXgDKrm
 cXFGUfOg4gBm4IRjJ/RJn+mUv6u+/3sLVqsaFTs9aiib1dpBSSUuMGBh548Ft7g2
 5VbFVSDaLjB2BdlcG7enlsmtzw0ltNssmqg7jTK/L7XNVnvxwUoXw+zP7RmCLEYt
 UV4FHXraMKNt2ZketlomC8ui2hg73ylUp4pPdMXCp7PIXp9sVamRTbpz12h689VJ
 /s55bWxHkR6S
 =LBxT
 -----END PGP SIGNATURE-----

Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 MMIO stale data fixes from Thomas Gleixner:
 "Yet another hw vulnerability with a software mitigation: Processor
  MMIO Stale Data.

  They are a class of MMIO-related weaknesses which can expose stale
  data by propagating it into core fill buffers. Data which can then be
  leaked using the usual speculative execution methods.

  Mitigations include this set along with microcode updates and are
  similar to MDS and TAA vulnerabilities: VERW now clears those buffers
  too"

* tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/mmio: Print SMT warning
  KVM: x86/speculation: Disable Fill buffer clear within guests
  x86/speculation/mmio: Reuse SRBDS mitigation for SBDS
  x86/speculation/srbds: Update SRBDS mitigation selection
  x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data
  x86/speculation/mmio: Enable CPU Fill buffer clearing on idle
  x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations
  x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data
  x86/speculation: Add a common function for MD_CLEAR mitigation update
  x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug
  Documentation: Add documentation for Processor MMIO Stale Data
2022-06-14 07:43:15 -07:00
Linus Torvalds
3cae0d8475 Random number generator fixes for Linux 5.19-rc2.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKktaEACgkQSfxwEqXe
 A65Q+RAAkyLfX6kxQlTZbglTTiapbGwpk7+VhM0pJfc3guXhIe8kNZ5cX/mfkGeZ
 f/ebUmWSJhqaNz1OIZeBZQ98ESwh8imfWaxWDkqsFh4c+hGsSp2xwIszMn3Hg+7L
 Sm/0Q71eZaSnRBGxWRVBbz3tTppUBS4nJxvFj8iM3jWWUffZa0m/w1lMvqc8kNJu
 kM1ONqb+CEuHOJyltUaH2qEXD6fQE3IOpPdC6PdsFqFX8qLN/pwZO6tpY8tYPt3j
 AUubp8F3eR4Y7WcmMi1b7BmiRgg47jdsS18aqRSH8CuYvIBbXHNM+tuK54zh/888
 d+s9Bj6ALR4z1/a8HXqtudCazYU+1VozWxVIELcpDWTX4wKgqVZ0HLEz7sAEfCgV
 wSIcIY8TRJlOTL43KenbJqbyIOsvAidqWEYz5ogF9WYUlaD82s2j8pUMj4wQoD5w
 VJqF2CaoewU0BOGK7rZFnElN5rPlfEJtNBZQOEo16BzA2tXilPgRFoOmKs9TMRQo
 dGotfp62rUuS14b9x7zc9be0QmtnmwxQKO9U6SRVd5X2HXU/E5PsqTsbt0PKlSbA
 qemw2CehPB+uFs0cqbbeI5VBpVnPQJkaflTfOVr04h7623KJ5Pnblv7/8iSedOOh
 L4ACxPO5I5MrM8rJAR4g85SYVg3A/HaTufA5XtR5QB/qmEaErXg=
 =1Lf0
 -----END PGP SIGNATURE-----

Merge tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator fixes from Jason Donenfeld:

 - A fix for a 5.19 regression for a case in which early device tree
   initializes the RNG, which flips a static branch.

   On most plaforms, jump labels aren't initialized until much later, so
   this caused splats. On a few mailing list threads, we cooked up easy
   fixes for arm64, arm32, and risc-v. But then things looked slightly
   more involved for xtensa, powerpc, arc, and mips. And at that point,
   when we're patching 7 architectures in a place before the console is
   even available, it seems like the cost/risk just wasn't worth it.

   So random.c works around it now by checking the already exported
   `static_key_initialized` boolean, as though somebody already ran into
   this issue in the past. I'm not super jazzed about that; it'd be
   prettier to not have to complicate downstream code. But I suppose
   it's practical.

 - A few small code nits and adding a missing __init annotation.

 - A change to the default config values to use the cpu and bootloader's
   seeds for initializing the RNG earlier.

   This brings them into line with what all the distros do (Fedora/RHEL,
   Debian, Ubuntu, Gentoo, Arch, NixOS, Alpine, SUSE, and Void... at
   least), and moreover will now give us test coverage in various test
   beds that might have caught the above device tree bug earlier.

 - A change to WireGuard CI's configuration to increase test coverage
   around the RNG.

 - A documentation comment fix to unrelated maintainerless CRC code that
   I was asked to take, I guess because it has to do with polynomials
   (which the RNG thankfully no longer uses).

* tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  wireguard: selftests: use maximum cpu features and allow rng seeding
  random: remove rng_has_arch_random()
  random: credit cpu and bootloader seeds by default
  random: do not use jump labels before they are initialized
  random: account for arch randomness in bits
  random: mark bootloader randomness code as __init
  random: avoid checking crng_ready() twice in random_init()
  crc-itu-t: fix typo in CRC ITU-T polynomial comment
2022-06-12 10:33:38 -07:00
Feng Zhou
89eda98428 selftest/bpf/benchs: Add bpf_map benchmark
Add benchmark for hash_map to reproduce the worst case
that non-stop update when map's free is zero.

Just like this:
./run_bench_bpf_hashmap_full_update.sh
Setting up benchmark 'bpf-hashmap-ful-update'...
Benchmark 'bpf-hashmap-ful-update' started.
1:hash_map_full_perf 555830 events per sec
...

Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20220610023308.93798-3-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-11 14:25:35 -07:00
Jason A. Donenfeld
17b0128a13 wireguard: selftests: use maximum cpu features and allow rng seeding
By forcing the maximum CPU that QEMU has available, we expose additional
capabilities, such as the RNDR instruction, which increases test
coverage. This then allows the CI to skip the fake seeding step in some
cases. Also enable STRICT_KERNEL_RWX to catch issues related to early
jump labels when the RNG is initialized at boot.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-11 15:38:08 +02:00
Linus Torvalds
825464e79d Networking fixes for 5.19-rc2, including fixes from bpf and netfilter.
Current release - regressions:
   - eth: amt: fix possible null-ptr-deref in amt_rcv()
 
 Previous releases - regressions:
   - tcp: use alloc_large_system_hash() to allocate table_perturb
 
   - af_unix: fix a data-race in unix_dgram_peer_wake_me()
 
   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
 
   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF
 
 Previous releases - always broken:
   - ipv6: fix signed integer overflow in __ip6_append_data
 
   - netfilter:
     - nat: really support inet nat without l3 address
     - nf_tables: memleak flow rule from commit path
 
   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs
 
   - openvswitch: fix misuse of the cached connection on tuple changes
 
   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred
 
   - eth: altera: fix refcount leak in altera_tse_mdio_create
 
 Misc:
   - add Quentin Monnet to bpftool maintainers
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmKhykgSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkN7sQAIn+ZmzQqTm5MVWnlvt/GcRGjjMP2VQY
 60oS2re8QC773yWoP6PvXqxCSFc99paDCC5BmCK6DMLbp9yuVSp5W8iAPuFuyjXE
 /Nur4Ti57LcGJ8ZpcJheBD4cRFbf+xtsGzx9a1WhUDrCYASo7vqRes5Eos2dT7P7
 qjgTduhUtaj6S1CfenfTnYqemZPzSGa+1euDuQ/Bu4mjCPUTrNZZQVYjmfTYM9p1
 UzwfCQr9TtmRKo8wLFHnYDLoWHNpfp55SNL0ShAwIQqgldiJ2OdMje+a2Sa4m6uF
 etRz8H0WrGVqfneD424tdyZv4nwhHw5dnaSrGe8DGq98c4/lIIcVyC38oDAbfWqI
 l8p7ZmtvNid7rpgoQFcxKpb2TAYAI+jaFq5GySEhvj5ZAblNQgFyghfMGPoncXCO
 XW6va8TtP2lmHFScAljQiQb6GNwDO52x77/q14Jkwvr+DILRKXMZZ3hCGrKUn5JM
 lafGkdL5ufm+E9C9RlaWN3imb2KoRj+wdThgV79efEPGG1py7yLOPVMoOCP3qmLq
 torcGcfDi1LGb7ohQxN6tCMv0JgXjS5nd1i+qJnImpkhRrUmahOfmpnElHoPuzs3
 6FU8HR77Eo15x70Jt+WOMy4oXrNh2MeEm8/Fhpj84MEhKpxVn+2o/53M+++5h+ru
 YtiLwEri0dCA
 =rdoB
 -----END PGP SIGNATURE-----

Merge tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - eth: amt: fix possible null-ptr-deref in amt_rcv()

  Previous releases - regressions:

   - tcp: use alloc_large_system_hash() to allocate table_perturb

   - af_unix: fix a data-race in unix_dgram_peer_wake_me()

   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling

   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF

  Previous releases - always broken:

   - ipv6: fix signed integer overflow in __ip6_append_data

   - netfilter:
       - nat: really support inet nat without l3 address
       - nf_tables: memleak flow rule from commit path

   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs

   - openvswitch: fix misuse of the cached connection on tuple changes

   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred

   - eth: altera: fix refcount leak in altera_tse_mdio_create

  Misc:

   - add Quentin Monnet to bpftool maintainers"

* tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  net: amd-xgbe: fix clang -Wformat warning
  tcp: use alloc_large_system_hash() to allocate table_perturb
  net: dsa: realtek: rtl8365mb: fix GMII caps for ports with internal PHY
  net: dsa: mv88e6xxx: correctly report serdes link failure
  net: dsa: mv88e6xxx: fix BMSR error to be consistent with others
  net: dsa: mv88e6xxx: use BMSR_ANEGCOMPLETE bit for filling an_complete
  net: altera: Fix refcount leak in altera_tse_mdio_create
  net: openvswitch: fix misuse of the cached connection on tuple changes
  net: ethernet: mtk_eth_soc: fix misuse of mem alloc interface netdev[napi]_alloc_frag
  ip_gre: test csum_start instead of transport header
  au1000_eth: stop using virt_to_bus()
  ipv6: Fix signed integer overflow in l2tp_ip6_sendmsg
  ipv6: Fix signed integer overflow in __ip6_append_data
  nfc: nfcmrvl: Fix memory leak in nfcmrvl_play_deferred
  nfc: st21nfca: fix incorrect sizing calculations in EVT_TRANSACTION
  nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
  nfc: st21nfca: fix incorrect validating logic in EVT_TRANSACTION
  net: ipv6: unexport __init-annotated seg6_hmac_init()
  net: xfrm: unexport __init-annotated xfrm4_protocol_init()
  net: mdio: unexport __init-annotated mdio_bus_init()
  ...
2022-06-09 12:06:52 -07:00
David Matlack
e0f3f46e42 KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
The selftests nested code only supports 4-level paging at the moment.
This means it cannot map nested guest physical addresses with more than
48 bits. Allow perf_test_util nested mode to work on hosts with more
than 48 physical addresses by restricting the guest test region to
48-bits.

While here, opportunistically fix an off-by-one error when dealing with
vm_get_max_gfn(). perf_test_util.c was treating this as the maximum
number of GFNs, rather than the maximum allowed GFN. This didn't result
in any correctness issues, but it did end up shifting the test region
down slightly when using huge pages.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220520233249.3776001-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:27 -04:00
David Matlack
71d4896619 KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2
Add an option to dirty_log_perf_test that configures the vCPUs to run in
L2 instead of L1. This makes it possible to benchmark the dirty logging
performance of nested virtualization, which is particularly interesting
because KVM must shadow L1's EPT/NPT tables.

For now this support only works on x86_64 CPUs with VMX. Otherwise
passing -n results in the test being skipped.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220520233249.3776001-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:27 -04:00