IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Implement bpf_prog_query_opts as a more expendable version of
bpf_prog_query. Expose new prog_attach_flags and attach_btf_func_id as
well:
* prog_attach_flags is a per-program attach_type; relevant only for
lsm cgroup program which might have different attach_flags
per attach_btf_id
* attach_btf_func_id is a new field expose for prog_query which
specifies real btf function id for lsm cgroup attachments
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-10-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
lsm_cgroup/ is the prefix for BPF_LSM_CGROUP.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-9-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Has been slowly getting out of sync, let's update it.
resolve_btfids usage has been updated to match the header changes.
Also bring new parts of tools/include/uapi/linux/bpf.h.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-8-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.
For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.
Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.
Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.
Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Seems like we missed to add 2 APIs to libbpf.map and another API was
misspelled. Fix it in libbpf.map.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-16-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove support for legacy features and behaviors that previously had to
be disabled by calling libbpf_set_strict_mode():
- legacy BPF map definitions are not supported now;
- RLIMIT_MEMLOCK auto-setting, if necessary, is always on (but see
libbpf_set_memlock_rlim());
- program name is used for program pinning (instead of section name);
- cleaned up error returning logic;
- entry BPF programs should have SEC() always.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-15-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Libbpf 1.0 stops support legacy-style BPF map definitions. Selftests has
been migrated away from using legacy BPF map definitions except for two
selftests, to make sure that legacy functionality still worked in
pre-1.0 libbpf. Now it's time to let those tests go as libbpf 1.0 is
imminent.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-14-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Clean up internals that had to deal with the possibility of
multi-instance bpf_programs. Libbpf 1.0 doesn't support this, so all
this is not necessary now and can be simplified.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Keep the LIBBPF_DEPRECATED_SINCE macro "framework" for future
deprecations, but clean up 0.x related helper macros.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove all the public APIs that are related to creating multi-instance
bpf_programs through custom preprocessing callback and generally working
with them.
Also remove all the bpf_{object,map,program}__[set_]priv() APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove a bunch of high-level bpf_object/bpf_map/bpf_program related
APIs. All the APIs related to private per-object/map/prog state,
program preprocessing callback, and generally everything multi-instance
related is removed in a separate patch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Drop low-level APIs as well as high-level (and very confusingly named)
BPF object loading bpf_prog_load_xattr() and bpf_prog_load_deprecated()
APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove deprecated xsk APIs from libbpf. But given we have selftests
relying on this, move those files (with minimal adjustments to make them
compilable) under selftests/bpf.
We also remove all the removed APIs from libbpf.map, while overall
keeping version inheritance chain, as most APIs are backwards
compatible so there is no need to reassign them as LIBBPF_1.0.0 versions.
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF type compatibility checks (bpf_core_types_are_compat()) are
currently duplicated between kernel and user space. That's a historical
artifact more than intentional doing and can lead to subtle bugs where
one implementation is adjusted but another is forgotten.
That happened with the enum64 work, for example, where the libbpf side
was changed (commit 23b2a3a8f6 ("libbpf: Add enum64 relocation
support")) to use the btf_kind_core_compat() helper function but the
kernel side was not (commit 6089fb325c ("bpf: Add btf enum64
support")).
This patch addresses both the duplication issue, by merging both
implementations and moving them into relo_core.c, and fixes the alluded
to kind check (by giving preference to libbpf's already adjusted logic).
For discussion of the topic, please refer to:
https://lore.kernel.org/bpf/CAADnVQKbWR7oarBdewgOBZUPzryhRYvEbkhyPJQHHuxq=0K1gw@mail.gmail.com/T/#mcc99f4a33ad9a322afaf1b9276fb1f0b7add9665
Changelog:
v1 -> v2:
- limited libbpf recursion limit to 32
- changed name to __bpf_core_types_are_compat
- included warning previously present in libbpf version
- merged kernel and user space changes into a single patch
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220623182934.2582827-1-deso@posteo.net
Some functions we use for bpf prologue generation are going to be
deprecated. This change reworks current code not to use them.
We need to replace following functions/struct:
bpf_program__set_prep
bpf_program__nth_fd
struct bpf_prog_prep_result
Currently we use bpf_program__set_prep to hook perf callback before
program is loaded and provide new instructions with the prologue.
We replace this function/ality by taking instructions for specific
program, attaching prologue to them and load such new ebpf programs
with prologue using separate bpf_prog_load calls (outside libbpf
load machinery).
Before we can take and use program instructions, we need libbpf to
actually load it. This way we get the final shape of its instructions
with all relocations and verifier adjustments).
There's one glitch though.. perf kprobe program already assumes
generated prologue code with proper values in argument registers,
so loading such program directly will fail in the verifier.
That's where the fallback pre-load handler fits in and prepends
the initialization code to the program. Once such program is loaded
we take its instructions, cut off the initialization code and prepend
the prologue.
I know.. sorry ;-)
To have access to the program when loading this patch adds support to
register 'fallback' section handler to take care of perf kprobe programs.
The fallback means that it handles any section definition besides the
ones that libbpf handles.
The handler serves two purposes:
- allows perf programs to have special arguments in section name
- allows perf to use pre-load callback where we can attach init
code (zeroing all argument registers) to each perf program
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/bpf/20220616202214.70359-2-jolsa@kernel.org
This test verifies that bpf_loop() inlining works as expected when
address of `env->prog` is updated. This address is updated upon BPF
program reallocation.
Reallocation is handled by bpf_prog_realloc(), which reuses old memory
if page boundary is not crossed. The value of `len` in the test is
chosen to cross this boundary on bpf_loop() patching.
Verify that the use-after-free bug in inline_bpf_loop() reported by
Dan Carpenter is fixed.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-3-eddyz87@gmail.com
test_sock_fields__detach() got called with a null pointer here when one
of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load()
call resulted in a jump to the "done" label.
A skeletons *__detach() is not safe to call with a null pointer, though.
This led to a segfault.
Go the easy route and only call test_sock_fields__destroy() which is
null-pointer safe and includes detaching.
Came across this while looking[1] to introduce the usage of
bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with
vmlinux.h.
[1] https://lore.kernel.org/bpf/629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/
Fixes: 8f50f16ff3 ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads")
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de
Test whether a TCP CC implemented in BPF providing get_info() is
rejected correctly. get_info() is unsupported in a BPF CC. The check for
required functions in a BPF CC has moved, this test ensures unsupported
functions are still rejected correctly.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Test whether a TCP CC implemented in BPF providing neither cong_avoid()
nor cong_control() is correctly rejected. This check solely depends on
tcp_register_congestion_control() now, which is invoked during
bpf_map__attach_struct_ops().
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Test whether a TCP CC implemented in BPF is allowed to write
sk_pacing_rate and sk_pacing_status in struct sock. This is needed when
cong_control() is implemented and used.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.
"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.
While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.
A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage. Only one inner map is created - a hashmap
meant to hold tid -> data mapping for all tasks. Size of the hashmap is
hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
keys which are actually fetched as part of the benchmark is
configurable.
Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.
Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):
Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
num keys: 1000
hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
num keys: 10000
hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
num keys: 100000
hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
num keys: 4194304
hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s
num_maps: 10
local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s
num_maps: 16
local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s
num_maps: 17
local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s
num_maps: 24
local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s
num_maps: 32
local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s
num_maps: 100
local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s
num_maps: 1000
local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s
Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.
For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.
To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:
Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s
num keys: 1000
hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s
num keys: 10000
hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s
num keys: 100000
hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s
num keys: 4194304
hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s
Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s
num_maps: 10
local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s
num_maps: 16
local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s
num_maps: 17
local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s
num_maps: 24
local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s
num_maps: 32
local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s
num_maps: 100
local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s
num_maps: 1000
local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s
Adjusting for overhead, latency numbers for "hashmap control" and
"sequential get" are:
hashmap_control_1k: ~53.8ns
hashmap_control_10k: ~124.2ns
hashmap_control_100k: ~206.5ns
sequential_get_1: ~12.1ns
sequential_get_10: ~16.0ns
sequential_get_16: ~13.8ns
sequential_get_17: ~16.8ns
sequential_get_24: ~40.9ns
sequential_get_32: ~65.2ns
sequential_get_100: ~148.2ns
sequential_get_1000: ~2204ns
Clearly demonstrating a cliff.
In the discussion for v1 of this patch, Alexei noted that local_storage
was 2.5x faster than a large hashmap when initially implemented [1]. The
benchmark results show that local_storage is 5-10x faster: a
long-running BPF application putting some pid-specific info into a
hashmap for each pid it sees will probably see on the order of 10-100k
pids. Bench numbers for hashmaps of this size are ~10x slower than
sequential_get_16, but as the number of local_storage maps grows far
past local_storage cache size the performance advantage shrinks and
eventually reverses.
When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.
[0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
[1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Two new test BPF programs for test_prog selftests checking bpf_loop
behavior. Both are corner cases for bpf_loop inlinig transformation:
- check that bpf_loop behaves correctly when callback function is not
a compile time constant
- check that local function variables are not affected by allocating
additional stack storage for registers spilled by loop inlining
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A number of test cases for BPF selftests test_verifier to check how
bpf_loop inline transformation rewrites the BPF program. The following
cases are covered:
- happy path
- no-rewrite when flags is non-zero
- no-rewrite when callback is non-constant
- subprogno in insn_aux is updated correctly when dead sub-programs
are removed
- check that correct stack offsets are assigned for spilling of R6-R8
registers
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BTF and func_info specification for test_verifier tests follows
the same notation as in prog_tests/btf.c tests. E.g.:
...
.func_info = { { 0, 6 }, { 8, 7 } },
.func_info_cnt = 2,
.btf_strings = "\0int\0",
.btf_types = {
BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4),
BTF_PTR_ENC(1),
},
...
The BTF specification is loaded only when specified.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allows to specify expected and unexpected instruction sequences in
test_verifier test cases. The instructions are requested from kernel
after BPF program loading, thus allowing to check some of the
transformations applied by BPF verifier.
- `expected_insn` field specifies a sequence of instructions expected
to be found in the program;
- `unexpected_insn` field specifies a sequence of instructions that
are not expected to be found in the program;
- `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask
`off` and `imm` fields.
- `SKIP_INSNS` could be used to specify that some instructions in the
(un)expected pattern are not important (behavior similar to usage of
`\t` in `errstr` field).
The intended usage is as follows:
{
"inline simple bpf_loop call",
.insns = {
/* main */
BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_2,
BPF_PSEUDO_FUNC, 0, 6),
...
BPF_EXIT_INSN(),
/* callback */
BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.expected_insns = {
BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
SKIP_INSNS(),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1)
},
.unexpected_insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0,
INSN_OFF_MASK, INSN_IMM_MASK),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
.result = ACCEPT,
.runs = 0,
},
Here it is expected that move of 1 to register 1 would remain in place
and helper function call instruction would be replaced by a relative
call instruction.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds the kernel config options needed to run the recently
added xdp_synproxy test. Users without these options will hit errors
like this:
test_synproxy:FAIL:iptables -t raw -I PREROUTING -i tmp1 -p
tcp -m tcp --syn --dport 8080 -j CT --notrack unexpected error: 256
(errno 22)
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220620104939.4094104-1-maximmi@nvidia.com
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-06-17
We've added 72 non-merge commits during the last 15 day(s) which contain
a total of 92 files changed, 4582 insertions(+), 834 deletions(-).
The main changes are:
1) Add 64 bit enum value support to BTF, from Yonghong Song.
2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov.
3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a
socket especially useful in synproxy scenarios, from Maxim Mikityanskiy.
4) Fix libbpf's internal USDT address translation logic for shared libraries as
well as uprobe's symbol file offset calculation, from Andrii Nakryiko.
5) Extend libbpf to provide an API for textual representation of the various
map/prog/attach/link types and use it in bpftool, from Daniel Müller.
6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the
core seen in 32 bit when storing BPF function addresses, from Pu Lehui.
7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases
for 'long' types, from Douglas Raillard.
8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting
has been unreliable and caused a regression on COS, from Quentin Monnet.
9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link
detachment, from Tadeusz Struk.
10) Fix bpftool build bootstrapping during cross compilation which was pointing
to the wrong AR process, from Shahab Vahedi.
11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi.
12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there
is no free element. Also add a benchmark as reproducer, from Feng Zhou.
13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin.
14) Various minor cleanup and improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpf: Fix bpf_skc_lookup comment wrt. return type
bpf: Fix non-static bpf_func_proto struct definitions
selftests/bpf: Don't force lld on non-x86 architectures
selftests/bpf: Add selftests for raw syncookie helpers in TC mode
bpf: Allow the new syncookie helpers to work with SKBs
selftests/bpf: Add selftests for raw syncookie helpers
bpf: Add helpers to issue and check SYN cookies in XDP
bpf: Allow helpers to accept pointers with a fixed size
bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
selftests/bpf: add tests for sleepable (uk)probes
libbpf: add support for sleepable uprobe programs
bpf: allow sleepable uprobe programs to attach
bpf: implement sleepable uprobes by chaining gps
bpf: move bpf_prog to bpf.h
libbpf: Fix internal USDT address translation logic for shared libraries
samples/bpf: Check detach prog exist or not in xdp_fwd
selftests/bpf: Avoid skipping certain subtests
selftests/bpf: Fix test_varlen verification failure with latest llvm
bpftool: Do not check return value from libbpf_set_strict_mode()
Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
...
====================
Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of hard coding the scale target in the test, dynamically set it
based on the maximum number of flow counters and their current
occupancy.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This tests creates as many RIFs as possible, ideally more than there can be
RIF counters (though that is currently only possible on Spectrum-1). It
then tries to enable L3 HW stats on each of the RIFs. It also contains the
traffic test, which tries to run traffic through a log2 of those counters
and checks that the traffic is shown in the counter values.
Like with tc_flower traffic test, take a log2 subset of rules. The logic
behind picking log2 rules is that then every bit of the instantiated item's
number is exercised. This should catch issues whether they happen at the
high end, low end, or somewhere in between.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a test that checks that the created filters do actually trigger on
matching traffic.
Exercising all the rules would be a very lengthy process. Instead, take a
log2 subset of rules. The logic behind picking log2 rules is that then
every bit of the instantiated item's number is exercised. This should catch
issues whether they happen at the high end, low end, or somewhere in
between.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scale tests are verifying behavior of mlxsw when number of instances of
some resource reaches the ASIC capacity. The number of instances is
referred to as "target" number.
No scale tests so far needed to know this target number to clean up. E.g.
the tc_flower simply removes the clsact qdisc that all the tested filters
are hooked onto, and that takes care of collecting all the filters.
However, for the RIF counter test, which is being added in a future patch,
VLAN netdevices are created. These are created as part of the test, but of
course the cleanup needs to undo them again. For that it needs to know how
many there were. To support this usage, pass the target number to the
cleanup callback.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scale tests are currently testing two things: that some number of
instances of a given resource can actually be created; and that when an
attempt is made to create more than the supported amount, the failures are
noted and handled gracefully.
Sometimes the scale test depends on more than one resource. In particular,
a following patch will add a RIF counter scale test, which depends on the
number of RIF counters that can be bound, and also on the number of RIFs
that can be created.
When the test is limited by the auxiliary resource and not by the primary
one, there's no point trying to run the overflow test, because it would be
testing exhaustion of the wrong resource.
To support this use case, when the $test_get_target yields 0, skip the test
instead.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scale tests are currently testing two things: that some number of
instances of a given resource can actually be created; and that when an
attempt is made to create more than the supported amount, the failures are
noted and handled gracefully.
However the ability to allocate the resource does not mean that the
resource actually works when passing traffic. For that, make it possible
for a given scale to also test traffic.
Traffic test is only run on the positive leg of the scale test (no point
trying to pass traffic when the expected outcome is that the resource will
not be allocated). Traffic tests are opt-in, if a given test does not
expose it, it is not run.
To this end, delay the test cleanup until after the traffic test is run.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scale of each resource is tested in the following manner:
1. The scale target is queried.
2. The test setup is prepared.
3. The test is invoked.
In some cases, the occupancy of a resource changes as part of the second
step, requiring the test to return a scale target that takes this change
into account.
Make this more robust by re-querying the scale target after the second
step.
Another possible solution is to swap the first and second steps, but
when a test needs to be skipped (i.e., scale target is zero), the setup
would have been in vain.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using mlxsw driver, the configurations are offloaded just in case that
there is a physical port which is enslaved to the virtual device
(e.g., to a bridge). In 'mirror_gre_bridge_1q_lag' test, the bridge gets an
address and route before there are ports in the bridge. It means that these
configurations are not offloaded.
Till now the test passes with mlxsw driver even that the RIF of the
bridge is not in the hardware, because the ARP packets are trapped in
layer 2 and also mirrored, so there is no real need of the RIF in hardware.
The previous patch changed the traps 'ARP_REQUEST' and 'ARP_RESPONSE' to
be done at layer 3 instead of layer 2. With this change the ARP packets are
not trapped during the test, as the RIF is not in the hardware because of
the order of configurations.
Reorder the configurations to make them to be offloaded, then the test will
pass with the change of the traps.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LLVM's lld linker doesn't have a universal architecture support (e.g.,
it definitely doesn't work on s390x), so be safe and force lld for
urandom_read and liburandom_read.so only on x86 architectures.
This should fix s390x CI runs.
Fixes: 3e6fe5ce4d ("libbpf: Fix internal USDT address translation logic for shared libraries")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617045512.1339795-1-andrii@kernel.org
This commit extends selftests for the new BPF helpers
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} to also test the TC BPF
functionality added in the previous commit.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-7-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds selftests for the new BPF helpers:
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6}.
xdp_synproxy_kern.c is a BPF program that generates SYN cookies on
allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
iptables module.
xdp_synproxy.c is a userspace control application that allows to
configure the following options in runtime: list of allowed ports, MSS,
window scale, TTL.
A selftest is added to prog_tests that leverages the above programs to
test the functionality of the new helpers.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-5-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
program to generate SYN cookies in response to TCP SYN packets and to
check those cookies upon receiving the first ACK packet (the final
packet of the TCP handshake).
Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
listening socket on the local machine, which allows to use them together
with synproxy to accelerate SYN cookie generation.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_tcp_gen_syncookie expects the full length of the TCP header (with
all options), and bpf_tcp_check_syncookie accepts lengths bigger than
sizeof(struct tcphdr). Fix the documentation that says these lengths
should be exactly sizeof(struct tcphdr).
While at it, fix a typo in the name of struct ipv6hdr.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-2-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add tests that ensure sleepable uprobe programs work correctly.
Add tests that ensure sleepable kprobe programs cannot attach.
Also add tests that attach both sleepable and non-sleepable
uprobe programs to the same location (i.e. same bpf_prog_array).
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/c744e5bb7a5c0703f05444dc41f2522ba3579a48.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>