IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.
test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).
Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.
[0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=*
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Out of all special global func arg tag annotations, __arg_ctx is
practically is the most immediately useful and most critical to have
working across multitude kernel version, if possible. This would allow
end users to write much simpler code if __arg_ctx semantics worked for
older kernels that don't natively understand btf_decl_tag("arg:ctx") in
verifier logic.
Luckily, it is possible to ensure __arg_ctx works on old kernels through
a bit of extra work done by libbpf, at least in a lot of common cases.
To explain the overall idea, we need to go back at how context argument
was supported in global funcs before __arg_ctx support was added. This
was done based on special struct name checks in kernel. E.g., for
BPF_PROG_TYPE_PERF_EVENT the expectation is that argument type `struct
bpf_perf_event_data *` mark that argument as PTR_TO_CTX. This is all
good as long as global function is used from the same BPF program types
only, which is often not the case. If the same subprog has to be called
from, say, kprobe and perf_event program types, there is no single
definition that would satisfy BPF verifier. Subprog will have context
argument either for kprobe (if using bpf_user_pt_regs_t struct name) or
perf_event (with bpf_perf_event_data struct name), but not both.
This limitation was the reason to add btf_decl_tag("arg:ctx"), making
the actual argument type not important, so that user can just define
"generic" signature:
__noinline int global_subprog(void *ctx __arg_ctx) { ... }
I won't belabor how libbpf is implementing subprograms, see a huge
comment next to bpf_object_relocate_calls() function. The idea is that
each main/entry BPF program gets its own copy of global_subprog's code
appended.
This per-program copy of global subprog code *and* associated func_info
.BTF.ext information, pointing to FUNC -> FUNC_PROTO BTF type chain
allows libbpf to simulate __arg_ctx behavior transparently, even if the
kernel doesn't yet support __arg_ctx annotation natively.
The idea is straightforward: each time we append global subprog's code
and func_info information, we adjust its FUNC -> FUNC_PROTO type
information, if necessary (that is, libbpf can detect the presence of
btf_decl_tag("arg:ctx") just like BPF verifier would do it).
The rest is just mechanical and somewhat painful BTF manipulation code.
It's painful because we need to clone FUNC -> FUNC_PROTO, instead of
reusing it, as same FUNC -> FUNC_PROTO chain might be used by another
main BPF program within the same BPF object, so we can't just modify it
in-place (and cloning BTF types within the same struct btf object is
painful due to constant memory invalidation, see comments in code).
Uploaded BPF object's BTF information has to work for all BPF
programs at the same time.
Once we have FUNC -> FUNC_PROTO clones, we make sure that instead of
using some `void *ctx` parameter definition, we have an expected `struct
bpf_perf_event_data *ctx` definition (as far as BPF verifier and kernel
is concerned), which will mark it as context for BPF verifier. Same
global subprog relocated and copied into another main BPF program will
get different type information according to main program's type. It all
works out in the end in a completely transparent way for end user.
Libbpf maintains internal program type -> expected context struct name
mapping internally. Note, not all BPF program types have named context
struct, so this approach won't work for such programs (just like it
didn't before __arg_ctx). So native __arg_ctx is still important to have
in kernel to have generic context support across all BPF program types.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With all the preparations in previous patches done we are ready to
postpone BTF loading and sanitization step until after all the
relocations are performed.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move the logic of finding and assigning exception callback indices from
BTF sanitization step to program relocations step, which seems more
logical and will unblock moving BTF loading to after relocation step.
Exception callbacks discovery and assignment has no dependency on BTF
being loaded into the kernel, it only uses BTF information. It does need
to happen before subprogram relocations happen, though. Which is why the
split.
No functional changes.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move map creation to later during BPF object loading by pre-creating
stable placeholder FDs (utilizing memfd_create()). Use dup2()
syscall to then atomically make those placeholder FDs point to real
kernel BPF map objects.
This change allows to delay BPF map creation to after all the BPF
program relocations. That, in turn, allows to delay BTF finalization and
loading into kernel to after all the relocations as well. We'll take
advantage of the latter in subsequent patches to allow libbpf to adjust
BTF in a way that helps with BPF global function usage.
Clean up a few places where we close map->fd, which now shouldn't
happen, because map->fd should be a valid FD regardless of whether map
was created or not. Surprisingly and nicely it simplifies a bunch of
error handling code. If this change doesn't backfire, I'm tempted to
pre-create such stable FDs for other entities (progs, maybe even BTF).
We previously did some manipulations to make gen_loader work with fake
map FDs, with stable map FDs this hack is not necessary for maps (we
still have it for BTF, but I left it as is for now).
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With the upcoming switch to preallocated placeholder FDs for maps,
switch various getters/setter away from checking map->fd. Use
map_is_created() helper that detect whether BPF map can be modified based
on map->obj->loaded state, with special provision for maps set up with
bpf_map__reuse_fd().
For backwards compatibility, we take map_is_created() into account in
bpf_map__fd() getter as well. This way before bpf_object__load() phase
bpf_map__fd() will always return -1, just as before the changes in
subsequent patches adding stable map->fd placeholders.
We also get rid of all internal uses of bpf_map__fd() getter, as it's
more oriented for uses external to libbpf. The above map_is_created()
check actually interferes with some of the internal uses, if map FD is
fetched through bpf_map__fd().
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of inferring whether map already point to previously
created/pinned BPF map (which user can specify with bpf_map__reuse_fd()) API),
use explicit map->reused flag that is set in such case.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It makes future grepping and code analysis a bit easier.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240104013847.3875810-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
An issue occurred while reading an ELF file in libbpf.c during fuzzing:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
4206 in libbpf.c
(gdb) bt
#0 0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
#1 0x000000000094f9d6 in bpf_object.collect_relos () at libbpf.c:6706
#2 0x000000000092bef3 in bpf_object_open () at libbpf.c:7437
#3 0x000000000092c046 in bpf_object.open_mem () at libbpf.c:7497
#4 0x0000000000924afa in LLVMFuzzerTestOneInput () at fuzz/bpf-object-fuzzer.c:16
#5 0x000000000060be11 in testblitz_engine::fuzzer::Fuzzer::run_one ()
#6 0x000000000087ad92 in tracing::span::Span::in_scope ()
#7 0x00000000006078aa in testblitz_engine::fuzzer::util::walkdir ()
#8 0x00000000005f3217 in testblitz_engine::entrypoint::main::{{closure}} ()
#9 0x00000000005f2601 in main ()
(gdb)
scn_data was null at this code(tools/lib/bpf/src/libbpf.c):
if (rel->r_offset % BPF_INSN_SZ || rel->r_offset >= scn_data->d_size) {
The scn_data is derived from the code above:
scn = elf_sec_by_idx(obj, sec_idx);
scn_data = elf_sec_data(obj, scn);
relo_sec_name = elf_sec_str(obj, shdr->sh_name);
sec_name = elf_sec_name(obj, scn);
if (!relo_sec_name || !sec_name)// don't check whether scn_data is NULL
return -EINVAL;
In certain special scenarios, such as reading a malformed ELF file,
it is possible that scn_data may be a null pointer
Signed-off-by: Mingyi Zhang <zhangmingyi5@huawei.com>
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Changye Wu <wuchangye@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231221033947.154564-1-liuxin350@huawei.com
clang can generate (with -g -Wa,--compress-debug-sections) 4-byte
aligned DWARF sections that declare themselves to be 8-byte aligned in
the section header. Since DWARF sections are dropped during linking
anyway, just skip running the sanity checks on them.
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/ZXcFRJVKbKxtEL5t@nz.home/
Link: https://lore.kernel.org/bpf/20231219110324.8989-1-hi@alyssa.is
Add a set of __arg_xxx macros which can be used to augment BPF global
subprogs/functions with extra information for use by BPF verifier.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To allow external admin authority to override default BPF FS location
(/sys/fs/bpf) for implicit BPF token creation, teach libbpf to recognize
LIBBPF_BPF_TOKEN_PATH envvar. If it is specified and user application
didn't explicitly specify neither bpf_token_path nor bpf_token_fd
option, it will be treated exactly like bpf_token_path option,
overriding default /sys/fs/bpf location and making BPF token mandatory.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add BPF token support to BPF object-level functionality.
BPF token is supported by BPF object logic either as an explicitly
provided BPF token from outside (through BPF FS path or explicit BPF
token FD), or implicitly (unless prevented through
bpf_object_open_opts).
Implicit mode is assumed to be the most common one for user namespaced
unprivileged workloads. The assumption is that privileged container
manager sets up default BPF FS mount point at /sys/fs/bpf with BPF token
delegation options (delegate_{cmds,maps,progs,attachs} mount options).
BPF object during loading will attempt to create BPF token from
/sys/fs/bpf location, and pass it for all relevant operations
(currently, map creation, BTF load, and program load).
In this implicit mode, if BPF token creation fails due to whatever
reason (BPF FS is not mounted, or kernel doesn't support BPF token,
etc), this is not considered an error. BPF object loading sequence will
proceed with no BPF token.
In explicit BPF token mode, user provides explicitly either custom BPF
FS mount point path or creates BPF token on their own and just passes
token FD directly. In such case, BPF object will either dup() token FD
(to not require caller to hold onto it for entire duration of BPF object
lifetime) or will attempt to create BPF token from provided BPF FS
location. If BPF token creation fails, that is considered a critical
error and BPF object load fails with an error.
Libbpf provides a way to disable implicit BPF token creation, if it
causes any troubles (BPF token is designed to be completely optional and
shouldn't cause any problems even if provided, but in the world of BPF
LSM, custom security logic can be installed that might change outcome
dependin on the presence of BPF token). To disable libbpf's default BPF
token creation behavior user should provide either invalid BPF token FD
(negative), or empty bpf_token_path option.
BPF token presence can influence libbpf's feature probing, so if BPF
object has associated BPF token, feature probing is instructed to use
BPF object-specific feature detection cache and token FD.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adjust feature probing callbacks to take into account optional token_fd.
In unprivileged contexts, some feature detectors would fail to detect
kernel support just because BPF program, BPF map, or BTF object can't be
loaded due to privileged nature of those operations. So when BPF object
is loaded with BPF token, this token should be used for feature probing.
This patch is setting support for this scenario, but we don't yet pass
non-zero token FD. This will be added in the next patch.
We also switched BPF cookie detector from using kprobe program to
tracepoint one, as tracepoint is somewhat less dangerous BPF program
type and has higher likelihood of being allowed through BPF token in the
future. This change has no effect on detection behavior.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's quite a lot of well isolated code, so it seems like a good
candidate to move it out of libbpf.c to reduce its size.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add feat_supported() helper that accepts feature cache instead of
bpf_object. This allows low-level code in bpf.c to not know or care
about higher-level concept of bpf_object, yet it will be able to utilize
custom feature checking in cases where BPF token might influence the
outcome.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Split a list of supported feature detectors with their corresponding
callbacks from actual cached supported/missing values. This will allow
to have more flexible per-token or per-object feature detectors in
subsequent refactorings.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
=== Motivation ===
Similar to reading from CO-RE bitfields, we need a CO-RE aware bitfield
writing wrapper to make the verifier happy.
Two alternatives to this approach are:
1. Use the upcoming `preserve_static_offset` [0] attribute to disable
CO-RE on specific structs.
2. Use broader byte-sized writes to write to bitfields.
(1) is a bit hard to use. It requires specific and not-very-obvious
annotations to bpftool generated vmlinux.h. It's also not generally
available in released LLVM versions yet.
(2) makes the code quite hard to read and write. And especially if
BPF_CORE_READ_BITFIELD() is already being used, it makes more sense to
to have an inverse helper for writing.
=== Implementation details ===
Since the logic is a bit non-obvious, I thought it would be helpful
to explain exactly what's going on.
To start, it helps by explaining what LSHIFT_U64 (lshift) and RSHIFT_U64
(rshift) is designed to mean. Consider the core of the
BPF_CORE_READ_BITFIELD() algorithm:
val <<= __CORE_RELO(s, field, LSHIFT_U64);
val = val >> __CORE_RELO(s, field, RSHIFT_U64);
Basically what happens is we lshift to clear the non-relevant (blank)
higher order bits. Then we rshift to bring the relevant bits (bitfield)
down to LSB position (while also clearing blank lower order bits). To
illustrate:
Start: ........XXX......
Lshift: XXX......00000000
Rshift: 00000000000000XXX
where `.` means blank bit, `0` means 0 bit, and `X` means bitfield bit.
After the two operations, the bitfield is ready to be interpreted as a
regular integer.
Next, we want to build an alternative (but more helpful) mental model
on lshift and rshift. That is, to consider:
* rshift as the total number of blank bits in the u64
* lshift as number of blank bits left of the bitfield in the u64
Take a moment to consider why that is true by consulting the above
diagram.
With this insight, we can now define the following relationship:
bitfield
_
| |
0.....00XXX0...00
| | | |
|______| | |
lshift | |
|____|
(rshift - lshift)
That is, we know the number of higher order blank bits is just lshift.
And the number of lower order blank bits is (rshift - lshift).
Finally, we can examine the core of the write side algorithm:
mask = (~0ULL << rshift) >> lshift; // 1
val = (val & ~mask) | ((nval << rpad) & mask); // 2
1. Compute a mask where the set bits are the bitfield bits. The first
left shift zeros out exactly the number of blank bits, leaving a
bitfield sized set of 1s. The subsequent right shift inserts the
correct amount of higher order blank bits.
2. On the left of the `|`, mask out the bitfield bits. This creates
0s where the new bitfield bits will go. On the right of the `|`,
bring nval into the correct bit position and mask out any bits
that fall outside of the bitfield. Finally, by bor'ing the two
halves, we get the final set of bits to write back.
[0]: https://reviews.llvm.org/D133361
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jonathan Lemon <jlemon@aviatrix.com>
Signed-off-by: Jonathan Lemon <jlemon@aviatrix.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/4d3dd215a4fd57d980733886f9c11a45e1a9adf3.1702325874.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Before the change on `i686-linux` `systemd` build failed as:
$ bpftool gen object src/core/bpf/socket_bind/socket-bind.bpf.o src/core/bpf/socket_bind/socket-bind.bpf.unstripped.o
Error: failed to link 'src/core/bpf/socket_bind/socket-bind.bpf.unstripped.o': Invalid argument (22)
After the change it fails as:
$ bpftool gen object src/core/bpf/socket_bind/socket-bind.bpf.o src/core/bpf/socket_bind/socket-bind.bpf.unstripped.o
libbpf: ELF section #9 has inconsistent alignment addr=8 != d=4 in src/core/bpf/socket_bind/socket-bind.bpf.unstripped.o
Error: failed to link 'src/core/bpf/socket_bind/socket-bind.bpf.unstripped.o': Invalid argument (22)
Now it's slightly easier to figure out what is wrong with an ELF file.
Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231208215100.435876-1-slyich@gmail.com
In libbpf, when determining whether we need to load vmlinux btf, we're
currently (among other things) checking whether there is any struct_ops
program present in the object. This works for most realistic struct_ops
maps, as a struct_ops map is of course typically composed of one or more
struct_ops programs. However, that technically need not be the case. A
struct_ops interface could be defined which allows a map to be specified
which one or more non-prog fields, and which provides default behavior
if no struct_ops progs is actually provided otherwise. For sched_ext,
for example, you technically only need to specify the name of the
scheduler in the struct_ops map, with the core scheduler logic providing
default behavior if no prog is actually specified.
If we were to define and try to load such a struct_ops map, we would
crash in libbpf when initializing it as obj->btf_vmlinux will be NULL:
Reading symbols from minimal...
(gdb) r
Starting program: minimal_example
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x000055555558308c in btf__type_cnt (btf=0x0) at btf.c:612
612 return btf->start_id + btf->nr_types;
(gdb) bt
type_name=0x5555555d99e3 "sched_ext_ops", kind=4) at btf.c:914
kind=4) at btf.c:942
type=0x7fffffffe558, type_id=0x7fffffffe548, ...
data_member=0x7fffffffe568) at libbpf.c:948
kern_btf=0x0) at libbpf.c:1017
at libbpf.c:8059
So as to account for such bare-bones struct_ops maps, let's update
obj_needs_vmlinux_btf() to also iterate over an obj's maps and check
whether any of them are struct_ops maps.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20231208061704.400463-1-void@manifault.com
Allow user to specify token_fd for bpf_btf_load() API that wraps
kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
process as long as it has BPF token allowing BPF_BTF_LOAD command, which
can be created and delegated by privileged process.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-15-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We need to get offsets for static variables in following changes,
so making elf_resolve_syms_offsets to take st_type value as argument
and passing it to elf_sym_iter_new.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231125193130.834322-2-jolsa@kernel.org
Build
-----
* Compile BPF programs by default if clang (>= 12.0.1) is available to
enable more features like kernel lock contention, off-cpu profiling,
kwork, sample filtering and so on. It can be disabled by passing
BUILD_BPF_SKEL=0 to make.
* Produce better error messages for bison on debug build (make DEBUG=1)
by defining YYDEBUG symbol internally.
perf record
-----------
* Track sideband events (like FORK/MMAP) from all CPUs even if perf record
targets a subset of CPUs only (using -C option). Otherwise it may lose
some information happened on a CPU out of the target list.
* Fix checking raw sched_switch tracepoint argument using system BTF.
This affects off-cpu profiling which attaches a BPF program to the raw
tracepoint.
perf lock contention
--------------------
* Add --lock-cgroup option to see contention by cgroups. This should be
used with BPF only (using -b option).
$ sudo perf lock con -ab --lock-cgroup -- sleep 1
contended total wait max wait avg wait cgroup
835 14.06 ms 41.19 us 16.83 us /system.slice/led.service
25 122.38 us 13.77 us 4.89 us /
44 23.73 us 3.87 us 539 ns /user.slice/user-657345.slice/session-c4.scope
1 491 ns 491 ns 491 ns /system.slice/connectd.service
* Add -G/--cgroup-filter option to see contention only for given cgroups.
This can be useful when you identified a cgroup in the above command and
want to investigate more on it. It also works with other output options
like -t/--threads and -l/--lock-addr.
$ sudo perf lock con -ab -G /user.slice/user-657345.slice/session-c4.scope -- sleep 1
contended total wait max wait avg wait type caller
8 77.11 us 17.98 us 9.64 us spinlock futex_wake+0xc8
2 24.56 us 14.66 us 12.28 us spinlock tick_do_update_jiffies64+0x25
1 4.97 us 4.97 us 4.97 us spinlock futex_q_lock+0x2a
* Use per-cpu array for better spinlock tracking. This is to improve
performance of the BPF program and to avoid nested contention on a lock
in the BPF hash map.
* Update callstack check for PowerPC. To find a representative caller of a
lock, it needs to look up the call stacks. It ends the lookup when it sees
0 in the call stack buffer. However, PowerPC call stacks can have 0 values
in the beginning so skip them when it expects valid call stacks after.
perf kwork
----------
* Support 'sched' class (for -k option) so that it can see task scheduling
event (using sched_switch tracepoint) as well as irq and workqueue items.
* Add perf kwork top subcommand to show more accurate cpu utilization with
sched class above. It works both with a recorded data (using perf kwork
record command) and BPF (using -b option). Unlike perf top command, it
does not support interactive mode (yet).
$ sudo perf kwork top -b -k sched
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 160702.425 ms, 8 cpus
%Cpu(s): 36.00% id, 0.00% hi, 0.00% si
%Cpu0 [|||||||||||||||||| 61.66%]
%Cpu1 [|||||||||||||||||| 61.27%]
%Cpu2 [||||||||||||||||||| 66.40%]
%Cpu3 [|||||||||||||||||| 61.28%]
%Cpu4 [|||||||||||||||||| 61.82%]
%Cpu5 [||||||||||||||||||||||| 77.41%]
%Cpu6 [|||||||||||||||||| 61.73%]
%Cpu7 [|||||||||||||||||| 63.25%]
PID SPID %CPU RUNTIME COMMMAND
-------------------------------------------------------------
0 0 38.72 8089.463 ms [swapper/1]
0 0 38.71 8084.547 ms [swapper/3]
0 0 38.33 8007.532 ms [swapper/0]
0 0 38.26 7992.985 ms [swapper/6]
0 0 38.17 7971.865 ms [swapper/4]
0 0 36.74 7447.765 ms [swapper/7]
0 0 33.59 6486.942 ms [swapper/2]
0 0 22.58 3771.268 ms [swapper/5]
9545 9351 2.48 447.136 ms sched-messaging
9574 9351 2.09 418.583 ms sched-messaging
9724 9351 2.05 372.407 ms sched-messaging
9531 9351 2.01 368.804 ms sched-messaging
9512 9351 2.00 362.250 ms sched-messaging
9514 9351 1.95 357.767 ms sched-messaging
9538 9351 1.86 384.476 ms sched-messaging
9712 9351 1.84 386.490 ms sched-messaging
9723 9351 1.83 380.021 ms sched-messaging
9722 9351 1.82 382.738 ms sched-messaging
9517 9351 1.81 354.794 ms sched-messaging
9559 9351 1.79 344.305 ms sched-messaging
9725 9351 1.77 365.315 ms sched-messaging
<SNIP>
* Add hard/soft-irq statistics to perf kwork top. This will show the
total CPU utilization with IRQ stats like below:
$ sudo perf kwork top -b -k sched,irq,softirq
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 12554.889 ms, 8 cpus
%Cpu(s): 96.23% id, 0.10% hi, 0.19% si <---- here
%Cpu0 [| 4.60%]
%Cpu1 [| 4.59%]
%Cpu2 [ 2.73%]
%Cpu3 [| 3.81%]
<SNIP>
perf bench
----------
* Add -G/--cgroups option to perf bench sched pipe. The pipe bench is
good to measure context switch overhead. With this option, it puts
the reader and writer tasks in separate cgroups to enforce context
switch between two different cgroups.
Also it needs to set CPU affinity of the tasks in a CPU to accurately
measure the impact of cgroup context switches.
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.307 [sec]
3.078180 usecs/op
324867 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000':
200,026 context-switches
63 cgroup-switches
0.321637922 seconds time elapsed
You can see small number of cgroup-switches because both write and read
tasks are in the same cgroup.
$ sudo mkdir /sys/fs/cgroup/{AAA,BBB}
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.351 [sec]
3.512990 usecs/op
284657 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB':
200,020 context-switches
200,019 cgroup-switches
0.365034567 seconds time elapsed
Now context-switches and cgroup-switches are almost same. And you can
see the pipe operation took little more.
* Kill child processes when perf bench sched messaging exited abnormally.
Otherwise it'd leave the child doing unnecessary work.
perf test
---------
* Fix various shellcheck issues on the tests written in shell script.
* Skip tests when condition is not satisfied:
- object code reading test for non-text section addresses.
- CoreSight test if cs_etm// event is not available.
- lock contention test if not enough CPUs.
Event parsing
-------------
* Make PMU alias name loading lazy to reduce the startup time in the
event parsing code for perf record, stat and others in the general
case.
* Lazily compute PMU default config. In the same sense, delay PMU
initialization until it's really needed to reduce the startup cost.
* Fix event term values that are raw events. The event specification
can have several terms including event name. But sometimes it clashes
with raw event encoding which starts with 'r' and has hex-digits.
For example, an event named 'read' should be processed as a normal
event but it was mis-treated as a raw encoding and caused a failure.
$ perf stat -e 'uncore_imc_free_running/event=read/' -a sleep 1
event syntax error: '..nning/event=read/'
\___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
Event metrics
-------------
* Add "Compat" regex to match event with multiple identifiers.
* Usual updates for Intel, Power10, Arm telemetry/CMN and AmpereOne.
Misc
----
* Assorted memory leak fixes and footprint reduction.
* Add "bpf_skeletons" to perf version --build-options so that users can
check whether their perf tools have BPF support easily.
* Fix unaligned access in Intel-PT packet decoder found by undefined-behavior
sanitizer.
* Avoid frequency mode for the dummy event. Surprisingly it'd impact
kernel timer tick handler performance by force iterating all PMU events.
* Update bash shell completion for events and metrics.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZUMg7wAKCRCMstVUGiXM
g8FvAQC9KED6H8rlH7UTvxE6fM947EJbldwGrNA1zGx++Ucd3gD/ewA2A6SUcIh6
Tua/XovmYOQbuDYOwlRHe+sdDag0sgg=
=GrCE
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v6.7-1-2023-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Namhyung Kim:
"Build:
- Compile BPF programs by default if clang (>= 12.0.1) is available
to enable more features like kernel lock contention, off-cpu
profiling, kwork, sample filtering and so on.
This can be disabled by passing BUILD_BPF_SKEL=0 to make.
- Produce better error messages for bison on debug build (make
DEBUG=1) by defining YYDEBUG symbol internally.
perf record:
- Track sideband events (like FORK/MMAP) from all CPUs even if perf
record targets a subset of CPUs only (using -C option). Otherwise
it may lose some information happened on a CPU out of the target
list.
- Fix checking raw sched_switch tracepoint argument using system BTF.
This affects off-cpu profiling which attaches a BPF program to the
raw tracepoint.
perf lock contention:
- Add --lock-cgroup option to see contention by cgroups. This should
be used with BPF only (using -b option).
$ sudo perf lock con -ab --lock-cgroup -- sleep 1
contended total wait max wait avg wait cgroup
835 14.06 ms 41.19 us 16.83 us /system.slice/led.service
25 122.38 us 13.77 us 4.89 us /
44 23.73 us 3.87 us 539 ns /user.slice/user-657345.slice/session-c4.scope
1 491 ns 491 ns 491 ns /system.slice/connectd.service
- Add -G/--cgroup-filter option to see contention only for given
cgroups.
This can be useful when you identified a cgroup in the above
command and want to investigate more on it. It also works with
other output options like -t/--threads and -l/--lock-addr.
$ sudo perf lock con -ab -G /user.slice/user-657345.slice/session-c4.scope -- sleep 1
contended total wait max wait avg wait type caller
8 77.11 us 17.98 us 9.64 us spinlock futex_wake+0xc8
2 24.56 us 14.66 us 12.28 us spinlock tick_do_update_jiffies64+0x25
1 4.97 us 4.97 us 4.97 us spinlock futex_q_lock+0x2a
- Use per-cpu array for better spinlock tracking. This is to improve
performance of the BPF program and to avoid nested contention on a
lock in the BPF hash map.
- Update callstack check for PowerPC. To find a representative caller
of a lock, it needs to look up the call stacks. It ends the lookup
when it sees 0 in the call stack buffer. However, PowerPC call
stacks can have 0 values in the beginning so skip them when it
expects valid call stacks after.
perf kwork:
- Support 'sched' class (for -k option) so that it can see task
scheduling event (using sched_switch tracepoint) as well as irq and
workqueue items.
- Add perf kwork top subcommand to show more accurate cpu utilization
with sched class above. It works both with a recorded data (using
perf kwork record command) and BPF (using -b option). Unlike perf
top command, it does not support interactive mode (yet).
$ sudo perf kwork top -b -k sched
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 160702.425 ms, 8 cpus
%Cpu(s): 36.00% id, 0.00% hi, 0.00% si
%Cpu0 [|||||||||||||||||| 61.66%]
%Cpu1 [|||||||||||||||||| 61.27%]
%Cpu2 [||||||||||||||||||| 66.40%]
%Cpu3 [|||||||||||||||||| 61.28%]
%Cpu4 [|||||||||||||||||| 61.82%]
%Cpu5 [||||||||||||||||||||||| 77.41%]
%Cpu6 [|||||||||||||||||| 61.73%]
%Cpu7 [|||||||||||||||||| 63.25%]
PID SPID %CPU RUNTIME COMMMAND
-------------------------------------------------------------
0 0 38.72 8089.463 ms [swapper/1]
0 0 38.71 8084.547 ms [swapper/3]
0 0 38.33 8007.532 ms [swapper/0]
0 0 38.26 7992.985 ms [swapper/6]
0 0 38.17 7971.865 ms [swapper/4]
0 0 36.74 7447.765 ms [swapper/7]
0 0 33.59 6486.942 ms [swapper/2]
0 0 22.58 3771.268 ms [swapper/5]
9545 9351 2.48 447.136 ms sched-messaging
9574 9351 2.09 418.583 ms sched-messaging
9724 9351 2.05 372.407 ms sched-messaging
9531 9351 2.01 368.804 ms sched-messaging
9512 9351 2.00 362.250 ms sched-messaging
9514 9351 1.95 357.767 ms sched-messaging
9538 9351 1.86 384.476 ms sched-messaging
9712 9351 1.84 386.490 ms sched-messaging
9723 9351 1.83 380.021 ms sched-messaging
9722 9351 1.82 382.738 ms sched-messaging
9517 9351 1.81 354.794 ms sched-messaging
9559 9351 1.79 344.305 ms sched-messaging
9725 9351 1.77 365.315 ms sched-messaging
<SNIP>
- Add hard/soft-irq statistics to perf kwork top. This will show the
total CPU utilization with IRQ stats like below:
$ sudo perf kwork top -b -k sched,irq,softirq
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 12554.889 ms, 8 cpus
%Cpu(s): 96.23% id, 0.10% hi, 0.19% si <---- here
%Cpu0 [| 4.60%]
%Cpu1 [| 4.59%]
%Cpu2 [ 2.73%]
%Cpu3 [| 3.81%]
<SNIP>
perf bench:
- Add -G/--cgroups option to perf bench sched pipe. The pipe bench is
good to measure context switch overhead. With this option, it puts
the reader and writer tasks in separate cgroups to enforce context
switch between two different cgroups.
Also it needs to set CPU affinity of the tasks in a CPU to
accurately measure the impact of cgroup context switches.
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.307 [sec]
3.078180 usecs/op
324867 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000':
200,026 context-switches
63 cgroup-switches
0.321637922 seconds time elapsed
You can see small number of cgroup-switches because both write and
read tasks are in the same cgroup.
$ sudo mkdir /sys/fs/cgroup/{AAA,BBB}
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.351 [sec]
3.512990 usecs/op
284657 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB':
200,020 context-switches
200,019 cgroup-switches
0.365034567 seconds time elapsed
Now context-switches and cgroup-switches are almost same. And you
can see the pipe operation took little more.
- Kill child processes when perf bench sched messaging exited
abnormally. Otherwise it'd leave the child doing unnecessary work.
perf test:
- Fix various shellcheck issues on the tests written in shell script.
- Skip tests when condition is not satisfied:
- object code reading test for non-text section addresses.
- CoreSight test if cs_etm// event is not available.
- lock contention test if not enough CPUs.
Event parsing:
- Make PMU alias name loading lazy to reduce the startup time in the
event parsing code for perf record, stat and others in the general
case.
- Lazily compute PMU default config. In the same sense, delay PMU
initialization until it's really needed to reduce the startup cost.
- Fix event term values that are raw events. The event specification
can have several terms including event name. But sometimes it
clashes with raw event encoding which starts with 'r' and has
hex-digits.
For example, an event named 'read' should be processed as a normal
event but it was mis-treated as a raw encoding and caused a
failure.
$ perf stat -e 'uncore_imc_free_running/event=read/' -a sleep 1
event syntax error: '..nning/event=read/'
\___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
Event metrics:
- Add "Compat" regex to match event with multiple identifiers.
- Usual updates for Intel, Power10, Arm telemetry/CMN and AmpereOne.
Misc:
- Assorted memory leak fixes and footprint reduction.
- Add "bpf_skeletons" to perf version --build-options so that users
can check whether their perf tools have BPF support easily.
- Fix unaligned access in Intel-PT packet decoder found by
undefined-behavior sanitizer.
- Avoid frequency mode for the dummy event. Surprisingly it'd impact
kernel timer tick handler performance by force iterating all PMU
events.
- Update bash shell completion for events and metrics"
* tag 'perf-tools-for-v6.7-1-2023-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (187 commits)
perf vendor events intel: Update tsx_cycles_per_elision metrics
perf vendor events intel: Update bonnell version number to v5
perf vendor events intel: Update westmereex events to v4
perf vendor events intel: Update meteorlake events to v1.06
perf vendor events intel: Update knightslanding events to v16
perf vendor events intel: Add typo fix for ivybridge FP
perf vendor events intel: Update a spelling in haswell/haswellx
perf vendor events intel: Update emeraldrapids to v1.01
perf vendor events intel: Update alderlake/alderlake events to v1.23
perf build: Disable BPF skeletons if clang version is < 12.0.1
perf callchain: Fix spelling mistake "statisitcs" -> "statistics"
perf report: Fix spelling mistake "heirachy" -> "hierarchy"
perf python: Fix binding linkage due to rename and move of evsel__increase_rlimit()
perf tests: test_arm_coresight: Simplify source iteration
perf vendor events intel: Add tigerlake two metrics
perf vendor events intel: Add broadwellde two metrics
perf vendor events intel: Fix broadwellde tma_info_system_dram_bw_use metric
perf mem_info: Add and use map_symbol__exit and addr_map_symbol__exit
perf callchain: Minor layout changes to callchain_list
perf callchain: Make brtype_stat in callchain_list optional
...
Comparing pointers with reference count checking is tricky to avoid a
SEGV. Add a convenience macro to simplify and use.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Make the implicit REFCOUNT_CHECKING robust to when building with GCC.
Fixes: 9be6ab181b ("libperf rc_check: Enable implicitly with sanitizers")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
This adds bpf_program__attach_netkit() API to libbpf. Overall it is very
similar to tcx. The API looks as following:
LIBBPF_API struct bpf_link *
bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
const struct bpf_netkit_opts *opts);
The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts
for supporting bpf_mprog control parameters. The attach location for the
primary and peer device is derived from the program section "netkit/primary"
and "netkit/peer", respectively.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-4-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Fix too eager assumption that SHT_GNU_verdef ELF section is going to be
present whenever binary has SHT_GNU_versym section. It seems like either
SHT_GNU_verdef or SHT_GNU_verneed can be used, so failing on missing
SHT_GNU_verdef actually breaks use cases in production.
One specific reported issue, which was used to manually test this fix,
was trying to attach to `readline` function in BASH binary.
Fixes: bb7fa09399 ("libbpf: Support symbol versioning for uprobe")
Reported-by: Liam Wisehart <liamwisehart@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Manu Bretelle <chantr4@gmail.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Link: https://lore.kernel.org/bpf/20231016182840.4033346-1-andrii@kernel.org
io__getline will free the line on error but it doesn't clear the out
argument. This may lead to the line being freed twice, like in
tools/perf/util/srcline.c as detected by clang-tidy.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: llvm@lists.linux.dev
Cc: Ming Wang <wangming01@loongson.cn>
Cc: Tom Rix <trix@redhat.com>
Cc: bpf@vger.kernel.org
Link: https://lore.kernel.org/r/20231009183920.200859-17-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Golang symbols in ELF files are different from C/C++
which contains special characters like '*', '(' and ')'.
With generics, things get more complicated, there are
symbols like:
github.com/cilium/ebpf/internal.(*Deque[go.shape.interface { Format(fmt.State, int32); TypeName() string;github.com/cilium/ebpf/btf.copy() github.com/cilium/ebpf/btf.Type}]).Grow
Matching such symbols using `%m[^\n]` in sscanf, this
excludes newline which typically does not appear in ELF
symbols. This should work in most use-cases and also
work for unicode letters in identifiers. If newline do
show up in ELF symbols, users can still attach to such
symbol by specifying bpf_uprobe_opts::func_name.
A working example can be found at this repo ([0]).
[0]: https://github.com/chenhengqi/libbpf-go-symbols
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230929155954.92448-1-hengqi.chen@gmail.com
Add ring__avail_data_size for querying the currently available data in
the ringbuffer, similar to the BPF_RB_AVAIL_DATA flag in
bpf_ringbuf_query. This is racy during ongoing operations but is still
useful for overall information on how a ringbuffer is behaving.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-8-martin.kelly@crowdstrike.com
Switch rb->rings to be an array of pointers instead of a contiguous
block. This allows for each ring pointer to be stable after
ring_buffer__add is called, which allows us to expose struct ring * to
the user without gotchas. Without this change, the realloc in
ring_buffer__add could invalidate a struct ring *, making it unsafe to
give to the user.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-3-martin.kelly@crowdstrike.com
Refactor the cleanup code in ring_buffer__add to use a unified err_out
label. This reduces code duplication, as well as plugging a potential
leak if mmap_sz != (__u64)(size_t)mmap_sz (currently this would miss
unmapping tmp because ringbuf_unmap_ring isn't called).
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230925215045.2375758-2-martin.kelly@crowdstrike.com
In current implementation, we assume that symbol found in .dynsym section
would have a version suffix and use it to compare with symbol user supplied.
According to the spec ([0]), this assumption is incorrect, the version info
of dynamic symbols are stored in .gnu.version and .gnu.version_d sections
of ELF objects. For example:
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
000000000009b1a0 T __pthread_rwlock_wrlock@GLIBC_2.2.5
000000000009b1a0 T pthread_rwlock_wrlock@@GLIBC_2.34
000000000009b1a0 T pthread_rwlock_wrlock@GLIBC_2.2.5
$ readelf -W --dyn-syms /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
706: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 __pthread_rwlock_wrlock@GLIBC_2.2.5
2568: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@@GLIBC_2.34
2571: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@GLIBC_2.2.5
In this case, specify pthread_rwlock_wrlock@@GLIBC_2.34 or
pthread_rwlock_wrlock@GLIBC_2.2.5 in bpf_uprobe_opts::func_name won't work.
Because the qualified name does NOT match `pthread_rwlock_wrlock` (without
version suffix) in .dynsym sections.
This commit implements the symbol versioning for dynsym and allows user to
specify symbol in the following forms:
- func
- func@LIB_VERSION
- func@@LIB_VERSION
In case of symbol conflicts, error out and users should resolve it by
specifying a qualified name.
[0]: https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/symversion.html
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230918024813.237475-3-hengqi.chen@gmail.com
Dynamic symbols in shared library may have the same name, for example:
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
000000000009b1a0 T __pthread_rwlock_wrlock@GLIBC_2.2.5
000000000009b1a0 T pthread_rwlock_wrlock@@GLIBC_2.34
000000000009b1a0 T pthread_rwlock_wrlock@GLIBC_2.2.5
$ readelf -W --dyn-syms /lib/x86_64-linux-gnu/libc.so.6 | grep rwlock_wrlock
706: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 __pthread_rwlock_wrlock@GLIBC_2.2.5
2568: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@@GLIBC_2.34
2571: 000000000009b1a0 878 FUNC GLOBAL DEFAULT 15 pthread_rwlock_wrlock@GLIBC_2.2.5
Currently, users can't attach a uprobe to pthread_rwlock_wrlock because
there are two symbols named pthread_rwlock_wrlock and both are global
bind. And libbpf considers it as a conflict.
Since both of them are at the same offset we could accept one of them
harmlessly. Note that we already does this in elf_resolve_syms_offsets.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230918024813.237475-2-hengqi.chen@gmail.com
Alexei Starovoitov says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 79 files changed, 5275 insertions(+), 600 deletions(-).
The main changes are:
1) Basic BTF validation in libbpf, from Andrii Nakryiko.
2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi.
3) next_thread cleanups, from Oleg Nesterov.
4) Add mcpu=v4 support to arm32, from Puranjay Mohan.
5) Add support for __percpu pointers in bpf progs, from Yonghong Song.
6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang.
7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman",
Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle),
Song Liu, Stanislav Fomichev, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to libbpf to append exception callbacks when loading a
program. The exception callback is found by discovering the declaration
tag 'exception_callback:<value>' and finding the callback in the value
of the tag.
The process is done in two steps. First, for each main program, the
bpf_object__sanitize_and_load_btf function finds and marks its
corresponding exception callback as defined by the declaration tag on
it. Second, bpf_object__reloc_code is modified to append the indicated
exception callback at the end of the instruction iteration (since
exception callback will never be appended in that loop, as it is not
directly referenced).
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-16-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactor bpf_object__append_subprog_code out of bpf_object__reloc_code
to be able to reuse it to append subprog related code for the exception
callback to the main program.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>