linux

iv/linux

Author	SHA1	Message	Date
Liu Pan	01dc26c980	libbpf: Explicitly call write to append content to file Write data to fd by calling "vdprintf", in most implementations of the standard library, the data is finally written by the writev syscall. But "uprobe_events/kprobe_events" does not allow segmented writes, so switch the "append_to_file" function to explicit write() call. Signed-off-by: Liu Pan <patteliu@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230320030720.650-1-patteliu@gmail.com	2023-03-20 11:59:45 -07:00
Alexei Starovoitov	bb4a6a9237	selftest/bpf: Add a test case for ld_imm64 copy logic. Add a test case to exercise {btf_id, btf_obj_fd} copy logic between ld_imm64 insns. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230319203014.55866-2-alexei.starovoitov@gmail.com	2023-03-20 09:26:41 -07:00
Alexei Starovoitov	a506d6ce1d	libbpf: Fix ld_imm64 copy logic for ksym in light skeleton. Unlike normal libbpf the light skeleton 'loader' program is doing btf_find_by_name_kind() call at run-time to find ksym in the kernel and populate its {btf_id, btf_obj_fd} pair in ld_imm64 insn. To avoid doing the search multiple times for the same ksym it remembers the first patched ld_imm64 insn and copies {btf_id, btf_obj_fd} from it into subsequent ld_imm64 insn. Fix a bug in copying logic, since it may incorrectly clear BPF_PSEUDO_BTF_ID flag. Also replace always true if (btf_obj_fd >= 0) check with unconditional JMP_JA to clarify the code. Fixes: `d995816b77` ("libbpf: Avoid reload of imm for weak, unresolved, repeating ksym") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230319203014.55866-1-alexei.starovoitov@gmail.com	2023-03-20 09:26:41 -07:00
Sreevani Sreejith	08ff1c9f3e	bpf, docs: Libbpf overview documentation This patch documents overview of libbpf, including its features for developing BPF programs. Signed-off-by: Sreevani Sreejith <ssreevani@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230315195405.2051559-1-ssreevani@meta.com	2023-03-18 10:17:39 -07:00
Manu Bretelle	2be7aa76cc	selftests/bpf: Add --json-summary option to test_progs Currently, test_progs outputs all stdout/stderr as it runs, and when it is done, prints a summary. It is non-trivial for tooling to parse that output and extract meaningful information from it. This change adds a new option, `--json-summary`/`-J` that let the caller specify a file where `test_progs{,-no_alu32}` can write a summary of the run in a json format that can later be parsed by tooling. Currently, it creates a summary section with successes/skipped/failures followed by a list of failed tests and subtests. A test contains the following fields: - name: the name of the test - number: the number of the test - message: the log message that was printed by the test. - failed: A boolean indicating whether the test failed or not. Currently we only output failed tests, but in the future, successful tests could be added. - subtests: A list of subtests associated with this test. A subtest contains the following fields: - name: same as above - number: sanme as above - message: the log message that was printed by the subtest. - failed: same as above but for the subtest An example run and json content below: ``` $ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 \| awk '{print $1","}' \| tr -d '\n') -j -J /tmp/test_progs.json $ jq < /tmp/test_progs.json \| head -n 30 { "success": 29, "success_subtest": 23, "skipped": 3, "failed": 28, "results": [ { "name": "bpf_cookie", "number": 10, "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n", "failed": true, "subtests": [ { "name": "multi_kprobe_link_api", "number": 2, "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "multi_kprobe_attach_api", "number": 3, "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "lsm", "number": 8, "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n", "failed": true } ``` The file can then be used to print a summary of the test run and list of failing tests/subtests: ``` $ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"' Success: 29/23, Skipped: 3, Failed: 28 $ jq -r < /tmp/test_progs.json '.results \| map([ if .failed then "#\(.number) \(.name)" else empty end, ( . as {name: $tname, number: $tnum} \| .subtests \| map( if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end ) ) ]) \| flatten \| .[]' \| head -n 20 #10 bpf_cookie #10/2 bpf_cookie/multi_kprobe_link_api #10/3 bpf_cookie/multi_kprobe_attach_api #10/8 bpf_cookie/lsm #15 bpf_mod_race #15/1 bpf_mod_race/ksym (used_btfs UAF) #15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF) #36 cgroup_hierarchical_stats #61 deny_namespace #61/1 deny_namespace/unpriv_userns_create_no_bpf #73 fexit_stress #83 get_func_ip_test #99 kfunc_dynptr_param #99/1 kfunc_dynptr_param/dynptr_data_null #99/4 kfunc_dynptr_param/dynptr_data_null #100 kprobe_multi_bench_attach #100/1 kprobe_multi_bench_attach/kernel #100/2 kprobe_multi_bench_attach/modules #101 kprobe_multi_test #101/1 kprobe_multi_test/skel_api ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230317163256.3809328-1-chantr4@gmail.com	2023-03-17 16:00:33 -07:00
Andrii Nakryiko	6cae5a7106	Merge branch 'bpf: Add detection of kfuncs.' Alexei Starovoitov says: ==================== From: Alexei Starovoitov <ast@kernel.org> Allow BPF programs detect at load time whether particular kfunc exists. Patch 1: Allow ld_imm64 to point to kfunc in the kernel. Patch 2: Fix relocation of kfunc in ld_imm64 insn when kfunc is in kernel module. Patch 3: Introduce bpf_ksym_exists() macro. Patch 4: selftest. NOTE: detection of kfuncs from light skeleton is not supported yet. ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2023-03-17 15:46:03 -07:00
Alexei Starovoitov	95fdf6e313	selftests/bpf: Add test for bpf_ksym_exists(). Add load and run time test for bpf_ksym_exists() and check that the verifier performs dead code elimination for non-existing kfunc. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230317201920.62030-5-alexei.starovoitov@gmail.com	2023-03-17 15:46:02 -07:00
Alexei Starovoitov	5cbd3fe3a9	libbpf: Introduce bpf_ksym_exists() macro. Introduce bpf_ksym_exists() macro that can be used by BPF programs to detect at load time whether particular ksym (either variable or kfunc) is present in the kernel. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230317201920.62030-4-alexei.starovoitov@gmail.com	2023-03-17 15:46:00 -07:00
Alexei Starovoitov	5fc13ad59b	libbpf: Fix relocation of kfunc ksym in ld_imm64 insn. void *p = kfunc; -> generates ld_imm64 insn. kfunc() -> generates bpf_call insn. libbpf patches bpf_call insn correctly while only btf_id part of ld_imm64 is set in the former case. Which means that pointers to kfuncs in modules are not patched correctly and the verifier rejects load of such programs due to btf_id being out of range. Fix libbpf to patch ld_imm64 for kfunc. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230317201920.62030-3-alexei.starovoitov@gmail.com	2023-03-17 15:44:27 -07:00
Alexei Starovoitov	58aa2afbb1	bpf: Allow ld_imm64 instruction to point to kfunc. Allow ld_imm64 insn with BPF_PSEUDO_BTF_ID to hold the address of kfunc. The ld_imm64 pointing to a valid kfunc will be seen as non-null PTR_TO_MEM by is_branch_taken() logic of the verifier, while libbpf will resolve address to unknown kfunc as ld_imm64 reg, 0 which will also be recognized by is_branch_taken() and the verifier will proceed dead code elimination. BPF programs can use this logic to detect at load time whether kfunc is present in the kernel with bpf_ksym_exists() macro that is introduced in the next patches. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230317201920.62030-2-alexei.starovoitov@gmail.com	2023-03-17 15:44:26 -07:00
Bagas Sanjaya	0f10f647f4	bpf, docs: Use internal linking for link to netdev subsystem doc Commit `d56b0c461d` ("bpf, docs: Fix link to netdev-FAQ target") attempts to fix linking problem to undefined "netdev-FAQ" label introduced in `287f4fa99a` ("docs: Update references to netdev-FAQ") by changing internal cross reference to netdev subsystem documentation (Documentation/process/maintainer-netdev.rst) to external one at docs.kernel.org. However, the linking problem is still not resolved, as the generated link points to non-existent netdev-FAQ section of the external doc, which when clicked, will instead going to the top of the doc. Revert back to internal linking by simply mention the doc path while massaging the leading text to the link, since the netdev subsystem doc contains no FAQs but rather general information about the subsystem. Fixes: `d56b0c461d` ("bpf, docs: Fix link to netdev-FAQ target") Fixes: `287f4fa99a` ("docs: Update references to netdev-FAQ") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230314074449.23620-1-bagasdotme@gmail.com	2023-03-17 13:58:57 +01:00
Viktor Malik	bd5314f8dd	kallsyms, bpf: Move find_kallsyms_symbol_value out of internal header Moving find_kallsyms_symbol_value from kernel/module/internal.h to include/linux/module.h. The reason is that internal.h is not prepared to be included when CONFIG_MODULES=n. find_kallsyms_symbol_value is used by kernel/bpf/verifier.c and including internal.h from it (without modules) leads into a compilation error: In file included from ../include/linux/container_of.h:5, from ../include/linux/list.h:5, from ../include/linux/timer.h:5, from ../include/linux/workqueue.h:9, from ../include/linux/bpf.h:10, from ../include/linux/bpf-cgroup.h:5, from ../kernel/bpf/verifier.c:7: ../kernel/bpf/../module/internal.h: In function 'mod_find': ../include/linux/container_of.h:20:54: error: invalid use of undefined type 'struct module' 20 \| static_assert(__same_type((ptr), ((type )0)->member) \|\| \ \| ^~ [...] This patch fixes the above error. Fixes: `31bf1dbccf` ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/oe-kbuild-all/202303161404.OrmfCy09-lkp@intel.com/ Link: https://lore.kernel.org/bpf/20230317095601.386738-1-vmalik@redhat.com	2023-03-17 13:45:51 +01:00
Alexei Starovoitov	94bbbdfbde	Merge branch 'double-fix bpf_test_run + XDP_PASS recycling' Alexander Lobakin says: ==================== Enabling skb PP recycling revealed a couple issues in the bpf_test_run code. Recycling broke the assumption that the headroom won't ever be touched during the test_run execution: xdp_scrub_frame() invalidates the XDP frame at the headroom start, while neigh xmit code overwrites 2 bytes to the left of the Ethernet header. The first makes the kernel panic in certain cases, while the second breaks xdp_do_redirect selftest on BE. test_run is a limited-scope entity, so let's hope no more corner cases will happen here or at least they will be as easy and pleasant to fix as those two. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 22:20:09 -07:00
Alexander Lobakin	5640b6d894	selftests/bpf: fix "metadata marker" getting overwritten by the netstack Alexei noticed xdp_do_redirect test on BPF CI started failing on BE systems after skb PP recycling was enabled: test_xdp_do_redirect:PASS:prog_run 0 nsec test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec test_xdp_do_redirect:PASS:pkt_count_zero 0 nsec test_xdp_do_redirect:FAIL:pkt_count_tc unexpected pkt_count_tc: actual 220 != expected 9998 test_max_pkt_size:PASS:prog_run_max_size 0 nsec test_max_pkt_size:PASS:prog_run_too_big 0 nsec close_netns:PASS:setns 0 nsec #289 xdp_do_redirect:FAIL Summary: 270/1674 PASSED, 30 SKIPPED, 1 FAILED and it doesn't happen on LE systems. Ilya then hunted it down to: #0 0x0000000000aaeee6 in neigh_hh_output (hh=0x83258df0, skb=0x88142200) at linux/include/net/neighbour.h:503 #1 0x0000000000ab2cda in neigh_output (skip_cache=false, skb=0x88142200, n=<optimized out>) at linux/include/net/neighbour.h:544 #2 ip6_finish_output2 (net=net@entry=0x88edba00, sk=sk@entry=0x0, skb=skb@entry=0x88142200) at linux/net/ipv6/ip6_output.c:134 #3 0x0000000000ab4cbc in __ip6_finish_output (skb=0x88142200, sk=0x0, net=0x88edba00) at linux/net/ipv6/ip6_output.c:195 #4 ip6_finish_output (net=0x88edba00, sk=0x0, skb=0x88142200) at linux/net/ipv6/ip6_output.c:206 xdp_do_redirect test places a u32 marker (0x42) right before the Ethernet header to check it then in the XDP program and return %XDP_ABORTED if it's not there. Neigh xmit code likes to round up hard header length to speed up copying the header, so it overwrites two bytes in front of the Eth header. On LE systems, 0x42 is one byte at `data - 4`, while on BE it's `data - 1`, what explains why it happens only there. It didn't happen previously due to that %XDP_PASS meant the page will be discarded and replaced by a new one, but now it can be recycled as well, while bpf_test_run code doesn't reinitialize the content of recycled pages. This mark is limited to this particular test and its setup though, so there's no need to predict 1000 different possible cases. Just move it 4 bytes to the left, still keeping it 32 bit to match on more bytes. Fixes: `9c94bbf9a8` ("xdp: recycle Page Pool backed skbs built from XDP frames") Reported-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/CAADnVQ+B_JOU+EpP=DKhbY9yXdN6GiRPnpTTXfEZ9sNkUeb-yQ@mail.gmail.com Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> # + debugging Link: https://lore.kernel.org/bpf/8341c1d9f935f410438e79d3bd8a9cc50aefe105.camel@linux.ibm.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230316175051.922550-3-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 22:20:09 -07:00
Alexander Lobakin	e5995bc7e2	bpf, test_run: fix crashes due to XDP frame overwriting/corruption syzbot and Ilya faced the splats when %XDP_PASS happens for bpf_test_run after skb PP recycling was enabled for {__,}xdp_build_skb_from_frame(): BUG: kernel NULL pointer dereference, address: 0000000000000d28 RIP: 0010:memset_erms+0xd/0x20 arch/x86/lib/memset_64.S:66 [...] Call Trace: <TASK> __finalize_skb_around net/core/skbuff.c:321 [inline] __build_skb_around+0x232/0x3a0 net/core/skbuff.c:379 build_skb_around+0x32/0x290 net/core/skbuff.c:444 __xdp_build_skb_from_frame+0x121/0x760 net/core/xdp.c:622 xdp_recv_frames net/bpf/test_run.c:248 [inline] xdp_test_run_batch net/bpf/test_run.c:334 [inline] bpf_test_run_xdp_live+0x1289/0x1930 net/bpf/test_run.c:362 bpf_prog_test_run_xdp+0xa05/0x14e0 net/bpf/test_run.c:1418 [...] This happens due to that it calls xdp_scrub_frame(), which nullifies xdpf->data. bpf_test_run code doesn't reinit the frame when the XDP program doesn't adjust head or tail. Previously, %XDP_PASS meant the page will be released from the pool and returned to the MM layer, but now it does return to the Pool with the nullified xdpf->data, which doesn't get reinitialized then. So, in addition to checking whether the head and/or tail have been adjusted, check also for a potential XDP frame corruption. xdpf->data is 100% affected and also xdpf->flags is the field closest to the metadata / frame start. Checking for these two should be enough for non-extreme cases. Fixes: `9c94bbf9a8` ("xdp: recycle Page Pool backed skbs built from XDP frames") Reported-by: syzbot+e1d1b65f7c32f2a86a9f@syzkaller.appspotmail.com Link: https://lore.kernel.org/bpf/000000000000f1985705f6ef2243@google.com Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/e07dd94022ad5731705891b9487cc9ed66328b94.camel@linux.ibm.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230316175051.922550-2-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 22:20:09 -07:00
Luis Gerhorst	082cdc69a4	bpf: Remove misleading spec_v1 check on var-offset stack read For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit `01f810ace9` ("bpf: Allow variable-offset stack access") which correctly notes that "variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them". However, it does not further discuss why the check in check_stack_read() is necessary. The code which made this check obsolete was also introduced in this commit. I have compiled ~650 programs from the Linux selftests, Linux samples, Cilium, and libbpf/examples projects and confirmed that none of these trigger the check in check_stack_read() [2]. Instead, all of these programs are, as expected, already rejected when constructing the variable-offset pointers. Note that the check in check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the code removed by this patch does not (the error removed does not appear in the "verification_error" values). For reproducibility, the repository linked includes the raw data and scripts used to create the plot. [1] https://arxiv.org/pdf/1807.03757.pdf [2] `53dc19fcf4/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf` Fixes: `01f810ace9` ("bpf: Allow variable-offset stack access") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de	2023-03-16 22:05:50 +01:00
Alexei Starovoitov	deb9fd64d1	Merge branch 'Make struct bpf_cpumask RCU safe' David Vernet says: ==================== The struct bpf_cpumask type is currently not RCU safe. It uses the bpf_mem_cache_{alloc,free}() APIs to allocate and release cpumasks, and those allocations may be reused before an RCU grace period has elapsed. We want to be able to enable using this pattern in BPF programs: private(MASK) static struct bpf_cpumask __kptr global; int BPF_PROG(prog, ...) { struct bpf_cpumask cpumask; bpf_rcu_read_lock(); cpumask = global; if (!cpumask) { bpf_rcu_read_unlock(); return -1; } bpf_cpumask_setall(cpumask); ... bpf_rcu_read_unlock(); } In other words, to be able to pass a kptr to KF_RCU bpf_cpumask kfuncs without requiring the acquisition and release of refcounts using bpf_cpumask_kptr_get(). This patchset enables this by making the struct bpf_cpumask type RCU safe, and removing the bpf_cpumask_kptr_get() function. --- v1: https://lore.kernel.org/all/20230316014122.678082-2-void@manifault.com/ Changelog: ---------- v1 -> v2: - Add doxygen comment for new @rcu field in struct bpf_cpumask. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:31 -07:00
David Vernet	fec2c6d14f	bpf,docs: Remove bpf_cpumask_kptr_get() from documentation Now that the kfunc no longer exists, we can remove it and instead describe how RCU can be used to get a struct bpf_cpumask from a map value. This patch updates the BPF documentation accordingly. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230316054028.88924-6-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:30 -07:00
David Vernet	1b403ce77d	bpf: Remove bpf_cpumask_kptr_get() kfunc Now that struct bpf_cpumask is RCU safe, there's no need for this kfunc. Rather than doing the following: private(MASK) static struct bpf_cpumask __kptr global; int BPF_PROG(prog, s32 cpu, ...) { struct bpf_cpumask cpumask; bpf_rcu_read_lock(); cpumask = bpf_cpumask_kptr_get(&global); if (!cpumask) { bpf_rcu_read_unlock(); return -1; } bpf_cpumask_setall(cpumask); ... bpf_cpumask_release(cpumask); bpf_rcu_read_unlock(); } Programs can instead simply do (assume same global cpumask): int BPF_PROG(prog, ...) { struct bpf_cpumask *cpumask; bpf_rcu_read_lock(); cpumask = global; if (!cpumask) { bpf_rcu_read_unlock(); return -1; } bpf_cpumask_setall(cpumask); ... bpf_rcu_read_unlock(); } In other words, no extra atomic acquire / release, and less boilerplate code. This patch removes both the kfunc, as well as its selftests and documentation. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230316054028.88924-5-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:30 -07:00
David Vernet	a5a197df58	bpf/selftests: Test using global cpumask kptr with RCU Now that struct bpf_cpumask * is considered an RCU-safe type according to the verifier, we should add tests that validate its common usages. This patch adds those tests to the cpumask test suite. A subsequent changes will remove bpf_cpumask_kptr_get(), and will adjust the selftest and BPF documentation accordingly. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230316054028.88924-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:30 -07:00
David Vernet	63d2d83d21	bpf: Mark struct bpf_cpumask as rcu protected struct bpf_cpumask is a BPF-wrapper around the struct cpumask type which can be instantiated by a BPF program, and then queried as a cpumask in similar fashion to normal kernel code. The previous patch in this series makes the type fully RCU safe, so the type can be included in the rcu_protected_type BTF ID list. A subsequent patch will remove bpf_cpumask_kptr_get(), as it's no longer useful now that we can just treat the type as RCU safe by default and do our own if check. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230316054028.88924-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:30 -07:00
David Vernet	77473d1a96	bpf: Free struct bpf_cpumask in call_rcu handler The struct bpf_cpumask type uses the bpf_mem_cache_{alloc,free}() APIs to allocate and free its cpumasks. The bpf_mem allocator may currently immediately reuse some memory when its freed, without waiting for an RCU read cycle to elapse. We want to be able to treat struct bpf_cpumask objects as completely RCU safe. This is necessary for two reasons: 1. bpf_cpumask_kptr_get() currently does an RCU-protected refcnt_inc_not_zero(). This of course assumes that the underlying memory is not reused, and is therefore unsafe in its current form. 2. We want to be able to get rid of bpf_cpumask_kptr_get() entirely, and intead use the superior kptr RCU semantics now afforded by the verifier. This patch fixes (1), and enables (2), by making struct bpf_cpumask RCU safe. A subsequent patch will update the verifier to allow struct bpf_cpumask * pointers to be passed to KF_RCU kfuncs, and then a latter patch will remove bpf_cpumask_kptr_get(). Fixes: `516f4d3397` ("bpf: Enable cpumasks to be queried and used as kptrs") Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230316054028.88924-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-16 12:28:30 -07:00
Daniel Müller	6cb9430be1	libbpf: Ignore warnings about "inefficient alignment" Some consumers of libbpf compile the code base with different warnings enabled. In a report for perf, for example, -Wpacked was set which caused warnings about "inefficient alignment" to be emitted on a subset of supported architectures. With this change we silence specifically those warnings, as we intentionally worked with packed structs. This is a similar resolution as in `b2f10cd4e8` ("perf cpumap: Fix alignment for masks in event encoding"). Fixes: `1eebcb6063` ("libbpf: Implement basic zip archive parsing support") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/bpf/CA+G9fYtBnwxAWXi2+GyNByApxnf_DtP1-6+_zOKAdJKnJBexjg@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20230315171550.1551603-1-deso@posteo.net	2023-03-16 18:20:08 +01:00
Martin KaFai Lau	226efec2b0	selftests/bpf: Fix a fd leak in an error path in network_helpers.c In __start_server, it leaks a fd when setsockopt(SO_REUSEPORT) fails. This patch fixes it. Fixes: `eed92afdd1` ("bpf: selftest: Test batching and bpf_(get\|set)sockopt in bpf tcp iter") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230316000726.1016773-2-martin.lau@linux.dev	2023-03-16 18:12:59 +01:00
Martin KaFai Lau	ed01385c0d	selftests/bpf: Use ASSERT_EQ instead ASSERT_OK for testing memcmp result In tcp_hdr_options test, it ensures the received tcp hdr option and the sk local storage have the expected values. It uses memcmp to check that. Testing the memcmp result with ASSERT_OK is confusing because ASSERT_OK will print out the errno which is not set. This patch uses ASSERT_EQ to check for 0 instead. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230316000726.1016773-1-martin.lau@linux.dev	2023-03-16 18:12:59 +01:00
Alexei Starovoitov	72fe61d745	Merge branch 'Fix attaching fentry/fexit/fmod_ret/lsm to modules' Viktor Malik says: ==================== I noticed that the verifier behaves incorrectly when attaching to fentry of multiple functions of the same name located in different modules (or in vmlinux). The reason for this is that if the target program is not specified, the verifier will search kallsyms for the trampoline address to attach to. The entire kallsyms is always searched, not respecting the module in which the function to attach to is located. As Yonghong correctly pointed out, there is yet another issue - the trampoline acquires the module reference in register_fentry which means that if the module is unloaded between the place where the address is found in the verifier and register_fentry, it is possible that another module is loaded to the same address in the meantime, which may lead to errors. This patch fixes the above issues by extracting the module name from the BTF of the attachment target (which must be specified) and by doing the search in kallsyms of the correct module. At the same time, the module reference is acquired right after the address is found and only released right before the program itself is unloaded. --- Changes in v10: - added the new test to DENYLIST.aarch64 (suggested by Andrii) - renamed the test source file to match the test name Changes in v9: - two small changes suggested by Jiri Olsa and Jiri's ack Changes in v8: - added module_put to error paths in bpf_check_attach_target after the module reference is acquired Changes in v7: - refactored the module reference manipulation (comments by Jiri Olsa) - cleaned up the test (comments by Andrii Nakryiko) Changes in v6: - storing the module reference inside bpf_prog_aux instead of bpf_trampoline and releasing it when the program is unloaded (suggested by Jiri Olsa) Changes in v5: - fixed acquiring and releasing of module references by trampolines to prevent modules being unloaded between address lookup and trampoline allocation Changes in v4: - reworked module kallsyms lookup approach using existing functions, verifier now calls btf_try_get_module to retrieve the module and find_kallsyms_symbol_value to get the symbol address (suggested by Alexei) - included Jiri Olsa's comments - improved description of the new test and added it as a comment into the test source Changes in v3: - added trivial implementation for kallsyms_lookup_name_in_module() for !CONFIG_MODULES (noticed by test robot, fix suggested by Hao Luo) Changes in v2: - introduced and used more space-efficient kallsyms lookup function, suggested by Jiri Olsa - included Hao Luo's comments ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-15 18:38:30 -07:00
Viktor Malik	aa3d65de4b	bpf/selftests: Test fentry attachment to shadowed functions Adds a new test that tries to attach a program to fentry of two functions of the same name, one located in vmlinux and the other in bpf_testmod. To avoid conflicts with existing tests, a new function "bpf_fentry_shadow_test" was created both in vmlinux and in bpf_testmod. The previous commit fixed a bug which caused this test to fail. The verifier would always use the vmlinux function's address as the target trampoline address, hence trying to create two trampolines for a single address, which is forbidden. The test (similarly to other fentry/fexit tests) is not working on arm64 at the moment. Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/5fe2f364190b6f79b085066ed7c5989c5bc475fa.1678432753.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-15 18:38:30 -07:00
Viktor Malik	31bf1dbccf	bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules This resolves two problems with attachment of fentry/fexit/fmod_ret/lsm to functions located in modules: 1. The verifier tries to find the address to attach to in kallsyms. This is always done by searching the entire kallsyms, not respecting the module in which the function is located. Such approach causes an incorrect attachment address to be computed if the function to attach to is shadowed by a function of the same name located earlier in kallsyms. 2. If the address to attach to is located in a module, the module reference is only acquired in register_fentry. If the module is unloaded between the place where the address is found (bpf_check_attach_target in the verifier) and register_fentry, it is possible that another module is loaded to the same address which may lead to potential errors. Since the attachment must contain the BTF of the program to attach to, we extract the module from it and search for the function address in the correct module (resolving problem no. 1). Then, the module reference is taken directly in bpf_check_attach_target and stored in the bpf program (in bpf_prog_aux). The reference is only released when the program is unloaded (resolving problem no. 2). Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/3f6a9d8ae850532b5ef864ef16327b0f7a669063.1678432753.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-15 18:38:21 -07:00
Tejun Heo	b8a2e3f93d	cgroup: Make current_cgns_cgroup_dfl() safe to call after exit_task_namespace() The commit `332ea1f697` ("bpf: Add bpf_cgroup_from_id() kfunc") added bpf_cgroup_from_id() which calls current_cgns_cgroup_dfl() through cgroup_get_from_id(). However, BPF programs may be attached to a point where current->nsproxy has already been cleared to NULL by exit_task_namespace() and calling bpf_cgroup_from_id() would cause an oops. Just return the system-wide root if nsproxy has been cleared. This allows all cgroups to be looked up after the task passed through exit_task_namespace(), which semantically makes sense. Given that the only way to get this behavior is through BPF programs, it seems safe but let's see what others think. Fixes: `332ea1f697` ("bpf: Add bpf_cgroup_from_id() kfunc") Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/ZBDuVWiFj2jiz3i8@slm.duckdns.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 16:56:20 -07:00
Alexei Starovoitov	3c2611bac0	selftests/bpf: Fix trace_virtqueue_add_sgs test issue with LLVM 17. LLVM commit https://reviews.llvm.org/D143726 introduced hoistMinMax optimization that transformed (i < VIRTIO_MAX_SGS) && (i < out_sgs) into i < MIN(VIRTIO_MAX_SGS, out_sgs) and caused the verifier to stop recognizing such loop as bounded. Which resulted in the following test failure: libbpf: prog 'trace_virtqueue_add_sgs': BPF program load failed: Bad address libbpf: prog 'trace_virtqueue_add_sgs': -- BEGIN PROG LOAD LOG -- The sequence of 8193 jumps is too complex. verification time 789206 usec stack depth 56 processed 156446 insns (limit 1000000) max_states_per_insn 7 total_states 1746 peak_states 1701 mark_read 12 -- END PROG LOAD LOG -- libbpf: prog 'trace_virtqueue_add_sgs': failed to load: -14 libbpf: failed to load object 'loop6.bpf.o' Workaround the verifier limitation for now with inline asm that prevents this particular optimization. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:28:11 -07:00
Alexei Starovoitov	5584d9e63e	Merge branch 'xdp: recycle Page Pool backed skbs built from XDP frames' Alexander Lobakin says: ==================== Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway. __xdp_build_skb_from_frame() missed the moment when the networking stack became able to recycle skb pages backed by a page_pool. This was making e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was also affected in some scenarios. A lot of drivers use skb_mark_for_recycle() already, it's been almost two years and seems like there are no issues in using it in the generic code too. {__,}xdp_release_frame() can be then removed as it losts its last user. Page Pool becomes then zero-alloc (or almost) in the abovementioned cases, too. Other memory type models (who needs them at this point) have no changes. Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte IPv6 UDP, iavf w/XDP[0] (CONFIG_PAGE_POOL_STATS is enabled): Plain %XDP_PASS on baseline, Page Pool driver: src cpu Rx drops dst cpu Rx 2.1 Mpps N/A 2.1 Mpps cpumap redirect (cross-core, w/o leaving its NUMA node) on baseline: 6.8 Mpps 5.0 Mpps 1.8 Mpps cpumap redirect with skb PP recycling: 7.9 Mpps 5.7 Mpps 2.2 Mpps +22% (from cpumap redir on baseline) [0] https://github.com/alobakin/linux/commits/iavf-xdp ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:20:05 -07:00
Alexander Lobakin	d4e492338d	xdp: remove unused {__,}xdp_release_frame() __xdp_build_skb_from_frame() was the last user of {__,}xdp_release_frame(), which detaches pages from the page_pool. All the consumers now recycle Page Pool skbs and page, except mlx5, stmmac and tsnep drivers, which use page_pool_release_page() directly (might change one day). It's safe to assume this functionality is not needed anymore and can be removed (in favor of recycling). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-5-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:20:05 -07:00
Alexander Lobakin	9c94bbf9a8	xdp: recycle Page Pool backed skbs built from XDP frames __xdp_build_skb_from_frame() state(d): /* Until page_pool get SKB return path, release DMA here */ Page Pool got skb pages recycling in April 2021, but missed this function. xdp_release_frame() is relevant only for Page Pool backed frames and it detaches the page from the corresponding page_pool in order to make it freeable via page_frag_free(). It can instead just mark the output skb as eligible for recycling if the frame is backed by a pp. No change for other memory model types (the same condition check as before). cpumap redirect and veth on Page Pool drivers now become zero-alloc (or almost). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-4-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:20:05 -07:00
Alexander Lobakin	2c854e5fcd	net: page_pool, skbuff: make skb_mark_for_recycle() always available skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates unneeded complication when using it in the generic code. For now, it's only used in the drivers always selecting Page Pool, so this works. Move the guards so that preprocessor will cut out only the operation itself and the function will still be a noop on !PAGE_POOL systems, but available there as well. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-3-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:20:04 -07:00
Alexander Lobakin	487deb3e33	selftests/bpf: robustify test_xdp_do_redirect with more payload magics Currently, the test relies on that only dropped ("xmitted") frames will be recycled and if a frame became an skb, it will be freed later by the stack and never come back to its page_pool. So, it easily gets broken by trying to recycle skbs[0]: test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec test_xdp_do_redirect:FAIL:pkt_count_zero unexpected pkt_count_zero: actual 9936 != expected 2 test_xdp_do_redirect:PASS:pkt_count_tc 0 nsec That huge mismatch happened because after the TC ingress hook zeroes the magic, the page gets recycled when skb is freed, not returned to the MM layer. "Live frames" mode initializes only new pages and keeps the recycled ones as is by design, so they appear with zeroed magic on the Rx path again. Expand the possible magic values from two: 0 (was "xmitted"/dropped or did hit the TC hook) and 0x42 (hit the input XDP prog) to three: the new one will mark frames hit the TC hook, so that they will elide both @pkt_count_zero and @pkt_count_xdp. They can then be recycled to their page_pool or returned to the page allocator, this won't affect the counters anyhow. Just make sure to mark them as "input" (0x42) when they appear on the Rx path again. Also make an enum from those magics, so that they will be always visible and can be changed in just one place anytime. This also eases adding any new marks later on. Link: https://github.com/kernel-patches/bpf/actions/runs/4386538411/jobs/7681081789 Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-2-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-14 15:20:04 -07:00
Martin KaFai Lau	283b40c52d	Merge branch 'bpf: Allow helpers access ptr_to_btf_id.' Alexei Starovoitov says: ==================== From: Alexei Starovoitov <ast@kernel.org> Allow code like: bpf_strncmp(task->comm, 16, "foo"); ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-03-13 23:08:21 -07:00
Alexei Starovoitov	f25fd60882	selftests/bpf: Add various tests to check helper access into ptr_to_btf_id. Add various tests to check helper access into ptr_to_btf_id. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-4-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-03-13 23:08:21 -07:00
Alexei Starovoitov	3e30be4288	bpf: Allow helpers access trusted PTR_TO_BTF_ID. The verifier rejects the code: bpf_strncmp(task->comm, 16, "my_task"); with the message: 16: (85) call bpf_strncmp#182 R1 type=trusted_ptr_ expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf Teach the verifier that such access pattern is safe. Do not allow untrusted and legacy ptr_to_btf_id to be passed into helpers. Reported-by: David Vernet <void@manifault.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-3-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-03-13 23:08:21 -07:00
Alexei Starovoitov	c9267aa8b7	bpf: Fix bpf_strncmp proto. bpf_strncmp() doesn't write into its first argument. Make sure that the verifier knows about it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-2-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-03-13 23:08:21 -07:00
Dave Thaler	b9fe8e8d03	bpf, docs: Add signed comparison example Improve clarity by adding an example of a signed comparison instruction Signed-off-by: Dave Thaler <dthaler@microsoft.com> Acked-by: David Vernet <void@manifault.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230310233814.4641-1-dthaler1968@googlemail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 22:10:45 -07:00
Ross Zwisler	ab4c15feb2	selftests/bpf: use canonical ftrace path The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many tests in the bpf selftest code still refer to this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-3-zwisler@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 21:51:30 -07:00
Ross Zwisler	27d7fdf06f	bpf: use canonical ftrace path The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many comments and samples in the bpf code still refer to this older debugfs path, so let's update them to avoid confusion. There are a few spots where the bpf code explicitly checks both tracefs and debugfs (tools/bpf/bpftool/tracelog.c and tools/lib/api/fs/fs.c) and I've left those alone so that the tools can continue to work with both paths. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-2-zwisler@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 21:51:30 -07:00
Dave Marchevsky	9e36a204bd	bpf: Disable migration when freeing stashed local kptr using obj drop When a local kptr is stashed in a map and freed when the map goes away, currently an error like the below appears: [ 39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875 [ 39.196549] caller is bpf_mem_free+0x56/0xc0 [ 39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G O 6.2.0-13016-g22df776a9a86 #4477 [ 39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 39.198949] Workqueue: events_unbound bpf_map_free_deferred [ 39.199470] Call Trace: [ 39.199703] <TASK> [ 39.199911] dump_stack_lvl+0x60/0x70 [ 39.200267] check_preemption_disabled+0xbf/0xe0 [ 39.200704] bpf_mem_free+0x56/0xc0 [ 39.201032] ? bpf_obj_new_impl+0xa0/0xa0 [ 39.201430] bpf_obj_free_fields+0x1cd/0x200 [ 39.201838] array_map_free+0xad/0x220 [ 39.202193] ? finish_task_switch+0xe5/0x3c0 [ 39.202614] bpf_map_free_deferred+0xea/0x210 [ 39.203006] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.203460] process_one_work+0x64f/0xbe0 [ 39.203822] ? pwq_dec_nr_in_flight+0x110/0x110 [ 39.204264] ? do_raw_spin_lock+0x107/0x1c0 [ 39.204662] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.205107] worker_thread+0x74/0x7a0 [ 39.205451] ? process_one_work+0xbe0/0xbe0 [ 39.205818] kthread+0x171/0x1a0 [ 39.206111] ? kthread_complete_and_exit+0x20/0x20 [ 39.206552] ret_from_fork+0x1f/0x30 [ 39.206886] </TASK> This happens because the call to __bpf_obj_drop_impl I added in the patch adding support for stashing local kptrs doesn't disable migration. Prior to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF progarm, whereas now it can be called from map free path, so it's necessary to explicitly disable migration. Also, refactor a bit to just call __bpf_obj_drop_impl directly instead of bothering w/ dtor union and setting pointer-to-obj_drop. Fixes: `c8e1875409` ("bpf: Support __kptr to local kptrs") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 16:55:04 -07:00
David Vernet	22df776a9a	tasks: Extract rcu_users out of union In commit `3fbd7ee285` ("tasks: Add a count of task RCU users"), a count on the number of RCU users was added to struct task_struct. This was done so as to enable the removal of task_rcu_dereference(), and allow tasks to be protected by RCU even after exiting and being removed from the runqueue. In this commit, the 'refcount_t rcu_users' field that keeps track of this refcount was put into a union co-located with 'struct rcu_head rcu', so as to avoid taking up any extra space in task_struct. This was possible to do safely, because the field was only ever decremented by a static set of specific callers, and then never incremented again. While this restriction of there only being a small, static set of users of this field has worked fine, it prevents us from leveraging the field to use RCU to protect tasks in other contexts. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, put them in a map, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. While this can currently be done with 'usage', it becomes tricky when a task is already in a map, or if a reference should only be taken if a task is valid and will not soon be reaped. Ideally, we could do something like pass a reference to a map value, and then try to acquire a reference to the task in an RCU read region by using refcount_inc_not_zero(). Similarly, in sched_ext, schedulers are using integer pids to remember tasks, and then looking them up with find_task_by_pid_ns(). This is slow, error prone, and adds complexity. It would be more convenient and performant if BPF schedulers could instead store tasks directly in maps, and then leverage RCU to ensure they can be safely accessed with low overhead. Finally, overloading fields like this is error prone. Someone that wants to use 'rcu_users' could easily overlook the fact that once the rcu callback is scheduled, the refcount will go back to being nonzero, thus precluding the use of refcount_inc_not_zero(). Furthermore, as described below, it's possible to extract the fields of the union without changing the size of task_struct. There are several possible ways to enable this: 1. The lightest touch approach is likely the one proposed in this patch, which is to simply extract 'rcu_users' and 'rcu' from the union, so that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users' refcount. If we have a trusted task pointer, this would allow us to use refcnt_inc_not_zero() inside of an RCU region to determine if we can safely acquire a reference to the task and store it in a map. As mentioned below, this can be done without changing the size of task_struct, by moving the location of the union to another location that has padding gaps we can fill in. 2. Removing 'refcount_t rcu_users', and instead having the entire task be freed in an rcu callback. This is likely the most sound overall design, though it changes the behavioral semantics exposed to callers, who currently expect that a task that's successfully looked up in e.g. the pid_list with find_task_by_pid_ns(), can always have a 'usage' reference acquired on them, as it's guaranteed to be > 0 until after the next gp. In order for this approach to work, we'd have to audit all callers. This approach also slightly changes behavior observed by user space by not invoking trace_sched_process_free() until the whole task_struct is actually being freed, rather than just after it's exited. It also may change timings, as memory will be freed in an RCU callback rather than immediately when the final 'usage' refcount drops to 0. This also is arguably a benefit, as it provides more predictable performance to callers who are refcounting tasks. 3. There may be other solutions as well that don't require changing the layout of task_struct. For example, we could possibly do something complex from the BPF side, such as listen for task exit and remove a task from a map when the task is exiting. This would likely require significant custom handling for task_struct in the verifier, so a more generalizable solution is likely warranted. As mentioned above, this patch proposes the lightest-touch approach which allows callers elsewhere in the kernel to use 'rcu_users' to ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from the union. There is no size change in task_struct with this patch. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 12:42:24 -07:00
Andrii Nakryiko	34f0677e7a	bpf: fix precision propagation verbose logging Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: `529409ea92` ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-13 11:46:44 -07:00
Alexei Starovoitov	49b5300f1f	Merge branch 'Support stashing local kptrs with bpf_kptr_xchg' Dave Marchevsky says: ==================== Local kptrs are kptrs allocated via bpf_obj_new with a type specified in program BTF. A BPF program which creates a local kptr has exclusive control of the lifetime of the kptr, and, prior to terminating, must: * free the kptr via bpf_obj_drop * If the kptr is a {list,rbtree} node, add the node to a {list, rbtree}, thereby passing control of the lifetime to the collection This series adds a third option: * stash the kptr in a map value using bpf_kptr_xchg As indicated by the use of "stash" to describe this behavior, the intended use of this feature is temporary storage of local kptrs. For example, a sched_ext ([0]) scheduler may want to create an rbtree node for each new cgroup on cgroup init, but to add that node to the rbtree as part of a separate program which runs on enqueue. Stashing the node in a map_value allows its lifetime to outlive the execution of the cgroup_init program. Behavior: There is no semantic difference between adding a kptr to a graph collection and "stashing" it in a map. In both cases exclusive ownership of the kptr's lifetime is passed to some containing data structure, which is responsible for bpf_obj_drop'ing it when the container goes away. Since graph collections also expect exclusive ownership of the nodes they contain, graph nodes cannot be both stashed in a map_value and contained by their corresponding collection. Implementation: Two observations simplify the verifier changes for this feature. First, kptrs ("referenced kptrs" until a recent renaming) require registration of a dtor function as part of their acquire/release semantics, so that a referenced kptr which is placed in a map_value is properly released when the map goes away. We want this exact behavior for local kptrs, but with bpf_obj_drop as the dtor instead of a per-btf_id dtor. The second observation is that, in terms of identification, "referenced kptr" and "local kptr" already don't interfere with one another. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); struct map_value mapval; struct node_data res; int key = 0; res = bpf_obj_new(typeof(res)); if (!res) { /* err handling / } mapval = bpf_map_lookup_elem(&some_nodes, &key); if (!mapval) { / err handling / } res = bpf_kptr_xchg(&mapval->node, res); if (res) bpf_obj_drop(res); The __kptr tag identifies map_value's node as a referenced kptr, while the PTR_TO_BTF_ID which bpf_obj_new returns - a type in some non-vmlinux, non-module BTF - identifies res as a local kptr. Type tag on the pointer indicates referenced kptr, while the type of the pointee indicates local kptr. So using existing facilities we can tell the verifier about a "referenced kptr" pointer to a "local kptr" pointee. When kptr_xchg'ing a kptr into a map_value, the verifier can recognize local kptr types and treat them like referenced kptrs with a properly-typed bpf_obj_drop as a dtor. Other implementation notes: We don't need to do anything special to enforce "graph nodes cannot be both stashed in a map_value and contained by their corresponding collection" * bpf_kptr_xchg both returns and takes as input a (possibly-null) owning reference. It does not accept non-owning references as input by virtue of requiring a ref_obj_id. By definition, if a program has an owning ref to a node, the node isn't in a collection, so it's safe to pass ownership via bpf_kptr_xchg. Summary of patches: * Patch 1 modifies BTF plumbing to support using bpf_obj_drop as a dtor * Patch 2 adds verifier plumbing to support MEM_ALLOC-flagged param for bpf_kptr_xchg * Patch 3 adds selftests exercising the new behavior Changelog: v1 -> v2: https://lore.kernel.org/bpf/20230309180111.1618459-1-davemarchevsky@fb.com/ Patch #s used below refer to the patch's position in v1 unless otherwise specified. Patches 1-3 were applied and are not included in v2. Rebase onto latest bpf-next: "libbpf: Revert poisoning of strlcpy" Patch 4: "bpf: Support __kptr to local kptrs" * Remove !btf_is_kernel(btf) check, WARN_ON_ONCE instead (Alexei) Patch 6: "selftests/bpf: Add local kptr stashing test" * Add test which stashes 2 nodes and later unstashes one of them using a separate BPF program (Alexei) * Fix incorrect runner subtest name for original test (was "rbtree_add_nodes") ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 16:38:06 -08:00
Dave Marchevsky	5d8d6634cc	selftests/bpf: Add local kptr stashing test Add a new selftest, local_kptr_stash, which uses bpf_kptr_xchg to stash a bpf_obj_new-allocated object in a map. Test the following scenarios: * Stash two rb_nodes in an arraymap, don't unstash them, rely on map free to destruct them * Stash two rb_nodes in an arraymap, unstash the second one in a separate program, rely on map free to destruct first Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 16:38:05 -08:00
Dave Marchevsky	738c96d5e2	bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg The previous patch added necessary plumbing for verifier and runtime to know what to do with non-kernel PTR_TO_BTF_IDs in map values, but didn't provide any way to get such local kptrs into a map value. This patch modifies verifier handling of bpf_kptr_xchg to allow MEM_ALLOC kptr types. check_reg_type is modified accept MEM_ALLOC-flagged input to bpf_kptr_xchg despite such types not being in btf_ptr_types. This could have been done with a MAYBE_MEM_ALLOC equivalent to MAYBE_NULL, but bpf_kptr_xchg is the only helper that I can forsee using MAYBE_MEM_ALLOC, so keep it special-cased for now. The verifier tags bpf_kptr_xchg retval MEM_ALLOC if and only if the BTF associated with the retval is not kernel BTF. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 16:38:05 -08:00
Dave Marchevsky	c8e1875409	bpf: Support __kptr to local kptrs If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module BTF - it must have been allocated by bpf_obj_new and therefore must be free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new. This patch adds support for treating __kptr-tagged pointers to "local kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr acquire / release semantics. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); If struct node_data had a matching definition in kernel BTF, the verifier would expect a destructor for the type to be registered. Since struct node_data does not match any type in kernel BTF, the verifier knows that there is no kfunc that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can only come from bpf_obj_new. So instead of searching for a registered dtor, a bpf_obj_drop dtor can be assumed. This allows the runtime to properly destruct such kptrs in bpf_obj_free_fields, which enables maps to clean up map_vals w/ such kptrs when going away. Implementation notes: "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr. Before this patch, the variable would only ever point to vmlinux or module BTFs, but now it can point to some program BTF for local kptr type. It's later used to populate the (btf, btf_id) pair in kptr btf field. * It's necessary to btf_get the program BTF when populating btf_field for local kptr. btf_record_free later does a btf_put. * Behavior for non-local referenced kptrs is not modified, as bpf_find_btf_id helper only searches vmlinux and module BTFs for matching BTF type. If such a type is found, btf_field_kptr's btf will pass btf_is_kernel check, and the associated release function is some one-argument dtor. If btf_is_kernel check fails, associated release function is two-arg bpf_obj_drop_impl. Before this patch only btf_field_kptr's w/ kernel or module BTFs were created. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 16:38:05 -08:00
Dave Thaler	c1f9e14e3b	bpf, docs: Explain helper functions Add brief text about existence of helper functions, with details to go in separate psABI text. Note that text about runtime functions (kfuncs) is part of a separate patch, not this one. Signed-off-by: Dave Thaler <dthaler@microsoft.com> Link: https://lore.kernel.org/r/20230308205303.1308-1-dthaler1968@googlemail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 13:02:00 -08:00

... 3 4 5 6 7 ...

1168893 Commits