linux

iv/linux

Author	SHA1	Message	Date
Andrii Nakryiko	6bcd39d366	selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF Add a self-tests validating libbpf is able to perform CO-RE relocations against the type defined in kernel module BTF. if bpf_testmod.o is not supported by the kernel (e.g., due to version mismatch), skip tests, instead of failing. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-9-andrii@kernel.org	2020-12-03 17:38:21 -08:00
Andrii Nakryiko	5ed31472b9	selftests/bpf: Add support for marking sub-tests as skipped Previously skipped sub-tests would be counted as passing with ":OK" appened in the log. Change that to be accounted as ":SKIP". Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-8-andrii@kernel.org	2020-12-03 17:38:21 -08:00
Andrii Nakryiko	9f7fa22589	selftests/bpf: Add bpf_testmod kernel module for testing Add bpf_testmod module, which is conceptually out-of-tree module and provides ways for selftests/bpf to test various kernel module-related functionality: raw tracepoint, fentry/fexit/fmod_ret, etc. This module will be auto-loaded by test_progs test runner and expected by some of selftests to be present and loaded. Pahole currently isn't able to generate BTF for static functions in kernel modules, so make sure traced function is global. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201203204634.1325171-7-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Andrii Nakryiko	4f33a53d56	libbpf: Add kernel module BTF support for CO-RE relocations Teach libbpf to search for candidate types for CO-RE relocations across kernel modules BTFs, in addition to vmlinux BTF. If at least one candidate type is found in vmlinux BTF, kernel module BTFs are not iterated. If vmlinux BTF has no matching candidates, then find all kernel module BTFs and search for all matching candidates across all of them. Kernel's support for module BTFs are inferred from the support for BTF name pointer in BPF UAPI. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-6-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Andrii Nakryiko	0f7515ca7c	libbpf: Refactor CO-RE relocs to not assume a single BTF object Refactor CO-RE relocation candidate search to not expect a single BTF, rather return all candidate types with their corresponding BTF objects. This will allow to extend CO-RE relocations to accommodate kernel module BTFs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201203204634.1325171-5-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Andrii Nakryiko	a19f93cfaf	libbpf: Add internal helper to load BTF data by FD Add a btf_get_from_fd() helper, which constructs struct btf from in-kernel BTF data by FD. This is used for loading module BTFs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-4-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Andrii Nakryiko	2fe8890848	bpf: Keep module's btf_data_size intact after load Having real btf_data_size stored in struct module is benefitial to quickly determine which kernel modules have associated BTF object and which don't. There is no harm in keeping this info, as opposed to keeping invalid pointer. Fixes: `607c543f93` ("bpf: Sanitize BTF data pointer after module is loaded") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-3-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Andrii Nakryiko	12cc126df8	bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() __module_address() needs to be called with preemption disabled or with module_mutex taken. preempt_disable() is enough for read-only uses, which is what this fix does. Also, module_put() does internal check for NULL, so drop it as well. Fixes: `a38d1107f9` ("bpf: support raw tracepoints in modules") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201203204634.1325171-2-andrii@kernel.org	2020-12-03 17:38:20 -08:00
Alexei Starovoitov	cadd64807c	Merge branch 'Add support to set window_clamp from bpf setsockops' Prankur gupta says: ==================== This patch contains support to set tcp window_field field from bpf setsockops. v2: Used TCP_WINDOW_CLAMP setsockopt logic for bpf_setsockopt (review comment addressed) v3: Created a common function for duplicated code (review comment addressed) v4: Removing logic to pass struct sock and struct tcp_sock together (review comment addressed) ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-12-03 17:25:24 -08:00
Prankur gupta	55144f31f0	selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP Adding selftests for new added functionality to set TCP_WINDOW_CLAMP from bpf setsockopt. Signed-off-by: Prankur gupta <prankgup@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202213152.435886-3-prankgup@fb.com	2020-12-03 17:23:24 -08:00
Prankur gupta	cb81110997	bpf: Adds support for setting window clamp Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP, which sets the maximum receiver window size. It will be useful for limiting receiver window based on RTT. Signed-off-by: Prankur gupta <prankgup@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com	2020-12-03 17:23:24 -08:00
Colin Ian King	2faa7328f5	samples/bpf: Fix spelling mistake "recieving" -> "receiving" There is a spelling mistake in an error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20201203114452.1060017-1-colin.king@canonical.com	2020-12-03 12:18:02 -08:00
Brendan Jackman	58c185b85d	bpf: Fix cold build of test_progs-no_alu32 This object lives inside the trunner output dir, i.e. tools/testing/selftests/bpf/no_alu32/btf_data.o At some point it gets copied into the parent directory during another part of the build, but that doesn't happen when building test_progs-no_alu32 from clean. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20201203120850.859170-1-jackmanb@google.com	2020-12-03 12:10:53 -08:00
Stanislav Fomichev	d6d418bd8f	libbpf: Cap retries in sys_bpf_prog_load I've seen a situation, where a process that's under pprof constantly generates SIGPROF which prevents program loading indefinitely. The right thing to do probably is to disable signals in the upper layers while loading, but it still would be nice to get some error from libbpf instead of an endless loop. Let's add some small retry limit to the program loading: try loading the program 5 (arbitrary) times and give up. v2: * 10 -> 5 retires (Andrii Nakryiko) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202231332.3923644-1-sdf@google.com	2020-12-03 12:01:18 -08:00
Toke Høiland-Jørgensen	9cf309c56f	libbpf: Sanitise map names before pinning When we added sanitising of map names before loading programs to libbpf, we still allowed periods in the name. While the kernel will accept these for the map names themselves, they are not allowed in file names when pinning maps. This means that bpf_object__pin_maps() will fail if called on an object that contains internal maps (such as sections .rodata). Fix this by replacing periods with underscores when constructing map pin paths. This only affects the paths generated by libbpf when bpf_object__pin_maps() is called with a path argument. Any pin paths set by bpf_map__set_pin_path() are unaffected, and it will still be up to the caller to avoid invalid characters in those. Fixes: `113e6b7e15` ("libbpf: Sanitise internal map names so they are not rejected by the kernel") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203093306.107676-1-toke@redhat.com	2020-12-03 11:57:42 -08:00
Andrei Matei	80b2b5c3a7	libbpf: Fail early when loading programs with unspecified type Before this patch, a program with unspecified type (BPF_PROG_TYPE_UNSPEC) would be passed to the BPF syscall, only to have the kernel reject it with an opaque invalid argument error. This patch makes libbpf reject such programs with a nicer error message - in particular libbpf now tries to diagnose bad ELF section names at both open time and load time. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203043410.59699-1-andreimatei1@gmail.com	2020-12-03 11:37:05 -08:00
Andrii Nakryiko	a8b415c9bd	Merge branch 'Fixes for ima selftest' KP Singh says: ==================== From: KP Singh <kpsingh@google.com> # v3 -> v4 * Fix typos. * Update commit message for the indentation patch. * Added Andrii's acks. # v2 -> v3 * Added missing tags. * Indentation fixes + some other fixes suggested by Andrii. * Re-indent file to tabs. The selftest for the bpf_ima_inode_hash helper uses a shell script to setup the system for ima. While this worked without an issue on recent desktop distros, it failed on environments with stripped out shells like busybox which is also used by the bpf CI. This series fixes the assumptions made on the availablity of certain command line switches and the expectation that securityfs being mounted by default. It also adds the missing kernel config dependencies in tools/testing/selftests/bpf and, lastly, changes the indentation of ima_setup.sh to use tabs. ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2020-12-03 11:20:21 -08:00
KP Singh	ffebecd9d4	selftests/bpf: Indent ima_setup.sh with tabs. The file was formatted with spaces instead of tabs and went unnoticed as checkpatch.pl did not complain (probably because this is a shell script). Re-indent it with tabs to be consistent with other scripts. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-5-kpsingh@chromium.org	2020-12-03 11:20:21 -08:00
KP Singh	d932e043b9	selftests/bpf: Add config dependency on BLK_DEV_LOOP The ima selftest restricts its scope to a test filesystem image mounted on a loop device and prevents permanent ima policy changes for the whole system. Fixes: `34b82d3ac1` ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-4-kpsingh@chromium.org	2020-12-03 11:20:21 -08:00
KP Singh	1ee076719d	selftests/bpf: Ensure securityfs mount before writing ima policy SecurityFS may not be mounted even if it is enabled in the kernel config. So, check if the mount exists in /proc/mounts by parsing the file and, if not, mount it on /sys/kernel/security. Fixes: `34b82d3ac1` ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-3-kpsingh@chromium.org	2020-12-03 11:20:21 -08:00
KP Singh	3db980449b	selftests/bpf: Update ima_setup.sh for busybox losetup on busybox does not output the name of loop device on using -f with --show. It also doesn't support -j to find the loop devices for a given backing file. losetup is updated to use "-a" which is available on busybox. blkid does not support options (-s and -o) to only display the uuid, so parse the output instead. Not all environments have mkfs.ext4, the test requires a loop device with a backing image file which could formatted with any filesystem. Update to using mkfs.ext2 which is available on busybox. Fixes: `34b82d3ac1` ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-2-kpsingh@chromium.org	2020-12-03 11:20:20 -08:00
Alexei Starovoitov	61b759480e	Merge branch 'libbpf: add support for privileged/unprivileged control separation' Mariusz Dudek says: ==================== From: Mariusz Dudek <mariuszx.dudek@intel.com> This patch series adds support for separation of eBPF program load and xsk socket creation. In for example a Kubernetes environment you can have an AF_XDP CNI or daemonset that is responsible for launching pods that execute an application using AF_XDP sockets. It is desirable that the pod runs with as low privileges as possible, CAP_NET_RAW in this case, and that all operations that require privileges are contained in the CNI or daemonset. In this case, you have to be able separate ePBF program load from xsk socket creation. Currently, this will not work with the xsk_socket__create APIs because you need to have CAP_NET_ADMIN privileges to load eBPF program and CAP_SYS_ADMIN privileges to create update xsk_bpf_maps. To be exact xsk_set_bpf_maps does not need those privileges but it takes the prog_fd and xsks_map_fd and those are known only to process that was loading eBPF program. The api bpf_prog_get_fd_by_id that looks up the fd of the prog using an prog_id and bpf_map_get_fd_by_id that looks for xsks_map_fd usinb map_id both requires CAP_SYS_ADMIN. With this patch, the pod can be run with CAP_NET_RAW capability only. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. Without this patch in case of insufficient rights ENOPERM is returned by xsk_socket__create. To resolve this privileges issue two new APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_sokcet__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Usage example: int xsk_setup_xdp_prog(int ifindex, int xsks_map_fd) int xsk_socket__update_xskmap(struct xsk_socket xsk, int xsks_map_fd); Inserts AF_XDP socket "fd" into the xskmap. The first patch introduces the new APIs. The second patch provides a new sample applications working as control and modification to existing xdpsock application to work with less privileges. This patch set is based on bpf-next commit `97306be45f` (Merge branch 'switch to memcg-based memory accounting') Since v6 - rebase on `97306be45f` to resolve RLIMIT conflicts Since v5 - fixed sample/bpf/xdpsock_user.c to resolve merge conflicts Since v4 - sample/bpf/Makefile issues fixed Since v3: - force_set_map flag removed - leaking of xsk struct fixed - unified function error returning policy implemented Since v2: - new APIs moved itto LIBBPF_0.3.0 section - struct bpf_prog_cfg_opts removed - loading own eBPF program via xsk_setup_xdp_prog functionality removed Since v1: - struct bpf_prog_cfg improved for backward/forward compatibility - API xsk_update_xskmap renamed to xsk_socket__update_xskmap - commit message formatting fixed ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-12-03 10:38:06 -08:00
Mariusz Dudek	3627d9702d	samples/bpf: Sample application for eBPF load and socket creation split Introduce a sample program to demonstrate the control and data plane split. For the control plane part a new program called xdpsock_ctrl_proc is introduced. For the data plane part, some code was added to xdpsock_user.c to act as the data plane entity. Application xdpsock_ctrl_proc works as control entity with sudo privileges (CAP_SYS_ADMIN and CAP_NET_ADMIN are sufficient) and the extended xdpsock as data plane entity with CAP_NET_RAW capability only. Usage example: sudo ./samples/bpf/xdpsock_ctrl_proc -i <interface> sudo ./samples/bpf/xdpsock -i <interface> -q <queue_id> -n <interval> -N -l -R Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-3-mariuszx.dudek@intel.com	2020-12-03 10:37:59 -08:00
Mariusz Dudek	e459f49b43	libbpf: Separate XDP program load with xsk socket creation Add support for separation of eBPF program load and xsk socket creation. This is needed for use-case when you want to privide as little privileges as possible to the data plane application that will handle xsk socket creation and incoming traffic. With this patch the data entity container can be run with only CAP_NET_RAW capability to fulfill its purpose of creating xsk socket and handling packages. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. To resolve privileges issue two APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_socket__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-2-mariuszx.dudek@intel.com	2020-12-03 10:37:59 -08:00
Brendan Jackman	22e8ebe35a	tools/resolve_btfids: Fix some error messages Add missing newlines and fix polarity of strerror argument. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20201203102234.648540-1-jackmanb@google.com	2020-12-03 10:25:47 -08:00
Stanislav Fomichev	a874c8c389	selftests/bpf: Copy file using read/write in local storage test Splice (copy_file_range) doesn't work on all filesystems. I'm running test kernels on top of my read-only disk image and it uses plan9 under the hood. This prevents test_local_storage from successfully passing. There is really no technical reason to use splice, so lets do old-school read/write to copy file; this should work in all environments. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202174947.3621989-1-sdf@google.com	2020-12-03 10:22:45 -08:00
Alexei Starovoitov	0d1e026959	Merge branch 'bpftool: improve split BTF support' Andrii Nakryiko says: ==================== Few follow up improvements to bpftool for split BTF support: - emit "name <anon>" for non-named BTFs in `bpftool btf show` command; - when dumping /sys/kernel/btf/<module> use /sys/kernel/btf/vmlinux as the base BTF, unless base BTF is explicitly specified with -B flag. This patch set also adds btf__base_btf() getter to access base BTF of the struct btf. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-12-03 10:17:27 -08:00
Andrii Nakryiko	fa4528379a	tools/bpftool: Auto-detect split BTFs in common cases In case of working with module's split BTF from /sys/kernel/btf/*, auto-substitute /sys/kernel/btf/vmlinux as the base BTF. This makes using bpftool with module BTFs faster and simpler. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-4-andrii@kernel.org	2020-12-03 10:17:27 -08:00
Andrii Nakryiko	0cfdcd6378	libbpf: Add base BTF accessor Add ability to get base BTF. It can be also used to check if BTF is split BTF. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-3-andrii@kernel.org	2020-12-03 10:17:26 -08:00
Andrii Nakryiko	71ccb50074	tools/bpftool: Emit name <anon> for anonymous BTFs For consistency of output, emit "name <anon>" for BTFs without the name. This keeps output more consistent and obvious. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-2-andrii@kernel.org	2020-12-03 10:17:26 -08:00
Alexei Starovoitov	97306be45f	Merge branch 'switch to memcg-based memory accounting' Roman Gushchin says: ==================== Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf use cases (e.g. memlock()). Any specific value is either too low and creates false failures or is too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and charge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. 6) rlimits are generally considered as (at least partially) obsolete. They do not provide a comprehensive system for the control of physical resources: memory, cpu, io etc. All resource control developments in the recent years were related to cgroups. In order to overcome these problems let's switch to the memory cgroup-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. Cgroup-based accounting adds new requirements: 1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled. These options are usually enabled, maybe excluding tiny builds for embedded devices. 2) The system should have a configured cgroup hierarchy, including reasonable memory limits and/or guarantees. Modern systems usually delegate this task to systemd or similar task managers. Without meeting these requirements there are no limits on how much memory bpf can use and a non-root user is able to hurt the system by allocating too much. But because per-user rlimits do not provide a functional system to protect and manage physical resources anyway, anyone who seriously depends on it, should use cgroups. When a bpf map is created, the memory cgroup of the process which creates the map is recorded. Subsequently all memory allocation related to the bpf map are charged to the same cgroup. It includes allocations made from interrupts and by any processes. Bpf program memory is charged to the memory cgroup of a process which loads the program. The patchset consists of the following parts: 1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped to userspace 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples v9: - always charge the saved memory cgroup, by Daniel, Toke and Alexei - added bpf_map_kzalloc() - rebase and minor fixes v8: - extended the cover letter to be more clear on new requirements, by Daniel - an approximate value is provided by map memlock info, by Alexei v7: - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei - switched allocations made from an interrupt context to new helpers, by Daniel - rebase and minor fixes v6: - rebased to the latest version of the remote charging API - fixed signatures, added acks v5: - rebased to the latest version of the remote charging API - implemented kmem accounting from an interrupt context, by Shakeel - rebased to latest changes in mm allowed to map vmallocs to userspace - fixed a build issue in kselftests, by Alexei - fixed a use-after-free bug in bpf_map_free_deferred() - added bpf line info coverage, by Shakeel - split bpf map charging preparations into a separate patch v4: - covered allocations made from an interrupt context, by Daniel - added some clarifications to the cover letter v3: - droped the userspace part for further discussions/refinements, by Andrii and Song v2: - fixed build issue, caused by the remaining rlimit-based accounting for sockhash maps ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-12-02 18:46:19 -08:00
Roman Gushchin	5b0764b2d3	bpf: samples: Do not touch RLIMIT_MEMLOCK Since bpf is not using rlimit memlock for the memory accounting and control, do not change the limit in sample applications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-35-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	3ac1f01b43	bpf: Eliminate rlimit-based memory accounting for bpf progs Do not use rlimit-based memory accounting for bpf progs. It has been replaced with memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	80ee81e040	bpf: Eliminate rlimit-based memory accounting infra for bpf maps Remove rlimit-based accounting infrastructure code, which is not used anymore. To provide a backward compatibility, use an approximation of the bpf map memory footprint as a "memlock" value, available to a user via map info. The approximation is based on the maximal number of elements and key and value sizes. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	ab31be378a	bpf: Eliminate rlimit-based memory accounting for bpf local storage maps Do not use rlimit-based memory accounting for bpf local storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	819a4f3235	bpf: Eliminate rlimit-based memory accounting for xskmap maps Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	370868107b	bpf: Eliminate rlimit-based memory accounting for stackmap maps Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	0d2c4f9640	bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	abbdd0813f	bpf: Eliminate rlimit-based memory accounting for bpf ringbuffer Do not use rlimit-based memory accounting for bpf ringbuffer. It has been replaced with the memcg-based memory accounting. bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM) and a valid pointer, so to simplify the code make it return NULL in the first case. This allows to drop a couple of lines in ringbuf_map_alloc() and also makes it look similar to other memory allocating function like kmalloc(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-28-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	db54330d3e	bpf: Eliminate rlimit-based memory accounting for reuseport_array maps Do not use rlimit-based memory accounting for reuseport_array maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-27-guro@fb.com	2020-12-02 18:32:47 -08:00
Roman Gushchin	a37fb7ef24	bpf: Eliminate rlimit-based memory accounting for queue_stack_maps maps Do not use rlimit-based memory accounting for queue_stack maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-26-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	cbddcb574d	bpf: Eliminate rlimit-based memory accounting for lpm_trie maps Do not use rlimit-based memory accounting for lpm_trie maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-25-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	755e5d5536	bpf: Eliminate rlimit-based memory accounting for hashtab maps Do not use rlimit-based memory accounting for hashtab maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-24-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	844f157f6c	bpf: Eliminate rlimit-based memory accounting for devmap maps Do not use rlimit-based memory accounting for devmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	087b0d39fe	bpf: Eliminate rlimit-based memory accounting for cgroup storage maps Do not use rlimit-based memory accounting for cgroup storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-22-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	711cabaf14	bpf: Eliminate rlimit-based memory accounting for cpumap maps Do not use rlimit-based memory accounting for cpumap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-21-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	f043733f31	bpf: Eliminate rlimit-based memory accounting for bpf_struct_ops maps Do not use rlimit-based memory accounting for bpf_struct_ops maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-20-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	1bc5975613	bpf: Eliminate rlimit-based memory accounting for arraymap maps Do not use rlimit-based memory accounting for arraymap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-19-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	28e1dcdef0	bpf: Refine memcg-based memory accounting for xskmap maps Extend xskmap memory accounting to include the memory taken by the xsk_map_node structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com	2020-12-02 18:32:46 -08:00
Roman Gushchin	7846dd9f83	bpf: Refine memcg-based memory accounting for sockmap and sockhash maps Include internal metadata into the memcg-based memory accounting. Also include the memory allocated on updating an element. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com	2020-12-02 18:32:46 -08:00

1 2 3 4 5 ...

967898 Commits