IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
A new field bpf_seq_afinfo is added to tcp_iter_state
to provide bpf tcp iterator afinfo. There are two
reasons on why we did this.
First, the current way to get afinfo from PDE_DATA
does not work for bpf iterator as its seq_file
inode does not conform to /proc/net/{tcp,tcp6}
inode structures. More specifically, anonymous
bpf iterator will use an anonymous inode which
is shared in the system and we cannot change inode
private data structure at all.
Second, bpf iterator for tcp/tcp6 wants to
traverse all tcp and tcp6 sockets in one pass
and bpf program can control whether they want
to skip one sk_family or not. Having a different
afinfo with family AF_UNSPEC make it easier
to understand in the code.
This patch does not change /proc/net/{tcp,tcp6} behavior
as the bpf_seq_afinfo will be NULL for these two proc files.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230804.3987829-1-yhs@fb.com
This patch adds support of SO_KEEPALIVE flag and TCP related options
to bpf_setsockopt() routine. This is helpful if we want to enable or tune
TCP keepalive for applications which don't do it in the userspace code.
v3:
- update kernel-doc in uapi (Nikita Vetoshkin <nekto0n@yandex-team.ru>)
v4:
- update kernel-doc in tools too (Alexei Starovoitov)
- add test to selftests (Alexei Starovoitov)
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200620153052.9439-3-zeil@yandex-team.ru
./test_progs-no_alu32 -t get_stack_raw_tp
fails due to:
52: (85) call bpf_get_stack#67
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (c7) r1 s>>= 32
; if (usize < 0)
57: (c5) if r1 s< 0x0 goto pc+26
R0=inv(id=0,smax_value=800) R1_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,smax_value=800) R9=inv800
; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
58: (1f) r9 -= r8
; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
59: (bf) r2 = r7
60: (0f) r2 += r1
regs=1 stack=0 before 52: (85) call bpf_get_stack#67
; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
61: (bf) r1 = r6
62: (bf) r3 = r9
63: (b7) r4 = 0
64: (85) call bpf_get_stack#67
R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=800,var_off=(0x0; 0x3ff),s32_max_value=1023,u32_max_value=1023) R3_w=inv(id=0,umax_value=9223372036854776608)
R3 unbounded memory access, use 'var &= const' or 'if (var < const)'
In the C code:
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
if (ksize < 0)
return 0;
We used to have problem with pointer arith in R2.
Now it's a problem with two integers in R3.
'if (usize < 0)' is comparing R1 and makes it [0,800], but R8 stays [-inf,800].
Both registers represent the same 'usize' variable.
Then R9 -= R8 is doing 800 - [-inf, 800]
so the result of "max_len - usize" looks unbounded to the verifier while
it's obvious in C code that "max_len - usize" should be [0, 800].
To workaround the problem convert ksize and usize variables from int to long.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Prevent loading/parsing vmlinux BTF twice in some cases: for CO-RE relocations
and for BTF-aware hooks (tp_btf, fentry/fexit, etc).
Fixes: a6ed02cac6 ("libbpf: Load btf_vmlinux only once per object.")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200624043805.1794620-1-andriin@fb.com
Building bpftool yields the following complaint:
pids.c: In function 'emit_obj_refs_json':
pids.c:175:80: warning: declaration of 'json_wtr' shadows a global declaration [-Wshadow]
175 | void emit_obj_refs_json(struct obj_refs_table *table, __u32 id, json_writer_t *json_wtr)
| ~~~~~~~~~~~~~~~^~~~~~~~
In file included from pids.c:11:
main.h:141:23: note: shadowed declaration is here
141 | extern json_writer_t *json_wtr;
| ^~~~~~~~
Let's rename the variable.
v2:
- Rename the variable instead of calling the global json_wtr directly.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200623213600.16643-1-quentin@isovalent.com
Currently, if the clang-bpf-co-re feature is not available, the build
fails with e.g.
CC prog.o
prog.c:1462:10: fatal error: profiler.skel.h: No such file or directory
1462 | #include "profiler.skel.h"
| ^~~~~~~~~~~~~~~~~
This is due to the fact that the BPFTOOL_WITHOUT_SKELETONS macro is not
defined, despite BUILD_BPF_SKELS not being set. Fix this by correctly
evaluating $(BUILD_BPF_SKELS) when deciding on whether to add
-DBPFTOOL_WITHOUT_SKELETONS to CFLAGS.
Fixes: 05aca6da3b ("tools/bpftool: Generalize BPF skeleton support and generate vmlinux.h")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200623103710.10370-1-tklauser@distanz.ch
Extend original variable-length tests with a case to catch a common
existing pattern of testing for < 0 for errors. Note because
verifier also tracks upper bounds and we know it can not be greater
than MAX_LEN here we can skip upper bound check.
In ALU64 enabled compilation converting from long->int return types
in probe helpers results in extra instruction pattern, <<= 32, s >>= 32.
The trade-off is the non-ALU64 case works. If you really care about
every extra insn (XDP case?) then you probably should be using original
int type.
In addition adding a sext insn to bpf might help the verifier in the
general case to avoid these types of tricks.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200623032224.4020118-3-andriin@fb.com
Add selftest that validates variable-length data reading and concatentation
with one big shared data array. This is a common pattern in production use for
monitoring and tracing applications, that potentially can read a lot of data,
but overall read much less. Such pattern allows to determine precisely what
amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200623032224.4020118-2-andriin@fb.com
Switch most of BPF helper definitions from returning int to long. These
definitions are coming from comments in BPF UAPI header and are used to
generate bpf_helper_defs.h (under libbpf) to be later included and used from
BPF programs.
In actual in-kernel implementation, all the helpers are defined as returning
u64, but due to some historical reasons, most of them are actually defined as
returning int in UAPI (usually, to return 0 on success, and negative value on
error).
This actually causes Clang to quite often generate sub-optimal code, because
compiler believes that return value is 32-bit, and in a lot of cases has to be
up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
before they can be used further in BPF code.
Besides just "polluting" the code, these 32-bit shifts quite often cause
problems for cases in which return value matters. This is especially the case
for the family of bpf_probe_read_str() functions. There are few other similar
helpers (e.g., bpf_read_branch_records()), in which return value is used by
BPF program logic to record variable-length data and process it. For such
cases, BPF program logic carefully manages offsets within some array or map to
read variable-length data. For such uses, it's crucial for BPF verifier to
track possible range of register values to prove that all the accesses happen
within given memory bounds. Those extraneous zero-extending bit shifts,
inserted by Clang (and quite often interleaved with other code, which makes
the issues even more challenging and sometimes requires employing extra
per-variable compiler barriers), throws off verifier logic and makes it mark
registers as having unknown variable offset. We'll study this pattern a bit
later below.
Another common pattern is to check return of BPF helper for non-zero state to
detect error conditions and attempt alternative actions in such case. Even in
this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
quite often leads to sub-optimal and unnecessary extra code. We'll look at
this pattern as well.
Clang's BPF target supports two modes of code generation: ALU32, in which it
is capable of using lower 32-bit parts of registers, and no-ALU32, in which
only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
above described problems, but not in all cases.
This patch switches all the cases in which BPF helpers return 0 or negative
error from returning int to returning long. It is shown below that such change
in definition leads to equivalent or better code. No-ALU32 mode benefits more,
but ALU32 mode doesn't degrade or still gets improved code generation.
Another class of cases switched from int to long are bpf_probe_read_str()-like
helpers, which encode successful case as non-negative values, while still
returning negative value for errors.
In all of such cases, correctness is preserved due to two's complement
encoding of negative values and the fact that all helpers return values with
32-bit absolute value. Two's complement ensures that for negative values
higher 32 bits are all ones and when truncated, leave valid negative 32-bit
value with the same value. Non-negative values have upper 32 bits set to zero
and similarly preserve value when high 32 bits are truncated. This means that
just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
require any extra shifts).
To minimize the chances of regressions, two code patterns were investigated,
as mentioned above. For both patterns, BPF assembly was analyzed in
ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
new 64-bit long return type.
Case 1. Variable-length data reading and concatenation. This is quite
ubiquitous pattern in tracing/monitoring applications, reading data like
process's environment variables, file path, etc. In such case, many pieces of
string-like variable-length data are read into a single big buffer, and at the
end of the process, only a part of array containing actual data is sent to
user-space for further processing. This case is tested in test_varlen.c
selftest (in the next patch). Code flow is roughly as follows:
void *payload = &sample->payload;
u64 len;
len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
if (len <= MAX_SZ1) {
payload += len;
sample->len1 = len;
}
len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
if (len <= MAX_SZ2) {
payload += len;
sample->len2 = len;
}
/* and so on */
sample->total_len = payload - &sample->payload;
/* send over, e.g., perf buffer */
There could be two variations with slightly different code generated: when len
is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
BPF assembly instructions between two successive invocations of
bpf_probe_read_kernel_str() were used to check code regressions. Results are
below, followed by short analysis. Left side is using helpers with int return
type, the right one is after the switch to long.
ALU32 + INT ALU32 + LONG
=========== ============
64-BIT (13 insns): 64-BIT (10 insns):
------------------------------------ ------------------------------------
17: call 115 17: call 115
18: if w0 > 256 goto +9 <LBB0_4> 18: if r0 > 256 goto +6 <LBB0_4>
19: w1 = w0 19: r1 = 0 ll
20: r1 <<= 32 21: *(u64 *)(r1 + 0) = r0
21: r1 s>>= 32 22: r6 = 0 ll
22: r2 = 0 ll 24: r6 += r0
24: *(u64 *)(r2 + 0) = r1 00000000000000c8 <LBB0_4>:
25: r6 = 0 ll 25: r1 = r6
27: r6 += r1 26: w2 = 256
00000000000000e0 <LBB0_4>: 27: r3 = 0 ll
28: r1 = r6 29: call 115
29: w2 = 256
30: r3 = 0 ll
32: call 115
32-BIT (11 insns): 32-BIT (12 insns):
------------------------------------ ------------------------------------
17: call 115 17: call 115
18: if w0 > 256 goto +7 <LBB1_4> 18: if w0 > 256 goto +8 <LBB1_4>
19: r1 = 0 ll 19: r1 = 0 ll
21: *(u32 *)(r1 + 0) = r0 21: *(u32 *)(r1 + 0) = r0
22: w1 = w0 22: r0 <<= 32
23: r6 = 0 ll 23: r0 >>= 32
25: r6 += r1 24: r6 = 0 ll
00000000000000d0 <LBB1_4>: 26: r6 += r0
26: r1 = r6 00000000000000d8 <LBB1_4>:
27: w2 = 256 27: r1 = r6
28: r3 = 0 ll 28: w2 = 256
30: call 115 29: r3 = 0 ll
31: call 115
In ALU32 mode, the variant using 64-bit length variable clearly wins and
avoids unnecessary zero-extension bit shifts. In practice, this is even more
important and good, because BPF code won't need to do extra checks to "prove"
that payload/len are within good bounds.
32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
uses extra register. The latter might potentially lose some range information,
but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
after check at 18:, and shifting 32 bits left/right keeps that range intact.
We should probably look into Clang's logic and see why it chooses bitshifts
over sub-register assignments for this.
NO-ALU32 + INT NO-ALU32 + LONG
============== ===============
64-BIT (14 insns): 64-BIT (10 insns):
------------------------------------ ------------------------------------
17: call 115 17: call 115
18: r0 <<= 32 18: if r0 > 256 goto +6 <LBB0_4>
19: r1 = r0 19: r1 = 0 ll
20: r1 >>= 32 21: *(u64 *)(r1 + 0) = r0
21: if r1 > 256 goto +7 <LBB0_4> 22: r6 = 0 ll
22: r0 s>>= 32 24: r6 += r0
23: r1 = 0 ll 00000000000000c8 <LBB0_4>:
25: *(u64 *)(r1 + 0) = r0 25: r1 = r6
26: r6 = 0 ll 26: r2 = 256
28: r6 += r0 27: r3 = 0 ll
00000000000000e8 <LBB0_4>: 29: call 115
29: r1 = r6
30: r2 = 256
31: r3 = 0 ll
33: call 115
32-BIT (13 insns): 32-BIT (13 insns):
------------------------------------ ------------------------------------
17: call 115 17: call 115
18: r1 = r0 18: r1 = r0
19: r1 <<= 32 19: r1 <<= 32
20: r1 >>= 32 20: r1 >>= 32
21: if r1 > 256 goto +6 <LBB1_4> 21: if r1 > 256 goto +6 <LBB1_4>
22: r2 = 0 ll 22: r2 = 0 ll
24: *(u32 *)(r2 + 0) = r0 24: *(u32 *)(r2 + 0) = r0
25: r6 = 0 ll 25: r6 = 0 ll
27: r6 += r1 27: r6 += r1
00000000000000e0 <LBB1_4>: 00000000000000e0 <LBB1_4>:
28: r1 = r6 28: r1 = r6
29: r2 = 256 29: r2 = 256
30: r3 = 0 ll 30: r3 = 0 ll
32: call 115 32: call 115
In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
len, code is identical.
So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
difference stems from internal Clang decision, rather than compiler lacking
enough information about types.
Case 2. Let's look at the simpler case of checking return result of BPF helper
for errors. The code is very simple:
long bla;
if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
return 1;
else
return 0;
ALU32 + CHECK (9 insns) ALU32 + CHECK (9 insns)
==================================== ====================================
0: r1 = r10 0: r1 = r10
1: r1 += -8 1: r1 += -8
2: w2 = 8 2: w2 = 8
3: r3 = 0 3: r3 = 0
4: call 113 4: call 113
5: w1 = w0 5: r1 = r0
6: w0 = 1 6: w0 = 1
7: if w1 != 0 goto +1 <LBB2_2> 7: if r1 != 0 goto +1 <LBB2_2>
8: w0 = 0 8: w0 = 0
0000000000000048 <LBB2_2>: 0000000000000048 <LBB2_2>:
9: exit 9: exit
Almost identical code, the only difference is the use of full register
assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
architectures, new BPF assembly might be slightly less optimal, in theory. But
one can argue that's not a big issue, given that use of full registers is
still prevalent (e.g., for parameter passing).
NO-ALU32 + CHECK (11 insns) NO-ALU32 + CHECK (9 insns)
==================================== ====================================
0: r1 = r10 0: r1 = r10
1: r1 += -8 1: r1 += -8
2: r2 = 8 2: r2 = 8
3: r3 = 0 3: r3 = 0
4: call 113 4: call 113
5: r1 = r0 5: r1 = r0
6: r1 <<= 32 6: r0 = 1
7: r1 >>= 32 7: if r1 != 0 goto +1 <LBB2_2>
8: r0 = 1 8: r0 = 0
9: if r1 != 0 goto +1 <LBB2_2> 0000000000000048 <LBB2_2>:
10: r0 = 0 9: exit
0000000000000058 <LBB2_2>:
11: exit
NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
shifts.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200623032224.4020118-1-andriin@fb.com
Andrii Nakryiko says:
====================
This patch set implements libbpf support for a second kind of special externs,
kernel symbols, in addition to existing Kconfig externs.
Right now, only untyped (const void) externs are supported, which, in
C language, allow only to take their address. In the future, with kernel BTF
getting type info about its own global and per-cpu variables, libbpf will
extend this support with BTF type info, which will allow to also directly
access variable's contents and follow its internal pointers, similarly to how
it's possible today in fentry/fexit programs.
As a first practical use of this functionality, bpftool gained ability to show
PIDs of processes that have open file descriptors for BPF map/program/link/BTF
object. It relies on iter/task_file BPF iterator program to extract this
information efficiently.
There was a bunch of bpftool refactoring (especially Makefile) necessary to
generalize bpftool's internal BPF program use. This includes generalization of
BPF skeletons support, addition of a vmlinux.h generation, extracting and
building minimal subset of bpftool for bootstrapping.
v2->v3:
- fix sec_btf_id check (Hao);
v1->v2:
- docs fixes (Quentin);
- dual GPL/BSD license for pid_inter.bpf.c (Quentin);
- NULL-init kcfg_data (Hao Luo);
rfc->v1:
- show pids, if supported by kernel, always (Alexei);
- switched iter output to binary to support showing process names;
- update man pages;
- fix few minor bugs in libbpf w.r.t. extern iteration.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add statements about bpftool being able to discover process info, holding
reference to BPF map, prog, link, or BTF. Show example output as well.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-10-andriin@fb.com
Adapt Makefile to support BPF skeleton generation beyond single profiler.bpf.c
case. Also add vmlinux.h generation and switch profiler.bpf.c to use it.
clang-bpf-global-var feature is extended and renamed to clang-bpf-co-re to
check for support of preserve_access_index attribute, which, together with BTF
for global variables, is the minimum requirement for modern BPF programs.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-7-andriin@fb.com
Build minimal "bootstrap mode" bpftool to enable skeleton (and, later,
vmlinux.h generation), instead of building almost complete, but slightly
different (w/o skeletons, etc) bpftool to bootstrap complete bpftool build.
Current approach doesn't scale well (engineering-wise) when adding more BPF
programs to bpftool and other complicated functionality, as it requires
constant adjusting of the code to work in both bootstrapped mode and normal
mode.
So it's better to build only minimal bpftool version that supports only BPF
skeleton code generation and BTF-to-C conversion. Thankfully, this is quite
easy to accomplish due to internal modularity of bpftool commands. This will
also allow to keep adding new functionality to bpftool in general, without the
need to care about bootstrap mode for those new parts of bpftool.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-6-andriin@fb.com
Move functions that parse map and prog by id/tag/name/etc outside of
map.c/prog.c, respectively. These functions are used outside of those files
and are generic enough to be in common. This also makes heavy-weight map.c and
prog.c more decoupled from the rest of bpftool files and facilitates more
lightweight bootstrap bpftool variant.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-5-andriin@fb.com
Validate libbpf is able to handle weak and strong kernel symbol externs in BPF
code correctly.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-4-andriin@fb.com
Add support for another (in addition to existing Kconfig) special kind of
externs in BPF code, kernel symbol externs. Such externs allow BPF code to
"know" kernel symbol address and either use it for comparisons with kernel
data structures (e.g., struct file's f_op pointer, to distinguish different
kinds of file), or, with the help of bpf_probe_user_kernel(), to follow
pointers and read data from global variables. Kernel symbol addresses are
found through /proc/kallsyms, which should be present in the system.
Currently, such kernel symbol variables are typeless: they have to be defined
as `extern const void <symbol>` and the only operation you can do (in C code)
with them is to take its address. Such extern should reside in a special
section '.ksyms'. bpf_helpers.h header provides __ksym macro for this. Strong
vs weak semantics stays the same as with Kconfig externs. If symbol is not
found in /proc/kallsyms, this will be a failure for strong (non-weak) extern,
but will be defaulted to 0 for weak externs.
If the same symbol is defined multiple times in /proc/kallsyms, then it will
be error if any of the associated addresses differs. In that case, address is
ambiguous, so libbpf falls on the side of caution, rather than confusing user
with randomly chosen address.
In the future, once kernel is extended with variables BTF information, such
ksym externs will be supported in a typed version, which will allow BPF
program to read variable's contents directly, similarly to how it's done for
fentry/fexit input arguments.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-3-andriin@fb.com
Switch existing Kconfig externs to be just one of few possible kinds of more
generic externs. This refactoring is in preparation for ksymbol extern
support, added in the follow up patch. There are no functional changes
intended.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20200619231703.738941-2-andriin@fb.com
Add a bunch of getter for various aspects of BPF map. Some of these attribute
(e.g., key_size, value_size, type, etc) are available right now in struct
bpf_map_def, but this patch adds getter allowing to fetch them individually.
bpf_map_def approach isn't very scalable, when ABI stability requirements are
taken into account. It's much easier to extend libbpf and add support for new
features, when each aspect of BPF map has separate getter/setter.
Getters follow the common naming convention of not explicitly having "get" in
its name: bpf_map__type() returns map type, bpf_map__key_size() returns
key_size. Setters, though, explicitly have set in their name:
bpf_map__set_type(), bpf_map__set_key_size().
This patch ensures we now have a getter and a setter for the following
map attributes:
- type;
- max_entries;
- map_flags;
- numa_node;
- key_size;
- value_size;
- ifindex.
bpf_map__resize() enforces unnecessary restriction of max_entries > 0. It is
unnecessary, because libbpf actually supports zero max_entries for some cases
(e.g., for PERF_EVENT_ARRAY map) and treats it specially during map creation
time. To allow setting max_entries=0, new bpf_map__set_max_entries() setter is
added. bpf_map__resize()'s behavior is preserved for backwards compatibility
reasons.
Map ifindex getter is added as well. There is a setter already, but no
corresponding getter. Fix this assymetry as well. bpf_map__set_ifindex()
itself is converted from void function into error-returning one, similar to
other setters. The only error returned right now is -EBUSY, if BPF map is
already loaded and has corresponding FD.
One lacking attribute with no ability to get/set or even specify it
declaratively is numa_node. This patch fixes this gap and both adds
programmatic getter/setter, as well as adds support for numa_node field in
BTF-defined map.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200621062112.3006313-1-andriin@fb.com
Add selftests to test access to map pointers from bpf program for all
map types except struct_ops (that one would need additional work).
verifier test focuses mostly on scenarios that must be rejected.
prog_tests test focuses on accessing multiple fields both scalar and a
nested struct from bpf program and verifies that those fields have
expected values.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/139a6a17f8016491e39347849b951525335c6eb4.1592600985.git.rdna@fb.com
Set map_btf_name and map_btf_id for all map types so that map fields can
be accessed by bpf programs.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
There are multiple use-cases when it's convenient to have access to bpf
map fields, both `struct bpf_map` and map type specific struct-s such as
`struct bpf_array`, `struct bpf_htab`, etc.
For example while working with sock arrays it can be necessary to
calculate the key based on map->max_entries (some_hash % max_entries).
Currently this is solved by communicating max_entries via "out-of-band"
channel, e.g. via additional map with known key to get info about target
map. That works, but is not very convenient and error-prone while
working with many maps.
In other cases necessary data is dynamic (i.e. unknown at loading time)
and it's impossible to get it at all. For example while working with a
hash table it can be convenient to know how much capacity is already
used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).
At the same time kernel knows this info and can provide it to bpf
program.
Fill this gap by adding support to access bpf map fields from bpf
program for both `struct bpf_map` and map type specific fields.
Support is implemented via btf_struct_access() so that a user can define
their own `struct bpf_map` or map type specific struct in their program
with only necessary fields and preserve_access_index attribute, cast a
map to this struct and use a field.
For example:
struct bpf_map {
__u32 max_entries;
} __attribute__((preserve_access_index));
struct bpf_array {
struct bpf_map map;
__u32 elem_size;
} __attribute__((preserve_access_index));
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 4);
__type(key, __u32);
__type(value, __u32);
} m_array SEC(".maps");
SEC("cgroup_skb/egress")
int cg_skb(void *ctx)
{
struct bpf_array *array = (struct bpf_array *)&m_array;
struct bpf_map *map = (struct bpf_map *)&m_array;
/* .. use map->max_entries or array->map.max_entries .. */
}
Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
corresponding struct. Only reading from map fields is supported.
For btf_struct_access() to work there should be a way to know btf id of
a struct that corresponds to a map type. To get btf id there should be a
way to get a stringified name of map-specific struct, such as
"bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
`struct bpf_map_ops` to handle it:
* .map_btf_name keeps a btf name of a struct returned by map_alloc();
* .map_btf_id is used to cache btf id of that struct.
To make btf ids calculation cheaper they're calculated once while
preparing btf_vmlinux and cached same way as it's done for btf_id field
of `struct bpf_func_proto`
While calculating btf ids, struct names are NOT checked for collision.
Collisions will be checked as a part of the work to prepare btf ids used
in verifier in compile time that should land soon. The only known
collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
net/core/sock_map.c) was fixed earlier.
Both new fields .map_btf_name and .map_btf_id must be set for a map type
for the feature to work. If neither is set for a map type, verifier will
return ENOTSUPP on a try to access map_ptr of corresponding type. If
just one of them set, it's verifier misconfiguration.
Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
supported separately.
The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
perfmon_capable() so that unpriv programs won't have access to bpf map
fields.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
There are two different `struct bpf_htab` in bpf code in the following
files:
- kernel/bpf/hashtab.c
- net/core/sock_map.c
It makes it impossible to find proper btf_id by name = "bpf_htab" and
kind = BTF_KIND_STRUCT what is needed to support access to map ptr so
that bpf program can access `struct bpf_htab` fields.
To make it possible one of the struct-s should be renamed, sock_map.c
looks like a better candidate for rename since it's specialized version
of hashtab.
Rename it to bpf_shtab ("sh" stands for Sock Hash).
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/c006a639e03c64ca50fc87c4bb627e0bfba90f4e.1592600985.git.rdna@fb.com
btf_parse_vmlinux() implements manual search for struct bpf_ctx_convert
since at the time of implementing btf_find_by_name_kind() was not
available.
Later btf_find_by_name_kind() was introduced in 27ae7997a6 ("bpf:
Introduce BPF_PROG_TYPE_STRUCT_OPS"). It provides similar search
functionality and can be leveraged in btf_parse_vmlinux(). Do it.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/6e12d5c3e8a3d552925913ef73a695dd1bb27800.1592600985.git.rdna@fb.com
Relicense it to be compatible with the rest of bpftool files.
Suggested-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200619222024.519774-1-andriin@fb.com
Added two test_verifier subtests for 32bit pointer/scalar arithmetic
with BPF_SUB operator. They are passing verifier now.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200618234632.3321367-1-yhs@fb.com
When do experiments with llvm (disabling instcombine and
simplifyCFG), I hit the following error with test_seg6_loop.o.
; R1=pkt(id=0,off=0,r=48,imm=0), R7=pkt(id=0,off=40,r=48,imm=0)
w2 = w7
; R2_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
w2 -= w1
R2 32-bit pointer arithmetic prohibited
The corresponding source code is:
uint32_t srh_off
// srh and skb->data are all packet pointers
srh_off = (char *)srh - (char *)(long)skb->data;
The verifier does not support 32-bit pointer/scalar arithmetic.
Without my llvm change, the code looks like
; R3=pkt(id=0,off=40,r=48,imm=0), R8=pkt(id=0,off=0,r=48,imm=0)
w3 -= w8
; R3_w=inv(id=0)
This is explicitly allowed in verifier if both registers are
pointers and the opcode is BPF_SUB.
To fix this problem, I changed the verifier to allow
32-bit pointer/scaler BPF_SUB operations.
At the source level, the issue could be workarounded with
inline asm or changing "uint32_t srh_off" to "uint64_t srh_off".
But I feel that verifier change might be the right thing to do.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200618234631.3321118-1-yhs@fb.com
The cache_idx is currently picked by RR. There is chance that
the same cache_idx will be picked by multiple sk_storage_maps while
other cache_idx is still unused. e.g. It could happen when the
sk_storage_map is recreated during the restart of the user
space process.
This patch tracks the usage count for each cache_idx. There is
16 of them now (defined in BPF_SK_STORAGE_CACHE_SIZE).
It will try to pick the free cache_idx. If none was found,
it would pick one with the minimal usage count.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200617174226.2301909-1-kafai@fb.com
During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on
success, instead of amount of data successfully read. This majorly breaks
applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str()
and their results. Fix this by returning actual number of bytes read.
Fixes: 8d92db5c04 ("bpf: rework the compat kernel probe handling")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com
Pull networking fixes from David Miller:
1) Don't get per-cpu pointer with preemption enabled in nft_set_pipapo,
fix from Stefano Brivio.
2) Fix memory leak in ctnetlink, from Pablo Neira Ayuso.
3) Multiple definitions of MPTCP_PM_MAX_ADDR, from Geliang Tang.
4) Accidently disabling NAPI in non-error paths of macb_open(), from
Charles Keepax.
5) Fix races between alx_stop and alx_remove, from Zekun Shen.
6) We forget to re-enable SRIOV during resume in bnxt_en driver, from
Michael Chan.
7) Fix memory leak in ipv6_mc_destroy_dev(), from Wang Hai.
8) rxtx stats use wrong index in mvpp2 driver, from Sven Auhagen.
9) Fix memory leak in mptcp_subflow_create_socket error path, from Wei
Yongjun.
10) We should not adjust the TCP window advertised when sending dup acks
in non-SACK mode, because it won't be counted as a dup by the sender
if the window size changes. From Eric Dumazet.
11) Destroy the right number of queues during remove in mvpp2 driver,
from Sven Auhagen.
12) Various WOL and PM fixes to e1000 driver, from Chen Yu, Vaibhav
Gupta, and Arnd Bergmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
e1000e: fix unused-function warning
e1000: use generic power management
e1000e: Do not wake up the system via WOL if device wakeup is disabled
lan743x: add MODULE_DEVICE_TABLE for module loading alias
mlxsw: spectrum: Adjust headroom buffers for 8x ports
bareudp: Fixed configuration to avoid having garbage values
mvpp2: remove module bugfix
tcp: grow window for OOO packets only for SACK flows
mptcp: fix memory leak in mptcp_subflow_create_socket()
netfilter: flowtable: Make nf_flow_table_offload_add/del_cb inline
net/sched: act_ct: Make tcf_ct_flow_table_restore_skb inline
net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
mvpp2: ethtool rxtx stats fix
MAINTAINERS: switch to my private email for Renesas Ethernet drivers
rocker: fix incorrect error handling in dma_rings_init
test_objagg: Fix potential memory leak in error handling
net: ethernet: mtk-star-emac: simplify interrupt handling
mld: fix memory leak in ipv6_mc_destroy_dev()
bnxt_en: Return from timer if interface is not in open state.
bnxt_en: Fix AER reset logic on 57500 chips.
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7pMu8ACgkQ+7dXa6fL
C2sgNRAAnOCq281ojebwVSIkRDVGlxBODNeJtcgOOC4ib3jZM++vhdnnJgJIr8kc
UOQ+LF4E5hNgwELubCrLOx/AjIzVuzfrreFNOPh3P3TSjyxW/7AU+tFGkdnLkYun
NyadOXxI9Dk84UBN1LrmRm3ccAbF6nDf/KcPykS0oAEh12LVm6sDpVJz9+1uclnK
Xq0rgl+zrR0+SPplPYz4P/OEPTgNfpLV9DHVYfkvsvEhwb/TaUmiLj9SEgndp+fg
L3CT66QXoG9zds9hYFVODQM3devaXOpGNU0vsc9+Xg57BWuYvVed24eH5oBrcBQo
F5kon+mcZlHtmTG87UJ6vFUwfHGeYqKKRb9XTbKbATtIWvkB3XM4Jz/XUlaAIE+R
y0njNYEoIn4wHkleL/KeHmFPFSYG7pZpAN3wqhXZ9wVptXRDSB10OK3vpgLD/2rM
V68FmBin6eStE5qZ8Mu9qMQxXb1buknoef37FIXUozjc+VMPrg5dbG6GjcW/CqIC
LynaNUvrQOvF0ZFVzMt7ffZPrdDYlqqzyN0bReMdibys4BPKo24gSr5aVMLt7YXf
ZaJeApcSdsphs4uUmtHKlHYgUQrSEl9pSGmc4hcq9bNIKHo9S618LG9uuUplOjdP
j0L8N6uWBHQCjAvu6kDm8Wp5pRPPUnTgaXDsok7yP2GLRqBEm3Q=
=bYOZ
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"I've managed to get xfstests kind of working with afs. Here are a set
of patches that fix most of the bugs found.
There are a number of primary issues:
- Incorrect handling of mtime and non-handling of ctime. It might be
argued, that the latter isn't a bug since the AFS protocol doesn't
support ctime, but I should probably still update it locally.
- Shared-write mmap, truncate and writeback bugs. This includes not
changing i_size under the callback lock, overwriting local i_size
with the reply from the server after a partial writeback, not
limiting the writeback from an mmapped page to EOF.
- Checks for an abort code indicating that the primary vnode in an
operation was deleted by a third-party are done in the wrong place.
- Silly rename bugs. This includes an incomplete conversion to the
new operation handling, duplicate nlink handling, nlink changing
not being done inside the callback lock and insufficient handling
of third-party conflicting directory changes.
And some secondary ones:
- The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.
- Remove a couple of unused or incompletely used bits.
- Remove a couple of redundant success checks.
These seem to fix all the data-corruption bugs found by
./check -afs -g quick
along with the obvious silly rename bugs and time bugs.
There are still some test failures, but they seem to fall into two
classes: firstly, the authentication/security model is different to
the standard UNIX model and permission is arbitrated by the server and
cached locally; and secondly, there are a number of features that AFS
does not support (such as mknod). But in these cases, the tests
themselves need to be adapted or skipped.
Using the in-kernel afs client with xfstests also found a bug in the
AuriStor AFS server that has been fixed for a future release"
* tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix silly rename
afs: afs_vnode_commit_status() doesn't need to check the RPC error
afs: Fix use of afs_check_for_remote_deletion()
afs: Remove afs_operation::abort_code
afs: Fix yfs_fs_fetch_status() to honour vnode selector
afs: Remove yfs_fs_fetch_file_status() as it's not used
afs: Fix the mapping of the UAEOVERFLOW abort code
afs: Fix truncation issues and mmap writeback size
afs: Concoct ctimes
afs: Fix EOF corruption
afs: afs_write_end() should change i_size under the right lock
afs: Fix non-setting of mtime when writing into mmap
Hi Linus,
Please, pull the following patches that replace zero-length arrays with
flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size for the
array declaration entirely:
struct something {
size_t count;
struct foo items[];
};
This is the way the kernel expects dynamically sized trailing elements to be
declared. It allows the compiler to generate errors when the flexible array
does not occur last in the structure, which helps to prevent some kind of
undefined behavior[3] bugs from being inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via sizeof(),
CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no
mechanism that warns us that the following application of the sizeof() operator
to a zero-length array always results in zero:
struct something {
size_t count;
struct foo items[0];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items) * instance->count;
memcpy(instance->items, source, size);
At the last line of code above, size turns out to be zero, when one might have
thought it represents the total size in bytes of the dynamic memory recently
allocated for the trailing array items. Here are a couple examples of this
issue[4][5]. Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such operators will
be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through the use of
a flexible array member:
struct something {
size_t count;
struct foo items[];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items[0]) * instance->count;
memcpy(instance->items, source, size);
Thanks
--
Gustavo
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] https://git.kernel.org/linus/76497732932f15e7323dc805e8ea8dc11bb587cf
[4] https://git.kernel.org/linus/f2cd32a443da694ac4e28fbf4ac6f9d5cc63a539
[5] https://git.kernel.org/linus/ab91c2a89f86be2898cee208d492816ec238b2cf
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl7oSmYACgkQRwW0y0cG
2zGEiw/9FiH3MBwMlPVJPcneY1wCH/N6ZSf+kr7SJiVwV/YbBe9EWuaKZ0D4vAWm
kTACkOfsZ1me1OKz9wNrOxn0zezTMFQK2PLPgzKIPuK0Hg8MW1EU63RIRsnr0bPc
b90wZwyBQtLbGRC3/9yAACKwFZe/SeYoV5rr8uylffA35HZW3SZbTex6XnGCF9Q5
UYwnz7vNg+9VH1GRQeB5jlqL7mAoRzJ49I/TL3zJr04Mn+xC+vVBS7XwipDd03p+
foC6/KmGhlCO9HMPASReGrOYNPydDAMKLNPdIfUlcTKHWsoTjGOcW/dzfT4rUu6n
nKr5rIqJ4FdlIvXZL5P5w7Uhkwbd3mus5G0HBk+V/cUScckCpBou+yuGzjxXSitQ
o0qPsGjWr3v+gxRWHj8YO/9MhKKKW0Iy+QmAC9+uLnbfJdbUwYbLIXbsOKnokCA8
jkDEr64F5hFTKtajIK4VToJK1CsM3D9dwTub27lwZysHn3RYSQdcyN+9OiZgdzpc
GlI6QoaqKR9AT4b/eBmqlQAKgA07zSQ5RsIjRm6hN3d7u/77x2kyrreo+trJyVY2
F17uEOzfTqZyxtkPayE8DVjTtbByoCuBR0Vm1oMAFxjyqZQY5daalB0DKd1mdYqi
khIXqNAuYqHOb898fEuzidjV38hxZ9y8SAym3P7WnYl+Hxz+8Jo=
=8HUQ
-----END PGP SIGNATURE-----
Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should no
longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size
for the array declaration entirely:
struct something {
size_t count;
struct foo items[];
};
This is the way the kernel expects dynamically sized trailing elements
to be declared. It allows the compiler to generate errors when the
flexible array does not occur last in the structure, which helps to
prevent some kind of undefined behavior[3] bugs from being
inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via
sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
instance, there is no mechanism that warns us that the following
application of the sizeof() operator to a zero-length array always
results in zero:
struct something {
size_t count;
struct foo items[0];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items) * instance->count;
memcpy(instance->items, source, size);
At the last line of code above, size turns out to be zero, when one
might have thought it represents the total size in bytes of the
dynamic memory recently allocated for the trailing array items. Here
are a couple examples of this issue[4][5].
Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such
operators will be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through
the use of a flexible array member:
struct something {
size_t count;
struct foo items[];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items[0]) * instance->count;
memcpy(instance->items, source, size);
instead"
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
[4] commit f2cd32a443 ("rndis_wlan: Remove logically dead code")
[5] commit ab91c2a89f ("tpm: eventlog: Replace zero-length array with flexible-array member")
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
* tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits)
w1: Replace zero-length array with flexible-array
tracing/probe: Replace zero-length array with flexible-array
soc: ti: Replace zero-length array with flexible-array
tifm: Replace zero-length array with flexible-array
dmaengine: tegra-apb: Replace zero-length array with flexible-array
stm class: Replace zero-length array with flexible-array
Squashfs: Replace zero-length array with flexible-array
ASoC: SOF: Replace zero-length array with flexible-array
ima: Replace zero-length array with flexible-array
sctp: Replace zero-length array with flexible-array
phy: samsung: Replace zero-length array with flexible-array
RxRPC: Replace zero-length array with flexible-array
rapidio: Replace zero-length array with flexible-array
media: pwc: Replace zero-length array with flexible-array
firmware: pcdp: Replace zero-length array with flexible-array
oprofile: Replace zero-length array with flexible-array
block: Replace zero-length array with flexible-array
tools/testing/nvdimm: Replace zero-length array with flexible-array
libata: Replace zero-length array with flexible-array
kprobes: Replace zero-length array with flexible-array
...
The purgatory Makefile removes -fstack-protector options if they were
configured in, but does not currently add -fno-stack-protector.
If gcc was configured with the --enable-default-ssp configure option,
this results in the stack protector still being enabled for the
purgatory (absent distro-specific specs files that might disable it
again for freestanding compilations), if the main kernel is being
compiled with stack protection enabled (if it's disabled for the main
kernel, the top-level Makefile will add -fno-stack-protector).
This will break the build since commit
e4160b2e4b ("x86/purgatory: Fail the build if purgatory.ro has missing symbols")
and prior to that would have caused runtime failure when trying to use
kexec.
Explicitly add -fno-stack-protector to avoid this, as done in other
Makefiles that need to disable the stack protector.
Reported-by: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2020-06-16
This series contains fixes to e1000 and e1000e.
Chen fixes an e1000e issue where systems could be waken via WoL, even
though the user has disabled the wakeup bit via sysfs.
Vaibhav Gupta updates the e1000 driver to clean up the legacy Power
Management hooks.
Arnd Bergmann cleans up the inconsistent use CONFIG_PM_SLEEP
preprocessor tags, which also resolves the compiler warnings about the
possibility of unused structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_PM_SLEEP #ifdef checks in this file are inconsistent,
leading to a warning about sometimes unused function:
drivers/net/ethernet/intel/e1000e/netdev.c:137:13: error: unused function 'e1000e_check_me' [-Werror,-Wunused-function]
Rather than adding more #ifdefs, just remove them completely
and mark the PM functions as __maybe_unused to let the compiler
work it out on it own.
Fixes: e086ba2fcc ("e1000e: disable s0ix entry and exit flows for ME systems")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
With legacy PM hooks, it was the responsibility of a driver to manage PCI
states and also the device's power state. The generic approach is to let PCI
core handle the work.
e1000_suspend() calls __e1000_shutdown() to perform intermediate tasks.
__e1000_shutdown() modifies the value of "wake" (device should be wakeup
enabled or not), responsible for controlling the flow of legacy PM.
Since, PCI core has no idea about the value of "wake", new code for generic
PM may produce unexpected results. Thus, use "device_set_wakeup_enable()"
to wakeup-enable the device accordingly.
Signed-off-by: Vaibhav Gupta <vaibhavgupta40@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently the system will be woken up via WOL(Wake On LAN) even if the
device wakeup ability has been disabled via sysfs:
cat /sys/devices/pci0000:00/0000:00:1f.6/power/wakeup
disabled
The system should not be woken up if the user has explicitly
disabled the wake up ability for this device.
This patch clears the WOL ability of this network device if the
user has disabled the wake up ability in sysfs.
Fixes: bc7f75fa97 ("[E1000E]: New pci-express e1000 driver")
Reported-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Without a MODULE_DEVICE_TABLE the attributes are missing that create
an alias for auto-loading the module in userspace via hotplug.
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix AFS's silly rename by the following means:
(1) Set the destination directory in afs_do_silly_rename() so as to avoid
misbehaviour and indicate that the directory data version will
increment by 1 so as to avoid warnings about unexpected changes in the
DV. Also indicate that the ctime should be updated to avoid xfstest
grumbling.
(2) Note when the server indicates that a directory changed more than we
expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a
third party change, checking on successful completion of unlink and
rename.
The problem is that the FS.RemoveFile RPC op doesn't report the status
of the unlinked file, though YFS.RemoveFile2 does. This can be
mitigated by the assumption that if the directory DV cranked by
exactly 1, we can be sure we removed one link from the file; further,
ordinarily in AFS, files cannot be hardlinked across directories, so
if we reduce nlink to 0, the file is deleted.
However, if the directory DV jumps by more than 1, we cannot know if a
third party intervened by adding or removing a link on the file we
just removed a link from.
The same also goes for any vnode that is at the destination of the
FS.Rename RPC op.
(3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock
section along with the other attribute updates if ->op_unlinked is set
on the descriptor for the appropriate vnode.
(4) Issue a follow up status fetch to the unlinked file in the event of a
third party conflict that makes it impossible for us to know if we
actually deleted the file or not.
(5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to
the user about the nlink of a silly deleted file so that it appears as
0, not 1.
Found with the generic/035 and generic/084 xfstests.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
The port's headroom buffers are used to store packets while they
traverse the device's pipeline and also to store packets that are egress
mirrored.
On Spectrum-3, ports with eight lanes use two headroom buffers between
which the configured headroom size is split.
In order to prevent packet loss, multiply the calculated headroom size
by two for 8x ports.
Fixes: da382875c6 ("mlxsw: spectrum: Extend to support Spectrum-3 ASIC")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Code to initialize the conf structure while gathering the configuration
of the device was missing.
Fixes: 571912c69f ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
Signed-off-by: Martin <martin.varghese@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The remove function does not destroy all
BM Pools when per cpu pool is active.
When reloading the mvpp2 as a module the BM Pools
are still active in hardware and due to the bug
have twice the size now old + new.
This eventually leads to a kernel crash.
v2:
* add Fixes tag
Fixes: 7d04b0b13b ("mvpp2: percpu buffers")
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in 2013, we made a change that broke fast retransmit
for non SACK flows.
Indeed, for these flows, a sender needs to receive three duplicate
ACK before starting fast retransmit. Sending ACK with different
receive window do not count.
Even if enabling SACK is strongly recommended these days,
there still are some cases where it has to be disabled.
Not increasing the window seems better than having to
rely on RTO.
After the fix, following packetdrill test gives :
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+0 < . 1:1(0) ack 1 win 514
+0 accept(3, ..., ...) = 4
+0 < . 1:1001(1000) ack 1 win 514
// Quick ack
+0 > . 1:1(0) ack 1001 win 264
+0 < . 2001:3001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 3001:4001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 4001:5001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 1001:2001(1000) ack 1 win 514
// Hole is repaired.
+0 > . 1:1(0) ack 5001 win 272
Fixes: 4e4f1fc226 ("tcp: properly increase rcv_ssthresh for ofo packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>