IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The BPF atomic operations with the BPF_FETCH modifier along with
BPF_XCHG and BPF_CMPXCHG are fully ordered but the RISC-V JIT implements
all atomic operations except BPF_CMPXCHG with relaxed ordering.
Section 8.1 of the "The RISC-V Instruction Set Manual Volume I:
Unprivileged ISA" [1], titled, "Specifying Ordering of Atomic
Instructions" says:
| To provide more efficient support for release consistency [5], each
| atomic instruction has two bits, aq and rl, used to specify additional
| memory ordering constraints as viewed by other RISC-V harts.
and
| If only the aq bit is set, the atomic memory operation is treated as
| an acquire access.
| If only the rl bit is set, the atomic memory operation is treated as a
| release access.
|
| If both the aq and rl bits are set, the atomic memory operation is
| sequentially consistent.
Fix this by setting both aq and rl bits as 1 for operations with
BPF_FETCH and BPF_XCHG.
[1] https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
Fixes: dd642ccb45ec ("riscv, bpf: Implement more atomic operations for RV64")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/20240505201633.123115-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF_ATOMIC_OP() macro documentation states that "BPF_ADD | BPF_FETCH"
should be the same as atomic_fetch_add(), which is currently not the
case on s390x: the serialization instruction "bcr 14,0" is missing.
This applies to "and", "or" and "xor" variants too.
s390x is allowed to reorder stores with subsequent fetches from
different addresses, so code relying on BPF_FETCH acting as a barrier,
for example:
stw [%r0], 1
afadd [%r1], %r2
ldxw %r3, [%r4]
may be broken. Fix it by emitting "bcr 14,0".
Note that a separate serialization instruction is not needed for
BPF_XCHG and BPF_CMPXCHG, because COMPARE AND SWAP performs
serialization itself.
Fixes: ba3b86b9cef0 ("s390/bpf: Implement new atomic ops")
Reported-by: Puranjay Mohan <puranjay12@gmail.com>
Closes: https://lore.kernel.org/bpf/mb61p34qvq3wf.fsf@kernel.org/
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240507000557.12048-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting
a read from struct thread_info. The SP_EL0 system register holds the
pointer to the task_struct and thread_info is the first member of this
struct. We can read the cpu number from the thread_info.
Here is how the ARM64 JITed assembly changes after this commit:
ARM64 JIT
===========
BEFORE AFTER
-------- -------
int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id();
mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0
movk x10, #0x802b, lsl #16 ldr w7, [x10, #24]
movk x10, #0x8000, lsl #32
blr x10
add x7, x0, #0x0
Performance improvement using benchmark[1]
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+-------------------+-------------------+--------------+
| Name | Before | After | % change |
|---------------+-------------------+-------------------+--------------|
| glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% |
| arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% |
| hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% |
+---------------+-------------------+-------------------+--------------+
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.
Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
access using tpidr_el1"), the per-cpu offset for the CPU is stored in
the tpidr_el1/2 register of that CPU.
To support this BPF instruction in the ARM64 JIT, the following ARM64
instructions are emitted:
mov dst, src // Move src to dst, if src != dst
mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
add dst, dst, tmp // Add the per cpu offset to the dst.
To measure the performance improvement provided by this change, the
benchmark in [1] was used:
Before:
glob-arr-inc : 23.597 ± 0.012M/s
arr-inc : 23.173 ± 0.019M/s
hash-inc : 12.186 ± 0.028M/s
After:
glob-arr-inc : 23.819 ± 0.034M/s
arr-inc : 23.285 ± 0.017M/s
hash-inc : 12.419 ± 0.011M/s
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
RISCV saves the pointer to the CPU's task_struct in the TP (thread
pointer) register. This makes it trivial to get the CPU's processor id.
As thread_info is the first member of task_struct, we can read the
processor id from TP + offsetof(struct thread_info, cpu).
RISCV64 JIT output for `call bpf_get_smp_processor_id`
======================================================
Before After
-------- -------
auipc t1,0x848c ld a5,32(tp)
jalr 604(t1)
mv a5,a0
Benchmark using [1] on Qemu.
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+------------------+------------------+--------------+
| Name | Before | After | % change |
|---------------+------------------+------------------+--------------|
| glob-arr-inc | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s | + 24.04% |
| arr-inc | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s | + 23.56% |
| hash-inc | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s | + 32.18% |
+---------------+------------------+------------------+--------------+
NOTE: This benchmark includes changes from this patch and the previous
patch that implemented the per-cpu insn.
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.
RISC-V uses generic per-cpu implementation where the offsets for CPUs
are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
the address of the task_struct in TP register. The first element in
task_struct is struct thread_info, and we can get the cpu number by
reading from the TP register + offsetof(struct thread_info, cpu).
Once we have the cpu number in a register we read the offset for that
cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
offset to the destination register.
To measure the improvement from this change, the benchmark in [1] was
used on Qemu:
Before:
glob-arr-inc : 1.127 ± 0.013M/s
arr-inc : 1.121 ± 0.004M/s
hash-inc : 0.681 ± 0.052M/s
After:
glob-arr-inc : 1.138 ± 0.011M/s
arr-inc : 1.366 ± 0.006M/s
hash-inc : 0.676 ± 0.001M/s
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This will add eBPF JIT support to the 32-bit ARCv2 processors. The
implementation is qualified by running the BPF tests on a Synopsys HSDK
board with "ARC HS38 v2.1c at 500 MHz" as the 4-core CPU.
The test_bpf.ko reports 2-10 fold improvements in execution time of its
tests. For instance:
test_bpf: #33 tcpdump port 22 jited:0 704 1766 2104 PASS
test_bpf: #33 tcpdump port 22 jited:1 120 224 260 PASS
test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:0 238 PASS
test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:1 23 PASS
test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:0 2034681 PASS
test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:1 1020022 PASS
Deployment and structure
------------------------
The related codes are added to "arch/arc/net":
- bpf_jit.h -- The interface that a back-end translator must provide
- bpf_jit_core.c -- Knows how to handle the input eBPF byte stream
- bpf_jit_arcv2.c -- The back-end code that knows the translation logic
The bpf_int_jit_compile() at the end of bpf_jit_core.c is the entrance
to the whole process. Normally, the translation is done in one pass,
namely the "normal pass". In case some relocations are not known during
this pass, some data (arc_jit_data) is allocated for the next pass to
come. This possible next (and last) pass is called the "extra pass".
1. Normal pass # The necessary pass
1a. Dry run # Get the whole JIT length, epilogue offset, etc.
1b. Emit phase # Allocate memory and start emitting instructions
2. Extra pass # Only needed if there are relocations to be fixed
2a. Patch relocations
Support status
--------------
The JIT compiler supports BPF instructions up to "cpu=v4". However, it
does not yet provide support for:
- Tail calls
- Atomic operations
- 64-bit division/remainder
- BPF_PROBE_MEM* (exception table)
The result of "test_bpf" test suite on an HSDK board is:
hsdk-lnx# insmod test_bpf.ko test_suite=test_bpf
test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed]
All the failing test cases are due to the ones that were not JIT'ed.
Categorically, they can be represented as:
.-----------.------------.-------------.
| test type | opcodes | # of cases |
|-----------+------------+-------------|
| atomic | 0xC3, 0xDB | 149 |
| div64 | 0x37, 0x3F | 22 |
| mod64 | 0x97, 0x9F | 15 |
`-----------^------------+-------------|
| (total) 186 |
`-------------'
Setup: build config
-------------------
The following configs must be set to have a working JIT test:
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_TEST_BPF=m
The following options are not necessary for the tests module,
but are good to have:
CONFIG_DEBUG_INFO=y # prerequisite for below
CONFIG_DEBUG_INFO_BTF=y # so bpftool can generate vmlinux.h
CONFIG_FTRACE=y #
CONFIG_BPF_SYSCALL=y # all these options lead to
CONFIG_KPROBE_EVENTS=y # having CONFIG_BPF_EVENTS=y
CONFIG_PERF_EVENTS=y #
Some BPF programs provide data through /sys/kernel/debug:
CONFIG_DEBUG_FS=y
arc# mount -t debugfs debugfs /sys/kernel/debug
Setup: elfutils
---------------
The libdw.{so,a} library that is used by pahole for processing
the final binary must come from elfutils 0.189 or newer. The
support for ARCv2 [1] has been added since that version.
[1]
https://sourceware.org/git/?p=elfutils.git;a=commit;h=de3d46b3e7
Setup: pahole
-------------
The line below in linux/scripts/Makefile.btf must be commented out:
pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats
Or else, the build will fail:
$ make V=1
...
BTF .btf.vmlinux.bin.o
pahole -J --btf_gen_floats \
-j --lang_exclude=rust \
--skip_encoding_btf_inconsistent_proto \
--btf_gen_optimized .tmp_vmlinux.btf
Complex, interval and imaginary float types are not supported
Encountered error while encoding BTF.
...
BTFIDS vmlinux
./tools/bpf/resolve_btfids/resolve_btfids vmlinux
libbpf: failed to find '.BTF' ELF section in vmlinux
FAILED: load BTF from vmlinux: No data available
This is due to the fact that the ARC toolchains generate
"complex float" DIE entries in libgcc and at the moment, pahole
can't handle such entries.
Running the tests
-----------------
host$ scp /bld/linux/lib/test_bpf.ko arc:
arc # sysctl net.core.bpf_jit_enable=1
arc # insmod test_bpf.ko test_suite=test_bpf
...
test_bpf: #1048 Staggered jumps: JMP32_JSLE_X jited:1 697811 PASS
test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed]
Acknowledgments
---------------
- Claudiu Zissulescu for his unwavering support
- Yuriy Kolerov for testing and troubleshooting
- Vladimir Isaev for the pahole workaround
- Sergey Matyukevich for paving the road by adding the interpreter support
Signed-off-by: Shahab Vahedi <shahab@synopsys.com>
Link: https://lore.kernel.org/r/20240430145604.38592-1-list+bpf@vahedi.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-05-08 (most Intel drivers)
This series contains updates to i40e, iavf, ice, igb, igc, e1000e, and ixgbe
drivers.
Asbjørn Sloth Tønnesen adds checks against supported flower control flags
for i40e, iavf, ice, and igb drivers.
Michal corrects filters removed during eswitch release for ice.
Corinna Vinschen defers PTP initialization to later in probe so that
netdev log entry is initialized on igc.
Ilpo Järvinen removes a couple of unused, duplicate defines on
e1000e and ixgbe.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
net: e1000e & ixgbe: Remove PCI_HEADER_TYPE_MFD duplicates
igc: fix a log entry using uninitialized netdev
ice: remove correct filters during eswitch release
igb: flower: validate control flags
ice: flower: validate control flags
iavf: flower: validate control flags
i40e: flower: validate control flags
====================
Link: https://lore.kernel.org/r/20240508173342.2760994-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Asbjørn Sloth Tønnesen says:
====================
net: qede: convert filter code to use extack
This series converts the filter code in the qede driver
to use NL_SET_ERR_MSG_*(extack, ...) for error handling.
Patch 1-12 converts qede_parse_flow_attr() to use extack,
along with all it's static helper functions.
qede_parse_flow_attr() is used in two places:
- qede_add_tc_flower_fltr()
- qede_flow_spec_to_rule()
In the latter call site extack is faked in the same way as
is done in mlxsw (patch 12).
While the conversion is going on, some error messages are silenced
in between patch 1-12. If wanted could squash patch 1-12 in a v3, but
I felt that it would be easier to review as 12 more trivial patches.
Patch 13 and 14, finishes up by converting qede_parse_actions(),
and ensures that extack is propagated to it, in both call contexts.
v1: https://lore.kernel.org/netdev/20240507104421.1628139-1-ast@fiberby.net/
====================
Link: https://lore.kernel.org/r/20240508143404.95901-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert DP_NOTICE/DP_INFO to NL_SET_ERR_MSG_MOD.
Keep edev around for use with QEDE_RSS_COUNT().
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-15-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass extack to qede_flow_spec_validate() when called in
qede_flow_spec_to_rule().
Pass extack to qede_parse_actions().
Not converting qede_flow_spec_validate() to use extack for
errors, as it's only called from qede_flow_spec_to_rule(),
where extack is faked into a DP_NOTICE anyway, so opting to
keep DP_VERBOSE/DP_NOTICE usage.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-14-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since qede_parse_flow_attr() now does error reporting
through extack, then give it a fake extack and extract the
error message afterwards if one was set.
The extracted error message is then passed on through
DP_NOTICE(), including messages that was earlier issued
with DP_INFO().
This fake extack approach is already used by
mlxsw_env_linecard_modules_power_mode_apply() in
drivers/net/ethernet/mellanox/mlxsw/core_env.c
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-13-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_parse_flow_attr() to take extack,
and drop the edev argument.
Convert DP_NOTICE calls to use NL_SET_ERR_MSG_* instead.
Pass extack in calls to qede_flow_parse_{tcp,udp}_v{4,6}().
In calls to qede_parse_flow_attr(), if extack is
unavailable, then use NULL for now, until a
subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-12-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define extack locally, to reduce line lengths and aid future users.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-11-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_udp_v4() to take extack,
and drop the edev argument.
Pass extack in call to qede_flow_parse_v4_common().
In call to qede_flow_parse_udp_v4(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-10-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_udp_v6() to take extack,
and drop the edev argument.
Pass extack in call to qede_flow_parse_v6_common().
In call to qede_flow_parse_udp_v6(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-9-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_tcp_v4() to take extack,
and drop the edev argument.
Pass extack in call to qede_flow_parse_v4_common().
In call to qede_flow_parse_tcp_v4(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-8-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_tcp_v6() to take extack,
and drop the edev argument.
Pass extack in call to qede_flow_parse_v6_common().
In call to qede_flow_parse_tcp_v6(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-7-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_v4_common() to take extack,
and drop the edev argument.
Convert DP_NOTICE call to use NL_SET_ERR_MSG_MOD instead.
Pass extack in calls to qede_flow_parse_ports() and
qede_set_v4_tuple_to_profile().
In calls to qede_flow_parse_v4_common(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-6-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_v6_common() to take extack,
and drop the edev argument.
Convert DP_NOTICE call to use NL_SET_ERR_MSG_MOD instead.
Pass extack in calls to qede_flow_parse_ports() and
qede_set_v6_tuple_to_profile().
In calls to qede_flow_parse_v6_common(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-5-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_set_v4_tuple_to_profile() to take extack,
and drop the edev argument.
Convert DP_INFO call to use NL_SET_ERR_MSG_MOD instead.
In calls to qede_set_v4_tuple_to_profile(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_set_v6_tuple_to_profile() to take extack,
and drop the edev argument.
Convert DP_INFO call to use NL_SET_ERR_MSG_MOD instead.
In calls to qede_set_v6_tuple_to_profile(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert qede_flow_parse_ports to use extack,
and drop the edev argument.
Convert DP_NOTICE call to use NL_SET_ERR_MSG_MOD instead.
In calls to qede_flow_parse_ports(), use NULL as extack
for now, until a subsequent patch makes extack available.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508143404.95901-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some usb drivers try to set small skb->truesize and break
core networking stacks.
In this patch, I removed one of the skb->truesize override.
I also replaced one skb_clone() by an allocation of a fresh
and small skb, to get minimally sized skbs, like we did
in commit 1e2c61172342 ("net: cdc_ncm: reduce skb truesize
in rx path") and 4ce62d5b2f7a ("net: usb: ax88179_178a:
stop lying about skb->truesize")
v3: also fix a sparse error ( https://lore.kernel.org/oe-kbuild-all/202405091310.KvncIecx-lkp@intel.com/ )
v2: leave the skb_trim() game because smsc95xx_rx_csum_offload()
needs the csum part. (Jakub)
While we are it, use get_unaligned() in smsc95xx_rx_csum_offload().
Fixes: 2f7ca802bdae ("net: Add SMSC LAN9500 USB2.0 10/100 ethernet adapter driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steve Glendinning <steve.glendinning@shawell.net>
Cc: UNGLinuxDriver@microchip.com
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240509083313.2113832-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1af2dface5d2 ("af_unix: Don't access successor in unix_del_edges()
during GC.") fixed use-after-free by avoid accessing edge->successor while
GC is in progress.
However, there could be a small race window where another process could
call unix_del_edges() while gc_in_progress is true and __skb_queue_purge()
is on the way.
So, we need another marker for struct scm_fp_list which indicates if the
skb is garbage-collected.
This patch adds dead flag in struct scm_fp_list and set it true before
calling __skb_queue_purge().
Fixes: 1af2dface5d2 ("af_unix: Don't access successor in unix_del_edges() during GC.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240508171150.50601-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
linux/gpio.h is deprecated and subject to remove.
The driver doesn't use it directly, replace it
with what is really being used.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508114519.972082-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Real number of Transmit queues are incremented when user enables HTB
class and vice versa. Depending on SKB priority driver returns transmit
queue (Txq). Transmit queues and Send queues are one-to-one mapped.
In few scenarios, Driver is returning transmit queue value which is
greater than real number of transmit queue and Stack detects this as
error and overwrites transmit queue value.
For example
user has added two classes and real number of queues are incremented
accordingly
- tc class add dev eth1 parent 1: classid 1:1 htb
rate 100Mbit ceil 100Mbit prio 1 quantum 1024
- tc class add dev eth1 parent 1: classid 1:2 htb
rate 100Mbit ceil 200Mbit prio 7 quantum 1024
now if user deletes the class with id 1:1, driver decrements the real
number of queues
- tc class del dev eth1 classid 1:1
But for the class with id 1:2, driver is returning transmit queue
value which is higher than real number of transmit queue leading
to below error
eth1 selects TX queue x, but real number of TX queues is x
This patch solves the problem by assigning deleted class transmit
queue/send queue to active class.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508070935.11501-1-hkelam@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make use of standard helpers to simplify filling in stats strings.
The first two ethtool_puts() changes address the following fortification
warnings flagged by W=1 builds with clang-18. (The last ethtool_puts
change does not because the warning relates to writing beyond the first
element of an array, and gve_gstrings_priv_flags only has one element.)
.../fortify-string.h:562:4: warning: call to '__read_overflow2_field' declared with 'warning' attribute: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
562 | __read_overflow2_field(q_size_field, size);
| ^
.../fortify-string.h:562:4: warning: call to '__read_overflow2_field' declared with 'warning' attribute: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
Likewise, the same changes resolve the same problems flagged by Smatch.
.../gve_ethtool.c:100 gve_get_strings() error: __builtin_memcpy() '*gve_gstrings_main_stats' too small (32 vs 576)
.../gve_ethtool.c:120 gve_get_strings() error: __builtin_memcpy() '*gve_gstrings_adminq_stats' too small (32 vs 512)
Compile tested only.
Reviewed-by: Shailend Chand <shailend@google.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240508-gve-comma-v2-2-1ac919225f13@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Although it does not seem to have any untoward side-effects,
the use of ';' to separate to assignments seems more appropriate than ','.
Flagged by clang-18 -Wcomma
No functional change intended.
Compile tested only.
Reviewed-by: Shailend Chand <shailend@google.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240508-gve-comma-v2-1-1ac919225f13@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Slow machines can delay scheduling of the packets for milliseconds.
Increase the delay to 8ms if KSFT_MACHINE_SLOW. Try to limit the
variability by moving setsockopts earlier (before we read time).
This fixes the "TXTIME rel" failures on debug kernels, like:
Case ICMPv4 - TXTIME rel returned '', expected 'OK'
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240510005705.43069-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On slow machines the SND timestamp sometimes doesn't arrive before
we quit. The test only waits as long as the packet delay, so it's
easy for a race condition to happen.
Double the wait but do a bit of polling, once the SND timestamp
arrives there's no point to wait any longer.
This fixes the "TXTIME abs" failures on debug kernels, like:
Case ICMPv4 - TXTIME abs returned '', expected 'OK'
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240510005705.43069-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFEKqOX9jqZmzwkCc1GSBvS7ZkBkFAmY5ajUACgkQ1GSBvS7Z
kBmR5w//U8Cr5QqBzvczVZNmEtgNm4Zsy9DR2QnV7Wzp1hH/4DL7oDzJX319sw0i
jNUu5QnGhbq5T3hBF/GIbugSQSVhTeS2oRk/Qxc/k8X7cNchlk+iB9U1YF8Lppj8
Y7pB6pT20SFO/8nhQguYQC8F+W0TKnpwSNcykHHIF1THe1MO8PfPbkdcvty28Mvs
xB/IiYee5nm88Chx/+PAuJWFSaUK/AmS5jlcBRKpffzPu/HtQ6cLtC8eRrZlYG0v
plVkExIlsRYhCudBp8ihhLQx2Nc3PpbRPDnE8AKyQCOILXM29/Lk85U4583Lbg9t
bMQuVEm1lw0HPH2iDMRwkkdsl09xPZX7XpnAv5zhEXnl9kDaEwLjV3eXaxdMUmK0
wHd05OMnPiOeyTyBvAjVdjNythfgg8fdy40K8E7G2BXw0G9Yv+/klyT2LM8bhmYp
ec2tI9FyxzCtY/Dz89Of2xQCrFLSXCHVjQAwmAYheSw247JfKu2m3fSnC5ABd/cm
B70VBkXESYk78uzK6JtpRCqA3CgppeB7tryOJ60Ac0GczeZUSUirbSJxH5laMF71
aPnhw+PRep+GWaX0ym2dmaIiK6NnkbEHk5B4oHwJFoJzwqJ2ezUAVh0oQssSxpwJ
zU4i8CtFegVWNsyly7N8js96a6T/Sn5Sny0NTCt+nJpbCwLaaD0=
=aPYm
-----END PGP SIGNATURE-----
Merge tag 'gtp-24-05-07' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/gtp
Pablo neira Ayuso says:
====================
gtp pull request 24-05-07
This v3 includes:
- fix for clang uninitialized variable per Jakub.
- address Smatch and Coccinelle reports per Simon
- remove inline in new IPv6 support per Simon
- fix memleaks in netlink control plane per Simon
-o-
The following patchset contains IPv6 GTP driver support for net-next,
this also includes IPv6 over IPv4 and vice-versa:
Patch #1 removes a unnecessary stack variable initialization in the
socket routine.
Patch #2 deals with GTP extension headers. This variable length extension
header to decapsulate packets accordingly. Otherwise, packets are
dropped when these extension headers are present which breaks
interoperation with other non-Linux based GTP implementations.
Patch #3 prepares for IPv6 support by moving IPv4 specific fields in PDP
context objects to a union.
Patch #4 adds IPv6 support while retaining backward compatibility.
Three new attributes allows to declare an IPv6 GTP tunnel
GTPA_FAMILY, GTPA_PEER_ADDR6 and GTPA_MS_ADDR6 as well as
IFLA_GTP_LOCAL6 to declare the IPv6 GTP UDP socket. Up to this
patch, only IPv6 outer in IPv6 inner is supported.
Patch #5 uses IPv6 address /64 prefix for UE/MS in the inner headers.
Unlike IPv4, which provides a 1:1 mapping between UE/MS,
IPv6 tunnel encapsulates traffic for /64 address as specified
by 3GPP TS. Patch has been split from Patch #4 to highlight
this behaviour.
Patch #6 passes up IPv6 link-local traffic, such as IPv6 SLAAC, for
handling to userspace so they are handled as control packets.
Patch #7 prepares to allow for GTP IPv4 over IPv6 and vice-versa by
moving IP specific debugging out of the function to build
IPv4 and IPv6 GTP packets.
Patch #8 generalizes TOS/DSCP handling following similar approach as
in the existing iptunnel infrastructure.
Patch #9 adds a helper function to build an IPv4 GTP packet in the outer
header.
Patch #10 adds a helper function to build an IPv6 GTP packet in the outer
header.
Patch #11 adds support for GTP IPv4-over-IPv6 and vice-versa.
Patch #12 allows to use the same TID/TEID (tunnel identifier) for inner
IPv4 and IPv6 packets for better UE/MS dual stack integration.
This series integrates with the osmocom.org project CI and TTCN-3 test
infrastructure (Oliver Smith) as well as the userspace libgtpnl library.
Thanks to Harald Welte, Oliver Smith and Pau Espin for reviewing and
providing feedback through the osmocom.org redmine platform to make this
happen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sven Auhagen reports transaction failures with following error:
./main.nft:13:1-26: Error: Could not process rule: Cannot allocate memory
percpu: allocation failed, size=16 align=8 atomic=1, atomic alloc failed, no space left
This points to failing pcpu allocation with GFP_ATOMIC flag.
However, transactions happen from user context and are allowed to sleep.
One case where we can call into percpu allocator with GFP_ATOMIC is
nft_counter expression.
Normally this happens from control plane, so this could use GFP_KERNEL
instead. But one use case, element insertion from packet path,
needs to use GFP_ATOMIC allocations (nft_dynset expression).
At this time, .clone callbacks always use GFP_ATOMIC for this reason.
Add gfp_t argument to the .clone function and pass GFP_KERNEL or
GFP_ATOMIC flag depending on context, this allows all clone memory
allocations to sleep for the normal (transaction) case.
Cc: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new test script that uses packetdrill tool to exercise conntrack
state machine.
Needs ip/ip6tables and conntrack tool (to check if we have an entry in
the expected state).
Test cases added here cover following scenarios:
1. already-acked (retransmitted) packets are not tagged as INVALID
2. RST packet coming when conntrack is already closing (FIN/CLOSE_WAIT)
transitions conntrack to CLOSE even if the RST is not an exact match
3. RST packets with out-of-window sequence numbers are marked as INVALID
4. SYN+Challenge ACK: check that challenge ack is allowed to pass
5. Old SYN/ACK: check conntrack handles the case where SYN is answered
with SYN/ACK for an old, previous connection attempt
6. Check SYN reception while in ESTABLISHED state generates a challenge
ack, RST response clears 'outdated' state + next SYN retransmit gets
us into 'SYN_RECV' conntrack state.
Tests get run twice, once with ipv4 and once with ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This set type keeps two copies of the sets' content,
priv->match (live version, used to match from packet path)
priv->clone (work-in-progress version of the 'future' priv->match).
All additions and removals are done on priv->clone. When transaction
completes, priv->clone becomes priv->match and a new clone is allocated
for use by next transaction.
Problem is that the cloning requires GFP_KERNEL allocations but we
cannot fail at either commit or abort time.
This patch defers the clone until we get an insertion or removal
request. This allows us to handle OOM situations correctly.
This also allows to remove ->dirty in a followup change:
If ->clone exists, ->dirty is always true
If ->clone is NULL, ->dirty is always false, no elements were added
or removed (except catchall elements which are external to the specific
set backend).
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The helper uses priv->clone unconditionally which will fail once we do
the clone conditionally on first insert or removal.
'nft get element' from userspace needs to use priv->match since this
runs from rcu read side lock section.
Prepare for this by passing the match backend data as argument.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In IPv6, ipv6_rcv_core will parse the hop-by-hop type extension header and increase skb->transport_header by one extension header length.
But if there are more other extension headers like fragment header at this time, the skb->transport_header points to the second extension header,
not the transport layer header or the first extension header.
This will result in the start and nexthdrp variable not pointing to the same position in ipv6frag_thdr_trunced,
and ipv6_skip_exthdr returning incorrect offset and frag_off.Sometimes,the length of the last sharded packet is smaller than the calculated incorrect offset, resulting in packet loss.
We can use network header to offset and calculate the correct position to solve this problem.
Fixes: 9d9e937b1c8b (ipv6/netfilter: Discard first fragment not including all headers)
Signed-off-by: Gao Xingwang <gaoxingwang1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DCCP is going away soon, and had no twsk_unique() method.
We can directly call tcp_twsk_unique() for TCP sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240507164140.940547-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Applications are sensitive to long network latency, particularly
heartbeat monitoring ones. Longer the tx timeout recovery higher the
risk with such applications on a production machines. This patch
remedies, yet honoring device set tx timeout.
Modify watchdog next timeout to be shorter than the device specified.
Compute the next timeout be equal to device watchdog timeout less the
how long ago queue stop had been done. At next watchdog timeout tx
timeout handler is called into if still in stopped state. Either called
or not called, restore the watchdog timeout back to device specified.
Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
Link: https://lore.kernel.org/r/20240508133617.4424-1-praveen.kannoju@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The btf_features list can be used for pahole v1.26 and later -
it is useful because if a feature is not yet implemented it will
not exit with a failure message. This will allow us to add feature
requests to the pahole options without having to check pahole versions
in future; if the version of pahole supports the feature it will be
added.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240507135514.490467-1-alan.maguire@oracle.com
Geliang Tang says:
====================
From: Geliang Tang <tanggeliang@kylinos.cn>
This patchset adds post_socket_cb pointer into
struct network_helper_opts to make start_server_addr() helper
more flexible. With these modifications, many duplicate codes
can be dropped.
Patches 1-3 address Martin's comments in the previous series.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch uses public helper connect_to_fd() exported in network_helpers.h
instead of the local defined function connect_to_server() in
test_tcp_check_syncookie_user.c. This can avoid duplicate code.
Then the arguments "addr" and "len" of run_test() become useless, drop them
too.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/e0ae6b790ac0abc7193aadfb2660c8c9eb0fe1f0.1714907662.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>