IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The else condition within __netdev_alloc_frag_align() is an open coded
__napi_alloc_frag_align().
Use __napi_alloc_frag_align() instead of open coding it.
Move fragsz assignment before page_frag_alloc_align() invocation because
__napi_alloc_frag_align() also contains this statement.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-4-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
1e7962114c ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error")
165f87691a ("bnxt_en: add timestamping statistics support")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Long used destructors kfree_skb and kfree_skb_reason do not pass
receiving socket to packet drop tracepoints trace_kfree_skb.
This makes it hard to track packet drops of a certain netns (container)
or a socket (user application).
The naming of these destructors are also not consistent with most sk/skb
operating functions, i.e. functions named "sk_xxx" or "skb_xxx".
Introduce a new functions sk_skb_reason_drop as drop-in replacement for
kfree_skb_reason on local receiving path. Callers can now pass receiving
sockets to the tracepoints.
kfree_skb and kfree_skb_reason are still usable but they are now just
inline helpers that call sk_skb_reason_drop.
Note it is not feasible to do the same to consume_skb. Packets not
dropped can flow through multiple receive handlers, and have multiple
receiving sockets. Leave it untouched for now.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.
Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This declaration was added to the header to be called from ethtool.
ethtool is separated from core for code organization but it is not really
a separate entity, it controls very core things.
As ethtool is an internal stuff it is not wise to have it in netdevice.h.
Move the declaration to net/core/dev.h instead.
Remove the EXPORT_SYMBOL_GPL call as ethtool can not be built as a module.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-2-b2a086257b63@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We only use the mapping in a single context in a short and contained scope,
so kmap_local_page is sufficient and cheaper. This will also allow
skb_datagram_iter to be called from softirq context.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20240613113504.1079860-1-sagi@grimberg.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-EINVAL will interrupt the dump. The correct error to return
if we have more data to dump is -EMSGSIZE.
Discovered by doing:
for i in `seq 80`; do ip link add type veth; done
./cli.py --dbg-small-recv 5300 --spec netdev.yaml --dump dev-get >> /dev/null
[...]
nl_len = 64 (48) nl_flags = 0x0 nl_type = 19
nl_len = 20 (4) nl_flags = 0x2 nl_type = 3
error: -22
Fixes: d3d854fd6a ("netdev-genl: create a simple family for netdev stuff")
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240613213044.3675745-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzkaller builds (CONFIG_DEBUG_NET=y) frequently trigger a debug
hint in pskb_may_pull.
We'd like to retain this debug check because it might hint at integer
overflows and other issues (kernel code should pull headers, not huge
value).
In bpf case, this splat isn't interesting at all: such (nonsensical)
bpf programs are typically generated by a fuzzer anyway.
Do what Eric suggested and suppress such warning.
For CONFIG_DEBUG_NET=n we don't need the extra check because
pskb_may_pull will do the right thing: return an error without the
WARN() backtrace.
Fixes: 219eee9c0d ("net: skbuff: add overflow debug check to pull/push helpers")
Reported-by: syzbot+0c4150bff9fff3bf023c@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://syzkaller.appspot.com/bug?extid=0c4150bff9fff3bf023c
Link: https://lore.kernel.org/netdev/9f254c96-54f2-4457-b7ab-1d9f6187939c@gmail.com/
Link: https://lore.kernel.org/bpf/20240614101801.9496-1-fw@strlen.de
When calculating hashes for the purpose of multipath forwarding, both IPv4
and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
randomly-generated seed. That's a fine choice by default, but unfortunately
some deployments may need a tighter control over the seed used.
In this patch, make the seed configurable by adding a new sysctl key,
net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used
specifically for multipath forwarding and not for the other concerns that
flow_hash_from_keys() is used for, such as queue selection. Expose the knob
as sysctl because other such settings, such as headers to hash, are also
handled that way. Like those, the multipath hash seed is a per-netns
variable.
Despite being placed in the net.ipv4 namespace, the multipath seed sysctl
is used for both IPv4 and IPv6, similarly to e.g. a number of TCP
variables.
The seed used by flow_hash_from_keys() is a 128-bit quantity. However it
seems that usually the seed is a much more modest value. 32 bits seem
typical (Cisco, Cumulus), some systems go even lower. For that reason, and
to decouple the user interface from implementation details, go with a
32-bit quantity, which is then quadruplicated to form the siphash key.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to previous patch: apply same logic for
__skb_get_hash_symmetric and let callers pass the netns to the dissector
core.
Existing function is turned into a wrapper to avoid adjusting all
callers, nft_hash.c uses new function.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240608221057.16070-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Years ago flow dissector gained ability to delegate flow dissection
to a bpf program, scoped per netns.
Unfortunately, skb_get_hash() only gets an sk_buff argument instead
of both net+skb. This means the flow dissector needs to obtain the
netns pointer from somewhere else.
The netns is derived from skb->dev, and if that is not available, from
skb->sk. If neither is set, we hit a (benign) WARN_ON_ONCE().
Trying both dev and sk covers most cases, but not all, as recently
reported by Christoph Paasch.
In case of nf-generated tcp reset, both sk and dev are NULL:
WARNING: .. net/core/flow_dissector.c:1104
skb_flow_dissect_flow_keys include/linux/skbuff.h:1536 [inline]
skb_get_hash include/linux/skbuff.h:1578 [inline]
nft_trace_init+0x7d/0x120 net/netfilter/nf_tables_trace.c:320
nft_do_chain+0xb26/0xb90 net/netfilter/nf_tables_core.c:268
nft_do_chain_ipv4+0x7a/0xa0 net/netfilter/nft_chain_filter.c:23
nf_hook_slow+0x57/0x160 net/netfilter/core.c:626
__ip_local_out+0x21d/0x260 net/ipv4/ip_output.c:118
ip_local_out+0x26/0x1e0 net/ipv4/ip_output.c:127
nf_send_reset+0x58c/0x700 net/ipv4/netfilter/nf_reject_ipv4.c:308
nft_reject_ipv4_eval+0x53/0x90 net/ipv4/netfilter/nft_reject_ipv4.c:30
[..]
syzkaller did something like this:
table inet filter {
chain input {
type filter hook input priority filter; policy accept;
meta nftrace set 1
tcp dport 42 reject with tcp reset
}
chain output {
type filter hook output priority filter; policy accept;
# empty chain is enough
}
}
... then sends a tcp packet to port 42.
Initial attempt to simply set skb->dev from nf_reject_ipv4 doesn't cover
all cases: skbs generated via ipv4 igmp_send_report trigger similar splat.
Moreover, Pablo Neira found that nft_hash.c uses __skb_get_hash_symmetric()
which would trigger same warn splat for such skbs.
Lets allow callers to pass the current netns explicitly.
The nf_trace infrastructure is adjusted to use the new helper.
__skb_get_hash_symmetric is handled in the next patch.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/494
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240608221057.16070-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We currently have dev_get_tstats64() for collecting per-cpu stats of
type pcpu_sw_netstats ("tstats"). However, tstats doesn't allow for
accounting tx/rx drops. We do have a stats variant that does have stats
for dropped packets: struct pcpu_dstats, but there are no core helpers
for using those stats.
The VRF driver uses dstats, by providing its own collation/fetch
functions to do so.
This change adds a common implementation for dstats-type collection,
used when pcpu_stat_type == NETDEV_PCPU_STAT_DSTAT. This is based on the
VRF driver's existing stats collator (plus the unused tx_drops stat from
there). We will switch the VRF driver to use this in the next change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607-dstats-v3-2-cc781fe116f7@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we have an intermediate layer of code for handling
rtnl-level netlink dump quirks, we can move the rtnl_lock
taking there.
For dump handlers with RTNL_FLAG_DUMP_SPLIT_NLM_DONE we can
avoid taking rtnl_lock just to generate NLM_DONE, once again.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having an OR in WARN_ON() makes me sad because it's impossible to tell
which condition is true when triggered.
Split a WARN_ON() with an OR in page_pool_disable_direct_recycling().
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
491aee894a ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391 ("net: use unrcu_pointer() helper")
b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.
Also make inet_rcv_compat const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf ("inet: bring NLM_DONE out to a separate recv() again")
Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).
Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.
Tested:
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
--dump getaddr --json '{"ifa-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
--dump getroute --json '{"rtm-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
--dump getlink
Fixes: 3e41af9076 ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to current semantics we should never try to shift data
between skbs which differ on decrypted or pp_recycle status.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata.
This is a prerequisite for matching these control flags using TC flower.
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
After fixing four different bugs involving dst_cache
users, it might be worth adding a check about BH being
blocked by dst_cache callers.
DEBUG_NET_WARN_ON_ONCE(!in_softirq());
It is not fatal, if we missed valid case where no
BH deadlock is to be feared, we might change this.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following sysctl are global and can't be read from a netns:
net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max
Make the following sysctl parameters available readonly from within a
network namespace, allowing a container to read them.
Signed-off-by: Matteo Croce <teknoraver@meta.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20240530232722.45255-2-technoboy85@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/ti/icssg/icssg_classifier.c
abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/
No other adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-1-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlWtmQAKCRDbK58LschI
g0TUAQDT76jx7Rq1DShCtZ3eqiBMNkYczK8b+GqNsSG8YGduaAEA1jn/GN+H65Rh
atQZ/pYAfLZflMV04+XE0GyBr5q1uQg=
=NczG
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-28
We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).
The main changes are:
1) Rename skb's mono_delivery_time to tstamp_type for extensibility
and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
from Abhishek Chauhan.
2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
CT zones can be used from XDP/tc BPF netfilter CT helper functions,
from Brad Cowie.
3) Several tweaks to the instruction-set.rst IETF doc to address
the Last Call review comments, from Dave Thaler.
4) Small batch of riscv64 BPF JIT optimizations in order to emit more
compressed instructions to the JITed image for better icache efficiency,
from Xiao Wang.
5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
diffing and forcing more natural type definitions ordering,
from Mykyta Yatsenko.
6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
a syzbot/KCSAN race report for the tx_errors counter,
from Jiang Yunshui.
7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
which started treating const structs as constants and thus breaking
full BTF program name resolution, from Ivan Babrou.
8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
some of the test-internal array sizes, from Geliang Tang.
9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
pahole, from Alan Maguire.
10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
rebuilds in some corner cases, from Artem Savkov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
bpf, net: Use DEV_STAT_INC()
bpf, docs: Fix instruction.rst indentation
bpf, docs: Clarify call local offset
bpf, docs: Add table captions
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
bpf, docs: Use RFC 2119 language for ISA requirements
bpf, docs: Move sentence about returning R0 to abi.rst
bpf: constify member bpf_sysctl_kern:: Table
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
riscv, bpf: Use STACK_ALIGN macro for size rounding up
riscv, bpf: Optimize zextw insn with Zba extension
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
net: Add additional bit to support clockid_t timestamp type
net: Rename mono_delivery_time to tstamp_type for scalabilty
selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
net: netfilter: Make ct zone opts configurable for bpf ct helpers
selftests/bpf: Fix prog numbers in test_sockmap
bpf: Remove unused variable "prev_state"
bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
bpf: Fix order of args in call to bpf_map_kvcalloc
...
====================
Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit cdfbabfb2f ("net: Work around lockdep limitation in
sockets that use sockets"), it introduces 'af_kern_callback_keys'
to lockdep-init of sk_callback_lock according to 'sk_kern_sock',
it modifies sock_init_data() only, and sk_clone_lock() calls
sk_init_common() to initialize sk_callback_lock too, so the
lockdep-init of sk_callback_lock should be moved to sk_init_common().
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Link: https://lore.kernel.org/r/20240526145718.9542-2-gouhao@uniontech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk_psock_get will return NULL if the refcount of psock has gone to 0, which
will happen when the last call of sk_psock_put is done. However,
sk_psock_drop may not have finished yet, so the close callback will still
point to sock_map_close despite psock being NULL.
This can be reproduced with a thread deleting an element from the sock map,
while the second one creates a socket, adds it to the map and closes it.
That will trigger the WARN_ON_ONCE:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Modules linked in:
CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02
RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293
RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000
RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0
RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3
R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840
R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870
FS: 000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0
Call Trace:
<TASK>
unix_release+0x87/0xc0 net/unix/af_unix.c:1048
__sock_release net/socket.c:659 [inline]
sock_close+0xbe/0x240 net/socket.c:1421
__fput+0x42b/0x8a0 fs/file_table.c:422
__do_sys_close fs/open.c:1556 [inline]
__se_sys_close fs/open.c:1541 [inline]
__x64_sys_close+0x7f/0x110 fs/open.c:1541
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb37d618070
Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070
RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Use sk_psock, which will only check that the pointer is not been set to
NULL yet, which should only happen after the callbacks are restored. If,
then, a reference can still be gotten, we may call sk_psock_stop and cancel
psock->work.
As suggested by Paolo Abeni, reorder the condition so the control flow is
less convoluted.
After that change, the reproducer does not trigger the WARN_ON_ONCE
anymore.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reported-by: syzbot+07a2e4a1a57118ef7355@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355
Fixes: aadb2bb83f ("sock_map: Fix a potential use-after-free in sock_map_close()")
Fixes: 5b4a79ba65 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself")
Cc: stable@vger.kernel.org
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20240524144702.1178377-1-cascardo@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlTGFAAKCRDbK58LschI
g5NXAP0QRn8nBSxJHIswFSOwRiCyhOhR7YL2P0c+RGcRMA+ZSAD9E1cwsYXsPu3L
ummQ52AMaMfouHg6aW+rFIoupkGSnwc=
=QctA
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-05-27
We've added 15 non-merge commits during the last 7 day(s) which contain
a total of 18 files changed, 583 insertions(+), 55 deletions(-).
The main changes are:
1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread
while the promise was to filter by process, from Andrii Nakryiko.
2) Fix the recent influx of syzkaller reports to sockmap which triggered
a locking rule violation by performing a map_delete, from Jakub Sitnicki.
3) Fixes to netkit driver in particular on skb->pkt_type override upon pass
verdict, from Daniel Borkmann.
4) Fix an integer overflow in resolve_btfids which can wrongly trigger build
failures, from Friedrich Vock.
5) Follow-up fixes for ARC JIT reported by static analyzers,
from Shahab Vahedi.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Cover verifier checks for mutating sockmap/sockhash
Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
bpf: Allow delete from sockmap/sockhash only if update is allowed
selftests/bpf: Add netkit test for pkt_type
selftests/bpf: Add netkit tests for mac address
netkit: Fix pkt_type override upon netkit pass verdict
netkit: Fix setting mac address in l2 mode
ARC, bpf: Fix issues reported by the static analyzers
selftests/bpf: extend multi-uprobe tests with USDTs
selftests/bpf: extend multi-uprobe tests with child thread case
libbpf: detect broken PID filtering logic for multi-uprobe
bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic
bpf: fix multi-uprobe PID filtering logic
bpf: Fix potential integer overflow in resolve_btfids
MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT
====================
Link: https://lore.kernel.org/r/20240527203551.29712-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit ff91059932.
This check is no longer needed. BPF programs attached to tracepoints are
now rejected by the verifier when they attempt to delete from a
sockmap/sockhash maps.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-2-944b372f2101@cloudflare.com
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.
To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.
We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
- optimize DMA sync calls when they are no-ops (Alexander Lobakin)
- fix swiotlb padding for untrusted devices (Michael Kelley)
- add documentation for swiotb (Michael Kelley)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmZLV+gLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPO7hAAlKuXigzwcrVEUnfRGRdaZ28xbmffyC1dPfw8HRZe
xJqvD51aJ/VOoOCcUyt3hNLEQHwtjEk4eM0xGcAASMdwceU58doJCcDJBpbbgbDK
CPKJgBLQBC1JfAJUpRiJkV4RsudRhAyndIzUPVgkz0WObpEgDpfO0ClHRF/0Pavy
1sBFVFMbB1ewb/D8ffpp+DWfwrwu0oMC3A2LkYu2F5SQFWuVOpbNemrnZ6K2ckPt
2mcLpJ308+sti8Ka/LrI2akU8JCLYMYDQnue/44v3X3Gm63cMcEx/fj5M5x6m71n
P+cxAkjsGDHybnfjbUvR842to8msRsH4CI4Zbb69+5HDlWSadM8JhQd74oeii6o6
RiGPrrFEk7vCxFOkUsqGFYMykEX+71wXfQ1Mpp/b4QgdqBLkxW4ozQ3Ya7ASUs2z
TLLmQvIXtYKGnyU+RdOkvS6piHjd4wVHOhuGVdXqVT7WrbaPeovY4TNSTV2ZA1gE
9Y5RCdrX9xeGGNjsYXKwsWGvXVsm6UTQmQVUsatQb3ic+K3S6tQR9pwzk0HmhMuM
BscWHSAEL7T8ZZ5Ydph45Cw/6xdH7LggD+nRtLcdAuzCika12eabZHsO0DrF533n
qXYOjZOgsMEZWICynxq6+EGQKGWY+F+GyKDMU2w2Es5OgMa9Bqb40aSF+Q887s96
xwI=
=Pa8W
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- optimize DMA sync calls when they are no-ops (Alexander Lobakin)
- fix swiotlb padding for untrusted devices (Michael Kelley)
- add documentation for swiotb (Michael Kelley)
* tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping:
dma: fix DMA sync for drivers not calling dma_set_mask*()
xsk: use generic DMA sync shortcut instead of a custom one
page_pool: check for DMA sync shortcut earlier
page_pool: don't use driver-set flags field directly
page_pool: make sure frag API fields don't span between cachelines
iommu/dma: avoid expensive indirect calls for sync operations
dma: avoid redundant calls for sync operations
dma: compile-out DMA sync op calls when not used
iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices
swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()
Documentation/core-api: add swiotlb documentation
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+
4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8
UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s
i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq
6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV
t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9
8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5
c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe
0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj
qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ
4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x
xLq2iq+ehw==
=S6Xt
-----END PGP SIGNATURE-----
Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept
requests.
This is very similar to previous work that enabled the same hint for
doing receives on sockets. By far the majority of the work here is
refactoring to enable the networking side to pass back whether or not
the socket had more pending requests after accepting the current one,
the last patch just wires it up for io_uring.
Not only does this enable applications to know whether there are more
connections to accept right now, it also enables smarter logic for
io_uring multishot accept on whether to retry immediately or wait for
a poll trigger"
* tag 'net-accept-more-20240515' of git://git.kernel.dk/linux:
io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept
net: pass back whether socket was empty post accept
net: have do_accept() take a struct proto_accept_arg argument
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel
invocation or not, add a struct proto_accept_arg struct as the argument.
This then holds all of these arguments, and prepares accept for being
able to pass back more information.
No functional changes in this patch.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
TX queue stop and wake are counted by some drivers.
Support reporting these via netdev-genl queue stats.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP
flows.
To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
6.94% [kernel] [k] inet_gro_receive
3.02% [kernel] [k] tcp_gro_receive
patch applied:
4.27% [kernel] [k] tcp_gro_receive
4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
10.09% [kernel] [k] inet_gro_receive
2.08% [kernel] [k] tcp_gro_receive
patch applied:
6.97% [kernel] [k] inet_gro_receive
3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1af2dface5 ("af_unix: Don't access successor in unix_del_edges()
during GC.") fixed use-after-free by avoid accessing edge->successor while
GC is in progress.
However, there could be a small race window where another process could
call unix_del_edges() while gc_in_progress is true and __skb_queue_purge()
is on the way.
So, we need another marker for struct scm_fp_list which indicates if the
skb is garbage-collected.
This patch adds dead flag in struct scm_fp_list and set it true before
calling __skb_queue_purge().
Fixes: 1af2dface5 ("af_unix: Don't access successor in unix_del_edges() during GC.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240508171150.50601-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
35d92abfba ("net: hns3: fix kernel crash when devlink reload during initialization")
2a1a1a7b5f ("net: hns3: add command queue trace for hns3")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need to use this_cpu_ptr(dst_cache->cache) twice.
Compiler is unable to optimize the second call, because of
per-cpu constraints.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132717.627518-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_cache->reset_ts is read or written locklessly,
add READ_ONCE() and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132000.614591-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IEEE 802.1q specification provides recommendation and examples which can
be used as good default values for different drivers.
This patch implements mapping examples documented in IEEE 802.1Q-2022 in
Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming
and mapping DSCP to Traffic Type inspired by RFC8325.
This helpers will be used in followup patches for dsa/microchip DCB
implementation.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can save a couple more function calls in the Page Pool code if we
check for dma_need_sync() earlier, just when we test pp->p.dma_sync.
Move both these checks into an inline wrapper and call the PP wrapper
over the generic DMA sync function only when both are true.
You can't cache the result of dma_need_sync() in &page_pool, as it may
change anytime if an SWIOTLB buffer is allocated or mapped.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
page_pool::p is driver-defined params, copied directly from the
structure passed to page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields the same way as the already existing one (::has_init_callback).
They are defined as bits in the driver-set params, leave them so here
as well, to not waste byte-per-bit or so. Almost 30 bits are still free
for future extensions.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive. ::flags passed by the driver can
now go to the "slow" PP params.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
After commit 5027ec19f1 ("net: page_pool: split the page_pool_params
into fast and slow") that made &page_pool contain only "hot" params at
the start, cacheline boundary chops frag API fields group in the middle
again.
To not bother with this each time fast params get expanded or shrunk,
let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to
their actual size (2 longs + 1 int). This ensures 16-byte alignment for
the 32-bit architectures and 32-byte alignment for the 64-bit ones,
excluding unnecessary false-sharing.
::page_state_hold_cnt is used quite intensively on hotpath no matter if
frag API is used, so move it to the newly created hole in the first
cacheline.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>