IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Looks like the IPv6 version of the patch under Fixes was
a copy/paste of the IPv4 but hit the wrong spot.
It is tcp_v6_rcv() which uses drop_reason as a boolean, and
needs to be protected against reason == 0 before calling free.
tcp_v6_do_rcv() has a pretty straightforward flow.
The resulting warning looks like this:
WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775
Call Trace:
tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767)
ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438)
ip6_input_finish (include/linux/rcupdate.h:726)
ip6_input (include/linux/netfilter.h:307)
Fixes: f8319dfd1b ("net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()")
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20220520021347.2270207-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, misc
updates and fallout fixes from recent Florian's code rewritting (from
last pull request):
1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi.
2) Avoid unnecessary GC with a timestamp in conncount, from William Tu
and Yifeng Sun.
3) Remove TCP conntrack debugging, from Florian Westphal.
4) Fix compilation warning in ctnetlink, from Florian.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"
netfilter: conntrack: remove pr_debug callsites from tcp tracker
netfilter: nf_conncount: reduce unnecessary GC
netfilter: Use l3mdev flow key when re-routing mangled packets
====================
Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.
The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.
Stop👏making👏bespoke👏random👏number👏generators👏.
Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.
The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.
There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Commit 40867d74c3 ("net: Add l3mdev index to flow struct and avoid oif
reset for port devices") introduces a flow key specific for layer 3
domains, such as a VRF master device. This allows for explicit VRF domain
selection instead of abusing the oif flow key.
Update ip[6]_route_me_harder() to make use of that new key when re-routing
mangled packets within VRFs instead of setting the flow oif, making it
consistent with other users.
Signed-off-by: Martin Willi <martin@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().
And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.
Fixes: 1330b6ef33 ("skb: make drop reason booleanable")
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
INET6_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET6_MATCH() an inline function
to ease our changes.
v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
Change the name to inet6_match(), this is no longer a macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.
Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.
BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg
write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
__ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0xffffff9b
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.
Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.
This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).
This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.
If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.
v2: perform HBH removal from ipv6_gso_segment() instead of
skb_segment() (Alexander feedback)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.
The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.
1) Add a dedicated list for conntrack event redelivery.
2) Include event redelivery list in conntrack dumps of dying type.
3) Remove per-cpu dying list for event redelivery, not used anymore.
4) Add netns .pre_exit to cttimeout to zap timeout objects before
synchronize_rcu() call.
5) Remove nf_ct_unconfirmed_destroy.
6) Add generation id for conntrack extensions for conntrack
timeout and helpers.
7) Detach timeout policy from conntrack on cttimeout module removal.
8) Remove __nf_ct_unconfirmed_destroy.
9) Remove unconfirmed list.
10) Remove unconditional local_bh_disable in init_conntrack().
11) Consolidate conntrack iterator nf_ct_iterate_cleanup().
12) Detect if ctnetlink listeners exist to short-circuit event
path early.
13) Un-inline nf_ct_ecache_ext_add().
14) Add nf_conntrack_events autodetect ctnetlink listener mode
and make it default.
15) Add nf_ct_ecache_exist() to check for event cache extension.
16) Extend flowtable reverse route lookup to include source, iif,
tos and mark, from Sven Auhagen.
17) Do not verify zero checksum UDP packets in nf_reject,
from Kevin Mitchell.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-05-13
1) Cleanups for the code behind the XFRM offload API. This is a
preparation for the extension of the API for policy offload.
From Leon Romanovsky.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: drop not needed flags variable in XFRM offload struct
net/mlx5e: Use XFRM state direction instead of flags
netdevsim: rely on XFRM state direction instead of flags
ixgbe: propagate XFRM offload state direction instead of flags
xfrm: store and rely on direction to construct offload flags
xfrm: rename xfrm_state_offload struct to allow reuse
xfrm: delete not used number of external headers
xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================
Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.
Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.
Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commits:
0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()")
d507204d3c ("tcp/dccp: add tw->tw_bslot")
As Leonard pointed out, a newly allocated netns can happen
to reuse a freed 'struct net'.
While TCP TW timers were covered by my patches, other things were not:
1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
at 4-tuple plus the 'struct net' pointer.
2) /proc/net/tcp[6] and inet_diag, same reason.
3) hashinfo->bhash[], same reason.
Fixing all this seems risky, lets instead revert.
In the future, we might have a per netns tcp hash table, or
a per netns list of timewait sockets...
Fixes: 0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <cdleonard@gmail.com>
Tested-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).
After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.
The commit 05c0b35709 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.
There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call. This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443). [ On top of the slowness in
adding to the tail in the IPv6 case ]. The latter patch has a
selftest to demonstrate this case.
This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.
Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches. The
changes in this patch is highlighted here for the review
purpose.
1. Because of the listening_hash removal, lhash2 can use the
sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
This will also keep the sk_unhashed() check to work as is
after stop adding sk to listening_hash.
The union is removed from inet_listen_hashbucket because
only nulls_head is needed.
2. icsk->icsk_listen_portaddr_node and its helpers are removed.
3. The current lhash2 users needs to iterate with sk_nulls_node
instead of icsk_listen_portaddr_node.
One case is in the inet[6]_lhash2_lookup().
Another case is the seq_file iterator in tcp_ipv4.c.
One thing to note is sk_nulls_next() is needed
because the old inet_lhash2_for_each_icsk_continue()
does a "next" first before iterating.
4. Move the remaining listening_hash usage to lhash2
inet_reuseport_add_sock() which this series is
trying to improve.
inet_diag.c and mptcp_diag.c are the final two
remaining use cases and is moved to lhash2 now also.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After removal of Innova IPsec support from mlx5 driver, the last user
of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely
remove it as no other hardware is capable (or need) to remove ESP trailer.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
7cd23e5300 ("secure_seq: use SipHash in place of MD5"), but the output
remained truncated to 32-bit only. In order to exploit more bits from the
hash, let's make the functions return the full 64-bit of siphash_3u32().
We also make sure the port offset calculation in __inet_hash_connect()
remains done on 32-bit to avoid the need for div_u64_rem() and an extra
cost on 32-bit systems.
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Creating a new netdevice allocates at least ~50Kb of memory for various
kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
creating an unlimited number of netdevice inside a memcg-limited container
does not fall within memcg restrictions, consumes a significant part
of the host's memory, can cause global OOM and lead to random kills of
host processes.
The main consumers of non-accounted memory are:
~10Kb 80+ kernfs nodes
~6Kb ipv6_add_dev() allocations
6Kb __register_sysctl_table() allocations
4Kb neigh_sysctl_register() allocations
4Kb __devinet_sysctl_register() allocations
4Kb __addrconf_sysctl_register() allocations
Accounting of these objects allows to increase the share of memcg-related
memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
on typical VM with default Fedora 35 kernel) and this should be enough
to somehow protect the host from misuse inside container.
Other related objects are quite small and may not be taken into account
to minimize the expected performance degradation.
It should be separately mentonied ~300 bytes of percpu allocation
of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
it can become the main consumer of memory.
This patch does not enables kernfs accounting as it affects
other parts of the kernel and should be discussed separately.
However, even without kernfs, this patch significantly improves the
current situation and allows to take into account more than half
of all netdevice allocations.
Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently we made o_seqno atomic_t. Stop special-casing TUNNEL_SEQ, and
always mark IP6GRE[TAP] devices as NETIF_F_LLTX, since we no longer need
the TX lock (&txq->_xmit_lock).
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Whenever RCU protected list replaces an object,
the pointer to the new object needs to be updated
_before_ the call to kfree_rcu() or call_rcu()
Also ip6_mc_msfilter() needs to update the pointer
before releasing the mc_lock mutex.
Note that linux-5.13 was supporting kfree_rcu(NULL, rcu),
so this fix does not need the conditional test I was
forced to use in the equivalent patch for IPv4.
Fixes: 882ba1f73c ("mld: convert ipv6_mc_socklist->sflist to RCU")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Throw neigh checks in ip6_finish_output2() under a single slow path if,
so we don't have the overhead in the hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two callers of __ip6_finish_output(), both are in
ip6_finish_output(). We can combine the call sites into one and handle
return code after, that will inline __ip6_finish_output().
Note, error handling under NET_XMIT_CN will only return 0 if
__ip6_finish_output() succeded, and in this case it return 0.
Considering that NET_XMIT_SUCCESS is 0, it'll be returning exactly the
same result for it as before.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK
should be included in the ancillary data returned by recvmsg().
Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs().
Signed-off-by: Erin MacNeil <lnx.erin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.
But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.
For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.
For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.
Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.
This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.
This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.
In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.
Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.
Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.
Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)
Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.
10 runs of one TCP_STREAM flow
Before:
Average throughput: 49685 Mbit.
Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)
57.81% [kernel] [k] copy_user_enhanced_fast_string
(*) 12.87% [kernel] [k] skb_release_data
(*) 4.25% [kernel] [k] __free_one_page
(*) 3.57% [kernel] [k] __list_del_entry_valid
1.85% [kernel] [k] __netif_receive_skb_core
1.60% [kernel] [k] __skb_datagram_iter
(*) 1.59% [kernel] [k] free_unref_page_commit
(*) 1.16% [kernel] [k] __slab_free
1.16% [kernel] [k] _copy_to_iter
(*) 1.01% [kernel] [k] kfree
(*) 0.88% [kernel] [k] free_unref_page
0.57% [kernel] [k] ip6_rcv_core
0.55% [kernel] [k] ip6t_do_table
0.54% [kernel] [k] flush_smp_call_function_queue
(*) 0.54% [kernel] [k] free_pcppages_bulk
0.51% [kernel] [k] llist_reverse_order
0.38% [kernel] [k] process_backlog
(*) 0.38% [kernel] [k] free_pcp_prepare
0.37% [kernel] [k] tcp_recvmsg_locked
(*) 0.37% [kernel] [k] __list_add_valid
0.34% [kernel] [k] sock_rfree
0.34% [kernel] [k] _raw_spin_lock_irq
(*) 0.33% [kernel] [k] __page_cache_release
0.33% [kernel] [k] tcp_v6_rcv
(*) 0.33% [kernel] [k] __put_page
(*) 0.29% [kernel] [k] __mod_zone_page_state
0.27% [kernel] [k] _raw_spin_lock
After patch:
Average throughput: 73076 Mbit.
Kernel profiles on cpu running user thread recvmsg() looks better:
81.35% [kernel] [k] copy_user_enhanced_fast_string
1.95% [kernel] [k] _copy_to_iter
1.95% [kernel] [k] __skb_datagram_iter
1.27% [kernel] [k] __netif_receive_skb_core
1.03% [kernel] [k] ip6t_do_table
0.60% [kernel] [k] sock_rfree
0.50% [kernel] [k] tcp_v6_rcv
0.47% [kernel] [k] ip6_rcv_core
0.45% [kernel] [k] read_tsc
0.44% [kernel] [k] _raw_spin_lock_irqsave
0.37% [kernel] [k] _raw_spin_lock
0.37% [kernel] [k] native_irq_return_iret
0.33% [kernel] [k] __inet6_lookup_established
0.31% [kernel] [k] ip6_protocol_deliver_rcu
0.29% [kernel] [k] tcp_rcv_established
0.29% [kernel] [k] llist_reverse_order
v2: kdoc issue (kernel bots)
do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
replace the sk_buff_head with a single-linked list (Jakub)
add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
collect_md mode is racy for [IP6]GRE[TAP] devices. Consider the
following sequence of events:
1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
add ... external". "ip" ignores "[o]seq" if "external" is specified,
so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
it uses lockless TX);
2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
gre_build_header(skb, tun_hlen,
flags, protocol,
tunnel_id_to_key32(tun_info->key.tun_id),
(flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
: 0); ^^^^^^^^^^^^^^^^^
Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
try to do this tunnel->o_seqno++ in parallel, which is racy. Fix it by
making o_seqno atomic_t.
As mentioned by Eric Dumazet in commit b790e01aee ("ip_gre: lockless
xmit"), making o_seqno atomic_t increases "chance for packets being out
of order at receiver" when NETIF_F_LLTX is on.
Maybe a better fix would be:
1. Do not ignore "oseq" in external mode. Users MUST specify "oseq" if
they want the kernel to allow sequencing of outgoing packets;
2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
with "oseq".
Unfortunately, that would break userspace.
We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
do it in separate patches to keep this fix minimal.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 77a5196a80 ("gre: add sequence number for collect md mode.")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in
native mode. According to RFC 2890 2.2., "The first datagram is sent
with a sequence number of 0." Fix it.
It is worth mentioning that o_seqno already starts from 0 in collect_md
mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(),
where tunnel->o_seqno is passed to gre_build_header() before getting
incremented.
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit referenced below fixed packet re-routing if Netfilter mangles
a routing key property of a packet and the packet is routed in a VRF L3
domain. The fix, however, addressed IPv4 re-routing, only.
This commit applies the same behavior for IPv6. While at it, untangle
the nested ternary operator to make the code more readable.
Fixes: 6d8b49c3a3 ("netfilter: Update ip_route_me_harder to consider L3 domain")
Cc: stable@vger.kernel.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch replaces some sk_ipv6only tests with ipv6_only_sock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 9fe516ba3f ("inet: move ipv6only in sock_common"),
ipv6_only_sock() and __ipv6_only_sock() are the same macro. Let's
remove the one.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix mistake in the original patch where limits were specified but the
handler didn't take care of the limits.
Signed-off-by: Arun Ajith S <aajith@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new neighbour cache entry in STALE state for routers on receiving
an unsolicited (gratuitous) neighbour advertisement with
target link-layer-address option specified.
This is similar to the arp_accept configuration for IPv4.
A new sysctl endpoint is created to turn on this behaviour:
/proc/sys/net/ipv6/conf/interface/accept_unsolicited_na.
Signed-off-by: Arun Ajith S <aajith@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reads and Writes to ip6_rt_gc_expire always have been racy,
as syzbot reported lately [1]
There is a possible risk of under-flow, leading
to unexpected high value passed to fib6_run_gc(),
although I have not observed this in the field.
Hosts hitting ip6_dst_gc() very hard are under pretty bad
state anyway.
[1]
BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc
read-write to 0xffff888102110744 of 4 bytes by task 13165 on cpu 1:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
read-write to 0xffff888102110744 of 4 bytes by task 11607 on cpu 0:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
value changed: 0x00000bb3 -> 0x00000ba9
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11607 Comm: kworker/0:21 Not tainted 5.18.0-rc1-syzkaller-00037-g42e7a03d3bad-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: mld mld_ifc_work
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220413181333.649424-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Feng reported an skb_under_panic BUG triggered by running
test_ip6gretap() in tools/testing/selftests/bpf/test_tunnel.sh:
[ 82.492551] skbuff: skb_under_panic: text:ffffffffb268bb8e len:403 put:12 head:ffff9997c5480000 data:ffff9997c547fff8 tail:0x18b end:0x2c0 dev:ip6gretap11
<...>
[ 82.607380] Call Trace:
[ 82.609389] <TASK>
[ 82.611136] skb_push.cold.109+0x10/0x10
[ 82.614289] __gre6_xmit+0x41e/0x590
[ 82.617169] ip6gre_tunnel_xmit+0x344/0x3f0
[ 82.620526] dev_hard_start_xmit+0xf1/0x330
[ 82.623882] sch_direct_xmit+0xe4/0x250
[ 82.626961] __dev_queue_xmit+0x720/0xfe0
<...>
[ 82.633431] packet_sendmsg+0x96a/0x1cb0
[ 82.636568] sock_sendmsg+0x30/0x40
<...>
The following sequence of events caused the BUG:
1. During ip6gretap device initialization, tunnel->tun_hlen (e.g. 4) is
calculated based on old flags (see ip6gre_calc_hlen());
2. packet_snd() reserves header room for skb A, assuming
tunnel->tun_hlen is 4;
3. Later (in clsact Qdisc), the eBPF program sets a new tunnel key for
skb A using bpf_skb_set_tunnel_key() (see _ip6gretap_set_tunnel());
4. __gre6_xmit() detects the new tunnel key, and recalculates
"tun_hlen" (e.g. 12) based on new flags (e.g. TUNNEL_KEY and
TUNNEL_SEQ);
5. gre_build_header() calls skb_push() with insufficient reserved header
room, triggering the BUG.
As sugguested by Cong, fix it by moving the call to skb_cow_head() after
the recalculation of tun_hlen.
Reproducer:
OBJ=$LINUX/tools/testing/selftests/bpf/test_tunnel_kern.o
ip netns add at_ns0
ip link add veth0 type veth peer name veth1
ip link set veth0 netns at_ns0
ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
ip netns exec at_ns0 ip link set dev veth0 up
ip link set dev veth1 up mtu 1500
ip addr add dev veth1 172.16.1.200/24
ip netns exec at_ns0 ip addr add ::11/96 dev veth0
ip netns exec at_ns0 ip link set dev veth0 up
ip addr add dev veth1 ::22/96
ip link set dev veth1 up
ip netns exec at_ns0 \
ip link add dev ip6gretap00 type ip6gretap seq flowlabel 0xbcdef key 2 \
local ::11 remote ::22
ip netns exec at_ns0 ip addr add dev ip6gretap00 10.1.1.100/24
ip netns exec at_ns0 ip addr add dev ip6gretap00 fc80::100/96
ip netns exec at_ns0 ip link set dev ip6gretap00 up
ip link add dev ip6gretap11 type ip6gretap external
ip addr add dev ip6gretap11 10.1.1.200/24
ip addr add dev ip6gretap11 fc80::200/24
ip link set dev ip6gretap11 up
tc qdisc add dev ip6gretap11 clsact
tc filter add dev ip6gretap11 egress bpf da obj $OBJ sec ip6gretap_set_tunnel
tc filter add dev ip6gretap11 ingress bpf da obj $OBJ sec ip6gretap_get_tunnel
ping6 -c 3 -w 10 -q ::11
Fixes: 6712abc168 ("ip6_gre: add ip6 gre and gretap collect_md mode")
Reported-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not update tunnel->tun_hlen in data plane code. Use a local variable
instead, just like "tunnel_hlen" in net/ipv4/ip_gre.c:gre_fb_xmit().
Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-04-14
1) Fix the output interface for VRF cases in xfrm_dst_lookup.
From David Ahern.
2) Fix write out of bounds by doing COW on esp output when the
packet size is larger than a page.
From Sabrina Dubroca.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in ip6_protocol_deliver_rcu() with
kfree_skb_reason().
No new reasons are added.
Some paths are ignored, as they are not common, such as encapsulation
on non-final protocol.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in ip6_rcv_core() with kfree_skb_reason().
No new drop reasons are added.
Seems now we use 'SKB_DROP_REASON_IP_INHDR' for too many case during
ipv6 header parse or check, just like what 'IPSTATS_MIB_INHDRERRORS'
do. Will it be too general and hard to know what happened?
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in TLV encoded option header parsing with
kfree_skb_reason(). Following functions are involved:
ip6_parse_tlv()
ipv6_hop_ra()
ipv6_hop_ioam()
ipv6_hop_jumbo()
ipv6_hop_calipso()
ipv6_dest_hao()
Most skb drops during this process are regarded as 'InHdrErrors',
as 'IPSTATS_MIB_INHDRERRORS' is used when ip6_parse_tlv() fails,
which make we use 'SKB_DROP_REASON_IP_INHDR' correspondingly.
However, 'IP_INHDR' is a relatively general reason. Therefore, we
can use other reasons with higher priority in some cases. For example,
'SKB_DROP_REASON_UNHANDLED_PROTO' is used for unknown TLV options.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two call chains for ipv6_hop_jumbo(). The first one is:
ipv6_destopt_rcv() -> ip6_parse_tlv() -> ipv6_hop_jumbo()
On this call chain, the drop statistics will be done in
ipv6_destopt_rcv() with 'IPSTATS_MIB_INHDRERRORS' if ipv6_hop_jumbo()
returns false.
The second call chain is:
ip6_rcv_core() -> ipv6_parse_hopopts() -> ip6_parse_tlv()
And the drop statistics will also be done in ip6_rcv_core() with
'IPSTATS_MIB_INHDRERRORS' if ipv6_hop_jumbo() returns false.
Therefore, the statistics in ipv6_hop_jumbo() is redundant, which
means the drop is counted twice. The statistics in ipv6_hop_jumbo()
is almost the same as the outside, except the
'IPSTATS_MIB_INTRUNCATEDPKTS', which seems that we have to ignore it.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to add the skb drop reasons support to icmpv6_param_prob(),
introduce the function icmpv6_param_prob_reason() and make
icmpv6_param_prob() an inline call to it. This new function will be
used in the following patches.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() which is used in ip6_forward() and ip_forward()
with kfree_skb_reason().
The new drop reason 'SKB_DROP_REASON_PKT_TOO_BIG' is introduced for
the case that the length of the packet exceeds MTU and can't
fragment.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in ip6_pkt_drop() with kfree_skb_reason().
No new reason is added.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Address the following coccicheck warning:
net/ipv6/exthdrs.c:620:44-45: WARNING opportunity for swap()
by using swap() for the swapping of variable values and drop
the tmp (`addr`) variable that is not needed any more.
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ebe48d368e ("esp: Fix possible buffer overflow in ESP
transformation") tried to fix skb_page_frag_refill usage in ESP by
capping allocsize to 32k, but that doesn't completely solve the issue,
as skb_page_frag_refill may return a single page. If that happens, we
will write out of bounds, despite the check introduced in the previous
patch.
This patch forces COW in cases where we would end up calling
skb_page_frag_refill with a size larger than a page (first in
esp_output_head with tailen, then in esp_output_tail with
skb->data_len).
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kongweibin reported a kernel panic in ip6_forward() when input interface
has no in6 dev associated.
The following tc commands were used to reproduce this panic:
tc qdisc del dev vxlan100 root
tc qdisc add dev vxlan100 root netem corrupt 5%
CC: stable@vger.kernel.org
Fixes: ccd27f05ae ("ipv6: fix 'disable_policy' for fwd packets")
Reported-by: kongweibin <kongweibin2@huawei.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Replace unnecessary list_for_each_entry_continue() in nf_tables,
from Jakob Koschel.
2) Add struct nf_conntrack_net_ecache to conntrack event cache and
use it, from Florian Westphal.
3) Refactor ctnetlink_dump_list(), also from Florian.
4) Bump module reference counter on cttimeout object addition/removal,
from Florian.
5) Consolidate nf_log MAC printer, from Phil Sutter.
6) Add basic logging support for unknown ethertype, from Phil Sutter.
7) Consolidate check for sysctl nf_log_all_netns toggle, also from Phil.
8) Replace hardcode value in nft_bitwise, from Jeremy Sowden.
9) Rename BASIC-like goto tags in nft_bitwise to more meaningful names,
also from Jeremy.
10) nft_fib support for reverse path filtering with policy-based routing
on iif. Extend selftests to cover for this new usecase, from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If policy-based routing using the iif selector is used, then the fib
expression fails to look up for the reverse path from the prerouting
hook because the input interface cannot be inferred. In order to support
this scenario, extend the fib expression to allow to use after the route
lookup, from the forward hook.
This patch also adds support for the input hook for usability reasons.
Since the prerouting hook cannot be used for the scenario described
above, users need two rules: one for the forward chain and another rule
for the input chain to check for the reverse path check for locally
targeted traffic.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with
kfree_skb_reason().
In order to get the reasons of the skb drops after icmp message handle,
we change the return type of 'handler()' in 'struct icmp_control' from
'bool' to 'enum skb_drop_reason'. This may change its original
intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means
success now. Therefore, all 'handler' and the call of them need to be
handled. Following 'handler' functions are involved:
icmp_unreach()
icmp_redirect()
icmp_echo()
icmp_timestamp()
icmp_discard()
And following new drop reasons are added:
SKB_DROP_REASON_ICMP_CSUM
SKB_DROP_REASON_INVALID_PROTO
The reason 'INVALID_PROTO' is introduced for the case that the packet
doesn't follow rfc 1122 and is dropped. This is not a common case, and
I believe we can locate the problem from the data in the packet. For now,
this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types.
Maybe there should be a document file for these reasons. For example,
list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO'
drop reason. Therefore, users can locate their problems according to the
document.
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment rx_otherhost_dropped counter when packet dropped due to
mismatched dest MAC addr.
An example when this drop can occur is when manually crafting raw
packets that will be consumed by a user space application via a tap
device. For testing purposes local traffic was generated using trafgen
for the client and netcat to start a server
Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other
with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the
counter was incremented. (Also had to modify iproute2 to show the stat,
additional patch for that coming next.)
Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
idev->addr_list needs to be protected by idev->lock. However, it is not
always possible to do so while iterating and performing actions on
inet6_ifaddr instances. For example, multiple functions (like
addrconf_{join,leave}_anycast) eventually call down to other functions
that acquire the idev->lock. The current code temporarily unlocked the
idev->lock during the loops, which can cause race conditions. Moving the
locks up is also not an appropriate solution as the ordering of lock
acquisition will be inconsistent with for example mc_lock.
This solution adds an additional field to inet6_ifaddr that is used
to temporarily add the instances to a temporary list while holding
idev->lock. The temporary list can then be traversed without holding
idev->lock. This change was done in two places. In addrconf_ifdown, the
list_for_each_entry_safe variant of the list loop is also no longer
necessary as there is no deletion within that specific loop.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220403231523.45843-1-dossche.niels@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is greater
than zero.
Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
can trigger, and without a repro we would have to spend
considerable time finding the bug.
Instead of complaining too late, we want to catch where
and when tp->snd_cwnd is set to an illegal value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a same action when the variable is initialized
Signed-off-by: Hongbin Wang <wh_bin@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv6/ip6mr.c:1656:14: warning: unused variable 'do_wrmifwhole'
Move it to the CONFIG_IPV6_PIMSM_V2 scope where its used.
Fixes: 4b340a5a72 ("net: ip6mr: add support for passing full packet on wrong mif")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'
As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
into 'flags' and 'noblock' with finally obsolete bit operations like this:
skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);
And this is not even done consistently with the 'flags' parameter.
This patch removes the obsolete and costly splitting into two parameters
and only performs bit operations when really needed on the caller side.
One missing conversion thankfully reported by kernel test robot. I missed
to enable kunit tests to build the mctp code.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF devices are the loopbacks for VRFs, and a loopback can not be
assigned to a VRF. Accordingly, the condition in ip6_pkt_drop should
be '||' not '&&'.
Fixes: 1d3fd8a10b ("vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach")
Reported-by: Pudak, Filip <Filip.Pudak@windriver.com>
Reported-by: Xiao, Jiguang <Jiguang.Xiao@windriver.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220404150908.2937-1-dsahern@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The fib expression stores to a register, so we can't add empty stub.
Check that the register that is being written is in fact redundant.
In most cases, this is expected to cancel tracking as re-use is
unlikely.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Skip register tracking for expressions that perform read-only operations
on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to
avoid defining stubs for these expressions.
This patch re-enables register tracking which was disabled in ed5f85d422
("netfilter: nf_tables: disable register tracking"). Follow up patches
add remaining register tracking for existing expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-03-16
1) Fix a kernel-info-leak in pfkey.
From Haimin Zhang.
2) Fix an incorrect check of the return value of ipv6_skip_exthdr.
From Sabrina Dubroca.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
esp6: fix check on ipv6_skip_exthdr's return value
af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register
====================
Link: https://lore.kernel.org/r/20220316121142.3142336-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The fundamental premise of VRF and l3mdev core code is binding a socket
to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
Legacy code resets flowi_oif to the l3mdev losing any original port
device binding. Ben (among others) has demonstrated use cases where the
original port device binding is important and needs to be retained.
This patch handles that by adding a new entry to the common flow struct
that can indicate the l3mdev index for later rule and table matching
avoiding the need to reset flowi_oif.
In addition to allowing more use cases that require port device binds,
this patch brings a few datapath simplications:
1. l3mdev_fib_rule_match is only called when walking fib rules and
always after l3mdev_update_flow. That allows an optimization to bail
early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
only that index needs to be checked for the FIB table id.
2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
(e.g., VRF) device. By resetting flowi_oif only for this case the
FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
removing several checks in the datapath. The flowi_iif path can be
simplified to only be called if the it is not loopback (loopback can
not be assigned to an L3 domain) and the l3mdev index is not already
set.
3. Avoid another device lookup in the output path when the fib lookup
returns a reject failure.
Note: 2 functional tests for local traffic with reject fib rules are
updated to reflect the new direct failure at FIB lookup time for ping
rather than the failure on packet path. The current code fails like this:
HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: Warning: source address might be selected on device other than: eth1
PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.
--- 172.16.3.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
where the test now directly fails:
HINT: Fails since address on vrf device is out of device scope
COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
ping: connect: No route to host
Signed-off-by: David Ahern <dsahern@kernel.org>
Tested-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5f9c55c806 ("ipv6: check return value of ipv6_skip_exthdr")
introduced an incorrect check, which leads to all ESP packets over
either TCPv6 or UDPv6 encapsulation being dropped. In this particular
case, offset is negative, since skb->data points to the ESP header in
the following chain of headers, while skb->network_header points to
the IPv6 header:
IPv6 | ext | ... | ext | UDP | ESP | ...
That doesn't seem to be a problem, especially considering that if we
reach esp6_input_done2, we're guaranteed to have a full set of headers
available (otherwise the packet would have been dropped earlier in the
stack). However, it means that the return value will (intentionally)
be negative. We can make the test more specific, as the expected
return value of ipv6_skip_exthdr will be the (negated) size of either
a UDP header, or a TCP header with possible options.
In the future, we should probably either make ipv6_skip_exthdr
explicitly accept negative offsets (and adjust its return value for
error cases), or make ipv6_skip_exthdr only take non-negative
offsets (and audit all callers).
Fixes: 5f9c55c806 ("ipv6: check return value of ipv6_skip_exthdr")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-03-09
1) Fix IPv6 PMTU discovery for xfrm interfaces.
From Lina Wang.
2) Revert failing for policies and states that are
configured with XFRMA_IF_ID 0. It broke a
user configuration. From Kai Lueke.
3) Fix a possible buffer overflow in the ESP output path.
4) Fix ESP GSO for tunnel and BEET mode on inter address
family tunnels.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We have a number of cases where function returns drop/no drop
decision as a boolean. Now that we want to report the reason
code as well we have to pass extra output arguments.
We can make the reason code evaluate correctly as bool.
I believe we're good to reorder the reasons as they are
reported to user space as strings.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The esp tunnel GSO handlers use skb_mac_gso_segment to
push the inner packet to the segmentation handlers.
However, skb_mac_gso_segment takes the Ethernet Protocol
ID from 'skb->protocol' which is wrong for inter address
family tunnels. We fix this by introducing a new
skb_eth_gso_segment function.
This function can be used if it is necessary to pass the
Ethernet Protocol ID directly to the segmentation handler.
First users of this function will be the esp4 and esp6
tunnel segmentation handlers.
Fixes: c35fe4106b ("xfrm: Add mode handlers for IPsec on layer 2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The xfrm{4,6}_beet_gso_segment() functions did not correctly set the
SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 gso types for the address family
tunneling case. Fix this by setting these gso types.
Fixes: 384a46ea7b ("esp4: add gso_segment for esp4 beet mode")
Fixes: 7f9e40eb18 ("esp6: add gso_segment for esp6 beet mode")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The maximum message size that can be send is bigger than
the maximum site that skb_page_frag_refill can allocate.
So it is possible to write beyond the allocated buffer.
Fix this by doing a fallback to COW in that case.
v2:
Avoid get get_order() costs as suggested by Linus Torvalds.
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Reported-by: valis <sec@valis.email>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
While investigating on why a synchronize_net() has been added recently
in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report()
might drop skbs in some cases.
Discussion about removing synchronize_net() from ipv6_mc_down()
will happen in a different thread.
Fixes: f185de28d9 ("mld: add new workqueues for process mld events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous patches handled the delivery_time in the ingress path
before the routing decision is made. This patch can postpone clearing
delivery_time in a skb until knowing it is delivered locally and also
set the (rcv) timestamp if needed. This patch moves the
skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
and ip6_input_finish().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IOAM is a hop-by-hop option with a temporary iana allocation (49).
Since it is hop-by-hop, it is done before the input routing decision.
One of the traced data field is the (rcv) timestamp.
When the locally generated skb is looping from egress to ingress over
a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
delivery time before it is known that it will be delivered locally
and received by another sk.
Like handling the network tapping (tcpdump) in the earlier patch,
this patch gets the timestamp if needed without over-writing the
delivery_time in the skb->tstamp. skb_tstamp_cond() is added to do the
ktime_get_real() with an extra cond arg to check on top of the
netstamp_needed_key static key. skb_tstamp_cond() will also be used in
a latter patch and it needs the netstamp_needed_key check.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A latter patch will postpone the delivery_time clearing until the stack
knows the skb is being delivered locally (i.e. calling
skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4
and at ip6_input_finish() for IPv6). That will allow other kernel
forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
A very similar IPv6 defrag codes have been duplicated in
multiple places: regular IPv6, nf_conntrack, and 6lowpan.
Unlike the IPv4 defrag which is done before ip_local_deliver_finish(),
the regular IPv6 defrag is done after ip6_input_finish().
Thus, no change should be needed in the regular IPv6 defrag
logic because skb_clear_delivery_time() should have been called.
6lowpan also does not need special handling on delivery_time
because it is a non-inet packet_type.
However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs
to do the IPv6 defrag earlier. Thus, it needs to save the
mono_delivery_time bit in the inet_frag_queue which is similar
to how it is handled in the previous patch for the IPv4 defrag.
This patch chooses to do it consistently and stores the mono_delivery_time
in the inet_frag_queue for all cases such that it will be easier
for the future refactoring effort on the IPv6 reasm code.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.
The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.
This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.
The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally. It will be done in a latter patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).
Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.
Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.
While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev. For example, when forwarding from egress to
ingress and then finally back to egress:
tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
^ ^
reset rest
This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq. A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.
In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
[ This patch is a prep work. The following patches will
get the other parts of the stack ready first. Then another patch
after that will finally set the skb->mono_delivery_time. ]
skb_set_delivery_time() function is added. It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.
A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder. It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().
[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in tunnel mode, if outer interface(ipv4) is less, it is easily to let
inner IPV6 mtu be less than 1280. If so, a Packet Too Big ICMPV6 message
is received. When send again, packets are fragmentized with 1280, they
are still rejected with ICMPV6(Packet Too Big) by xfrmi_xmit2().
According to RFC4213 Section3.2.2:
if (IPv4 path MTU - 20) is less than 1280
if packet is larger than 1280 bytes
Send ICMPv6 "packet too big" with MTU=1280
Drop packet
else
Encapsulate but do not set the Don't Fragment
flag in the IPv4 header. The resulting IPv4
packet might be fragmented by the IPv4 layer
on the encapsulator or by some router along
the IPv4 path.
endif
else
if packet is larger than (IPv4 path MTU - 20)
Send ICMPv6 "packet too big" with
MTU = (IPv4 path MTU - 20).
Drop packet.
else
Encapsulate and set the Don't Fragment flag
in the IPv4 header.
endif
endif
Packets should be fragmentized with ipv4 outer interface, so change it.
After it is fragemtized with ipv4, there will be double fragmenation.
No.48 & No.51 are ipv6 fragment packets, No.48 is double fragmentized,
then tunneled with IPv4(No.49& No.50), which obey spec. And received peer
cannot decrypt it rightly.
48 2002::10 2002::11 1296(length) IPv6 fragment (off=0 more=y ident=0xa20da5bc nxt=50)
49 0x0000 (0) 2002::10 2002::11 1304 IPv6 fragment (off=0 more=y ident=0x7448042c nxt=44)
50 0x0000 (0) 2002::10 2002::11 200 ESP (SPI=0x00035000)
51 2002::10 2002::11 180 Echo (ping) request
52 0x56dc 2002::10 2002::11 248 IPv6 fragment (off=1232 more=n ident=0xa20da5bc nxt=50)
xfrm6_noneed_fragment has fixed above issues. Finally, it acted like below:
1 0x6206 192.168.1.138 192.168.1.1 1316 Fragmented IP protocol (proto=Encap Security Payload 50, off=0, ID=6206) [Reassembled in #2]
2 0x6206 2002::10 2002::11 88 IPv6 fragment (off=0 more=y ident=0x1f440778 nxt=50)
3 0x0000 2002::10 2002::11 248 ICMPv6 Echo (ping) request
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are two reasons for addrconf_notify() to be called with NETDEV_DOWN:
either the network device is actually going down, or IPv6 was disabled
on the interface.
If either of them stays down while the other is toggled, we repeatedly
call the code for NETDEV_DOWN, including ipv6_mc_down(), while never
calling the corresponding ipv6_mc_up() in between. This will cause a
new entry in idev->mc_tomb to be allocated for each multicast group
the interface is subscribed to, which in turn leaks one struct ifmcaddr6
per nontrivial multicast group the interface is subscribed to.
The following reproducer will leak at least $n objects:
ip addr add ff2e::4242/32 dev eth0 autojoin
sysctl -w net.ipv6.conf.eth0.disable_ipv6=1
for i in $(seq 1 $n); do
ip link set up eth0; ip link set down eth0
done
Joining groups with IPV6_ADD_MEMBERSHIP (unprivileged) or setting the
sysctl net.ipv6.conf.eth0.forwarding to 1 (=> subscribing to ff02::2)
can also be used to create a nontrivial idev->mc_list, which will the
leak objects with the right up-down-sequence.
Based on both sources for NETDEV_DOWN events the interface IPv6 state
should be considered:
- not ready if the network interface is not ready OR IPv6 is disabled
for it
- ready if the network interface is ready AND IPv6 is enabled for it
The functions ipv6_mc_up() and ipv6_down() should only be run when this
state changes.
Implement this by remembering when the IPv6 state is ready, and only
run ipv6_mc_down() if it actually changed from ready to not ready.
The other direction (not ready -> ready) already works correctly, as:
- the interface notification triggered codepath for NETDEV_UP /
NETDEV_CHANGE returns early if ipv6 is disabled, and
- the disable_ipv6=0 triggered codepath skips fully initializing the
interface as long as addrconf_link_ready(dev) returns false
- calling ipv6_mc_up() repeatedly does not leak anything
Fixes: 3ce62a84d5 ("ipv6: exit early in addrconf_notify() if IPv6 is disabled")
Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() which is used in the packet egress path of IP layer
with kfree_skb_reason(). Functions that are involved include:
__ip_queue_xmit()
ip_finish_output()
ip_mc_finish_output()
ip6_output()
ip6_finish_output()
ip6_finish_output2()
Following new drop reasons are introduced:
SKB_DROP_REASON_IP_OUTNOROUTES
SKB_DROP_REASON_BPF_CGROUP_EGRESS
SKB_DROP_REASON_IPV6DISABLED
SKB_DROP_REASON_NEIGH_CREATEFAIL
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Fix PMTU for IPv6 if the reported MTU minus the ESP overhead is
smaller than 1280. From Jiri Bohac.
2) Fix xfrm interface ID and inter address family tunneling when
migrating xfrm states. From Yan Yan.
3) Add missing xfrm intrerface ID initialization on xfrmi_changelink.
From Antony Antony.
4) Enforce validity of xfrm offload input flags so that userspace can't
send undefined flags to the offload driver.
From Leon Romanovsky.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions do essentially the same work to verify TCP-MD5 sign.
Code can be merged into one family-independent function in order to
reduce copy'n'paste and generated code.
Later with TCP-AO option added, this will allow to create one function
that's responsible for segment verification, that will have all the
different checks for MD5/AO/non-signed packets, which in turn will help
to see checks for all corner-cases in one function, rather than spread
around different families and functions.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
valid_lft, prefered_lft and tstamp are always accessed under the lock
"lock" in other places. Reading these without taking the lock may result
in inconsistencies regarding the calculation of the valid and preferred
variables since decisions are taken on these fields for those variables.
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Link: https://lore.kernel.org/r/20220223131954.6570-1-niels.dossche@ugent.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All other skbs allocated for TCP tx are using MAX_TCP_HEADER already.
MAX_HEADER can be too small for some cases (like eBPF based encapsulation),
so this can avoid extra pskb_expand_head() in lower stacks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220222031115.4005060-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch separate NS message allocation steps from ndisc_send_ns(),
so it could be used in other places, like bonding, to allocate and
send IPv6 NS message.
Also export ndisc_send_skb() and ndisc_ns_create() for later bonding usage.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encounter a tcp drop issue in our cloud environment. Packet GROed in
host forwards to a VM virtio_net nic with net_failover enabled. VM acts
as a IPVS LB with ipip encapsulation. The full path like:
host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
-> ipip encap -> net_failover tx -> virtio_net tx
When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
did because it supports TSO and GSO_IPXIP4. But network_header points to
inner ip header.
Call Trace:
tcp4_gso_segment ------> return NULL
inet_gso_segment ------> inner iph, network_header points to
ipip_gso_segment
inet_gso_segment ------> outer iph
skb_mac_gso_segment
Afterwards virtio_net transmits the pkt, only inner ip header is modified.
And the outer one just keeps unchanged. The pkt will be dropped in remote
host.
Call Trace:
inet_gso_segment ------> inner iph, outer iph is skipped
skb_mac_gso_segment
__skb_gso_segment
validate_xmit_skb
validate_xmit_skb_list
sch_direct_xmit
__qdisc_run
__dev_queue_xmit ------> virtio_net
dev_hard_start_xmit
__dev_queue_xmit ------> net_failover
ip_finish_output2
ip_output
iptunnel_xmit
ip_tunnel_xmit
ipip_tunnel_xmit ------> ipip
dev_hard_start_xmit
__dev_queue_xmit
ip_finish_output2
ip_output
ip_forward
ip_rcv
__netif_receive_skb_one_core
netif_receive_skb_internal
napi_gro_receive
receive_buf
virtnet_poll
net_rx_action
The root cause of this issue is specific with the rare combination of
SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
SKB_GSO_DODGY is set from external virtio_net. We need to reset network
header when callbacks.gso_segment() returns NULL.
This patch also includes ipv6_gso_segment(), considering SIT, etc.
Fixes: cb32f511a7 ("ipip: add GSO/TSO support")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with
kfree_skb_reason().
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:
SKB_DROP_REASON_SOCKET_BACKLOG
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.
Following drop reasons are added:
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>