2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* TCP over IPv6
2007-02-09 23:24:49 +09:00
* Linux INET6 implementation
2005-04-16 15:20:36 -07:00
*
* Authors :
2007-02-09 23:24:49 +09:00
* Pedro Roque < roque @ di . fc . ul . pt >
2005-04-16 15:20:36 -07:00
*
2007-02-09 23:24:49 +09:00
* Based on :
2005-04-16 15:20:36 -07:00
* linux / net / ipv4 / tcp . c
* linux / net / ipv4 / tcp_input . c
* linux / net / ipv4 / tcp_output . c
*
* Fixes :
* Hideaki YOSHIFUJI : sin6_scope_id support
* YOSHIFUJI Hideaki @ USAGI and : Support IPV6_V6ONLY socket option , which
* Alexey Kuznetsov allow both IPv4 and IPv6 sockets to bind
* a single port at the same time .
* YOSHIFUJI Hideaki @ USAGI : convert / proc / net / tcp6 to seq_file .
*/
2008-12-29 23:04:08 -08:00
# include <linux/bottom_half.h>
2005-04-16 15:20:36 -07:00
# include <linux/module.h>
# include <linux/errno.h>
# include <linux/types.h>
# include <linux/socket.h>
# include <linux/sockios.h>
# include <linux/net.h>
# include <linux/jiffies.h>
# include <linux/in.h>
# include <linux/in6.h>
# include <linux/netdevice.h>
# include <linux/init.h>
# include <linux/jhash.h>
# include <linux/ipsec.h>
# include <linux/times.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2014-03-29 09:27:29 +08:00
# include <linux/uaccess.h>
2005-04-16 15:20:36 -07:00
# include <linux/ipv6.h>
# include <linux/icmpv6.h>
# include <linux/random.h>
2019-05-03 17:01:37 +02:00
# include <linux/indirect_call_wrapper.h>
2005-04-16 15:20:36 -07:00
# include <net/tcp.h>
# include <net/ndisc.h>
2005-08-12 09:26:18 -03:00
# include <net/inet6_hashtables.h>
2005-12-13 23:15:24 -08:00
# include <net/inet6_connection_sock.h>
2005-04-16 15:20:36 -07:00
# include <net/ipv6.h>
# include <net/transp_v6.h>
# include <net/addrconf.h>
# include <net/ip6_route.h>
# include <net/ip6_checksum.h>
# include <net/inet_ecn.h>
# include <net/protocol.h>
# include <net/xfrm.h>
# include <net/snmp.h>
# include <net/dsfield.h>
2005-12-13 23:25:19 -08:00
# include <net/timewait_sock.h>
2008-04-03 14:22:32 -07:00
# include <net/inet_common.h>
2011-08-03 20:50:44 -07:00
# include <net/secure_seq.h>
2013-07-10 17:13:17 +03:00
# include <net/busy_poll.h>
2005-04-16 15:20:36 -07:00
# include <linux/proc_fs.h>
# include <linux/seq_file.h>
2016-01-24 21:20:23 +08:00
# include <crypto/hash.h>
2006-11-14 19:07:45 -08:00
# include <linux/scatterlist.h>
2017-10-23 09:20:24 -07:00
# include <trace/events/tcp.h>
2015-09-29 07:42:39 -07:00
static void tcp_v6_send_reset ( const struct sock * sk , struct sk_buff * skb ) ;
static void tcp_v6_reqsk_send_ack ( const struct sock * sk , struct sk_buff * skb ,
2008-08-06 23:50:04 -07:00
struct request_sock * req ) ;
2005-04-16 15:20:36 -07:00
2021-11-15 11:02:41 -08:00
INDIRECT_CALLABLE_SCOPE int tcp_v6_do_rcv ( struct sock * sk , struct sk_buff * skb ) ;
2005-04-16 15:20:36 -07:00
2009-09-01 19:25:04 +00:00
static const struct inet_connection_sock_af_ops ipv6_mapped ;
2020-01-09 07:59:21 -08:00
const struct inet_connection_sock_af_ops ipv6_specific ;
2006-11-14 19:53:22 -08:00
# ifdef CONFIG_TCP_MD5SIG
2009-09-01 19:25:03 +00:00
static const struct tcp_sock_af_ops tcp_sock_ipv6_specific ;
static const struct tcp_sock_af_ops tcp_sock_ipv6_mapped_specific ;
2008-04-18 12:45:16 +09:00
# else
2015-09-29 21:24:05 -07:00
static struct tcp_md5sig_key * tcp_v6_md5_do_lookup ( const struct sock * sk ,
2019-12-30 14:14:28 -08:00
const struct in6_addr * addr ,
int l3index )
2008-04-18 12:45:16 +09:00
{
return NULL ;
}
2006-11-14 19:53:22 -08:00
# endif
2005-04-16 15:20:36 -07:00
2019-03-19 07:01:08 -07:00
/* Helper returning the inet6 address from a given tcp socket.
* It can be used in TCP stack instead of inet6_sk ( sk ) .
* This avoids a dereference and allow compiler optimizations .
2019-04-01 03:09:20 -07:00
* It is a specialized version of inet6_sk_generic ( ) .
2019-03-19 07:01:08 -07:00
*/
static struct ipv6_pinfo * tcp_inet6_sk ( const struct sock * sk )
{
2019-04-01 03:09:20 -07:00
unsigned int offset = sizeof ( struct tcp6_sock ) - sizeof ( struct ipv6_pinfo ) ;
2019-03-19 07:01:08 -07:00
2019-04-01 03:09:20 -07:00
return ( struct ipv6_pinfo * ) ( ( ( u8 * ) sk ) + offset ) ;
2019-03-19 07:01:08 -07:00
}
2012-08-19 03:30:38 +00:00
static void inet6_sk_rx_dst_set ( struct sock * sk , const struct sk_buff * skb )
{
struct dst_entry * dst = skb_dst ( skb ) ;
net: fix IP early demux races
David Wilder reported crashes caused by dst reuse.
<quote David>
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
</quote>
Thanks to his detailed report and analysis, we were able to understand
the core issue.
IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.
When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.
Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.
We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.
This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.
It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.
Can probably be backported back to linux-3.6 kernels
Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 14:08:53 -08:00
if ( dst & & dst_hold_safe ( dst ) ) {
2014-09-08 08:06:07 -07:00
const struct rt6_info * rt = ( const struct rt6_info * ) dst ;
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
rcu_assign_pointer ( sk - > sk_rx_dst , dst ) ;
2021-10-25 09:48:16 -07:00
sk - > sk_rx_dst_ifindex = skb - > skb_iif ;
2021-10-25 09:48:17 -07:00
sk - > sk_rx_dst_cookie = rt6_get_cookie ( rt ) ;
2014-09-08 08:06:07 -07:00
}
2012-08-19 03:30:38 +00:00
}
2017-05-05 06:56:54 -07:00
static u32 tcp_v6_init_seq ( const struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2017-05-05 06:56:54 -07:00
return secure_tcpv6_seq ( ipv6_hdr ( skb ) - > daddr . s6_addr32 ,
ipv6_hdr ( skb ) - > saddr . s6_addr32 ,
tcp_hdr ( skb ) - > dest ,
tcp_hdr ( skb ) - > source ) ;
}
2017-06-07 10:34:39 -07:00
static u32 tcp_v6_init_ts_off ( const struct net * net , const struct sk_buff * skb )
2017-05-05 06:56:54 -07:00
{
2017-06-07 10:34:39 -07:00
return secure_tcpv6_ts_off ( net , ipv6_hdr ( skb ) - > daddr . s6_addr32 ,
2017-05-05 06:56:54 -07:00
ipv6_hdr ( skb ) - > saddr . s6_addr32 ) ;
2005-04-16 15:20:36 -07:00
}
2018-03-30 15:08:05 -07:00
static int tcp_v6_pre_connect ( struct sock * sk , struct sockaddr * uaddr ,
int addr_len )
{
/* This check is replicated from tcp_v6_connect() and intended to
* prevent BPF program called below from accessing bytes that are out
* of the bound specified by user in addr_len .
*/
if ( addr_len < SIN6_LEN_RFC2133 )
return - EINVAL ;
sock_owned_by_me ( sk ) ;
return BPF_CGROUP_RUN_PROG_INET6_CONNECT ( sk , uaddr ) ;
}
2007-02-09 23:24:49 +09:00
static int tcp_v6_connect ( struct sock * sk , struct sockaddr * uaddr ,
2005-04-16 15:20:36 -07:00
int addr_len )
{
struct sockaddr_in6 * usin = ( struct sockaddr_in6 * ) uaddr ;
2005-12-13 23:26:10 -08:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2022-09-07 18:10:17 -07:00
struct in6_addr * saddr = NULL , * final_p , final ;
2022-01-26 10:07:14 -08:00
struct inet_timewait_death_row * tcp_death_row ;
2019-03-19 07:01:08 -07:00
struct ipv6_pinfo * np = tcp_inet6_sk ( sk ) ;
2022-09-07 18:10:17 -07:00
struct inet_sock * inet = inet_sk ( sk ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2022-09-07 18:10:17 -07:00
struct net * net = sock_net ( sk ) ;
2015-11-29 19:37:57 -08:00
struct ipv6_txoptions * opt ;
2005-04-16 15:20:36 -07:00
struct dst_entry * dst ;
2022-09-07 18:10:17 -07:00
struct flowi6 fl6 ;
2005-04-16 15:20:36 -07:00
int addr_type ;
int err ;
2007-02-09 23:24:49 +09:00
if ( addr_len < SIN6_LEN_RFC2133 )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2007-02-09 23:24:49 +09:00
if ( usin - > sin6_family ! = AF_INET6 )
2010-09-22 20:43:57 +00:00
return - EAFNOSUPPORT ;
2005-04-16 15:20:36 -07:00
2011-03-12 16:22:43 -05:00
memset ( & fl6 , 0 , sizeof ( fl6 ) ) ;
2005-04-16 15:20:36 -07:00
if ( np - > sndflow ) {
2011-03-12 16:22:43 -05:00
fl6 . flowlabel = usin - > sin6_flowinfo & IPV6_FLOWINFO_MASK ;
IP6_ECN_flow_init ( fl6 . flowlabel ) ;
if ( fl6 . flowlabel & IPV6_FLOWLABEL_MASK ) {
2005-04-16 15:20:36 -07:00
struct ip6_flowlabel * flowlabel ;
2011-03-12 16:22:43 -05:00
flowlabel = fl6_sock_lookup ( sk , fl6 . flowlabel ) ;
2019-07-07 05:34:45 -04:00
if ( IS_ERR ( flowlabel ) )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
fl6_sock_release ( flowlabel ) ;
}
}
/*
2007-02-09 23:24:49 +09:00
* connect ( ) to INADDR_ANY means loopback ( BSD ' ism ) .
*/
2017-02-12 17:26:07 -05:00
if ( ipv6_addr_any ( & usin - > sin6_addr ) ) {
if ( ipv6_addr_v4mapped ( & sk - > sk_v6_rcv_saddr ) )
ipv6_addr_set_v4mapped ( htonl ( INADDR_LOOPBACK ) ,
& usin - > sin6_addr ) ;
else
usin - > sin6_addr = in6addr_loopback ;
}
2005-04-16 15:20:36 -07:00
addr_type = ipv6_addr_type ( & usin - > sin6_addr ) ;
2013-12-19 18:44:34 +08:00
if ( addr_type & IPV6_ADDR_MULTICAST )
2005-04-16 15:20:36 -07:00
return - ENETUNREACH ;
if ( addr_type & IPV6_ADDR_LINKLOCAL ) {
if ( addr_len > = sizeof ( struct sockaddr_in6 ) & &
usin - > sin6_scope_id ) {
/* If interface is set while binding, indices
* must coincide .
*/
2018-01-04 14:03:54 -08:00
if ( ! sk_dev_equal_l3scope ( sk , usin - > sin6_scope_id ) )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
sk - > sk_bound_dev_if = usin - > sin6_scope_id ;
}
/* Connect to link-local address requires an interface */
if ( ! sk - > sk_bound_dev_if )
return - EINVAL ;
}
if ( tp - > rx_opt . ts_recent_stamp & &
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
! ipv6_addr_equal ( & sk - > sk_v6_daddr , & usin - > sin6_addr ) ) {
2005-04-16 15:20:36 -07:00
tp - > rx_opt . ts_recent = 0 ;
tp - > rx_opt . ts_recent_stamp = 0 ;
2019-10-10 20:17:41 -07:00
WRITE_ONCE ( tp - > write_seq , 0 ) ;
2005-04-16 15:20:36 -07:00
}
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
sk - > sk_v6_daddr = usin - > sin6_addr ;
2011-03-12 16:22:43 -05:00
np - > flow_label = fl6 . flowlabel ;
2005-04-16 15:20:36 -07:00
/*
* TCP over IPv4
*/
2017-02-12 17:26:07 -05:00
if ( addr_type & IPV6_ADDR_MAPPED ) {
2005-12-13 23:26:10 -08:00
u32 exthdrlen = icsk - > icsk_ext_hdr_len ;
2005-04-16 15:20:36 -07:00
struct sockaddr_in sin ;
2022-04-20 10:58:50 +09:00
if ( ipv6_only_sock ( sk ) )
2005-04-16 15:20:36 -07:00
return - ENETUNREACH ;
sin . sin_family = AF_INET ;
sin . sin_port = usin - > sin6_port ;
sin . sin_addr . s_addr = usin - > sin6_addr . s6_addr32 [ 3 ] ;
2005-12-13 23:26:10 -08:00
icsk - > icsk_af_ops = & ipv6_mapped ;
2020-01-21 16:56:18 -08:00
if ( sk_is_mptcp ( sk ) )
2020-01-30 10:45:26 +01:00
mptcpv6_handle_mapped ( sk , true ) ;
2005-04-16 15:20:36 -07:00
sk - > sk_backlog_rcv = tcp_v4_do_rcv ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
tp - > af_specific = & tcp_sock_ipv6_mapped_specific ;
# endif
2005-04-16 15:20:36 -07:00
err = tcp_v4_connect ( sk , ( struct sockaddr * ) & sin , sizeof ( sin ) ) ;
if ( err ) {
2005-12-13 23:26:10 -08:00
icsk - > icsk_ext_hdr_len = exthdrlen ;
icsk - > icsk_af_ops = & ipv6_specific ;
2020-01-21 16:56:18 -08:00
if ( sk_is_mptcp ( sk ) )
2020-01-30 10:45:26 +01:00
mptcpv6_handle_mapped ( sk , false ) ;
2005-04-16 15:20:36 -07:00
sk - > sk_backlog_rcv = tcp_v6_do_rcv ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
tp - > af_specific = & tcp_sock_ipv6_specific ;
# endif
2005-04-16 15:20:36 -07:00
goto failure ;
}
2015-03-18 14:05:35 -07:00
np - > saddr = sk - > sk_v6_rcv_saddr ;
2005-04-16 15:20:36 -07:00
return err ;
}
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
if ( ! ipv6_addr_any ( & sk - > sk_v6_rcv_saddr ) )
saddr = & sk - > sk_v6_rcv_saddr ;
2005-04-16 15:20:36 -07:00
2011-03-12 16:22:43 -05:00
fl6 . flowi6_proto = IPPROTO_TCP ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
fl6 . daddr = sk - > sk_v6_daddr ;
2011-11-21 03:39:03 +00:00
fl6 . saddr = saddr ? * saddr : np - > saddr ;
2011-03-12 16:22:43 -05:00
fl6 . flowi6_oif = sk - > sk_bound_dev_if ;
fl6 . flowi6_mark = sk - > sk_mark ;
2011-03-12 16:36:19 -05:00
fl6 . fl6_dport = usin - > sin6_port ;
fl6 . fl6_sport = inet - > inet_sport ;
2016-11-04 02:23:43 +09:00
fl6 . flowi6_uid = sk - > sk_uid ;
2005-04-16 15:20:36 -07:00
2016-04-05 17:10:15 +02:00
opt = rcu_dereference_protected ( np - > opt , lockdep_sock_is_held ( sk ) ) ;
2015-11-29 19:37:57 -08:00
final_p = fl6_update_dst ( & fl6 , opt , & final ) ;
2005-04-16 15:20:36 -07:00
2020-09-27 22:38:26 -04:00
security_sk_classify_flow ( sk , flowi6_to_flowi_common ( & fl6 ) ) ;
2006-08-04 23:12:42 -07:00
2022-09-07 18:10:17 -07:00
dst = ip6_dst_lookup_flow ( net , sk , & fl6 , final_p ) ;
2011-03-01 13:19:07 -08:00
if ( IS_ERR ( dst ) ) {
err = PTR_ERR ( dst ) ;
2005-04-16 15:20:36 -07:00
goto failure ;
2007-05-24 18:17:54 -07:00
}
2005-04-16 15:20:36 -07:00
2015-03-29 14:00:04 +01:00
if ( ! saddr ) {
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
struct inet_bind_hashbucket * prev_addr_hashbucket = NULL ;
struct in6_addr prev_v6_rcv_saddr ;
if ( icsk - > icsk_bind2_hash ) {
prev_addr_hashbucket = inet_bhashfn_portaddr ( & tcp_hashinfo ,
2022-09-07 18:10:17 -07:00
sk , net , inet - > inet_num ) ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
prev_v6_rcv_saddr = sk - > sk_v6_rcv_saddr ;
}
2011-03-12 16:22:43 -05:00
saddr = & fl6 . saddr ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
sk - > sk_v6_rcv_saddr = * saddr ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
if ( prev_addr_hashbucket ) {
err = inet_bhash2_update_saddr ( prev_addr_hashbucket , sk ) ;
if ( err ) {
sk - > sk_v6_rcv_saddr = prev_v6_rcv_saddr ;
goto failure ;
}
}
2005-04-16 15:20:36 -07:00
}
/* set the source address */
2011-11-21 03:39:03 +00:00
np - > saddr = * saddr ;
2009-10-15 06:30:45 +00:00
inet - > inet_rcv_saddr = LOOPBACK4_IPV6 ;
2005-04-16 15:20:36 -07:00
2006-06-30 13:37:03 -07:00
sk - > sk_gso_type = SKB_GSO_TCPV6 ;
2015-12-02 21:53:57 -08:00
ip6_dst_store ( sk , dst , NULL , NULL ) ;
2005-04-16 15:20:36 -07:00
2005-12-13 23:26:10 -08:00
icsk - > icsk_ext_hdr_len = 0 ;
2015-11-29 19:37:57 -08:00
if ( opt )
icsk - > icsk_ext_hdr_len = opt - > opt_flen +
opt - > opt_nflen ;
2005-04-16 15:20:36 -07:00
tp - > rx_opt . mss_clamp = IPV6_MIN_MTU - sizeof ( struct tcphdr ) - sizeof ( struct ipv6hdr ) ;
2009-10-15 06:30:45 +00:00
inet - > inet_dport = usin - > sin6_port ;
2005-04-16 15:20:36 -07:00
tcp_set_state ( sk , TCP_SYN_SENT ) ;
2022-09-07 18:10:18 -07:00
tcp_death_row = & net - > ipv4 . tcp_death_row ;
2016-12-28 17:52:32 +08:00
err = inet6_hash_connect ( tcp_death_row , sk ) ;
2005-04-16 15:20:36 -07:00
if ( err )
goto late_failure ;
2015-07-28 16:02:05 -07:00
sk_set_txhash ( sk ) ;
2014-10-22 21:42:01 +05:30
2017-02-22 13:23:55 +03:00
if ( likely ( ! tp - > repair ) ) {
if ( ! tp - > write_seq )
2019-10-10 20:17:41 -07:00
WRITE_ONCE ( tp - > write_seq ,
secure_tcpv6_seq ( np - > saddr . s6_addr32 ,
sk - > sk_v6_daddr . s6_addr32 ,
inet - > inet_sport ,
inet - > inet_dport ) ) ;
2022-09-07 18:10:17 -07:00
tp - > tsoffset = secure_tcpv6_ts_off ( net , np - > saddr . s6_addr32 ,
2017-05-05 06:56:54 -07:00
sk - > sk_v6_daddr . s6_addr32 ) ;
2017-02-22 13:23:55 +03:00
}
2005-04-16 15:20:36 -07:00
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 10:59:22 -08:00
if ( tcp_fastopen_defer_connect ( sk , & err ) )
return err ;
if ( err )
goto late_failure ;
2005-04-16 15:20:36 -07:00
err = tcp_connect ( sk ) ;
if ( err )
goto late_failure ;
return 0 ;
late_failure :
tcp_set_state ( sk , TCP_CLOSE ) ;
failure :
2009-10-15 06:30:45 +00:00
inet - > inet_dport = 0 ;
2005-04-16 15:20:36 -07:00
sk - > sk_route_caps = 0 ;
return err ;
}
2012-07-23 09:48:52 +02:00
static void tcp_v6_mtu_reduced ( struct sock * sk )
{
struct dst_entry * dst ;
ipv6: tcp: drop silly ICMPv6 packet too big messages
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu()
icmpv6_rcv()
icmpv6_notify()
tcp_v6_err()
tcp_v6_mtu_reduced()
inet6_csk_update_pmtu()
ip6_rt_update_pmtu()
__ip6_rt_update_pmtu()
ip6_rt_cache_alloc()
ip6_dst_alloc()
dst_alloc()
ip6_dst_gc()
fib6_run_gc()
spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)
v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:21:09 -07:00
u32 mtu ;
2012-07-23 09:48:52 +02:00
if ( ( 1 < < sk - > sk_state ) & ( TCPF_LISTEN | TCPF_CLOSE ) )
return ;
ipv6: tcp: drop silly ICMPv6 packet too big messages
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu()
icmpv6_rcv()
icmpv6_notify()
tcp_v6_err()
tcp_v6_mtu_reduced()
inet6_csk_update_pmtu()
ip6_rt_update_pmtu()
__ip6_rt_update_pmtu()
ip6_rt_cache_alloc()
ip6_dst_alloc()
dst_alloc()
ip6_dst_gc()
fib6_run_gc()
spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)
v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:21:09 -07:00
mtu = READ_ONCE ( tcp_sk ( sk ) - > mtu_info ) ;
/* Drop requests trying to increase our current mss.
* Check done in __ip6_rt_update_pmtu ( ) is too late .
*/
if ( tcp_mtu_to_mss ( sk , mtu ) > = tcp_sk ( sk ) - > mss_cache )
return ;
dst = inet6_csk_update_pmtu ( sk , mtu ) ;
2012-07-23 09:48:52 +02:00
if ( ! dst )
return ;
if ( inet_csk ( sk ) - > icsk_pmtu_cookie > dst_mtu ( dst ) ) {
tcp_sync_mss ( sk , dst_mtu ( dst ) ) ;
tcp_simple_retransmit ( sk ) ;
}
}
2018-11-08 12:19:21 +01:00
static int tcp_v6_err ( struct sk_buff * skb , struct inet6_skb_parm * opt ,
2009-06-23 04:31:07 -07:00
u8 type , u8 code , int offset , __be32 info )
2005-04-16 15:20:36 -07:00
{
2013-12-19 18:44:34 +08:00
const struct ipv6hdr * hdr = ( const struct ipv6hdr * ) skb - > data ;
2005-08-12 09:19:38 -03:00
const struct tcphdr * th = ( struct tcphdr * ) ( skb - > data + offset ) ;
2015-03-22 10:22:23 -07:00
struct net * net = dev_net ( skb - > dev ) ;
struct request_sock * fastopen ;
2005-04-16 15:20:36 -07:00
struct ipv6_pinfo * np ;
2007-02-09 23:24:49 +09:00
struct tcp_sock * tp ;
2014-05-11 20:22:12 -07:00
__u32 seq , snd_una ;
2015-03-22 10:22:23 -07:00
struct sock * sk ;
2016-02-02 19:31:12 -08:00
bool fatal ;
2015-03-22 10:22:23 -07:00
int err ;
2005-04-16 15:20:36 -07:00
2015-03-22 10:22:23 -07:00
sk = __inet6_lookup_established ( net , & tcp_hashinfo ,
& hdr - > daddr , th - > dest ,
& hdr - > saddr , ntohs ( th - > source ) ,
2017-08-07 08:44:21 -07:00
skb - > dev - > ifindex , inet6_sdif ( skb ) ) ;
2005-04-16 15:20:36 -07:00
2015-03-22 10:22:23 -07:00
if ( ! sk ) {
2016-04-27 16:44:36 -07:00
__ICMP6_INC_STATS ( net , __in6_dev_get ( skb - > dev ) ,
ICMP6_MIB_INERRORS ) ;
2018-11-08 12:19:21 +01:00
return - ENOENT ;
2005-04-16 15:20:36 -07:00
}
if ( sk - > sk_state = = TCP_TIME_WAIT ) {
2006-10-10 19:41:46 -07:00
inet_twsk_put ( inet_twsk ( sk ) ) ;
2018-11-08 12:19:21 +01:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2015-03-22 10:22:23 -07:00
seq = ntohl ( th - > seq ) ;
2016-02-02 19:31:12 -08:00
fatal = icmpv6_err_convert ( type , code , & err ) ;
2018-11-08 12:19:21 +01:00
if ( sk - > sk_state = = TCP_NEW_SYN_RECV ) {
tcp_req_err ( sk , seq , fatal ) ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
bh_lock_sock ( sk ) ;
2012-07-23 09:48:52 +02:00
if ( sock_owned_by_user ( sk ) & & type ! = ICMPV6_PKT_TOOBIG )
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_LOCKDROPPEDICMPS ) ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state = = TCP_CLOSE )
goto out ;
2021-10-25 09:48:22 -07:00
if ( static_branch_unlikely ( & ip6_min_hopcount ) ) {
/* min_hopcount can be changed concurrently from do_ipv6_setsockopt() */
if ( ipv6_hdr ( skb ) - > hop_limit < READ_ONCE ( tcp_inet6_sk ( sk ) - > min_hopcount ) ) {
__NET_INC_STATS ( net , LINUX_MIB_TCPMINTTLDROP ) ;
goto out ;
}
IPv6: Generic TTL Security Mechanism (final version)
This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.
Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
The server does have to deal with both IPv4 and IPv6 socket options
and the client has to handle the different for each family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
...
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:24:53 -07:00
}
2005-04-16 15:20:36 -07:00
tp = tcp_sk ( sk ) ;
2014-05-11 20:22:12 -07:00
/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
2019-10-10 20:17:38 -07:00
fastopen = rcu_dereference ( tp - > fastopen_rsk ) ;
2014-05-11 20:22:12 -07:00
snd_una = fastopen ? tcp_rsk ( fastopen ) - > snt_isn : tp - > snd_una ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state ! = TCP_LISTEN & &
2014-05-11 20:22:12 -07:00
! between ( seq , snd_una , tp - > snd_nxt ) ) {
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_OUTOFWINDOWICMPS ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2019-03-19 07:01:08 -07:00
np = tcp_inet6_sk ( sk ) ;
2005-04-16 15:20:36 -07:00
2012-07-12 00:25:15 -07:00
if ( type = = NDISC_REDIRECT ) {
2017-03-10 16:40:33 +11:00
if ( ! sock_owned_by_user ( sk ) ) {
struct dst_entry * dst = __sk_dst_check ( sk , np - > dst_cookie ) ;
2012-07-12 00:25:15 -07:00
2017-03-10 16:40:33 +11:00
if ( dst )
dst - > ops - > redirect ( dst , sk , skb ) ;
}
2013-04-07 04:53:15 +00:00
goto out ;
2012-07-12 00:25:15 -07:00
}
2005-04-16 15:20:36 -07:00
if ( type = = ICMPV6_PKT_TOOBIG ) {
ipv6: tcp: drop silly ICMPv6 packet too big messages
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu()
icmpv6_rcv()
icmpv6_notify()
tcp_v6_err()
tcp_v6_mtu_reduced()
inet6_csk_update_pmtu()
ip6_rt_update_pmtu()
__ip6_rt_update_pmtu()
ip6_rt_cache_alloc()
ip6_dst_alloc()
dst_alloc()
ip6_dst_gc()
fib6_run_gc()
spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)
v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:21:09 -07:00
u32 mtu = ntohl ( info ) ;
2013-03-18 07:01:28 +00:00
/* We are not interested in TCP_LISTEN and open_requests
* ( SYN - ACKs send out by Linux are always < 576 bytes so
* they should go through unfragmented ) .
*/
if ( sk - > sk_state = = TCP_LISTEN )
goto out ;
2013-12-15 03:41:14 +01:00
if ( ! ip6_sk_accept_pmtu ( sk ) )
goto out ;
ipv6: tcp: drop silly ICMPv6 packet too big messages
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu()
icmpv6_rcv()
icmpv6_notify()
tcp_v6_err()
tcp_v6_mtu_reduced()
inet6_csk_update_pmtu()
ip6_rt_update_pmtu()
__ip6_rt_update_pmtu()
ip6_rt_cache_alloc()
ip6_dst_alloc()
dst_alloc()
ip6_dst_gc()
fib6_run_gc()
spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)
v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:21:09 -07:00
if ( mtu < IPV6_MIN_MTU )
goto out ;
WRITE_ONCE ( tp - > mtu_info , mtu ) ;
2012-07-23 09:48:52 +02:00
if ( ! sock_owned_by_user ( sk ) )
tcp_v6_mtu_reduced ( sk ) ;
2012-09-05 10:53:18 +00:00
else if ( ! test_and_set_bit ( TCP_MTU_REDUCED_DEFERRED ,
2016-12-03 11:14:57 -08:00
& sk - > sk_tsq_flags ) )
2012-09-05 10:53:18 +00:00
sock_hold ( sk ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2005-06-18 22:47:21 -07:00
/* Might be for an request_sock */
2005-04-16 15:20:36 -07:00
switch ( sk - > sk_state ) {
case TCP_SYN_SENT :
2014-05-11 20:22:12 -07:00
case TCP_SYN_RECV :
/* Only in fast or simultaneous open. If a fast open socket is
2020-09-17 21:35:17 -07:00
* already accepted it is treated as a connected one below .
2014-05-11 20:22:12 -07:00
*/
2015-03-29 14:00:04 +01:00
if ( fastopen & & ! fastopen - > sk )
2014-05-11 20:22:12 -07:00
break ;
tcp: allow traceroute -Mtcp for unpriv users
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.
$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 192.168.86.1 3.631 ms 3.512 ms 3.405 ms
2 10.1.10.1 4.183 ms 4.125 ms 4.072 ms
3 96.120.88.125 20.621 ms 19.462 ms 20.553 ms
4 96.110.177.65 24.271 ms 25.351 ms 25.250 ms
5 69.139.199.197 44.492 ms 43.075 ms 44.346 ms
6 68.86.143.93 27.969 ms 25.184 ms 25.092 ms
7 96.112.146.18 25.323 ms 96.112.146.22 25.583 ms 96.112.146.26 24.502 ms
8 72.14.239.204 24.405 ms 74.125.37.224 16.326 ms 17.194 ms
9 209.85.251.9 18.154 ms 209.85.247.55 14.449 ms 209.85.251.9 26.296 ms^C
We can easily support traceroute over TCP, by queueing an error message
into socket error queue.
Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
{cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 11:00:02 -07:00
ipv6_icmp_error ( sk , skb , err , th - > dest , ntohl ( info ) , ( u8 * ) th ) ;
2005-04-16 15:20:36 -07:00
if ( ! sock_owned_by_user ( sk ) ) {
sk - > sk_err = err ;
2021-06-27 18:48:21 -04:00
sk_error_report ( sk ) ; /* Wake people up to see the error (see connect in sock.c) */
2005-04-16 15:20:36 -07:00
tcp_done ( sk ) ;
} else
sk - > sk_err_soft = err ;
goto out ;
2020-05-27 17:34:58 -07:00
case TCP_LISTEN :
break ;
default :
/* check if this ICMP message allows revert of backoff.
* ( see RFC 6069 )
*/
if ( ! fastopen & & type = = ICMPV6_DEST_UNREACH & &
code = = ICMPV6_NOROUTE )
tcp_ld_RTO_revert ( sk , seq ) ;
2005-04-16 15:20:36 -07:00
}
if ( ! sock_owned_by_user ( sk ) & & np - > recverr ) {
sk - > sk_err = err ;
2021-06-27 18:48:21 -04:00
sk_error_report ( sk ) ;
2005-04-16 15:20:36 -07:00
} else
sk - > sk_err_soft = err ;
out :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
2018-11-08 12:19:21 +01:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2015-09-25 07:39:21 -07:00
static int tcp_v6_send_synack ( const struct sock * sk , struct dst_entry * dst ,
2014-06-25 17:09:58 +03:00
struct flowi * fl ,
2012-06-28 12:34:19 +00:00
struct request_sock * req ,
2015-10-02 11:43:35 -07:00
struct tcp_fastopen_cookie * foc ,
2020-08-20 12:00:52 -07:00
enum tcp_synack_type synack_type ,
struct sk_buff * syn_skb )
2005-04-16 15:20:36 -07:00
{
2013-10-09 15:21:29 -07:00
struct inet_request_sock * ireq = inet_rsk ( req ) ;
2019-03-19 07:01:08 -07:00
struct ipv6_pinfo * np = tcp_inet6_sk ( sk ) ;
2016-06-27 15:05:28 -04:00
struct ipv6_txoptions * opt ;
2014-06-25 17:09:58 +03:00
struct flowi6 * fl6 = & fl - > u . ip6 ;
2013-12-19 18:44:34 +08:00
struct sk_buff * skb ;
2012-06-28 12:34:20 +00:00
int err = - ENOMEM ;
2020-09-09 17:50:48 -07:00
u8 tclass ;
2005-04-16 15:20:36 -07:00
2012-06-28 12:34:21 +00:00
/* First, grab a route. */
2015-09-29 07:42:42 -07:00
if ( ! dst & & ( dst = inet6_csk_route_req ( sk , fl6 , req ,
IPPROTO_TCP ) ) = = NULL )
2008-02-29 11:43:03 -08:00
goto done ;
2012-06-28 12:34:20 +00:00
2020-08-20 12:00:52 -07:00
skb = tcp_make_synack ( sk , dst , req , foc , synack_type , syn_skb ) ;
2012-06-28 12:34:20 +00:00
2005-04-16 15:20:36 -07:00
if ( skb ) {
2013-10-09 15:21:29 -07:00
__tcp_v6_send_check ( skb , & ireq - > ir_v6_loc_addr ,
& ireq - > ir_v6_rmt_addr ) ;
2005-04-16 15:20:36 -07:00
2013-10-09 15:21:29 -07:00
fl6 - > daddr = ireq - > ir_v6_rmt_addr ;
2015-03-29 14:00:05 +01:00
if ( np - > repflow & & ireq - > pktopts )
2014-01-17 17:15:03 +01:00
fl6 - > flowlabel = ip6_flowlabel ( ipv6_hdr ( ireq - > pktopts ) ) ;
2022-07-22 11:22:04 -07:00
tclass = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_reflect_tos ) ?
2020-12-08 09:55:08 -08:00
( tcp_rsk ( req ) - > syn_tos & ~ INET_ECN_MASK ) |
( np - > tclass & INET_ECN_MASK ) :
2020-11-19 13:23:51 -08:00
np - > tclass ;
2020-11-20 19:47:44 -08:00
if ( ! INET_ECN_is_capable ( tclass ) & &
tcp_bpf_ca_needs_ecn ( ( struct sock * ) req ) )
tclass | = INET_ECN_ECT_0 ;
rcu_read_lock ( ) ;
opt = ireq - > ipv6_opt ;
2016-06-27 15:05:28 -04:00
if ( ! opt )
opt = rcu_dereference ( np - > opt ) ;
2021-07-09 18:28:23 +03:00
err = ip6_xmit ( sk , skb , fl6 , skb - > mark ? : sk - > sk_mark , opt ,
2020-11-19 13:23:51 -08:00
tclass , sk - > sk_priority ) ;
2016-01-08 09:35:51 -08:00
rcu_read_unlock ( ) ;
2006-11-14 11:21:36 -02:00
err = net_xmit_eval ( err ) ;
2005-04-16 15:20:36 -07:00
}
done :
return err ;
}
2010-01-17 19:09:39 -08:00
2005-06-18 22:47:21 -07:00
static void tcp_v6_reqsk_destructor ( struct request_sock * req )
2005-04-16 15:20:36 -07:00
{
2016-06-27 15:05:28 -04:00
kfree ( inet_rsk ( req ) - > ipv6_opt ) ;
2021-10-25 09:48:25 -07:00
consume_skb ( inet_rsk ( req ) - > pktopts ) ;
2005-04-16 15:20:36 -07:00
}
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2015-09-25 07:39:15 -07:00
static struct tcp_md5sig_key * tcp_v6_md5_do_lookup ( const struct sock * sk ,
2019-12-30 14:14:28 -08:00
const struct in6_addr * addr ,
int l3index )
2006-11-14 19:07:45 -08:00
{
2019-12-30 14:14:28 -08:00
return tcp_md5_do_lookup ( sk , l3index ,
( union tcp_md5_addr * ) addr , AF_INET6 ) ;
2006-11-14 19:07:45 -08:00
}
2015-09-25 07:39:15 -07:00
static struct tcp_md5sig_key * tcp_v6_md5_lookup ( const struct sock * sk ,
2015-03-24 15:58:56 -07:00
const struct sock * addr_sk )
2006-11-14 19:07:45 -08:00
{
2019-12-30 14:14:28 -08:00
int l3index ;
l3index = l3mdev_master_ifindex_by_index ( sock_net ( sk ) ,
addr_sk - > sk_bound_dev_if ) ;
return tcp_v6_md5_do_lookup ( sk , & addr_sk - > sk_v6_daddr ,
l3index ) ;
2006-11-14 19:07:45 -08:00
}
2017-06-15 18:07:07 -07:00
static int tcp_v6_parse_md5_keys ( struct sock * sk , int optname ,
2020-07-23 08:09:05 +02:00
sockptr_t optval , int optlen )
2006-11-14 19:07:45 -08:00
{
struct tcp_md5sig cmd ;
struct sockaddr_in6 * sin6 = ( struct sockaddr_in6 * ) & cmd . tcpm_addr ;
2019-12-30 14:14:28 -08:00
int l3index = 0 ;
2017-06-15 18:07:07 -07:00
u8 prefixlen ;
2021-10-15 10:26:05 +03:00
u8 flags ;
2006-11-14 19:07:45 -08:00
if ( optlen < sizeof ( cmd ) )
return - EINVAL ;
2020-07-23 08:09:05 +02:00
if ( copy_from_sockptr ( & cmd , optval , sizeof ( cmd ) ) )
2006-11-14 19:07:45 -08:00
return - EFAULT ;
if ( sin6 - > sin6_family ! = AF_INET6 )
return - EINVAL ;
2021-10-15 10:26:05 +03:00
flags = cmd . tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX ;
2017-06-15 18:07:07 -07:00
if ( optname = = TCP_MD5SIG_EXT & &
cmd . tcpm_flags & TCP_MD5SIG_FLAG_PREFIX ) {
prefixlen = cmd . tcpm_prefixlen ;
if ( prefixlen > 128 | | ( ipv6_addr_v4mapped ( & sin6 - > sin6_addr ) & &
prefixlen > 32 ) )
return - EINVAL ;
} else {
prefixlen = ipv6_addr_v4mapped ( & sin6 - > sin6_addr ) ? 32 : 128 ;
}
2021-10-15 10:26:05 +03:00
if ( optname = = TCP_MD5SIG_EXT & & cmd . tcpm_ifindex & &
2019-12-30 14:14:29 -08:00
cmd . tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX ) {
struct net_device * dev ;
rcu_read_lock ( ) ;
dev = dev_get_by_index_rcu ( sock_net ( sk ) , cmd . tcpm_ifindex ) ;
if ( dev & & netif_is_l3_master ( dev ) )
l3index = dev - > ifindex ;
rcu_read_unlock ( ) ;
/* ok to reference set/not set outside of rcu;
* right now device MUST be an L3 master
*/
if ( ! dev | | ! l3index )
return - EINVAL ;
}
2006-11-14 19:07:45 -08:00
if ( ! cmd . tcpm_keylen ) {
2007-08-24 23:16:08 -07:00
if ( ipv6_addr_v4mapped ( & sin6 - > sin6_addr ) )
2012-01-31 05:18:33 +00:00
return tcp_md5_do_del ( sk , ( union tcp_md5_addr * ) & sin6 - > sin6_addr . s6_addr32 [ 3 ] ,
2019-12-30 14:14:29 -08:00
AF_INET , prefixlen ,
2021-10-15 10:26:05 +03:00
l3index , flags ) ;
2012-01-31 05:18:33 +00:00
return tcp_md5_do_del ( sk , ( union tcp_md5_addr * ) & sin6 - > sin6_addr ,
2021-10-15 10:26:05 +03:00
AF_INET6 , prefixlen , l3index , flags ) ;
2006-11-14 19:07:45 -08:00
}
if ( cmd . tcpm_keylen > TCP_MD5SIG_MAXKEYLEN )
return - EINVAL ;
2012-01-31 05:18:33 +00:00
if ( ipv6_addr_v4mapped ( & sin6 - > sin6_addr ) )
return tcp_md5_do_add ( sk , ( union tcp_md5_addr * ) & sin6 - > sin6_addr . s6_addr32 [ 3 ] ,
2021-10-15 10:26:05 +03:00
AF_INET , prefixlen , l3index , flags ,
2019-12-30 14:14:28 -08:00
cmd . tcpm_key , cmd . tcpm_keylen ,
GFP_KERNEL ) ;
2006-11-14 19:07:45 -08:00
2012-01-31 05:18:33 +00:00
return tcp_md5_do_add ( sk , ( union tcp_md5_addr * ) & sin6 - > sin6_addr ,
2021-10-15 10:26:05 +03:00
AF_INET6 , prefixlen , l3index , flags ,
2019-12-30 14:14:28 -08:00
cmd . tcpm_key , cmd . tcpm_keylen , GFP_KERNEL ) ;
2006-11-14 19:07:45 -08:00
}
2016-06-27 18:51:53 +02:00
static int tcp_v6_md5_hash_headers ( struct tcp_md5sig_pool * hp ,
const struct in6_addr * daddr ,
const struct in6_addr * saddr ,
const struct tcphdr * th , int nbytes )
2006-11-14 19:07:45 -08:00
{
struct tcp6_pseudohdr * bp ;
2008-07-19 00:01:42 -07:00
struct scatterlist sg ;
2016-06-27 18:51:53 +02:00
struct tcphdr * _th ;
2008-04-17 13:19:16 +09:00
2016-06-27 18:51:53 +02:00
bp = hp - > scratch ;
2006-11-14 19:07:45 -08:00
/* 1. TCP pseudo-header (RFC2460) */
2011-11-21 03:39:03 +00:00
bp - > saddr = * saddr ;
bp - > daddr = * daddr ;
2008-07-19 00:01:42 -07:00
bp - > protocol = cpu_to_be32 ( IPPROTO_TCP ) ;
2008-07-31 21:36:07 -07:00
bp - > len = cpu_to_be32 ( nbytes ) ;
2006-11-14 19:07:45 -08:00
2016-06-27 18:51:53 +02:00
_th = ( struct tcphdr * ) ( bp + 1 ) ;
memcpy ( _th , th , sizeof ( * th ) ) ;
_th - > check = 0 ;
sg_init_one ( & sg , bp , sizeof ( * bp ) + sizeof ( * th ) ) ;
ahash_request_set_crypt ( hp - > md5_req , & sg , NULL ,
sizeof ( * bp ) + sizeof ( * th ) ) ;
2016-01-24 21:20:23 +08:00
return crypto_ahash_update ( hp - > md5_req ) ;
2008-07-19 00:01:42 -07:00
}
2007-10-26 00:41:21 -07:00
2016-06-27 18:51:53 +02:00
static int tcp_v6_md5_hash_hdr ( char * md5_hash , const struct tcp_md5sig_key * key ,
2011-04-22 04:53:02 +00:00
const struct in6_addr * daddr , struct in6_addr * saddr ,
2011-10-24 02:46:04 -04:00
const struct tcphdr * th )
2008-07-19 00:01:42 -07:00
{
struct tcp_md5sig_pool * hp ;
2016-01-24 21:20:23 +08:00
struct ahash_request * req ;
2008-07-19 00:01:42 -07:00
hp = tcp_get_md5sig_pool ( ) ;
if ( ! hp )
goto clear_hash_noput ;
2016-01-24 21:20:23 +08:00
req = hp - > md5_req ;
2008-07-19 00:01:42 -07:00
2016-01-24 21:20:23 +08:00
if ( crypto_ahash_init ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
2016-06-27 18:51:53 +02:00
if ( tcp_v6_md5_hash_headers ( hp , daddr , saddr , th , th - > doff < < 2 ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
if ( tcp_md5_hash_key ( hp , key ) )
goto clear_hash ;
2016-01-24 21:20:23 +08:00
ahash_request_set_crypt ( req , NULL , md5_hash , 0 ) ;
if ( crypto_ahash_final ( req ) )
2006-11-14 19:07:45 -08:00
goto clear_hash ;
tcp_put_md5sig_pool ( ) ;
return 0 ;
2008-07-19 00:01:42 -07:00
2006-11-14 19:07:45 -08:00
clear_hash :
tcp_put_md5sig_pool ( ) ;
clear_hash_noput :
memset ( md5_hash , 0 , 16 ) ;
2008-07-19 00:01:42 -07:00
return 1 ;
2006-11-14 19:07:45 -08:00
}
2015-03-24 15:58:55 -07:00
static int tcp_v6_md5_hash_skb ( char * md5_hash ,
const struct tcp_md5sig_key * key ,
2011-10-24 02:46:04 -04:00
const struct sock * sk ,
const struct sk_buff * skb )
2006-11-14 19:07:45 -08:00
{
2011-04-22 04:53:02 +00:00
const struct in6_addr * saddr , * daddr ;
2008-07-19 00:01:42 -07:00
struct tcp_md5sig_pool * hp ;
2016-01-24 21:20:23 +08:00
struct ahash_request * req ;
2011-10-24 02:46:04 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2006-11-14 19:07:45 -08:00
2015-03-24 15:58:55 -07:00
if ( sk ) { /* valid for establish/request sockets */
saddr = & sk - > sk_v6_rcv_saddr ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
daddr = & sk - > sk_v6_daddr ;
2008-07-19 00:01:42 -07:00
} else {
2011-04-22 04:53:02 +00:00
const struct ipv6hdr * ip6h = ipv6_hdr ( skb ) ;
2008-07-19 00:01:42 -07:00
saddr = & ip6h - > saddr ;
daddr = & ip6h - > daddr ;
2006-11-14 19:07:45 -08:00
}
2008-07-19 00:01:42 -07:00
hp = tcp_get_md5sig_pool ( ) ;
if ( ! hp )
goto clear_hash_noput ;
2016-01-24 21:20:23 +08:00
req = hp - > md5_req ;
2008-07-19 00:01:42 -07:00
2016-01-24 21:20:23 +08:00
if ( crypto_ahash_init ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
2016-06-27 18:51:53 +02:00
if ( tcp_v6_md5_hash_headers ( hp , daddr , saddr , th , skb - > len ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
if ( tcp_md5_hash_skb_data ( hp , skb , th - > doff < < 2 ) )
goto clear_hash ;
if ( tcp_md5_hash_key ( hp , key ) )
goto clear_hash ;
2016-01-24 21:20:23 +08:00
ahash_request_set_crypt ( req , NULL , md5_hash , 0 ) ;
if ( crypto_ahash_final ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
tcp_put_md5sig_pool ( ) ;
return 0 ;
clear_hash :
tcp_put_md5sig_pool ( ) ;
clear_hash_noput :
memset ( md5_hash , 0 , 16 ) ;
return 1 ;
2006-11-14 19:07:45 -08:00
}
2015-10-02 11:43:28 -07:00
# endif
2015-09-25 07:39:08 -07:00
static void tcp_v6_init_req ( struct request_sock * req ,
const struct sock * sk_listener ,
2014-06-25 17:09:53 +03:00
struct sk_buff * skb )
{
2018-12-12 15:27:38 -08:00
bool l3_slave = ipv6_l3mdev_skb ( TCP_SKB_CB ( skb ) - > header . h6 . flags ) ;
2014-06-25 17:09:53 +03:00
struct inet_request_sock * ireq = inet_rsk ( req ) ;
2019-03-19 07:01:08 -07:00
const struct ipv6_pinfo * np = tcp_inet6_sk ( sk_listener ) ;
2014-06-25 17:09:53 +03:00
ireq - > ir_v6_rmt_addr = ipv6_hdr ( skb ) - > saddr ;
ireq - > ir_v6_loc_addr = ipv6_hdr ( skb ) - > daddr ;
/* So that link locals have meaning */
2018-12-12 15:27:38 -08:00
if ( ( ! sk_listener - > sk_bound_dev_if | | l3_slave ) & &
2014-06-25 17:09:53 +03:00
ipv6_addr_type ( & ireq - > ir_v6_rmt_addr ) & IPV6_ADDR_LINKLOCAL )
2014-10-17 09:17:20 -07:00
ireq - > ir_iif = tcp_v6_iif ( skb ) ;
2014-06-25 17:09:53 +03:00
2014-09-05 15:33:32 -07:00
if ( ! TCP_SKB_CB ( skb ) - > tcp_tw_isn & &
2015-09-25 07:39:08 -07:00
( ipv6_opt_accepted ( sk_listener , skb , & TCP_SKB_CB ( skb ) - > header . h6 ) | |
2014-09-27 09:50:56 -07:00
np - > rxopt . bits . rxinfo | |
2014-06-25 17:09:53 +03:00
np - > rxopt . bits . rxoinfo | | np - > rxopt . bits . rxhlim | |
np - > rxopt . bits . rxohlim | | np - > repflow ) ) {
2017-06-30 13:07:58 +03:00
refcount_inc ( & skb - > users ) ;
2014-06-25 17:09:53 +03:00
ireq - > pktopts = skb ;
}
}
2015-09-29 07:42:50 -07:00
static struct dst_entry * tcp_v6_route_req ( const struct sock * sk ,
2020-11-30 16:36:30 +01:00
struct sk_buff * skb ,
2015-09-29 07:42:50 -07:00
struct flowi * fl ,
2020-11-30 16:36:30 +01:00
struct request_sock * req )
2014-06-25 17:09:55 +03:00
{
2020-11-30 16:36:30 +01:00
tcp_v6_init_req ( req , sk , skb ) ;
if ( security_inet_conn_request ( sk , skb , req ) )
return NULL ;
2015-09-29 07:42:42 -07:00
return inet6_csk_route_req ( sk , & fl - > u . ip6 , req , IPPROTO_TCP ) ;
2014-06-25 17:09:55 +03:00
}
2008-02-07 21:49:26 -08:00
struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
2005-04-16 15:20:36 -07:00
. family = AF_INET6 ,
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
. obj_size = sizeof ( struct tcp6_request_sock ) ,
2014-06-25 17:09:59 +03:00
. rtx_syn_ack = tcp_rtx_synack ,
2005-06-18 22:47:21 -07:00
. send_ack = tcp_v6_reqsk_send_ack ,
. destructor = tcp_v6_reqsk_destructor ,
2010-01-17 19:09:39 -08:00
. send_reset = tcp_v6_send_reset ,
2014-03-29 09:27:29 +08:00
. syn_ack_timeout = tcp_syn_ack_timeout ,
2005-04-16 15:20:36 -07:00
} ;
2020-01-09 07:59:21 -08:00
const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
2014-06-25 17:10:00 +03:00
. mss_clamp = IPV6_MIN_MTU - sizeof ( struct tcphdr ) -
sizeof ( struct ipv6hdr ) ,
2014-06-25 17:09:53 +03:00
# ifdef CONFIG_TCP_MD5SIG
2015-03-24 15:58:56 -07:00
. req_md5_lookup = tcp_v6_md5_lookup ,
2009-07-16 05:04:51 +00:00
. calc_md5_hash = tcp_v6_md5_hash_skb ,
2006-11-30 19:16:28 -08:00
# endif
2014-06-25 17:09:54 +03:00
# ifdef CONFIG_SYN_COOKIES
. cookie_init_seq = cookie_v6_init_sequence ,
# endif
2014-06-25 17:09:55 +03:00
. route_req = tcp_v6_route_req ,
2017-05-05 06:56:54 -07:00
. init_seq = tcp_v6_init_seq ,
. init_ts_off = tcp_v6_init_ts_off ,
2014-06-25 17:09:58 +03:00
. send_synack = tcp_v6_send_synack ,
2014-06-25 17:09:53 +03:00
} ;
2006-11-14 19:07:45 -08:00
2015-09-29 07:42:39 -07:00
static void tcp_v6_send_response ( const struct sock * sk , struct sk_buff * skb , u32 seq ,
2014-12-09 09:56:08 -08:00
u32 ack , u32 win , u32 tsval , u32 tsecr ,
int oif , struct tcp_md5sig_key * key , int rst ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
u8 tclass , __be32 label , u32 priority , u32 txhash )
2005-04-16 15:20:36 -07:00
{
2011-10-21 05:22:42 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
struct tcphdr * t1 ;
2005-04-16 15:20:36 -07:00
struct sk_buff * buff ;
2011-03-12 16:22:43 -05:00
struct flowi6 fl6 ;
2014-12-09 09:56:08 -08:00
struct net * net = sk ? sock_net ( sk ) : dev_net ( skb_dst ( skb ) - > dev ) ;
2008-03-07 11:16:26 -08:00
struct sock * ctl_sk = net - > ipv6 . tcp_sk ;
2008-10-09 14:41:38 -07:00
unsigned int tot_len = sizeof ( struct tcphdr ) ;
2021-04-01 16:19:44 -07:00
__be32 mrst = 0 , * topt ;
2009-06-02 05:19:30 +00:00
struct dst_entry * dst ;
2018-05-10 16:53:51 +10:00
__u32 mark = 0 ;
2005-04-16 15:20:36 -07:00
2013-02-11 05:50:19 +00:00
if ( tsecr )
2008-10-09 14:42:40 -07:00
tot_len + = TCPOLEN_TSTAMP_ALIGNED ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
if ( key )
tot_len + = TCPOLEN_MD5SIG_ALIGNED ;
# endif
2021-04-01 16:19:44 -07:00
# ifdef CONFIG_MPTCP
if ( rst & & ! key ) {
mrst = mptcp_reset_option ( skb ) ;
if ( mrst )
tot_len + = sizeof ( __be32 ) ;
}
# endif
2022-02-21 19:11:15 -08:00
buff = alloc_skb ( MAX_TCP_HEADER , GFP_ATOMIC ) ;
2015-03-29 14:00:04 +01:00
if ( ! buff )
2007-02-09 23:24:49 +09:00
return ;
2005-04-16 15:20:36 -07:00
2022-02-21 19:11:15 -08:00
skb_reserve ( buff , MAX_TCP_HEADER ) ;
2005-04-16 15:20:36 -07:00
networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)
Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 14:29:23 +02:00
t1 = skb_push ( buff , tot_len ) ;
2010-04-21 00:47:15 -07:00
skb_reset_transport_header ( buff ) ;
2005-04-16 15:20:36 -07:00
/* Swap the send and the receive. */
memset ( t1 , 0 , sizeof ( * t1 ) ) ;
t1 - > dest = th - > source ;
t1 - > source = th - > dest ;
2006-11-14 19:07:45 -08:00
t1 - > doff = tot_len / 4 ;
2008-10-09 14:42:40 -07:00
t1 - > seq = htonl ( seq ) ;
t1 - > ack_seq = htonl ( ack ) ;
t1 - > ack = ! rst | | ! th - > ack ;
t1 - > rst = rst ;
t1 - > window = htons ( win ) ;
2005-04-16 15:20:36 -07:00
tcpv6: convert opt[] -> topt in tcp_v6_send_reset
after this I get:
$ diff-funcs tcp_v6_send_reset tcp_ipv6.c tcp_ipv6.c tcp_v6_send_ack
--- tcp_ipv6.c:tcp_v6_send_reset()
+++ tcp_ipv6.c:tcp_v6_send_ack()
@@ -1,4 +1,5 @@
-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
u32 ts,
+ struct tcp_md5sig_key *key)
{
struct tcphdr *th = tcp_hdr(skb), *t1;
struct sk_buff *buff;
@@ -7,31 +8,14 @@
struct sock *ctl_sk = net->ipv6.tcp_sk;
unsigned int tot_len = sizeof(struct tcphdr);
__be32 *topt;
-#ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *key;
-#endif
-
- if (th->rst)
- return;
-
- if (!ipv6_unicast_destination(skb))
- return;
+ if (ts)
+ tot_len += TCPOLEN_TSTAMP_ALIGNED;
#ifdef CONFIG_TCP_MD5SIG
- if (sk)
- key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
- else
- key = NULL;
-
if (key)
tot_len += TCPOLEN_MD5SIG_ALIGNED;
#endif
- /*
- * We need to grab some memory, and put together an RST,
- * and then put it into the queue to be sent.
- */
-
buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
GFP_ATOMIC);
if (buff == NULL)
@@ -46,18 +30,20 @@
t1->dest = th->source;
t1->source = th->dest;
t1->doff = tot_len / 4;
- t1->rst = 1;
-
- if(th->ack) {
- t1->seq = th->ack_seq;
- } else {
- t1->ack = 1;
- t1->ack_seq = htonl(ntohl(th->seq) + th->syn + th->fin
- + skb->len - (th->doff<<2));
- }
+ t1->seq = htonl(seq);
+ t1->ack_seq = htonl(ack);
+ t1->ack = 1;
+ t1->window = htons(win);
topt = (__be32 *)(t1 + 1);
+ if (ts) {
+ *topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
+ (TCPOPT_TIMESTAMP << 8) |
TCPOLEN_TIMESTAMP);
+ *topt++ = htonl(tcp_time_stamp);
+ *topt++ = htonl(ts);
+ }
+
#ifdef CONFIG_TCP_MD5SIG
if (key) {
*topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
@@ -84,15 +70,10 @@
fl.fl_ip_sport = t1->source;
security_skb_classify_flow(skb, &fl);
- /* Pass a socket to ip6_dst_lookup either it is for RST
- * Underlying function will use this to retrieve the network
- * namespace
- */
if (!ip6_dst_lookup(ctl_sk, &buff->dst, &fl)) {
if (xfrm_lookup(&buff->dst, &fl, NULL, 0) >= 0) {
ip6_xmit(ctl_sk, buff, &fl, NULL, 0);
TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
- TCP_INC_STATS_BH(net, TCP_MIB_OUTRSTS);
return;
}
}
...which starts to be trivial to combine.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 14:42:01 -07:00
topt = ( __be32 * ) ( t1 + 1 ) ;
2013-02-11 05:50:19 +00:00
if ( tsecr ) {
2008-10-09 14:42:40 -07:00
* topt + + = htonl ( ( TCPOPT_NOP < < 24 ) | ( TCPOPT_NOP < < 16 ) |
( TCPOPT_TIMESTAMP < < 8 ) | TCPOLEN_TIMESTAMP ) ;
2013-02-11 05:50:19 +00:00
* topt + + = htonl ( tsval ) ;
* topt + + = htonl ( tsecr ) ;
2008-10-09 14:42:40 -07:00
}
2021-04-01 16:19:44 -07:00
if ( mrst )
* topt + + = mrst ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
if ( key ) {
tcpv6: convert opt[] -> topt in tcp_v6_send_reset
after this I get:
$ diff-funcs tcp_v6_send_reset tcp_ipv6.c tcp_ipv6.c tcp_v6_send_ack
--- tcp_ipv6.c:tcp_v6_send_reset()
+++ tcp_ipv6.c:tcp_v6_send_ack()
@@ -1,4 +1,5 @@
-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
u32 ts,
+ struct tcp_md5sig_key *key)
{
struct tcphdr *th = tcp_hdr(skb), *t1;
struct sk_buff *buff;
@@ -7,31 +8,14 @@
struct sock *ctl_sk = net->ipv6.tcp_sk;
unsigned int tot_len = sizeof(struct tcphdr);
__be32 *topt;
-#ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *key;
-#endif
-
- if (th->rst)
- return;
-
- if (!ipv6_unicast_destination(skb))
- return;
+ if (ts)
+ tot_len += TCPOLEN_TSTAMP_ALIGNED;
#ifdef CONFIG_TCP_MD5SIG
- if (sk)
- key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
- else
- key = NULL;
-
if (key)
tot_len += TCPOLEN_MD5SIG_ALIGNED;
#endif
- /*
- * We need to grab some memory, and put together an RST,
- * and then put it into the queue to be sent.
- */
-
buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
GFP_ATOMIC);
if (buff == NULL)
@@ -46,18 +30,20 @@
t1->dest = th->source;
t1->source = th->dest;
t1->doff = tot_len / 4;
- t1->rst = 1;
-
- if(th->ack) {
- t1->seq = th->ack_seq;
- } else {
- t1->ack = 1;
- t1->ack_seq = htonl(ntohl(th->seq) + th->syn + th->fin
- + skb->len - (th->doff<<2));
- }
+ t1->seq = htonl(seq);
+ t1->ack_seq = htonl(ack);
+ t1->ack = 1;
+ t1->window = htons(win);
topt = (__be32 *)(t1 + 1);
+ if (ts) {
+ *topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
+ (TCPOPT_TIMESTAMP << 8) |
TCPOLEN_TIMESTAMP);
+ *topt++ = htonl(tcp_time_stamp);
+ *topt++ = htonl(ts);
+ }
+
#ifdef CONFIG_TCP_MD5SIG
if (key) {
*topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
@@ -84,15 +70,10 @@
fl.fl_ip_sport = t1->source;
security_skb_classify_flow(skb, &fl);
- /* Pass a socket to ip6_dst_lookup either it is for RST
- * Underlying function will use this to retrieve the network
- * namespace
- */
if (!ip6_dst_lookup(ctl_sk, &buff->dst, &fl)) {
if (xfrm_lookup(&buff->dst, &fl, NULL, 0) >= 0) {
ip6_xmit(ctl_sk, buff, &fl, NULL, 0);
TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
- TCP_INC_STATS_BH(net, TCP_MIB_OUTRSTS);
return;
}
}
...which starts to be trivial to combine.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 14:42:01 -07:00
* topt + + = htonl ( ( TCPOPT_NOP < < 24 ) | ( TCPOPT_NOP < < 16 ) |
( TCPOPT_MD5SIG < < 8 ) | TCPOLEN_MD5SIG ) ;
tcp_v6_md5_hash_hdr ( ( __u8 * ) topt , key ,
2008-10-09 14:37:47 -07:00
& ipv6_hdr ( skb ) - > saddr ,
& ipv6_hdr ( skb ) - > daddr , t1 ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2011-03-12 16:22:43 -05:00
memset ( & fl6 , 0 , sizeof ( fl6 ) ) ;
2011-11-21 03:39:03 +00:00
fl6 . daddr = ipv6_hdr ( skb ) - > saddr ;
fl6 . saddr = ipv6_hdr ( skb ) - > daddr ;
2014-01-16 17:21:22 +01:00
fl6 . flowlabel = label ;
2005-04-16 15:20:36 -07:00
2010-04-21 14:59:20 -07:00
buff - > ip_summed = CHECKSUM_PARTIAL ;
2011-03-12 16:22:43 -05:00
__tcp_v6_send_check ( buff , & fl6 . saddr , & fl6 . daddr ) ;
2005-04-16 15:20:36 -07:00
2011-03-12 16:22:43 -05:00
fl6 . flowi6_proto = IPPROTO_TCP ;
net: ipv6: Fix oif in TCP SYN+ACK route lookup.
net-next commit 9c76a11, ipv6: tcp_ipv6 policy route issue, had
a boolean logic error that caused incorrect behaviour for TCP
SYN+ACK when oif-based rules are in use. Specifically:
1. If a SYN comes in from a global address, and sk_bound_dev_if
is not set, the routing lookup has oif set to the interface
the SYN came in on. Instead, it should have oif unset,
because for global addresses, the incoming interface doesn't
necessarily have any bearing on the interface the SYN+ACK is
sent out on.
2. If a SYN comes in from a link-local address, and
sk_bound_dev_if is set, the routing lookup has oif set to the
interface the SYN came in on. Instead, it should have oif set
to sk_bound_dev_if, because that's what the application
requested.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11 13:19:12 +09:00
if ( rt6_need_strict ( & fl6 . daddr ) & & ! oif )
2014-10-17 09:17:20 -07:00
fl6 . flowi6_oif = tcp_v6_iif ( skb ) ;
2016-11-09 09:07:26 -08:00
else {
if ( ! oif & & netif_index_is_l3_master ( net , skb - > skb_iif ) )
oif = skb - > skb_iif ;
fl6 . flowi6_oif = oif ;
}
2016-05-04 21:26:08 -07:00
2019-06-08 17:58:51 -07:00
if ( sk ) {
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
if ( sk - > sk_state = = TCP_TIME_WAIT )
2019-06-08 17:58:51 -07:00
mark = inet_twsk ( sk ) - > tw_mark ;
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
else
2019-06-08 17:58:51 -07:00
mark = sk - > sk_mark ;
2022-03-02 11:55:25 -08:00
skb_set_delivery_time ( buff , tcp_transmit_time ( sk ) , true ) ;
2019-06-08 17:58:51 -07:00
}
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
if ( txhash ) {
/* autoflowlabel/skb_get_hash_flowi6 rely on buff->hash */
skb_set_hash ( buff , txhash , PKT_HASH_TYPE_L4 ) ;
}
2018-05-10 16:53:51 +10:00
fl6 . flowi6_mark = IP6_REPLY_MARK ( net , skb - > mark ) ? : mark ;
2011-03-12 16:36:19 -05:00
fl6 . fl6_dport = t1 - > dest ;
fl6 . fl6_sport = t1 - > source ;
2016-11-04 02:23:43 +09:00
fl6 . flowi6_uid = sock_net_uid ( net , sk & & sk_fullsock ( sk ) ? sk : NULL ) ;
2020-09-27 22:38:26 -04:00
security_skb_classify_flow ( skb , flowi6_to_flowi_common ( & fl6 ) ) ;
2005-04-16 15:20:36 -07:00
2008-03-05 10:48:35 -08:00
/* Pass a socket to ip6_dst_lookup either it is for RST
* Underlying function will use this to retrieve the network
* namespace
*/
2022-07-07 10:01:39 +00:00
if ( sk & & sk - > sk_state ! = TCP_TIME_WAIT )
dst = ip6_dst_lookup_flow ( net , sk , & fl6 , NULL ) ; /*sk's xfrm_policy can be referred*/
else
dst = ip6_dst_lookup_flow ( net , ctl_sk , & fl6 , NULL ) ;
2011-03-01 13:19:07 -08:00
if ( ! IS_ERR ( dst ) ) {
skb_dst_set ( buff , dst ) ;
2020-09-08 14:29:02 -07:00
ip6_xmit ( ctl_sk , buff , & fl6 , fl6 . flowi6_mark , NULL ,
tclass & ~ INET_ECN_MASK , priority ) ;
2016-04-29 14:16:47 -07:00
TCP_INC_STATS ( net , TCP_MIB_OUTSEGS ) ;
2011-03-01 13:19:07 -08:00
if ( rst )
2016-04-29 14:16:47 -07:00
TCP_INC_STATS ( net , TCP_MIB_OUTRSTS ) ;
2011-03-01 13:19:07 -08:00
return ;
2005-04-16 15:20:36 -07:00
}
kfree_skb ( buff ) ;
}
2015-09-29 07:42:39 -07:00
static void tcp_v6_send_reset ( const struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2011-10-21 05:22:42 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2019-06-05 07:55:09 -07:00
struct ipv6hdr * ipv6h = ipv6_hdr ( skb ) ;
2008-10-09 14:42:40 -07:00
u32 seq = 0 , ack_seq = 0 ;
2008-10-09 21:11:56 -07:00
struct tcp_md5sig_key * key = NULL ;
2012-01-31 22:35:48 +00:00
# ifdef CONFIG_TCP_MD5SIG
const __u8 * hash_location = NULL ;
unsigned char newhash [ 16 ] ;
int genhash ;
struct sock * sk1 = NULL ;
# endif
2019-06-05 07:55:09 -07:00
__be32 label = 0 ;
2019-09-24 08:01:15 -07:00
u32 priority = 0 ;
2019-06-05 07:55:09 -07:00
struct net * net ;
2017-10-23 09:20:24 -07:00
int oif = 0 ;
2005-04-16 15:20:36 -07:00
2008-10-09 14:42:40 -07:00
if ( th - > rst )
2005-04-16 15:20:36 -07:00
return ;
2014-11-25 07:40:04 -08:00
/* If sk not NULL, it means we did a successful lookup and incoming
* route had to be correct . prequeue might have dropped our dst .
*/
if ( ! sk & & ! ipv6_unicast_destination ( skb ) )
2008-10-09 14:42:40 -07:00
return ;
2005-04-16 15:20:36 -07:00
2019-06-07 12:23:48 -07:00
net = sk ? sock_net ( sk ) : dev_net ( skb_dst ( skb ) - > dev ) ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2016-04-01 08:52:17 -07:00
rcu_read_lock ( ) ;
2012-01-31 22:35:48 +00:00
hash_location = tcp_parse_md5sig_option ( th ) ;
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
if ( sk & & sk_fullsock ( sk ) ) {
2019-12-30 14:14:28 -08:00
int l3index ;
/* sdif set, means packet ingressed via a device
* in an L3 domain and inet_iif is set to it .
*/
l3index = tcp_v6_sdif ( skb ) ? tcp_v6_iif_l3_slave ( skb ) : 0 ;
key = tcp_v6_md5_do_lookup ( sk , & ipv6h - > saddr , l3index ) ;
2015-12-21 21:29:25 +01:00
} else if ( hash_location ) {
2019-12-30 14:14:26 -08:00
int dif = tcp_v6_iif_l3_slave ( skb ) ;
int sdif = tcp_v6_sdif ( skb ) ;
2019-12-30 14:14:28 -08:00
int l3index ;
2019-12-30 14:14:26 -08:00
2012-01-31 22:35:48 +00:00
/*
* active side is lost . Try to find listening socket through
* source port , and then find md5 key through listening socket .
* we are not loose security here :
* Incoming packet is checked with md5 hash with finding key ,
* no RST generated if md5 hash doesn ' t match .
*/
2019-06-05 07:55:09 -07:00
sk1 = inet6_lookup_listener ( net ,
2016-02-10 11:50:38 -05:00
& tcp_hashinfo , NULL , 0 ,
& ipv6h - > saddr ,
2013-01-22 09:50:39 +00:00
th - > source , & ipv6h - > daddr ,
2019-12-30 14:14:26 -08:00
ntohs ( th - > source ) , dif , sdif ) ;
2012-01-31 22:35:48 +00:00
if ( ! sk1 )
2016-04-01 08:52:17 -07:00
goto out ;
2012-01-31 22:35:48 +00:00
2019-12-30 14:14:28 -08:00
/* sdif set, means packet ingressed via a device
* in an L3 domain and dif is set to it .
*/
l3index = tcp_v6_sdif ( skb ) ? dif : 0 ;
key = tcp_v6_md5_do_lookup ( sk1 , & ipv6h - > saddr , l3index ) ;
2012-01-31 22:35:48 +00:00
if ( ! key )
2016-04-01 08:52:17 -07:00
goto out ;
2012-01-31 22:35:48 +00:00
2015-03-24 15:58:55 -07:00
genhash = tcp_v6_md5_hash_skb ( newhash , key , NULL , skb ) ;
2012-01-31 22:35:48 +00:00
if ( genhash | | memcmp ( hash_location , newhash , 16 ) ! = 0 )
2016-04-01 08:52:17 -07:00
goto out ;
2012-01-31 22:35:48 +00:00
}
2006-11-14 19:07:45 -08:00
# endif
2008-10-09 14:42:40 -07:00
if ( th - > ack )
seq = ntohl ( th - > ack_seq ) ;
else
ack_seq = ntohl ( th - > seq ) + th - > syn + th - > fin + skb - > len -
( th - > doff < < 2 ) ;
2005-04-16 15:20:36 -07:00
2017-10-23 09:20:24 -07:00
if ( sk ) {
oif = sk - > sk_bound_dev_if ;
2019-07-10 06:40:09 -07:00
if ( sk_fullsock ( sk ) ) {
const struct ipv6_pinfo * np = tcp_inet6_sk ( sk ) ;
2018-02-06 20:50:23 -08:00
trace_tcp_send_reset ( sk , skb ) ;
2019-07-10 06:40:09 -07:00
if ( np - > repflow )
label = ip6_flowlabel ( ipv6h ) ;
2019-09-24 08:01:15 -07:00
priority = sk - > sk_priority ;
2019-07-10 06:40:09 -07:00
}
2019-09-24 08:01:16 -07:00
if ( sk - > sk_state = = TCP_TIME_WAIT ) {
2019-06-05 07:55:10 -07:00
label = cpu_to_be32 ( inet_twsk ( sk ) - > tw_flowlabel ) ;
2019-09-24 08:01:16 -07:00
priority = inet_twsk ( sk ) - > tw_priority ;
}
2019-06-05 07:55:09 -07:00
} else {
2019-07-01 06:39:36 -07:00
if ( net - > ipv6 . sysctl . flowlabel_reflect & FLOWLABEL_REFLECT_TCP_RESET )
2019-06-05 07:55:09 -07:00
label = ip6_flowlabel ( ipv6h ) ;
2017-10-23 09:20:24 -07:00
}
2020-09-08 14:29:02 -07:00
tcp_v6_send_response ( sk , skb , seq , ack_seq , 0 , 0 , 0 , oif , key , 1 ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
ipv6_get_dsfield ( ipv6h ) , label , priority , 0 ) ;
2012-01-31 22:35:48 +00:00
# ifdef CONFIG_TCP_MD5SIG
2016-04-01 08:52:17 -07:00
out :
rcu_read_unlock ( ) ;
2012-01-31 22:35:48 +00:00
# endif
2008-10-09 14:42:40 -07:00
}
2005-04-16 15:20:36 -07:00
2015-09-29 07:42:39 -07:00
static void tcp_v6_send_ack ( const struct sock * sk , struct sk_buff * skb , u32 seq ,
2014-12-09 09:56:08 -08:00
u32 ack , u32 win , u32 tsval , u32 tsecr , int oif ,
2014-01-16 17:21:22 +01:00
struct tcp_md5sig_key * key , u8 tclass ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
__be32 label , u32 priority , u32 txhash )
2008-10-09 14:42:40 -07:00
{
2014-12-09 09:56:08 -08:00
tcp_v6_send_response ( sk , skb , seq , ack , win , tsval , tsecr , oif , key , 0 ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
tclass , label , priority , txhash ) ;
2005-04-16 15:20:36 -07:00
}
static void tcp_v6_timewait_ack ( struct sock * sk , struct sk_buff * skb )
{
2005-08-09 20:09:30 -07:00
struct inet_timewait_sock * tw = inet_twsk ( sk ) ;
2006-11-14 19:07:45 -08:00
struct tcp_timewait_sock * tcptw = tcp_twsk ( sk ) ;
2005-04-16 15:20:36 -07:00
2014-12-09 09:56:08 -08:00
tcp_v6_send_ack ( sk , skb , tcptw - > tw_snd_nxt , tcptw - > tw_rcv_nxt ,
2005-08-09 20:09:30 -07:00
tcptw - > tw_rcv_wnd > > tw - > tw_rcv_wscale ,
2017-05-16 14:00:14 -07:00
tcp_time_stamp_raw ( ) + tcptw - > tw_ts_offset ,
2014-03-29 09:27:31 +08:00
tcptw - > tw_ts_recent , tw - > tw_bound_dev_if , tcp_twsk_md5_key ( tcptw ) ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
tw - > tw_tclass , cpu_to_be32 ( tw - > tw_flowlabel ) , tw - > tw_priority ,
tw - > tw_txhash ) ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:09:30 -07:00
inet_twsk_put ( tw ) ;
2005-04-16 15:20:36 -07:00
}
2015-09-29 07:42:39 -07:00
static void tcp_v6_reqsk_send_ack ( const struct sock * sk , struct sk_buff * skb ,
2008-08-06 23:50:04 -07:00
struct request_sock * req )
2005-04-16 15:20:36 -07:00
{
2019-12-30 14:14:28 -08:00
int l3index ;
l3index = tcp_v6_sdif ( skb ) ? tcp_v6_iif_l3_slave ( skb ) : 0 ;
2014-05-11 20:22:13 -07:00
/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
* sk - > sk_state = = TCP_SYN_RECV - > for Fast Open .
*/
tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.
Tested:
Following packetdrill test demonstrates the issue :
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>
+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 11:31:10 -07:00
/* RFC 7323 2.3
* The window field ( SEG . WND ) of every outgoing segment , with the
* exception of < SYN > segments , MUST be right - shifted by
* Rcv . Wind . Shift bits :
*/
2014-12-09 09:56:08 -08:00
tcp_v6_send_ack ( sk , skb , ( sk - > sk_state = = TCP_LISTEN ) ?
2014-05-11 20:22:13 -07:00
tcp_rsk ( req ) - > snt_isn + 1 : tcp_sk ( sk ) - > snd_nxt ,
tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.
Tested:
Following packetdrill test demonstrates the issue :
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>
+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 11:31:10 -07:00
tcp_rsk ( req ) - > rcv_nxt ,
req - > rsk_rcv_wnd > > inet_rsk ( req ) - > rcv_wscale ,
2017-05-16 14:00:14 -07:00
tcp_time_stamp_raw ( ) + tcp_rsk ( req ) - > ts_off ,
2016-12-01 11:32:06 +01:00
req - > ts_recent , sk - > sk_bound_dev_if ,
2019-12-30 14:14:28 -08:00
tcp_v6_md5_do_lookup ( sk , & ipv6_hdr ( skb ) - > saddr , l3index ) ,
ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
This is a followup of commit c67b85558ff2 ("ipv6: tcp: send consistent
autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
WHen this happens, we want to use the flow label that was used when
the prior SYNACK packet was sent, instead of another one.
After his patch, following packetdrill passes:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
// Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
+.01 < . 4000000000:4000000000(0) ack 1 win 320
+0 > (flowlabel 0x11) . 1:1(0) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 13:37:29 -07:00
ipv6_get_dsfield ( ipv6_hdr ( skb ) ) , 0 , sk - > sk_priority ,
tcp_rsk ( req ) - > txhash ) ;
2005-04-16 15:20:36 -07:00
}
2015-10-02 11:43:32 -07:00
static struct sock * tcp_v6_cookie_check ( struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2015-10-02 11:43:32 -07:00
# ifdef CONFIG_SYN_COOKIES
2007-04-10 21:04:22 -07:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
2010-06-03 00:43:44 +00:00
if ( ! th - > syn )
2008-02-07 21:49:26 -08:00
sk = cookie_v6_check ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
# endif
return sk ;
}
2019-07-29 09:59:14 -07:00
u16 tcp_v6_get_syncookie ( struct sock * sk , struct ipv6hdr * iph ,
struct tcphdr * th , u32 * cookie )
{
u16 mss = 0 ;
# ifdef CONFIG_SYN_COOKIES
mss = tcp_get_syncookie_mss ( & tcp6_request_sock_ops ,
& tcp_request_sock_ipv6_ops , sk , th ) ;
if ( mss ) {
* cookie = __cookie_v6_init_sequence ( iph , th , & mss ) ;
tcp_synq_overflow ( sk ) ;
}
# endif
return mss ;
}
2005-04-16 15:20:36 -07:00
static int tcp_v6_conn_request ( struct sock * sk , struct sk_buff * skb )
{
if ( skb - > protocol = = htons ( ETH_P_IP ) )
return tcp_v4_conn_request ( sk , skb ) ;
if ( ! ipv6_unicast_destination ( skb ) )
2007-02-09 23:24:49 +09:00
goto drop ;
2005-04-16 15:20:36 -07:00
2021-03-17 09:55:15 -07:00
if ( ipv6_addr_v4mapped ( & ipv6_hdr ( skb ) - > saddr ) ) {
__IP6_INC_STATS ( sock_net ( sk ) , NULL , IPSTATS_MIB_INHDRERRORS ) ;
return 0 ;
}
2014-06-25 17:10:02 +03:00
return tcp_conn_request ( & tcp6_request_sock_ops ,
& tcp_request_sock_ipv6_ops , sk , skb ) ;
2005-04-16 15:20:36 -07:00
drop :
2016-04-01 08:52:20 -07:00
tcp_listendrop ( sk ) ;
2005-04-16 15:20:36 -07:00
return 0 ; /* don't send reset */
}
2017-02-05 20:23:22 -08:00
static void tcp_v6_restore_cb ( struct sk_buff * skb )
{
/* We need to move header back to the beginning if xfrm6_policy_check()
* and tcp_v6_fill_cb ( ) are going to be called again .
* ip6_datagram_recv_specific_ctl ( ) also expects IP6CB to be there .
*/
memmove ( IP6CB ( skb ) , & TCP_SKB_CB ( skb ) - > header . h6 ,
sizeof ( struct inet6_skb_parm ) ) ;
}
2015-09-29 07:42:48 -07:00
static struct sock * tcp_v6_syn_recv_sock ( const struct sock * sk , struct sk_buff * skb ,
2013-12-19 18:44:34 +08:00
struct request_sock * req ,
2015-10-22 08:20:46 -07:00
struct dst_entry * dst ,
struct request_sock * req_unhash ,
bool * own_req )
2005-04-16 15:20:36 -07:00
{
2013-10-09 15:21:29 -07:00
struct inet_request_sock * ireq ;
2015-09-29 07:42:48 -07:00
struct ipv6_pinfo * newnp ;
2019-03-19 07:01:08 -07:00
const struct ipv6_pinfo * np = tcp_inet6_sk ( sk ) ;
2015-11-29 19:37:57 -08:00
struct ipv6_txoptions * opt ;
2005-04-16 15:20:36 -07:00
struct inet_sock * newinet ;
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
bool found_dup_sk = false ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * newtp ;
struct sock * newsk ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key * key ;
2019-12-30 14:14:28 -08:00
int l3index ;
2006-11-14 19:07:45 -08:00
# endif
2012-06-28 12:34:19 +00:00
struct flowi6 fl6 ;
2005-04-16 15:20:36 -07:00
if ( skb - > protocol = = htons ( ETH_P_IP ) ) {
/*
* v6 mapped
*/
2015-10-22 08:20:46 -07:00
newsk = tcp_v4_syn_recv_sock ( sk , skb , req , dst ,
req_unhash , own_req ) ;
2005-04-16 15:20:36 -07:00
2015-03-29 14:00:04 +01:00
if ( ! newsk )
2005-04-16 15:20:36 -07:00
return NULL ;
2019-03-19 07:01:08 -07:00
inet_sk ( newsk ) - > pinet6 = tcp_inet6_sk ( newsk ) ;
2005-04-16 15:20:36 -07:00
2019-03-19 07:01:08 -07:00
newnp = tcp_inet6_sk ( newsk ) ;
2005-04-16 15:20:36 -07:00
newtp = tcp_sk ( newsk ) ;
memcpy ( newnp , np , sizeof ( struct ipv6_pinfo ) ) ;
2015-03-18 14:05:35 -07:00
newnp - > saddr = newsk - > sk_v6_rcv_saddr ;
2005-04-16 15:20:36 -07:00
2005-12-13 23:15:52 -08:00
inet_csk ( newsk ) - > icsk_af_ops = & ipv6_mapped ;
2020-01-21 16:56:18 -08:00
if ( sk_is_mptcp ( newsk ) )
2020-01-30 10:45:26 +01:00
mptcpv6_handle_mapped ( newsk , true ) ;
2005-04-16 15:20:36 -07:00
newsk - > sk_backlog_rcv = tcp_v4_do_rcv ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
newtp - > af_specific = & tcp_sock_ipv6_mapped_specific ;
# endif
2017-05-09 16:59:54 -07:00
newnp - > ipv6_mc_list = NULL ;
2011-09-25 02:21:30 +00:00
newnp - > ipv6_ac_list = NULL ;
newnp - > ipv6_fl_list = NULL ;
2005-04-16 15:20:36 -07:00
newnp - > pktoptions = NULL ;
newnp - > opt = NULL ;
2019-03-19 05:45:35 -07:00
newnp - > mcast_oif = inet_iif ( skb ) ;
newnp - > mcast_hops = ip_hdr ( skb ) - > ttl ;
newnp - > rcv_flowinfo = 0 ;
2014-01-17 17:15:03 +01:00
if ( np - > repflow )
2019-03-19 05:45:35 -07:00
newnp - > flow_label = 0 ;
2005-04-16 15:20:36 -07:00
2005-08-09 19:45:38 -07:00
/*
* No need to charge this sock to the relevant IPv6 refcnt debug socks count
* here , tcp_create_openreq_child now does this for us , see the comment in
* that function for the gory details . - acme
2005-04-16 15:20:36 -07:00
*/
/* It is tricky place. Until this moment IPv4 tcp
2005-12-13 23:15:52 -08:00
worked with IPv6 icsk . icsk_af_ops .
2005-04-16 15:20:36 -07:00
Sync it now .
*/
2005-12-13 23:26:10 -08:00
tcp_sync_mss ( newsk , inet_csk ( newsk ) - > icsk_pmtu_cookie ) ;
2005-04-16 15:20:36 -07:00
return newsk ;
}
2013-10-09 15:21:29 -07:00
ireq = inet_rsk ( req ) ;
2005-04-16 15:20:36 -07:00
if ( sk_acceptq_is_full ( sk ) )
goto out_overflow ;
2010-12-02 12:14:29 -08:00
if ( ! dst ) {
2015-09-29 07:42:42 -07:00
dst = inet6_csk_route_req ( sk , & fl6 , req , IPPROTO_TCP ) ;
2010-12-02 12:14:29 -08:00
if ( ! dst )
2005-04-16 15:20:36 -07:00
goto out ;
2007-02-09 23:24:49 +09:00
}
2005-04-16 15:20:36 -07:00
newsk = tcp_create_openreq_child ( sk , req , skb ) ;
2015-03-29 14:00:04 +01:00
if ( ! newsk )
2010-10-21 13:06:43 +02:00
goto out_nonewsk ;
2005-04-16 15:20:36 -07:00
2005-08-09 19:45:38 -07:00
/*
* No need to charge this sock to the relevant IPv6 refcnt debug socks
* count here , tcp_create_openreq_child now does this for us , see the
* comment in that function for the gory details . - acme
*/
2005-04-16 15:20:36 -07:00
2006-08-25 15:55:43 -07:00
newsk - > sk_gso_type = SKB_GSO_TCPV6 ;
2015-12-02 21:53:57 -08:00
ip6_dst_store ( newsk , dst , NULL , NULL ) ;
2012-08-19 03:30:38 +00:00
inet6_sk_rx_dst_set ( newsk , skb ) ;
2005-04-16 15:20:36 -07:00
2019-03-19 07:01:08 -07:00
inet_sk ( newsk ) - > pinet6 = tcp_inet6_sk ( newsk ) ;
2005-04-16 15:20:36 -07:00
newtp = tcp_sk ( newsk ) ;
newinet = inet_sk ( newsk ) ;
2019-03-19 07:01:08 -07:00
newnp = tcp_inet6_sk ( newsk ) ;
2005-04-16 15:20:36 -07:00
memcpy ( newnp , np , sizeof ( struct ipv6_pinfo ) ) ;
2013-10-09 15:21:29 -07:00
newsk - > sk_v6_daddr = ireq - > ir_v6_rmt_addr ;
newnp - > saddr = ireq - > ir_v6_loc_addr ;
newsk - > sk_v6_rcv_saddr = ireq - > ir_v6_loc_addr ;
newsk - > sk_bound_dev_if = ireq - > ir_iif ;
2005-04-16 15:20:36 -07:00
2007-02-09 23:24:49 +09:00
/* Now IPv6 options...
2005-04-16 15:20:36 -07:00
First : no IPv4 options .
*/
2011-04-21 09:45:37 +00:00
newinet - > inet_opt = NULL ;
2017-05-09 16:59:54 -07:00
newnp - > ipv6_mc_list = NULL ;
2011-09-25 02:21:30 +00:00
newnp - > ipv6_ac_list = NULL ;
2007-03-16 16:14:03 -07:00
newnp - > ipv6_fl_list = NULL ;
2005-04-16 15:20:36 -07:00
/* Clone RX bits */
newnp - > rxopt . all = np - > rxopt . all ;
newnp - > pktoptions = NULL ;
newnp - > opt = NULL ;
2014-10-17 09:17:20 -07:00
newnp - > mcast_oif = tcp_v6_iif ( skb ) ;
2007-04-25 17:54:47 -07:00
newnp - > mcast_hops = ipv6_hdr ( skb ) - > hop_limit ;
2013-12-08 15:46:57 +01:00
newnp - > rcv_flowinfo = ip6_flowinfo ( ipv6_hdr ( skb ) ) ;
2014-01-17 17:15:03 +01:00
if ( np - > repflow )
newnp - > flow_label = ip6_flowlabel ( ipv6_hdr ( skb ) ) ;
2005-04-16 15:20:36 -07:00
2020-12-08 09:55:08 -08:00
/* Set ToS of the new socket based upon the value of incoming SYN.
* ECT bits are set later in tcp_init_transfer ( ) .
*/
2022-07-22 11:22:04 -07:00
if ( READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_reflect_tos ) )
2020-09-09 17:50:48 -07:00
newnp - > tclass = tcp_rsk ( req ) - > syn_tos & ~ INET_ECN_MASK ;
2005-04-16 15:20:36 -07:00
/* Clone native IPv6 options from listening socket (if any)
Yes , keeping reference count would be much more clever ,
but we make one more one thing there : reattach optmem
to newsk .
*/
2016-06-27 15:05:28 -04:00
opt = ireq - > ipv6_opt ;
if ( ! opt )
opt = rcu_dereference ( np - > opt ) ;
2015-11-29 19:37:57 -08:00
if ( opt ) {
opt = ipv6_dup_options ( newsk , opt ) ;
RCU_INIT_POINTER ( newnp - > opt , opt ) ;
}
2005-12-13 23:26:10 -08:00
inet_csk ( newsk ) - > icsk_ext_hdr_len = 0 ;
2015-11-29 19:37:57 -08:00
if ( opt )
inet_csk ( newsk ) - > icsk_ext_hdr_len = opt - > opt_nflen +
opt - > opt_flen ;
2005-04-16 15:20:36 -07:00
net: tcp: add per route congestion control
This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.
For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.
The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.
The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.
Joint work with Florian Westphal.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 23:57:48 +01:00
tcp_ca_openreq_child ( newsk , dst ) ;
2005-04-16 15:20:36 -07:00
tcp_sync_mss ( newsk , dst_mtu ( dst ) ) ;
2017-02-02 08:04:56 -08:00
newtp - > advmss = tcp_mss_clamp ( tcp_sk ( sk ) , dst_metric_advmss ( dst ) ) ;
2012-04-22 09:45:47 +00:00
2005-04-16 15:20:36 -07:00
tcp_initialize_rcv_mss ( newsk ) ;
2009-10-15 06:30:45 +00:00
newinet - > inet_daddr = newinet - > inet_saddr = LOOPBACK4_IPV6 ;
newinet - > inet_rcv_saddr = LOOPBACK4_IPV6 ;
2005-04-16 15:20:36 -07:00
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2019-12-30 14:14:28 -08:00
l3index = l3mdev_master_ifindex_by_index ( sock_net ( sk ) , ireq - > ir_iif ) ;
2006-11-14 19:07:45 -08:00
/* Copy over the MD5 key from the original socket */
2019-12-30 14:14:28 -08:00
key = tcp_v6_md5_do_lookup ( sk , & newsk - > sk_v6_daddr , l3index ) ;
2015-03-29 14:00:05 +01:00
if ( key ) {
2006-11-14 19:07:45 -08:00
/* We're using one, so create a matching key
* on the newsk structure . If we fail to get
* memory , then we end up not copying the key
* across . Shucks .
*/
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
tcp_md5_do_add ( newsk , ( union tcp_md5_addr * ) & newsk - > sk_v6_daddr ,
2021-10-15 10:26:05 +03:00
AF_INET6 , 128 , l3index , key - > flags , key - > key , key - > keylen ,
2015-11-30 08:57:28 -08:00
sk_gfp_mask ( sk , GFP_ATOMIC ) ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2010-10-21 13:06:43 +02:00
if ( __inet_inherit_port ( sk , newsk ) < 0 ) {
inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:
unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
[<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
[<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
[<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
[<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
[<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
[<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
[<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
[<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
[<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
[<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
[<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
[<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
[<ffffffff814cee68>] ip_rcv+0x217/0x267
[<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
[<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.
This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...
Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.
Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
A similar approach is taken for dccp by calling dccp_done().
This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14 04:07:58 +00:00
inet_csk_prepare_forced_close ( newsk ) ;
tcp_done ( newsk ) ;
2010-10-21 13:06:43 +02:00
goto out ;
}
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
* own_req = inet_ehash_nolisten ( newsk , req_to_sk ( req_unhash ) ,
& found_dup_sk ) ;
2015-11-05 11:07:13 -08:00
if ( * own_req ) {
2015-11-05 12:50:19 -08:00
tcp_move_syn ( newtp , req ) ;
2015-11-05 11:07:13 -08:00
/* Clone pktoptions received with SYN, if we own the req */
if ( ireq - > pktopts ) {
newnp - > pktoptions = skb_clone ( ireq - > pktopts ,
2015-11-30 08:57:28 -08:00
sk_gfp_mask ( sk , GFP_ATOMIC ) ) ;
2015-11-05 11:07:13 -08:00
consume_skb ( ireq - > pktopts ) ;
ireq - > pktopts = NULL ;
2017-02-05 20:23:22 -08:00
if ( newnp - > pktoptions ) {
tcp_v6_restore_cb ( newnp - > pktoptions ) ;
2015-11-05 11:07:13 -08:00
skb_set_owner_r ( newnp - > pktoptions , newsk ) ;
2017-02-05 20:23:22 -08:00
}
2015-11-05 11:07:13 -08:00
}
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
} else {
if ( ! req_unhash & & found_dup_sk ) {
/* This code path should only be executed in the
* syncookie case only
*/
bh_unlock_sock ( newsk ) ;
sock_put ( newsk ) ;
newsk = NULL ;
}
2015-10-30 09:46:12 -07:00
}
2005-04-16 15:20:36 -07:00
return newsk ;
out_overflow :
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_LISTENOVERFLOWS ) ;
2010-10-21 13:06:43 +02:00
out_nonewsk :
2005-04-16 15:20:36 -07:00
dst_release ( dst ) ;
2010-10-21 13:06:43 +02:00
out :
2016-04-01 08:52:20 -07:00
tcp_listendrop ( sk ) ;
2005-04-16 15:20:36 -07:00
return NULL ;
}
2021-02-01 17:41:32 +00:00
INDIRECT_CALLABLE_DECLARE ( struct dst_entry * ipv4_dst_check ( struct dst_entry * ,
u32 ) ) ;
2005-04-16 15:20:36 -07:00
/* The socket must have it's spinlock held when we get
2015-10-02 11:43:39 -07:00
* here , unless it is a TCP_LISTEN socket .
2005-04-16 15:20:36 -07:00
*
* We have a potential double - lock case here , so even when
* doing backlog processing we use the BH locking scheme .
* This is because we cannot sleep with the original spinlock
* held .
*/
2021-11-15 11:02:41 -08:00
INDIRECT_CALLABLE_SCOPE
int tcp_v6_do_rcv ( struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2019-03-19 07:01:08 -07:00
struct ipv6_pinfo * np = tcp_inet6_sk ( sk ) ;
2005-04-16 15:20:36 -07:00
struct sk_buff * opt_skb = NULL ;
2022-02-20 15:06:34 +08:00
enum skb_drop_reason reason ;
2019-03-19 07:01:08 -07:00
struct tcp_sock * tp ;
2005-04-16 15:20:36 -07:00
/* Imagine: socket is IPv6. IPv4 packet arrives,
goes to IPv4 receive handler and backlogged .
From backlog it always goes here . Kerboom . . .
Fortunately , tcp_rcv_established and rcv_established
handle them correctly , but it is not case with
tcp_v6_hnd_req and tcp_v6_send_reset ( ) . - - ANK
*/
if ( skb - > protocol = = htons ( ETH_P_IP ) )
return tcp_v4_do_rcv ( sk , skb ) ;
/*
* socket locking is here for SMP purposes as backlog rcv
* is currently called with bh processing disabled .
*/
/* Do Stevens' IPV6_PKTOPTIONS.
Yes , guys , it is the only place in our code , where we
may make it not affecting IPv4 .
The rest of code is protocol independent ,
and I do not like idea to uglify IPv4 .
Actually , all the idea behind IPV6_PKTOPTIONS
looks not very well thought . For now we latch
options , received in the last packet , enqueued
by tcp . Feel free to propose better solution .
2007-02-09 23:24:49 +09:00
- - ANK ( 980728 )
2005-04-16 15:20:36 -07:00
*/
if ( np - > rxopt . all )
2015-11-30 08:57:28 -08:00
opt_skb = skb_clone ( skb , sk_gfp_mask ( sk , GFP_ATOMIC ) ) ;
2005-04-16 15:20:36 -07:00
2022-02-20 15:06:34 +08:00
reason = SKB_DROP_REASON_NOT_SPECIFIED ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state = = TCP_ESTABLISHED ) { /* Fast path */
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
struct dst_entry * dst ;
dst = rcu_dereference_protected ( sk - > sk_rx_dst ,
lockdep_sock_is_held ( sk ) ) ;
2012-08-06 05:09:33 +00:00
2011-08-14 19:45:55 +00:00
sock_rps_save_rxhash ( sk , skb ) ;
2014-11-11 05:54:27 -08:00
sk_mark_napi_id ( sk , skb ) ;
2012-08-06 05:09:33 +00:00
if ( dst ) {
2021-10-25 09:48:16 -07:00
if ( sk - > sk_rx_dst_ifindex ! = skb - > skb_iif | |
2021-02-01 17:41:32 +00:00
INDIRECT_CALL_1 ( dst - > ops - > check , ip6_dst_check ,
2021-10-25 09:48:17 -07:00
dst , sk - > sk_rx_dst_cookie ) = = NULL ) {
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
RCU_INIT_POINTER ( sk - > sk_rx_dst , NULL ) ;
2012-08-06 05:09:33 +00:00
dst_release ( dst ) ;
}
}
2018-05-29 23:27:31 +08:00
tcp_rcv_established ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
if ( opt_skb )
goto ipv6_pktoptions ;
return 0 ;
}
2015-06-03 23:49:21 -07:00
if ( tcp_checksum_complete ( skb ) )
2005-04-16 15:20:36 -07:00
goto csum_err ;
2007-02-09 23:24:49 +09:00
if ( sk - > sk_state = = TCP_LISTEN ) {
2015-10-02 11:43:32 -07:00
struct sock * nsk = tcp_v6_cookie_check ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
if ( ! nsk )
goto discard ;
2013-12-19 18:44:34 +08:00
if ( nsk ! = sk ) {
2005-04-16 15:20:36 -07:00
if ( tcp_child_process ( sk , nsk , skb ) )
goto reset ;
if ( opt_skb )
__kfree_skb ( opt_skb ) ;
return 0 ;
}
2011-04-06 13:07:09 -07:00
} else
2011-08-14 19:45:55 +00:00
sock_rps_save_rxhash ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
2015-09-29 07:42:41 -07:00
if ( tcp_rcv_state_process ( sk , skb ) )
2005-04-16 15:20:36 -07:00
goto reset ;
if ( opt_skb )
goto ipv6_pktoptions ;
return 0 ;
reset :
2006-11-14 19:07:45 -08:00
tcp_v6_send_reset ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
discard :
if ( opt_skb )
__kfree_skb ( opt_skb ) ;
2022-02-20 15:06:34 +08:00
kfree_skb_reason ( skb , reason ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
csum_err :
2022-02-20 15:06:34 +08:00
reason = SKB_DROP_REASON_TCP_CSUM ;
2021-05-14 13:04:25 -07:00
trace_tcp_bad_csum ( skb ) ;
2016-04-29 14:16:47 -07:00
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_CSUMERRORS ) ;
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_INERRS ) ;
2005-04-16 15:20:36 -07:00
goto discard ;
ipv6_pktoptions :
/* Do you ask, what is it?
1. skb was enqueued by tcp .
2. skb is added to tail of read queue , rather than out of order .
3. socket is not in passive state .
4. Finally , it really contains options , which user wants to receive .
*/
tp = tcp_sk ( sk ) ;
if ( TCP_SKB_CB ( opt_skb ) - > end_seq = = tp - > rcv_nxt & &
! ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) ) {
[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542).
Support several new socket options / ancillary data:
IPV6_RECVPKTINFO, IPV6_PKTINFO,
IPV6_RECVHOPOPTS, IPV6_HOPOPTS,
IPV6_RECVDSTOPTS, IPV6_DSTOPTS, IPV6_RTHDRDSTOPTS,
IPV6_RECVRTHDR, IPV6_RTHDR,
IPV6_RECVHOPOPTS, IPV6_HOPOPTS
Old semantics are preserved as IPV6_2292xxxx so that
we can maintain backward compatibility.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2005-09-08 09:59:17 +09:00
if ( np - > rxopt . bits . rxinfo | | np - > rxopt . bits . rxoinfo )
2014-10-17 09:17:20 -07:00
np - > mcast_oif = tcp_v6_iif ( opt_skb ) ;
[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542).
Support several new socket options / ancillary data:
IPV6_RECVPKTINFO, IPV6_PKTINFO,
IPV6_RECVHOPOPTS, IPV6_HOPOPTS,
IPV6_RECVDSTOPTS, IPV6_DSTOPTS, IPV6_RTHDRDSTOPTS,
IPV6_RECVRTHDR, IPV6_RTHDR,
IPV6_RECVHOPOPTS, IPV6_HOPOPTS
Old semantics are preserved as IPV6_2292xxxx so that
we can maintain backward compatibility.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2005-09-08 09:59:17 +09:00
if ( np - > rxopt . bits . rxhlim | | np - > rxopt . bits . rxohlim )
2007-04-25 17:54:47 -07:00
np - > mcast_hops = ipv6_hdr ( opt_skb ) - > hop_limit ;
2013-12-08 15:46:59 +01:00
if ( np - > rxopt . bits . rxflow | | np - > rxopt . bits . rxtclass )
2013-12-08 15:46:57 +01:00
np - > rcv_flowinfo = ip6_flowinfo ( ipv6_hdr ( opt_skb ) ) ;
2014-01-17 17:15:03 +01:00
if ( np - > repflow )
np - > flow_label = ip6_flowlabel ( ipv6_hdr ( opt_skb ) ) ;
2014-09-27 09:50:56 -07:00
if ( ipv6_opt_accepted ( sk , opt_skb , & TCP_SKB_CB ( opt_skb ) - > header . h6 ) ) {
2005-04-16 15:20:36 -07:00
skb_set_owner_r ( opt_skb , sk ) ;
2016-10-12 19:01:45 +02:00
tcp_v6_restore_cb ( opt_skb ) ;
2005-04-16 15:20:36 -07:00
opt_skb = xchg ( & np - > pktoptions , opt_skb ) ;
} else {
__kfree_skb ( opt_skb ) ;
opt_skb = xchg ( & np - > pktoptions , NULL ) ;
}
}
2021-10-25 09:48:25 -07:00
consume_skb ( opt_skb ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2014-12-22 18:22:48 +01:00
static void tcp_v6_fill_cb ( struct sk_buff * skb , const struct ipv6hdr * hdr ,
const struct tcphdr * th )
{
/* This is tricky: we move IP6CB at its correct location into
* TCP_SKB_CB ( ) . It must be done after xfrm6_policy_check ( ) , because
* _decode_session6 ( ) uses IP6CB ( ) .
* barrier ( ) makes sure compiler won ' t play aliasing games .
*/
memmove ( & TCP_SKB_CB ( skb ) - > header . h6 , IP6CB ( skb ) ,
sizeof ( struct inet6_skb_parm ) ) ;
barrier ( ) ;
TCP_SKB_CB ( skb ) - > seq = ntohl ( th - > seq ) ;
TCP_SKB_CB ( skb ) - > end_seq = ( TCP_SKB_CB ( skb ) - > seq + th - > syn + th - > fin +
skb - > len - th - > doff * 4 ) ;
TCP_SKB_CB ( skb ) - > ack_seq = ntohl ( th - > ack_seq ) ;
TCP_SKB_CB ( skb ) - > tcp_flags = tcp_flag_byte ( th ) ;
TCP_SKB_CB ( skb ) - > tcp_tw_isn = 0 ;
TCP_SKB_CB ( skb ) - > ip_dsfield = ipv6_get_dsfield ( hdr ) ;
TCP_SKB_CB ( skb ) - > sacked = 0 ;
2017-08-22 17:08:48 -04:00
TCP_SKB_CB ( skb ) - > has_rxtstamp =
skb - > tstamp | | skb_hwtstamps ( skb ) - > hwtstamp ;
2014-12-22 18:22:48 +01:00
}
2019-05-03 17:01:37 +02:00
INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv ( struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2022-02-20 15:06:31 +08:00
enum skb_drop_reason drop_reason ;
2017-08-07 08:44:21 -07:00
int sdif = inet6_sdif ( skb ) ;
2019-12-30 14:14:26 -08:00
int dif = inet6_iif ( skb ) ;
2011-10-21 05:22:42 -04:00
const struct tcphdr * th ;
2011-04-22 04:53:02 +00:00
const struct ipv6hdr * hdr ;
2016-04-01 08:52:17 -07:00
bool refcounted ;
2005-04-16 15:20:36 -07:00
struct sock * sk ;
int ret ;
2008-07-16 20:20:58 -07:00
struct net * net = dev_net ( skb - > dev ) ;
2005-04-16 15:20:36 -07:00
2022-02-20 15:06:31 +08:00
drop_reason = SKB_DROP_REASON_NOT_SPECIFIED ;
2005-04-16 15:20:36 -07:00
if ( skb - > pkt_type ! = PACKET_HOST )
goto discard_it ;
/*
* Count it even if it ' s bad .
*/
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_INSEGS ) ;
2005-04-16 15:20:36 -07:00
if ( ! pskb_may_pull ( skb , sizeof ( struct tcphdr ) ) )
goto discard_it ;
2016-05-13 09:16:40 -07:00
th = ( const struct tcphdr * ) skb - > data ;
2005-04-16 15:20:36 -07:00
2022-02-20 15:06:31 +08:00
if ( unlikely ( th - > doff < sizeof ( struct tcphdr ) / 4 ) ) {
drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL ;
2005-04-16 15:20:36 -07:00
goto bad_packet ;
2022-02-20 15:06:31 +08:00
}
2005-04-16 15:20:36 -07:00
if ( ! pskb_may_pull ( skb , th - > doff * 4 ) )
goto discard_it ;
2014-05-02 16:29:51 -07:00
if ( skb_checksum_init ( skb , IPPROTO_TCP , ip6_compute_pseudo ) )
2013-04-29 08:39:56 +00:00
goto csum_error ;
2005-04-16 15:20:36 -07:00
2016-05-13 09:16:40 -07:00
th = ( const struct tcphdr * ) skb - > data ;
IPv6: Generic TTL Security Mechanism (final version)
This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.
Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
The server does have to deal with both IPv4 and IPv6 socket options
and the client has to handle the different for each family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
...
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:24:53 -07:00
hdr = ipv6_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
2015-10-13 17:12:54 -07:00
lookup :
2016-02-10 11:50:38 -05:00
sk = __inet6_lookup_skb ( & tcp_hashinfo , skb , __tcp_hdrlen ( th ) ,
2017-08-07 08:44:21 -07:00
th - > source , th - > dest , inet6_iif ( skb ) , sdif ,
2016-04-01 08:52:17 -07:00
& refcounted ) ;
2005-04-16 15:20:36 -07:00
if ( ! sk )
goto no_tcp_socket ;
process :
if ( sk - > sk_state = = TCP_TIME_WAIT )
goto do_time_wait ;
2015-10-02 11:43:32 -07:00
if ( sk - > sk_state = = TCP_NEW_SYN_RECV ) {
struct request_sock * req = inet_reqsk ( sk ) ;
2018-02-13 06:14:12 -08:00
bool req_stolen = false ;
2016-02-18 05:39:18 -08:00
struct sock * nsk ;
2015-10-02 11:43:32 -07:00
sk = req - > rsk_listener ;
2022-03-07 16:44:21 -08:00
drop_reason = tcp_inbound_md5_hash ( sk , skb ,
& hdr - > saddr , & hdr - > daddr ,
AF_INET6 , dif , sdif ) ;
if ( drop_reason ) {
2016-08-24 08:50:24 -07:00
sk_drops_add ( sk , skb ) ;
2015-10-02 11:43:32 -07:00
reqsk_put ( req ) ;
goto discard_it ;
}
2018-06-12 23:09:37 +00:00
if ( tcp_checksum_complete ( skb ) ) {
reqsk_put ( req ) ;
goto csum_error ;
}
2016-02-18 05:39:18 -08:00
if ( unlikely ( sk - > sk_state ! = TCP_LISTEN ) ) {
tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
This patch also changes the code to call reuseport_migrate_sock() and
inet_reqsk_clone(), but unlike the other cases, we do not call
inet_reqsk_clone() right after reuseport_migrate_sock().
Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:
(A) for listener itself
(B) carried by reuqest_sock
(C) sock_hold() in tcp_v[46]_rcv()
While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.
For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.
In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:
CPU 1 looks up req1
CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
2021-06-12 21:32:20 +09:00
nsk = reuseport_migrate_sock ( sk , req_to_sk ( req ) , skb ) ;
if ( ! nsk ) {
inet_csk_reqsk_queue_drop_and_put ( sk , req ) ;
goto lookup ;
}
sk = nsk ;
/* reuseport_migrate_sock() has already held one sk_refcnt
* before returning .
*/
} else {
sock_hold ( sk ) ;
2015-10-13 17:12:54 -07:00
}
2016-04-01 08:52:17 -07:00
refcounted = true ;
2017-09-08 12:44:47 -07:00
nsk = NULL ;
2017-12-03 09:32:59 -08:00
if ( ! tcp_filter ( sk , skb ) ) {
th = ( const struct tcphdr * ) skb - > data ;
hdr = ipv6_hdr ( skb ) ;
tcp_v6_fill_cb ( skb , hdr , th ) ;
2018-02-13 06:14:12 -08:00
nsk = tcp_check_req ( sk , skb , req , false , & req_stolen ) ;
2022-02-20 15:06:31 +08:00
} else {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER ;
2017-12-03 09:32:59 -08:00
}
2015-10-02 11:43:32 -07:00
if ( ! nsk ) {
reqsk_put ( req ) ;
2018-02-13 06:14:12 -08:00
if ( req_stolen ) {
/* Another cpu got exclusive access to req
* and created a full blown socket .
* Try to feed this packet to this socket
* instead of discarding it .
*/
tcp_v6_restore_cb ( skb ) ;
sock_put ( sk ) ;
goto lookup ;
}
2016-02-18 05:39:18 -08:00
goto discard_and_relse ;
2015-10-02 11:43:32 -07:00
}
if ( nsk = = sk ) {
reqsk_put ( req ) ;
tcp_v6_restore_cb ( skb ) ;
} else if ( tcp_child_process ( sk , nsk , skb ) ) {
tcp_v6_send_reset ( nsk , skb ) ;
2016-02-18 05:39:18 -08:00
goto discard_and_relse ;
2015-10-02 11:43:32 -07:00
} else {
2016-02-18 05:39:18 -08:00
sock_put ( sk ) ;
2015-10-02 11:43:32 -07:00
return 0 ;
}
}
2021-10-25 09:48:22 -07:00
if ( static_branch_unlikely ( & ip6_min_hopcount ) ) {
/* min_hopcount can be changed concurrently from do_ipv6_setsockopt() */
if ( hdr - > hop_limit < READ_ONCE ( tcp_inet6_sk ( sk ) - > min_hopcount ) ) {
__NET_INC_STATS ( net , LINUX_MIB_TCPMINTTLDROP ) ;
goto discard_and_relse ;
}
IPv6: Generic TTL Security Mechanism (final version)
This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.
Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
The server does have to deal with both IPv4 and IPv6 socket options
and the client has to handle the different for each family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
...
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:24:53 -07:00
}
2022-02-20 15:06:31 +08:00
if ( ! xfrm6_policy_check ( sk , XFRM_POLICY_IN , skb ) ) {
drop_reason = SKB_DROP_REASON_XFRM_POLICY ;
2005-04-16 15:20:36 -07:00
goto discard_and_relse ;
2022-02-20 15:06:31 +08:00
}
2005-04-16 15:20:36 -07:00
2022-03-07 16:44:21 -08:00
drop_reason = tcp_inbound_md5_hash ( sk , skb , & hdr - > saddr , & hdr - > daddr ,
AF_INET6 , dif , sdif ) ;
if ( drop_reason )
2014-08-07 02:38:22 +04:00
goto discard_and_relse ;
2022-02-20 15:06:31 +08:00
if ( tcp_filter ( sk , skb ) ) {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER ;
2005-04-16 15:20:36 -07:00
goto discard_and_relse ;
2022-02-20 15:06:31 +08:00
}
2016-11-10 13:12:35 -08:00
th = ( const struct tcphdr * ) skb - > data ;
hdr = ipv6_hdr ( skb ) ;
2017-12-03 09:32:59 -08:00
tcp_v6_fill_cb ( skb , hdr , th ) ;
2005-04-16 15:20:36 -07:00
skb - > dev = NULL ;
2015-10-02 11:43:39 -07:00
if ( sk - > sk_state = = TCP_LISTEN ) {
ret = tcp_v6_do_rcv ( sk , skb ) ;
goto put_and_return ;
}
sk_incoming_cpu_update ( sk ) ;
2006-09-25 22:28:47 -07:00
bh_lock_sock_nested ( sk ) ;
2016-03-14 10:52:15 -07:00
tcp_segs_in ( tcp_sk ( sk ) , skb ) ;
2005-04-16 15:20:36 -07:00
ret = 0 ;
if ( ! sock_owned_by_user ( sk ) ) {
2017-07-30 03:57:18 +02:00
ret = tcp_v6_do_rcv ( sk , skb ) ;
tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 08:56:40 -07:00
} else {
2022-02-20 15:06:33 +08:00
if ( tcp_add_backlog ( sk , skb , & drop_reason ) )
tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 08:56:40 -07:00
goto discard_and_relse ;
2010-03-04 18:01:41 +00:00
}
2005-04-16 15:20:36 -07:00
bh_unlock_sock ( sk ) ;
2015-10-02 11:43:39 -07:00
put_and_return :
2016-04-01 08:52:17 -07:00
if ( refcounted )
sock_put ( sk ) ;
2005-04-16 15:20:36 -07:00
return ret ? - 1 : 0 ;
no_tcp_socket :
2022-02-20 15:06:31 +08:00
drop_reason = SKB_DROP_REASON_NO_SOCKET ;
2005-04-16 15:20:36 -07:00
if ( ! xfrm6_policy_check ( NULL , XFRM_POLICY_IN , skb ) )
goto discard_it ;
2014-12-22 18:22:48 +01:00
tcp_v6_fill_cb ( skb , hdr , th ) ;
2015-06-03 23:49:21 -07:00
if ( tcp_checksum_complete ( skb ) ) {
2013-04-29 08:39:56 +00:00
csum_error :
2022-02-20 15:06:31 +08:00
drop_reason = SKB_DROP_REASON_TCP_CSUM ;
2021-05-14 13:04:25 -07:00
trace_tcp_bad_csum ( skb ) ;
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_CSUMERRORS ) ;
2005-04-16 15:20:36 -07:00
bad_packet :
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_INERRS ) ;
2005-04-16 15:20:36 -07:00
} else {
2006-11-14 19:07:45 -08:00
tcp_v6_send_reset ( NULL , skb ) ;
2005-04-16 15:20:36 -07:00
}
discard_it :
2022-05-19 19:13:47 -07:00
SKB_DR_OR ( drop_reason , NOT_SPECIFIED ) ;
2022-02-20 15:06:31 +08:00
kfree_skb_reason ( skb , drop_reason ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
discard_and_relse :
2016-04-01 08:52:19 -07:00
sk_drops_add ( sk , skb ) ;
2016-04-01 08:52:17 -07:00
if ( refcounted )
sock_put ( sk ) ;
2005-04-16 15:20:36 -07:00
goto discard_it ;
do_time_wait :
if ( ! xfrm6_policy_check ( NULL , XFRM_POLICY_IN , skb ) ) {
2022-02-20 15:06:31 +08:00
drop_reason = SKB_DROP_REASON_XFRM_POLICY ;
2006-10-10 19:41:46 -07:00
inet_twsk_put ( inet_twsk ( sk ) ) ;
2005-04-16 15:20:36 -07:00
goto discard_it ;
}
2014-12-22 18:22:48 +01:00
tcp_v6_fill_cb ( skb , hdr , th ) ;
2013-04-29 08:39:56 +00:00
if ( tcp_checksum_complete ( skb ) ) {
inet_twsk_put ( inet_twsk ( sk ) ) ;
goto csum_error ;
2005-04-16 15:20:36 -07:00
}
2006-10-10 19:41:46 -07:00
switch ( tcp_timewait_state_process ( inet_twsk ( sk ) , skb , th ) ) {
2005-04-16 15:20:36 -07:00
case TCP_TW_SYN :
{
struct sock * sk2 ;
2008-03-25 21:47:49 +09:00
sk2 = inet6_lookup_listener ( dev_net ( skb - > dev ) , & tcp_hashinfo ,
2016-02-10 11:50:38 -05:00
skb , __tcp_hdrlen ( th ) ,
2013-01-22 09:50:39 +00:00
& ipv6_hdr ( skb ) - > saddr , th - > source ,
2007-04-25 17:54:47 -07:00
& ipv6_hdr ( skb ) - > daddr ,
2018-07-19 12:41:18 -07:00
ntohs ( th - > dest ) ,
tcp_v6_iif_l3_slave ( skb ) ,
2017-08-07 08:44:21 -07:00
sdif ) ;
2015-03-29 14:00:05 +01:00
if ( sk2 ) {
2005-08-09 20:44:40 -07:00
struct inet_timewait_sock * tw = inet_twsk ( sk ) ;
2015-07-08 14:28:30 -07:00
inet_twsk_deschedule_put ( tw ) ;
2005-04-16 15:20:36 -07:00
sk = sk2 ;
2015-03-27 12:24:22 +03:00
tcp_v6_restore_cb ( skb ) ;
2016-04-01 08:52:17 -07:00
refcounted = false ;
2005-04-16 15:20:36 -07:00
goto process ;
}
}
2017-10-16 16:36:52 -05:00
/* to ACK */
2020-03-12 15:50:22 -07:00
fallthrough ;
2005-04-16 15:20:36 -07:00
case TCP_TW_ACK :
tcp_v6_timewait_ack ( sk , skb ) ;
break ;
case TCP_TW_RST :
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
tcp_v6_send_reset ( sk , skb ) ;
inet_twsk_deschedule_put ( inet_twsk ( sk ) ) ;
goto discard_it ;
2014-03-29 09:27:29 +08:00
case TCP_TW_SUCCESS :
;
2005-04-16 15:20:36 -07:00
}
goto discard_it ;
}
tcp/udp: Make early_demux back namespacified.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-13 10:52:07 -07:00
void tcp_v6_early_demux ( struct sk_buff * skb )
2012-07-26 12:18:11 +00:00
{
const struct ipv6hdr * hdr ;
const struct tcphdr * th ;
struct sock * sk ;
if ( skb - > pkt_type ! = PACKET_HOST )
return ;
if ( ! pskb_may_pull ( skb , skb_transport_offset ( skb ) + sizeof ( struct tcphdr ) ) )
return ;
hdr = ipv6_hdr ( skb ) ;
th = tcp_hdr ( skb ) ;
if ( th - > doff < sizeof ( struct tcphdr ) / 4 )
return ;
2014-10-17 09:17:20 -07:00
/* Note : We use inet6_iif() here, not tcp_v6_iif() */
2012-07-26 12:18:11 +00:00
sk = __inet6_lookup_established ( dev_net ( skb - > dev ) , & tcp_hashinfo ,
& hdr - > saddr , th - > source ,
& hdr - > daddr , ntohs ( th - > dest ) ,
2017-08-07 08:44:21 -07:00
inet6_iif ( skb ) , inet6_sdif ( skb ) ) ;
2012-07-26 12:18:11 +00:00
if ( sk ) {
skb - > sk = sk ;
skb - > destructor = sock_edemux ;
2015-03-15 21:12:13 -07:00
if ( sk_fullsock ( sk ) ) {
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
struct dst_entry * dst = rcu_dereference ( sk - > sk_rx_dst ) ;
2012-10-22 21:41:48 +00:00
2012-07-26 12:18:11 +00:00
if ( dst )
2021-10-25 09:48:17 -07:00
dst = dst_check ( dst , sk - > sk_rx_dst_cookie ) ;
2012-07-26 12:18:11 +00:00
if ( dst & &
2021-10-25 09:48:16 -07:00
sk - > sk_rx_dst_ifindex = = skb - > skb_iif )
2012-07-26 12:18:11 +00:00
skb_dst_set_noref ( skb , dst ) ;
}
}
}
2010-12-01 18:09:13 -08:00
static struct timewait_sock_ops tcp6_timewait_sock_ops = {
. twsk_obj_size = sizeof ( struct tcp6_timewait_sock ) ,
. twsk_unique = tcp_twsk_unique ,
2014-03-29 09:27:29 +08:00
. twsk_destructor = tcp_twsk_destructor ,
2010-12-01 18:09:13 -08:00
} ;
2020-06-19 12:12:35 -07:00
INDIRECT_CALLABLE_SCOPE void tcp_v6_send_check ( struct sock * sk , struct sk_buff * skb )
{
2021-11-15 11:02:32 -08:00
__tcp_v6_send_check ( skb , & sk - > sk_v6_rcv_saddr , & sk - > sk_v6_daddr ) ;
2020-06-19 12:12:35 -07:00
}
2020-01-09 07:59:21 -08:00
const struct inet_connection_sock_af_ops ipv6_specific = {
2006-03-20 22:48:35 -08:00
. queue_xmit = inet6_csk_xmit ,
. send_check = tcp_v6_send_check ,
. rebuild_header = inet6_sk_rebuild_header ,
2012-08-06 05:09:33 +00:00
. sk_rx_dst_set = inet6_sk_rx_dst_set ,
2006-03-20 22:48:35 -08:00
. conn_request = tcp_v6_conn_request ,
. syn_recv_sock = tcp_v6_syn_recv_sock ,
. net_header_len = sizeof ( struct ipv6hdr ) ,
ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing
Quoting Tore Anderson from :
https://bugzilla.kernel.org/show_bug.cgi?id=42572
When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
size does not take into account the size of the IPv6 Fragmentation
header that needs to be included in outbound packets, causing every
transmitted TCP segment to be fragmented across two IPv6 packets, the
latter of which will only contain 8 bytes of actual payload.
RTAX_FEATURE_ALLFRAG is typically set on a route in response to
receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
PTBs with MTU < 1280 are still valid, in particular when an IPv6
packet is sent to an IPv4 destination through a stateless translator.
Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
the path will be translated to ICMPv6 PTB which may then indicate an
MTU of less than 1280.
The Linux kernel refuses to reduce the effective MTU to anything below
1280 bytes, instead it sets it to exactly 1280 bytes, and
RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
instead of 1232 (additionally taking into account the 8 bytes required
by the IPv6 Fragmentation extension header).
This in turn results in rather inefficient transmission, as every
transmitted TCP segment now is split in two fragments containing
1232+8 bytes of payload.
After this patch, all the outgoing packets that includes a
Fragmentation header all are "atomic" or "non-fragmented" fragments,
i.e., they both have Offset=0 and More Fragments=0.
With help from David S. Miller
Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-24 07:37:38 +00:00
. net_frag_header_len = sizeof ( struct frag_hdr ) ,
2006-03-20 22:48:35 -08:00
. setsockopt = ipv6_setsockopt ,
. getsockopt = ipv6_getsockopt ,
. addr2sockaddr = inet6_csk_addr2sockaddr ,
. sockaddr_len = sizeof ( struct sockaddr_in6 ) ,
2014-08-14 12:40:05 -04:00
. mtu_reduced = tcp_v6_mtu_reduced ,
2005-04-16 15:20:36 -07:00
} ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2009-09-01 19:25:03 +00:00
static const struct tcp_sock_af_ops tcp_sock_ipv6_specific = {
2006-11-14 19:07:45 -08:00
. md5_lookup = tcp_v6_md5_lookup ,
2008-07-19 00:01:42 -07:00
. calc_md5_hash = tcp_v6_md5_hash_skb ,
2006-11-14 19:07:45 -08:00
. md5_parse = tcp_v6_parse_md5_keys ,
} ;
2006-11-14 19:53:22 -08:00
# endif
2006-11-14 19:07:45 -08:00
2005-04-16 15:20:36 -07:00
/*
* TCP over IPv4 via INET6 API
*/
2009-09-01 19:25:04 +00:00
static const struct inet_connection_sock_af_ops ipv6_mapped = {
2006-03-20 22:48:35 -08:00
. queue_xmit = ip_queue_xmit ,
. send_check = tcp_v4_send_check ,
. rebuild_header = inet_sk_rebuild_header ,
2012-08-09 14:11:00 +00:00
. sk_rx_dst_set = inet_sk_rx_dst_set ,
2006-03-20 22:48:35 -08:00
. conn_request = tcp_v6_conn_request ,
. syn_recv_sock = tcp_v6_syn_recv_sock ,
. net_header_len = sizeof ( struct iphdr ) ,
. setsockopt = ipv6_setsockopt ,
. getsockopt = ipv6_getsockopt ,
. addr2sockaddr = inet6_csk_addr2sockaddr ,
. sockaddr_len = sizeof ( struct sockaddr_in6 ) ,
2014-08-14 12:40:05 -04:00
. mtu_reduced = tcp_v4_mtu_reduced ,
2005-04-16 15:20:36 -07:00
} ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2009-09-01 19:25:03 +00:00
static const struct tcp_sock_af_ops tcp_sock_ipv6_mapped_specific = {
2006-11-14 19:07:45 -08:00
. md5_lookup = tcp_v4_md5_lookup ,
2008-07-19 00:01:42 -07:00
. calc_md5_hash = tcp_v4_md5_hash_skb ,
2006-11-14 19:07:45 -08:00
. md5_parse = tcp_v6_parse_md5_keys ,
} ;
2006-11-14 19:53:22 -08:00
# endif
2006-11-14 19:07:45 -08:00
2005-04-16 15:20:36 -07:00
/* NOTE: A lot of things set to zero explicitly by call to
* sk_alloc ( ) so need not be done here .
*/
static int tcp_v6_init_sock ( struct sock * sk )
{
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
2012-04-19 09:55:21 +00:00
tcp_init_sock ( sk ) ;
2005-04-16 15:20:36 -07:00
2005-12-13 23:15:52 -08:00
icsk - > icsk_af_ops = & ipv6_specific ;
2005-04-16 15:20:36 -07:00
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2012-04-23 03:21:58 -04:00
tcp_sk ( sk ) - > af_specific = & tcp_sock_ipv6_specific ;
2006-11-14 19:07:45 -08:00
# endif
2005-04-16 15:20:36 -07:00
return 0 ;
}
2008-06-14 17:04:49 -07:00
static void tcp_v6_destroy_sock ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
tcp_v4_destroy_sock ( sk ) ;
2008-06-14 17:04:49 -07:00
inet6_destroy_sock ( sk ) ;
2005-04-16 15:20:36 -07:00
}
2007-04-21 20:13:44 +09:00
# ifdef CONFIG_PROC_FS
2005-04-16 15:20:36 -07:00
/* Proc filesystem TCPv6 sock list dumping. */
2007-02-09 23:24:49 +09:00
static void get_openreq6 ( struct seq_file * seq ,
2015-10-02 11:43:30 -07:00
const struct request_sock * req , int i )
2005-04-16 15:20:36 -07:00
{
inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 19:04:20 -07:00
long ttd = req - > rsk_timer . expires - jiffies ;
2013-10-09 15:21:29 -07:00
const struct in6_addr * src = & inet_rsk ( req ) - > ir_v6_loc_addr ;
const struct in6_addr * dest = & inet_rsk ( req ) - > ir_v6_rmt_addr ;
2005-04-16 15:20:36 -07:00
if ( ttd < 0 )
ttd = 0 ;
seq_printf ( seq ,
" %4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
2013-08-15 13:42:14 +02:00
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %d %d %pK \n " ,
2005-04-16 15:20:36 -07:00
i ,
src - > s6_addr32 [ 0 ] , src - > s6_addr32 [ 1 ] ,
src - > s6_addr32 [ 2 ] , src - > s6_addr32 [ 3 ] ,
2013-10-10 00:04:37 -07:00
inet_rsk ( req ) - > ir_num ,
2005-04-16 15:20:36 -07:00
dest - > s6_addr32 [ 0 ] , dest - > s6_addr32 [ 1 ] ,
dest - > s6_addr32 [ 2 ] , dest - > s6_addr32 [ 3 ] ,
2013-10-09 15:21:29 -07:00
ntohs ( inet_rsk ( req ) - > ir_rmt_port ) ,
2005-04-16 15:20:36 -07:00
TCP_SYN_RECV ,
2013-12-19 18:44:34 +08:00
0 , 0 , /* could print option size, but that is af dependent. */
2007-02-09 23:24:49 +09:00
1 , /* timers active (only the expire timer) */
jiffies_to_clock_t ( ttd ) ,
2012-10-27 23:16:46 +00:00
req - > num_timeout ,
2015-10-02 11:43:30 -07:00
from_kuid_munged ( seq_user_ns ( seq ) ,
sock_i_uid ( req - > rsk_listener ) ) ,
2007-02-09 23:24:49 +09:00
0 , /* non standard timer */
2005-04-16 15:20:36 -07:00
0 , /* open_requests have no inode */
0 , req ) ;
}
static void get_tcp6_sock ( struct seq_file * seq , struct sock * sp , int i )
{
2011-04-22 04:53:02 +00:00
const struct in6_addr * dest , * src ;
2005-04-16 15:20:36 -07:00
__u16 destp , srcp ;
int timer_active ;
unsigned long timer_expires ;
2011-10-21 05:22:42 -04:00
const struct inet_sock * inet = inet_sk ( sp ) ;
const struct tcp_sock * tp = tcp_sk ( sp ) ;
2005-08-09 20:10:42 -07:00
const struct inet_connection_sock * icsk = inet_csk ( sp ) ;
2015-09-29 07:42:52 -07:00
const struct fastopen_queue * fastopenq = & icsk - > icsk_accept_queue . fastopenq ;
2015-11-12 08:43:18 -08:00
int rx_queue ;
int state ;
2005-04-16 15:20:36 -07:00
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
dest = & sp - > sk_v6_daddr ;
src = & sp - > sk_v6_rcv_saddr ;
2009-10-15 06:30:45 +00:00
destp = ntohs ( inet - > inet_dport ) ;
srcp = ntohs ( inet - > inet_sport ) ;
2005-08-09 20:10:42 -07:00
2016-06-06 15:07:18 -07:00
if ( icsk - > icsk_pending = = ICSK_TIME_RETRANS | |
2017-01-12 22:11:33 -08:00
icsk - > icsk_pending = = ICSK_TIME_REO_TIMEOUT | |
2016-06-06 15:07:18 -07:00
icsk - > icsk_pending = = ICSK_TIME_LOSS_PROBE ) {
2005-04-16 15:20:36 -07:00
timer_active = 1 ;
2005-08-09 20:10:42 -07:00
timer_expires = icsk - > icsk_timeout ;
} else if ( icsk - > icsk_pending = = ICSK_TIME_PROBE0 ) {
2005-04-16 15:20:36 -07:00
timer_active = 4 ;
2005-08-09 20:10:42 -07:00
timer_expires = icsk - > icsk_timeout ;
2005-04-16 15:20:36 -07:00
} else if ( timer_pending ( & sp - > sk_timer ) ) {
timer_active = 2 ;
timer_expires = sp - > sk_timer . expires ;
} else {
timer_active = 0 ;
timer_expires = jiffies ;
}
2017-12-20 11:12:52 +08:00
state = inet_sk_state_load ( sp ) ;
2015-11-12 08:43:18 -08:00
if ( state = = TCP_LISTEN )
2019-11-05 14:11:53 -08:00
rx_queue = READ_ONCE ( sp - > sk_ack_backlog ) ;
2015-11-12 08:43:18 -08:00
else
/* Because we don't lock the socket,
* we might find a transient negative value .
*/
2019-10-10 20:17:39 -07:00
rx_queue = max_t ( int , READ_ONCE ( tp - > rcv_nxt ) -
2019-10-10 20:17:40 -07:00
READ_ONCE ( tp - > copied_seq ) , 0 ) ;
2015-11-12 08:43:18 -08:00
2005-04-16 15:20:36 -07:00
seq_printf ( seq ,
" %4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
2013-08-15 13:42:14 +02:00
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %lu %lu %u %u %d \n " ,
2005-04-16 15:20:36 -07:00
i ,
src - > s6_addr32 [ 0 ] , src - > s6_addr32 [ 1 ] ,
src - > s6_addr32 [ 2 ] , src - > s6_addr32 [ 3 ] , srcp ,
dest - > s6_addr32 [ 0 ] , dest - > s6_addr32 [ 1 ] ,
dest - > s6_addr32 [ 2 ] , dest - > s6_addr32 [ 3 ] , destp ,
2015-11-12 08:43:18 -08:00
state ,
2019-10-10 20:17:41 -07:00
READ_ONCE ( tp - > write_seq ) - tp - > snd_una ,
2015-11-12 08:43:18 -08:00
rx_queue ,
2005-04-16 15:20:36 -07:00
timer_active ,
2012-08-08 21:13:53 +00:00
jiffies_delta_to_clock_t ( timer_expires - jiffies ) ,
2005-08-09 20:10:42 -07:00
icsk - > icsk_retransmits ,
2012-05-24 01:10:10 -06:00
from_kuid_munged ( seq_user_ns ( seq ) , sock_i_uid ( sp ) ) ,
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out ,
2005-04-16 15:20:36 -07:00
sock_i_ino ( sp ) ,
2017-06-30 13:08:01 +03:00
refcount_read ( & sp - > sk_refcnt ) , sp ,
2008-06-27 20:00:19 -07:00
jiffies_to_clock_t ( icsk - > icsk_rto ) ,
jiffies_to_clock_t ( icsk - > icsk_ack . ato ) ,
2019-01-25 10:53:19 -08:00
( icsk - > icsk_ack . quick < < 1 ) | inet_csk_in_pingpong_mode ( sp ) ,
2022-04-05 16:35:38 -07:00
tcp_snd_cwnd ( tp ) ,
2015-11-12 08:43:18 -08:00
state = = TCP_LISTEN ?
2015-09-29 07:42:52 -07:00
fastopenq - > max_qlen :
2014-05-11 20:22:12 -07:00
( tcp_in_initial_slowstart ( tp ) ? - 1 : tp - > snd_ssthresh )
2005-04-16 15:20:36 -07:00
) ;
}
2007-02-09 23:24:49 +09:00
static void get_timewait6_sock ( struct seq_file * seq ,
2005-08-09 20:09:30 -07:00
struct inet_timewait_sock * tw , int i )
2005-04-16 15:20:36 -07:00
{
tcp/dccp: get rid of central timewait timer
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.
This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.
Tested:
On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)
Before patch :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171
While test is running, we can observe 25 or even 33 ms latencies.
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
After patch :
About 90% increase of throughput :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992
And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 18:51:09 -07:00
long delta = tw - > tw_timer . expires - jiffies ;
2011-04-22 04:53:02 +00:00
const struct in6_addr * dest , * src ;
2005-04-16 15:20:36 -07:00
__u16 destp , srcp ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:42:29 -07:00
dest = & tw - > tw_v6_daddr ;
src = & tw - > tw_v6_rcv_saddr ;
2005-04-16 15:20:36 -07:00
destp = ntohs ( tw - > tw_dport ) ;
srcp = ntohs ( tw - > tw_sport ) ;
seq_printf ( seq ,
" %4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
net: convert %p usage to %pK
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
The supporting code for kptr_restrict and %pK are currently in the -mm
tree. This patch converts users of %p in net/ to %pK. Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-23 12:17:35 +00:00
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK \n " ,
2005-04-16 15:20:36 -07:00
i ,
src - > s6_addr32 [ 0 ] , src - > s6_addr32 [ 1 ] ,
src - > s6_addr32 [ 2 ] , src - > s6_addr32 [ 3 ] , srcp ,
dest - > s6_addr32 [ 0 ] , dest - > s6_addr32 [ 1 ] ,
dest - > s6_addr32 [ 2 ] , dest - > s6_addr32 [ 3 ] , destp ,
tw - > tw_substate , 0 , 0 ,
2012-08-08 21:13:53 +00:00
3 , jiffies_delta_to_clock_t ( delta ) , 0 , 0 , 0 , 0 ,
2017-06-30 13:08:01 +03:00
refcount_read ( & tw - > tw_refcnt ) , tw ) ;
2005-04-16 15:20:36 -07:00
}
static int tcp6_seq_show ( struct seq_file * seq , void * v )
{
struct tcp_iter_state * st ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
struct sock * sk = v ;
2005-04-16 15:20:36 -07:00
if ( v = = SEQ_START_TOKEN ) {
seq_puts ( seq ,
" sl "
" local_address "
" remote_address "
" st tx_queue rx_queue tr tm->when retrnsmt "
" uid timeout inode \n " ) ;
goto out ;
}
st = seq - > private ;
2015-10-02 11:43:32 -07:00
if ( sk - > sk_state = = TCP_TIME_WAIT )
get_timewait6_sock ( seq , v , st - > num ) ;
else if ( sk - > sk_state = = TCP_NEW_SYN_RECV )
2015-10-02 11:43:30 -07:00
get_openreq6 ( seq , v , st - > num ) ;
2015-10-02 11:43:32 -07:00
else
get_tcp6_sock ( seq , v , st - > num ) ;
2005-04-16 15:20:36 -07:00
out :
return 0 ;
}
2018-04-11 09:31:28 +02:00
static const struct seq_operations tcp6_seq_ops = {
. show = tcp6_seq_show ,
. start = tcp_seq_start ,
. next = tcp_seq_next ,
. stop = tcp_seq_stop ,
} ;
2005-04-16 15:20:36 -07:00
static struct tcp_seq_afinfo tcp6_seq_afinfo = {
. family = AF_INET6 ,
} ;
2010-01-17 03:35:32 +00:00
int __net_init tcp6_proc_init ( struct net * net )
2005-04-16 15:20:36 -07:00
{
2018-04-10 19:42:55 +02:00
if ( ! proc_create_net_data ( " tcp6 " , 0444 , net - > proc_net , & tcp6_seq_ops ,
sizeof ( struct tcp_iter_state ) , & tcp6_seq_afinfo ) )
2018-04-11 09:31:28 +02:00
return - ENOMEM ;
return 0 ;
2005-04-16 15:20:36 -07:00
}
2008-03-21 04:14:45 -07:00
void tcp6_proc_exit ( struct net * net )
2005-04-16 15:20:36 -07:00
{
2018-04-11 09:31:28 +02:00
remove_proc_entry ( " tcp6 " , net - > proc_net ) ;
2005-04-16 15:20:36 -07:00
}
# endif
struct proto tcpv6_prot = {
. name = " TCPv6 " ,
. owner = THIS_MODULE ,
. close = tcp_close ,
2018-03-30 15:08:05 -07:00
. pre_connect = tcp_v6_pre_connect ,
2005-04-16 15:20:36 -07:00
. connect = tcp_v6_connect ,
. disconnect = tcp_disconnect ,
2005-08-09 20:10:42 -07:00
. accept = inet_csk_accept ,
2005-04-16 15:20:36 -07:00
. ioctl = tcp_ioctl ,
. init = tcp_v6_init_sock ,
. destroy = tcp_v6_destroy_sock ,
. shutdown = tcp_shutdown ,
. setsockopt = tcp_setsockopt ,
. getsockopt = tcp_getsockopt ,
2021-01-15 08:34:59 -08:00
. bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt ,
2017-01-09 16:55:12 +01:00
. keepalive = tcp_set_keepalive ,
2005-04-16 15:20:36 -07:00
. recvmsg = tcp_recvmsg ,
2010-07-10 20:41:55 +00:00
. sendmsg = tcp_sendmsg ,
. sendpage = tcp_sendpage ,
2005-04-16 15:20:36 -07:00
. backlog_rcv = tcp_v6_do_rcv ,
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 05:50:31 +00:00
. release_cb = tcp_release_cb ,
2016-02-10 11:50:36 -05:00
. hash = inet6_hash ,
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 04:06:04 -08:00
. unhash = inet_unhash ,
. get_port = inet_csk_get_port ,
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06 21:20:20 +08:00
. put_port = inet_put_port ,
2021-03-30 19:32:31 -07:00
# ifdef CONFIG_BPF_SYSCALL
. psock_update_sk_prot = tcp_bpf_update_proto ,
# endif
2005-04-16 15:20:36 -07:00
. enter_memory_pressure = tcp_enter_memory_pressure ,
2017-06-07 13:29:12 -07:00
. leave_memory_pressure = tcp_leave_memory_pressure ,
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 20:27:07 -07:00
. stream_memory_free = tcp_stream_memory_free ,
2005-04-16 15:20:36 -07:00
. sockets_allocated = & tcp_sockets_allocated ,
2022-06-08 23:34:08 -07:00
2005-04-16 15:20:36 -07:00
. memory_allocated = & tcp_memory_allocated ,
2022-06-08 23:34:08 -07:00
. per_cpu_fw_alloc = & tcp_memory_per_cpu_fw_alloc ,
2005-04-16 15:20:36 -07:00
. memory_pressure = & tcp_memory_pressure ,
2005-08-09 20:11:41 -07:00
. orphan_count = & tcp_orphan_count ,
2013-10-19 16:25:36 -07:00
. sysctl_mem = sysctl_tcp_mem ,
2017-11-07 00:29:28 -08:00
. sysctl_wmem_offset = offsetof ( struct net , ipv4 . sysctl_tcp_wmem ) ,
. sysctl_rmem_offset = offsetof ( struct net , ipv4 . sysctl_tcp_rmem ) ,
2005-04-16 15:20:36 -07:00
. max_header = MAX_TCP_HEADER ,
. obj_size = sizeof ( struct tcp6_sock ) ,
2017-01-18 02:53:44 -08:00
. slab_flags = SLAB_TYPESAFE_BY_RCU ,
2005-12-13 23:25:19 -08:00
. twsk_prot = & tcp6_timewait_sock_ops ,
2005-06-18 22:47:21 -07:00
. rsk_prot = & tcp6_request_sock_ops ,
2008-03-22 16:50:58 -07:00
. h . hashinfo = & tcp_hashinfo ,
2010-07-10 20:41:55 +00:00
. no_autobind = true ,
2015-12-16 12:30:05 +09:00
. diag_destroy = tcp_abort ,
2005-04-16 15:20:36 -07:00
} ;
2020-06-02 00:07:05 +05:30
EXPORT_SYMBOL_GPL ( tcpv6_prot ) ;
2005-04-16 15:20:36 -07:00
tcp/udp: Make early_demux back namespacified.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-13 10:52:07 -07:00
static const struct inet6_protocol tcpv6_protocol = {
2005-04-16 15:20:36 -07:00
. handler = tcp_v6_rcv ,
. err_handler = tcp_v6_err ,
. flags = INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL ,
} ;
static struct inet_protosw tcpv6_protosw = {
. type = SOCK_STREAM ,
. protocol = IPPROTO_TCP ,
. prot = & tcpv6_prot ,
. ops = & inet6_stream_ops ,
2005-12-13 23:26:10 -08:00
. flags = INET_PROTOSW_PERMANENT |
INET_PROTOSW_ICSK ,
2005-04-16 15:20:36 -07:00
} ;
2010-01-17 03:35:32 +00:00
static int __net_init tcpv6_net_init ( struct net * net )
2008-03-07 11:16:02 -08:00
{
2008-04-03 14:28:30 -07:00
return inet_ctl_sock_create ( & net - > ipv6 . tcp_sk , PF_INET6 ,
SOCK_RAW , IPPROTO_TCP , net ) ;
2008-03-07 11:16:02 -08:00
}
2010-01-17 03:35:32 +00:00
static void __net_exit tcpv6_net_exit ( struct net * net )
2008-03-07 11:16:02 -08:00
{
2008-04-03 14:28:30 -07:00
inet_ctl_sock_destroy ( net - > ipv6 . tcp_sk ) ;
2009-12-03 02:29:09 +00:00
}
2022-05-12 14:14:56 -07:00
static void __net_exit tcpv6_net_exit_batch ( struct list_head * net_exit_list )
{
inet_twsk_purge ( & tcp_hashinfo , AF_INET6 ) ;
}
2008-03-07 11:16:02 -08:00
static struct pernet_operations tcpv6_net_ops = {
2009-12-03 02:29:09 +00:00
. init = tcpv6_net_init ,
. exit = tcpv6_net_exit ,
2022-05-12 14:14:56 -07:00
. exit_batch = tcpv6_net_exit_batch ,
2008-03-07 11:16:02 -08:00
} ;
2007-12-11 02:25:35 -08:00
int __init tcpv6_init ( void )
2005-04-16 15:20:36 -07:00
{
2007-12-11 02:25:35 -08:00
int ret ;
2012-11-15 08:49:15 +00:00
ret = inet6_add_protocol ( & tcpv6_protocol , IPPROTO_TCP ) ;
if ( ret )
2012-11-15 08:49:22 +00:00
goto out ;
2012-11-15 08:49:15 +00:00
2005-04-16 15:20:36 -07:00
/* register inet6 protocol */
2007-12-11 02:25:35 -08:00
ret = inet6_register_protosw ( & tcpv6_protosw ) ;
if ( ret )
goto out_tcpv6_protocol ;
2008-03-07 11:16:02 -08:00
ret = register_pernet_subsys ( & tcpv6_net_ops ) ;
2007-12-11 02:25:35 -08:00
if ( ret )
goto out_tcpv6_protosw ;
2020-01-21 16:56:15 -08:00
ret = mptcpv6_init ( ) ;
if ( ret )
goto out_tcpv6_pernet_subsys ;
2007-12-11 02:25:35 -08:00
out :
return ret ;
2006-01-11 15:53:04 -08:00
2020-01-21 16:56:15 -08:00
out_tcpv6_pernet_subsys :
unregister_pernet_subsys ( & tcpv6_net_ops ) ;
2007-12-11 02:25:35 -08:00
out_tcpv6_protosw :
inet6_unregister_protosw ( & tcpv6_protosw ) ;
2012-11-15 08:49:15 +00:00
out_tcpv6_protocol :
inet6_del_protocol ( & tcpv6_protocol , IPPROTO_TCP ) ;
2007-12-11 02:25:35 -08:00
goto out ;
}
2007-12-13 05:34:58 -08:00
void tcpv6_exit ( void )
2007-12-11 02:25:35 -08:00
{
2008-03-07 11:16:02 -08:00
unregister_pernet_subsys ( & tcpv6_net_ops ) ;
2007-12-11 02:25:35 -08:00
inet6_unregister_protosw ( & tcpv6_protosw ) ;
inet6_del_protocol ( & tcpv6_protocol , IPPROTO_TCP ) ;
2005-04-16 15:20:36 -07:00
}