2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* IPv6 output functions
2007-02-09 23:24:49 +09:00
* Linux INET6 implementation
2005-04-16 15:20:36 -07:00
*
* Authors :
2007-02-09 23:24:49 +09:00
* Pedro Roque < roque @ di . fc . ul . pt >
2005-04-16 15:20:36 -07:00
*
* Based on linux / net / ipv4 / ip_output . c
*
* Changes :
* A . N . Kuznetsov : airthmetics in fragmentation .
* extension headers are implemented .
* route changes now work .
* ip6_forward does not confuse sniffers .
* etc .
*
* H . von Brand : Added missing # include < linux / string . h >
2014-08-24 21:53:10 +01:00
* Imran Patel : frag id should be in NBO
2005-04-16 15:20:36 -07:00
* Kazunori MIYAZAWA @ USAGI
* : add ip6_append_data and related functions
* for datagram xmit
*/
# include <linux/errno.h>
2008-01-11 19:15:08 -08:00
# include <linux/kernel.h>
2005-04-16 15:20:36 -07:00
# include <linux/string.h>
# include <linux/socket.h>
# include <linux/net.h>
# include <linux/netdevice.h>
# include <linux/if_arp.h>
# include <linux/in6.h>
# include <linux/tcp.h>
# include <linux/route.h>
2006-05-27 23:05:54 -07:00
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2005-04-16 15:20:36 -07:00
2016-11-23 16:52:29 +01:00
# include <linux/bpf-cgroup.h>
2005-04-16 15:20:36 -07:00
# include <linux/netfilter.h>
# include <linux/netfilter_ipv6.h>
# include <net/sock.h>
# include <net/snmp.h>
# include <net/ipv6.h>
# include <net/ndisc.h>
# include <net/protocol.h>
# include <net/ip6_route.h>
# include <net/addrconf.h>
# include <net/rawv6.h>
# include <net/icmp.h>
# include <net/xfrm.h>
# include <net/checksum.h>
2008-04-03 09:22:53 +09:00
# include <linux/mroute6.h>
2015-10-12 11:47:10 -07:00
# include <net/l3mdev.h>
2016-08-24 20:10:43 -07:00
# include <net/lwtunnel.h>
2005-04-16 15:20:36 -07:00
2015-06-12 22:12:04 -05:00
static int ip6_finish_output2 ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2009-06-02 05:19:30 +00:00
struct dst_entry * dst = skb_dst ( skb ) ;
2005-04-16 15:20:36 -07:00
struct net_device * dev = dst - > dev ;
2019-06-24 16:01:08 +02:00
const struct in6_addr * nexthop ;
2011-07-14 07:53:20 -07:00
struct neighbour * neigh ;
2013-01-17 12:54:00 +00:00
int ret ;
2005-04-16 15:20:36 -07:00
2007-04-25 17:54:47 -07:00
if ( ipv6_addr_is_multicast ( & ipv6_hdr ( skb ) - > daddr ) ) {
2009-06-02 05:19:30 +00:00
struct inet6_dev * idev = ip6_dst_idev ( skb_dst ( skb ) ) ;
2005-04-16 15:20:36 -07:00
2015-04-05 22:19:04 -04:00
if ( ! ( dev - > flags & IFF_LOOPBACK ) & & sk_mc_loop ( sk ) & &
2018-02-28 23:29:30 +02:00
( ( mroute6_is_socket ( net , skb ) & &
2008-12-10 16:07:08 -08:00
! ( IP6CB ( skb ) - > flags & IP6SKB_FORWARDED ) ) | |
2008-04-03 09:22:53 +09:00
ipv6_chk_mcast_addr ( dev , & ipv6_hdr ( skb ) - > daddr ,
& ipv6_hdr ( skb ) - > saddr ) ) ) {
2005-04-16 15:20:36 -07:00
struct sk_buff * newskb = skb_clone ( skb , GFP_ATOMIC ) ;
/* Do not check for IFF_ALLMULTI; multicast routing
is not supported in any case .
*/
if ( newskb )
2010-03-23 04:09:07 +01:00
NF_HOOK ( NFPROTO_IPV6 , NF_INET_POST_ROUTING ,
2015-09-15 20:04:16 -05:00
net , sk , newskb , NULL , newskb - > dev ,
2012-06-12 10:16:35 +00:00
dev_loopback_xmit ) ;
2005-04-16 15:20:36 -07:00
2007-04-25 17:54:47 -07:00
if ( ipv6_hdr ( skb ) - > hop_limit = = 0 ) {
2015-09-15 20:04:09 -05:00
IP6_INC_STATS ( net , idev ,
2008-10-08 10:54:51 -07:00
IPSTATS_MIB_OUTDISCARDS ) ;
2005-04-16 15:20:36 -07:00
kfree_skb ( skb ) ;
return 0 ;
}
}
2015-09-15 20:04:09 -05:00
IP6_UPD_PO_STATS ( net , idev , IPSTATS_MIB_OUTMCAST , skb - > len ) ;
2013-02-10 02:33:35 +00:00
if ( IPV6_ADDR_MC_SCOPE ( & ipv6_hdr ( skb ) - > daddr ) < =
IPV6_ADDR_SCOPE_NODELOCAL & &
! ( dev - > flags & IFF_LOOPBACK ) ) {
kfree_skb ( skb ) ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
}
2016-08-24 20:10:43 -07:00
if ( lwtunnel_xmit_redirect ( dst - > lwtstate ) ) {
int res = lwtunnel_xmit ( skb ) ;
if ( res < 0 | | res = = LWTUNNEL_XMIT_DONE )
return res ;
}
2013-01-17 12:54:00 +00:00
rcu_read_lock_bh ( ) ;
2015-05-22 20:55:58 -07:00
nexthop = rt6_nexthop ( ( struct rt6_info * ) dst , & ipv6_hdr ( skb ) - > daddr ) ;
2013-01-17 12:54:00 +00:00
neigh = __ipv6_neigh_lookup_noref ( dst - > dev , nexthop ) ;
if ( unlikely ( ! neigh ) )
neigh = __neigh_create ( & nd_tbl , nexthop , dst - > dev , false ) ;
if ( ! IS_ERR ( neigh ) ) {
2017-02-06 23:14:12 +02:00
sock_confirm_neigh ( skb , neigh ) ;
2019-04-05 16:30:33 -07:00
ret = neigh_output ( neigh , skb , false ) ;
2013-01-17 12:54:00 +00:00
rcu_read_unlock_bh ( ) ;
return ret ;
}
rcu_read_unlock_bh ( ) ;
2011-07-16 17:26:00 -07:00
2015-09-15 20:04:09 -05:00
IP6_INC_STATS ( net , ip6_dst_idev ( dst ) , IPSTATS_MIB_OUTNOROUTES ) ;
2010-04-13 15:28:11 +02:00
kfree_skb ( skb ) ;
return - EINVAL ;
2005-04-16 15:20:36 -07:00
}
2019-05-28 16:59:38 -07:00
static int __ip6_finish_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
2010-04-13 15:28:11 +02:00
{
2017-12-21 17:32:24 +01:00
# if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
if ( skb_dst ( skb ) - > xfrm ) {
IPCB ( skb ) - > flags | = IPSKB_REROUTED ;
return dst_output ( net , sk , skb ) ;
}
# endif
2010-04-13 15:28:11 +02:00
if ( ( skb - > len > ip6_skb_dst_mtu ( skb ) & & ! skb_is_gso ( skb ) ) | |
2013-11-06 17:52:19 +01:00
dst_allfrag ( skb_dst ( skb ) ) | |
( IP6CB ( skb ) - > frag_max_size & & skb - > len > IP6CB ( skb ) - > frag_max_size ) )
2015-06-12 22:12:04 -05:00
return ip6_fragment ( net , sk , skb , ip6_finish_output2 ) ;
2010-04-13 15:28:11 +02:00
else
2015-06-12 22:12:04 -05:00
return ip6_finish_output2 ( net , sk , skb ) ;
2010-04-13 15:28:11 +02:00
}
2019-05-28 16:59:38 -07:00
static int ip6_finish_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
{
int ret ;
ret = BPF_CGROUP_RUN_PROG_INET_EGRESS ( sk , skb ) ;
switch ( ret ) {
case NET_XMIT_SUCCESS :
return __ip6_finish_output ( net , sk , skb ) ;
case NET_XMIT_CN :
return __ip6_finish_output ( net , sk , skb ) ? : ret ;
default :
kfree_skb ( skb ) ;
return ret ;
}
}
2015-10-07 16:48:47 -05:00
int ip6_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2010-04-13 15:28:11 +02:00
struct net_device * dev = skb_dst ( skb ) - > dev ;
2009-06-02 05:19:30 +00:00
struct inet6_dev * idev = ip6_dst_idev ( skb_dst ( skb ) ) ;
2015-09-17 17:21:31 -05:00
2017-06-09 12:06:07 -07:00
skb - > protocol = htons ( ETH_P_IPV6 ) ;
skb - > dev = dev ;
2008-06-28 14:17:11 +09:00
if ( unlikely ( idev - > cnf . disable_ipv6 ) ) {
2015-09-15 20:04:10 -05:00
IP6_INC_STATS ( net , idev , IPSTATS_MIB_OUTDISCARDS ) ;
2008-06-28 14:17:11 +09:00
kfree_skb ( skb ) ;
return 0 ;
}
2015-09-15 20:04:16 -05:00
return NF_HOOK_COND ( NFPROTO_IPV6 , NF_INET_POST_ROUTING ,
net , sk , skb , NULL , dev ,
2010-04-13 15:32:16 +02:00
ip6_finish_output ,
! ( IP6CB ( skb ) - > flags & IP6SKB_REROUTED ) ) ;
2005-04-16 15:20:36 -07:00
}
2018-01-22 20:06:42 +00:00
bool ip6_autoflowlabel ( struct net * net , const struct ipv6_pinfo * np )
net: reevalulate autoflowlabel setting after sysctl setting
sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.
The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.
To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.
Note, this changes behavior a little bit. Before commit 42240901f7c4
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 12:10:21 -08:00
{
if ( ! np - > autoflowlabel_set )
return ip6_default_np_autolabel ( net ) ;
else
return np - > autoflowlabel ;
}
2005-04-16 15:20:36 -07:00
/*
2015-09-25 07:39:20 -07:00
* xmit an sk_buff ( used by TCP , SCTP and DCCP )
* Note : socket lock is not held for SYNACK packets , but might be modified
* by calls to skb_set_owner_w ( ) and ipv6_local_error ( ) ,
* which are using proper atomic operations or spinlocks .
2005-04-16 15:20:36 -07:00
*/
2015-09-25 07:39:20 -07:00
int ip6_xmit ( const struct sock * sk , struct sk_buff * skb , struct flowi6 * fl6 ,
2019-09-24 08:01:14 -07:00
__u32 mark , struct ipv6_txoptions * opt , int tclass , u32 priority )
2005-04-16 15:20:36 -07:00
{
2008-10-08 10:54:51 -07:00
struct net * net = sock_net ( sk ) ;
2015-09-25 07:39:20 -07:00
const struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2011-03-12 16:22:43 -05:00
struct in6_addr * first_hop = & fl6 - > daddr ;
2009-06-02 05:19:30 +00:00
struct dst_entry * dst = skb_dst ( skb ) ;
ipv6: Check available headroom in ip6_xmit() even without options
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 19:30:36 +01:00
unsigned int head_room ;
2005-04-16 15:20:36 -07:00
struct ipv6hdr * hdr ;
2011-03-12 16:22:43 -05:00
u8 proto = fl6 - > flowi6_proto ;
2005-04-16 15:20:36 -07:00
int seg_len = skb - > len ;
2009-08-09 08:12:48 +00:00
int hlimit = - 1 ;
2005-04-16 15:20:36 -07:00
u32 mtu ;
ipv6: Check available headroom in ip6_xmit() even without options
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 19:30:36 +01:00
head_room = sizeof ( struct ipv6hdr ) + LL_RESERVED_SPACE ( dst - > dev ) ;
if ( opt )
head_room + = opt - > opt_nflen + opt - > opt_flen ;
if ( unlikely ( skb_headroom ( skb ) < head_room ) ) {
struct sk_buff * skb2 = skb_realloc_headroom ( skb , head_room ) ;
if ( ! skb2 ) {
IP6_INC_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) ,
IPSTATS_MIB_OUTDISCARDS ) ;
kfree_skb ( skb ) ;
return - ENOBUFS ;
2005-04-16 15:20:36 -07:00
}
ipv6: Check available headroom in ip6_xmit() even without options
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 19:30:36 +01:00
if ( skb - > sk )
skb_set_owner_w ( skb2 , skb - > sk ) ;
consume_skb ( skb ) ;
skb = skb2 ;
}
if ( opt ) {
seg_len + = opt - > opt_nflen + opt - > opt_flen ;
2005-04-16 15:20:36 -07:00
if ( opt - > opt_flen )
ipv6_push_frag_opts ( skb , opt , & proto ) ;
ipv6: Check available headroom in ip6_xmit() even without options
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 19:30:36 +01:00
2005-04-16 15:20:36 -07:00
if ( opt - > opt_nflen )
2016-11-08 14:59:20 +01:00
ipv6_push_nfrag_opts ( skb , opt , & proto , & first_hop ,
& fl6 - > saddr ) ;
2005-04-16 15:20:36 -07:00
}
2007-04-10 20:46:21 -07:00
skb_push ( skb , sizeof ( struct ipv6hdr ) ) ;
skb_reset_network_header ( skb ) ;
2007-04-25 17:54:47 -07:00
hdr = ipv6_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
/*
* Fill in the IPv6 header
*/
2011-10-27 00:44:35 -04:00
if ( np )
2005-04-16 15:20:36 -07:00
hlimit = np - > hop_limit ;
if ( hlimit < 0 )
2008-03-10 06:00:30 -04:00
hlimit = ip6_dst_hoplimit ( dst ) ;
2005-04-16 15:20:36 -07:00
2014-07-01 21:33:10 -07:00
ip6_flow_hdr ( hdr , tclass , ip6_make_flowlabel ( net , skb , fl6 - > flowlabel ,
net: reevalulate autoflowlabel setting after sysctl setting
sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.
The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.
To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.
Note, this changes behavior a little bit. Before commit 42240901f7c4
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 12:10:21 -08:00
ip6_autoflowlabel ( net , np ) , fl6 ) ) ;
2005-09-08 10:19:03 +09:00
2005-04-16 15:20:36 -07:00
hdr - > payload_len = htons ( seg_len ) ;
hdr - > nexthdr = proto ;
hdr - > hop_limit = hlimit ;
2011-11-21 03:39:03 +00:00
hdr - > saddr = fl6 - > saddr ;
hdr - > daddr = * first_hop ;
2005-04-16 15:20:36 -07:00
2013-08-26 12:31:23 +02:00
skb - > protocol = htons ( ETH_P_IPV6 ) ;
2019-09-24 08:01:14 -07:00
skb - > priority = priority ;
2017-01-26 22:56:21 +01:00
skb - > mark = mark ;
2006-01-08 22:37:26 -08:00
2005-04-16 15:20:36 -07:00
mtu = dst_mtu ( dst ) ;
2014-05-04 16:39:18 -07:00
if ( ( skb - > len < = mtu ) | | skb - > ignore_df | | skb_is_gso ( skb ) ) {
2009-06-02 05:19:30 +00:00
IP6_UPD_PO_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) ,
2009-04-27 02:45:02 -07:00
IPSTATS_MIB_OUT , skb - > len ) ;
2016-09-10 12:09:53 -07:00
/* if egress device is enslaved to an L3 master device pass the
* skb to its handler for processing
*/
skb = l3mdev_ip6_out ( ( struct sock * ) sk , skb ) ;
if ( unlikely ( ! skb ) )
return 0 ;
2015-09-25 07:39:20 -07:00
/* hooks should never assume socket lock is held.
* we promote our socket to non const
*/
2015-09-15 20:04:16 -05:00
return NF_HOOK ( NFPROTO_IPV6 , NF_INET_LOCAL_OUT ,
2015-09-25 07:39:20 -07:00
net , ( struct sock * ) sk , skb , NULL , dst - > dev ,
2015-10-07 16:48:35 -05:00
dst_output ) ;
2005-04-16 15:20:36 -07:00
}
skb - > dev = dst - > dev ;
2015-09-25 07:39:20 -07:00
/* ipv6_local_error() does not require socket lock,
* we promote our socket to non const
*/
ipv6_local_error ( ( struct sock * ) sk , EMSGSIZE , fl6 , mtu ) ;
2009-06-02 05:19:30 +00:00
IP6_INC_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) , IPSTATS_MIB_FRAGFAILS ) ;
2005-04-16 15:20:36 -07:00
kfree_skb ( skb ) ;
return - EMSGSIZE ;
}
2007-02-22 22:05:40 +09:00
EXPORT_SYMBOL ( ip6_xmit ) ;
2005-04-16 15:20:36 -07:00
static int ip6_call_ra_chain ( struct sk_buff * skb , int sel )
{
struct ip6_ra_chain * ra ;
struct sock * last = NULL ;
read_lock ( & ip6_ra_lock ) ;
for ( ra = ip6_ra_chain ; ra ; ra = ra - > next ) {
struct sock * sk = ra - > sk ;
2005-08-09 19:44:42 -07:00
if ( sk & & ra - > sel = = sel & &
( ! sk - > sk_bound_dev_if | |
sk - > sk_bound_dev_if = = skb - > dev - > ifindex ) ) {
2019-03-01 15:31:03 -08:00
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
if ( np & & np - > rtalert_isolate & &
! net_eq ( sock_net ( sk ) , dev_net ( skb - > dev ) ) ) {
continue ;
}
2005-04-16 15:20:36 -07:00
if ( last ) {
struct sk_buff * skb2 = skb_clone ( skb , GFP_ATOMIC ) ;
if ( skb2 )
rawv6_rcv ( last , skb2 ) ;
}
last = sk ;
}
}
if ( last ) {
rawv6_rcv ( last , skb ) ;
read_unlock ( & ip6_ra_lock ) ;
return 1 ;
}
read_unlock ( & ip6_ra_lock ) ;
return 0 ;
}
2006-09-22 14:41:44 -07:00
static int ip6_forward_proxy_check ( struct sk_buff * skb )
{
2007-04-25 17:54:47 -07:00
struct ipv6hdr * hdr = ipv6_hdr ( skb ) ;
2006-09-22 14:41:44 -07:00
u8 nexthdr = hdr - > nexthdr ;
2011-11-30 17:05:51 -08:00
__be16 frag_off ;
2006-09-22 14:41:44 -07:00
int offset ;
if ( ipv6_ext_hdr ( nexthdr ) ) {
2011-11-30 17:05:51 -08:00
offset = ipv6_skip_exthdr ( skb , sizeof ( * hdr ) , & nexthdr , & frag_off ) ;
2006-09-22 14:41:44 -07:00
if ( offset < 0 )
return 0 ;
} else
offset = sizeof ( struct ipv6hdr ) ;
if ( nexthdr = = IPPROTO_ICMPV6 ) {
struct icmp6hdr * icmp6 ;
2007-04-10 20:50:43 -07:00
if ( ! pskb_may_pull ( skb , ( skb_network_header ( skb ) +
offset + 1 - skb - > data ) ) )
2006-09-22 14:41:44 -07:00
return 0 ;
2007-04-10 20:50:43 -07:00
icmp6 = ( struct icmp6hdr * ) ( skb_network_header ( skb ) + offset ) ;
2006-09-22 14:41:44 -07:00
switch ( icmp6 - > icmp6_type ) {
case NDISC_ROUTER_SOLICITATION :
case NDISC_ROUTER_ADVERTISEMENT :
case NDISC_NEIGHBOUR_SOLICITATION :
case NDISC_NEIGHBOUR_ADVERTISEMENT :
case NDISC_REDIRECT :
/* For reaction involving unicast neighbor discovery
* message destined to the proxied address , pass it to
* input function .
*/
return 1 ;
default :
break ;
}
}
2006-09-22 14:42:18 -07:00
/*
* The proxying router can ' t forward traffic sent to a link - local
* address , so signal the sender and discard the packet . This
* behavior is clarified by the MIPv6 specification .
*/
if ( ipv6_addr_type ( & hdr - > daddr ) & IPV6_ADDR_LINKLOCAL ) {
dst_link_failure ( skb ) ;
return - 1 ;
}
2006-09-22 14:41:44 -07:00
return 0 ;
}
2015-09-15 20:04:18 -05:00
static inline int ip6_forward_finish ( struct net * net , struct sock * sk ,
struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2018-04-05 21:29:47 +00:00
struct dst_entry * dst = skb_dst ( skb ) ;
__IP6_INC_STATS ( net , ip6_dst_idev ( dst ) , IPSTATS_MIB_OUTFORWDATAGRAMS ) ;
__IP6_ADD_STATS ( net , ip6_dst_idev ( dst ) , IPSTATS_MIB_OUTOCTETS , skb - > len ) ;
2018-12-04 08:15:11 +00:00
# ifdef CONFIG_NET_SWITCHDEV
if ( skb - > offload_l3_fwd_mark ) {
consume_skb ( skb ) ;
return 0 ;
}
# endif
2018-12-14 06:46:49 -08:00
skb - > tstamp = 0 ;
2015-10-07 16:48:35 -05:00
return dst_output ( net , sk , skb ) ;
2005-04-16 15:20:36 -07:00
}
net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 23:09:12 +01:00
static bool ip6_pkt_too_big ( const struct sk_buff * skb , unsigned int mtu )
{
2014-05-05 00:03:34 +02:00
if ( skb - > len < = mtu )
net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 23:09:12 +01:00
return false ;
2014-05-04 16:39:18 -07:00
/* ipv6 conntrack defrag sets max_frag_size + ignore_df */
net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 23:09:12 +01:00
if ( IP6CB ( skb ) - > frag_max_size & & IP6CB ( skb ) - > frag_max_size > mtu )
return true ;
2014-05-04 16:39:18 -07:00
if ( skb - > ignore_df )
2014-05-05 00:03:34 +02:00
return false ;
2018-03-01 17:13:37 +11:00
if ( skb_is_gso ( skb ) & & skb_gso_validate_network_len ( skb , mtu ) )
net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 23:09:12 +01:00
return false ;
return true ;
}
2005-04-16 15:20:36 -07:00
int ip6_forward ( struct sk_buff * skb )
{
2018-04-16 13:42:16 -04:00
struct inet6_dev * idev = __in6_dev_get_safely ( skb - > dev ) ;
2009-06-02 05:19:30 +00:00
struct dst_entry * dst = skb_dst ( skb ) ;
2007-04-25 17:54:47 -07:00
struct ipv6hdr * hdr = ipv6_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
struct inet6_skb_parm * opt = IP6CB ( skb ) ;
2008-03-25 21:47:49 +09:00
struct net * net = dev_net ( dst - > dev ) ;
2010-02-26 04:34:49 -08:00
u32 mtu ;
2007-02-09 23:24:49 +09:00
2008-07-19 22:35:03 -07:00
if ( net - > ipv6 . devconf_all - > forwarding = = 0 )
2005-04-16 15:20:36 -07:00
goto error ;
2014-03-11 10:40:08 +08:00
if ( skb - > pkt_type ! = PACKET_HOST )
goto drop ;
2015-10-08 18:19:53 +02:00
if ( unlikely ( skb - > sk ) )
goto drop ;
2008-06-19 16:22:28 -07:00
if ( skb_warn_if_lro ( skb ) )
goto drop ;
2005-04-16 15:20:36 -07:00
if ( ! xfrm6_policy_check ( NULL , XFRM_POLICY_FWD , skb ) ) {
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INDISCARDS ) ;
2005-04-16 15:20:36 -07:00
goto drop ;
}
2007-03-26 23:22:20 -07:00
skb_forward_csum ( skb ) ;
2005-04-16 15:20:36 -07:00
/*
* We DO NOT make any processing on
* RA packets , pushing them to user level AS IS
* without ane WARRANTY that application will be able
* to interpret them . The reason is that we
* cannot make anything clever here .
*
* We are not end - node , so that if packet contains
* AH / ESP , we cannot make anything .
* Defragmentation also would be mistake , RA packets
* cannot be fragmented , because there is no warranty
* that different fragments will go along one path . - - ANK
*/
2013-06-22 11:13:13 +09:00
if ( unlikely ( opt - > flags & IP6SKB_ROUTERALERT ) ) {
if ( ip6_call_ra_chain ( skb , ntohs ( opt - > ra ) ) )
2005-04-16 15:20:36 -07:00
return 0 ;
}
/*
* check and decrement ttl
*/
if ( hdr - > hop_limit < = 1 ) {
/* Force OUTPUT device used as source address */
skb - > dev = dst - > dev ;
2010-02-18 08:25:24 +00:00
icmpv6_send ( skb , ICMPV6_TIME_EXCEED , ICMPV6_EXC_HOPLIMIT , 0 ) ;
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INHDRERRORS ) ;
2005-04-16 15:20:36 -07:00
kfree_skb ( skb ) ;
return - ETIMEDOUT ;
}
2006-09-22 14:43:49 -07:00
/* XXX: idev->cnf.proxy_ndp? */
2008-07-19 22:35:03 -07:00
if ( net - > ipv6 . devconf_all - > proxy_ndp & &
2008-03-07 11:14:16 -08:00
pneigh_lookup ( & nd_tbl , net , & hdr - > daddr , skb - > dev , 0 ) ) {
2006-09-22 14:42:18 -07:00
int proxied = ip6_forward_proxy_check ( skb ) ;
if ( proxied > 0 )
2006-09-22 14:41:44 -07:00
return ip6_input ( skb ) ;
2006-09-22 14:42:18 -07:00
else if ( proxied < 0 ) {
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INDISCARDS ) ;
2006-09-22 14:42:18 -07:00
goto drop ;
}
2006-09-22 14:41:44 -07:00
}
2005-04-16 15:20:36 -07:00
if ( ! xfrm6_route_forward ( skb ) ) {
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INDISCARDS ) ;
2005-04-16 15:20:36 -07:00
goto drop ;
}
2009-06-02 05:19:30 +00:00
dst = skb_dst ( skb ) ;
2005-04-16 15:20:36 -07:00
/* IPv6 specs say nothing about it, but it is clear that we cannot
send redirects to source routed frames .
2007-08-24 19:08:55 +09:00
We don ' t send redirects to frames decapsulated from IPsec .
2005-04-16 15:20:36 -07:00
*/
2018-06-01 00:05:21 -04:00
if ( IP6CB ( skb ) - > iif = = dst - > dev - > ifindex & &
opt - > srcrt = = 0 & & ! skb_sec_path ( skb ) ) {
2005-04-16 15:20:36 -07:00
struct in6_addr * target = NULL ;
2012-06-08 23:24:18 -07:00
struct inet_peer * peer ;
2005-04-16 15:20:36 -07:00
struct rt6_info * rt ;
/*
* incoming and outgoing devices are the same
* send a redirect .
*/
rt = ( struct rt6_info * ) dst ;
2012-01-27 15:32:19 -08:00
if ( rt - > rt6i_flags & RTF_GATEWAY )
target = & rt - > rt6i_gateway ;
2005-04-16 15:20:36 -07:00
else
target = & hdr - > daddr ;
2015-05-22 20:55:57 -07:00
peer = inet_getpeer_v6 ( net - > ipv6 . peers , & hdr - > daddr , 1 ) ;
2011-02-04 15:55:25 -08:00
2005-04-16 15:20:36 -07:00
/* Limit redirects both by destination (here)
and by source ( inside ndisc_send_redirect )
*/
2012-06-08 23:24:18 -07:00
if ( inet_peer_xrlim_allow ( peer , 1 * HZ ) )
2012-01-27 15:30:48 -08:00
ndisc_send_redirect ( skb , target ) ;
2012-07-10 03:58:16 -07:00
if ( peer )
inet_putpeer ( peer ) ;
2007-05-09 13:53:44 -07:00
} else {
int addrtype = ipv6_addr_type ( & hdr - > saddr ) ;
2005-04-16 15:20:36 -07:00
/* This check is security critical. */
2008-06-25 16:55:26 +09:00
if ( addrtype = = IPV6_ADDR_ANY | |
addrtype & ( IPV6_ADDR_MULTICAST | IPV6_ADDR_LOOPBACK ) )
2007-05-09 13:53:44 -07:00
goto error ;
if ( addrtype & IPV6_ADDR_LINKLOCAL ) {
icmpv6_send ( skb , ICMPV6_DEST_UNREACH ,
2010-02-18 08:25:24 +00:00
ICMPV6_NOT_NEIGHBOUR , 0 ) ;
2007-05-09 13:53:44 -07:00
goto error ;
}
2005-04-16 15:20:36 -07:00
}
2014-01-09 10:01:16 +01:00
mtu = ip6_dst_mtu_forward ( dst ) ;
2010-02-26 04:34:49 -08:00
if ( mtu < IPV6_MIN_MTU )
mtu = IPV6_MIN_MTU ;
net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 23:09:12 +01:00
if ( ip6_pkt_too_big ( skb , mtu ) ) {
2005-04-16 15:20:36 -07:00
/* Again, force OUTPUT device used as source address */
skb - > dev = dst - > dev ;
2010-02-26 04:34:49 -08:00
icmpv6_send ( skb , ICMPV6_PKT_TOOBIG , 0 , mtu ) ;
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INTOOBIGERRORS ) ;
2016-04-27 16:44:40 -07:00
__IP6_INC_STATS ( net , ip6_dst_idev ( dst ) ,
IPSTATS_MIB_FRAGFAILS ) ;
2005-04-16 15:20:36 -07:00
kfree_skb ( skb ) ;
return - EMSGSIZE ;
}
if ( skb_cow ( skb , dst - > dev - > hard_header_len ) ) {
2016-04-27 16:44:40 -07:00
__IP6_INC_STATS ( net , ip6_dst_idev ( dst ) ,
IPSTATS_MIB_OUTDISCARDS ) ;
2005-04-16 15:20:36 -07:00
goto drop ;
}
2007-04-25 17:54:47 -07:00
hdr = ipv6_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
/* Mangling hops number delayed to point after skb COW */
2007-02-09 23:24:49 +09:00
2005-04-16 15:20:36 -07:00
hdr - > hop_limit - - ;
2015-09-15 20:04:16 -05:00
return NF_HOOK ( NFPROTO_IPV6 , NF_INET_FORWARD ,
net , NULL , skb , skb - > dev , dst - > dev ,
2007-11-19 18:53:30 -08:00
ip6_forward_finish ) ;
2005-04-16 15:20:36 -07:00
error :
2018-04-16 13:42:16 -04:00
__IP6_INC_STATS ( net , idev , IPSTATS_MIB_INADDRERRORS ) ;
2005-04-16 15:20:36 -07:00
drop :
kfree_skb ( skb ) ;
return - EINVAL ;
}
static void ip6_copy_metadata ( struct sk_buff * to , struct sk_buff * from )
{
to - > pkt_type = from - > pkt_type ;
to - > priority = from - > priority ;
to - > protocol = from - > protocol ;
2009-06-02 05:19:30 +00:00
skb_dst_drop ( to ) ;
skb_dst_set ( to , dst_clone ( skb_dst ( from ) ) ) ;
2005-04-16 15:20:36 -07:00
to - > dev = from - > dev ;
2006-11-09 15:19:14 -08:00
to - > mark = from - > mark ;
2005-04-16 15:20:36 -07:00
2018-07-23 16:50:48 +02:00
skb_copy_hash ( to , from ) ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_NET_SCHED
to - > tc_index = from - > tc_index ;
# endif
2007-03-14 16:44:01 -07:00
nf_copy ( to , from ) ;
sk_buff: add skb extension infrastructure
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18 17:15:16 +01:00
skb_ext_copy ( to , from ) ;
2006-06-09 00:29:17 -07:00
skb_copy_secmark ( to , from ) ;
2005-04-16 15:20:36 -07:00
}
2019-05-29 13:25:32 +02:00
int ip6_fraglist_init ( struct sk_buff * skb , unsigned int hlen , u8 * prevhdr ,
u8 nexthdr , __be32 frag_id ,
struct ip6_fraglist_iter * iter )
{
unsigned int first_len ;
struct frag_hdr * fh ;
/* BUILD HEADER */
* prevhdr = NEXTHDR_FRAGMENT ;
iter - > tmp_hdr = kmemdup ( skb_network_header ( skb ) , hlen , GFP_ATOMIC ) ;
if ( ! iter - > tmp_hdr )
return - ENOMEM ;
2019-06-02 11:24:18 -07:00
iter - > frag = skb_shinfo ( skb ) - > frag_list ;
2019-05-29 13:25:32 +02:00
skb_frag_list_init ( skb ) ;
iter - > offset = 0 ;
iter - > hlen = hlen ;
iter - > frag_id = frag_id ;
iter - > nexthdr = nexthdr ;
__skb_pull ( skb , hlen ) ;
fh = __skb_push ( skb , sizeof ( struct frag_hdr ) ) ;
__skb_push ( skb , hlen ) ;
skb_reset_network_header ( skb ) ;
memcpy ( skb_network_header ( skb ) , iter - > tmp_hdr , hlen ) ;
fh - > nexthdr = nexthdr ;
fh - > reserved = 0 ;
fh - > frag_off = htons ( IP6_MF ) ;
fh - > identification = frag_id ;
first_len = skb_pagelen ( skb ) ;
skb - > data_len = first_len - skb_headlen ( skb ) ;
skb - > len = first_len ;
ipv6_hdr ( skb ) - > payload_len = htons ( first_len - sizeof ( struct ipv6hdr ) ) ;
return 0 ;
}
EXPORT_SYMBOL ( ip6_fraglist_init ) ;
void ip6_fraglist_prepare ( struct sk_buff * skb ,
struct ip6_fraglist_iter * iter )
{
struct sk_buff * frag = iter - > frag ;
unsigned int hlen = iter - > hlen ;
struct frag_hdr * fh ;
frag - > ip_summed = CHECKSUM_NONE ;
skb_reset_transport_header ( frag ) ;
fh = __skb_push ( frag , sizeof ( struct frag_hdr ) ) ;
__skb_push ( frag , hlen ) ;
skb_reset_network_header ( frag ) ;
memcpy ( skb_network_header ( frag ) , iter - > tmp_hdr , hlen ) ;
iter - > offset + = skb - > len - hlen - sizeof ( struct frag_hdr ) ;
fh - > nexthdr = iter - > nexthdr ;
fh - > reserved = 0 ;
fh - > frag_off = htons ( iter - > offset ) ;
if ( frag - > next )
fh - > frag_off | = htons ( IP6_MF ) ;
fh - > identification = iter - > frag_id ;
ipv6_hdr ( frag ) - > payload_len = htons ( frag - > len - sizeof ( struct ipv6hdr ) ) ;
ip6_copy_metadata ( frag , skb ) ;
}
EXPORT_SYMBOL ( ip6_fraglist_prepare ) ;
2019-05-29 13:25:34 +02:00
void ip6_frag_init ( struct sk_buff * skb , unsigned int hlen , unsigned int mtu ,
unsigned short needed_tailroom , int hdr_room , u8 * prevhdr ,
u8 nexthdr , __be32 frag_id , struct ip6_frag_state * state )
{
state - > prevhdr = prevhdr ;
state - > nexthdr = nexthdr ;
state - > frag_id = frag_id ;
state - > hlen = hlen ;
state - > mtu = mtu ;
state - > left = skb - > len - hlen ; /* Space per frame */
state - > ptr = hlen ; /* Where to start from */
state - > hroom = hdr_room ;
state - > troom = needed_tailroom ;
state - > offset = 0 ;
}
EXPORT_SYMBOL ( ip6_frag_init ) ;
struct sk_buff * ip6_frag_next ( struct sk_buff * skb , struct ip6_frag_state * state )
{
u8 * prevhdr = state - > prevhdr , * fragnexthdr_offset ;
struct sk_buff * frag ;
struct frag_hdr * fh ;
unsigned int len ;
len = state - > left ;
/* IF: it doesn't fit, use 'mtu' - the data space left */
if ( len > state - > mtu )
len = state - > mtu ;
/* IF: we are not sending up to and including the packet end
then align the next start on an eight byte boundary */
if ( len < state - > left )
len & = ~ 7 ;
/* Allocate buffer */
frag = alloc_skb ( len + state - > hlen + sizeof ( struct frag_hdr ) +
state - > hroom + state - > troom , GFP_ATOMIC ) ;
if ( ! frag )
return ERR_PTR ( - ENOMEM ) ;
/*
* Set up data on packet
*/
ip6_copy_metadata ( frag , skb ) ;
skb_reserve ( frag , state - > hroom ) ;
skb_put ( frag , len + state - > hlen + sizeof ( struct frag_hdr ) ) ;
skb_reset_network_header ( frag ) ;
fh = ( struct frag_hdr * ) ( skb_network_header ( frag ) + state - > hlen ) ;
frag - > transport_header = ( frag - > network_header + state - > hlen +
sizeof ( struct frag_hdr ) ) ;
/*
* Charge the memory for the fragment to any owner
* it might possess
*/
if ( skb - > sk )
skb_set_owner_w ( frag , skb - > sk ) ;
/*
* Copy the packet header into the new buffer .
*/
skb_copy_from_linear_data ( skb , skb_network_header ( frag ) , state - > hlen ) ;
fragnexthdr_offset = skb_network_header ( frag ) ;
fragnexthdr_offset + = prevhdr - skb_network_header ( skb ) ;
* fragnexthdr_offset = NEXTHDR_FRAGMENT ;
/*
* Build fragment header .
*/
fh - > nexthdr = state - > nexthdr ;
fh - > reserved = 0 ;
fh - > identification = state - > frag_id ;
/*
* Copy a block of the IP datagram .
*/
BUG_ON ( skb_copy_bits ( skb , state - > ptr , skb_transport_header ( frag ) ,
len ) ) ;
state - > left - = len ;
fh - > frag_off = htons ( state - > offset ) ;
if ( state - > left > 0 )
fh - > frag_off | = htons ( IP6_MF ) ;
ipv6_hdr ( frag ) - > payload_len = htons ( frag - > len - sizeof ( struct ipv6hdr ) ) ;
state - > ptr + = len ;
state - > offset + = len ;
return frag ;
}
EXPORT_SYMBOL ( ip6_frag_next ) ;
2015-06-12 22:12:04 -05:00
int ip6_fragment ( struct net * net , struct sock * sk , struct sk_buff * skb ,
int ( * output ) ( struct net * , struct sock * , struct sk_buff * ) )
2005-04-16 15:20:36 -07:00
{
struct sk_buff * frag ;
2014-08-24 21:53:10 +01:00
struct rt6_info * rt = ( struct rt6_info * ) skb_dst ( skb ) ;
2015-04-01 17:07:44 +02:00
struct ipv6_pinfo * np = skb - > sk & & ! dev_recursion_level ( ) ?
inet6_sk ( skb - > sk ) : NULL ;
2019-05-29 13:25:34 +02:00
struct ip6_frag_state state ;
unsigned int mtu , hlen , nexthdr_offset ;
2019-10-16 18:00:56 -07:00
ktime_t tstamp = skb - > tstamp ;
2019-05-29 13:25:34 +02:00
int hroom , err = 0 ;
2015-05-22 20:55:56 -07:00
__be32 frag_id ;
2005-04-16 15:20:36 -07:00
u8 * prevhdr , nexthdr = 0 ;
2017-05-17 22:54:11 -04:00
err = ip6_find_1stfragopt ( skb , & prevhdr ) ;
if ( err < 0 )
ipv6: Prevent overrun when parsing v6 header options
The KASAN warning repoted below was discovered with a syzkaller
program. The reproducer is basically:
int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
send(s, &one_byte_of_data, 1, MSG_MORE);
send(s, &more_than_mtu_bytes_data, 2000, 0);
The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.
The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option. Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.
This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.
[ 42.361487] ==================================================================
[ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
[ 42.366469]
[ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[ 42.368824] Call Trace:
[ 42.369183] dump_stack+0xb3/0x10b
[ 42.369664] print_address_description+0x73/0x290
[ 42.370325] kasan_report+0x252/0x370
[ 42.370839] ? ip6_fragment+0x11c8/0x3730
[ 42.371396] check_memory_region+0x13c/0x1a0
[ 42.371978] memcpy+0x23/0x50
[ 42.372395] ip6_fragment+0x11c8/0x3730
[ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110
[ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0
[ 42.374263] ? ip6_forward+0x2e30/0x2e30
[ 42.374803] ip6_finish_output+0x584/0x990
[ 42.375350] ip6_output+0x1b7/0x690
[ 42.375836] ? ip6_finish_output+0x990/0x990
[ 42.376411] ? ip6_fragment+0x3730/0x3730
[ 42.376968] ip6_local_out+0x95/0x160
[ 42.377471] ip6_send_skb+0xa1/0x330
[ 42.377969] ip6_push_pending_frames+0xb3/0xe0
[ 42.378589] rawv6_sendmsg+0x2051/0x2db0
[ 42.379129] ? rawv6_bind+0x8b0/0x8b0
[ 42.379633] ? _copy_from_user+0x84/0xe0
[ 42.380193] ? debug_check_no_locks_freed+0x290/0x290
[ 42.380878] ? ___sys_sendmsg+0x162/0x930
[ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120
[ 42.382074] ? sock_has_perm+0x1f6/0x290
[ 42.382614] ? ___sys_sendmsg+0x167/0x930
[ 42.383173] ? lock_downgrade+0x660/0x660
[ 42.383727] inet_sendmsg+0x123/0x500
[ 42.384226] ? inet_sendmsg+0x123/0x500
[ 42.384748] ? inet_recvmsg+0x540/0x540
[ 42.385263] sock_sendmsg+0xca/0x110
[ 42.385758] SYSC_sendto+0x217/0x380
[ 42.386249] ? SYSC_connect+0x310/0x310
[ 42.386783] ? __might_fault+0x110/0x1d0
[ 42.387324] ? lock_downgrade+0x660/0x660
[ 42.387880] ? __fget_light+0xa1/0x1f0
[ 42.388403] ? __fdget+0x18/0x20
[ 42.388851] ? sock_common_setsockopt+0x95/0xd0
[ 42.389472] ? SyS_setsockopt+0x17f/0x260
[ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe
[ 42.390650] SyS_sendto+0x40/0x50
[ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.391731] RIP: 0033:0x7fbbb711e383
[ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
[ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
[ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
[ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
[ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
[ 42.397257]
[ 42.397411] Allocated by task 3789:
[ 42.397702] save_stack_trace+0x16/0x20
[ 42.398005] save_stack+0x46/0xd0
[ 42.398267] kasan_kmalloc+0xad/0xe0
[ 42.398548] kasan_slab_alloc+0x12/0x20
[ 42.398848] __kmalloc_node_track_caller+0xcb/0x380
[ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0
[ 42.399654] __alloc_skb+0xf8/0x580
[ 42.400003] sock_wmalloc+0xab/0xf0
[ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0
[ 42.400813] ip6_append_data+0x1a8/0x2f0
[ 42.401122] rawv6_sendmsg+0x11ee/0x2db0
[ 42.401505] inet_sendmsg+0x123/0x500
[ 42.401860] sock_sendmsg+0xca/0x110
[ 42.402209] ___sys_sendmsg+0x7cb/0x930
[ 42.402582] __sys_sendmsg+0xd9/0x190
[ 42.402941] SyS_sendmsg+0x2d/0x50
[ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.403718]
[ 42.403871] Freed by task 1794:
[ 42.404146] save_stack_trace+0x16/0x20
[ 42.404515] save_stack+0x46/0xd0
[ 42.404827] kasan_slab_free+0x72/0xc0
[ 42.405167] kfree+0xe8/0x2b0
[ 42.405462] skb_free_head+0x74/0xb0
[ 42.405806] skb_release_data+0x30e/0x3a0
[ 42.406198] skb_release_all+0x4a/0x60
[ 42.406563] consume_skb+0x113/0x2e0
[ 42.406910] skb_free_datagram+0x1a/0xe0
[ 42.407288] netlink_recvmsg+0x60d/0xe40
[ 42.407667] sock_recvmsg+0xd7/0x110
[ 42.408022] ___sys_recvmsg+0x25c/0x580
[ 42.408395] __sys_recvmsg+0xd6/0x190
[ 42.408753] SyS_recvmsg+0x2d/0x50
[ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.409513]
[ 42.409665] The buggy address belongs to the object at ffff88000969e780
[ 42.409665] which belongs to the cache kmalloc-512 of size 512
[ 42.410846] The buggy address is located 24 bytes inside of
[ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980)
[ 42.411941] The buggy address belongs to the page:
[ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
[ 42.413298] flags: 0x100000000008100(slab|head)
[ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
[ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
[ 42.415074] page dumped because: kasan: bad access detected
[ 42.415604]
[ 42.415757] Memory state around the buggy address:
[ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 42.418273] ^
[ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419882] ==================================================================
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 14:36:23 -04:00
goto fail ;
2017-05-17 22:54:11 -04:00
hlen = err ;
2005-04-16 15:20:36 -07:00
nexthdr = * prevhdr ;
2019-04-02 19:38:04 +08:00
nexthdr_offset = prevhdr - skb_network_header ( skb ) ;
2005-04-16 15:20:36 -07:00
2007-04-20 15:53:27 -07:00
mtu = ip6_skb_dst_mtu ( skb ) ;
2007-04-20 15:52:39 -07:00
/* We must not fragment if the socket is set to force MTU discovery
2010-02-26 04:34:49 -08:00
* or if the skb it not generated by a local socket .
2007-04-20 15:52:39 -07:00
*/
2015-05-22 00:44:16 +02:00
if ( unlikely ( ! skb - > ignore_df & & skb - > len > mtu ) )
goto fail_toobig ;
2012-05-18 21:51:44 +00:00
2015-05-22 00:44:16 +02:00
if ( IP6CB ( skb ) - > frag_max_size ) {
if ( IP6CB ( skb ) - > frag_max_size > mtu )
goto fail_toobig ;
/* don't send fragments larger than what we received */
mtu = IP6CB ( skb ) - > frag_max_size ;
if ( mtu < IPV6_MIN_MTU )
mtu = IPV6_MIN_MTU ;
2007-04-20 15:52:39 -07:00
}
2006-02-24 13:18:33 -08:00
if ( np & & np - > frag_size < mtu ) {
if ( np - > frag_size )
mtu = np - > frag_size ;
}
2015-10-28 13:21:04 +01:00
if ( mtu < hlen + sizeof ( struct frag_hdr ) + 8 )
2015-10-16 11:32:43 +02:00
goto fail_toobig ;
2015-10-28 13:21:03 +01:00
mtu - = hlen + sizeof ( struct frag_hdr ) ;
2005-04-16 15:20:36 -07:00
2015-05-22 20:55:57 -07:00
frag_id = ipv6_select_ident ( net , & ipv6_hdr ( skb ) - > daddr ,
& ipv6_hdr ( skb ) - > saddr ) ;
2015-05-22 20:55:56 -07:00
2015-10-27 22:40:42 +01:00
if ( skb - > ip_summed = = CHECKSUM_PARTIAL & &
( err = skb_checksum_help ( skb ) ) )
goto fail ;
2019-04-02 19:38:04 +08:00
prevhdr = skb_network_header ( skb ) + nexthdr_offset ;
2015-09-16 17:26:14 +02:00
hroom = LL_RESERVED_SPACE ( rt - > dst . dev ) ;
2010-08-23 00:13:46 -07:00
if ( skb_has_frag_list ( skb ) ) {
2016-11-19 04:08:08 +03:00
unsigned int first_len = skb_pagelen ( skb ) ;
2019-05-29 13:25:32 +02:00
struct ip6_fraglist_iter iter ;
2010-09-21 08:47:45 +00:00
struct sk_buff * frag2 ;
2005-04-16 15:20:36 -07:00
if ( first_len - hlen > mtu | |
( ( first_len - hlen ) & 7 ) | |
2015-09-16 17:26:14 +02:00
skb_cloned ( skb ) | |
skb_headroom ( skb ) < ( hroom + sizeof ( struct frag_hdr ) ) )
2005-04-16 15:20:36 -07:00
goto slow_path ;
2009-06-09 00:20:05 -07:00
skb_walk_frags ( skb , frag ) {
2005-04-16 15:20:36 -07:00
/* Correct geometry. */
if ( frag - > len > mtu | |
( ( frag - > len & 7 ) & & frag - > next ) | |
2015-09-16 17:26:14 +02:00
skb_headroom ( frag ) < ( hlen + hroom + sizeof ( struct frag_hdr ) ) )
2010-09-21 08:47:45 +00:00
goto slow_path_clean ;
2005-04-16 15:20:36 -07:00
/* Partially cloned skb? */
if ( skb_shared ( frag ) )
2010-09-21 08:47:45 +00:00
goto slow_path_clean ;
2005-05-18 22:52:33 -07:00
BUG_ON ( frag - > sk ) ;
if ( skb - > sk ) {
frag - > sk = skb - > sk ;
frag - > destructor = sock_wfree ;
}
2010-09-21 08:47:45 +00:00
skb - > truesize - = frag - > truesize ;
2005-04-16 15:20:36 -07:00
}
2019-05-29 13:25:32 +02:00
err = ip6_fraglist_init ( skb , hlen , prevhdr , nexthdr , frag_id ,
& iter ) ;
if ( err < 0 )
2015-09-16 17:26:14 +02:00
goto fail ;
2006-11-04 20:11:37 +09:00
2005-04-16 15:20:36 -07:00
for ( ; ; ) {
/* Prepare header of the next frame,
* before previous one went down . */
2019-05-29 13:25:32 +02:00
if ( iter . frag )
ip6_fraglist_prepare ( skb , & iter ) ;
2007-02-09 23:24:49 +09:00
2019-10-16 18:00:56 -07:00
skb - > tstamp = tstamp ;
2015-06-12 22:12:04 -05:00
err = output ( net , sk , skb ) ;
2014-08-24 21:53:10 +01:00
if ( ! err )
2010-06-10 23:31:35 -07:00
IP6_INC_STATS ( net , ip6_dst_idev ( & rt - > dst ) ,
2008-10-08 10:54:51 -07:00
IPSTATS_MIB_FRAGCREATES ) ;
2006-08-02 13:41:21 -07:00
2019-05-29 13:25:32 +02:00
if ( err | | ! iter . frag )
2005-04-16 15:20:36 -07:00
break ;
2019-05-29 13:25:32 +02:00
skb = ip6_fraglist_next ( & iter ) ;
2005-04-16 15:20:36 -07:00
}
2019-05-29 13:25:32 +02:00
kfree ( iter . tmp_hdr ) ;
2005-04-16 15:20:36 -07:00
if ( err = = 0 ) {
2010-06-10 23:31:35 -07:00
IP6_INC_STATS ( net , ip6_dst_idev ( & rt - > dst ) ,
2008-10-08 10:54:51 -07:00
IPSTATS_MIB_FRAGOKS ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2019-06-02 11:24:18 -07:00
kfree_skb_list ( iter . frag ) ;
2005-04-16 15:20:36 -07:00
2010-06-10 23:31:35 -07:00
IP6_INC_STATS ( net , ip6_dst_idev ( & rt - > dst ) ,
2008-10-08 10:54:51 -07:00
IPSTATS_MIB_FRAGFAILS ) ;
2005-04-16 15:20:36 -07:00
return err ;
2010-09-21 08:47:45 +00:00
slow_path_clean :
skb_walk_frags ( skb , frag2 ) {
if ( frag2 = = frag )
break ;
frag2 - > sk = NULL ;
frag2 - > destructor = NULL ;
skb - > truesize + = frag2 - > truesize ;
}
2005-04-16 15:20:36 -07:00
}
slow_path :
/*
* Fragment the datagram .
*/
2019-05-29 13:25:34 +02:00
ip6_frag_init ( skb , hlen , mtu , rt - > dst . dev - > needed_tailroom ,
LL_RESERVED_SPACE ( rt - > dst . dev ) , prevhdr , nexthdr , frag_id ,
& state ) ;
2005-04-16 15:20:36 -07:00
/*
* Keep copying data until we run out .
*/
2019-05-29 13:25:34 +02:00
while ( state . left > 0 ) {
frag = ip6_frag_next ( skb , & state ) ;
if ( IS_ERR ( frag ) ) {
err = PTR_ERR ( frag ) ;
2005-04-16 15:20:36 -07:00
goto fail ;
}
/*
* Put this fragment into the sending queue .
*/
2019-10-16 18:00:56 -07:00
frag - > tstamp = tstamp ;
2015-06-12 22:12:04 -05:00
err = output ( net , sk , frag ) ;
2005-04-16 15:20:36 -07:00
if ( err )
goto fail ;
2006-08-02 13:41:21 -07:00
2009-06-02 05:19:30 +00:00
IP6_INC_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) ,
2008-10-08 10:54:51 -07:00
IPSTATS_MIB_FRAGCREATES ) ;
2005-04-16 15:20:36 -07:00
}
2009-06-02 05:19:30 +00:00
IP6_INC_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) ,
2006-11-04 20:11:37 +09:00
IPSTATS_MIB_FRAGOKS ) ;
2012-04-24 10:17:59 +00:00
consume_skb ( skb ) ;
2005-04-16 15:20:36 -07:00
return err ;
2015-05-22 00:44:16 +02:00
fail_toobig :
if ( skb - > sk & & dst_allfrag ( skb_dst ( skb ) ) )
sk_nocaps_add ( skb - > sk , NETIF_F_GSO_MASK ) ;
icmpv6_send ( skb , ICMPV6_PKT_TOOBIG , 0 , mtu ) ;
err = - EMSGSIZE ;
2005-04-16 15:20:36 -07:00
fail :
2009-06-02 05:19:30 +00:00
IP6_INC_STATS ( net , ip6_dst_idev ( skb_dst ( skb ) ) ,
2006-11-04 20:11:37 +09:00
IPSTATS_MIB_FRAGFAILS ) ;
2007-02-09 23:24:49 +09:00
kfree_skb ( skb ) ;
2005-04-16 15:20:36 -07:00
return err ;
}
2011-04-22 04:53:02 +00:00
static inline int ip6_rt_check ( const struct rt6key * rt_key ,
const struct in6_addr * fl_addr ,
const struct in6_addr * addr_cache )
2006-08-23 17:19:18 -07:00
{
2010-09-22 20:43:57 +00:00
return ( rt_key - > plen ! = 128 | | ! ipv6_addr_equal ( fl_addr , & rt_key - > addr ) ) & &
2015-03-29 14:00:04 +01:00
( ! addr_cache | | ! ipv6_addr_equal ( fl_addr , addr_cache ) ) ;
2006-08-23 17:19:18 -07:00
}
2006-07-30 20:19:33 -07:00
static struct dst_entry * ip6_sk_dst_check ( struct sock * sk ,
struct dst_entry * dst ,
2011-04-22 04:53:02 +00:00
const struct flowi6 * fl6 )
2005-04-16 15:20:36 -07:00
{
2006-07-30 20:19:33 -07:00
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2013-06-26 04:15:07 -07:00
struct rt6_info * rt ;
2005-04-16 15:20:36 -07:00
2006-07-30 20:19:33 -07:00
if ( ! dst )
goto out ;
2013-06-26 04:15:07 -07:00
if ( dst - > ops - > family ! = AF_INET6 ) {
dst_release ( dst ) ;
return NULL ;
}
rt = ( struct rt6_info * ) dst ;
2006-07-30 20:19:33 -07:00
/* Yes, checking route validity in not connected
* case is not very simple . Take into account ,
* that we do not support routing by source , TOS ,
2014-08-24 21:53:10 +01:00
* and MSG_DONTROUTE - - ANK ( 980726 )
2006-07-30 20:19:33 -07:00
*
2006-08-23 17:19:18 -07:00
* 1. ip6_rt_check ( ) : If route was host route ,
* check that cached destination is current .
2006-07-30 20:19:33 -07:00
* If it is network route , we still may
* check its validity using saved pointer
* to the last used address : daddr_cache .
* We do not want to save whole address now ,
* ( because main consumer of this service
* is tcp , which has not this problem ) ,
* so that the last trick works only on connected
* sockets .
* 2. oif also should be the same .
*/
2011-03-12 16:22:43 -05:00
if ( ip6_rt_check ( & rt - > rt6i_dst , & fl6 - > daddr , np - > daddr_cache ) | |
2006-08-29 17:15:09 -07:00
# ifdef CONFIG_IPV6_SUBTREES
2011-03-12 16:22:43 -05:00
ip6_rt_check ( & rt - > rt6i_src , & fl6 - > saddr , np - > saddr_cache ) | |
2006-08-29 17:15:09 -07:00
# endif
2015-10-12 11:47:10 -07:00
( ! ( fl6 - > flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF ) & &
( fl6 - > flowi6_oif & & fl6 - > flowi6_oif ! = dst - > dev - > ifindex ) ) ) {
2006-07-30 20:19:33 -07:00
dst_release ( dst ) ;
dst = NULL ;
2005-04-16 15:20:36 -07:00
}
2006-07-30 20:19:33 -07:00
out :
return dst ;
}
2015-09-25 07:39:12 -07:00
static int ip6_dst_lookup_tail ( struct net * net , const struct sock * sk ,
2011-03-12 16:22:43 -05:00
struct dst_entry * * dst , struct flowi6 * fl6 )
2006-07-30 20:19:33 -07:00
{
2011-07-17 23:09:49 -07:00
# ifdef CONFIG_IPV6_OPTIMISTIC_DAD
struct neighbour * n ;
2012-07-02 22:43:47 -07:00
struct rt6_info * rt ;
2011-07-17 23:09:49 -07:00
# endif
int err ;
2016-01-29 12:30:19 +01:00
int flags = 0 ;
2006-07-30 20:19:33 -07:00
2015-05-05 13:36:59 +03:00
/* The correct way to handle this would be to do
* ip6_route_get_saddr , and then ip6_route_output ; however ,
* the route - specific preferred source forces the
* ip6_route_output call _before_ ip6_route_get_saddr .
*
* In source specific routing ( no src = any default route ) ,
* ip6_route_output will fail given src = any saddr , though , so
* that ' s why we try it again later .
*/
if ( ipv6_addr_any ( & fl6 - > saddr ) & & ( ! * dst | | ! ( * dst ) - > error ) ) {
2018-04-20 15:38:02 -07:00
struct fib6_info * from ;
2015-05-05 13:36:59 +03:00
struct rt6_info * rt ;
bool had_dst = * dst ! = NULL ;
2005-04-16 15:20:36 -07:00
2015-05-05 13:36:59 +03:00
if ( ! had_dst )
* dst = ip6_route_output ( net , sk , fl6 ) ;
rt = ( * dst ) - > error ? NULL : ( struct rt6_info * ) * dst ;
2018-04-20 15:38:02 -07:00
rcu_read_lock ( ) ;
from = rt ? rcu_dereference ( rt - > from ) : NULL ;
err = ip6_route_get_saddr ( net , from , & fl6 - > daddr ,
2011-04-13 21:10:57 +00:00
sk ? inet6_sk ( sk ) - > srcprefs : 0 ,
& fl6 - > saddr ) ;
2018-04-20 15:38:02 -07:00
rcu_read_unlock ( ) ;
2005-07-27 11:45:17 -07:00
if ( err )
2005-04-16 15:20:36 -07:00
goto out_err_release ;
2015-05-05 13:36:59 +03:00
/* If we had an erroneous initial result, pretend it
* never existed and let the SA - enabled version take
* over .
*/
if ( ! had_dst & & ( * dst ) - > error ) {
dst_release ( * dst ) ;
* dst = NULL ;
}
2016-01-29 12:30:19 +01:00
if ( fl6 - > flowi6_oif )
flags | = RT6_LOOKUP_F_IFACE ;
2005-04-16 15:20:36 -07:00
}
2015-05-05 13:36:59 +03:00
if ( ! * dst )
2016-01-29 12:30:19 +01:00
* dst = ip6_route_output_flags ( net , sk , fl6 , flags ) ;
2015-05-05 13:36:59 +03:00
err = ( * dst ) - > error ;
if ( err )
goto out_err_release ;
2007-04-25 17:08:10 -07:00
# ifdef CONFIG_IPV6_OPTIMISTIC_DAD
2008-09-09 13:51:35 -07:00
/*
* Here if the dst entry we ' ve looked up
* has a neighbour entry that is in the INCOMPLETE
* state and the src address from the flow is
* marked as OPTIMISTIC , we release the found
* dst entry and replace it instead with the
* dst entry of the nexthop router
*/
2012-07-06 09:19:05 +02:00
rt = ( struct rt6_info * ) * dst ;
2013-01-17 12:53:55 +00:00
rcu_read_lock_bh ( ) ;
2015-05-22 20:55:58 -07:00
n = __ipv6_neigh_lookup_noref ( rt - > dst . dev ,
rt6_nexthop ( rt , & fl6 - > daddr ) ) ;
2013-01-17 12:53:55 +00:00
err = n & & ! ( n - > nud_state & NUD_VALID ) ? - EINVAL : 0 ;
rcu_read_unlock_bh ( ) ;
if ( err ) {
2008-09-09 13:51:35 -07:00
struct inet6_ifaddr * ifp ;
2011-03-12 16:22:43 -05:00
struct flowi6 fl_gw6 ;
2008-09-09 13:51:35 -07:00
int redirect ;
2011-03-12 16:22:43 -05:00
ifp = ipv6_get_ifaddr ( net , & fl6 - > saddr ,
2008-09-09 13:51:35 -07:00
( * dst ) - > dev , 1 ) ;
redirect = ( ifp & & ifp - > flags & IFA_F_OPTIMISTIC ) ;
if ( ifp )
in6_ifa_put ( ifp ) ;
if ( redirect ) {
/*
* We need to get the dst entry for the
* default router instead
*/
dst_release ( * dst ) ;
2011-03-12 16:22:43 -05:00
memcpy ( & fl_gw6 , fl6 , sizeof ( struct flowi6 ) ) ;
memset ( & fl_gw6 . daddr , 0 , sizeof ( struct in6_addr ) ) ;
* dst = ip6_route_output ( net , sk , & fl_gw6 ) ;
2014-11-23 21:28:43 +00:00
err = ( * dst ) - > error ;
if ( err )
2008-09-09 13:51:35 -07:00
goto out_err_release ;
2007-04-25 17:08:10 -07:00
}
2008-09-09 13:51:35 -07:00
}
2007-04-25 17:08:10 -07:00
# endif
2017-02-12 17:26:06 -05:00
if ( ipv6_addr_v4mapped ( & fl6 - > saddr ) & &
2017-02-18 19:00:45 -05:00
! ( ipv6_addr_v4mapped ( & fl6 - > daddr ) | | ipv6_addr_any ( & fl6 - > daddr ) ) ) {
err = - EAFNOSUPPORT ;
goto out_err_release ;
}
2007-04-25 17:08:10 -07:00
2005-04-16 15:20:36 -07:00
return 0 ;
out_err_release :
dst_release ( * dst ) ;
* dst = NULL ;
2016-09-10 12:09:59 -07:00
2016-06-16 16:24:25 -07:00
if ( err = = - ENETUNREACH )
IP6_INC_STATS ( net , NULL , IPSTATS_MIB_OUTNOROUTES ) ;
2005-04-16 15:20:36 -07:00
return err ;
}
2005-11-29 16:28:56 -08:00
2006-07-30 20:19:33 -07:00
/**
* ip6_dst_lookup - perform route lookup on flow
* @ sk : socket which provides route info
* @ dst : pointer to dst_entry * for result
2011-03-12 16:22:43 -05:00
* @ fl6 : flow to lookup
2006-07-30 20:19:33 -07:00
*
* This function performs a route lookup on the given flow .
*
* It returns zero on success , or a standard errno code on error .
*/
2015-07-30 13:34:53 -07:00
int ip6_dst_lookup ( struct net * net , struct sock * sk , struct dst_entry * * dst ,
struct flowi6 * fl6 )
2006-07-30 20:19:33 -07:00
{
* dst = NULL ;
2015-07-30 13:34:53 -07:00
return ip6_dst_lookup_tail ( net , sk , dst , fl6 ) ;
2006-07-30 20:19:33 -07:00
}
2005-12-13 23:23:20 -08:00
EXPORT_SYMBOL_GPL ( ip6_dst_lookup ) ;
2006-07-30 20:19:33 -07:00
/**
2011-03-01 13:19:07 -08:00
* ip6_dst_lookup_flow - perform route lookup on flow with ipsec
* @ sk : socket which provides route info
2011-03-12 16:22:43 -05:00
* @ fl6 : flow to lookup
2011-03-01 13:19:07 -08:00
* @ final_dst : final destination address for ipsec lookup
*
* This function performs a route lookup on the given flow .
*
* It returns a valid dst pointer on success , or a pointer encoded
* error code .
*/
2015-09-25 07:39:12 -07:00
struct dst_entry * ip6_dst_lookup_flow ( const struct sock * sk , struct flowi6 * fl6 ,
2013-08-28 08:04:14 +02:00
const struct in6_addr * final_dst )
2011-03-01 13:19:07 -08:00
{
struct dst_entry * dst = NULL ;
int err ;
2015-07-30 13:34:53 -07:00
err = ip6_dst_lookup_tail ( sock_net ( sk ) , sk , & dst , fl6 ) ;
2011-03-01 13:19:07 -08:00
if ( err )
return ERR_PTR ( err ) ;
if ( final_dst )
2011-11-21 03:39:03 +00:00
fl6 - > daddr = * final_dst ;
2011-03-01 14:59:04 -08:00
2014-09-16 10:08:40 +02:00
return xfrm_lookup_route ( sock_net ( sk ) , dst , flowi6_to_flowi ( fl6 ) , sk , 0 ) ;
2011-03-01 13:19:07 -08:00
}
EXPORT_SYMBOL_GPL ( ip6_dst_lookup_flow ) ;
/**
* ip6_sk_dst_lookup_flow - perform socket cached route lookup on flow
2006-07-30 20:19:33 -07:00
* @ sk : socket which provides the dst cache and route info
2011-03-12 16:22:43 -05:00
* @ fl6 : flow to lookup
2011-03-01 13:19:07 -08:00
* @ final_dst : final destination address for ipsec lookup
2018-04-03 15:00:08 +03:00
* @ connected : whether @ sk is connected or not
2006-07-30 20:19:33 -07:00
*
* This function performs a route lookup on the given flow with the
* possibility of using the cached route in the socket if it is valid .
* It will take the socket dst lock when operating on the dst cache .
* As a result , this function can only be used in process context .
*
2018-04-03 15:00:08 +03:00
* In addition , for a connected socket , cache the dst in the socket
* if the current cache is not valid .
*
2011-03-01 13:19:07 -08:00
* It returns a valid dst pointer on success , or a pointer encoded
* error code .
2006-07-30 20:19:33 -07:00
*/
2011-03-12 16:22:43 -05:00
struct dst_entry * ip6_sk_dst_lookup_flow ( struct sock * sk , struct flowi6 * fl6 ,
2018-04-03 15:00:08 +03:00
const struct in6_addr * final_dst ,
bool connected )
2006-07-30 20:19:33 -07:00
{
2011-03-01 13:19:07 -08:00
struct dst_entry * dst = sk_dst_check ( sk , inet6_sk ( sk ) - > dst_cookie ) ;
2006-07-30 20:19:33 -07:00
2011-03-12 16:22:43 -05:00
dst = ip6_sk_dst_check ( sk , dst , fl6 ) ;
2018-04-03 15:00:08 +03:00
if ( dst )
return dst ;
dst = ip6_dst_lookup_flow ( sk , fl6 , final_dst ) ;
if ( connected & & ! IS_ERR ( dst ) )
ip6_sk_dst_store_flow ( sk , dst_clone ( dst ) , fl6 ) ;
2011-03-01 13:19:07 -08:00
ipv6: Skip XFRM lookup if dst_entry in socket cache is valid
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.
If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.
To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:
1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500
2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &
3) Send a datagram that needs fragmentation with a connected socket
# perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
(^ ICMPv6 Parameter Problem)
00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136
4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
What happens in step (3) is:
1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,
2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.
To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.
The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.
Joint work with Hannes Frederic Sowa.
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 15:13:34 +02:00
return dst ;
2006-07-30 20:19:33 -07:00
}
2011-03-01 13:19:07 -08:00
EXPORT_SYMBOL_GPL ( ip6_sk_dst_lookup_flow ) ;
2006-07-30 20:19:33 -07:00
2009-02-05 15:15:50 -08:00
static inline struct ipv6_opt_hdr * ip6_opt_dup ( struct ipv6_opt_hdr * src ,
gfp_t gfp )
{
return src ? kmemdup ( src , ( src - > hdrlen + 1 ) * 8 , gfp ) : NULL ;
}
static inline struct ipv6_rt_hdr * ip6_rthdr_dup ( struct ipv6_rt_hdr * src ,
gfp_t gfp )
{
return src ? kmemdup ( src , ( src - > hdrlen + 1 ) * 8 , gfp ) : NULL ;
}
2013-07-02 08:04:05 +02:00
static void ip6_append_data_mtu ( unsigned int * mtu ,
2012-05-26 01:30:53 +00:00
int * maxfraglen ,
unsigned int fragheaderlen ,
struct sk_buff * skb ,
2013-07-02 08:04:05 +02:00
struct rt6_info * rt ,
ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 12:51:01 +08:00
unsigned int orig_mtu )
2012-05-26 01:30:53 +00:00
{
if ( ! ( rt - > dst . flags & DST_XFRM_TUNNEL ) ) {
2015-03-29 14:00:04 +01:00
if ( ! skb ) {
2012-05-26 01:30:53 +00:00
/* first fragment, reserve header_len */
ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 12:51:01 +08:00
* mtu = orig_mtu - rt - > dst . header_len ;
2012-05-26 01:30:53 +00:00
} else {
/*
* this fragment is not first , the headers
* space is regarded as data space .
*/
ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 12:51:01 +08:00
* mtu = orig_mtu ;
2012-05-26 01:30:53 +00:00
}
* maxfraglen = ( ( * mtu - fragheaderlen ) & ~ 7 )
+ fragheaderlen - sizeof ( struct frag_hdr ) ;
}
}
2015-01-31 10:40:13 -05:00
static int ip6_setup_cork ( struct sock * sk , struct inet_cork_full * cork ,
2016-05-02 21:40:07 -07:00
struct inet6_cork * v6_cork , struct ipcm6_cookie * ipc6 ,
2018-07-06 10:12:57 -04:00
struct rt6_info * rt , struct flowi6 * fl6 )
2015-01-31 10:40:13 -05:00
{
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
unsigned int mtu ;
2016-05-02 21:40:07 -07:00
struct ipv6_txoptions * opt = ipc6 - > opt ;
2015-01-31 10:40:13 -05:00
/*
* setup for corking
*/
if ( opt ) {
if ( WARN_ON ( v6_cork - > opt ) )
return - EINVAL ;
2017-10-21 12:26:23 -07:00
v6_cork - > opt = kzalloc ( sizeof ( * opt ) , sk - > sk_allocation ) ;
2015-03-29 14:00:04 +01:00
if ( unlikely ( ! v6_cork - > opt ) )
2015-01-31 10:40:13 -05:00
return - ENOBUFS ;
2017-10-21 12:26:23 -07:00
v6_cork - > opt - > tot_len = sizeof ( * opt ) ;
2015-01-31 10:40:13 -05:00
v6_cork - > opt - > opt_flen = opt - > opt_flen ;
v6_cork - > opt - > opt_nflen = opt - > opt_nflen ;
v6_cork - > opt - > dst0opt = ip6_opt_dup ( opt - > dst0opt ,
sk - > sk_allocation ) ;
if ( opt - > dst0opt & & ! v6_cork - > opt - > dst0opt )
return - ENOBUFS ;
v6_cork - > opt - > dst1opt = ip6_opt_dup ( opt - > dst1opt ,
sk - > sk_allocation ) ;
if ( opt - > dst1opt & & ! v6_cork - > opt - > dst1opt )
return - ENOBUFS ;
v6_cork - > opt - > hopopt = ip6_opt_dup ( opt - > hopopt ,
sk - > sk_allocation ) ;
if ( opt - > hopopt & & ! v6_cork - > opt - > hopopt )
return - ENOBUFS ;
v6_cork - > opt - > srcrt = ip6_rthdr_dup ( opt - > srcrt ,
sk - > sk_allocation ) ;
if ( opt - > srcrt & & ! v6_cork - > opt - > srcrt )
return - ENOBUFS ;
/* need source address above miyazawa*/
}
dst_hold ( & rt - > dst ) ;
cork - > base . dst = & rt - > dst ;
cork - > fl . u . ip6 = * fl6 ;
2016-05-02 21:40:07 -07:00
v6_cork - > hop_limit = ipc6 - > hlimit ;
v6_cork - > tclass = ipc6 - > tclass ;
2015-01-31 10:40:13 -05:00
if ( rt - > dst . flags & DST_XFRM_TUNNEL )
mtu = np - > pmtudisc > = IPV6_PMTUDISC_PROBE ?
2018-01-10 12:45:10 -05:00
READ_ONCE ( rt - > dst . dev - > mtu ) : dst_mtu ( & rt - > dst ) ;
2015-01-31 10:40:13 -05:00
else
mtu = np - > pmtudisc > = IPV6_PMTUDISC_PROBE ?
2018-01-17 00:00:25 -05:00
READ_ONCE ( rt - > dst . dev - > mtu ) : dst_mtu ( xfrm_dst_path ( & rt - > dst ) ) ;
2015-01-31 10:40:13 -05:00
if ( np - > frag_size < mtu ) {
if ( np - > frag_size )
mtu = np - > frag_size ;
}
2018-01-10 12:45:10 -05:00
if ( mtu < IPV6_MIN_MTU )
return - EINVAL ;
2015-01-31 10:40:13 -05:00
cork - > base . fragsize = mtu ;
2018-07-06 10:12:59 -04:00
cork - > base . gso_size = ipc6 - > gso_size ;
2018-07-06 10:12:58 -04:00
cork - > base . tx_flags = 0 ;
2019-09-11 15:50:51 -04:00
cork - > base . mark = ipc6 - > sockc . mark ;
2018-07-06 10:12:58 -04:00
sock_tx_timestamp ( sk , ipc6 - > sockc . tsflags , & cork - > base . tx_flags ) ;
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 13:42:17 -04:00
2017-11-28 15:40:46 -05:00
if ( dst_allfrag ( xfrm_dst_path ( & rt - > dst ) ) )
2015-01-31 10:40:13 -05:00
cork - > base . flags | = IPCORK_ALLFRAG ;
cork - > base . length = 0 ;
2018-07-06 10:12:57 -04:00
cork - > base . transmit_time = ipc6 - > sockc . transmit_time ;
2018-07-03 15:42:50 -07:00
2015-01-31 10:40:13 -05:00
return 0 ;
}
2015-01-31 10:40:14 -05:00
static int __ip6_append_data ( struct sock * sk ,
struct flowi6 * fl6 ,
struct sk_buff_head * queue ,
struct inet_cork * cork ,
struct inet6_cork * v6_cork ,
struct page_frag * pfrag ,
int getfrag ( void * from , char * to , int offset ,
int len , int odd , struct sk_buff * skb ) ,
void * from , int length , int transhdrlen ,
2018-07-06 10:12:57 -04:00
unsigned int flags , struct ipcm6_cookie * ipc6 )
2005-04-16 15:20:36 -07:00
{
2012-05-26 01:30:53 +00:00
struct sk_buff * skb , * skb_prev = NULL ;
2018-03-23 14:47:30 +01:00
unsigned int maxfraglen , fragheaderlen , mtu , orig_mtu , pmtu ;
2018-11-30 15:32:39 -05:00
struct ubuf_info * uarg = NULL ;
2015-01-31 10:40:14 -05:00
int exthdrlen = 0 ;
int dst_exthdrlen = 0 ;
2005-04-16 15:20:36 -07:00
int hh_len ;
int copy ;
int err ;
int offset = 0 ;
2014-08-04 22:11:47 -04:00
u32 tskey = 0 ;
2015-01-31 10:40:14 -05:00
struct rt6_info * rt = ( struct rt6_info * ) cork - > dst ;
struct ipv6_txoptions * opt = v6_cork - > opt ;
2015-01-31 10:40:18 -05:00
int csummode = CHECKSUM_NONE ;
2015-10-27 22:40:41 +01:00
unsigned int maxnonfragsize , headersize ;
2018-03-31 13:16:26 -07:00
unsigned int wmem_alloc_delta = 0 ;
2019-05-30 18:01:21 -04:00
bool paged , extra_uref = false ;
2005-04-16 15:20:36 -07:00
2015-01-31 10:40:14 -05:00
skb = skb_peek_tail ( queue ) ;
if ( ! skb ) {
exthdrlen = opt ? opt - > opt_flen : 0 ;
2013-01-16 12:47:40 +00:00
dst_exthdrlen = rt - > dst . header_len - rt - > rt6i_nfheader_len ;
2005-04-16 15:20:36 -07:00
}
2015-01-31 10:40:14 -05:00
2018-04-26 13:42:19 -04:00
paged = ! ! cork - > gso_size ;
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 13:42:17 -04:00
mtu = cork - > gso_size ? IP6_MAX_MTU : cork - > fragsize ;
ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 12:51:01 +08:00
orig_mtu = mtu ;
2005-04-16 15:20:36 -07:00
2018-07-06 10:12:58 -04:00
if ( cork - > tx_flags & SKBTX_ANY_SW_TSTAMP & &
sk - > sk_tsflags & SOF_TIMESTAMPING_OPT_ID )
tskey = sk - > sk_tskey + + ;
2010-06-10 23:31:35 -07:00
hh_len = LL_RESERVED_SPACE ( rt - > dst . dev ) ;
2005-04-16 15:20:36 -07:00
2007-12-20 20:41:12 -08:00
fragheaderlen = sizeof ( struct ipv6hdr ) + rt - > rt6i_nfheader_len +
2007-11-13 21:33:32 -08:00
( opt ? opt - > opt_nflen : 0 ) ;
2013-12-16 12:36:44 +01:00
maxfraglen = ( ( mtu - fragheaderlen ) & ~ 7 ) + fragheaderlen -
sizeof ( struct frag_hdr ) ;
2005-04-16 15:20:36 -07:00
2015-10-27 22:40:41 +01:00
headersize = sizeof ( struct ipv6hdr ) +
( opt ? opt - > opt_flen + opt - > opt_nflen : 0 ) +
( dst_allfrag ( & rt - > dst ) ?
sizeof ( struct frag_hdr ) : 0 ) +
rt - > rt6i_nfheader_len ;
2018-03-23 14:47:30 +01:00
/* as per RFC 7112 section 5, the entire IPv6 Header Chain must fit
* the first fragment
*/
if ( headersize + transhdrlen > mtu )
goto emsgsize ;
2016-05-02 21:40:07 -07:00
if ( cork - > length + length > mtu - headersize & & ipc6 - > dontfrag & &
2015-10-27 22:40:41 +01:00
( sk - > sk_protocol = = IPPROTO_UDP | |
sk - > sk_protocol = = IPPROTO_RAW ) ) {
ipv6_local_rxpmtu ( sk , fl6 , mtu - headersize +
sizeof ( struct ipv6hdr ) ) ;
goto emsgsize ;
}
2013-12-16 12:36:44 +01:00
2015-10-27 22:40:41 +01:00
if ( ip6_sk_ignore_df ( sk ) )
maxnonfragsize = sizeof ( struct ipv6hdr ) + IPV6_MAXPLEN ;
else
maxnonfragsize = mtu ;
2013-12-16 12:36:44 +01:00
2015-10-27 22:40:41 +01:00
if ( cork - > length + length > maxnonfragsize - headersize ) {
2013-12-16 12:36:44 +01:00
emsgsize :
2018-03-23 14:47:30 +01:00
pmtu = max_t ( int , mtu - headersize + sizeof ( struct ipv6hdr ) , 0 ) ;
ipv6_local_error ( sk , EMSGSIZE , fl6 , pmtu ) ;
2015-10-27 22:40:41 +01:00
return - EMSGSIZE ;
2005-04-16 15:20:36 -07:00
}
2015-10-27 22:40:41 +01:00
/* CHECKSUM_PARTIAL only with no extension headers and when
* we are not going to fragment
*/
if ( transhdrlen & & sk - > sk_protocol = = IPPROTO_UDP & &
headersize = = sizeof ( struct ipv6hdr ) & &
2017-01-29 22:52:53 -05:00
length < = mtu - headersize & &
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 13:42:17 -04:00
( ! ( flags & MSG_MORE ) | | cork - > gso_size ) & &
2015-12-14 11:19:44 -08:00
rt - > dst . dev - > features & ( NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM ) )
2015-10-27 22:40:41 +01:00
csummode = CHECKSUM_PARTIAL ;
2018-11-30 15:32:39 -05:00
if ( flags & MSG_ZEROCOPY & & length & & sock_flag ( sk , SOCK_ZEROCOPY ) ) {
uarg = sock_zerocopy_realloc ( sk , length , skb_zcopy ( skb ) ) ;
if ( ! uarg )
return - ENOBUFS ;
2019-06-07 17:57:48 -04:00
extra_uref = ! skb_zcopy ( skb ) ; /* only ref on new uarg */
2018-11-30 15:32:39 -05:00
if ( rt - > dst . dev - > features & NETIF_F_SG & &
csummode = = CHECKSUM_PARTIAL ) {
paged = true ;
} else {
uarg - > zerocopy = 0 ;
2018-11-30 15:32:40 -05:00
skb_zcopy_set ( skb , uarg , & extra_uref ) ;
2018-11-30 15:32:39 -05:00
}
}
2005-04-16 15:20:36 -07:00
/*
* Let ' s try using as much space as possible .
* Use MTU if total length of the message fits into the MTU .
* Otherwise , we need to reserve fragment header and
* fragment alignment ( = 8 - 15 octects , in total ) .
*
* Note that we may need to " move " the data from the tail of
2007-02-09 23:24:49 +09:00
* of the buffer to the new fragment when we split
2005-04-16 15:20:36 -07:00
* the message .
*
2007-02-09 23:24:49 +09:00
* FIXME : It may be fragmented into multiple chunks
2005-04-16 15:20:36 -07:00
* at once if non - fragmentable extension headers
* are too large .
2007-02-09 23:24:49 +09:00
* - - yoshfuji
2005-04-16 15:20:36 -07:00
*/
2013-09-21 06:27:00 +02:00
cork - > length + = length ;
if ( ! skb )
2005-04-16 15:20:36 -07:00
goto alloc_new_skb ;
while ( length > 0 ) {
/* Check if the remaining data fits into current packet. */
2011-05-06 15:02:07 -07:00
copy = ( cork - > length < = mtu & & ! ( cork - > flags & IPCORK_ALLFRAG ) ? mtu : maxfraglen ) - skb - > len ;
2005-04-16 15:20:36 -07:00
if ( copy < length )
copy = maxfraglen - skb - > len ;
if ( copy < = 0 ) {
char * data ;
unsigned int datalen ;
unsigned int fraglen ;
unsigned int fraggap ;
unsigned int alloclen ;
2018-11-24 14:21:16 -05:00
unsigned int pagedlen ;
2005-04-16 15:20:36 -07:00
alloc_new_skb :
/* There's no room in the current skb */
2012-05-26 01:30:53 +00:00
if ( skb )
fraggap = skb - > len - maxfraglen ;
2005-04-16 15:20:36 -07:00
else
fraggap = 0 ;
2012-05-26 01:30:53 +00:00
/* update mtu and maxfraglen if necessary */
2015-03-29 14:00:04 +01:00
if ( ! skb | | ! skb_prev )
2012-05-26 01:30:53 +00:00
ip6_append_data_mtu ( & mtu , & maxfraglen ,
2013-07-02 08:04:05 +02:00
fragheaderlen , skb , rt ,
ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 12:51:01 +08:00
orig_mtu ) ;
2012-05-26 01:30:53 +00:00
skb_prev = skb ;
2005-04-16 15:20:36 -07:00
/*
* If remaining data exceeds the mtu ,
* we know we need more fragment ( s ) .
*/
datalen = length + fraggap ;
2012-05-26 01:30:53 +00:00
if ( datalen > ( cork - > length < = mtu & & ! ( cork - > flags & IPCORK_ALLFRAG ) ? mtu : maxfraglen ) - fragheaderlen )
datalen = maxfraglen - fragheaderlen - rt - > dst . trailer_len ;
2018-04-26 13:42:19 -04:00
fraglen = datalen + fragheaderlen ;
2018-11-24 14:21:16 -05:00
pagedlen = 0 ;
2018-04-26 13:42:19 -04:00
2005-04-16 15:20:36 -07:00
if ( ( flags & MSG_MORE ) & &
2010-06-10 23:31:35 -07:00
! ( rt - > dst . dev - > features & NETIF_F_SG ) )
2005-04-16 15:20:36 -07:00
alloclen = mtu ;
2018-04-26 13:42:19 -04:00
else if ( ! paged )
alloclen = fraglen ;
else {
alloclen = min_t ( int , fraglen , MAX_HEADER ) ;
pagedlen = fraglen - alloclen ;
}
2005-04-16 15:20:36 -07:00
2011-10-11 01:43:33 +00:00
alloclen + = dst_exthdrlen ;
2012-05-26 01:30:53 +00:00
if ( datalen ! = length + fraggap ) {
/*
* this is not the last fragment , the trailer
* space is regarded as data space .
*/
datalen + = rt - > dst . trailer_len ;
}
alloclen + = rt - > dst . trailer_len ;
fraglen = datalen + fragheaderlen ;
2005-04-16 15:20:36 -07:00
/*
* We just reserve space for fragment header .
2007-02-09 23:24:49 +09:00
* Note : this may be overallocation if the message
2005-04-16 15:20:36 -07:00
* ( without MSG_MORE ) fits into the MTU .
*/
alloclen + = sizeof ( struct frag_hdr ) ;
2018-04-26 13:42:19 -04:00
copy = datalen - transhdrlen - fraggap - pagedlen ;
2017-05-19 14:17:48 -07:00
if ( copy < 0 ) {
err = - EINVAL ;
goto error ;
}
2005-04-16 15:20:36 -07:00
if ( transhdrlen ) {
skb = sock_alloc_send_skb ( sk ,
alloclen + hh_len ,
( flags & MSG_DONTWAIT ) , & err ) ;
} else {
skb = NULL ;
2018-03-31 13:16:26 -07:00
if ( refcount_read ( & sk - > sk_wmem_alloc ) + wmem_alloc_delta < =
2005-04-16 15:20:36 -07:00
2 * sk - > sk_sndbuf )
2018-03-31 13:16:26 -07:00
skb = alloc_skb ( alloclen + hh_len ,
sk - > sk_allocation ) ;
2015-03-29 14:00:04 +01:00
if ( unlikely ( ! skb ) )
2005-04-16 15:20:36 -07:00
err = - ENOBUFS ;
}
2015-03-29 14:00:04 +01:00
if ( ! skb )
2005-04-16 15:20:36 -07:00
goto error ;
/*
* Fill in the control structures
*/
2013-08-26 12:31:23 +02:00
skb - > protocol = htons ( ETH_P_IPV6 ) ;
2015-01-31 10:40:18 -05:00
skb - > ip_summed = csummode ;
2005-04-16 15:20:36 -07:00
skb - > csum = 0 ;
2012-03-19 22:36:10 +00:00
/* reserve for fragmentation and ipsec header */
skb_reserve ( skb , hh_len + sizeof ( struct frag_hdr ) +
dst_exthdrlen ) ;
2005-04-16 15:20:36 -07:00
/*
* Find where to start putting bytes
*/
2018-04-26 13:42:19 -04:00
data = skb_put ( skb , fraglen - pagedlen ) ;
2012-03-19 22:36:10 +00:00
skb_set_network_header ( skb , exthdrlen ) ;
data + = fragheaderlen ;
2007-04-10 21:21:55 -07:00
skb - > transport_header = ( skb - > network_header +
fragheaderlen ) ;
2005-04-16 15:20:36 -07:00
if ( fraggap ) {
skb - > csum = skb_copy_and_csum_bits (
skb_prev , maxfraglen ,
data + transhdrlen , fraggap , 0 ) ;
skb_prev - > csum = csum_sub ( skb_prev - > csum ,
skb - > csum ) ;
data + = fraggap ;
2006-08-13 20:12:58 -07:00
pskb_trim_unique ( skb_prev , maxfraglen ) ;
2005-04-16 15:20:36 -07:00
}
2017-05-19 14:17:48 -07:00
if ( copy > 0 & &
getfrag ( from , data + transhdrlen , offset ,
copy , fraggap , skb ) < 0 ) {
2005-04-16 15:20:36 -07:00
err = - EFAULT ;
kfree_skb ( skb ) ;
goto error ;
}
offset + = copy ;
2018-04-26 13:42:19 -04:00
length - = copy + transhdrlen ;
2005-04-16 15:20:36 -07:00
transhdrlen = 0 ;
exthdrlen = 0 ;
2011-10-11 01:43:33 +00:00
dst_exthdrlen = 0 ;
2005-04-16 15:20:36 -07:00
2018-11-30 15:32:40 -05:00
/* Only the initial fragment is time stamped */
skb_shinfo ( skb ) - > tx_flags = cork - > tx_flags ;
cork - > tx_flags = 0 ;
skb_shinfo ( skb ) - > tskey = tskey ;
tskey = 0 ;
skb_zcopy_set ( skb , uarg , & extra_uref ) ;
2017-02-06 23:14:16 +02:00
if ( ( flags & MSG_CONFIRM ) & & ! skb_prev )
skb_set_dst_pending_confirm ( skb , 1 ) ;
2005-04-16 15:20:36 -07:00
/*
* Put the packet on the pending queue
*/
2018-03-31 13:16:26 -07:00
if ( ! skb - > destructor ) {
skb - > destructor = sock_wfree ;
skb - > sk = sk ;
wmem_alloc_delta + = skb - > truesize ;
}
2015-01-31 10:40:14 -05:00
__skb_queue_tail ( queue , skb ) ;
2005-04-16 15:20:36 -07:00
continue ;
}
if ( copy > length )
copy = length ;
2018-05-17 13:13:29 -04:00
if ( ! ( rt - > dst . dev - > features & NETIF_F_SG ) & &
skb_tailroom ( skb ) > = copy ) {
2005-04-16 15:20:36 -07:00
unsigned int off ;
off = skb - > len ;
if ( getfrag ( from , skb_put ( skb , copy ) ,
offset , copy , off , skb ) < 0 ) {
__skb_trim ( skb , off ) ;
err = - EFAULT ;
goto error ;
}
2018-11-30 15:32:39 -05:00
} else if ( ! uarg | | ! uarg - > zerocopy ) {
2005-04-16 15:20:36 -07:00
int i = skb_shinfo ( skb ) - > nr_frags ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
err = - ENOMEM ;
if ( ! sk_page_frag_refill ( sk , pfrag ) )
2005-04-16 15:20:36 -07:00
goto error ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
if ( ! skb_can_coalesce ( skb , i , pfrag - > page ,
pfrag - > offset ) ) {
err = - EMSGSIZE ;
if ( i = = MAX_SKB_FRAGS )
goto error ;
__skb_fill_page_desc ( skb , i , pfrag - > page ,
pfrag - > offset , 0 ) ;
skb_shinfo ( skb ) - > nr_frags = + + i ;
get_page ( pfrag - > page ) ;
2005-04-16 15:20:36 -07:00
}
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
copy = min_t ( int , copy , pfrag - > size - pfrag - > offset ) ;
2011-10-18 21:00:24 +00:00
if ( getfrag ( from ,
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
page_address ( pfrag - > page ) + pfrag - > offset ,
offset , copy , skb - > len , skb ) < 0 )
goto error_efault ;
pfrag - > offset + = copy ;
skb_frag_size_add ( & skb_shinfo ( skb ) - > frags [ i - 1 ] , copy ) ;
2005-04-16 15:20:36 -07:00
skb - > len + = copy ;
skb - > data_len + = copy ;
2008-01-22 22:39:26 -08:00
skb - > truesize + = copy ;
2018-03-31 13:16:26 -07:00
wmem_alloc_delta + = copy ;
2018-11-30 15:32:39 -05:00
} else {
err = skb_zerocopy_iter_dgram ( skb , from , copy ) ;
if ( err < 0 )
goto error ;
2005-04-16 15:20:36 -07:00
}
offset + = copy ;
length - = copy ;
}
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
2018-04-04 14:30:01 +02:00
if ( wmem_alloc_delta )
refcount_add ( wmem_alloc_delta , & sk - > sk_wmem_alloc ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
error_efault :
err = - EFAULT ;
2005-04-16 15:20:36 -07:00
error :
2018-12-08 06:22:46 -05:00
if ( uarg )
sock_zerocopy_put_abort ( uarg , extra_uref ) ;
2011-05-06 15:02:07 -07:00
cork - > length - = length ;
2008-10-08 10:54:51 -07:00
IP6_INC_STATS ( sock_net ( sk ) , rt - > rt6i_idev , IPSTATS_MIB_OUTDISCARDS ) ;
2018-03-31 13:16:26 -07:00
refcount_add ( wmem_alloc_delta , & sk - > sk_wmem_alloc ) ;
2005-04-16 15:20:36 -07:00
return err ;
}
2015-01-31 10:40:14 -05:00
int ip6_append_data ( struct sock * sk ,
int getfrag ( void * from , char * to , int offset , int len ,
int odd , struct sk_buff * skb ) ,
2016-05-02 21:40:07 -07:00
void * from , int length , int transhdrlen ,
struct ipcm6_cookie * ipc6 , struct flowi6 * fl6 ,
2018-07-06 10:12:57 -04:00
struct rt6_info * rt , unsigned int flags )
2015-01-31 10:40:14 -05:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
int exthdrlen ;
int err ;
if ( flags & MSG_PROBE )
return 0 ;
if ( skb_queue_empty ( & sk - > sk_write_queue ) ) {
/*
* setup for corking
*/
2016-05-02 21:40:07 -07:00
err = ip6_setup_cork ( sk , & inet - > cork , & np - > cork ,
2018-07-06 10:12:57 -04:00
ipc6 , rt , fl6 ) ;
2015-01-31 10:40:14 -05:00
if ( err )
return err ;
2016-05-02 21:40:07 -07:00
exthdrlen = ( ipc6 - > opt ? ipc6 - > opt - > opt_flen : 0 ) ;
2015-01-31 10:40:14 -05:00
length + = exthdrlen ;
transhdrlen + = exthdrlen ;
} else {
fl6 = & inet - > cork . fl . u . ip6 ;
transhdrlen = 0 ;
}
return __ip6_append_data ( sk , fl6 , & sk - > sk_write_queue , & inet - > cork . base ,
& np - > cork , sk_page_frag ( sk ) , getfrag ,
2018-07-06 10:12:57 -04:00
from , length , transhdrlen , flags , ipc6 ) ;
2015-01-31 10:40:14 -05:00
}
2012-04-29 21:48:53 +00:00
EXPORT_SYMBOL_GPL ( ip6_append_data ) ;
2005-04-16 15:20:36 -07:00
2015-01-31 10:40:13 -05:00
static void ip6_cork_release ( struct inet_cork_full * cork ,
struct inet6_cork * v6_cork )
2007-11-05 21:04:31 -08:00
{
2015-01-31 10:40:13 -05:00
if ( v6_cork - > opt ) {
kfree ( v6_cork - > opt - > dst0opt ) ;
kfree ( v6_cork - > opt - > dst1opt ) ;
kfree ( v6_cork - > opt - > hopopt ) ;
kfree ( v6_cork - > opt - > srcrt ) ;
kfree ( v6_cork - > opt ) ;
v6_cork - > opt = NULL ;
2009-02-05 15:15:50 -08:00
}
2015-01-31 10:40:13 -05:00
if ( cork - > base . dst ) {
dst_release ( cork - > base . dst ) ;
cork - > base . dst = NULL ;
cork - > base . flags & = ~ IPCORK_ALLFRAG ;
2007-11-05 21:04:31 -08:00
}
2015-01-31 10:40:13 -05:00
memset ( & cork - > fl , 0 , sizeof ( cork - > fl ) ) ;
2007-11-05 21:04:31 -08:00
}
2015-01-31 10:40:15 -05:00
struct sk_buff * __ip6_make_skb ( struct sock * sk ,
struct sk_buff_head * queue ,
struct inet_cork_full * cork ,
struct inet6_cork * v6_cork )
2005-04-16 15:20:36 -07:00
{
struct sk_buff * skb , * tmp_skb ;
struct sk_buff * * tail_skb ;
struct in6_addr final_dst_buf , * final_dst = & final_dst_buf ;
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2008-10-08 10:54:51 -07:00
struct net * net = sock_net ( sk ) ;
2005-04-16 15:20:36 -07:00
struct ipv6hdr * hdr ;
2015-01-31 10:40:15 -05:00
struct ipv6_txoptions * opt = v6_cork - > opt ;
struct rt6_info * rt = ( struct rt6_info * ) cork - > base . dst ;
struct flowi6 * fl6 = & cork - > fl . u . ip6 ;
2011-03-12 16:22:43 -05:00
unsigned char proto = fl6 - > flowi6_proto ;
2005-04-16 15:20:36 -07:00
2015-01-31 10:40:15 -05:00
skb = __skb_dequeue ( queue ) ;
2015-03-29 14:00:04 +01:00
if ( ! skb )
2005-04-16 15:20:36 -07:00
goto out ;
tail_skb = & ( skb_shinfo ( skb ) - > frag_list ) ;
/* move skb->data to ip header from ext header */
2007-04-10 20:50:43 -07:00
if ( skb - > data < skb_network_header ( skb ) )
2007-03-10 22:16:10 -03:00
__skb_pull ( skb , skb_network_offset ( skb ) ) ;
2015-01-31 10:40:15 -05:00
while ( ( tmp_skb = __skb_dequeue ( queue ) ) ! = NULL ) {
2007-03-16 17:26:39 -03:00
__skb_pull ( tmp_skb , skb_network_header_len ( skb ) ) ;
2005-04-16 15:20:36 -07:00
* tail_skb = tmp_skb ;
tail_skb = & ( tmp_skb - > next ) ;
skb - > len + = tmp_skb - > len ;
skb - > data_len + = tmp_skb - > len ;
skb - > truesize + = tmp_skb - > truesize ;
tmp_skb - > destructor = NULL ;
tmp_skb - > sk = NULL ;
}
2008-02-12 18:07:27 -08:00
/* Allow local fragmentation. */
2014-05-04 16:39:18 -07:00
skb - > ignore_df = ip6_sk_ignore_df ( sk ) ;
2008-02-12 18:07:27 -08:00
2011-11-21 03:39:03 +00:00
* final_dst = fl6 - > daddr ;
2007-03-16 17:26:39 -03:00
__skb_pull ( skb , skb_network_header_len ( skb ) ) ;
2005-04-16 15:20:36 -07:00
if ( opt & & opt - > opt_flen )
ipv6_push_frag_opts ( skb , opt , & proto ) ;
if ( opt & & opt - > opt_nflen )
2016-11-08 14:59:20 +01:00
ipv6_push_nfrag_opts ( skb , opt , & proto , & final_dst , & fl6 - > saddr ) ;
2005-04-16 15:20:36 -07:00
2007-04-10 20:46:21 -07:00
skb_push ( skb , sizeof ( struct ipv6hdr ) ) ;
skb_reset_network_header ( skb ) ;
2007-04-25 17:54:47 -07:00
hdr = ipv6_hdr ( skb ) ;
2007-02-09 23:24:49 +09:00
2015-01-31 10:40:15 -05:00
ip6_flow_hdr ( hdr , v6_cork - > tclass ,
2014-07-01 21:33:10 -07:00
ip6_make_flowlabel ( net , skb , fl6 - > flowlabel ,
net: reevalulate autoflowlabel setting after sysctl setting
sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.
The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.
To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.
Note, this changes behavior a little bit. Before commit 42240901f7c4
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 12:10:21 -08:00
ip6_autoflowlabel ( net , np ) , fl6 ) ) ;
2015-01-31 10:40:15 -05:00
hdr - > hop_limit = v6_cork - > hop_limit ;
2005-04-16 15:20:36 -07:00
hdr - > nexthdr = proto ;
2011-11-21 03:39:03 +00:00
hdr - > saddr = fl6 - > saddr ;
hdr - > daddr = * final_dst ;
2005-04-16 15:20:36 -07:00
2006-01-08 22:37:26 -08:00
skb - > priority = sk - > sk_priority ;
2019-09-11 15:50:51 -04:00
skb - > mark = cork - > base . mark ;
2006-01-08 22:37:26 -08:00
2018-07-03 15:42:50 -07:00
skb - > tstamp = cork - > base . transmit_time ;
2010-06-10 23:31:35 -07:00
skb_dst_set ( skb , dst_clone ( & rt - > dst ) ) ;
2009-04-27 02:45:02 -07:00
IP6_UPD_PO_STATS ( net , rt - > rt6i_idev , IPSTATS_MIB_OUT , skb - > len ) ;
2007-09-16 16:52:35 -07:00
if ( proto = = IPPROTO_ICMPV6 ) {
2009-06-02 05:19:30 +00:00
struct inet6_dev * idev = ip6_dst_idev ( skb_dst ( skb ) ) ;
2007-09-16 16:52:35 -07:00
2014-03-31 20:14:10 +02:00
ICMP6MSGOUT_INC_STATS ( net , idev , icmp6_hdr ( skb ) - > icmp6_type ) ;
ICMP6_INC_STATS ( net , idev , ICMP6_MIB_OUTMSGS ) ;
2007-09-16 16:52:35 -07:00
}
2015-01-31 10:40:15 -05:00
ip6_cork_release ( cork , v6_cork ) ;
out :
return skb ;
}
int ip6_send_skb ( struct sk_buff * skb )
{
struct net * net = sock_net ( skb - > sk ) ;
struct rt6_info * rt = ( struct rt6_info * ) skb_dst ( skb ) ;
int err ;
2015-10-07 16:48:46 -05:00
err = ip6_local_out ( net , skb - > sk , skb ) ;
2005-04-16 15:20:36 -07:00
if ( err ) {
if ( err > 0 )
ip: Report qdisc packet drops
Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.
IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.
This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.
Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048
send() syscalls, do however still return an OK status, to not
break applications.
Note : send() manual page explicitly says for -ENOBUFS error :
"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "
This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.
Many thanks to Christoph, David, and last but not least, Alexey !
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-02 18:05:33 -07:00
err = net_xmit_errno ( err ) ;
2005-04-16 15:20:36 -07:00
if ( err )
2015-01-31 10:40:15 -05:00
IP6_INC_STATS ( net , rt - > rt6i_idev ,
IPSTATS_MIB_OUTDISCARDS ) ;
2005-04-16 15:20:36 -07:00
}
return err ;
2015-01-31 10:40:15 -05:00
}
int ip6_push_pending_frames ( struct sock * sk )
{
struct sk_buff * skb ;
skb = ip6_finish_skb ( sk ) ;
if ( ! skb )
return 0 ;
return ip6_send_skb ( skb ) ;
2005-04-16 15:20:36 -07:00
}
2012-04-29 21:48:53 +00:00
EXPORT_SYMBOL_GPL ( ip6_push_pending_frames ) ;
2005-04-16 15:20:36 -07:00
2015-01-31 10:40:14 -05:00
static void __ip6_flush_pending_frames ( struct sock * sk ,
2015-01-31 10:40:15 -05:00
struct sk_buff_head * queue ,
struct inet_cork_full * cork ,
struct inet6_cork * v6_cork )
2005-04-16 15:20:36 -07:00
{
struct sk_buff * skb ;
2015-01-31 10:40:14 -05:00
while ( ( skb = __skb_dequeue_tail ( queue ) ) ! = NULL ) {
2009-06-02 05:19:30 +00:00
if ( skb_dst ( skb ) )
IP6_INC_STATS ( sock_net ( sk ) , ip6_dst_idev ( skb_dst ( skb ) ) ,
2007-09-11 11:31:43 +02:00
IPSTATS_MIB_OUTDISCARDS ) ;
2005-04-16 15:20:36 -07:00
kfree_skb ( skb ) ;
}
2015-01-31 10:40:15 -05:00
ip6_cork_release ( cork , v6_cork ) ;
2005-04-16 15:20:36 -07:00
}
2015-01-31 10:40:14 -05:00
void ip6_flush_pending_frames ( struct sock * sk )
{
2015-01-31 10:40:15 -05:00
__ip6_flush_pending_frames ( sk , & sk - > sk_write_queue ,
& inet_sk ( sk ) - > cork , & inet6_sk ( sk ) - > cork ) ;
2015-01-31 10:40:14 -05:00
}
2012-04-29 21:48:53 +00:00
EXPORT_SYMBOL_GPL ( ip6_flush_pending_frames ) ;
2015-01-31 10:40:15 -05:00
struct sk_buff * ip6_make_skb ( struct sock * sk ,
int getfrag ( void * from , char * to , int offset ,
int len , int odd , struct sk_buff * skb ) ,
void * from , int length , int transhdrlen ,
2016-05-02 21:40:07 -07:00
struct ipcm6_cookie * ipc6 , struct flowi6 * fl6 ,
2015-01-31 10:40:15 -05:00
struct rt6_info * rt , unsigned int flags ,
2018-07-06 10:12:57 -04:00
struct inet_cork_full * cork )
2015-01-31 10:40:15 -05:00
{
struct inet6_cork v6_cork ;
struct sk_buff_head queue ;
2016-05-02 21:40:07 -07:00
int exthdrlen = ( ipc6 - > opt ? ipc6 - > opt - > opt_flen : 0 ) ;
2015-01-31 10:40:15 -05:00
int err ;
if ( flags & MSG_PROBE )
return NULL ;
__skb_queue_head_init ( & queue ) ;
2018-04-26 13:42:15 -04:00
cork - > base . flags = 0 ;
cork - > base . addr = 0 ;
cork - > base . opt = NULL ;
cork - > base . dst = NULL ;
2015-01-31 10:40:15 -05:00
v6_cork . opt = NULL ;
2018-07-06 10:12:57 -04:00
err = ip6_setup_cork ( sk , cork , & v6_cork , ipc6 , rt , fl6 ) ;
2018-01-10 03:45:49 -08:00
if ( err ) {
2018-04-26 13:42:15 -04:00
ip6_cork_release ( cork , & v6_cork ) ;
2015-01-31 10:40:15 -05:00
return ERR_PTR ( err ) ;
2018-01-10 03:45:49 -08:00
}
2016-05-02 21:40:07 -07:00
if ( ipc6 - > dontfrag < 0 )
ipc6 - > dontfrag = inet6_sk ( sk ) - > dontfrag ;
2015-01-31 10:40:15 -05:00
2018-04-26 13:42:15 -04:00
err = __ip6_append_data ( sk , fl6 , & queue , & cork - > base , & v6_cork ,
2015-01-31 10:40:15 -05:00
& current - > task_frag , getfrag , from ,
length + exthdrlen , transhdrlen + exthdrlen ,
2018-07-06 10:12:57 -04:00
flags , ipc6 ) ;
2015-01-31 10:40:15 -05:00
if ( err ) {
2018-04-26 13:42:15 -04:00
__ip6_flush_pending_frames ( sk , & queue , cork , & v6_cork ) ;
2015-01-31 10:40:15 -05:00
return ERR_PTR ( err ) ;
}
2018-04-26 13:42:15 -04:00
return __ip6_make_skb ( sk , & queue , cork , & v6_cork ) ;
2015-01-31 10:40:15 -05:00
}