2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* The Internet Protocol ( IP ) output module .
*
2005-05-06 03:16:16 +04:00
* Authors : Ross Biro
2005-04-17 02:20:36 +04:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Donald Becker , < becker @ super . org >
* Alan Cox , < Alan . Cox @ linux . org >
* Richard Underwood
* Stefan Becker , < stefanb @ yello . ping . de >
* Jorge Cwik , < jorge @ laser . satlink . net >
* Arnt Gulbrandsen , < agulbra @ nvg . unit . no >
* Hirokazu Takahashi , < taka @ valinux . co . jp >
*
* See ip_input . c for original log
*
* Fixes :
* Alan Cox : Missing nonblock feature in ip_build_xmit .
* Mike Kilburn : htons ( ) missing in ip_build_xmit .
2007-02-09 17:24:47 +03:00
* Bradford Johnson : Fix faulty handling of some frames when
2005-04-17 02:20:36 +04:00
* no route is found .
* Alexander Demenshin : Missing sk / skb free in ip_queue_xmit
* ( in case if packet not accepted by
* output firewall rules )
* Mike McLagan : Routing by source
* Alexey Kuznetsov : use new route cache
* Andi Kleen : Fix broken PMTU recovery and remove
* some redundant tests .
* Vitaly E . Lavrov : Transparent proxy revived after year coma .
* Andi Kleen : Replace ip_reply with ip_send_reply .
2007-02-09 17:24:47 +03:00
* Andi Kleen : Split fast and slow ip_build_xmit path
* for decreased register pressure on x86
2021-03-27 02:12:38 +03:00
* and more readability .
2005-04-17 02:20:36 +04:00
* Marc Boucher : When call_out_firewall returns FW_QUEUE ,
* silently drop skb instead of failing with - EPERM .
* Detlev Wengorz : Copy protocol for fragments .
* Hirokazu Takahashi : HW checksumming for outgoing UDP
* datagrams .
* Hirokazu Takahashi : sendfile ( ) on UDP works now .
*/
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <linux/module.h>
# include <linux/types.h>
# include <linux/kernel.h>
# include <linux/mm.h>
# include <linux/string.h>
# include <linux/errno.h>
2006-10-20 00:08:53 +04:00
# include <linux/highmem.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2005-04-17 02:20:36 +04:00
# include <linux/socket.h>
# include <linux/sockios.h>
# include <linux/in.h>
# include <linux/inet.h>
# include <linux/netdevice.h>
# include <linux/etherdevice.h>
# include <linux/proc_fs.h>
# include <linux/stat.h>
# include <linux/init.h>
# include <net/snmp.h>
# include <net/ip.h>
# include <net/protocol.h>
# include <net/route.h>
2006-01-09 09:36:54 +03:00
# include <net/xfrm.h>
2005-04-17 02:20:36 +04:00
# include <linux/skbuff.h>
# include <net/sock.h>
# include <net/arp.h>
# include <net/icmp.h>
# include <net/checksum.h>
2023-06-08 22:17:37 +03:00
# include <net/gso.h>
2005-04-17 02:20:36 +04:00
# include <net/inetpeer.h>
2020-09-09 00:09:34 +03:00
# include <net/inet_ecn.h>
2016-08-25 06:10:43 +03:00
# include <net/lwtunnel.h>
2016-11-23 18:52:29 +03:00
# include <linux/bpf-cgroup.h>
2005-04-17 02:20:36 +04:00
# include <linux/igmp.h>
# include <linux/netfilter_ipv4.h>
# include <linux/netfilter_bridge.h>
# include <linux/netlink.h>
2005-08-10 06:49:02 +04:00
# include <linux/tcp.h>
2005-04-17 02:20:36 +04:00
2015-06-13 05:55:31 +03:00
static int
ip_fragment ( struct net * net , struct sock * sk , struct sk_buff * skb ,
unsigned int mtu ,
int ( * output ) ( struct net * , struct sock * , struct sk_buff * ) ) ;
2015-05-16 00:15:37 +03:00
2005-04-17 02:20:36 +04:00
/* Generate a checksum for an outgoing IP datagram. */
2013-05-09 03:19:42 +04:00
void ip_send_check ( struct iphdr * iph )
2005-04-17 02:20:36 +04:00
{
iph - > check = 0 ;
iph - > check = ip_fast_csum ( ( unsigned char * ) iph , iph - > ihl ) ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( ip_send_check ) ;
2005-04-17 02:20:36 +04:00
2015-10-08 00:48:45 +03:00
int __ip_local_out ( struct net * net , struct sock * sk , struct sk_buff * skb )
2008-01-12 06:14:00 +03:00
{
struct iphdr * iph = ip_hdr ( skb ) ;
2023-10-19 04:20:53 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_OUTREQUESTS ) ;
net: add support for ipv4 big tcp
Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.
Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.
Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.
Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 18:58:39 +03:00
iph_set_totlen ( iph , skb - > len ) ;
2008-01-12 06:14:00 +03:00
ip_send_check ( iph ) ;
2016-09-10 22:09:53 +03:00
/* if egress device is enslaved to an L3 master device pass the
* skb to its handler for processing
*/
skb = l3mdev_ip_out ( sk , skb ) ;
if ( unlikely ( ! skb ) )
return 0 ;
2016-12-01 05:05:10 +03:00
skb - > protocol = htons ( ETH_P_IP ) ;
2015-09-16 04:04:16 +03:00
return nf_hook ( NFPROTO_IPV4 , NF_INET_LOCAL_OUT ,
net , sk , skb , NULL , skb_dst ( skb ) - > dev ,
2015-10-08 00:48:35 +03:00
dst_output ) ;
2015-04-06 05:19:04 +03:00
}
2015-10-08 00:48:46 +03:00
int ip_local_out ( struct net * net , struct sock * sk , struct sk_buff * skb )
2008-01-12 06:14:00 +03:00
{
int err ;
2015-10-08 00:48:45 +03:00
err = __ip_local_out ( net , sk , skb ) ;
2008-01-12 06:14:00 +03:00
if ( likely ( err = = 1 ) )
2015-10-08 00:48:35 +03:00
err = dst_output ( net , sk , skb ) ;
2008-01-12 06:14:00 +03:00
return err ;
}
2015-10-08 00:48:38 +03:00
EXPORT_SYMBOL_GPL ( ip_local_out ) ;
2008-01-12 06:14:00 +03:00
2023-03-16 18:31:55 +03:00
static inline int ip_select_ttl ( const struct inet_sock * inet ,
const struct dst_entry * dst )
2005-04-17 02:20:36 +04:00
{
2023-08-16 11:15:46 +03:00
int ttl = READ_ONCE ( inet - > uc_ttl ) ;
2005-04-17 02:20:36 +04:00
if ( ttl < 0 )
2010-12-13 08:55:08 +03:00
ttl = ip4_dst_hoplimit ( dst ) ;
2005-04-17 02:20:36 +04:00
return ttl ;
}
2007-02-09 17:24:47 +03:00
/*
2005-04-17 02:20:36 +04:00
* Add an ip header to a skbuff and send it out .
*
*/
2015-09-25 17:39:16 +03:00
int ip_build_and_send_pkt ( struct sk_buff * skb , const struct sock * sk ,
2020-09-10 03:50:47 +03:00
__be32 saddr , __be32 daddr , struct ip_options_rcu * opt ,
u8 tos )
2005-04-17 02:20:36 +04:00
{
2023-03-16 18:31:55 +03:00
const struct inet_sock * inet = inet_sk ( sk ) ;
2009-06-02 09:14:27 +04:00
struct rtable * rt = skb_rtable ( skb ) ;
2015-10-08 00:48:42 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
struct iphdr * iph ;
/* Build the IP header. */
2011-04-21 13:45:37 +04:00
skb_push ( skb , sizeof ( struct iphdr ) + ( opt ? opt - > opt . optlen : 0 ) ) ;
2007-03-11 01:40:39 +03:00
skb_reset_network_header ( skb ) ;
2007-04-21 09:47:35 +04:00
iph = ip_hdr ( skb ) ;
2005-04-17 02:20:36 +04:00
iph - > version = 4 ;
iph - > ihl = 5 ;
2020-09-10 03:50:47 +03:00
iph - > tos = tos ;
2010-06-11 10:31:35 +04:00
iph - > ttl = ip_select_ttl ( inet , & rt - > dst ) ;
2011-05-04 23:03:30 +04:00
iph - > daddr = ( opt & & opt - > opt . srr ? opt - > opt . faddr : daddr ) ;
iph - > saddr = saddr ;
2005-04-17 02:20:36 +04:00
iph - > protocol = sk - > sk_protocol ;
2022-01-27 04:10:21 +03:00
/* Do not bother generating IPID for small packets (eg SYNACK) */
if ( skb - > len < = IPV4_MIN_MTU | | ip_dont_fragment ( sk , & rt - > dst ) ) {
2015-09-25 17:39:16 +03:00
iph - > frag_off = htons ( IP_DF ) ;
iph - > id = 0 ;
} else {
iph - > frag_off = 0 ;
2022-01-27 04:10:21 +03:00
/* TCP packets here are SYNACK with fat IPv4/TCP options.
* Avoid using the hashed IP ident generator .
*/
if ( sk - > sk_protocol = = IPPROTO_TCP )
2022-10-05 18:23:53 +03:00
iph - > id = ( __force __be16 ) get_random_u16 ( ) ;
2022-01-27 04:10:21 +03:00
else
__ip_select_ident ( net , iph , 1 ) ;
2015-09-25 17:39:16 +03:00
}
2005-04-17 02:20:36 +04:00
2011-04-21 13:45:37 +04:00
if ( opt & & opt - > opt . optlen ) {
iph - > ihl + = opt - > opt . optlen > > 2 ;
2022-01-28 19:06:54 +03:00
ip_options_build ( skb , & opt - > opt , daddr , rt ) ;
2005-04-17 02:20:36 +04:00
}
2023-07-28 18:03:18 +03:00
skb - > priority = READ_ONCE ( sk - > sk_priority ) ;
2017-07-03 16:51:50 +03:00
if ( ! skb - > mark )
2023-07-28 18:03:15 +03:00
skb - > mark = READ_ONCE ( sk - > sk_mark ) ;
2005-04-17 02:20:36 +04:00
/* Send it out. */
2015-10-08 00:48:46 +03:00
return ip_local_out ( net , skb - > sk , skb ) ;
2005-04-17 02:20:36 +04:00
}
2005-08-10 07:12:12 +04:00
EXPORT_SYMBOL_GPL ( ip_build_and_send_pkt ) ;
2015-06-13 05:55:31 +03:00
static int ip_finish_output2 ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
2009-06-02 09:19:30 +04:00
struct dst_entry * dst = skb_dst ( skb ) ;
2007-04-30 11:48:20 +04:00
struct rtable * rt = ( struct rtable * ) dst ;
2005-04-17 02:20:36 +04:00
struct net_device * dev = dst - > dev ;
2007-10-24 08:07:32 +04:00
unsigned int hh_len = LL_RESERVED_SPACE ( dev ) ;
2011-07-14 18:53:20 +04:00
struct neighbour * neigh ;
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 02:30:34 +03:00
bool is_v6gw = false ;
2005-04-17 02:20:36 +04:00
2009-04-27 13:45:02 +04:00
if ( rt - > rt_type = = RTN_MULTICAST ) {
2015-09-16 04:04:01 +03:00
IP_UPD_PO_STATS ( net , IPSTATS_MIB_OUTMCAST , skb - > len ) ;
2009-04-27 13:45:02 +04:00
} else if ( rt - > rt_type = = RTN_BROADCAST )
2015-09-16 04:04:01 +03:00
IP_UPD_PO_STATS ( net , IPSTATS_MIB_OUTBCAST , skb - > len ) ;
2007-04-30 11:48:20 +04:00
2023-08-25 10:55:05 +03:00
/* OUTOCTETS should be counted after fragment */
IP_UPD_PO_STATS ( net , IPSTATS_MIB_OUT , skb - > len ) ;
2007-10-09 12:40:57 +04:00
if ( unlikely ( skb_headroom ( skb ) < hh_len & & dev - > header_ops ) ) {
2021-08-02 11:52:35 +03:00
skb = skb_expand_head ( skb , hh_len ) ;
if ( ! skb )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
}
2016-08-25 06:10:43 +03:00
if ( lwtunnel_xmit_redirect ( dst - > lwtstate ) ) {
int res = lwtunnel_xmit ( skb ) ;
2023-08-18 05:58:14 +03:00
if ( res ! = LWTUNNEL_XMIT_CONTINUE )
2016-08-25 06:10:43 +03:00
return res ;
}
2023-03-21 07:01:14 +03:00
rcu_read_lock ( ) ;
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 02:30:34 +03:00
neigh = ip_neigh_for_gw ( rt , skb , & is_v6gw ) ;
2012-08-06 07:55:29 +04:00
if ( ! IS_ERR ( neigh ) ) {
2017-02-07 00:14:12 +03:00
int res ;
sock_confirm_neigh ( skb , neigh ) ;
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 02:30:34 +03:00
/* if crossing protocols, can not use the cached header */
res = neigh_output ( neigh , skb , is_v6gw ) ;
2023-03-21 07:01:14 +03:00
rcu_read_unlock ( ) ;
2011-07-29 23:00:53 +04:00
return res ;
}
2023-03-21 07:01:14 +03:00
rcu_read_unlock ( ) ;
2011-07-17 04:26:00 +04:00
2012-05-14 01:56:26 +04:00
net_dbg_ratelimited ( " %s: No header cache and no neighbour! \n " ,
__func__ ) ;
2022-02-26 07:18:29 +03:00
kfree_skb_reason ( skb , SKB_DROP_REASON_NEIGH_CREATEFAIL ) ;
2023-08-07 04:54:08 +03:00
return PTR_ERR ( neigh ) ;
2005-04-17 02:20:36 +04:00
}
2015-06-13 05:55:31 +03:00
static int ip_finish_output_gso ( struct net * net , struct sock * sk ,
struct sk_buff * skb , unsigned int mtu )
2014-05-05 17:00:43 +04:00
{
2020-01-14 02:42:31 +03:00
struct sk_buff * segs , * nskb ;
2014-05-05 17:00:43 +04:00
netdev_features_t features ;
int ret = 0 ;
2016-11-02 23:36:17 +03:00
/* common case: seglen is <= mtu
2016-07-18 14:49:33 +03:00
*/
2018-03-01 09:13:37 +03:00
if ( skb_gso_validate_network_len ( skb , mtu ) )
2015-06-13 05:55:31 +03:00
return ip_finish_output2 ( net , sk , skb ) ;
2014-05-05 17:00:43 +04:00
2016-11-09 23:04:39 +03:00
/* Slowpath - GSO segment length exceeds the egress MTU.
2014-05-05 17:00:43 +04:00
*
2016-11-09 23:04:39 +03:00
* This can happen in several cases :
* - Forwarding of a TCP GRO skb , when DF flag is not set .
* - Forwarding of an skb that arrived on a virtualization interface
* ( virtio - net / vhost / tap ) with TSO / GSO size set by other network
* stack .
* - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
* interface with a smaller MTU .
* - Arriving GRO skb ( or GSO skb in a virtualized environment ) that is
* bridged to a NETIF_F_TSO tunnel stacked over an interface with an
2021-03-27 02:12:38 +03:00
* insufficient MTU .
2014-05-05 17:00:43 +04:00
*/
features = netif_skb_features ( skb ) ;
2020-03-26 10:33:14 +03:00
BUILD_BUG_ON ( sizeof ( * IPCB ( skb ) ) > SKB_GSO_CB_OFFSET ) ;
2014-05-05 17:00:43 +04:00
segs = skb_gso_segment ( skb , features & ~ NETIF_F_GSO_MASK ) ;
2014-10-20 15:49:17 +04:00
if ( IS_ERR_OR_NULL ( segs ) ) {
2014-05-05 17:00:43 +04:00
kfree_skb ( skb ) ;
return - ENOMEM ;
}
consume_skb ( skb ) ;
2020-01-14 02:42:31 +03:00
skb_list_walk_safe ( segs , segs , nskb ) {
2014-05-05 17:00:43 +04:00
int err ;
2018-07-30 06:42:53 +03:00
skb_mark_not_on_list ( segs ) ;
2015-06-13 05:55:31 +03:00
err = ip_fragment ( net , sk , segs , mtu , ip_finish_output2 ) ;
2014-05-05 17:00:43 +04:00
if ( err & & ret = = 0 )
ret = err ;
2020-01-14 02:42:31 +03:00
}
2014-05-05 17:00:43 +04:00
return ret ;
}
2019-05-29 02:59:38 +03:00
static int __ip_finish_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
2015-05-22 17:32:50 +03:00
unsigned int mtu ;
2006-01-07 10:05:36 +03:00
# if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
2015-04-03 11:17:27 +03:00
if ( skb_dst ( skb ) - > xfrm ) {
2006-02-16 02:10:22 +03:00
IPCB ( skb ) - > flags | = IPSKB_REROUTED ;
2015-10-08 00:48:35 +03:00
return dst_output ( net , sk , skb ) ;
2006-02-16 02:10:22 +03:00
}
2006-01-07 10:05:36 +03:00
# endif
2016-06-29 21:47:03 +03:00
mtu = ip_skb_dst_mtu ( sk , skb ) ;
2014-05-05 17:00:43 +04:00
if ( skb_is_gso ( skb ) )
2015-06-13 05:55:31 +03:00
return ip_finish_output_gso ( net , sk , skb , mtu ) ;
2014-05-05 17:00:43 +04:00
net: ip: always refragment ip defragmented packets
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.
In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.
This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.
The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.
Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.
packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set. At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.
If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.
Also, tunnel might be configured to force DF bit on outer header.
In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.
But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.
Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.
The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.
Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06 02:15:23 +03:00
if ( skb - > len > mtu | | IPCB ( skb ) - > frag_max_size )
2015-06-13 05:55:31 +03:00
return ip_fragment ( net , sk , skb , mtu , ip_finish_output2 ) ;
2014-05-05 17:00:43 +04:00
2015-06-13 05:55:31 +03:00
return ip_finish_output2 ( net , sk , skb ) ;
2005-04-17 02:20:36 +04:00
}
2019-05-29 02:59:38 +03:00
static int ip_finish_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
{
int ret ;
ret = BPF_CGROUP_RUN_PROG_INET_EGRESS ( sk , skb ) ;
switch ( ret ) {
case NET_XMIT_SUCCESS :
return __ip_finish_output ( net , sk , skb ) ;
case NET_XMIT_CN :
return __ip_finish_output ( net , sk , skb ) ? : ret ;
default :
2022-02-26 07:18:29 +03:00
kfree_skb_reason ( skb , SKB_DROP_REASON_BPF_CGROUP_EGRESS ) ;
2019-05-29 02:59:38 +03:00
return ret ;
}
}
2016-11-23 18:52:29 +03:00
static int ip_mc_finish_output ( struct net * net , struct sock * sk ,
struct sk_buff * skb )
{
2019-06-26 09:21:16 +03:00
struct rtable * new_rt ;
2019-06-28 07:06:39 +03:00
bool do_cn = false ;
int ret , err ;
2016-11-23 18:52:29 +03:00
ret = BPF_CGROUP_RUN_PROG_INET_EGRESS ( sk , skb ) ;
2019-05-29 02:59:38 +03:00
switch ( ret ) {
case NET_XMIT_CN :
2019-06-28 07:06:39 +03:00
do_cn = true ;
2020-03-13 01:50:22 +03:00
fallthrough ;
2019-06-28 07:06:39 +03:00
case NET_XMIT_SUCCESS :
break ;
2019-05-29 02:59:38 +03:00
default :
2022-02-26 07:18:29 +03:00
kfree_skb_reason ( skb , SKB_DROP_REASON_BPF_CGROUP_EGRESS ) ;
2016-11-23 18:52:29 +03:00
return ret ;
}
2019-06-26 09:21:16 +03:00
/* Reset rt_iif so that inet_iif() will return skb->skb_iif. Setting
* this to non - zero causes ipi_ifindex in in_pktinfo to be overwritten ,
* see ipv4_pktinfo_prepare ( ) .
*/
new_rt = rt_dst_clone ( net - > loopback_dev , skb_rtable ( skb ) ) ;
if ( new_rt ) {
new_rt - > rt_iif = 0 ;
skb_dst_drop ( skb ) ;
skb_dst_set ( skb , & new_rt - > dst ) ;
}
2019-06-28 07:06:39 +03:00
err = dev_loopback_xmit ( net , sk , skb ) ;
return ( do_cn & & err ) ? ret : err ;
2016-11-23 18:52:29 +03:00
}
2015-10-08 00:48:47 +03:00
int ip_mc_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
2009-06-02 09:14:27 +04:00
struct rtable * rt = skb_rtable ( skb ) ;
2010-06-11 10:31:35 +04:00
struct net_device * dev = rt - > dst . dev ;
2005-04-17 02:20:36 +04:00
/*
* If the indicated interface is up and running , send the packet .
*/
skb - > dev = dev ;
skb - > protocol = htons ( ETH_P_IP ) ;
/*
* Multicasts are looped back for other local users
*/
if ( rt - > rt_flags & RTCF_MULTICAST ) {
2010-01-07 07:37:01 +03:00
if ( sk_mc_loop ( sk )
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_IP_MROUTE
/* Small optimization: do not loopback not local frames,
which returned after forwarding ; they will be dropped
by ip_mr_input in any case .
Note , that local frames are looped back to be delivered
to local recipients .
This check is duplicated in ip_mr_input at the moment .
*/
2009-11-23 21:41:23 +03:00
& &
( ( rt - > rt_flags & RTCF_LOCAL ) | |
! ( IPCB ( skb ) - > flags & IPSKB_FORWARDED ) )
2005-04-17 02:20:36 +04:00
# endif
2009-11-23 21:41:23 +03:00
) {
2005-04-17 02:20:36 +04:00
struct sk_buff * newskb = skb_clone ( skb , GFP_ATOMIC ) ;
if ( newskb )
2010-03-23 06:07:29 +03:00
NF_HOOK ( NFPROTO_IPV4 , NF_INET_POST_ROUTING ,
2015-09-16 04:04:16 +03:00
net , sk , newskb , NULL , newskb - > dev ,
2016-11-23 18:52:29 +03:00
ip_mc_finish_output ) ;
2005-04-17 02:20:36 +04:00
}
/* Multicasts with ttl 0 must not go beyond the host */
2007-04-21 09:47:35 +04:00
if ( ip_hdr ( skb ) - > ttl = = 0 ) {
2005-04-17 02:20:36 +04:00
kfree_skb ( skb ) ;
return 0 ;
}
}
if ( rt - > rt_flags & RTCF_BROADCAST ) {
struct sk_buff * newskb = skb_clone ( skb , GFP_ATOMIC ) ;
if ( newskb )
2015-09-16 04:04:16 +03:00
NF_HOOK ( NFPROTO_IPV4 , NF_INET_POST_ROUTING ,
net , sk , newskb , NULL , newskb - > dev ,
2016-11-23 18:52:29 +03:00
ip_mc_finish_output ) ;
2005-04-17 02:20:36 +04:00
}
2015-09-16 04:04:16 +03:00
return NF_HOOK_COND ( NFPROTO_IPV4 , NF_INET_POST_ROUTING ,
net , sk , skb , NULL , skb - > dev ,
ip_finish_output ,
2006-02-16 02:10:22 +03:00
! ( IPCB ( skb ) - > flags & IPSKB_REROUTED ) ) ;
2005-04-17 02:20:36 +04:00
}
2015-10-08 00:48:47 +03:00
int ip_output ( struct net * net , struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
2019-11-12 19:14:37 +03:00
struct net_device * dev = skb_dst ( skb ) - > dev , * indev = skb - > dev ;
2006-01-05 23:20:59 +03:00
skb - > dev = dev ;
skb - > protocol = htons ( ETH_P_IP ) ;
2015-09-16 04:04:16 +03:00
return NF_HOOK_COND ( NFPROTO_IPV4 , NF_INET_POST_ROUTING ,
2019-11-12 19:14:37 +03:00
net , sk , skb , indev , dev ,
2007-02-09 17:24:47 +03:00
ip_finish_output ,
2006-02-16 02:10:22 +03:00
! ( IPCB ( skb ) - > flags & IPSKB_REROUTED ) ) ;
2005-04-17 02:20:36 +04:00
}
2021-02-01 20:41:30 +03:00
EXPORT_SYMBOL ( ip_output ) ;
2005-04-17 02:20:36 +04:00
2011-11-30 23:00:53 +04:00
/*
* copy saddr and daddr , possibly using 64 bit load / stores
* Equivalent to :
* iph - > saddr = fl4 - > saddr ;
* iph - > daddr = fl4 - > daddr ;
*/
static void ip_copy_addrs ( struct iphdr * iph , const struct flowi4 * fl4 )
{
BUILD_BUG_ON ( offsetof ( typeof ( * fl4 ) , daddr ) ! =
offsetof ( typeof ( * fl4 ) , saddr ) + sizeof ( fl4 - > saddr ) ) ;
2021-07-26 22:52:51 +03:00
iph - > saddr = fl4 - > saddr ;
iph - > daddr = fl4 - > daddr ;
2011-11-30 23:00:53 +04:00
}
2014-04-15 20:58:34 +04:00
/* Note: skb->sk can be different from sk, in case of tunnels */
2018-07-02 13:21:11 +03:00
int __ip_queue_xmit ( struct sock * sk , struct sk_buff * skb , struct flowi * fl ,
__u8 tos )
2005-04-17 02:20:36 +04:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
2015-10-08 00:48:42 +03:00
struct net * net = sock_net ( sk ) ;
2011-04-21 13:45:37 +04:00
struct ip_options_rcu * inet_opt ;
2011-05-07 03:24:06 +04:00
struct flowi4 * fl4 ;
2005-04-17 02:20:36 +04:00
struct rtable * rt ;
struct iphdr * iph ;
2010-05-10 15:31:49 +04:00
int res ;
2005-04-17 02:20:36 +04:00
/* Skip all of this if the packet is already routed,
* f . e . by something like SCTP .
*/
2010-05-10 15:31:49 +04:00
rcu_read_lock ( ) ;
2011-04-21 13:45:37 +04:00
inet_opt = rcu_dereference ( inet - > inet_opt ) ;
2011-05-07 09:30:20 +04:00
fl4 = & fl - > u . ip4 ;
2009-06-02 09:14:27 +04:00
rt = skb_rtable ( skb ) ;
2015-04-03 11:17:27 +03:00
if ( rt )
2005-04-17 02:20:36 +04:00
goto packet_routed ;
/* Make sure we can route this packet. */
rt = ( struct rtable * ) __sk_dst_check ( sk , 0 ) ;
2015-04-03 11:17:26 +03:00
if ( ! rt ) {
2006-09-28 05:28:07 +04:00
__be32 daddr ;
2005-04-17 02:20:36 +04:00
/* Use correct destination address if we have options. */
2009-10-15 10:30:45 +04:00
daddr = inet - > inet_daddr ;
2011-04-21 13:45:37 +04:00
if ( inet_opt & & inet_opt - > opt . srr )
daddr = inet_opt - > opt . faddr ;
2005-04-17 02:20:36 +04:00
2011-03-12 08:00:52 +03:00
/* If this fails, retransmit mechanism of transport layer will
* keep trying until route appears or the connection times
* itself out .
*/
2015-10-08 00:48:42 +03:00
rt = ip_route_output_ports ( net , fl4 , sk ,
2011-03-12 08:00:52 +03:00
daddr , inet - > inet_saddr ,
inet - > inet_dport ,
inet - > inet_sport ,
sk - > sk_protocol ,
ipv4: Set the routing scope properly in ip_route_output_ports().
Set scope automatically in ip_route_output_ports() (using the socket
SOCK_LOCALROUTE flag). This way, callers don't have to overload the
tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does.
For callers that don't pass a struct sock, this doesn't change anything
as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL.
Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or
RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use
ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with
the RTO_ONLINK flag now becomes unnecessary.
In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos
parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk
parameter is a kernel socket, which doesn't have any configuration path
for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports()
will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c
doesn't need to be modified.
Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as
these macros are now unused.
The objective is to eventually remove RTO_ONLINK entirely to allow
converting ->flowi4_tos to dscp_t. This will ensure proper isolation
between the DSCP and ECN bits, thus minimising the risk of introducing
bugs where TOS values interfere with ECN.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09 19:43:37 +03:00
RT_TOS ( tos ) ,
2011-03-12 08:00:52 +03:00
sk - > sk_bound_dev_if ) ;
if ( IS_ERR ( rt ) )
goto no_route ;
2010-06-11 10:31:35 +04:00
sk_setup_caps ( sk , & rt - > dst ) ;
2005-04-17 02:20:36 +04:00
}
2010-06-11 10:31:35 +04:00
skb_dst_set_noref ( skb , & rt - > dst ) ;
2005-04-17 02:20:36 +04:00
packet_routed :
2019-09-17 20:39:49 +03:00
if ( inet_opt & & inet_opt - > opt . is_strictroute & & rt - > rt_uses_gateway )
2005-04-17 02:20:36 +04:00
goto no_route ;
/* OK, we know where to send it, allocate and build IP header. */
2011-04-21 13:45:37 +04:00
skb_push ( skb , sizeof ( struct iphdr ) + ( inet_opt ? inet_opt - > opt . optlen : 0 ) ) ;
2007-03-11 01:40:39 +03:00
skb_reset_network_header ( skb ) ;
2007-04-21 09:47:35 +04:00
iph = ip_hdr ( skb ) ;
2018-07-02 13:21:11 +03:00
* ( ( __be16 * ) iph ) = htons ( ( 4 < < 12 ) | ( 5 < < 8 ) | ( tos & 0xff ) ) ;
2014-05-05 03:39:18 +04:00
if ( ip_dont_fragment ( sk , & rt - > dst ) & & ! skb - > ignore_df )
2005-04-17 02:20:36 +04:00
iph - > frag_off = htons ( IP_DF ) ;
else
iph - > frag_off = 0 ;
2010-06-11 10:31:35 +04:00
iph - > ttl = ip_select_ttl ( inet , & rt - > dst ) ;
2005-04-17 02:20:36 +04:00
iph - > protocol = sk - > sk_protocol ;
2011-11-30 23:00:53 +04:00
ip_copy_addrs ( iph , fl4 ) ;
2005-04-17 02:20:36 +04:00
/* Transport layer set skb->h.foo itself. */
2011-04-21 13:45:37 +04:00
if ( inet_opt & & inet_opt - > opt . optlen ) {
iph - > ihl + = inet_opt - > opt . optlen > > 2 ;
2022-01-28 19:06:54 +03:00
ip_options_build ( skb , & inet_opt - > opt , inet - > inet_daddr , rt ) ;
2005-04-17 02:20:36 +04:00
}
2015-10-08 00:48:42 +03:00
ip_select_ident_segs ( net , skb , sk ,
2015-03-25 19:07:44 +03:00
skb_shinfo ( skb ) - > gso_segs ? : 1 ) ;
2005-04-17 02:20:36 +04:00
2014-04-15 20:58:34 +04:00
/* TODO : should we use skb->sk here instead of sk ? */
2023-07-28 18:03:18 +03:00
skb - > priority = READ_ONCE ( sk - > sk_priority ) ;
2023-07-28 18:03:15 +03:00
skb - > mark = READ_ONCE ( sk - > sk_mark ) ;
2005-04-17 02:20:36 +04:00
2015-10-08 00:48:46 +03:00
res = ip_local_out ( net , sk , skb ) ;
2010-05-10 15:31:49 +04:00
rcu_read_unlock ( ) ;
return res ;
2005-04-17 02:20:36 +04:00
no_route :
2010-05-10 15:31:49 +04:00
rcu_read_unlock ( ) ;
2015-10-08 00:48:42 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_OUTNOROUTES ) ;
2022-02-26 07:18:29 +03:00
kfree_skb_reason ( skb , SKB_DROP_REASON_IP_OUTNOROUTES ) ;
2005-04-17 02:20:36 +04:00
return - EHOSTUNREACH ;
}
2018-07-02 13:21:11 +03:00
EXPORT_SYMBOL ( __ip_queue_xmit ) ;
2005-04-17 02:20:36 +04:00
2020-06-19 22:12:34 +03:00
int ip_queue_xmit ( struct sock * sk , struct sk_buff * skb , struct flowi * fl )
{
2023-09-22 06:42:16 +03:00
return __ip_queue_xmit ( sk , skb , fl , READ_ONCE ( inet_sk ( sk ) - > tos ) ) ;
2020-06-19 22:12:34 +03:00
}
EXPORT_SYMBOL ( ip_queue_xmit ) ;
2005-04-17 02:20:36 +04:00
static void ip_copy_metadata ( struct sk_buff * to , struct sk_buff * from )
{
to - > pkt_type = from - > pkt_type ;
to - > priority = from - > priority ;
to - > protocol = from - > protocol ;
2019-04-29 16:39:30 +03:00
to - > skb_iif = from - > skb_iif ;
2009-06-02 09:19:30 +04:00
skb_dst_drop ( to ) ;
2010-07-02 03:48:22 +04:00
skb_dst_copy ( to , from ) ;
2005-04-17 02:20:36 +04:00
to - > dev = from - > dev ;
2006-11-10 02:19:14 +03:00
to - > mark = from - > mark ;
2005-04-17 02:20:36 +04:00
2018-07-23 17:50:48 +03:00
skb_copy_hash ( to , from ) ;
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_NET_SCHED
to - > tc_index = from - > tc_index ;
# endif
2007-03-15 02:44:01 +03:00
nf_copy ( to , from ) ;
sk_buff: add skb extension infrastructure
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18 19:15:16 +03:00
skb_ext_copy ( to , from ) ;
2016-09-09 15:43:16 +03:00
# if IS_ENABLED(CONFIG_IP_VS)
2005-10-22 14:39:21 +04:00
to - > ipvs_property = from - > ipvs_property ;
2005-04-17 02:20:36 +04:00
# endif
2006-06-09 11:29:17 +04:00
skb_copy_secmark ( to , from ) ;
2005-04-17 02:20:36 +04:00
}
2015-06-13 05:55:31 +03:00
static int ip_fragment ( struct net * net , struct sock * sk , struct sk_buff * skb ,
2015-05-22 17:32:50 +03:00
unsigned int mtu ,
2015-06-13 05:55:31 +03:00
int ( * output ) ( struct net * , struct sock * , struct sk_buff * ) )
2015-05-16 00:15:37 +03:00
{
struct iphdr * iph = ip_hdr ( skb ) ;
2015-05-22 17:32:51 +03:00
if ( ( iph - > frag_off & htons ( IP_DF ) ) = = 0 )
2015-06-13 05:55:31 +03:00
return ip_do_fragment ( net , sk , skb , output ) ;
2015-05-22 17:32:51 +03:00
if ( unlikely ( ! skb - > ignore_df | |
2015-05-16 00:15:37 +03:00
( IPCB ( skb ) - > frag_max_size & &
IPCB ( skb ) - > frag_max_size > mtu ) ) ) {
2015-09-16 04:04:00 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGFAILS ) ;
2015-05-16 00:15:37 +03:00
icmp_send ( skb , ICMP_DEST_UNREACH , ICMP_FRAG_NEEDED ,
htonl ( mtu ) ) ;
kfree_skb ( skb ) ;
return - EMSGSIZE ;
}
2015-06-13 05:55:31 +03:00
return ip_do_fragment ( net , sk , skb , output ) ;
2015-05-16 00:15:37 +03:00
}
2019-05-29 14:25:31 +03:00
void ip_fraglist_init ( struct sk_buff * skb , struct iphdr * iph ,
unsigned int hlen , struct ip_fraglist_iter * iter )
{
unsigned int first_len = skb_pagelen ( skb ) ;
2019-06-02 21:24:18 +03:00
iter - > frag = skb_shinfo ( skb ) - > frag_list ;
2019-05-29 14:25:31 +03:00
skb_frag_list_init ( skb ) ;
iter - > offset = 0 ;
iter - > iph = iph ;
iter - > hlen = hlen ;
skb - > data_len = first_len - skb_headlen ( skb ) ;
skb - > len = first_len ;
iph - > tot_len = htons ( first_len ) ;
iph - > frag_off = htons ( IP_MF ) ;
ip_send_check ( iph ) ;
}
EXPORT_SYMBOL ( ip_fraglist_init ) ;
void ip_fraglist_prepare ( struct sk_buff * skb , struct ip_fraglist_iter * iter )
{
unsigned int hlen = iter - > hlen ;
struct iphdr * iph = iter - > iph ;
struct sk_buff * frag ;
frag = iter - > frag ;
frag - > ip_summed = CHECKSUM_NONE ;
skb_reset_transport_header ( frag ) ;
__skb_push ( frag , hlen ) ;
skb_reset_network_header ( frag ) ;
memcpy ( skb_network_header ( frag ) , iph , hlen ) ;
iter - > iph = ip_hdr ( frag ) ;
iph = iter - > iph ;
iph - > tot_len = htons ( frag - > len ) ;
ip_copy_metadata ( frag , skb ) ;
iter - > offset + = skb - > len - hlen ;
iph - > frag_off = htons ( iter - > offset > > 3 ) ;
if ( frag - > next )
iph - > frag_off | = htons ( IP_MF ) ;
/* Ready, complete checksum */
ip_send_check ( iph ) ;
}
EXPORT_SYMBOL ( ip_fraglist_prepare ) ;
2019-05-29 14:25:33 +03:00
void ip_frag_init ( struct sk_buff * skb , unsigned int hlen ,
2019-10-19 19:26:37 +03:00
unsigned int ll_rs , unsigned int mtu , bool DF ,
2019-05-29 14:25:33 +03:00
struct ip_frag_state * state )
{
struct iphdr * iph = ip_hdr ( skb ) ;
2019-10-19 19:26:37 +03:00
state - > DF = DF ;
2019-05-29 14:25:33 +03:00
state - > hlen = hlen ;
state - > ll_rs = ll_rs ;
state - > mtu = mtu ;
state - > left = skb - > len - hlen ; /* Space per frame */
state - > ptr = hlen ; /* Where to start from */
state - > offset = ( ntohs ( iph - > frag_off ) & IP_OFFSET ) < < 3 ;
state - > not_last_frag = iph - > frag_off & htons ( IP_MF ) ;
}
EXPORT_SYMBOL ( ip_frag_init ) ;
2019-05-29 14:25:35 +03:00
static void ip_frag_ipcb ( struct sk_buff * from , struct sk_buff * to ,
2021-08-23 06:17:59 +03:00
bool first_frag )
2019-05-29 14:25:35 +03:00
{
/* Copy the flags to each fragment. */
IPCB ( to ) - > flags = IPCB ( from ) - > flags ;
/* ANK: dirty, but effective trick. Upgrade options only if
* the segment to be fragmented was THE FIRST ( otherwise ,
* options are already fixed ) and make it ONCE
* on the initial skb , so that all the following fragments
* will inherit fixed options .
*/
if ( first_frag )
ip_options_fragment ( from ) ;
}
2019-05-29 14:25:33 +03:00
struct sk_buff * ip_frag_next ( struct sk_buff * skb , struct ip_frag_state * state )
{
unsigned int len = state - > left ;
struct sk_buff * skb2 ;
struct iphdr * iph ;
/* IF: it doesn't fit, use 'mtu' - the data space left */
if ( len > state - > mtu )
len = state - > mtu ;
/* IF: we are not sending up to and including the packet end
then align the next start on an eight byte boundary */
if ( len < state - > left ) {
len & = ~ 7 ;
}
/* Allocate buffer */
skb2 = alloc_skb ( len + state - > hlen + state - > ll_rs , GFP_ATOMIC ) ;
if ( ! skb2 )
return ERR_PTR ( - ENOMEM ) ;
/*
* Set up data on packet
*/
ip_copy_metadata ( skb2 , skb ) ;
skb_reserve ( skb2 , state - > ll_rs ) ;
skb_put ( skb2 , len + state - > hlen ) ;
skb_reset_network_header ( skb2 ) ;
skb2 - > transport_header = skb2 - > network_header + state - > hlen ;
/*
* Charge the memory for the fragment to any owner
* it might possess
*/
if ( skb - > sk )
skb_set_owner_w ( skb2 , skb - > sk ) ;
/*
* Copy the packet header into the new buffer .
*/
skb_copy_from_linear_data ( skb , skb_network_header ( skb2 ) , state - > hlen ) ;
/*
* Copy a block of the IP datagram .
*/
if ( skb_copy_bits ( skb , state - > ptr , skb_transport_header ( skb2 ) , len ) )
BUG ( ) ;
state - > left - = len ;
/*
* Fill in the new header fields .
*/
iph = ip_hdr ( skb2 ) ;
iph - > frag_off = htons ( ( state - > offset > > 3 ) ) ;
2019-10-19 19:26:37 +03:00
if ( state - > DF )
iph - > frag_off | = htons ( IP_DF ) ;
2019-05-29 14:25:33 +03:00
/*
* Added AC : If we are fragmenting a fragment that ' s not the
* last fragment then keep MF on each bit
*/
if ( state - > left > 0 | | state - > not_last_frag )
iph - > frag_off | = htons ( IP_MF ) ;
state - > ptr + = len ;
state - > offset + = len ;
iph - > tot_len = htons ( len + state - > hlen ) ;
ip_send_check ( iph ) ;
return skb2 ;
}
EXPORT_SYMBOL ( ip_frag_next ) ;
2005-04-17 02:20:36 +04:00
/*
* This IP datagram is too large to be sent in one piece . Break it up into
* smaller pieces ( each of size equal to IP header plus
* a block of the data of the original IP data part ) that will yet fit in a
* single device frame , and queue such a frame for sending .
*/
2015-06-13 05:55:31 +03:00
int ip_do_fragment ( struct net * net , struct sock * sk , struct sk_buff * skb ,
int ( * output ) ( struct net * , struct sock * , struct sk_buff * ) )
2005-04-17 02:20:36 +04:00
{
struct iphdr * iph ;
struct sk_buff * skb2 ;
2022-03-02 22:55:25 +03:00
bool mono_delivery_time = skb - > mono_delivery_time ;
2009-06-02 09:14:27 +04:00
struct rtable * rt = skb_rtable ( skb ) ;
2019-05-29 14:25:33 +03:00
unsigned int mtu , hlen , ll_rs ;
2019-05-29 14:25:31 +03:00
struct ip_fraglist_iter iter ;
2019-10-17 04:00:56 +03:00
ktime_t tstamp = skb - > tstamp ;
2019-05-29 14:25:33 +03:00
struct ip_frag_state state ;
2005-04-17 02:20:36 +04:00
int err = 0 ;
2015-10-28 00:40:40 +03:00
/* for offloaded checksums cleanup checksum before fragmentation */
if ( skb - > ip_summed = = CHECKSUM_PARTIAL & &
( err = skb_checksum_help ( skb ) ) )
goto fail ;
2005-04-17 02:20:36 +04:00
/*
* Point into the IP datagram header .
*/
2007-04-21 09:47:35 +04:00
iph = ip_hdr ( skb ) ;
2005-04-17 02:20:36 +04:00
2016-06-29 21:47:03 +03:00
mtu = ip_skb_dst_mtu ( sk , skb ) ;
2015-05-22 17:32:51 +03:00
if ( IPCB ( skb ) - > frag_max_size & & IPCB ( skb ) - > frag_max_size < mtu )
mtu = IPCB ( skb ) - > frag_max_size ;
2005-04-17 02:20:36 +04:00
/*
* Setup starting values .
*/
hlen = iph - > ihl * 4 ;
2014-01-09 13:01:15 +04:00
mtu = mtu - hlen ; /* Size of data space */
2005-12-14 10:14:27 +03:00
IPCB ( skb ) - > flags | = IPSKB_FRAG_COMPLETE ;
2017-07-14 12:04:16 +03:00
ll_rs = LL_RESERVED_SPACE ( rt - > dst . dev ) ;
2005-04-17 02:20:36 +04:00
/* When frag_list is given, use it. First, check its validity:
* some transformers could create wrong frag_list or break existing
* one , it is not prohibited . In this case fall back to copying .
*
* LATER : this step can be merged to real generation of fragments ,
* we can switch to copy when see the first bad fragment .
*/
2010-08-23 11:13:46 +04:00
if ( skb_has_frag_list ( skb ) ) {
2010-09-21 12:47:45 +04:00
struct sk_buff * frag , * frag2 ;
2016-11-19 04:08:08 +03:00
unsigned int first_len = skb_pagelen ( skb ) ;
2005-04-17 02:20:36 +04:00
if ( first_len - hlen > mtu | |
( ( first_len - hlen ) & 7 ) | |
2011-06-22 07:33:34 +04:00
ip_is_fragment ( iph ) | |
2017-07-14 12:04:16 +03:00
skb_cloned ( skb ) | |
skb_headroom ( skb ) < ll_rs )
2005-04-17 02:20:36 +04:00
goto slow_path ;
2009-06-09 11:19:37 +04:00
skb_walk_frags ( skb , frag ) {
2005-04-17 02:20:36 +04:00
/* Correct geometry. */
if ( frag - > len > mtu | |
( ( frag - > len & 7 ) & & frag - > next ) | |
2017-07-14 12:04:16 +03:00
skb_headroom ( frag ) < hlen + ll_rs )
2010-09-21 12:47:45 +04:00
goto slow_path_clean ;
2005-04-17 02:20:36 +04:00
/* Partially cloned skb? */
if ( skb_shared ( frag ) )
2010-09-21 12:47:45 +04:00
goto slow_path_clean ;
2005-05-19 09:52:33 +04:00
BUG_ON ( frag - > sk ) ;
if ( skb - > sk ) {
frag - > sk = skb - > sk ;
frag - > destructor = sock_wfree ;
}
2010-09-21 12:47:45 +04:00
skb - > truesize - = frag - > truesize ;
2005-04-17 02:20:36 +04:00
}
/* Everything is OK. Generate! */
2019-05-29 14:25:31 +03:00
ip_fraglist_init ( skb , iph , hlen , & iter ) ;
2021-08-30 12:16:40 +03:00
2005-04-17 02:20:36 +04:00
for ( ; ; ) {
/* Prepare header of the next frame,
* before previous one went down . */
2019-05-29 14:25:35 +03:00
if ( iter . frag ) {
ipv4: fix ip option filtering for locally generated fragments
During IP fragmentation we sanitize IP options. This means overwriting
options which should not be copied with NOPs. Only the first fragment
has the original, full options.
ip_fraglist_prepare() copies the IP header and options from previous
fragment to the next one. Commit 19c3401a917b ("net: ipv4: place control
buffer handling away from fragmentation iterators") moved sanitizing
options before ip_fraglist_prepare() which means options are sanitized
and then overwritten again with the old values.
Fixing this is not enough, however, nor did the sanitization work
prior to aforementioned commit.
ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
for the length of the options. ipcb->opt of fragments is not populated
(it's 0), only the head skb has the state properly built. So even when
called at the right time ip_options_fragment() does nothing. This seems
to date back all the way to v2.5.44 when the fast path for pre-fragmented
skbs had been introduced. Prior to that ip_options_build() would have been
called for every fragment (in fact ever since v2.5.44 the fragmentation
handing in ip_options_build() has been dead code, I'll clean it up in
-next).
In the original patch (see Link) caixf mentions fixing the handling
for fragments other than the second one, but I'm not sure how _any_
fragment could have had their options sanitized with the code
as it stood.
Tested with python (MTU on lo lowered to 1000 to force fragmentation):
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
bytearray([7,4,5,192, 20|0x80,4,1,0]))
s.sendto(b'1'*2000, ('127.0.0.1', 1234))
Before:
IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.36500 > localhost.search-agent: UDP, length 2000
IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
After:
IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: caixf <ooppublic@163.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-22 03:57:31 +03:00
bool first_frag = ( iter . offset = = 0 ) ;
2021-08-23 06:17:59 +03:00
IPCB ( iter . frag ) - > flags = IPCB ( skb ) - > flags ;
2019-05-29 14:25:31 +03:00
ip_fraglist_prepare ( skb , & iter ) ;
ipv4: fix ip option filtering for locally generated fragments
During IP fragmentation we sanitize IP options. This means overwriting
options which should not be copied with NOPs. Only the first fragment
has the original, full options.
ip_fraglist_prepare() copies the IP header and options from previous
fragment to the next one. Commit 19c3401a917b ("net: ipv4: place control
buffer handling away from fragmentation iterators") moved sanitizing
options before ip_fraglist_prepare() which means options are sanitized
and then overwritten again with the old values.
Fixing this is not enough, however, nor did the sanitization work
prior to aforementioned commit.
ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
for the length of the options. ipcb->opt of fragments is not populated
(it's 0), only the head skb has the state properly built. So even when
called at the right time ip_options_fragment() does nothing. This seems
to date back all the way to v2.5.44 when the fast path for pre-fragmented
skbs had been introduced. Prior to that ip_options_build() would have been
called for every fragment (in fact ever since v2.5.44 the fragmentation
handing in ip_options_build() has been dead code, I'll clean it up in
-next).
In the original patch (see Link) caixf mentions fixing the handling
for fragments other than the second one, but I'm not sure how _any_
fragment could have had their options sanitized with the code
as it stood.
Tested with python (MTU on lo lowered to 1000 to force fragmentation):
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
bytearray([7,4,5,192, 20|0x80,4,1,0]))
s.sendto(b'1'*2000, ('127.0.0.1', 1234))
Before:
IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.36500 > localhost.search-agent: UDP, length 2000
IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
After:
IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: caixf <ooppublic@163.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-22 03:57:31 +03:00
if ( first_frag & & IPCB ( skb ) - > opt . optlen ) {
/* ipcb->opt is not populated for frags
* coming from __ip_make_skb ( ) ,
* ip_options_fragment ( ) needs optlen
*/
IPCB ( iter . frag ) - > opt . optlen =
IPCB ( skb ) - > opt . optlen ;
ip_options_fragment ( iter . frag ) ;
ip_send_check ( iter . iph ) ;
}
2019-05-29 14:25:35 +03:00
}
2005-04-17 02:20:36 +04:00
2022-03-02 22:55:25 +03:00
skb_set_delivery_time ( skb , tstamp , mono_delivery_time ) ;
2015-06-13 05:55:31 +03:00
err = output ( net , sk , skb ) ;
2005-04-17 02:20:36 +04:00
2006-08-03 00:41:21 +04:00
if ( ! err )
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGCREATES ) ;
2019-05-29 14:25:31 +03:00
if ( err | | ! iter . frag )
2005-04-17 02:20:36 +04:00
break ;
2019-05-29 14:25:31 +03:00
skb = ip_fraglist_next ( & iter ) ;
2005-04-17 02:20:36 +04:00
}
if ( err = = 0 ) {
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGOKS ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2019-06-02 21:24:18 +03:00
kfree_skb_list ( iter . frag ) ;
2019-04-04 14:54:20 +03:00
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGFAILS ) ;
2005-04-17 02:20:36 +04:00
return err ;
2010-09-21 12:47:45 +04:00
slow_path_clean :
skb_walk_frags ( skb , frag2 ) {
if ( frag2 = = frag )
break ;
frag2 - > sk = NULL ;
frag2 - > destructor = NULL ;
skb - > truesize + = frag2 - > truesize ;
}
2005-04-17 02:20:36 +04:00
}
slow_path :
/*
* Fragment the datagram .
*/
2019-10-19 19:26:37 +03:00
ip_frag_init ( skb , hlen , ll_rs , mtu , IPCB ( skb ) - > flags & IPSKB_FRAG_PMTU ,
& state ) ;
2005-04-17 02:20:36 +04:00
/*
* Keep copying data until we run out .
*/
2019-05-29 14:25:33 +03:00
while ( state . left > 0 ) {
2019-05-29 14:25:35 +03:00
bool first_frag = ( state . offset = = 0 ) ;
2019-05-29 14:25:33 +03:00
skb2 = ip_frag_next ( skb , & state ) ;
if ( IS_ERR ( skb2 ) ) {
err = PTR_ERR ( skb2 ) ;
2005-04-17 02:20:36 +04:00
goto fail ;
}
2021-08-23 06:17:59 +03:00
ip_frag_ipcb ( skb , skb2 , first_frag ) ;
2005-04-17 02:20:36 +04:00
/*
* Put this fragment into the sending queue .
*/
2022-03-02 22:55:25 +03:00
skb_set_delivery_time ( skb2 , tstamp , mono_delivery_time ) ;
2015-06-13 05:55:31 +03:00
err = output ( net , sk , skb2 ) ;
2005-04-17 02:20:36 +04:00
if ( err )
goto fail ;
2006-08-03 00:41:21 +04:00
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGCREATES ) ;
2005-04-17 02:20:36 +04:00
}
2012-06-04 05:17:19 +04:00
consume_skb ( skb ) ;
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGOKS ) ;
2005-04-17 02:20:36 +04:00
return err ;
fail :
2007-02-09 17:24:47 +03:00
kfree_skb ( skb ) ;
2015-09-16 04:03:59 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_FRAGFAILS ) ;
2005-04-17 02:20:36 +04:00
return err ;
}
2015-05-16 00:15:37 +03:00
EXPORT_SYMBOL ( ip_do_fragment ) ;
2006-04-05 00:42:35 +04:00
2005-04-17 02:20:36 +04:00
int
ip_generic_getfrag ( void * from , char * to , int offset , int len , int odd , struct sk_buff * skb )
{
2014-11-24 21:23:40 +03:00
struct msghdr * msg = from ;
2005-04-17 02:20:36 +04:00
2006-08-30 03:44:56 +04:00
if ( skb - > ip_summed = = CHECKSUM_PARTIAL ) {
2016-11-04 01:17:31 +03:00
if ( ! copy_from_iter_full ( to , len , & msg - > msg_iter ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
} else {
2006-11-15 08:36:14 +03:00
__wsum csum = 0 ;
2016-11-04 01:17:31 +03:00
if ( ! csum_and_copy_from_iter_full ( to , len , & csum , & msg - > msg_iter ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
skb - > csum = csum_block_add ( skb - > csum , csum , odd ) ;
}
return 0 ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( ip_generic_getfrag ) ;
2005-04-17 02:20:36 +04:00
2011-05-09 04:24:10 +04:00
static int __ip_append_data ( struct sock * sk ,
struct flowi4 * fl4 ,
struct sk_buff_head * queue ,
2011-03-01 05:36:47 +03:00
struct inet_cork * cork ,
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
struct page_frag * pfrag ,
2011-03-01 05:36:47 +03:00
int getfrag ( void * from , char * to , int offset ,
int len , int odd , struct sk_buff * skb ) ,
void * from , int length , int transhdrlen ,
unsigned int flags )
2005-04-17 02:20:36 +04:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
2018-11-30 23:32:39 +03:00
struct ubuf_info * uarg = NULL ;
2005-04-17 02:20:36 +04:00
struct sk_buff * skb ;
2011-03-02 10:00:58 +03:00
struct ip_options * opt = cork - > opt ;
2005-04-17 02:20:36 +04:00
int hh_len ;
int exthdrlen ;
int mtu ;
int copy ;
int err ;
int offset = 0 ;
2022-07-12 23:52:25 +03:00
bool zc = false ;
2013-10-27 20:29:11 +04:00
unsigned int maxfraglen , fragheaderlen , maxnonfragsize ;
2005-04-17 02:20:36 +04:00
int csummode = CHECKSUM_NONE ;
2011-03-01 05:36:47 +03:00
struct rtable * rt = ( struct rtable * ) cork - > dst ;
2024-02-13 14:04:28 +03:00
bool paged , hold_tskey , extra_uref = false ;
2018-03-31 23:16:25 +03:00
unsigned int wmem_alloc_delta = 0 ;
2014-08-05 06:11:47 +04:00
u32 tskey = 0 ;
2005-04-17 02:20:36 +04:00
2011-06-06 00:48:47 +04:00
skb = skb_peek_tail ( queue ) ;
exthdrlen = ! skb ? rt - > dst . header_len : 0 ;
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 20:42:17 +03:00
mtu = cork - > gso_size ? IP_MAX_MTU : cork - > fragsize ;
2018-04-26 20:42:19 +03:00
paged = ! ! cork - > gso_size ;
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 20:42:17 +03:00
2010-06-11 10:31:35 +04:00
hh_len = LL_RESERVED_SPACE ( rt - > dst . dev ) ;
2005-04-17 02:20:36 +04:00
fragheaderlen = sizeof ( struct iphdr ) + ( opt ? opt - > optlen : 0 ) ;
maxfraglen = ( ( mtu - fragheaderlen ) & ~ 7 ) + fragheaderlen ;
2020-08-29 12:09:18 +03:00
maxnonfragsize = ip_sk_ignore_df ( sk ) ? IP_MAX_MTU : mtu ;
2005-04-17 02:20:36 +04:00
2013-10-27 20:29:11 +04:00
if ( cork - > length + length > maxnonfragsize - fragheaderlen ) {
2011-05-09 04:24:10 +04:00
ip_local_error ( sk , EMSGSIZE , fl4 - > daddr , inet - > inet_dport ,
2013-12-19 05:13:36 +04:00
mtu - ( opt ? opt - > optlen : 0 ) ) ;
2005-04-17 02:20:36 +04:00
return - EMSGSIZE ;
}
/*
* transhdrlen > 0 means that this is the first fragment and we wish
* it won ' t be fragmented in the future .
*/
if ( transhdrlen & &
length + fragheaderlen < = mtu & &
2015-12-14 22:19:44 +03:00
rt - > dst . dev - > features & ( NETIF_F_HW_CSUM | NETIF_F_IP_CSUM ) & &
udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 20:42:17 +03:00
( ! ( flags & MSG_MORE ) | | cork - > gso_size ) & &
2018-04-12 22:03:13 +03:00
( ! exthdrlen | | ( rt - > dst . dev - > features & NETIF_F_HW_ESP_TX_CSUM ) ) )
2006-08-30 03:44:56 +04:00
csummode = CHECKSUM_PARTIAL ;
2005-04-17 02:20:36 +04:00
2022-07-12 23:52:33 +03:00
if ( ( flags & MSG_ZEROCOPY ) & & length ) {
struct msghdr * msg = from ;
if ( getfrag = = ip_generic_getfrag & & msg - > msg_ubuf ) {
if ( skb_zcopy ( skb ) & & msg - > msg_ubuf ! = skb_zcopy ( skb ) )
return - EINVAL ;
/* Leave uarg NULL if can't zerocopy, callers should
* be able to handle it .
*/
if ( ( rt - > dst . dev - > features & NETIF_F_SG ) & &
csummode = = CHECKSUM_PARTIAL ) {
paged = true ;
zc = true ;
uarg = msg - > msg_ubuf ;
}
} else if ( sock_flag ( sk , SOCK_ZEROCOPY ) ) {
uarg = msg_zerocopy_realloc ( sk , length , skb_zcopy ( skb ) ) ;
if ( ! uarg )
return - ENOBUFS ;
extra_uref = ! skb_zcopy ( skb ) ; /* only ref on new uarg */
if ( rt - > dst . dev - > features & NETIF_F_SG & &
csummode = = CHECKSUM_PARTIAL ) {
paged = true ;
zc = true ;
} else {
2022-09-23 19:39:04 +03:00
uarg_to_msgzc ( uarg ) - > zerocopy = 0 ;
2022-07-12 23:52:33 +03:00
skb_zcopy_set ( skb , uarg , & extra_uref ) ;
}
2018-11-30 23:32:39 +03:00
}
2023-05-22 15:11:20 +03:00
} else if ( ( flags & MSG_SPLICE_PAGES ) & & length ) {
2023-08-16 11:15:38 +03:00
if ( inet_test_bit ( HDRINCL , sk ) )
2023-05-22 15:11:20 +03:00
return - EPERM ;
2023-06-14 11:04:16 +03:00
if ( rt - > dst . dev - > features & NETIF_F_SG & &
getfrag = = ip_generic_getfrag )
2023-05-22 15:11:20 +03:00
/* We need an empty buffer to attach stuff to */
paged = true ;
else
flags & = ~ MSG_SPLICE_PAGES ;
2018-11-30 23:32:39 +03:00
}
2011-03-01 05:36:47 +03:00
cork - > length + = length ;
2005-04-17 02:20:36 +04:00
2024-02-13 14:04:28 +03:00
hold_tskey = cork - > tx_flags & SKBTX_ANY_TSTAMP & &
READ_ONCE ( sk - > sk_tsflags ) & SOF_TIMESTAMPING_OPT_ID ;
if ( hold_tskey )
tskey = atomic_inc_return ( & sk - > sk_tskey ) - 1 ;
2005-04-17 02:20:36 +04:00
/* So, what's going on in the loop below?
*
* We use calculated fragment length to generate chained skb ,
* each of segments is IP fragment ready for sending to network after
* adding appropriate IP header .
*/
2010-06-15 05:52:25 +04:00
if ( ! skb )
2005-04-17 02:20:36 +04:00
goto alloc_new_skb ;
while ( length > 0 ) {
/* Check if the remaining data fits into current packet. */
copy = mtu - skb - > len ;
if ( copy < length )
copy = maxfraglen - skb - > len ;
if ( copy < = 0 ) {
char * data ;
unsigned int datalen ;
unsigned int fraglen ;
unsigned int fraggap ;
2021-06-24 00:44:38 +03:00
unsigned int alloclen , alloc_extra ;
2018-11-24 22:21:16 +03:00
unsigned int pagedlen ;
2005-04-17 02:20:36 +04:00
struct sk_buff * skb_prev ;
alloc_new_skb :
skb_prev = skb ;
if ( skb_prev )
fraggap = skb_prev - > len - maxfraglen ;
else
fraggap = 0 ;
/*
* If remaining data exceeds the mtu ,
* we know we need more fragment ( s ) .
*/
datalen = length + fraggap ;
if ( datalen > mtu - fragheaderlen )
datalen = maxfraglen - fragheaderlen ;
fraglen = datalen + fragheaderlen ;
2018-11-24 22:21:16 +03:00
pagedlen = 0 ;
2005-04-17 02:20:36 +04:00
2021-06-24 00:44:38 +03:00
alloc_extra = hh_len + 15 ;
alloc_extra + = exthdrlen ;
/* The last fragment gets additional space at tail.
* Note , with MSG_MORE we overallocate on fragments ,
* because we have no idea what fragment will be
* the last .
*/
if ( datalen = = length + fraggap )
alloc_extra + = rt - > dst . trailer_len ;
2007-02-09 17:24:47 +03:00
if ( ( flags & MSG_MORE ) & &
2010-06-11 10:31:35 +04:00
! ( rt - > dst . dev - > features & NETIF_F_SG ) )
2005-04-17 02:20:36 +04:00
alloclen = mtu ;
2021-06-24 00:44:38 +03:00
else if ( ! paged & &
( fraglen + alloc_extra < SKB_MAX_ALLOC | |
! ( rt - > dst . dev - > features & NETIF_F_SG ) ) )
2010-09-21 00:16:27 +04:00
alloclen = fraglen ;
2022-08-25 15:06:31 +03:00
else {
2022-07-12 23:52:25 +03:00
alloclen = fragheaderlen + transhdrlen ;
pagedlen = datalen - transhdrlen ;
2018-04-26 20:42:19 +03:00
}
2005-04-17 02:20:36 +04:00
2021-06-24 00:44:38 +03:00
alloclen + = alloc_extra ;
2011-06-22 05:04:37 +04:00
2005-04-17 02:20:36 +04:00
if ( transhdrlen ) {
2021-06-24 00:44:38 +03:00
skb = sock_alloc_send_skb ( sk , alloclen ,
2005-04-17 02:20:36 +04:00
( flags & MSG_DONTWAIT ) , & err ) ;
} else {
skb = NULL ;
2018-03-31 23:16:25 +03:00
if ( refcount_read ( & sk - > sk_wmem_alloc ) + wmem_alloc_delta < =
2005-04-17 02:20:36 +04:00
2 * sk - > sk_sndbuf )
2021-06-24 00:44:38 +03:00
skb = alloc_skb ( alloclen ,
2018-03-31 23:16:25 +03:00
sk - > sk_allocation ) ;
2015-04-03 11:17:26 +03:00
if ( unlikely ( ! skb ) )
2005-04-17 02:20:36 +04:00
err = - ENOBUFS ;
}
2015-04-03 11:17:26 +03:00
if ( ! skb )
2005-04-17 02:20:36 +04:00
goto error ;
/*
* Fill in the control structures
*/
skb - > ip_summed = csummode ;
skb - > csum = 0 ;
skb_reserve ( skb , hh_len ) ;
2014-07-15 01:55:06 +04:00
2005-04-17 02:20:36 +04:00
/*
* Find where to start putting bytes .
*/
2018-04-26 20:42:19 +03:00
data = skb_put ( skb , fraglen + exthdrlen - pagedlen ) ;
2007-03-12 04:39:41 +03:00
skb_set_network_header ( skb , exthdrlen ) ;
2007-04-11 08:21:55 +04:00
skb - > transport_header = ( skb - > network_header +
fragheaderlen ) ;
2011-06-22 05:05:37 +04:00
data + = fragheaderlen + exthdrlen ;
2005-04-17 02:20:36 +04:00
if ( fraggap ) {
skb - > csum = skb_copy_and_csum_bits (
skb_prev , maxfraglen ,
2020-07-11 03:07:10 +03:00
data + transhdrlen , fraggap ) ;
2005-04-17 02:20:36 +04:00
skb_prev - > csum = csum_sub ( skb_prev - > csum ,
skb - > csum ) ;
data + = fraggap ;
2006-08-14 07:12:58 +04:00
pskb_trim_unique ( skb_prev , maxfraglen ) ;
2005-04-17 02:20:36 +04:00
}
2018-04-26 20:42:19 +03:00
copy = datalen - transhdrlen - fraggap - pagedlen ;
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit. Something like:
pipe(pfd);
sfd = socket(AF_INET, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.
This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:
length -= copy + transhdrlen;
to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator. This is a potential bug for
the future.
Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01 18:48:53 +03:00
/* [!] NOTE: copy will be negative if pagedlen>0
* because then the equation reduces to - fraggap .
*/
2005-04-17 02:20:36 +04:00
if ( copy > 0 & & getfrag ( from , data + transhdrlen , offset , copy , fraggap , skb ) < 0 ) {
err = - EFAULT ;
kfree_skb ( skb ) ;
goto error ;
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit. Something like:
pipe(pfd);
sfd = socket(AF_INET, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.
This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:
length -= copy + transhdrlen;
to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator. This is a potential bug for
the future.
Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01 18:48:53 +03:00
} else if ( flags & MSG_SPLICE_PAGES ) {
copy = 0 ;
2005-04-17 02:20:36 +04:00
}
offset + = copy ;
2018-04-26 20:42:19 +03:00
length - = copy + transhdrlen ;
2005-04-17 02:20:36 +04:00
transhdrlen = 0 ;
exthdrlen = 0 ;
csummode = CHECKSUM_NONE ;
2018-11-30 23:32:40 +03:00
/* only the initial fragment is time stamped */
skb_shinfo ( skb ) - > tx_flags = cork - > tx_flags ;
cork - > tx_flags = 0 ;
skb_shinfo ( skb ) - > tskey = tskey ;
tskey = 0 ;
skb_zcopy_set ( skb , uarg , & extra_uref ) ;
2017-02-07 00:14:16 +03:00
if ( ( flags & MSG_CONFIRM ) & & ! skb_prev )
skb_set_dst_pending_confirm ( skb , 1 ) ;
2005-04-17 02:20:36 +04:00
/*
* Put the packet on the pending queue .
*/
2018-03-31 23:16:25 +03:00
if ( ! skb - > destructor ) {
skb - > destructor = sock_wfree ;
skb - > sk = sk ;
wmem_alloc_delta + = skb - > truesize ;
}
2011-03-01 05:36:47 +03:00
__skb_queue_tail ( queue , skb ) ;
2005-04-17 02:20:36 +04:00
continue ;
}
if ( copy > length )
copy = length ;
2018-05-17 20:13:29 +03:00
if ( ! ( rt - > dst . dev - > features & NETIF_F_SG ) & &
skb_tailroom ( skb ) > = copy ) {
2005-04-17 02:20:36 +04:00
unsigned int off ;
off = skb - > len ;
2007-02-09 17:24:47 +03:00
if ( getfrag ( from , skb_put ( skb , copy ) ,
2005-04-17 02:20:36 +04:00
offset , copy , off , skb ) < 0 ) {
__skb_trim ( skb , off ) ;
err = - EFAULT ;
goto error ;
}
2023-05-22 15:11:20 +03:00
} else if ( flags & MSG_SPLICE_PAGES ) {
struct msghdr * msg = from ;
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit. Something like:
pipe(pfd);
sfd = socket(AF_INET, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.
This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:
length -= copy + transhdrlen;
to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator. This is a potential bug for
the future.
Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01 18:48:53 +03:00
err = - EIO ;
if ( WARN_ON_ONCE ( copy > msg - > msg_iter . count ) )
goto error ;
2023-05-22 15:11:20 +03:00
err = skb_splice_from_iter ( skb , & msg - > msg_iter , copy ,
sk - > sk_allocation ) ;
if ( err < 0 )
goto error ;
copy = err ;
wmem_alloc_delta + = copy ;
2022-07-12 23:52:33 +03:00
} else if ( ! zc ) {
2005-04-17 02:20:36 +04:00
int i = skb_shinfo ( skb ) - > nr_frags ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
err = - ENOMEM ;
if ( ! sk_page_frag_refill ( sk , pfrag ) )
2005-04-17 02:20:36 +04:00
goto error ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
2022-07-12 23:52:33 +03:00
skb_zcopy_downgrade_managed ( skb ) ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
if ( ! skb_can_coalesce ( skb , i , pfrag - > page ,
pfrag - > offset ) ) {
err = - EMSGSIZE ;
if ( i = = MAX_SKB_FRAGS )
goto error ;
__skb_fill_page_desc ( skb , i , pfrag - > page ,
pfrag - > offset , 0 ) ;
skb_shinfo ( skb ) - > nr_frags = + + i ;
get_page ( pfrag - > page ) ;
2005-04-17 02:20:36 +04:00
}
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
copy = min_t ( int , copy , pfrag - > size - pfrag - > offset ) ;
if ( getfrag ( from ,
page_address ( pfrag - > page ) + pfrag - > offset ,
offset , copy , skb - > len , skb ) < 0 )
goto error_efault ;
pfrag - > offset + = copy ;
skb_frag_size_add ( & skb_shinfo ( skb ) - > frags [ i - 1 ] , copy ) ;
2022-06-22 19:09:03 +03:00
skb_len_add ( skb , copy ) ;
2018-03-31 23:16:25 +03:00
wmem_alloc_delta + = copy ;
2018-11-30 23:32:39 +03:00
} else {
err = skb_zerocopy_iter_dgram ( skb , from , copy ) ;
if ( err < 0 )
goto error ;
2005-04-17 02:20:36 +04:00
}
offset + = copy ;
length - = copy ;
}
2018-04-04 15:30:01 +03:00
if ( wmem_alloc_delta )
refcount_add ( wmem_alloc_delta , & sk - > sk_wmem_alloc ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
error_efault :
err = - EFAULT ;
2005-04-17 02:20:36 +04:00
error :
2021-01-07 01:18:41 +03:00
net_zcopy_put_abort ( uarg , extra_uref ) ;
2011-03-01 05:36:47 +03:00
cork - > length - = length ;
2008-07-17 07:19:49 +04:00
IP_INC_STATS ( sock_net ( sk ) , IPSTATS_MIB_OUTDISCARDS ) ;
2018-03-31 23:16:25 +03:00
refcount_add ( wmem_alloc_delta , & sk - > sk_wmem_alloc ) ;
2024-02-13 14:04:28 +03:00
if ( hold_tskey )
atomic_dec ( & sk - > sk_tskey ) ;
2007-02-09 17:24:47 +03:00
return err ;
2005-04-17 02:20:36 +04:00
}
2011-03-01 05:36:47 +03:00
static int ip_setup_cork ( struct sock * sk , struct inet_cork * cork ,
struct ipcm_cookie * ipc , struct rtable * * rtp )
{
2011-04-21 13:45:37 +04:00
struct ip_options_rcu * opt ;
2011-03-01 05:36:47 +03:00
struct rtable * rt ;
2018-04-16 05:16:45 +03:00
rt = * rtp ;
if ( unlikely ( ! rt ) )
return - EFAULT ;
2024-01-29 12:10:17 +03:00
cork - > fragsize = ip_sk_use_pmtu ( sk ) ?
dst_mtu ( & rt - > dst ) : READ_ONCE ( rt - > dst . dev - > mtu ) ;
if ( ! inetdev_valid_mtu ( cork - > fragsize ) )
return - ENETUNREACH ;
2011-03-01 05:36:47 +03:00
/*
* setup for corking .
*/
opt = ipc - > opt ;
if ( opt ) {
2015-04-03 11:17:26 +03:00
if ( ! cork - > opt ) {
2011-03-01 05:36:47 +03:00
cork - > opt = kmalloc ( sizeof ( struct ip_options ) + 40 ,
sk - > sk_allocation ) ;
2015-04-03 11:17:26 +03:00
if ( unlikely ( ! cork - > opt ) )
2011-03-01 05:36:47 +03:00
return - ENOBUFS ;
}
2011-04-21 13:45:37 +04:00
memcpy ( cork - > opt , & opt - > opt , sizeof ( struct ip_options ) + opt - > opt . optlen ) ;
2011-03-01 05:36:47 +03:00
cork - > flags | = IPCORK_OPT ;
cork - > addr = ipc - > addr ;
}
2018-04-16 05:16:45 +03:00
2018-07-06 17:12:59 +03:00
cork - > gso_size = ipc - > gso_size ;
2019-12-06 07:43:46 +03:00
2011-03-01 05:36:47 +03:00
cork - > dst = & rt - > dst ;
2019-12-06 07:43:46 +03:00
/* We stole this route, caller should not release it. */
* rtp = NULL ;
2011-03-01 05:36:47 +03:00
cork - > length = 0 ;
2013-09-24 17:43:09 +04:00
cork - > ttl = ipc - > ttl ;
cork - > tos = ipc - > tos ;
2019-09-11 22:50:51 +03:00
cork - > mark = ipc - > sockc . mark ;
2013-09-24 17:43:09 +04:00
cork - > priority = ipc - > priority ;
2018-07-04 01:42:49 +03:00
cork - > transmit_time = ipc - > sockc . transmit_time ;
2018-07-06 17:12:58 +03:00
cork - > tx_flags = 0 ;
sock_tx_timestamp ( sk , ipc - > sockc . tsflags , & cork - > tx_flags ) ;
2011-03-01 05:36:47 +03:00
return 0 ;
}
/*
2023-05-22 15:11:23 +03:00
* ip_append_data ( ) can make one large IP datagram from many pieces of
* data . Each piece will be held on the socket until
* ip_push_pending_frames ( ) is called . Each piece can be a page or
* non - page data .
2011-03-01 05:36:47 +03:00
*
* Not only UDP , other transport protocols - e . g . raw sockets - can use
* this interface potentially .
*
* LATER : length must be adjusted by pad at tail , when it is required .
*/
2011-05-09 04:24:10 +04:00
int ip_append_data ( struct sock * sk , struct flowi4 * fl4 ,
2011-03-01 05:36:47 +03:00
int getfrag ( void * from , char * to , int offset , int len ,
int odd , struct sk_buff * skb ) ,
void * from , int length , int transhdrlen ,
struct ipcm_cookie * ipc , struct rtable * * rtp ,
unsigned int flags )
{
struct inet_sock * inet = inet_sk ( sk ) ;
int err ;
if ( flags & MSG_PROBE )
return 0 ;
if ( skb_queue_empty ( & sk - > sk_write_queue ) ) {
2011-05-07 02:02:07 +04:00
err = ip_setup_cork ( sk , & inet - > cork . base , ipc , rtp ) ;
2011-03-01 05:36:47 +03:00
if ( err )
return err ;
} else {
transhdrlen = 0 ;
}
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
return __ip_append_data ( sk , fl4 , & sk - > sk_write_queue , & inet - > cork . base ,
sk_page_frag ( sk ) , getfrag ,
2011-03-01 05:36:47 +03:00
from , length , transhdrlen , flags ) ;
}
static void ip_cork_release ( struct inet_cork * cork )
2007-11-06 08:03:24 +03:00
{
2011-03-01 05:36:47 +03:00
cork - > flags & = ~ IPCORK_OPT ;
kfree ( cork - > opt ) ;
cork - > opt = NULL ;
dst_release ( cork - > dst ) ;
cork - > dst = NULL ;
2007-11-06 08:03:24 +03:00
}
2005-04-17 02:20:36 +04:00
/*
* Combined all pending IP fragments on the socket as one IP datagram
* and push them out .
*/
2011-03-01 05:36:47 +03:00
struct sk_buff * __ip_make_skb ( struct sock * sk ,
2011-05-09 04:12:19 +04:00
struct flowi4 * fl4 ,
2011-03-01 05:36:47 +03:00
struct sk_buff_head * queue ,
struct inet_cork * cork )
2005-04-17 02:20:36 +04:00
{
struct sk_buff * skb , * tmp_skb ;
struct sk_buff * * tail_skb ;
struct inet_sock * inet = inet_sk ( sk ) ;
2008-07-15 10:00:43 +04:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
struct ip_options * opt = NULL ;
2011-03-01 05:36:47 +03:00
struct rtable * rt = ( struct rtable * ) cork - > dst ;
2005-04-17 02:20:36 +04:00
struct iphdr * iph ;
2023-09-22 06:42:15 +03:00
u8 pmtudisc , ttl ;
2006-01-07 00:24:29 +03:00
__be16 df = 0 ;
2005-04-17 02:20:36 +04:00
2015-04-03 11:17:26 +03:00
skb = __skb_dequeue ( queue ) ;
if ( ! skb )
2005-04-17 02:20:36 +04:00
goto out ;
tail_skb = & ( skb_shinfo ( skb ) - > frag_list ) ;
/* move skb->data to ip header from ext header */
2007-04-11 07:50:43 +04:00
if ( skb - > data < skb_network_header ( skb ) )
2007-03-11 04:16:10 +03:00
__skb_pull ( skb , skb_network_offset ( skb ) ) ;
2011-03-01 05:36:47 +03:00
while ( ( tmp_skb = __skb_dequeue ( queue ) ) ! = NULL ) {
2007-03-16 23:26:39 +03:00
__skb_pull ( tmp_skb , skb_network_header_len ( skb ) ) ;
2005-04-17 02:20:36 +04:00
* tail_skb = tmp_skb ;
tail_skb = & ( tmp_skb - > next ) ;
skb - > len + = tmp_skb - > len ;
skb - > data_len + = tmp_skb - > len ;
skb - > truesize + = tmp_skb - > truesize ;
tmp_skb - > destructor = NULL ;
tmp_skb - > sk = NULL ;
}
/* Unless user demanded real pmtu discovery (IP_PMTUDISC_DO), we allow
* to fragment the frame generated here . No matter , what transforms
* how transforms change size of the packet , it will come out .
*/
2014-05-05 03:39:18 +04:00
skb - > ignore_df = ip_sk_ignore_df ( sk ) ;
2005-04-17 02:20:36 +04:00
/* DF bit is set when we want to see DF on outgoing frames.
2014-05-05 03:39:18 +04:00
* If ignore_df is set too , we still allow to fragment this frame
2005-04-17 02:20:36 +04:00
* locally . */
2023-09-22 06:42:15 +03:00
pmtudisc = READ_ONCE ( inet - > pmtudisc ) ;
if ( pmtudisc = = IP_PMTUDISC_DO | |
pmtudisc = = IP_PMTUDISC_PROBE | |
2010-06-11 10:31:35 +04:00
( skb - > len < = dst_mtu ( & rt - > dst ) & &
ip_dont_fragment ( sk , & rt - > dst ) ) )
2005-04-17 02:20:36 +04:00
df = htons ( IP_DF ) ;
2011-03-01 05:36:47 +03:00
if ( cork - > flags & IPCORK_OPT )
opt = cork - > opt ;
2005-04-17 02:20:36 +04:00
2013-09-24 17:43:09 +04:00
if ( cork - > ttl ! = 0 )
ttl = cork - > ttl ;
else if ( rt - > rt_type = = RTN_MULTICAST )
2023-09-22 06:42:14 +03:00
ttl = READ_ONCE ( inet - > mc_ttl ) ;
2005-04-17 02:20:36 +04:00
else
2010-06-11 10:31:35 +04:00
ttl = ip_select_ttl ( inet , & rt - > dst ) ;
2005-04-17 02:20:36 +04:00
2013-09-19 02:29:52 +04:00
iph = ip_hdr ( skb ) ;
2005-04-17 02:20:36 +04:00
iph - > version = 4 ;
iph - > ihl = 5 ;
2023-09-22 06:42:16 +03:00
iph - > tos = ( cork - > tos ! = - 1 ) ? cork - > tos : READ_ONCE ( inet - > tos ) ;
2005-04-17 02:20:36 +04:00
iph - > frag_off = df ;
iph - > ttl = ttl ;
iph - > protocol = sk - > sk_protocol ;
2011-11-30 23:00:53 +04:00
ip_copy_addrs ( iph , fl4 ) ;
2015-03-25 19:07:44 +03:00
ip_select_ident ( net , skb , sk ) ;
2005-04-17 02:20:36 +04:00
2011-05-14 01:21:27 +04:00
if ( opt ) {
2020-08-29 12:21:30 +03:00
iph - > ihl + = opt - > optlen > > 2 ;
2022-01-28 19:06:54 +03:00
ip_options_build ( skb , opt , cork - > addr , rt ) ;
2011-05-14 01:21:27 +04:00
}
2023-09-21 23:28:11 +03:00
skb - > priority = ( cork - > tos ! = - 1 ) ? cork - > priority : READ_ONCE ( sk - > sk_priority ) ;
2019-09-11 22:50:51 +03:00
skb - > mark = cork - > mark ;
2018-07-04 01:42:49 +03:00
skb - > tstamp = cork - > transmit_time ;
2008-11-25 03:07:50 +03:00
/*
* Steal rt from cork . dst to avoid a pair of atomic_inc / atomic_dec
* on dst refcount
*/
2011-03-01 05:36:47 +03:00
cork - > dst = NULL ;
2010-06-11 10:31:35 +04:00
skb_dst_set ( skb , & rt - > dst ) ;
2005-04-17 02:20:36 +04:00
2023-04-20 15:40:35 +03:00
if ( iph - > protocol = = IPPROTO_ICMP ) {
u8 icmp_type ;
/* For such sockets, transhdrlen is zero when do ip_append_data(),
* so icmphdr does not in skb linear region and can not get icmp_type
* by icmp_hdr ( skb ) - > type .
*/
2023-08-16 11:15:38 +03:00
if ( sk - > sk_type = = SOCK_RAW & &
! inet_test_bit ( HDRINCL , sk ) )
2023-04-20 15:40:35 +03:00
icmp_type = fl4 - > fl4_icmp_type ;
else
icmp_type = icmp_hdr ( skb ) - > type ;
icmp_out_count ( net , icmp_type ) ;
}
2007-09-17 20:57:33 +04:00
2011-03-01 05:36:47 +03:00
ip_cork_release ( cork ) ;
out :
return skb ;
}
2012-08-10 06:22:47 +04:00
int ip_send_skb ( struct net * net , struct sk_buff * skb )
2011-03-01 05:36:47 +03:00
{
int err ;
2015-10-08 00:48:46 +03:00
err = ip_local_out ( net , skb - > sk , skb ) ;
2005-04-17 02:20:36 +04:00
if ( err ) {
if ( err > 0 )
ip: Report qdisc packet drops
Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.
IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.
This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.
Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048
send() syscalls, do however still return an OK status, to not
break applications.
Note : send() manual page explicitly says for -ENOBUFS error :
"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "
This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.
Many thanks to Christoph, David, and last but not least, Alexey !
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-03 05:05:33 +04:00
err = net_xmit_errno ( err ) ;
2005-04-17 02:20:36 +04:00
if ( err )
2011-03-01 05:36:47 +03:00
IP_INC_STATS ( net , IPSTATS_MIB_OUTDISCARDS ) ;
2005-04-17 02:20:36 +04:00
}
return err ;
}
2011-05-09 04:12:19 +04:00
int ip_push_pending_frames ( struct sock * sk , struct flowi4 * fl4 )
2011-03-01 05:36:47 +03:00
{
2011-03-01 05:36:47 +03:00
struct sk_buff * skb ;
2011-05-09 04:12:19 +04:00
skb = ip_finish_skb ( sk , fl4 ) ;
2011-03-01 05:36:47 +03:00
if ( ! skb )
return 0 ;
/* Netfilter gets whole the not fragmented skb. */
2012-08-10 06:22:47 +04:00
return ip_send_skb ( sock_net ( sk ) , skb ) ;
2011-03-01 05:36:47 +03:00
}
2005-04-17 02:20:36 +04:00
/*
* Throw away all pending data on the socket .
*/
2011-03-01 05:36:47 +03:00
static void __ip_flush_pending_frames ( struct sock * sk ,
struct sk_buff_head * queue ,
struct inet_cork * cork )
2005-04-17 02:20:36 +04:00
{
struct sk_buff * skb ;
2011-03-01 05:36:47 +03:00
while ( ( skb = __skb_dequeue_tail ( queue ) ) ! = NULL )
2005-04-17 02:20:36 +04:00
kfree_skb ( skb ) ;
2011-03-01 05:36:47 +03:00
ip_cork_release ( cork ) ;
}
void ip_flush_pending_frames ( struct sock * sk )
{
2011-05-07 02:02:07 +04:00
__ip_flush_pending_frames ( sk , & sk - > sk_write_queue , & inet_sk ( sk ) - > cork . base ) ;
2005-04-17 02:20:36 +04:00
}
2011-03-01 05:36:47 +03:00
struct sk_buff * ip_make_skb ( struct sock * sk ,
2011-05-09 04:12:19 +04:00
struct flowi4 * fl4 ,
2011-03-01 05:36:47 +03:00
int getfrag ( void * from , char * to , int offset ,
int len , int odd , struct sk_buff * skb ) ,
void * from , int length , int transhdrlen ,
struct ipcm_cookie * ipc , struct rtable * * rtp ,
2018-04-26 20:42:15 +03:00
struct inet_cork * cork , unsigned int flags )
2011-03-01 05:36:47 +03:00
{
struct sk_buff_head queue ;
int err ;
if ( flags & MSG_PROBE )
return NULL ;
__skb_queue_head_init ( & queue ) ;
2018-04-26 20:42:15 +03:00
cork - > flags = 0 ;
cork - > addr = 0 ;
cork - > opt = NULL ;
err = ip_setup_cork ( sk , cork , ipc , rtp ) ;
2011-03-01 05:36:47 +03:00
if ( err )
return ERR_PTR ( err ) ;
2018-04-26 20:42:15 +03:00
err = __ip_append_data ( sk , fl4 , & queue , cork ,
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
& current - > task_frag , getfrag ,
2011-03-01 05:36:47 +03:00
from , length , transhdrlen , flags ) ;
if ( err ) {
2018-04-26 20:42:15 +03:00
__ip_flush_pending_frames ( sk , & queue , cork ) ;
2011-03-01 05:36:47 +03:00
return ERR_PTR ( err ) ;
}
2018-04-26 20:42:15 +03:00
return __ip_make_skb ( sk , fl4 , & queue , cork ) ;
2011-03-01 05:36:47 +03:00
}
2005-04-17 02:20:36 +04:00
/*
* Fetch data from kernel space and fill in checksum if needed .
*/
2007-02-09 17:24:47 +03:00
static int ip_reply_glue_bits ( void * dptr , char * to , int offset ,
2005-04-17 02:20:36 +04:00
int len , int odd , struct sk_buff * skb )
{
2006-11-15 08:36:34 +03:00
__wsum csum ;
2005-04-17 02:20:36 +04:00
2020-07-11 07:12:07 +03:00
csum = csum_partial_copy_nocheck ( dptr + offset , to , len ) ;
2005-04-17 02:20:36 +04:00
skb - > csum = csum_block_add ( skb - > csum , csum , odd ) ;
2007-02-09 17:24:47 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2007-02-09 17:24:47 +03:00
/*
2005-04-17 02:20:36 +04:00
* Generic function to send a packet as reply to another packet .
2012-07-19 11:34:03 +04:00
* Used to send some TCP resets / acks so far .
2005-04-17 02:20:36 +04:00
*/
2015-01-30 08:35:05 +03:00
void ip_send_unicast_reply ( struct sock * sk , struct sk_buff * skb ,
2014-09-27 20:50:55 +04:00
const struct ip_options * sopt ,
__be32 daddr , __be32 saddr ,
const struct ip_reply_arg * arg ,
2023-05-23 19:14:52 +03:00
unsigned int len , u64 transmit_time , u32 txhash )
2005-04-17 02:20:36 +04:00
{
2011-04-21 13:45:37 +04:00
struct ip_options_data replyopts ;
2005-04-17 02:20:36 +04:00
struct ipcm_cookie ipc ;
2011-05-09 04:12:19 +04:00
struct flowi4 fl4 ;
2009-06-02 09:14:27 +04:00
struct rtable * rt = skb_rtable ( skb ) ;
2015-01-30 08:35:05 +03:00
struct net * net = sock_net ( sk ) ;
2012-07-19 11:34:03 +04:00
struct sk_buff * nskb ;
2014-10-15 16:24:02 +04:00
int err ;
2015-08-13 23:59:08 +03:00
int oif ;
2005-04-17 02:20:36 +04:00
2017-08-03 19:07:06 +03:00
if ( __ip_options_echo ( net , & replyopts . opt . opt , skb , sopt ) )
2005-04-17 02:20:36 +04:00
return ;
2018-07-06 17:12:54 +03:00
ipcm_init ( & ipc ) ;
2011-05-10 00:22:43 +04:00
ipc . addr = daddr ;
2019-06-14 07:22:35 +03:00
ipc . sockc . transmit_time = transmit_time ;
2005-04-17 02:20:36 +04:00
2011-04-21 13:45:37 +04:00
if ( replyopts . opt . opt . optlen ) {
2005-04-17 02:20:36 +04:00
ipc . opt = & replyopts . opt ;
2011-04-21 13:45:37 +04:00
if ( replyopts . opt . opt . srr )
daddr = replyopts . opt . opt . faddr ;
2005-04-17 02:20:36 +04:00
}
2015-08-13 23:59:08 +03:00
oif = arg - > bound_dev_if ;
2016-11-09 20:07:26 +03:00
if ( ! oif & & netif_index_is_l3_master ( net , skb - > skb_iif ) )
oif = skb - > skb_iif ;
2015-08-13 23:59:08 +03:00
flowi4_init_output ( & fl4 , oif ,
2018-05-10 09:53:51 +03:00
IP4_REPLY_MARK ( net , skb - > mark ) ? : sk - > sk_mark ,
2011-10-24 11:06:21 +04:00
RT_TOS ( arg - > tos ) ,
2012-07-19 11:34:03 +04:00
RT_SCOPE_UNIVERSE , ip_hdr ( skb ) - > protocol ,
2011-05-09 04:12:19 +04:00
ip_reply_arg_flowi_flags ( arg ) ,
2012-06-28 14:21:41 +04:00
daddr , saddr ,
2016-11-03 20:23:43 +03:00
tcp_hdr ( skb ) - > source , tcp_hdr ( skb ) - > dest ,
arg - > uid ) ;
2020-09-28 05:38:26 +03:00
security_skb_classify_flow ( skb , flowi4_to_flowi_common ( & fl4 ) ) ;
2022-07-07 13:01:39 +03:00
rt = ip_route_output_flow ( net , & fl4 , sk ) ;
2011-05-09 04:12:19 +04:00
if ( IS_ERR ( rt ) )
return ;
2005-04-17 02:20:36 +04:00
2020-09-09 00:09:34 +03:00
inet_sk ( sk ) - > tos = arg - > tos & ~ INET_ECN_MASK ;
2005-04-17 02:20:36 +04:00
2007-04-21 09:47:35 +04:00
sk - > sk_protocol = ip_hdr ( skb ) - > protocol ;
2007-06-05 08:32:46 +04:00
sk - > sk_bound_dev_if = arg - > bound_dev_if ;
2022-08-23 20:46:44 +03:00
sk - > sk_sndbuf = READ_ONCE ( sysctl_wmem_default ) ;
2020-07-01 23:00:06 +03:00
ipc . sockc . mark = fl4 . flowi4_mark ;
2014-10-15 16:24:02 +04:00
err = ip_append_data ( sk , & fl4 , ip_reply_glue_bits , arg - > iov - > iov_base ,
len , 0 , & ipc , & rt , MSG_DONTWAIT ) ;
if ( unlikely ( err ) ) {
ip_flush_pending_frames ( sk ) ;
goto out ;
}
2012-07-19 11:34:03 +04:00
nskb = skb_peek ( & sk - > sk_write_queue ) ;
if ( nskb ) {
2005-04-17 02:20:36 +04:00
if ( arg - > csumoffset > = 0 )
2012-07-19 11:34:03 +04:00
* ( ( __sum16 * ) skb_transport_header ( nskb ) +
arg - > csumoffset ) = csum_fold ( csum_add ( nskb - > csum ,
2007-04-26 05:04:18 +04:00
arg - > csum ) ) ;
2012-07-19 11:34:03 +04:00
nskb - > ip_summed = CHECKSUM_NONE ;
2022-03-02 22:55:50 +03:00
nskb - > mono_delivery_time = ! ! transmit_time ;
2023-05-23 19:14:52 +03:00
if ( txhash )
skb_set_hash ( nskb , txhash , PKT_HASH_TYPE_L4 ) ;
2011-05-09 04:12:19 +04:00
ip_push_pending_frames ( sk , & fl4 ) ;
2005-04-17 02:20:36 +04:00
}
2014-10-15 16:24:02 +04:00
out :
2005-04-17 02:20:36 +04:00
ip_rt_put ( rt ) ;
}
void __init ip_init ( void )
{
ip_rt_init ( ) ;
inet_initpeers ( ) ;
2014-01-11 04:09:45 +04:00
# if defined(CONFIG_IP_MULTICAST)
igmp_mc_init ( ) ;
2005-04-17 02:20:36 +04:00
# endif
}