2019-05-27 09:55:01 +03:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-17 02:20:36 +04:00
/*
* PF_INET6 socket protocol family
2007-02-09 17:24:49 +03:00
* Linux INET6 implementation
2005-04-17 02:20:36 +04:00
*
* Authors :
2007-02-09 17:24:49 +03:00
* Pedro Roque < roque @ di . fc . ul . pt >
2005-04-17 02:20:36 +04:00
*
* Adapted from linux / net / ipv4 / af_inet . c
*
2014-08-25 00:53:10 +04:00
* Fixes :
2005-04-17 02:20:36 +04:00
* piggy , Karl Knutson : Socket protocol table
2014-08-25 00:53:10 +04:00
* Hideaki YOSHIFUJI : sin6_scope_id support
* Arnaldo Melo : check proc_net_create return , cleanups
2005-04-17 02:20:36 +04:00
*/
2012-05-15 18:11:53 +04:00
# define pr_fmt(fmt) "IPv6: " fmt
2005-04-17 02:20:36 +04:00
# include <linux/module.h>
2006-01-11 23:17:47 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/errno.h>
# include <linux/types.h>
# include <linux/socket.h>
# include <linux/in.h>
# include <linux/kernel.h>
# include <linux/timer.h>
# include <linux/string.h>
# include <linux/sockios.h>
# include <linux/net.h>
# include <linux/fcntl.h>
# include <linux/mm.h>
# include <linux/interrupt.h>
# include <linux/proc_fs.h>
# include <linux/stat.h>
# include <linux/init.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2005-04-17 02:20:36 +04:00
# include <linux/inet.h>
# include <linux/netdevice.h>
# include <linux/icmpv6.h>
2005-08-10 06:42:34 +04:00
# include <linux/netfilter_ipv6.h>
2005-04-17 02:20:36 +04:00
# include <net/ip.h>
# include <net/ipv6.h>
# include <net/udp.h>
2006-11-27 22:10:57 +03:00
# include <net/udplite.h>
2005-04-17 02:20:36 +04:00
# include <net/tcp.h>
2013-05-23 00:17:31 +04:00
# include <net/ping.h>
2005-04-17 02:20:36 +04:00
# include <net/protocol.h>
# include <net/inet_common.h>
2008-10-01 18:33:10 +04:00
# include <net/route.h>
2005-04-17 02:20:36 +04:00
# include <net/transp_v6.h>
# include <net/ip6_route.h>
# include <net/addrconf.h>
2019-03-22 16:06:09 +03:00
# include <net/ipv6_stubs.h>
2013-08-31 09:44:36 +04:00
# include <net/ndisc.h>
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_IPV6_TUNNEL
# include <net/ip6_tunnel.h>
# endif
2016-06-27 22:02:46 +03:00
# include <net/calipso.h>
2016-11-08 16:57:40 +03:00
# include <net/seg6.h>
2020-03-28 01:00:22 +03:00
# include <net/rpl.h>
2020-05-18 09:28:06 +03:00
# include <net/compat.h>
2020-04-27 18:59:34 +03:00
# include <net/xfrm.h>
ipv6: ioam: Data plane support for Pre-allocated Trace
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 22:42:57 +03:00
# include <net/ioam6.h>
2022-06-18 06:47:05 +03:00
# include <net/rawv6.h>
2005-04-17 02:20:36 +04:00
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2008-04-03 04:22:53 +04:00
# include <linux/mroute6.h>
2005-04-17 02:20:36 +04:00
2016-04-05 18:22:51 +03:00
# include "ip6_offload.h"
2005-04-17 02:20:36 +04:00
MODULE_AUTHOR ( " Cast of dozens " ) ;
MODULE_DESCRIPTION ( " IPv6 protocol stack for Linux " ) ;
MODULE_LICENSE ( " GPL " ) ;
2007-11-23 16:28:44 +03:00
/* The inetsw6 table contains everything that inet6_create needs to
2005-04-17 02:20:36 +04:00
* build a new socket .
*/
static struct list_head inetsw6 [ SOCK_MAX ] ;
static DEFINE_SPINLOCK ( inetsw6_lock ) ;
2009-06-01 14:07:33 +04:00
struct ipv6_params ipv6_defaults = {
. disable_ipv6 = 0 ,
. autoconf = 1 ,
} ;
2012-05-05 14:13:53 +04:00
static int disable_ipv6_mod ;
2009-06-01 14:07:33 +04:00
module_param_named ( disable , disable_ipv6_mod , int , 0444 ) ;
MODULE_PARM_DESC ( disable , " Disable IPv6 module such that it is non-functional " ) ;
module_param_named ( disable_ipv6 , ipv6_defaults . disable_ipv6 , int , 0444 ) ;
MODULE_PARM_DESC ( disable_ipv6 , " Disable IPv6 on all interfaces " ) ;
module_param_named ( autoconf , ipv6_defaults . autoconf , int , 0444 ) ;
MODULE_PARM_DESC ( autoconf , " Enable IPv6 address autoconfiguration on all interfaces " ) ;
2009-03-04 14:18:11 +03:00
2016-06-09 20:21:00 +03:00
bool ipv6_mod_enabled ( void )
{
return disable_ipv6_mod = = 0 ;
}
EXPORT_SYMBOL_GPL ( ipv6_mod_enabled ) ;
2005-04-17 02:20:36 +04:00
static __inline__ struct ipv6_pinfo * inet6_sk_generic ( struct sock * sk )
{
const int offset = sk - > sk_prot - > obj_size - sizeof ( struct ipv6_pinfo ) ;
return ( struct ipv6_pinfo * ) ( ( ( u8 * ) sk ) + offset ) ;
}
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adcde ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adcde ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-06 21:53:47 +03:00
void inet6_sock_destruct ( struct sock * sk )
{
inet6_cleanup_sock ( sk ) ;
inet_sock_destruct ( sk ) ;
}
2022-10-20 01:36:00 +03:00
EXPORT_SYMBOL_GPL ( inet6_sock_destruct ) ;
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adcde ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adcde ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-06 21:53:47 +03:00
2009-11-06 09:18:14 +03:00
static int inet6_create ( struct net * net , struct socket * sock , int protocol ,
int kern )
2005-04-17 02:20:36 +04:00
{
struct inet_sock * inet ;
struct ipv6_pinfo * np ;
struct sock * sk ;
struct inet_protosw * answer ;
struct proto * answer_prot ;
unsigned char answer_flags ;
2005-12-03 07:56:57 +03:00
int try_loading_module = 0 ;
int err ;
2005-04-17 02:20:36 +04:00
2015-12-15 00:03:39 +03:00
if ( protocol < 0 | | protocol > = IPPROTO_MAX )
return - EINVAL ;
2005-04-17 02:20:36 +04:00
/* Look for the requested type/protocol pair. */
2005-12-03 07:56:57 +03:00
lookup_protocol :
err = - ESOCKTNOSUPPORT ;
2005-04-17 02:20:36 +04:00
rcu_read_lock ( ) ;
2008-07-25 12:45:34 +04:00
list_for_each_entry_rcu ( answer , & inetsw6 [ sock - > type ] , list ) {
2005-04-17 02:20:36 +04:00
2008-07-25 12:45:34 +04:00
err = 0 ;
2005-04-17 02:20:36 +04:00
/* Check the non-wild match. */
if ( protocol = = answer - > protocol ) {
if ( protocol ! = IPPROTO_IP )
break ;
} else {
/* Check for the two wild cases. */
if ( IPPROTO_IP = = protocol ) {
protocol = answer - > protocol ;
break ;
}
if ( IPPROTO_IP = = answer - > protocol )
break ;
}
2005-12-03 07:56:57 +03:00
err = - EPROTONOSUPPORT ;
2005-04-17 02:20:36 +04:00
}
2008-07-25 12:45:34 +04:00
if ( err ) {
2005-12-03 07:56:57 +03:00
if ( try_loading_module < 2 ) {
rcu_read_unlock ( ) ;
/*
* Be more specific , e . g . net - pf - 10 - proto - 132 - type - 1
* ( net - pf - PF_INET6 - proto - IPPROTO_SCTP - type - SOCK_STREAM )
*/
if ( + + try_loading_module = = 1 )
request_module ( " net-pf-%d-proto-%d-type-%d " ,
PF_INET6 , protocol , sock - > type ) ;
/*
* Fall back to generic , e . g . net - pf - 10 - proto - 132
* ( net - pf - PF_INET6 - proto - IPPROTO_SCTP )
*/
else
request_module ( " net-pf-%d-proto-%d " ,
PF_INET6 , protocol ) ;
goto lookup_protocol ;
} else
goto out_rcu_unlock ;
}
err = - EPERM ;
net: Allow userns root to control ipv6
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.
In general policy and network stack state changes are allowed while
resource control is left unchanged.
Allow the SIOCSIFADDR ioctl to add ipv6 addresses.
Allow the SIOCDIFADDR ioctl to delete ipv6 addresses.
Allow the SIOCADDRT ioctl to add ipv6 routes.
Allow the SIOCDELRT ioctl to delete ipv6 routes.
Allow creation of ipv6 raw sockets.
Allow setting the IPV6_JOIN_ANYCAST socket option.
Allow setting the IPV6_FL_A_RENEW parameter of the IPV6_FLOWLABEL_MGR
socket option.
Allow setting the IPV6_TRANSPARENT socket option.
Allow setting the IPV6_HOPOPTS socket option.
Allow setting the IPV6_RTHDRDSTOPTS socket option.
Allow setting the IPV6_DSTOPTS socket option.
Allow setting the IPV6_IPSEC_POLICY socket option.
Allow setting the IPV6_XFRM_POLICY socket option.
Allow sending packets with the IPV6_2292HOPOPTS control message.
Allow sending packets with the IPV6_2292DSTOPTS control message.
Allow sending packets with the IPV6_RTHDRDSTOPTS control message.
Allow setting the multicast routing socket options on non multicast
routing sockets.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, and SIOCDELTUNNEL ioctls for
setting up, changing and deleting tunnels over ipv6.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, SIOCDELTUNNEL ioctls for
setting up, changing and deleting ipv6 over ipv4 tunnels.
Allow the SIOCADDPRL, SIOCDELPRL, SIOCCHGPRL ioctls for adding,
deleting, and changing the potential router list for ISATAP tunnels.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16 07:03:06 +04:00
if ( sock - > type = = SOCK_RAW & & ! kern & &
! ns_capable ( net - > user_ns , CAP_NET_RAW ) )
2005-04-17 02:20:36 +04:00
goto out_rcu_unlock ;
sock - > ops = answer - > ops ;
answer_prot = answer - > prot ;
answer_flags = answer - > flags ;
rcu_read_unlock ( ) ;
2015-03-29 16:00:04 +03:00
WARN_ON ( ! answer_prot - > slab ) ;
2005-04-17 02:20:36 +04:00
2005-12-03 07:56:57 +03:00
err = - ENOBUFS ;
2015-05-09 05:09:13 +03:00
sk = sk_alloc ( net , PF_INET6 , GFP_KERNEL , answer_prot , kern ) ;
2015-03-29 16:00:04 +03:00
if ( ! sk )
2005-04-17 02:20:36 +04:00
goto out ;
sock_init_data ( sock , sk ) ;
2005-12-03 07:56:57 +03:00
err = 0 ;
2005-04-17 02:20:36 +04:00
if ( INET_PROTOSW_REUSE & answer_flags )
2012-04-19 07:39:36 +04:00
sk - > sk_reuse = SK_CAN_REUSE ;
2005-04-17 02:20:36 +04:00
inet = inet_sk ( sk ) ;
2007-01-10 01:37:06 +03:00
inet - > is_icsk = ( INET_PROTOSW_ICSK & answer_flags ) ! = 0 ;
2005-04-17 02:20:36 +04:00
if ( SOCK_RAW = = sock - > type ) {
2009-10-15 10:30:45 +04:00
inet - > inet_num = protocol ;
2005-04-17 02:20:36 +04:00
if ( IPPROTO_RAW = = protocol )
inet - > hdrincl = 1 ;
}
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adcde ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adcde ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-06 21:53:47 +03:00
sk - > sk_destruct = inet6_sock_destruct ;
2005-04-17 02:20:36 +04:00
sk - > sk_family = PF_INET6 ;
sk - > sk_protocol = protocol ;
sk - > sk_backlog_rcv = answer - > prot - > backlog_rcv ;
inet_sk ( sk ) - > pinet6 = np = inet6_sk_generic ( sk ) ;
np - > hop_limit = - 1 ;
2010-05-04 10:42:27 +04:00
np - > mcast_hops = IPV6_DEFAULT_MCASTHOPS ;
2005-04-17 02:20:36 +04:00
np - > mc_loop = 1 ;
2018-09-10 11:27:15 +03:00
np - > mc_all = 1 ;
2005-04-17 02:20:36 +04:00
np - > pmtudisc = IPV6_PMTUDISC_WANT ;
2019-07-01 16:39:36 +03:00
np - > repflow = net - > ipv6 . sysctl . flowlabel_reflect & FLOWLABEL_REFLECT_ESTABLISHED ;
2014-06-27 19:36:16 +04:00
sk - > sk_ipv6only = net - > ipv6 . sysctl . bindv6only ;
txhash: fix sk->sk_txrehash default
This code fix a bug that sk->sk_txrehash gets its default enable
value from sysctl_txrehash only when the socket is a TCP listener.
We should have sysctl_txrehash to set the default sk->sk_txrehash,
no matter TCP, nor listerner/connector.
Tested by following packetdrill:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
// SO_TXREHASH == 74, default to sysctl_txrehash == 1
+0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
+0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0
Fixes: 26859240e4ee ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-07 05:08:20 +03:00
sk - > sk_txrehash = READ_ONCE ( net - > core . sysctl_txrehash ) ;
2007-02-09 17:24:49 +03:00
2005-04-17 02:20:36 +04:00
/* Init the ipv4 part of the socket since we can have sockets
* using v6 API for ipv4 .
*/
inet - > uc_ttl = - 1 ;
inet - > mc_loop = 1 ;
inet - > mc_ttl = 1 ;
inet - > mc_index = 0 ;
2021-03-25 19:16:55 +03:00
RCU_INIT_POINTER ( inet - > mc_list , NULL ) ;
2012-02-09 13:35:49 +04:00
inet - > rcv_tos = 0 ;
2005-04-17 02:20:36 +04:00
2022-07-13 23:51:52 +03:00
if ( READ_ONCE ( net - > ipv4 . sysctl_ip_no_pmtu_disc ) )
2005-04-17 02:20:36 +04:00
inet - > pmtudisc = IP_PMTUDISC_DONT ;
else
inet - > pmtudisc = IP_PMTUDISC_WANT ;
2009-10-15 10:30:45 +04:00
if ( inet - > inet_num ) {
2005-04-17 02:20:36 +04:00
/* It assumes that any protocol which allows
* the user to assign a number at socket
* creation time automatically shares .
*/
2009-10-15 10:30:45 +04:00
inet - > inet_sport = htons ( inet - > inet_num ) ;
2016-02-10 19:50:35 +03:00
err = sk - > sk_prot - > hash ( sk ) ;
if ( err ) {
sk_common_release ( sk ) ;
goto out ;
}
2005-04-17 02:20:36 +04:00
}
if ( sk - > sk_prot - > init ) {
2005-12-03 07:56:57 +03:00
err = sk - > sk_prot - > init ( sk ) ;
if ( err ) {
2005-04-17 02:20:36 +04:00
sk_common_release ( sk ) ;
goto out ;
}
}
2016-12-01 19:48:04 +03:00
if ( ! kern ) {
err = BPF_CGROUP_RUN_PROG_INET_SOCK ( sk ) ;
if ( err ) {
sk_common_release ( sk ) ;
goto out ;
}
}
2005-04-17 02:20:36 +04:00
out :
2005-12-03 07:56:57 +03:00
return err ;
2005-04-17 02:20:36 +04:00
out_rcu_unlock :
rcu_read_unlock ( ) ;
goto out ;
}
2018-04-17 20:00:39 +03:00
static int __inet6_bind ( struct sock * sk , struct sockaddr * uaddr , int addr_len ,
2020-05-08 20:46:10 +03:00
u32 flags )
2018-03-31 01:08:04 +03:00
{
struct sockaddr_in6 * addr = ( struct sockaddr_in6 * ) uaddr ;
struct inet_sock * inet = inet_sk ( sk ) ;
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
struct net * net = sock_net ( sk ) ;
__be32 v4addr = 0 ;
unsigned short snum ;
bool saved_ipv6only ;
int addr_type = 0 ;
int err = 0 ;
2011-06-06 10:00:07 +04:00
if ( addr - > sin6_family ! = AF_INET6 )
net: bind() fix error return on wrong address family
Hi,
Reinhard Max also pointed out that the error should EAFNOSUPPORT according
to POSIX.
The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.
Other protocols error values in their af bind() methods in current mainline git as far
as a brief look shows:
EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip
Ciao, Marcus
Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-04 05:30:29 +04:00
return - EAFNOSUPPORT ;
2011-06-06 10:00:07 +04:00
2005-04-17 02:20:36 +04:00
addr_type = ipv6_addr_type ( & addr - > sin6_addr ) ;
2018-03-31 01:08:04 +03:00
if ( ( addr_type & IPV6_ADDR_MULTICAST ) & & sk - > sk_type = = SOCK_STREAM )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
snum = ntohs ( addr - > sin6_port ) ;
2021-01-27 22:31:39 +03:00
if ( ! ( flags & BIND_NO_CAP_NET_BIND_SERVICE ) & &
snum & & inet_port_requires_bind_service ( net , snum ) & &
2017-01-21 04:49:11 +03:00
! ns_capable ( net - > user_ns , CAP_NET_BIND_SERVICE ) )
2005-04-17 02:20:36 +04:00
return - EACCES ;
2020-05-08 20:46:10 +03:00
if ( flags & BIND_WITH_LOCK )
2018-03-31 01:08:04 +03:00
lock_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
/* Check these errors (active socket, double bind). */
2009-10-15 10:30:45 +04:00
if ( sk - > sk_state ! = TCP_CLOSE | | inet - > inet_num ) {
2005-04-17 02:20:36 +04:00
err = - EINVAL ;
goto out ;
}
/* Check if the address belongs to the host. */
if ( addr_type = = IPV6_ADDR_MAPPED ) {
2019-01-05 03:58:15 +03:00
struct net_device * dev = NULL ;
2009-03-24 19:24:50 +03:00
int chk_addr_ret ;
2009-03-24 19:24:48 +03:00
/* Binding to v4-mapped address on a v6-only socket
* makes no sense
*/
2022-04-20 04:58:51 +03:00
if ( ipv6_only_sock ( sk ) ) {
2009-03-24 19:24:48 +03:00
err = - EINVAL ;
goto out ;
}
2009-03-24 19:24:50 +03:00
2019-01-05 18:35:04 +03:00
rcu_read_lock ( ) ;
2019-01-05 03:58:15 +03:00
if ( sk - > sk_bound_dev_if ) {
dev = dev_get_by_index_rcu ( net , sk - > sk_bound_dev_if ) ;
if ( ! dev ) {
err = - ENODEV ;
2019-01-05 18:35:04 +03:00
goto out_unlock ;
2019-01-05 03:58:15 +03:00
}
}
2010-12-10 16:55:42 +03:00
/* Reproduce AF_INET checks to make the bindings consistent */
2005-04-17 02:20:36 +04:00
v4addr = addr - > sin6_addr . s6_addr32 [ 3 ] ;
2019-01-05 03:58:15 +03:00
chk_addr_ret = inet_addr_type_dev_table ( net , dev , v4addr ) ;
2019-01-05 18:35:04 +03:00
rcu_read_unlock ( ) ;
2021-11-17 12:00:11 +03:00
if ( ! inet_addr_valid_or_nonlocal ( net , inet , v4addr ,
chk_addr_ret ) ) {
2009-08-24 06:06:28 +04:00
err = - EADDRNOTAVAIL ;
2005-04-17 02:20:36 +04:00
goto out ;
2009-08-24 06:06:28 +04:00
}
2005-04-17 02:20:36 +04:00
} else {
if ( addr_type ! = IPV6_ADDR_ANY ) {
struct net_device * dev = NULL ;
2009-11-02 14:10:39 +03:00
rcu_read_lock ( ) ;
2013-03-08 06:07:19 +04:00
if ( __ipv6_addr_needs_scope_id ( addr_type ) ) {
2005-04-17 02:20:36 +04:00
if ( addr_len > = sizeof ( struct sockaddr_in6 ) & &
addr - > sin6_scope_id ) {
/* Override any existing binding, if another one
* is supplied by user .
*/
sk - > sk_bound_dev_if = addr - > sin6_scope_id ;
}
2007-02-09 17:24:49 +03:00
2005-04-17 02:20:36 +04:00
/* Binding to link-local address requires an interface */
if ( ! sk - > sk_bound_dev_if ) {
err = - EINVAL ;
2009-11-02 14:10:39 +03:00
goto out_unlock ;
2005-04-17 02:20:36 +04:00
}
2019-01-03 05:57:09 +03:00
}
if ( sk - > sk_bound_dev_if ) {
2009-11-02 14:10:39 +03:00
dev = dev_get_by_index_rcu ( net , sk - > sk_bound_dev_if ) ;
2005-04-17 02:20:36 +04:00
if ( ! dev ) {
err = - ENODEV ;
2009-11-02 14:10:39 +03:00
goto out_unlock ;
2005-04-17 02:20:36 +04:00
}
}
/* ipv4 addr of the socket is invalid. Only the
* unspecified and mapped address have a v4 equivalent .
*/
v4addr = LOOPBACK4_IPV6 ;
if ( ! ( addr_type & IPV6_ADDR_MULTICAST ) ) {
2018-07-31 22:18:11 +03:00
if ( ! ipv6_can_nonlocal_bind ( net , inet ) & &
2010-10-21 18:10:03 +04:00
! ipv6_chk_addr ( net , & addr - > sin6_addr ,
2008-01-11 09:43:18 +03:00
dev , 0 ) ) {
2005-04-17 02:20:36 +04:00
err = - EADDRNOTAVAIL ;
2009-11-02 14:10:39 +03:00
goto out_unlock ;
2005-04-17 02:20:36 +04:00
}
}
2009-11-02 14:10:39 +03:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
}
}
2009-10-15 10:30:45 +04:00
inet - > inet_rcv_saddr = v4addr ;
inet - > inet_saddr = v4addr ;
2005-04-17 02:20:36 +04:00
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
sk - > sk_v6_rcv_saddr = addr - > sin6_addr ;
2007-02-09 17:24:49 +03:00
2005-04-17 02:20:36 +04:00
if ( ! ( addr_type & IPV6_ADDR_MULTICAST ) )
2011-11-21 07:39:03 +04:00
np - > saddr = addr - > sin6_addr ;
2005-04-17 02:20:36 +04:00
2018-01-25 10:15:27 +03:00
saved_ipv6only = sk - > sk_ipv6only ;
if ( addr_type ! = IPV6_ADDR_ANY & & addr_type ! = IPV6_ADDR_MAPPED )
sk - > sk_ipv6only = 1 ;
2005-04-17 02:20:36 +04:00
/* Make sure we are allowed to bind here. */
2018-03-31 01:08:07 +03:00
if ( snum | | ! ( inet - > bind_address_no_port | |
2020-05-08 20:46:10 +03:00
( flags & BIND_FORCE_ADDRESS_NO_PORT ) ) ) {
2022-11-18 21:25:06 +03:00
err = sk - > sk_prot - > get_port ( sk , snum ) ;
if ( err ) {
2018-03-31 01:08:07 +03:00
sk - > sk_ipv6only = saved_ipv6only ;
inet_reset_saddr ( sk ) ;
goto out ;
}
2020-05-08 20:46:11 +03:00
if ( ! ( flags & BIND_FROM_BPF ) ) {
err = BPF_CGROUP_RUN_PROG_INET6_POST_BIND ( sk ) ;
if ( err ) {
sk - > sk_ipv6only = saved_ipv6only ;
inet_reset_saddr ( sk ) ;
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06 16:20:20 +03:00
if ( sk - > sk_prot - > put_port )
sk - > sk_prot - > put_port ( sk ) ;
2020-05-08 20:46:11 +03:00
goto out ;
}
2018-03-31 01:08:07 +03:00
}
2005-04-17 02:20:36 +04:00
}
2018-01-25 10:15:27 +03:00
if ( addr_type ! = IPV6_ADDR_ANY )
2005-04-17 02:20:36 +04:00
sk - > sk_userlocks | = SOCK_BINDADDR_LOCK ;
if ( snum )
sk - > sk_userlocks | = SOCK_BINDPORT_LOCK ;
2009-10-15 10:30:45 +04:00
inet - > inet_sport = htons ( inet - > inet_num ) ;
inet - > inet_dport = 0 ;
inet - > inet_daddr = 0 ;
2005-04-17 02:20:36 +04:00
out :
2020-05-08 20:46:10 +03:00
if ( flags & BIND_WITH_LOCK )
2018-03-31 01:08:04 +03:00
release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
return err ;
2009-11-02 14:10:39 +03:00
out_unlock :
rcu_read_unlock ( ) ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2007-02-22 16:05:40 +03:00
2018-04-17 20:00:39 +03:00
/* bind for INET6 API */
int inet6_bind ( struct socket * sock , struct sockaddr * uaddr , int addr_len )
{
struct sock * sk = sock - > sk ;
2021-01-27 22:31:39 +03:00
u32 flags = BIND_WITH_LOCK ;
2022-02-18 02:48:41 +03:00
const struct proto * prot ;
2018-04-17 20:00:39 +03:00
int err = 0 ;
2022-02-18 02:48:41 +03:00
/* IPV6_ADDRFORM can change sk->sk_prot under us. */
prot = READ_ONCE ( sk - > sk_prot ) ;
2018-04-17 20:00:39 +03:00
/* If the socket has its own bind function then use it. */
2022-02-18 02:48:41 +03:00
if ( prot - > bind )
return prot - > bind ( sk , uaddr , addr_len ) ;
2018-04-17 20:00:39 +03:00
if ( addr_len < SIN6_LEN_RFC2133 )
return - EINVAL ;
/* BPF prog is run before any checks are done so that if the prog
* changes context in a wrong way it will be caught .
*/
2021-01-27 22:31:39 +03:00
err = BPF_CGROUP_RUN_PROG_INET_BIND_LOCK ( sk , uaddr ,
2021-08-19 12:24:20 +03:00
CGROUP_INET6_BIND , & flags ) ;
2018-04-17 20:00:39 +03:00
if ( err )
return err ;
2021-01-27 22:31:39 +03:00
return __inet6_bind ( sk , uaddr , addr_len , flags ) ;
2018-04-17 20:00:39 +03:00
}
EXPORT_SYMBOL ( inet6_bind ) ;
2005-04-17 02:20:36 +04:00
int inet6_release ( struct socket * sock )
{
struct sock * sk = sock - > sk ;
2015-03-29 16:00:04 +03:00
if ( ! sk )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
/* Free mc lists */
ipv6_sock_mc_close ( sk ) ;
/* Free ac lists */
ipv6_sock_ac_close ( sk ) ;
return inet_release ( sock ) ;
}
2007-02-22 16:05:40 +03:00
EXPORT_SYMBOL ( inet6_release ) ;
2022-10-20 01:36:02 +03:00
void inet6_cleanup_sock ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
struct sk_buff * skb ;
struct ipv6_txoptions * opt ;
/* Release rx options */
2012-05-05 14:13:53 +04:00
skb = xchg ( & np - > pktoptions , NULL ) ;
2018-09-20 12:37:46 +03:00
kfree_skb ( skb ) ;
2010-04-23 15:26:09 +04:00
2012-05-05 14:13:53 +04:00
skb = xchg ( & np - > rxpmtu , NULL ) ;
2018-09-20 12:37:46 +03:00
kfree_skb ( skb ) ;
2005-04-17 02:20:36 +04:00
/* Free flowlabels */
fl6_free_socklist ( sk ) ;
/* Free tx options */
2015-11-30 06:37:57 +03:00
opt = xchg ( ( __force struct ipv6_txoptions * * ) & np - > opt , NULL ) ;
if ( opt ) {
atomic_sub ( opt - > tot_len , & sk - > sk_omem_alloc ) ;
txopt_put ( opt ) ;
}
2005-04-17 02:20:36 +04:00
}
2022-10-06 21:53:46 +03:00
EXPORT_SYMBOL_GPL ( inet6_cleanup_sock ) ;
2005-04-17 02:20:36 +04:00
/*
* This does both peername and sockname .
*/
int inet6_getname ( struct socket * sock , struct sockaddr * uaddr ,
bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.
The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.
Simple example:
# ./cilium/cilium service list
ID Frontend Service Type Backend
1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80
Before; curl's verbose output example, no getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
After; with getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 01:45:45 +03:00
int peer )
2005-04-17 02:20:36 +04:00
{
2012-05-05 14:13:53 +04:00
struct sockaddr_in6 * sin = ( struct sockaddr_in6 * ) uaddr ;
2005-04-17 02:20:36 +04:00
struct sock * sk = sock - > sk ;
struct inet_sock * inet = inet_sk ( sk ) ;
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2007-02-09 17:24:49 +03:00
2005-04-17 02:20:36 +04:00
sin - > sin6_family = AF_INET6 ;
sin - > sin6_flowinfo = 0 ;
sin - > sin6_scope_id = 0 ;
2021-10-27 00:30:14 +03:00
lock_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
if ( peer ) {
2021-10-27 00:30:14 +03:00
if ( ! inet - > inet_dport | |
( ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_SYN_SENT ) ) & &
peer = = 1 ) ) {
release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
return - ENOTCONN ;
2021-10-27 00:30:14 +03:00
}
2009-10-15 10:30:45 +04:00
sin - > sin6_port = inet - > inet_dport ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
sin - > sin6_addr = sk - > sk_v6_daddr ;
2005-04-17 02:20:36 +04:00
if ( np - > sndflow )
sin - > sin6_flowinfo = np - > flow_label ;
2021-10-27 00:30:14 +03:00
BPF_CGROUP_RUN_SA_PROG ( sk , ( struct sockaddr * ) sin ,
CGROUP_INET6_GETPEERNAME ) ;
2005-04-17 02:20:36 +04:00
} else {
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
if ( ipv6_addr_any ( & sk - > sk_v6_rcv_saddr ) )
2011-11-21 07:39:03 +04:00
sin - > sin6_addr = np - > saddr ;
2005-04-17 02:20:36 +04:00
else
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
sin - > sin6_addr = sk - > sk_v6_rcv_saddr ;
2009-10-15 10:30:45 +04:00
sin - > sin6_port = inet - > inet_sport ;
2021-10-27 00:30:14 +03:00
BPF_CGROUP_RUN_SA_PROG ( sk , ( struct sockaddr * ) sin ,
CGROUP_INET6_GETSOCKNAME ) ;
2021-01-15 19:35:01 +03:00
}
2013-03-08 06:07:19 +04:00
sin - > sin6_scope_id = ipv6_iface_scope_id ( & sin - > sin6_addr ,
sk - > sk_bound_dev_if ) ;
2021-10-27 00:30:14 +03:00
release_sock ( sk ) ;
2018-02-12 22:00:20 +03:00
return sizeof ( * sin ) ;
2005-04-17 02:20:36 +04:00
}
2007-02-22 16:05:40 +03:00
EXPORT_SYMBOL ( inet6_getname ) ;
2005-04-17 02:20:36 +04:00
int inet6_ioctl ( struct socket * sock , unsigned int cmd , unsigned long arg )
{
2020-05-18 09:28:05 +03:00
void __user * argp = ( void __user * ) arg ;
2005-04-17 02:20:36 +04:00
struct sock * sk = sock - > sk ;
2008-03-25 20:26:21 +03:00
struct net * net = sock_net ( sk ) ;
2022-02-18 02:48:41 +03:00
const struct proto * prot ;
2005-04-17 02:20:36 +04:00
2012-05-05 14:13:53 +04:00
switch ( cmd ) {
2005-04-17 02:20:36 +04:00
case SIOCADDRT :
2020-05-18 09:28:05 +03:00
case SIOCDELRT : {
struct in6_rtmsg rtmsg ;
2005-04-17 02:20:36 +04:00
2020-05-18 09:28:05 +03:00
if ( copy_from_user ( & rtmsg , argp , sizeof ( rtmsg ) ) )
return - EFAULT ;
return ipv6_route_ioctl ( net , cmd , & rtmsg ) ;
}
2005-04-17 02:20:36 +04:00
case SIOCSIFADDR :
2020-05-18 09:28:05 +03:00
return addrconf_add_ifaddr ( net , argp ) ;
2005-04-17 02:20:36 +04:00
case SIOCDIFADDR :
2020-05-18 09:28:05 +03:00
return addrconf_del_ifaddr ( net , argp ) ;
2005-04-17 02:20:36 +04:00
case SIOCSIFDSTADDR :
2020-05-18 09:28:05 +03:00
return addrconf_set_dstaddr ( net , argp ) ;
2005-04-17 02:20:36 +04:00
default :
2022-02-18 02:48:41 +03:00
/* IPV6_ADDRFORM can change sk->sk_prot under us. */
prot = READ_ONCE ( sk - > sk_prot ) ;
if ( ! prot - > ioctl )
2006-01-04 01:18:33 +03:00
return - ENOIOCTLCMD ;
2022-02-18 02:48:41 +03:00
return prot - > ioctl ( sk , cmd , arg ) ;
2005-04-17 02:20:36 +04:00
}
/*NOTREACHED*/
2010-09-23 00:43:57 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2007-02-22 16:05:40 +03:00
EXPORT_SYMBOL ( inet6_ioctl ) ;
2020-05-18 09:28:06 +03:00
# ifdef CONFIG_COMPAT
struct compat_in6_rtmsg {
struct in6_addr rtmsg_dst ;
struct in6_addr rtmsg_src ;
struct in6_addr rtmsg_gateway ;
u32 rtmsg_type ;
u16 rtmsg_dst_len ;
u16 rtmsg_src_len ;
u32 rtmsg_metric ;
u32 rtmsg_info ;
u32 rtmsg_flags ;
s32 rtmsg_ifindex ;
} ;
static int inet6_compat_routing_ioctl ( struct sock * sk , unsigned int cmd ,
struct compat_in6_rtmsg __user * ur )
{
struct in6_rtmsg rt ;
if ( copy_from_user ( & rt . rtmsg_dst , & ur - > rtmsg_dst ,
3 * sizeof ( struct in6_addr ) ) | |
get_user ( rt . rtmsg_type , & ur - > rtmsg_type ) | |
get_user ( rt . rtmsg_dst_len , & ur - > rtmsg_dst_len ) | |
get_user ( rt . rtmsg_src_len , & ur - > rtmsg_src_len ) | |
get_user ( rt . rtmsg_metric , & ur - > rtmsg_metric ) | |
get_user ( rt . rtmsg_info , & ur - > rtmsg_info ) | |
get_user ( rt . rtmsg_flags , & ur - > rtmsg_flags ) | |
get_user ( rt . rtmsg_ifindex , & ur - > rtmsg_ifindex ) )
return - EFAULT ;
return ipv6_route_ioctl ( sock_net ( sk ) , cmd , & rt ) ;
}
int inet6_compat_ioctl ( struct socket * sock , unsigned int cmd , unsigned long arg )
{
void __user * argp = compat_ptr ( arg ) ;
struct sock * sk = sock - > sk ;
switch ( cmd ) {
case SIOCADDRT :
case SIOCDELRT :
return inet6_compat_routing_ioctl ( sk , cmd , argp ) ;
default :
return - ENOIOCTLCMD ;
}
}
EXPORT_SYMBOL_GPL ( inet6_compat_ioctl ) ;
# endif /* CONFIG_COMPAT */
2019-07-03 17:06:55 +03:00
INDIRECT_CALLABLE_DECLARE ( int udpv6_sendmsg ( struct sock * , struct msghdr * ,
size_t ) ) ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
int inet6_sendmsg ( struct socket * sock , struct msghdr * msg , size_t size )
{
struct sock * sk = sock - > sk ;
2022-02-18 02:48:41 +03:00
const struct proto * prot ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
if ( unlikely ( inet_send_prepare ( sk ) ) )
return - EAGAIN ;
2022-02-18 02:48:41 +03:00
/* IPV6_ADDRFORM can change sk->sk_prot under us. */
prot = READ_ONCE ( sk - > sk_prot ) ;
return INDIRECT_CALL_2 ( prot - > sendmsg , tcp_sendmsg , udpv6_sendmsg ,
2019-07-03 17:06:55 +03:00
sk , msg , size ) ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
}
2019-07-03 17:06:55 +03:00
INDIRECT_CALLABLE_DECLARE ( int udpv6_recvmsg ( struct sock * , struct msghdr * ,
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
size_t , int , int * ) ) ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
int inet6_recvmsg ( struct socket * sock , struct msghdr * msg , size_t size ,
int flags )
{
struct sock * sk = sock - > sk ;
2022-02-18 02:48:41 +03:00
const struct proto * prot ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
int addr_len = 0 ;
int err ;
if ( likely ( ! ( flags & MSG_ERRQUEUE ) ) )
sock_rps_record_flow ( sk ) ;
2022-02-18 02:48:41 +03:00
/* IPV6_ADDRFORM can change sk->sk_prot under us. */
prot = READ_ONCE ( sk - > sk_prot ) ;
err = INDIRECT_CALL_2 ( prot - > recvmsg , tcp_recvmsg , udpv6_recvmsg ,
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
sk , msg , size , flags , & addr_len ) ;
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
if ( err > = 0 )
msg - > msg_namelen = addr_len ;
return err ;
}
2005-12-22 23:49:22 +03:00
const struct proto_ops inet6_stream_ops = {
2006-03-21 09:48:35 +03:00
. family = PF_INET6 ,
. owner = THIS_MODULE ,
. release = inet6_release ,
. bind = inet6_bind ,
. connect = inet_stream_connect , /* ok */
. socketpair = sock_no_socketpair , /* a do nothing */
. accept = inet_accept , /* ok */
. getname = inet6_getname ,
2018-06-28 19:43:44 +03:00
. poll = tcp_poll , /* ok */
2006-03-21 09:48:35 +03:00
. ioctl = inet6_ioctl , /* must change */
2019-04-17 23:51:48 +03:00
. gettstamp = sock_gettstamp ,
2006-03-21 09:48:35 +03:00
. listen = inet_listen , /* ok */
. shutdown = inet_shutdown , /* ok */
. setsockopt = sock_common_setsockopt , /* ok */
. getsockopt = sock_common_getsockopt , /* ok */
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
. sendmsg = inet6_sendmsg , /* retpoline's sake */
. recvmsg = inet6_recvmsg , /* retpoline's sake */
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
# ifdef CONFIG_MMU
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
. mmap = tcp_mmap ,
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
# endif
2010-07-11 00:41:55 +04:00
. sendpage = inet_sendpage ,
2017-08-16 08:31:10 +03:00
. sendmsg_locked = tcp_sendmsg_locked ,
. sendpage_locked = tcp_sendpage_locked ,
2007-11-07 10:31:58 +03:00
. splice_read = tcp_splice_read ,
2016-08-29 00:43:18 +03:00
. read_sock = tcp_read_sock ,
2022-06-15 19:20:12 +03:00
. read_skb = tcp_read_skb ,
2016-08-29 00:43:18 +03:00
. peek_len = tcp_peek_len ,
2006-03-21 09:45:21 +03:00
# ifdef CONFIG_COMPAT
2020-05-18 09:28:06 +03:00
. compat_ioctl = inet6_compat_ioctl ,
2006-03-21 09:45:21 +03:00
# endif
2018-04-16 20:33:35 +03:00
. set_rcvlowat = tcp_set_rcvlowat ,
2005-04-17 02:20:36 +04:00
} ;
2005-12-22 23:49:22 +03:00
const struct proto_ops inet6_dgram_ops = {
2006-03-21 09:48:35 +03:00
. family = PF_INET6 ,
. owner = THIS_MODULE ,
. release = inet6_release ,
. bind = inet6_bind ,
. connect = inet_dgram_connect , /* ok */
. socketpair = sock_no_socketpair , /* a do nothing */
. accept = sock_no_accept , /* a do nothing */
. getname = inet6_getname ,
2018-06-28 19:43:44 +03:00
. poll = udp_poll , /* ok */
2006-03-21 09:48:35 +03:00
. ioctl = inet6_ioctl , /* must change */
2019-04-17 23:51:48 +03:00
. gettstamp = sock_gettstamp ,
2006-03-21 09:48:35 +03:00
. listen = sock_no_listen , /* ok */
. shutdown = inet_shutdown , /* ok */
. setsockopt = sock_common_setsockopt , /* ok */
. getsockopt = sock_common_getsockopt , /* ok */
ipv6: provide and use ipv6 specific version for {recv, send}msg
This will simplify indirect call wrapper invocation in the following
patch.
No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.
SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.
The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.
v1 -> v2: drop inet6_{recv,send}msg declaration from header file,
prefer ICW macro instead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 17:06:53 +03:00
. sendmsg = inet6_sendmsg , /* retpoline's sake */
. recvmsg = inet6_recvmsg , /* retpoline's sake */
2022-06-15 19:20:12 +03:00
. read_skb = udp_read_skb ,
2006-03-21 09:48:35 +03:00
. mmap = sock_no_mmap ,
. sendpage = sock_no_sendpage ,
2016-04-05 19:41:16 +03:00
. set_peek_off = sk_set_peek_off ,
2006-03-21 09:45:21 +03:00
# ifdef CONFIG_COMPAT
2020-05-18 09:28:06 +03:00
. compat_ioctl = inet6_compat_ioctl ,
2006-03-21 09:45:21 +03:00
# endif
2005-04-17 02:20:36 +04:00
} ;
2009-10-05 09:58:39 +04:00
static const struct net_proto_family inet6_family_ops = {
2005-04-17 02:20:36 +04:00
. family = PF_INET6 ,
. create = inet6_create ,
. owner = THIS_MODULE ,
} ;
2007-12-11 13:25:01 +03:00
int inet6_register_protosw ( struct inet_protosw * p )
2005-04-17 02:20:36 +04:00
{
struct list_head * lh ;
struct inet_protosw * answer ;
struct list_head * last_perm ;
2007-12-11 13:25:01 +03:00
int protocol = p - > protocol ;
int ret ;
2005-04-17 02:20:36 +04:00
spin_lock_bh ( & inetsw6_lock ) ;
2007-12-11 13:25:01 +03:00
ret = - EINVAL ;
2005-04-17 02:20:36 +04:00
if ( p - > type > = SOCK_MAX )
goto out_illegal ;
/* If we are trying to override a permanent protocol, bail. */
answer = NULL ;
2007-12-11 13:25:01 +03:00
ret = - EPERM ;
2005-04-17 02:20:36 +04:00
last_perm = & inetsw6 [ p - > type ] ;
list_for_each ( lh , & inetsw6 [ p - > type ] ) {
answer = list_entry ( lh , struct inet_protosw , list ) ;
/* Check only the non-wild match. */
if ( INET_PROTOSW_PERMANENT & answer - > flags ) {
if ( protocol = = answer - > protocol )
break ;
last_perm = lh ;
}
answer = NULL ;
}
if ( answer )
goto out_permanent ;
/* Add the new entry after the last permanent entry if any, so that
* the new entry does not override a permanent entry when matched with
* a wild - card protocol . But it is allowed to override any existing
2007-02-09 17:24:49 +03:00
* non - permanent entry . This means that when we remove this entry , the
2005-04-17 02:20:36 +04:00
* system automatically returns to the old behavior .
*/
list_add_rcu ( & p - > list , last_perm ) ;
2007-12-11 13:25:01 +03:00
ret = 0 ;
2005-04-17 02:20:36 +04:00
out :
spin_unlock_bh ( & inetsw6_lock ) ;
2007-12-11 13:25:01 +03:00
return ret ;
2005-04-17 02:20:36 +04:00
out_permanent :
2012-05-15 18:11:53 +04:00
pr_err ( " Attempt to override permanent protocol %d \n " , protocol ) ;
2005-04-17 02:20:36 +04:00
goto out ;
out_illegal :
2012-05-15 18:11:53 +04:00
pr_err ( " Ignoring attempt to register invalid socket type %d \n " ,
2005-04-17 02:20:36 +04:00
p - > type ) ;
goto out ;
}
2007-02-22 16:05:40 +03:00
EXPORT_SYMBOL ( inet6_register_protosw ) ;
2005-04-17 02:20:36 +04:00
void
inet6_unregister_protosw ( struct inet_protosw * p )
{
if ( INET_PROTOSW_PERMANENT & p - > flags ) {
2012-05-15 18:11:53 +04:00
pr_err ( " Attempt to unregister permanent protocol %d \n " ,
2005-04-17 02:20:36 +04:00
p - > protocol ) ;
} else {
spin_lock_bh ( & inetsw6_lock ) ;
list_del_rcu ( & p - > list ) ;
spin_unlock_bh ( & inetsw6_lock ) ;
synchronize_net ( ) ;
}
}
2007-02-22 16:05:40 +03:00
EXPORT_SYMBOL ( inet6_unregister_protosw ) ;
2005-12-14 10:22:54 +03:00
int inet6_sk_rebuild_header ( struct sock * sk )
{
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2011-03-02 00:19:07 +03:00
struct dst_entry * dst ;
2005-12-14 10:22:54 +03:00
dst = __sk_dst_check ( sk , np - > dst_cookie ) ;
2015-03-29 16:00:04 +03:00
if ( ! dst ) {
2005-12-14 10:22:54 +03:00
struct inet_sock * inet = inet_sk ( sk ) ;
2010-06-02 01:35:01 +04:00
struct in6_addr * final_p , final ;
2011-03-13 00:22:43 +03:00
struct flowi6 fl6 ;
memset ( & fl6 , 0 , sizeof ( fl6 ) ) ;
fl6 . flowi6_proto = sk - > sk_protocol ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
fl6 . daddr = sk - > sk_v6_daddr ;
2011-11-21 07:39:03 +04:00
fl6 . saddr = np - > saddr ;
2011-03-13 00:22:43 +03:00
fl6 . flowlabel = np - > flow_label ;
fl6 . flowi6_oif = sk - > sk_bound_dev_if ;
fl6 . flowi6_mark = sk - > sk_mark ;
2011-03-13 00:36:19 +03:00
fl6 . fl6_dport = inet - > inet_dport ;
fl6 . fl6_sport = inet - > inet_sport ;
2016-11-03 20:23:43 +03:00
fl6 . flowi6_uid = sk - > sk_uid ;
2020-09-28 05:38:26 +03:00
security_sk_classify_flow ( sk , flowi6_to_flowi_common ( & fl6 ) ) ;
2011-03-13 00:22:43 +03:00
2015-11-30 06:37:57 +03:00
rcu_read_lock ( ) ;
final_p = fl6_update_dst ( & fl6 , rcu_dereference ( np - > opt ) ,
& final ) ;
rcu_read_unlock ( ) ;
2011-03-13 00:22:43 +03:00
2019-12-04 17:35:52 +03:00
dst = ip6_dst_lookup_flow ( sock_net ( sk ) , sk , & fl6 , final_p ) ;
2011-03-02 00:19:07 +03:00
if ( IS_ERR ( dst ) ) {
2005-12-14 10:22:54 +03:00
sk - > sk_route_caps = 0 ;
2023-03-15 23:57:43 +03:00
WRITE_ONCE ( sk - > sk_err_soft , - PTR_ERR ( dst ) ) ;
2011-03-02 00:19:07 +03:00
return PTR_ERR ( dst ) ;
2005-12-14 10:22:54 +03:00
}
2015-12-03 08:53:57 +03:00
ip6_dst_store ( sk , dst , NULL , NULL ) ;
2005-12-14 10:22:54 +03:00
}
return 0 ;
}
EXPORT_SYMBOL_GPL ( inet6_sk_rebuild_header ) ;
2014-09-27 20:50:56 +04:00
bool ipv6_opt_accepted ( const struct sock * sk , const struct sk_buff * skb ,
const struct inet6_skb_parm * opt )
2005-12-14 10:24:28 +03:00
{
2012-05-18 10:14:11 +04:00
const struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2005-12-14 10:24:28 +03:00
if ( np - > rxopt . all ) {
2015-07-09 00:32:12 +03:00
if ( ( ( opt - > flags & IP6SKB_HOPBYHOP ) & &
( np - > rxopt . bits . hopopts | | np - > rxopt . bits . ohopopts ) ) | |
2013-12-08 18:47:01 +04:00
( ip6_flowinfo ( ( struct ipv6hdr * ) skb_network_header ( skb ) ) & &
2005-12-14 10:24:28 +03:00
np - > rxopt . bits . rxflow ) | |
( opt - > srcrt & & ( np - > rxopt . bits . srcrt | |
np - > rxopt . bits . osrcrt ) ) | |
( ( opt - > dst1 | | opt - > dst0 ) & &
( np - > rxopt . bits . dstopts | | np - > rxopt . bits . odstopts ) ) )
2012-05-18 10:14:11 +04:00
return true ;
2005-12-14 10:24:28 +03:00
}
2012-05-18 10:14:11 +04:00
return false ;
2005-12-14 10:24:28 +03:00
}
EXPORT_SYMBOL_GPL ( ipv6_opt_accepted ) ;
2009-03-09 11:18:29 +03:00
static struct packet_type ipv6_packet_type __read_mostly = {
2009-02-01 11:45:17 +03:00
. type = cpu_to_be16 ( ETH_P_IPV6 ) ,
2008-02-27 17:14:03 +03:00
. func = ipv6_rcv ,
2018-07-05 17:49:42 +03:00
. list_func = ipv6_list_rcv ,
2012-11-15 12:49:11 +04:00
} ;
2008-02-27 17:14:03 +03:00
static int __init ipv6_packet_init ( void )
{
dev_add_pack ( & ipv6_packet_type ) ;
return 0 ;
}
static void ipv6_packet_cleanup ( void )
{
dev_remove_pack ( & ipv6_packet_type ) ;
}
2008-10-08 01:48:53 +04:00
static int __net_init ipv6_init_mibs ( struct net * net )
{
2013-10-08 02:51:58 +04:00
int i ;
2014-05-06 02:55:55 +04:00
net - > mib . udp_stats_in6 = alloc_percpu ( struct udp_mib ) ;
if ( ! net - > mib . udp_stats_in6 )
2008-10-08 01:49:36 +04:00
return - ENOMEM ;
2014-05-06 02:55:55 +04:00
net - > mib . udplite_stats_in6 = alloc_percpu ( struct udp_mib ) ;
if ( ! net - > mib . udplite_stats_in6 )
2008-10-08 01:50:06 +04:00
goto err_udplite_mib ;
2014-05-06 02:55:55 +04:00
net - > mib . ipv6_statistics = alloc_percpu ( struct ipstats_mib ) ;
if ( ! net - > mib . ipv6_statistics )
2008-10-08 21:36:03 +04:00
goto err_ip_mib ;
2013-10-08 02:51:58 +04:00
for_each_possible_cpu ( i ) {
struct ipstats_mib * af_inet6_stats ;
2014-05-06 02:55:55 +04:00
af_inet6_stats = per_cpu_ptr ( net - > mib . ipv6_statistics , i ) ;
2013-10-08 02:51:58 +04:00
u64_stats_init ( & af_inet6_stats - > syncp ) ;
}
2014-05-06 02:55:55 +04:00
net - > mib . icmpv6_statistics = alloc_percpu ( struct icmpv6_mib ) ;
if ( ! net - > mib . icmpv6_statistics )
2008-10-08 21:36:03 +04:00
goto err_icmp_mib ;
2011-11-13 05:24:04 +04:00
net - > mib . icmpv6msg_statistics = kzalloc ( sizeof ( struct icmpv6msg_mib ) ,
GFP_KERNEL ) ;
if ( ! net - > mib . icmpv6msg_statistics )
2008-10-08 21:36:03 +04:00
goto err_icmpmsg_mib ;
2008-10-08 01:48:53 +04:00
return 0 ;
2008-10-08 01:50:06 +04:00
2008-10-08 21:36:03 +04:00
err_icmpmsg_mib :
2014-05-06 02:55:55 +04:00
free_percpu ( net - > mib . icmpv6_statistics ) ;
2008-10-08 21:36:03 +04:00
err_icmp_mib :
2014-05-06 02:55:55 +04:00
free_percpu ( net - > mib . ipv6_statistics ) ;
2008-10-08 21:36:03 +04:00
err_ip_mib :
2014-05-06 02:55:55 +04:00
free_percpu ( net - > mib . udplite_stats_in6 ) ;
2008-10-08 01:50:06 +04:00
err_udplite_mib :
2014-05-06 02:55:55 +04:00
free_percpu ( net - > mib . udp_stats_in6 ) ;
2008-10-08 01:50:06 +04:00
return - ENOMEM ;
2008-10-08 01:48:53 +04:00
}
2010-01-17 06:35:32 +03:00
static void ipv6_cleanup_mibs ( struct net * net )
2008-10-08 01:48:53 +04:00
{
2014-05-06 02:55:55 +04:00
free_percpu ( net - > mib . udp_stats_in6 ) ;
free_percpu ( net - > mib . udplite_stats_in6 ) ;
free_percpu ( net - > mib . ipv6_statistics ) ;
free_percpu ( net - > mib . icmpv6_statistics ) ;
2011-11-13 05:24:04 +04:00
kfree ( net - > mib . icmpv6msg_statistics ) ;
2008-10-08 01:48:53 +04:00
}
2008-10-14 05:54:07 +04:00
static int __net_init inet6_net_init ( struct net * net )
2008-01-10 13:48:33 +03:00
{
2008-03-21 14:14:17 +03:00
int err = 0 ;
2008-01-10 13:54:53 +03:00
net - > ipv6 . sysctl . bindv6only = 0 ;
2008-01-10 14:02:40 +03:00
net - > ipv6 . sysctl . icmpv6_time = 1 * HZ ;
2018-08-10 18:48:15 +03:00
net - > ipv6 . sysctl . icmpv6_echo_ignore_all = 0 ;
2019-03-19 19:37:12 +03:00
net - > ipv6 . sysctl . icmpv6_echo_ignore_multicast = 0 ;
2019-03-20 17:29:27 +03:00
net - > ipv6 . sysctl . icmpv6_echo_ignore_anycast = 0 ;
2023-04-19 04:32:38 +03:00
net - > ipv6 . sysctl . icmpv6_error_anycast_as_unicast = 0 ;
2019-04-17 23:35:49 +03:00
/* By default, rate limit error messages.
* Except for pmtu discovery , it would break it .
* proc_do_large_bitmap needs pointer to the bitmap .
*/
bitmap_set ( net - > ipv6 . sysctl . icmpv6_ratemask , 0 , ICMPV6_ERRMSG_MAX + 1 ) ;
bitmap_clear ( net - > ipv6 . sysctl . icmpv6_ratemask , ICMPV6_PKT_TOOBIG , 1 ) ;
net - > ipv6 . sysctl . icmpv6_ratemask_ptr = net - > ipv6 . sysctl . icmpv6_ratemask ;
2014-01-17 20:15:05 +04:00
net - > ipv6 . sysctl . flowlabel_consistency = 1 ;
2015-08-01 02:52:12 +03:00
net - > ipv6 . sysctl . auto_flowlabels = IP6_DEFAULT_AUTO_FLOW_LABELS ;
2015-03-24 01:36:05 +03:00
net - > ipv6 . sysctl . idgen_retries = 3 ;
net - > ipv6 . sysctl . idgen_delay = 1 * HZ ;
2015-08-01 02:52:13 +03:00
net - > ipv6 . sysctl . flowlabel_state_ranges = 0 ;
2017-10-31 00:16:00 +03:00
net - > ipv6 . sysctl . max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT ;
net - > ipv6 . sysctl . max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT ;
net - > ipv6 . sysctl . max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN ;
net - > ipv6 . sysctl . max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN ;
2021-02-01 22:47:55 +03:00
net - > ipv6 . sysctl . fib_notify_on_flag_change = 0 ;
2014-10-06 21:58:37 +04:00
atomic_set ( & net - > ipv6 . fib6_sernum , 1 ) ;
2008-01-10 13:56:03 +03:00
ipv6: ioam: Data plane support for Pre-allocated Trace
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 22:42:57 +03:00
net - > ipv6 . sysctl . ioam6_id = IOAM6_DEFAULT_ID ;
net - > ipv6 . sysctl . ioam6_id_wide = IOAM6_DEFAULT_ID_WIDE ;
2008-10-08 01:48:53 +04:00
err = ipv6_init_mibs ( net ) ;
if ( err )
return err ;
2008-03-21 14:14:17 +03:00
# ifdef CONFIG_PROC_FS
err = udp6_proc_init ( net ) ;
if ( err )
goto out ;
2008-03-21 14:14:45 +03:00
err = tcp6_proc_init ( net ) ;
if ( err )
goto proc_tcp6_fail ;
2008-03-27 02:52:32 +03:00
err = ac6_proc_init ( net ) ;
if ( err )
goto proc_ac6_fail ;
2008-03-21 14:14:17 +03:00
# endif
return err ;
2008-03-21 14:14:45 +03:00
# ifdef CONFIG_PROC_FS
2008-03-27 02:52:32 +03:00
proc_ac6_fail :
tcp6_proc_exit ( net ) ;
2008-03-21 14:14:45 +03:00
proc_tcp6_fail :
udp6_proc_exit ( net ) ;
2008-10-08 01:48:53 +04:00
out :
ipv6_cleanup_mibs ( net ) ;
return err ;
2008-03-21 14:14:45 +03:00
# endif
2008-01-10 13:48:33 +03:00
}
2010-01-17 06:35:32 +03:00
static void __net_exit inet6_net_exit ( struct net * net )
2008-01-10 13:48:33 +03:00
{
2008-03-21 14:14:17 +03:00
# ifdef CONFIG_PROC_FS
udp6_proc_exit ( net ) ;
2008-03-21 14:14:45 +03:00
tcp6_proc_exit ( net ) ;
2008-03-27 02:52:32 +03:00
ac6_proc_exit ( net ) ;
2008-03-21 14:14:17 +03:00
# endif
2008-10-08 01:48:53 +04:00
ipv6_cleanup_mibs ( net ) ;
2008-01-10 13:48:33 +03:00
}
static struct pernet_operations inet6_net_ops = {
. init = inet6_net_init ,
. exit = inet6_net_exit ,
} ;
2019-02-13 22:53:38 +03:00
static int ipv6_route_input ( struct sk_buff * skb )
{
ip6_route_input ( skb ) ;
return skb_dst ( skb ) - > error ;
}
2013-08-31 09:44:30 +04:00
static const struct ipv6_stub ipv6_stub_impl = {
. ipv6_sock_mc_join = ipv6_sock_mc_join ,
. ipv6_sock_mc_drop = ipv6_sock_mc_drop ,
2019-12-04 17:35:53 +03:00
. ipv6_dst_lookup_flow = ip6_dst_lookup_flow ,
2019-02-13 22:53:38 +03:00
. ipv6_route_input = ipv6_route_input ,
2018-05-10 06:34:25 +03:00
. fib6_get_table = fib6_get_table ,
. fib6_table_lookup = fib6_table_lookup ,
. fib6_lookup = fib6_lookup ,
2019-04-17 00:35:59 +03:00
. fib6_select_path = fib6_select_path ,
2018-05-21 19:08:14 +03:00
. ip6_mtu_from_fib6 = ip6_mtu_from_fib6 ,
2019-04-06 02:30:24 +03:00
. fib6_nh_init = fib6_nh_init ,
. fib6_nh_release = fib6_nh_release ,
2021-11-22 18:15:12 +03:00
. fib6_nh_release_dsts = fib6_nh_release_dsts ,
2019-05-22 22:04:40 +03:00
. fib6_update_sernum = fib6_update_sernum_stub ,
2019-05-22 22:04:41 +03:00
. fib6_rt_update = fib6_rt_update ,
2019-05-22 22:04:39 +03:00
. ip6_del_rt = ip6_del_rt ,
2013-08-31 09:44:30 +04:00
. udpv6_encap_enable = udpv6_encap_enable ,
2013-08-31 09:44:36 +04:00
. ndisc_send_na = ndisc_send_na ,
2020-04-27 18:59:34 +03:00
# if IS_ENABLED(CONFIG_XFRM)
2020-05-04 11:06:06 +03:00
. xfrm6_local_rxpmtu = xfrm6_local_rxpmtu ,
2020-04-27 18:59:34 +03:00
. xfrm6_udp_encap_rcv = xfrm6_udp_encap_rcv ,
2020-04-27 18:59:35 +03:00
. xfrm6_rcv_encap = xfrm6_rcv_encap ,
2020-04-27 18:59:34 +03:00
# endif
2013-08-31 09:44:34 +04:00
. nd_tbl = & nd_tbl ,
2020-08-28 18:14:31 +03:00
. ipv6_fragment = ip6_fragment ,
2021-03-30 04:45:43 +03:00
. ipv6_dev_find = ipv6_dev_find ,
2013-08-31 09:44:30 +04:00
} ;
2018-03-31 01:08:05 +03:00
static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
. inet6_bind = __inet6_bind ,
2018-10-15 20:27:45 +03:00
. udp6_lib_lookup = __udp6_lib_lookup ,
2022-08-17 09:18:34 +03:00
. ipv6_setsockopt = do_ipv6_setsockopt ,
2022-09-02 03:29:31 +03:00
. ipv6_getsockopt = do_ipv6_getsockopt ,
2018-03-31 01:08:05 +03:00
} ;
2005-04-17 02:20:36 +04:00
static int __init inet6_init ( void )
{
2007-02-09 17:24:49 +03:00
struct list_head * r ;
2009-03-04 14:18:11 +03:00
int err = 0 ;
2005-04-17 02:20:36 +04:00
2015-03-01 15:58:29 +03:00
sock_skb_cb_check_size ( sizeof ( struct inet6_skb_parm ) ) ;
2006-09-01 11:29:06 +04:00
2009-03-04 14:18:11 +03:00
/* Register the socket-side information for inet6_create. */
2012-05-05 14:13:53 +04:00
for ( r = & inetsw6 [ 0 ] ; r < & inetsw6 [ SOCK_MAX ] ; + + r )
2009-03-04 14:18:11 +03:00
INIT_LIST_HEAD ( r ) ;
2022-09-16 11:48:21 +03:00
raw_hashinfo_init ( & raw_v6_hashinfo ) ;
2009-06-01 14:07:33 +04:00
if ( disable_ipv6_mod ) {
2012-05-15 18:11:53 +04:00
pr_info ( " Loaded, but administratively disabled, reboot required to enable \n " ) ;
2009-03-04 14:18:11 +03:00
goto out ;
}
2005-04-17 02:20:36 +04:00
err = proto_register ( & tcpv6_prot , 1 ) ;
if ( err )
goto out ;
err = proto_register ( & udpv6_prot , 1 ) ;
if ( err )
goto out_unregister_tcp_proto ;
2006-11-27 22:10:57 +03:00
err = proto_register ( & udplitev6_prot , 1 ) ;
2005-04-17 02:20:36 +04:00
if ( err )
goto out_unregister_udp_proto ;
2006-11-27 22:10:57 +03:00
err = proto_register ( & rawv6_prot , 1 ) ;
if ( err )
goto out_unregister_udplite_proto ;
2013-05-23 00:17:31 +04:00
err = proto_register ( & pingv6_prot , 1 ) ;
if ( err )
2018-08-28 14:40:52 +03:00
goto out_unregister_raw_proto ;
2005-04-17 02:20:36 +04:00
/* We MUST register RAW sockets before we create the ICMP6,
* IGMP6 , or NDISC control sockets .
*/
2007-12-11 13:25:35 +03:00
err = rawv6_init ( ) ;
if ( err )
2018-08-28 14:40:52 +03:00
goto out_unregister_ping_proto ;
2005-04-17 02:20:36 +04:00
/* Register the family here so that the init calls below will
* be able to create sockets . ( ? ? is this dangerous ? ? )
*/
2005-11-12 02:05:47 +03:00
err = sock_register ( & inet6_family_ops ) ;
if ( err )
2007-12-11 13:25:35 +03:00
goto out_sock_register_fail ;
2005-04-17 02:20:36 +04:00
/*
* ipngwg API draft makes clear that the correct semantics
* for TCP and UDP is to consider one TCP and UDP instance
2011-03-31 05:57:33 +04:00
* in a host available by both INET and INET6 APIs and
2005-04-17 02:20:36 +04:00
* able to communicate via both network protocols .
*/
2008-01-10 13:48:33 +03:00
err = register_pernet_subsys ( & inet6_net_ops ) ;
if ( err )
goto register_pernet_fail ;
2008-07-03 08:13:30 +04:00
err = ip6_mr_init ( ) ;
if ( err )
goto ipmr_fail ;
2017-03-05 23:34:53 +03:00
err = icmpv6_init ( ) ;
if ( err )
goto icmp_fail ;
2008-02-29 22:13:15 +03:00
err = ndisc_init ( ) ;
2005-04-17 02:20:36 +04:00
if ( err )
goto ndisc_fail ;
2008-02-29 22:13:15 +03:00
err = igmp6_init ( ) ;
2005-04-17 02:20:36 +04:00
if ( err )
goto igmp_fail ;
2013-08-31 09:44:30 +04:00
2005-08-10 06:42:34 +04:00
err = ipv6_netfilter_init ( ) ;
if ( err )
goto netfilter_fail ;
2005-04-17 02:20:36 +04:00
/* Create /proc/foo6 entries. */
# ifdef CONFIG_PROC_FS
err = - ENOMEM ;
if ( raw6_proc_init ( ) )
goto proc_raw6_fail ;
2006-11-27 22:10:57 +03:00
if ( udplite6_proc_init ( ) )
goto proc_udplite6_fail ;
2005-04-17 02:20:36 +04:00
if ( ipv6_misc_proc_init ( ) )
goto proc_misc6_fail ;
if ( if6_proc_init ( ) )
goto proc_if6_fail ;
# endif
2007-12-07 11:44:29 +03:00
err = ip6_route_init ( ) ;
if ( err )
goto ip6_route_fail ;
2013-09-09 23:45:04 +04:00
err = ndisc_late_init ( ) ;
if ( err )
goto ndisc_late_fail ;
2007-12-11 13:23:18 +03:00
err = ip6_flowlabel_init ( ) ;
if ( err )
goto ip6_flowlabel_fail ;
2018-11-02 23:23:57 +03:00
err = ipv6_anycast_init ( ) ;
if ( err )
goto ipv6_anycast_fail ;
2005-04-17 02:20:36 +04:00
err = addrconf_init ( ) ;
if ( err )
goto addrconf_fail ;
/* Init v6 extension headers. */
2007-12-11 13:23:54 +03:00
err = ipv6_exthdrs_init ( ) ;
if ( err )
goto ipv6_exthdrs_fail ;
2007-12-11 13:24:29 +03:00
err = ipv6_frag_init ( ) ;
if ( err )
goto ipv6_frag_fail ;
2005-04-17 02:20:36 +04:00
/* Init v6 transport protocols. */
2007-12-11 13:25:35 +03:00
err = udpv6_init ( ) ;
if ( err )
goto udpv6_fail ;
2005-07-06 01:41:20 +04:00
2007-12-11 13:25:35 +03:00
err = udplitev6_init ( ) ;
if ( err )
goto udplitev6_fail ;
2016-04-05 18:22:51 +03:00
err = udpv6_offload_init ( ) ;
if ( err )
goto udpv6_offload_fail ;
2007-12-11 13:25:35 +03:00
err = tcpv6_init ( ) ;
if ( err )
goto tcpv6_fail ;
err = ipv6_packet_init ( ) ;
if ( err )
goto ipv6_packet_fail ;
2008-03-05 21:45:36 +03:00
2013-05-23 00:17:31 +04:00
err = pingv6_init ( ) ;
if ( err )
goto pingv6_fail ;
2016-06-27 22:02:46 +03:00
err = calipso_init ( ) ;
if ( err )
goto calipso_fail ;
2016-11-08 16:57:40 +03:00
err = seg6_init ( ) ;
if ( err )
goto seg6_fail ;
2020-03-28 01:00:22 +03:00
err = rpl_init ( ) ;
if ( err )
goto rpl_fail ;
ipv6: ioam: Data plane support for Pre-allocated Trace
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 22:42:57 +03:00
err = ioam6_init ( ) ;
if ( err )
goto ioam6_fail ;
2017-03-28 21:49:16 +03:00
err = igmp6_late_init ( ) ;
if ( err )
goto igmp6_late_err ;
2008-03-05 21:45:36 +03:00
# ifdef CONFIG_SYSCTL
err = ipv6_sysctl_register ( ) ;
if ( err )
goto sysctl_fail ;
# endif
2017-04-24 15:18:28 +03:00
/* ensure that ipv6 stubs are visible only after ipv6 is ready */
wmb ( ) ;
ipv6_stub = & ipv6_stub_impl ;
2018-03-31 01:08:05 +03:00
ipv6_bpf_stub = & ipv6_bpf_stub_impl ;
2005-04-17 02:20:36 +04:00
out :
return err ;
2008-03-05 21:45:36 +03:00
# ifdef CONFIG_SYSCTL
sysctl_fail :
2017-03-28 21:49:16 +03:00
igmp6_late_cleanup ( ) ;
2008-03-05 21:45:36 +03:00
# endif
2017-03-28 21:49:16 +03:00
igmp6_late_err :
ipv6: ioam: Data plane support for Pre-allocated Trace
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 22:42:57 +03:00
ioam6_exit ( ) ;
ioam6_fail :
2020-03-28 01:00:22 +03:00
rpl_exit ( ) ;
rpl_fail :
2017-03-28 21:49:16 +03:00
seg6_exit ( ) ;
2016-11-08 16:57:40 +03:00
seg6_fail :
calipso_exit ( ) ;
2016-06-27 22:02:46 +03:00
calipso_fail :
pingv6_exit ( ) ;
2013-05-23 00:17:31 +04:00
pingv6_fail :
2013-11-17 00:17:24 +04:00
ipv6_packet_cleanup ( ) ;
2007-12-11 13:25:35 +03:00
ipv6_packet_fail :
tcpv6_exit ( ) ;
tcpv6_fail :
2016-04-05 18:22:51 +03:00
udpv6_offload_exit ( ) ;
udpv6_offload_fail :
2007-12-11 13:25:35 +03:00
udplitev6_exit ( ) ;
udplitev6_fail :
udpv6_exit ( ) ;
udpv6_fail :
ipv6_frag_exit ( ) ;
2007-12-11 13:24:29 +03:00
ipv6_frag_fail :
ipv6_exthdrs_exit ( ) ;
2007-12-11 13:23:54 +03:00
ipv6_exthdrs_fail :
addrconf_cleanup ( ) ;
2005-04-17 02:20:36 +04:00
addrconf_fail :
2018-11-02 23:23:57 +03:00
ipv6_anycast_cleanup ( ) ;
ipv6_anycast_fail :
2005-04-17 02:20:36 +04:00
ip6_flowlabel_cleanup ( ) ;
2007-12-11 13:23:18 +03:00
ip6_flowlabel_fail :
2013-09-09 23:45:04 +04:00
ndisc_late_cleanup ( ) ;
ndisc_late_fail :
2005-04-17 02:20:36 +04:00
ip6_route_cleanup ( ) ;
2007-12-07 11:44:29 +03:00
ip6_route_fail :
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_PROC_FS
if6_proc_exit ( ) ;
proc_if6_fail :
ipv6_misc_proc_exit ( ) ;
proc_misc6_fail :
2006-11-27 22:10:57 +03:00
udplite6_proc_exit ( ) ;
proc_udplite6_fail :
2005-04-17 02:20:36 +04:00
raw6_proc_exit ( ) ;
proc_raw6_fail :
# endif
2005-08-10 06:42:34 +04:00
ipv6_netfilter_fini ( ) ;
netfilter_fail :
2005-04-17 02:20:36 +04:00
igmp6_cleanup ( ) ;
igmp_fail :
ndisc_cleanup ( ) ;
ndisc_fail :
2018-08-28 14:40:51 +03:00
icmpv6_cleanup ( ) ;
2005-04-17 02:20:36 +04:00
icmp_fail :
2018-08-28 14:40:51 +03:00
ip6_mr_cleanup ( ) ;
2017-03-05 23:34:53 +03:00
ipmr_fail :
2018-08-28 14:40:51 +03:00
unregister_pernet_subsys ( & inet6_net_ops ) ;
2008-01-10 13:48:33 +03:00
register_pernet_fail :
2005-11-12 02:05:47 +03:00
sock_unregister ( PF_INET6 ) ;
2007-12-07 11:44:29 +03:00
rtnl_unregister_all ( PF_INET6 ) ;
2007-12-11 13:25:35 +03:00
out_sock_register_fail :
rawv6_exit ( ) ;
2013-05-23 00:17:31 +04:00
out_unregister_ping_proto :
proto_unregister ( & pingv6_prot ) ;
2005-04-17 02:20:36 +04:00
out_unregister_raw_proto :
proto_unregister ( & rawv6_prot ) ;
2006-11-27 22:10:57 +03:00
out_unregister_udplite_proto :
proto_unregister ( & udplitev6_prot ) ;
2005-04-17 02:20:36 +04:00
out_unregister_udp_proto :
proto_unregister ( & udpv6_prot ) ;
out_unregister_tcp_proto :
proto_unregister ( & tcpv6_prot ) ;
goto out ;
}
module_init ( inet6_init ) ;
MODULE_ALIAS_NETPROTO ( PF_INET6 ) ;