License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
// SPDX-License-Identifier: GPL-2.0
2005-04-17 02:20:36 +04:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* The IP to API glue .
2007-02-09 17:24:47 +03:00
*
2005-04-17 02:20:36 +04:00
* Authors : see ip . c
*
* Fixes :
* Many : Split from ip . c , see ip . c for history .
* Martin Mares : TOS setting fixed .
2007-02-09 17:24:47 +03:00
* Alan Cox : Fixed a couple of oopses in Martin ' s
2005-04-17 02:20:36 +04:00
* TOS tweaks .
* Mike McLagan : Routing by source
*/
# include <linux/module.h>
# include <linux/types.h>
# include <linux/mm.h>
# include <linux/skbuff.h>
# include <linux/ip.h>
# include <linux/icmp.h>
2005-12-27 07:43:12 +03:00
# include <linux/inetdevice.h>
2005-04-17 02:20:36 +04:00
# include <linux/netdevice.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2005-04-17 02:20:36 +04:00
# include <net/sock.h>
# include <net/ip.h>
# include <net/icmp.h>
2005-12-14 10:26:10 +03:00
# include <net/tcp_states.h>
2005-04-17 02:20:36 +04:00
# include <linux/udp.h>
# include <linux/igmp.h>
# include <linux/netfilter.h>
# include <linux/route.h>
# include <linux/mroute.h>
2011-10-22 08:07:47 +04:00
# include <net/inet_ecn.h>
2005-04-17 02:20:36 +04:00
# include <net/route.h>
# include <net/xfrm.h>
2008-04-27 12:06:07 +04:00
# include <net/compat.h>
2015-01-06 00:56:17 +03:00
# include <net/checksum.h>
2011-12-10 13:48:31 +04:00
# if IS_ENABLED(CONFIG_IPV6)
2005-04-17 02:20:36 +04:00
# include <net/transp_v6.h>
# endif
2012-06-28 14:59:11 +04:00
# include <net/ip_fib.h>
2005-04-17 02:20:36 +04:00
# include <linux/errqueue.h>
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
2018-05-22 05:22:30 +03:00
# include <linux/bpfilter.h>
2005-04-17 02:20:36 +04:00
/*
* SOL_IP control messages .
*/
static void ip_cmsg_recv_pktinfo ( struct msghdr * msg , struct sk_buff * skb )
{
ipv4: PKTINFO doesnt need dst reference
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>
OK I found it, I did some extra tests and believe its ready.
[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.
We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.
We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.
This removes two atomic operations per packet, and false sharing as
well.
On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.
IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 11:24:35 +04:00
struct in_pktinfo info = * PKTINFO_SKB_CB ( skb ) ;
2005-04-17 02:20:36 +04:00
2007-04-21 09:47:35 +04:00
info . ipi_addr . s_addr = ip_hdr ( skb ) - > daddr ;
2005-04-17 02:20:36 +04:00
put_cmsg ( msg , SOL_IP , IP_PKTINFO , sizeof ( info ) , & info ) ;
}
static void ip_cmsg_recv_ttl ( struct msghdr * msg , struct sk_buff * skb )
{
2007-04-21 09:47:35 +04:00
int ttl = ip_hdr ( skb ) - > ttl ;
2005-04-17 02:20:36 +04:00
put_cmsg ( msg , SOL_IP , IP_TTL , sizeof ( int ) , & ttl ) ;
}
static void ip_cmsg_recv_tos ( struct msghdr * msg , struct sk_buff * skb )
{
2007-04-21 09:47:35 +04:00
put_cmsg ( msg , SOL_IP , IP_TOS , 1 , & ip_hdr ( skb ) - > tos ) ;
2005-04-17 02:20:36 +04:00
}
static void ip_cmsg_recv_opts ( struct msghdr * msg , struct sk_buff * skb )
{
if ( IPCB ( skb ) - > opt . optlen = = 0 )
return ;
2007-04-21 09:47:35 +04:00
put_cmsg ( msg , SOL_IP , IP_RECVOPTS , IPCB ( skb ) - > opt . optlen ,
ip_hdr ( skb ) + 1 ) ;
2005-04-17 02:20:36 +04:00
}
2017-08-03 19:07:06 +03:00
static void ip_cmsg_recv_retopts ( struct net * net , struct msghdr * msg ,
struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
unsigned char optbuf [ sizeof ( struct ip_options ) + 40 ] ;
2012-04-15 05:34:41 +04:00
struct ip_options * opt = ( struct ip_options * ) optbuf ;
2005-04-17 02:20:36 +04:00
if ( IPCB ( skb ) - > opt . optlen = = 0 )
return ;
2017-08-03 19:07:06 +03:00
if ( ip_options_echo ( net , opt , skb ) ) {
2005-04-17 02:20:36 +04:00
msg - > msg_flags | = MSG_CTRUNC ;
return ;
}
ip_options_undo ( opt ) ;
put_cmsg ( msg , SOL_IP , IP_RETOPTS , opt - > optlen , opt - > __data ) ;
}
2016-11-02 18:02:16 +03:00
static void ip_cmsg_recv_fragsize ( struct msghdr * msg , struct sk_buff * skb )
{
int val ;
if ( IPCB ( skb ) - > frag_max_size = = 0 )
return ;
val = IPCB ( skb ) - > frag_max_size ;
put_cmsg ( msg , SOL_IP , IP_RECVFRAGSIZE , sizeof ( val ) , & val ) ;
}
2015-01-06 00:56:17 +03:00
static void ip_cmsg_recv_checksum ( struct msghdr * msg , struct sk_buff * skb ,
2016-10-24 04:03:06 +03:00
int tlen , int offset )
2015-01-06 00:56:17 +03:00
{
__wsum csum = skb - > csum ;
if ( skb - > ip_summed ! = CHECKSUM_COMPLETE )
return ;
2017-02-21 11:33:18 +03:00
if ( offset ! = 0 ) {
int tend_off = skb_transport_offset ( skb ) + tlen ;
csum = csum_sub ( csum , skb_checksum ( skb , tend_off , offset , 0 ) ) ;
}
2015-01-06 00:56:17 +03:00
put_cmsg ( msg , SOL_IP , IP_CHECKSUM , sizeof ( __wsum ) , & csum ) ;
}
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
static void ip_cmsg_recv_security ( struct msghdr * msg , struct sk_buff * skb )
{
char * secdata ;
2006-08-03 01:12:06 +04:00
u32 seclen , secid ;
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
int err ;
2006-08-03 01:12:06 +04:00
err = security_socket_getpeersec_dgram ( NULL , skb , & secid ) ;
if ( err )
return ;
err = security_secid_to_secctx ( secid , & secdata , & seclen ) ;
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
if ( err )
return ;
put_cmsg ( msg , SOL_IP , SCM_SECURITY , seclen , secdata ) ;
2006-08-03 01:12:06 +04:00
security_release_secctx ( secdata , seclen ) ;
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
}
2008-11-20 12:54:27 +03:00
static void ip_cmsg_recv_dstaddr ( struct msghdr * msg , struct sk_buff * skb )
2008-11-17 06:32:39 +03:00
{
2019-01-08 00:47:33 +03:00
__be16 _ports [ 2 ] , * ports ;
2008-11-17 06:32:39 +03:00
struct sockaddr_in sin ;
/* All current transport protocols have the port numbers in the
* first four bytes of the transport header and this function is
* written with this assumption in mind .
*/
2019-01-08 00:47:33 +03:00
ports = skb_header_pointer ( skb , skb_transport_offset ( skb ) ,
sizeof ( _ports ) , & _ports ) ;
if ( ! ports )
return ;
2008-11-17 06:32:39 +03:00
sin . sin_family = AF_INET ;
2018-09-30 21:33:39 +03:00
sin . sin_addr . s_addr = ip_hdr ( skb ) - > daddr ;
2008-11-17 06:32:39 +03:00
sin . sin_port = ports [ 1 ] ;
memset ( sin . sin_zero , 0 , sizeof ( sin . sin_zero ) ) ;
put_cmsg ( msg , SOL_IP , IP_ORIGDSTADDR , sizeof ( sin ) , & sin ) ;
}
2005-04-17 02:20:36 +04:00
2016-11-04 13:28:58 +03:00
void ip_cmsg_recv_offset ( struct msghdr * msg , struct sock * sk ,
struct sk_buff * skb , int tlen , int offset )
2005-04-17 02:20:36 +04:00
{
2023-08-16 11:15:33 +03:00
unsigned long flags = inet_cmsg_flags ( inet_sk ( sk ) ) ;
if ( ! flags )
return ;
2005-04-17 02:20:36 +04:00
/* Ordered by supposed usage frequency */
2015-01-06 00:56:15 +03:00
if ( flags & IP_CMSG_PKTINFO ) {
2005-04-17 02:20:36 +04:00
ip_cmsg_recv_pktinfo ( msg , skb ) ;
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_PKTINFO ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_TTL ) {
2005-04-17 02:20:36 +04:00
ip_cmsg_recv_ttl ( msg , skb ) ;
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_TTL ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_TOS ) {
2005-04-17 02:20:36 +04:00
ip_cmsg_recv_tos ( msg , skb ) ;
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_TOS ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_RECVOPTS ) {
2005-04-17 02:20:36 +04:00
ip_cmsg_recv_opts ( msg , skb ) ;
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_RECVOPTS ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_RETOPTS ) {
2017-08-03 19:07:06 +03:00
ip_cmsg_recv_retopts ( sock_net ( sk ) , msg , skb ) ;
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_RETOPTS ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_PASSSEC ) {
[SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.
Patch purpose:
This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using. The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection. In the case of UDP, the
security context is for each individual packet. An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.
Patch design approach:
- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association. The application may retrieve this context using
getsockopt. When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations. If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.
- Design for UDP
Unlike TCP, UDP is connectionless. This requires a somewhat different
API to retrieve the peer security context. With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down. With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.
The solution is to build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).
Patch implementation details:
- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag. As an example (ignoring error
checking):
getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);
The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations. If
these have security associations with security contexts, the security
context is returned.
getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.
- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt. Then the application retrieves the security context using
the auxiliary data mechanism.
An example server application for UDP should look like this:
toggle = 1;
toggle_len = sizeof(toggle);
setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_IP &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}
ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer. A new
ancillary message type SCM_SECURITY.
When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space. An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space. The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.
Testing:
We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built. For TCP, we can then
extract the peer security context using getsockopt on either end. For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-21 09:41:23 +03:00
ip_cmsg_recv_security ( msg , skb ) ;
2008-11-17 06:32:39 +03:00
2015-01-06 00:56:15 +03:00
flags & = ~ IP_CMSG_PASSSEC ;
if ( ! flags )
return ;
}
2015-01-06 00:56:17 +03:00
if ( flags & IP_CMSG_ORIGDSTADDR ) {
2008-11-17 06:32:39 +03:00
ip_cmsg_recv_dstaddr ( msg , skb ) ;
2015-01-06 00:56:17 +03:00
flags & = ~ IP_CMSG_ORIGDSTADDR ;
if ( ! flags )
return ;
}
if ( flags & IP_CMSG_CHECKSUM )
2016-10-24 04:03:06 +03:00
ip_cmsg_recv_checksum ( msg , skb , tlen , offset ) ;
2016-11-02 18:02:16 +03:00
if ( flags & IP_CMSG_RECVFRAGSIZE )
ip_cmsg_recv_fragsize ( msg , skb ) ;
2005-04-17 02:20:36 +04:00
}
2015-01-06 00:56:16 +03:00
EXPORT_SYMBOL ( ip_cmsg_recv_offset ) ;
2005-04-17 02:20:36 +04:00
2016-04-03 06:08:10 +03:00
int ip_cmsg_send ( struct sock * sk , struct msghdr * msg , struct ipcm_cookie * ipc ,
2014-02-19 00:38:08 +04:00
bool allow_ipv6 )
2005-04-17 02:20:36 +04:00
{
2013-09-24 17:43:08 +04:00
int err , val ;
2005-04-17 02:20:36 +04:00
struct cmsghdr * cmsg ;
2016-04-03 06:08:10 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
2014-12-11 06:22:04 +03:00
for_each_cmsghdr ( cmsg , msg ) {
2005-04-17 02:20:36 +04:00
if ( ! CMSG_OK ( msg , cmsg ) )
return - EINVAL ;
2014-11-11 04:54:25 +03:00
# if IS_ENABLED(CONFIG_IPV6)
2014-02-19 00:38:08 +04:00
if ( allow_ipv6 & &
cmsg - > cmsg_level = = SOL_IPV6 & &
cmsg - > cmsg_type = = IPV6_PKTINFO ) {
struct in6_pktinfo * src_info ;
if ( cmsg - > cmsg_len < CMSG_LEN ( sizeof ( * src_info ) ) )
return - EINVAL ;
src_info = ( struct in6_pktinfo * ) CMSG_DATA ( cmsg ) ;
if ( ! ipv6_addr_v4mapped ( & src_info - > ipi6_addr ) )
return - EINVAL ;
2018-02-16 22:03:03 +03:00
if ( src_info - > ipi6_ifindex )
ipc - > oif = src_info - > ipi6_ifindex ;
2014-02-19 00:38:08 +04:00
ipc - > addr = src_info - > ipi6_addr . s6_addr32 [ 3 ] ;
continue ;
}
# endif
2016-04-03 06:08:10 +03:00
if ( cmsg - > cmsg_level = = SOL_SOCKET ) {
2022-10-20 09:54:41 +03:00
err = __sock_cmsg_send ( sk , cmsg , & ipc - > sockc ) ;
2016-05-13 16:14:37 +03:00
if ( err )
return err ;
2016-04-03 06:08:10 +03:00
continue ;
}
2005-04-17 02:20:36 +04:00
if ( cmsg - > cmsg_level ! = SOL_IP )
continue ;
switch ( cmsg - > cmsg_type ) {
case IP_RETOPTS :
2017-01-03 15:42:17 +03:00
err = cmsg - > cmsg_len - sizeof ( struct cmsghdr ) ;
2016-02-04 17:23:28 +03:00
/* Our caller is responsible for freeing ipc->opt */
2020-07-23 09:08:57 +03:00
err = ip_options_get ( net , & ipc - > opt ,
KERNEL_SOCKPTR ( CMSG_DATA ( cmsg ) ) ,
2009-06-02 11:42:16 +04:00
err < 40 ? err : 40 ) ;
2005-04-17 02:20:36 +04:00
if ( err )
return err ;
break ;
case IP_PKTINFO :
{
struct in_pktinfo * info ;
if ( cmsg - > cmsg_len ! = CMSG_LEN ( sizeof ( struct in_pktinfo ) ) )
return - EINVAL ;
info = ( struct in_pktinfo * ) CMSG_DATA ( cmsg ) ;
2018-02-16 22:03:03 +03:00
if ( info - > ipi_ifindex )
ipc - > oif = info - > ipi_ifindex ;
2005-04-17 02:20:36 +04:00
ipc - > addr = info - > ipi_spec_dst . s_addr ;
break ;
}
2013-09-24 17:43:08 +04:00
case IP_TTL :
if ( cmsg - > cmsg_len ! = CMSG_LEN ( sizeof ( int ) ) )
return - EINVAL ;
val = * ( int * ) CMSG_DATA ( cmsg ) ;
if ( val < 1 | | val > 255 )
return - EINVAL ;
ipc - > ttl = val ;
break ;
case IP_TOS :
2016-09-08 07:52:56 +03:00
if ( cmsg - > cmsg_len = = CMSG_LEN ( sizeof ( int ) ) )
val = * ( int * ) CMSG_DATA ( cmsg ) ;
else if ( cmsg - > cmsg_len = = CMSG_LEN ( sizeof ( u8 ) ) )
val = * ( u8 * ) CMSG_DATA ( cmsg ) ;
else
2013-09-24 17:43:08 +04:00
return - EINVAL ;
if ( val < 0 | | val > 255 )
return - EINVAL ;
ipc - > tos = val ;
ipc - > priority = rt_tos2priority ( ipc - > tos ) ;
break ;
2023-05-22 15:08:20 +03:00
case IP_PROTOCOL :
if ( cmsg - > cmsg_len ! = CMSG_LEN ( sizeof ( int ) ) )
return - EINVAL ;
val = * ( int * ) CMSG_DATA ( cmsg ) ;
if ( val < 1 | | val > 255 )
return - EINVAL ;
ipc - > protocol = val ;
break ;
2005-04-17 02:20:36 +04:00
default :
return - EINVAL ;
}
}
return 0 ;
}
2010-06-09 20:21:07 +04:00
static void ip_ra_destroy_rcu ( struct rcu_head * head )
2010-06-07 07:12:08 +04:00
{
2010-06-09 20:21:07 +04:00
struct ip_ra_chain * ra = container_of ( head , struct ip_ra_chain , rcu ) ;
sock_put ( ra - > saved_sk ) ;
kfree ( ra ) ;
2010-06-07 07:12:08 +04:00
}
2005-04-17 02:20:36 +04:00
2009-06-02 11:42:16 +04:00
int ip_ra_control ( struct sock * sk , unsigned char on ,
void ( * destructor ) ( struct sock * ) )
2005-04-17 02:20:36 +04:00
{
2010-10-25 07:32:44 +04:00
struct ip_ra_chain * ra , * new_ra ;
struct ip_ra_chain __rcu * * rap ;
2018-03-22 12:45:32 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
2009-10-15 10:30:45 +04:00
if ( sk - > sk_type ! = SOCK_RAW | | inet_sk ( sk ) - > inet_num = = IPPROTO_RAW )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
new_ra = on ? kmalloc ( sizeof ( * new_ra ) , GFP_KERNEL ) : NULL ;
2019-05-24 06:24:26 +03:00
if ( on & & ! new_ra )
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2018-03-22 12:45:40 +03:00
mutex_lock ( & net - > ipv4 . ra_mutex ) ;
2018-03-22 12:45:32 +03:00
for ( rap = & net - > ipv4 . ra_chain ;
2018-03-22 12:45:02 +03:00
( ra = rcu_dereference_protected ( * rap ,
2018-03-22 12:45:40 +03:00
lockdep_is_held ( & net - > ipv4 . ra_mutex ) ) ) ! = NULL ;
2010-10-25 07:32:44 +04:00
rap = & ra - > next ) {
2005-04-17 02:20:36 +04:00
if ( ra - > sk = = sk ) {
if ( on ) {
2018-03-22 12:45:40 +03:00
mutex_unlock ( & net - > ipv4 . ra_mutex ) ;
2005-11-08 20:41:34 +03:00
kfree ( new_ra ) ;
2005-04-17 02:20:36 +04:00
return - EADDRINUSE ;
}
2010-06-09 20:21:07 +04:00
/* dont let ip_call_ra_chain() use sk again */
ra - > sk = NULL ;
2014-09-09 19:11:41 +04:00
RCU_INIT_POINTER ( * rap , ra - > next ) ;
2018-03-22 12:45:40 +03:00
mutex_unlock ( & net - > ipv4 . ra_mutex ) ;
2005-04-17 02:20:36 +04:00
if ( ra - > destructor )
ra - > destructor ( sk ) ;
2010-06-09 20:21:07 +04:00
/*
* Delay sock_put ( sk ) and kfree ( ra ) after one rcu grace
* period . This guarantee ip_call_ra_chain ( ) dont need
* to mess with socket refcounts .
*/
ra - > saved_sk = sk ;
call_rcu ( & ra - > rcu , ip_ra_destroy_rcu ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
}
2018-03-22 12:45:02 +03:00
if ( ! new_ra ) {
2018-03-22 12:45:40 +03:00
mutex_unlock ( & net - > ipv4 . ra_mutex ) ;
2005-04-17 02:20:36 +04:00
return - ENOBUFS ;
2018-03-22 12:45:02 +03:00
}
2005-04-17 02:20:36 +04:00
new_ra - > sk = sk ;
new_ra - > destructor = destructor ;
2014-09-09 19:11:41 +04:00
RCU_INIT_POINTER ( new_ra - > next , ra ) ;
2010-06-07 07:12:08 +04:00
rcu_assign_pointer ( * rap , new_ra ) ;
2005-04-17 02:20:36 +04:00
sock_hold ( sk ) ;
2018-03-22 12:45:40 +03:00
mutex_unlock ( & net - > ipv4 . ra_mutex ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2020-07-24 16:03:09 +03:00
static void ipv4_icmp_error_rfc4884 ( const struct sk_buff * skb ,
struct sock_ee_data_rfc4884 * out )
{
switch ( icmp_hdr ( skb ) - > type ) {
case ICMP_DEST_UNREACH :
case ICMP_TIME_EXCEEDED :
case ICMP_PARAMETERPROB :
ip_icmp_error_rfc4884 ( skb , out , sizeof ( struct icmphdr ) ,
icmp_hdr ( skb ) - > un . reserved [ 1 ] * 4 ) ;
}
}
2007-02-09 17:24:47 +03:00
void ip_icmp_error ( struct sock * sk , struct sk_buff * skb , int err ,
2006-09-28 05:34:21 +04:00
__be16 port , u32 info , u8 * payload )
2005-04-17 02:20:36 +04:00
{
struct sock_exterr_skb * serr ;
skb = skb_clone ( skb , GFP_ATOMIC ) ;
if ( ! skb )
return ;
2007-02-09 17:24:47 +03:00
serr = SKB_EXT_ERR ( skb ) ;
2005-04-17 02:20:36 +04:00
serr - > ee . ee_errno = err ;
serr - > ee . ee_origin = SO_EE_ORIGIN_ICMP ;
2007-03-13 20:43:18 +03:00
serr - > ee . ee_type = icmp_hdr ( skb ) - > type ;
serr - > ee . ee_code = icmp_hdr ( skb ) - > code ;
2005-04-17 02:20:36 +04:00
serr - > ee . ee_pad = 0 ;
serr - > ee . ee_info = info ;
serr - > ee . ee_data = 0 ;
2007-03-13 20:43:18 +03:00
serr - > addr_offset = ( u8 * ) & ( ( ( struct iphdr * ) ( icmp_hdr ( skb ) + 1 ) ) - > daddr ) -
2007-04-11 07:50:43 +04:00
skb_network_header ( skb ) ;
2005-04-17 02:20:36 +04:00
serr - > port = port ;
2015-04-03 11:17:27 +03:00
if ( skb_pull ( skb , payload - skb - > data ) ) {
2023-08-16 11:15:36 +03:00
if ( inet_test_bit ( RECVERR_RFC4884 , sk ) )
2020-07-24 16:03:09 +03:00
ipv4_icmp_error_rfc4884 ( skb , & serr - > ee . ee_rfc4884 ) ;
icmp: support rfc 4884
Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
extension struct if present.
ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior. It stores the offset
in words to the extension header in u8 icmphdr.un.reserved[1].
The field is valid only for ICMP types destination unreachable, time
exceeded and parameter problem, if length is at least 128 bytes and
entire packet does not exceed 576 bytes.
Return the offset to the start of the extension struct when reading an
ICMP error from the error queue, if it matches the above constraints.
Do not return the raw u8 field. Return the offset from the start of
the user buffer, in bytes. The kernel does not return the network and
transport headers, so subtract those.
Also validate the headers. Return the offset regardless of validation,
as an invalid extension must still not be misinterpreted as part of
the original datagram. Note that !invalid does not imply valid. If
the extension version does not match, no validation can take place,
for instance.
For backward compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
For forward compatibility, reserve only setsockopt value 1, leaving
other bits for additional icmp extensions.
Changes
v1->v2:
- convert word offset to byte offset from start of user buffer
- return in ee_data as u8 may be insufficient
- define extension struct and object header structs
- return len only if constraints met
- if returning len, also validate
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10 16:29:02 +03:00
2007-03-13 23:10:43 +03:00
skb_reset_transport_header ( skb ) ;
if ( sock_queue_err_skb ( sk , skb ) = = 0 )
return ;
}
kfree_skb ( skb ) ;
2005-04-17 02:20:36 +04:00
}
2022-10-12 10:49:29 +03:00
EXPORT_SYMBOL_GPL ( ip_icmp_error ) ;
2005-04-17 02:20:36 +04:00
2006-09-28 05:33:40 +04:00
void ip_local_error ( struct sock * sk , int err , __be32 daddr , __be16 port , u32 info )
2005-04-17 02:20:36 +04:00
{
struct sock_exterr_skb * serr ;
struct iphdr * iph ;
struct sk_buff * skb ;
2023-08-16 11:15:35 +03:00
if ( ! inet_test_bit ( RECVERR , sk ) )
2005-04-17 02:20:36 +04:00
return ;
skb = alloc_skb ( sizeof ( struct iphdr ) , GFP_ATOMIC ) ;
if ( ! skb )
return ;
2007-03-11 01:15:25 +03:00
skb_put ( skb , sizeof ( struct iphdr ) ) ;
skb_reset_network_header ( skb ) ;
2007-04-21 09:47:35 +04:00
iph = ip_hdr ( skb ) ;
2005-04-17 02:20:36 +04:00
iph - > daddr = daddr ;
2007-02-09 17:24:47 +03:00
serr = SKB_EXT_ERR ( skb ) ;
2005-04-17 02:20:36 +04:00
serr - > ee . ee_errno = err ;
serr - > ee . ee_origin = SO_EE_ORIGIN_LOCAL ;
2007-02-09 17:24:47 +03:00
serr - > ee . ee_type = 0 ;
2005-04-17 02:20:36 +04:00
serr - > ee . ee_code = 0 ;
serr - > ee . ee_pad = 0 ;
serr - > ee . ee_info = info ;
serr - > ee . ee_data = 0 ;
2007-04-11 07:50:43 +04:00
serr - > addr_offset = ( u8 * ) & iph - > daddr - skb_network_header ( skb ) ;
2005-04-17 02:20:36 +04:00
serr - > port = port ;
2007-04-20 07:29:13 +04:00
__skb_pull ( skb , skb_tail_pointer ( skb ) - skb - > data ) ;
2007-03-13 23:10:43 +03:00
skb_reset_transport_header ( skb ) ;
2005-04-17 02:20:36 +04:00
if ( sock_queue_err_skb ( sk , skb ) )
kfree_skb ( skb ) ;
}
2015-06-23 08:34:39 +03:00
/* For some errors we have valid addr_offset even with zero payload and
* zero port . Also , addr_offset should be supported if port is set .
*/
static inline bool ipv4_datagram_support_addr ( struct sock_exterr_skb * serr )
{
return serr - > ee . ee_origin = = SO_EE_ORIGIN_ICMP | |
serr - > ee . ee_origin = = SO_EE_ORIGIN_LOCAL | | serr - > port ;
}
2015-03-08 04:33:22 +03:00
/* IPv4 supports cmsg on all imcp errors and some timestamps
*
* Timestamp code paths do not initialize the fields expected by cmsg :
* the PKTINFO fields in skb - > cb [ ] . Fill those in here .
*/
static bool ipv4_datagram_support_cmsg ( const struct sock * sk ,
struct sk_buff * skb ,
int ee_origin )
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-01 06:22:34 +03:00
{
2015-03-08 04:33:22 +03:00
struct in_pktinfo * info ;
if ( ee_origin = = SO_EE_ORIGIN_ICMP )
return true ;
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-01 06:22:34 +03:00
2015-03-08 04:33:22 +03:00
if ( ee_origin = = SO_EE_ORIGIN_LOCAL )
return false ;
/* Support IP_PKTINFO on tstamp packets if requested, to correlate
2017-04-13 02:24:35 +03:00
* timestamp with egress dev . Not possible for packets without iif
2015-03-08 04:33:22 +03:00
* or without payload ( SOF_TIMESTAMPING_OPT_TSONLY ) .
*/
2017-04-13 02:24:35 +03:00
info = PKTINFO_SKB_CB ( skb ) ;
if ( ! ( sk - > sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG ) | |
! info - > ipi_ifindex )
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-01 06:22:34 +03:00
return false ;
info - > ipi_spec_dst . s_addr = ip_hdr ( skb ) - > saddr ;
return true ;
}
2007-02-09 17:24:47 +03:00
/*
2005-04-17 02:20:36 +04:00
* Handle MSG_ERRQUEUE
*/
2013-11-23 03:46:12 +04:00
int ip_recv_error ( struct sock * sk , struct msghdr * msg , int len , int * addr_len )
2005-04-17 02:20:36 +04:00
{
struct sock_exterr_skb * serr ;
2014-09-01 05:30:27 +04:00
struct sk_buff * skb ;
2014-01-18 01:53:15 +04:00
DECLARE_SOCKADDR ( struct sockaddr_in * , sin , msg - > msg_name ) ;
2005-04-17 02:20:36 +04:00
struct {
struct sock_extended_err ee ;
struct sockaddr_in offender ;
} errhdr ;
int err ;
int copied ;
err = - EAGAIN ;
2014-09-01 05:30:27 +04:00
skb = sock_dequeue_err_skb ( sk ) ;
2015-04-03 11:17:26 +03:00
if ( ! skb )
2005-04-17 02:20:36 +04:00
goto out ;
copied = skb - > len ;
if ( copied > len ) {
msg - > msg_flags | = MSG_TRUNC ;
copied = len ;
}
2014-11-06 00:46:40 +03:00
err = skb_copy_datagram_msg ( skb , 0 , msg , copied ) ;
2016-04-22 08:27:32 +03:00
if ( unlikely ( err ) ) {
kfree_skb ( skb ) ;
return err ;
}
2005-04-17 02:20:36 +04:00
sock_recv_timestamp ( msg , sk , skb ) ;
serr = SKB_EXT_ERR ( skb ) ;
2015-06-23 08:34:39 +03:00
if ( sin & & ipv4_datagram_support_addr ( serr ) ) {
2005-04-17 02:20:36 +04:00
sin - > sin_family = AF_INET ;
2007-04-11 07:50:43 +04:00
sin - > sin_addr . s_addr = * ( __be32 * ) ( skb_network_header ( skb ) +
serr - > addr_offset ) ;
2005-04-17 02:20:36 +04:00
sin - > sin_port = serr - > port ;
memset ( & sin - > sin_zero , 0 , sizeof ( sin - > sin_zero ) ) ;
2013-11-23 03:46:12 +04:00
* addr_len = sizeof ( * sin ) ;
2005-04-17 02:20:36 +04:00
}
memcpy ( & errhdr . ee , & serr - > ee , sizeof ( struct sock_extended_err ) ) ;
sin = & errhdr . offender ;
2015-01-15 21:18:40 +03:00
memset ( sin , 0 , sizeof ( * sin ) ) ;
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-01 06:22:34 +03:00
2015-03-08 04:33:22 +03:00
if ( ipv4_datagram_support_cmsg ( sk , skb , serr - > ee . ee_origin ) ) {
2005-04-17 02:20:36 +04:00
sin - > sin_family = AF_INET ;
2007-04-21 09:47:35 +04:00
sin - > sin_addr . s_addr = ip_hdr ( skb ) - > saddr ;
2023-08-16 11:15:33 +03:00
if ( inet_cmsg_flags ( inet_sk ( sk ) ) )
2005-04-17 02:20:36 +04:00
ip_cmsg_recv ( msg , skb ) ;
}
put_cmsg ( msg , SOL_IP , IP_RECVERR , sizeof ( errhdr ) , & errhdr ) ;
/* Now we could try to dump offended packet options */
msg - > msg_flags | = MSG_ERRQUEUE ;
err = copied ;
2016-04-22 08:27:32 +03:00
consume_skb ( skb ) ;
2005-04-17 02:20:36 +04:00
out :
return err ;
}
2021-11-19 23:41:34 +03:00
void __ip_sock_set_tos ( struct sock * sk , int val )
2020-05-28 08:12:26 +03:00
{
if ( sk - > sk_type = = SOCK_STREAM ) {
val & = ~ INET_ECN_MASK ;
val | = inet_sk ( sk ) - > tos & INET_ECN_MASK ;
}
if ( inet_sk ( sk ) - > tos ! = val ) {
inet_sk ( sk ) - > tos = val ;
2023-07-28 18:03:18 +03:00
WRITE_ONCE ( sk - > sk_priority , rt_tos2priority ( val ) ) ;
2020-05-28 08:12:26 +03:00
sk_dst_reset ( sk ) ;
}
}
void ip_sock_set_tos ( struct sock * sk , int val )
{
lock_sock ( sk ) ;
__ip_sock_set_tos ( sk , val ) ;
release_sock ( sk ) ;
}
EXPORT_SYMBOL ( ip_sock_set_tos ) ;
2005-04-17 02:20:36 +04:00
2020-05-28 08:12:27 +03:00
void ip_sock_set_freebind ( struct sock * sk )
{
2023-08-16 11:15:37 +03:00
inet_set_bit ( FREEBIND , sk ) ;
2020-05-28 08:12:27 +03:00
}
EXPORT_SYMBOL ( ip_sock_set_freebind ) ;
2020-05-28 08:12:28 +03:00
void ip_sock_set_recverr ( struct sock * sk )
{
2023-08-16 11:15:35 +03:00
inet_set_bit ( RECVERR , sk ) ;
2020-05-28 08:12:28 +03:00
}
EXPORT_SYMBOL ( ip_sock_set_recverr ) ;
2020-05-28 08:12:29 +03:00
int ip_sock_set_mtu_discover ( struct sock * sk , int val )
{
if ( val < IP_PMTUDISC_DONT | | val > IP_PMTUDISC_OMIT )
return - EINVAL ;
lock_sock ( sk ) ;
inet_sk ( sk ) - > pmtudisc = val ;
release_sock ( sk ) ;
return 0 ;
}
EXPORT_SYMBOL ( ip_sock_set_mtu_discover ) ;
2020-05-28 08:12:30 +03:00
void ip_sock_set_pktinfo ( struct sock * sk )
{
2023-08-16 11:15:33 +03:00
inet_set_bit ( PKTINFO , sk ) ;
2020-05-28 08:12:30 +03:00
}
EXPORT_SYMBOL ( ip_sock_set_pktinfo ) ;
2005-04-17 02:20:36 +04:00
/*
2009-06-02 11:42:16 +04:00
* Socket option code for IP . This is the end of the line after any
* TCP , UDP etc options on an IP socket .
2005-04-17 02:20:36 +04:00
*/
2015-03-18 20:50:42 +03:00
static bool setsockopt_needs_rtnl ( int optname )
{
switch ( optname ) {
case IP_ADD_MEMBERSHIP :
case IP_ADD_SOURCE_MEMBERSHIP :
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
case IP_BLOCK_SOURCE :
2015-03-18 20:50:42 +03:00
case IP_DROP_MEMBERSHIP :
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
case IP_DROP_SOURCE_MEMBERSHIP :
case IP_MSFILTER :
case IP_UNBLOCK_SOURCE :
case MCAST_BLOCK_SOURCE :
case MCAST_MSFILTER :
2015-03-18 20:50:42 +03:00
case MCAST_JOIN_GROUP :
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
case MCAST_JOIN_SOURCE_GROUP :
2015-03-18 20:50:42 +03:00
case MCAST_LEAVE_GROUP :
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
case MCAST_LEAVE_SOURCE_GROUP :
case MCAST_UNBLOCK_SOURCE :
2015-03-18 20:50:42 +03:00
return true ;
}
return false ;
}
2005-04-17 02:20:36 +04:00
2020-03-30 05:37:56 +03:00
static int set_mcast_msfilter ( struct sock * sk , int ifindex ,
int numsrc , int fmode ,
struct sockaddr_storage * group ,
struct sockaddr_storage * list )
{
struct ip_msfilter * msf ;
struct sockaddr_in * psin ;
int err , i ;
2021-08-04 21:23:25 +03:00
msf = kmalloc ( IP_MSFILTER_SIZE ( numsrc ) , GFP_KERNEL ) ;
2020-03-30 05:37:56 +03:00
if ( ! msf )
return - ENOBUFS ;
psin = ( struct sockaddr_in * ) group ;
if ( psin - > sin_family ! = AF_INET )
goto Eaddrnotavail ;
msf - > imsf_multiaddr = psin - > sin_addr . s_addr ;
msf - > imsf_interface = 0 ;
msf - > imsf_fmode = fmode ;
msf - > imsf_numsrc = numsrc ;
for ( i = 0 ; i < numsrc ; + + i ) {
psin = ( struct sockaddr_in * ) & list [ i ] ;
if ( psin - > sin_family ! = AF_INET )
goto Eaddrnotavail ;
2021-07-31 20:08:30 +03:00
msf - > imsf_slist_flex [ i ] = psin - > sin_addr . s_addr ;
2020-03-30 05:37:56 +03:00
}
err = ip_mc_msfilter ( sk , msf , ifindex ) ;
kfree ( msf ) ;
return err ;
Eaddrnotavail :
kfree ( msf ) ;
return - EADDRNOTAVAIL ;
}
2020-07-23 09:08:58 +03:00
static int copy_group_source_from_sockptr ( struct group_source_req * greqs ,
sockptr_t optval , int optlen )
2020-07-17 09:23:26 +03:00
{
if ( in_compat_syscall ( ) ) {
struct compat_group_source_req gr32 ;
if ( optlen ! = sizeof ( gr32 ) )
return - EINVAL ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & gr32 , optval , sizeof ( gr32 ) ) )
2020-07-17 09:23:26 +03:00
return - EFAULT ;
greqs - > gsr_interface = gr32 . gsr_interface ;
greqs - > gsr_group = gr32 . gsr_group ;
greqs - > gsr_source = gr32 . gsr_source ;
} else {
if ( optlen ! = sizeof ( * greqs ) )
return - EINVAL ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( greqs , optval , sizeof ( * greqs ) ) )
2020-07-17 09:23:26 +03:00
return - EFAULT ;
}
return 0 ;
}
2020-04-27 17:49:26 +03:00
static int do_mcast_group_source ( struct sock * sk , int optname ,
2020-07-23 09:08:58 +03:00
sockptr_t optval , int optlen )
2020-04-27 17:49:26 +03:00
{
2020-07-17 09:23:26 +03:00
struct group_source_req greqs ;
2020-04-27 17:49:26 +03:00
struct ip_mreq_source mreqs ;
struct sockaddr_in * psin ;
int omode , add , err ;
2020-07-23 09:08:58 +03:00
err = copy_group_source_from_sockptr ( & greqs , optval , optlen ) ;
2020-07-17 09:23:26 +03:00
if ( err )
return err ;
if ( greqs . gsr_group . ss_family ! = AF_INET | |
greqs . gsr_source . ss_family ! = AF_INET )
2020-04-27 17:49:26 +03:00
return - EADDRNOTAVAIL ;
2020-07-17 09:23:26 +03:00
psin = ( struct sockaddr_in * ) & greqs . gsr_group ;
2020-04-27 17:49:26 +03:00
mreqs . imr_multiaddr = psin - > sin_addr . s_addr ;
2020-07-17 09:23:26 +03:00
psin = ( struct sockaddr_in * ) & greqs . gsr_source ;
2020-04-27 17:49:26 +03:00
mreqs . imr_sourceaddr = psin - > sin_addr . s_addr ;
mreqs . imr_interface = 0 ; /* use index for mc_source */
if ( optname = = MCAST_BLOCK_SOURCE ) {
omode = MCAST_EXCLUDE ;
add = 1 ;
} else if ( optname = = MCAST_UNBLOCK_SOURCE ) {
omode = MCAST_EXCLUDE ;
add = 0 ;
} else if ( optname = = MCAST_JOIN_SOURCE_GROUP ) {
struct ip_mreqn mreq ;
2020-07-17 09:23:26 +03:00
psin = ( struct sockaddr_in * ) & greqs . gsr_group ;
2020-04-27 17:49:26 +03:00
mreq . imr_multiaddr = psin - > sin_addr ;
mreq . imr_address . s_addr = 0 ;
2020-07-17 09:23:26 +03:00
mreq . imr_ifindex = greqs . gsr_interface ;
2020-04-27 17:49:26 +03:00
err = ip_mc_join_group_ssm ( sk , & mreq , MCAST_INCLUDE ) ;
if ( err & & err ! = - EADDRINUSE )
return err ;
2020-07-17 09:23:26 +03:00
greqs . gsr_interface = mreq . imr_ifindex ;
2020-04-27 17:49:26 +03:00
omode = MCAST_INCLUDE ;
add = 1 ;
} else /* MCAST_LEAVE_SOURCE_GROUP */ {
omode = MCAST_INCLUDE ;
add = 0 ;
}
2020-07-17 09:23:26 +03:00
return ip_mc_source ( add , omode , sk , & mreqs , greqs . gsr_interface ) ;
2020-04-27 17:49:26 +03:00
}
2020-07-23 09:08:58 +03:00
static int ip_set_mcast_msfilter ( struct sock * sk , sockptr_t optval , int optlen )
2020-07-17 09:23:24 +03:00
{
struct group_filter * gsf = NULL ;
int err ;
if ( optlen < GROUP_FILTER_SIZE ( 0 ) )
return - EINVAL ;
2022-08-23 20:46:49 +03:00
if ( optlen > READ_ONCE ( sysctl_optmem_max ) )
2020-07-17 09:23:24 +03:00
return - ENOBUFS ;
2020-07-23 09:08:58 +03:00
gsf = memdup_sockptr ( optval , optlen ) ;
2020-07-17 09:23:24 +03:00
if ( IS_ERR ( gsf ) )
return PTR_ERR ( gsf ) ;
/* numsrc >= (4G-140)/128 overflow in 32 bits */
err = - ENOBUFS ;
if ( gsf - > gf_numsrc > = 0x1ffffff | |
2022-07-15 20:17:43 +03:00
gsf - > gf_numsrc > READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_igmp_max_msf ) )
2020-07-17 09:23:24 +03:00
goto out_free_gsf ;
err = - EINVAL ;
if ( GROUP_FILTER_SIZE ( gsf - > gf_numsrc ) > optlen )
goto out_free_gsf ;
err = set_mcast_msfilter ( sk , gsf - > gf_interface , gsf - > gf_numsrc ,
2021-08-04 23:45:36 +03:00
gsf - > gf_fmode , & gsf - > gf_group ,
gsf - > gf_slist_flex ) ;
2020-07-17 09:23:24 +03:00
out_free_gsf :
kfree ( gsf ) ;
return err ;
}
2020-07-23 09:08:58 +03:00
static int compat_ip_set_mcast_msfilter ( struct sock * sk , sockptr_t optval ,
2020-07-17 09:23:24 +03:00
int optlen )
{
2021-08-04 23:45:36 +03:00
const int size0 = offsetof ( struct compat_group_filter , gf_slist_flex ) ;
2020-07-17 09:23:24 +03:00
struct compat_group_filter * gf32 ;
unsigned int n ;
void * p ;
int err ;
if ( optlen < size0 )
return - EINVAL ;
2022-08-23 20:46:49 +03:00
if ( optlen > READ_ONCE ( sysctl_optmem_max ) - 4 )
2020-07-17 09:23:24 +03:00
return - ENOBUFS ;
p = kmalloc ( optlen + 4 , GFP_KERNEL ) ;
if ( ! p )
return - ENOMEM ;
2021-08-04 23:45:36 +03:00
gf32 = p + 4 ; /* we want ->gf_group and ->gf_slist_flex aligned */
2020-07-17 09:23:24 +03:00
err = - EFAULT ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( gf32 , optval , optlen ) )
2020-07-17 09:23:24 +03:00
goto out_free_gsf ;
/* numsrc >= (4G-140)/128 overflow in 32 bits */
n = gf32 - > gf_numsrc ;
err = - ENOBUFS ;
if ( n > = 0x1ffffff )
goto out_free_gsf ;
err = - EINVAL ;
2021-08-04 23:45:36 +03:00
if ( offsetof ( struct compat_group_filter , gf_slist_flex [ n ] ) > optlen )
2020-07-17 09:23:24 +03:00
goto out_free_gsf ;
/* numsrc >= (4G-140)/128 overflow in 32 bits */
err = - ENOBUFS ;
2022-07-15 20:17:43 +03:00
if ( n > READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_igmp_max_msf ) )
2020-07-17 09:23:26 +03:00
goto out_free_gsf ;
2020-07-17 09:23:24 +03:00
err = set_mcast_msfilter ( sk , gf32 - > gf_interface , n , gf32 - > gf_fmode ,
2021-08-04 23:45:36 +03:00
& gf32 - > gf_group , gf32 - > gf_slist_flex ) ;
2020-07-17 09:23:24 +03:00
out_free_gsf :
kfree ( p ) ;
return err ;
}
2020-07-17 09:23:25 +03:00
static int ip_mcast_join_leave ( struct sock * sk , int optname ,
2020-07-23 09:08:58 +03:00
sockptr_t optval , int optlen )
2020-07-17 09:23:25 +03:00
{
struct ip_mreqn mreq = { } ;
struct sockaddr_in * psin ;
struct group_req greq ;
if ( optlen < sizeof ( struct group_req ) )
return - EINVAL ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & greq , optval , sizeof ( greq ) ) )
2020-07-17 09:23:25 +03:00
return - EFAULT ;
psin = ( struct sockaddr_in * ) & greq . gr_group ;
if ( psin - > sin_family ! = AF_INET )
return - EINVAL ;
mreq . imr_multiaddr = psin - > sin_addr ;
mreq . imr_ifindex = greq . gr_interface ;
if ( optname = = MCAST_JOIN_GROUP )
return ip_mc_join_group ( sk , & mreq ) ;
return ip_mc_leave_group ( sk , & mreq ) ;
}
static int compat_ip_mcast_join_leave ( struct sock * sk , int optname ,
2020-07-23 09:08:58 +03:00
sockptr_t optval , int optlen )
2020-07-17 09:23:25 +03:00
{
struct compat_group_req greq ;
struct ip_mreqn mreq = { } ;
struct sockaddr_in * psin ;
if ( optlen < sizeof ( struct compat_group_req ) )
return - EINVAL ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & greq , optval , sizeof ( greq ) ) )
2020-07-17 09:23:25 +03:00
return - EFAULT ;
psin = ( struct sockaddr_in * ) & greq . gr_group ;
if ( psin - > sin_family ! = AF_INET )
return - EINVAL ;
mreq . imr_multiaddr = psin - > sin_addr ;
mreq . imr_ifindex = greq . gr_interface ;
if ( optname = = MCAST_JOIN_GROUP )
2020-07-17 09:23:26 +03:00
return ip_mc_join_group ( sk , & mreq ) ;
return ip_mc_leave_group ( sk , & mreq ) ;
2020-07-17 09:23:25 +03:00
}
2021-10-25 19:48:24 +03:00
DEFINE_STATIC_KEY_FALSE ( ip4_min_ttl ) ;
2022-08-17 09:18:26 +03:00
int do_ip_setsockopt ( struct sock * sk , int level , int optname ,
sockptr_t optval , unsigned int optlen )
2005-04-17 02:20:36 +04:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
2016-02-09 00:29:22 +03:00
struct net * net = sock_net ( sk ) ;
2008-11-03 11:27:11 +03:00
int val = 0 , err ;
2015-03-18 20:50:42 +03:00
bool needs_rtnl = setsockopt_needs_rtnl ( optname ) ;
2005-04-17 02:20:36 +04:00
2012-11-11 15:20:01 +04:00
switch ( optname ) {
case IP_PKTINFO :
case IP_RECVTTL :
case IP_RECVOPTS :
case IP_RECVTOS :
case IP_RETOPTS :
case IP_TOS :
case IP_TTL :
case IP_HDRINCL :
case IP_MTU_DISCOVER :
case IP_RECVERR :
case IP_ROUTER_ALERT :
case IP_FREEBIND :
case IP_PASSSEC :
case IP_TRANSPARENT :
case IP_MINTTL :
case IP_NODEFRAG :
inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).
As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())
With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.
This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.
lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets
lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
Check that a given source port is indeed used by many different
connections :
lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07 07:17:57 +03:00
case IP_BIND_ADDRESS_NO_PORT :
2012-11-11 15:20:01 +04:00
case IP_UNICAST_IF :
case IP_MULTICAST_TTL :
case IP_MULTICAST_ALL :
case IP_MULTICAST_LOOP :
case IP_RECVORIGDSTADDR :
2015-01-06 00:56:17 +03:00
case IP_CHECKSUM :
2016-11-02 18:02:16 +03:00
case IP_RECVFRAGSIZE :
icmp: support rfc 4884
Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
extension struct if present.
ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior. It stores the offset
in words to the extension header in u8 icmphdr.un.reserved[1].
The field is valid only for ICMP types destination unreachable, time
exceeded and parameter problem, if length is at least 128 bytes and
entire packet does not exceed 576 bytes.
Return the offset to the start of the extension struct when reading an
ICMP error from the error queue, if it matches the above constraints.
Do not return the raw u8 field. Return the offset from the start of
the user buffer, in bytes. The kernel does not return the network and
transport headers, so subtract those.
Also validate the headers. Return the offset regardless of validation,
as an invalid extension must still not be misinterpreted as part of
the original datagram. Note that !invalid does not imply valid. If
the extension version does not match, no validation can take place,
for instance.
For backward compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
For forward compatibility, reserve only setsockopt value 1, leaving
other bits for additional icmp extensions.
Changes
v1->v2:
- convert word offset to byte offset from start of user buffer
- return in ee_data as u8 may be insufficient
- define extension struct and object header structs
- return len only if constraints met
- if returning len, also validate
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10 16:29:02 +03:00
case IP_RECVERR_RFC4884 :
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24 16:36:43 +03:00
case IP_LOCAL_PORT_RANGE :
2005-04-17 02:20:36 +04:00
if ( optlen > = sizeof ( int ) ) {
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & val , optval , sizeof ( val ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
} else if ( optlen > = sizeof ( char ) ) {
unsigned char ucval ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & ucval , optval , sizeof ( ucval ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
val = ( int ) ucval ;
}
}
/* If optlen==0, it is equivalent to val == 0 */
2018-03-22 12:45:12 +03:00
if ( optname = = IP_ROUTER_ALERT )
return ip_ra_control ( sk , val ? 1 : 0 , NULL ) ;
2007-11-06 08:32:31 +03:00
if ( ip_mroute_opt ( optname ) )
2020-07-23 09:08:58 +03:00
return ip_mroute_setsockopt ( sk , optname , optval , optlen ) ;
2005-04-17 02:20:36 +04:00
inet: set/get simple options locklessly
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16 11:15:34 +03:00
/* Handle options that can be set without locking the socket. */
switch ( optname ) {
case IP_PKTINFO :
inet_assign_bit ( PKTINFO , sk , val ) ;
return 0 ;
case IP_RECVTTL :
inet_assign_bit ( TTL , sk , val ) ;
return 0 ;
case IP_RECVTOS :
inet_assign_bit ( TOS , sk , val ) ;
return 0 ;
case IP_RECVOPTS :
inet_assign_bit ( RECVOPTS , sk , val ) ;
return 0 ;
case IP_RETOPTS :
inet_assign_bit ( RETOPTS , sk , val ) ;
return 0 ;
case IP_PASSSEC :
inet_assign_bit ( PASSSEC , sk , val ) ;
return 0 ;
case IP_RECVORIGDSTADDR :
inet_assign_bit ( ORIGDSTADDR , sk , val ) ;
return 0 ;
case IP_RECVFRAGSIZE :
if ( sk - > sk_type ! = SOCK_RAW & & sk - > sk_type ! = SOCK_DGRAM )
return - EINVAL ;
inet_assign_bit ( RECVFRAGSIZE , sk , val ) ;
return 0 ;
2023-08-16 11:15:35 +03:00
case IP_RECVERR :
inet_assign_bit ( RECVERR , sk , val ) ;
if ( ! val )
skb_queue_purge ( & sk - > sk_error_queue ) ;
return 0 ;
2023-08-16 11:15:36 +03:00
case IP_RECVERR_RFC4884 :
if ( val < 0 | | val > 1 )
return - EINVAL ;
inet_assign_bit ( RECVERR_RFC4884 , sk , val ) ;
return 0 ;
2023-08-16 11:15:37 +03:00
case IP_FREEBIND :
if ( optlen < 1 )
return - EINVAL ;
inet_assign_bit ( FREEBIND , sk , val ) ;
return 0 ;
2023-08-16 11:15:38 +03:00
case IP_HDRINCL :
if ( sk - > sk_type ! = SOCK_RAW )
return - ENOPROTOOPT ;
inet_assign_bit ( HDRINCL , sk , val ) ;
return 0 ;
inet: set/get simple options locklessly
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16 11:15:34 +03:00
}
2005-04-17 02:20:36 +04:00
err = 0 ;
2015-03-18 20:50:42 +03:00
if ( needs_rtnl )
rtnl_lock ( ) ;
2022-08-17 09:17:37 +03:00
sockopt_lock_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
switch ( optname ) {
2007-03-09 07:44:43 +03:00
case IP_OPTIONS :
{
2011-04-21 13:45:37 +04:00
struct ip_options_rcu * old , * opt = NULL ;
2009-10-23 09:59:21 +04:00
if ( optlen > 40 )
2007-03-09 07:44:43 +03:00
goto e_inval ;
2020-07-23 09:08:58 +03:00
err = ip_options_get ( sock_net ( sk ) , & opt , optval , optlen ) ;
2007-03-09 07:44:43 +03:00
if ( err )
break ;
2011-04-21 13:45:37 +04:00
old = rcu_dereference_protected ( inet - > inet_opt ,
2016-04-05 18:10:15 +03:00
lockdep_sock_is_held ( sk ) ) ;
2007-03-09 07:44:43 +03:00
if ( inet - > is_icsk ) {
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2011-12-10 13:48:31 +04:00
# if IS_ENABLED(CONFIG_IPV6)
2007-03-09 07:44:43 +03:00
if ( sk - > sk_family = = PF_INET | |
( ! ( ( 1 < < sk - > sk_state ) &
( TCPF_LISTEN | TCPF_CLOSE ) ) & &
2009-10-15 10:30:45 +04:00
inet - > inet_daddr ! = LOOPBACK4_IPV6 ) ) {
2005-04-17 02:20:36 +04:00
# endif
2011-04-21 13:45:37 +04:00
if ( old )
icsk - > icsk_ext_hdr_len - = old - > opt . optlen ;
2007-03-09 07:44:43 +03:00
if ( opt )
2011-04-21 13:45:37 +04:00
icsk - > icsk_ext_hdr_len + = opt - > opt . optlen ;
2007-03-09 07:44:43 +03:00
icsk - > icsk_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
2011-12-10 13:48:31 +04:00
# if IS_ENABLED(CONFIG_IPV6)
2005-04-17 02:20:36 +04:00
}
2007-03-09 07:44:43 +03:00
# endif
2005-04-17 02:20:36 +04:00
}
2011-04-21 13:45:37 +04:00
rcu_assign_pointer ( inet - > inet_opt , opt ) ;
if ( old )
2012-01-07 05:08:33 +04:00
kfree_rcu ( old , rcu ) ;
2007-03-09 07:44:43 +03:00
break ;
}
2015-01-06 00:56:17 +03:00
case IP_CHECKSUM :
if ( val ) {
2023-08-16 11:15:33 +03:00
if ( ! ( inet_test_bit ( CHECKSUM , sk ) ) ) {
2015-01-06 00:56:17 +03:00
inet_inc_convert_csum ( sk ) ;
2023-08-16 11:15:33 +03:00
inet_set_bit ( CHECKSUM , sk ) ;
2015-01-06 00:56:17 +03:00
}
} else {
2023-08-16 11:15:33 +03:00
if ( inet_test_bit ( CHECKSUM , sk ) ) {
2015-01-06 00:56:17 +03:00
inet_dec_convert_csum ( sk ) ;
2023-08-16 11:15:33 +03:00
inet_clear_bit ( CHECKSUM , sk ) ;
2015-01-06 00:56:17 +03:00
}
}
break ;
2007-03-09 07:44:43 +03:00
case IP_TOS : /* This sets both TOS and Precedence */
2020-05-28 08:12:26 +03:00
__ip_sock_set_tos ( sk , val ) ;
2007-03-09 07:44:43 +03:00
break ;
case IP_TTL :
2009-06-02 11:42:16 +04:00
if ( optlen < 1 )
2007-03-09 07:44:43 +03:00
goto e_inval ;
2013-01-08 01:17:00 +04:00
if ( val ! = - 1 & & ( val < 1 | | val > 255 ) )
2007-03-09 07:44:43 +03:00
goto e_inval ;
inet - > uc_ttl = val ;
break ;
2010-06-15 05:07:31 +04:00
case IP_NODEFRAG :
if ( sk - > sk_type ! = SOCK_RAW ) {
err = - ENOPROTOOPT ;
break ;
}
inet - > nodefrag = val ? 1 : 0 ;
break ;
inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).
As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())
With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.
This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.
lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets
lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
Check that a given source port is indeed used by many different
connections :
lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07 07:17:57 +03:00
case IP_BIND_ADDRESS_NO_PORT :
inet - > bind_address_no_port = val ? 1 : 0 ;
break ;
2007-03-09 07:44:43 +03:00
case IP_MTU_DISCOVER :
2014-02-26 04:20:42 +04:00
if ( val < IP_PMTUDISC_DONT | | val > IP_PMTUDISC_OMIT )
2007-03-09 07:44:43 +03:00
goto e_inval ;
inet - > pmtudisc = val ;
break ;
case IP_MULTICAST_TTL :
if ( sk - > sk_type = = SOCK_STREAM )
goto e_inval ;
2009-06-02 11:42:16 +04:00
if ( optlen < 1 )
2007-03-09 07:44:43 +03:00
goto e_inval ;
2008-11-03 11:27:11 +03:00
if ( val = = - 1 )
2007-03-09 07:44:43 +03:00
val = 1 ;
if ( val < 0 | | val > 255 )
goto e_inval ;
inet - > mc_ttl = val ;
break ;
case IP_MULTICAST_LOOP :
2009-06-02 11:42:16 +04:00
if ( optlen < 1 )
2007-03-09 07:44:43 +03:00
goto e_inval ;
inet - > mc_loop = ! ! val ;
break ;
2012-02-08 13:11:07 +04:00
case IP_UNICAST_IF :
{
struct net_device * dev = NULL ;
int ifindex ;
2018-01-25 06:37:38 +03:00
int midx ;
2012-02-08 13:11:07 +04:00
if ( optlen ! = sizeof ( int ) )
goto e_inval ;
ifindex = ( __force int ) ntohl ( ( __force __be32 ) val ) ;
if ( ifindex = = 0 ) {
inet - > uc_index = 0 ;
err = 0 ;
break ;
}
dev = dev_get_by_index ( sock_net ( sk ) , ifindex ) ;
err = - EADDRNOTAVAIL ;
if ( ! dev )
break ;
2018-01-25 06:37:38 +03:00
midx = l3mdev_master_ifindex ( dev ) ;
2012-02-08 13:11:07 +04:00
dev_put ( dev ) ;
err = - EINVAL ;
2020-08-25 15:10:37 +03:00
if ( sk - > sk_bound_dev_if & & midx ! = sk - > sk_bound_dev_if )
2012-02-08 13:11:07 +04:00
break ;
inet - > uc_index = ifindex ;
err = 0 ;
break ;
}
2007-03-09 07:44:43 +03:00
case IP_MULTICAST_IF :
{
struct ip_mreqn mreq ;
struct net_device * dev = NULL ;
2016-12-30 02:39:37 +03:00
int midx ;
2007-03-09 07:44:43 +03:00
if ( sk - > sk_type = = SOCK_STREAM )
goto e_inval ;
/*
* Check the arguments are allowable
*/
2009-09-22 19:41:10 +04:00
if ( optlen < sizeof ( struct in_addr ) )
goto e_inval ;
2007-03-09 07:44:43 +03:00
err = - EFAULT ;
if ( optlen > = sizeof ( struct ip_mreqn ) ) {
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreq , optval , sizeof ( mreq ) ) )
2005-04-17 02:20:36 +04:00
break ;
2007-03-09 07:44:43 +03:00
} else {
memset ( & mreq , 0 , sizeof ( mreq ) ) ;
2012-05-04 02:37:45 +04:00
if ( optlen > = sizeof ( struct ip_mreq ) ) {
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreq , optval ,
sizeof ( struct ip_mreq ) ) )
2012-05-04 02:37:45 +04:00
break ;
} else if ( optlen > = sizeof ( struct in_addr ) ) {
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreq . imr_address , optval ,
sizeof ( struct in_addr ) ) )
2012-05-04 02:37:45 +04:00
break ;
}
2007-03-09 07:44:43 +03:00
}
if ( ! mreq . imr_ifindex ) {
2008-03-18 08:44:53 +03:00
if ( mreq . imr_address . s_addr = = htonl ( INADDR_ANY ) ) {
2007-03-09 07:44:43 +03:00
inet - > mc_index = 0 ;
inet - > mc_addr = 0 ;
err = 0 ;
2005-04-17 02:20:36 +04:00
break ;
}
2008-03-25 20:26:21 +03:00
dev = ip_dev_find ( sock_net ( sk ) , mreq . imr_address . s_addr ) ;
2009-10-19 10:41:58 +04:00
if ( dev )
2007-03-09 07:44:43 +03:00
mreq . imr_ifindex = dev - > ifindex ;
} else
2009-10-19 10:41:58 +04:00
dev = dev_get_by_index ( sock_net ( sk ) , mreq . imr_ifindex ) ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
err = - EADDRNOTAVAIL ;
if ( ! dev )
break ;
2016-12-30 02:39:37 +03:00
midx = l3mdev_master_ifindex ( dev ) ;
2009-10-19 10:41:58 +04:00
dev_put ( dev ) ;
2007-03-09 07:44:43 +03:00
err = - EINVAL ;
if ( sk - > sk_bound_dev_if & &
2016-12-30 02:39:37 +03:00
mreq . imr_ifindex ! = sk - > sk_bound_dev_if & &
2020-08-25 15:10:37 +03:00
midx ! = sk - > sk_bound_dev_if )
2007-03-09 07:44:43 +03:00
break ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
inet - > mc_index = mreq . imr_ifindex ;
inet - > mc_addr = mreq . imr_address . s_addr ;
err = 0 ;
break ;
}
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
case IP_ADD_MEMBERSHIP :
case IP_DROP_MEMBERSHIP :
{
struct ip_mreqn mreq ;
2005-04-17 02:20:36 +04:00
2007-08-25 09:16:39 +04:00
err = - EPROTO ;
if ( inet_sk ( sk ) - > is_icsk )
break ;
2007-03-09 07:44:43 +03:00
if ( optlen < sizeof ( struct ip_mreq ) )
goto e_inval ;
err = - EFAULT ;
if ( optlen > = sizeof ( struct ip_mreqn ) ) {
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreq , optval , sizeof ( mreq ) ) )
2005-04-17 02:20:36 +04:00
break ;
2007-03-09 07:44:43 +03:00
} else {
memset ( & mreq , 0 , sizeof ( mreq ) ) ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreq , optval ,
sizeof ( struct ip_mreq ) ) )
2005-04-17 02:20:36 +04:00
break ;
2007-03-09 07:44:43 +03:00
}
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
if ( optname = = IP_ADD_MEMBERSHIP )
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
err = ip_mc_join_group ( sk , & mreq ) ;
2007-03-09 07:44:43 +03:00
else
ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 20:50:43 +03:00
err = ip_mc_leave_group ( sk , & mreq ) ;
2007-03-09 07:44:43 +03:00
break ;
}
case IP_MSFILTER :
{
struct ip_msfilter * msf ;
2021-08-04 21:23:25 +03:00
if ( optlen < IP_MSFILTER_SIZE ( 0 ) )
2007-03-09 07:44:43 +03:00
goto e_inval ;
2022-08-23 20:46:49 +03:00
if ( optlen > READ_ONCE ( sysctl_optmem_max ) ) {
2007-03-09 07:44:43 +03:00
err = - ENOBUFS ;
2005-04-17 02:20:36 +04:00
break ;
}
2020-07-23 09:08:58 +03:00
msf = memdup_sockptr ( optval , optlen ) ;
2017-05-14 01:26:06 +03:00
if ( IS_ERR ( msf ) ) {
err = PTR_ERR ( msf ) ;
2007-03-09 07:44:43 +03:00
break ;
}
/* numsrc >= (1G-4) overflow in 32 bits */
if ( msf - > imsf_numsrc > = 0x3ffffffcU | |
2022-07-15 20:17:43 +03:00
msf - > imsf_numsrc > READ_ONCE ( net - > ipv4 . sysctl_igmp_max_msf ) ) {
2007-03-09 07:44:43 +03:00
kfree ( msf ) ;
err = - ENOBUFS ;
break ;
}
2021-08-04 21:23:25 +03:00
if ( IP_MSFILTER_SIZE ( msf - > imsf_numsrc ) > optlen ) {
2007-03-09 07:44:43 +03:00
kfree ( msf ) ;
err = - EINVAL ;
break ;
}
err = ip_mc_msfilter ( sk , msf , 0 ) ;
kfree ( msf ) ;
break ;
}
case IP_BLOCK_SOURCE :
case IP_UNBLOCK_SOURCE :
case IP_ADD_SOURCE_MEMBERSHIP :
case IP_DROP_SOURCE_MEMBERSHIP :
{
struct ip_mreq_source mreqs ;
int omode , add ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
if ( optlen ! = sizeof ( struct ip_mreq_source ) )
goto e_inval ;
2020-07-23 09:08:58 +03:00
if ( copy_from_sockptr ( & mreqs , optval , sizeof ( mreqs ) ) ) {
2005-04-17 02:20:36 +04:00
err = - EFAULT ;
break ;
}
2007-03-09 07:44:43 +03:00
if ( optname = = IP_BLOCK_SOURCE ) {
omode = MCAST_EXCLUDE ;
add = 1 ;
} else if ( optname = = IP_UNBLOCK_SOURCE ) {
omode = MCAST_EXCLUDE ;
add = 0 ;
} else if ( optname = = IP_ADD_SOURCE_MEMBERSHIP ) {
struct ip_mreqn mreq ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
mreq . imr_multiaddr . s_addr = mreqs . imr_multiaddr ;
mreq . imr_address . s_addr = mreqs . imr_interface ;
mreq . imr_ifindex = 0 ;
ipv4/igmp: init group mode as INCLUDE when join source group
Based on RFC3376 5.1
If no interface
state existed for that multicast address before the change (i.e., the
change consisted of creating a new per-interface record), or if no
state exists after the change (i.e., the change consisted of deleting
a per-interface record), then the "non-existent" state is considered
to have a filter mode of INCLUDE and an empty source list.
Which means a new multicast group should start with state IN().
Function ip_mc_join_group() works correctly for IGMP ASM(Any-Source Multicast)
mode. It adds a group with state EX() and inits crcount to mc_qrv,
so the kernel will send a TO_EX() report message after adding group.
But for IGMPv3 SSM(Source-specific multicast) JOIN_SOURCE_GROUP mode, we
split the group joining into two steps. First we join the group like ASM,
i.e. via ip_mc_join_group(). So the state changes from IN() to EX().
Then we add the source-specific address with INCLUDE mode. So the state
changes from EX() to IN(A).
Before the first step sends a group change record, we finished the second
step. So we will only send the second change record. i.e. TO_IN(A).
Regarding the RFC stands, we should actually send an ALLOW(A) message for
SSM JOIN_SOURCE_GROUP as the state should mimic the 'IN() to IN(A)'
transition.
The issue was exposed by commit a052517a8ff65 ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we used to send both ALLOW(A) and TO_IN(A). After this change we only send
TO_IN(A).
Fix it by adding a new parameter to init group mode. Also add new wrapper
functions so we don't need to change too much code.
v1 -> v2:
In my first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
an filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses' sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.
In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.
Fixes: a052517a8ff65 ("net/multicast: should not send source list records when have filter mode change")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 17:41:26 +03:00
err = ip_mc_join_group_ssm ( sk , & mreq , MCAST_INCLUDE ) ;
2007-03-09 07:44:43 +03:00
if ( err & & err ! = - EADDRINUSE )
2005-04-17 02:20:36 +04:00
break ;
2007-03-09 07:44:43 +03:00
omode = MCAST_INCLUDE ;
add = 1 ;
} else /* IP_DROP_SOURCE_MEMBERSHIP */ {
omode = MCAST_INCLUDE ;
add = 0 ;
}
err = ip_mc_source ( add , omode , sk , & mreqs , 0 ) ;
break ;
}
case MCAST_JOIN_GROUP :
case MCAST_LEAVE_GROUP :
2020-07-17 09:23:26 +03:00
if ( in_compat_syscall ( ) )
err = compat_ip_mcast_join_leave ( sk , optname , optval ,
optlen ) ;
else
err = ip_mcast_join_leave ( sk , optname , optval , optlen ) ;
2007-03-09 07:44:43 +03:00
break ;
case MCAST_JOIN_SOURCE_GROUP :
case MCAST_LEAVE_SOURCE_GROUP :
case MCAST_BLOCK_SOURCE :
case MCAST_UNBLOCK_SOURCE :
2020-07-17 09:23:26 +03:00
err = do_mcast_group_source ( sk , optname , optval , optlen ) ;
2007-03-09 07:44:43 +03:00
break ;
case MCAST_MSFILTER :
2020-07-17 09:23:26 +03:00
if ( in_compat_syscall ( ) )
err = compat_ip_set_mcast_msfilter ( sk , optval , optlen ) ;
else
err = ip_set_mcast_msfilter ( sk , optval , optlen ) ;
2007-03-09 07:44:43 +03:00
break ;
2009-05-28 11:00:46 +04:00
case IP_MULTICAST_ALL :
if ( optlen < 1 )
goto e_inval ;
if ( val ! = 0 & & val ! = 1 )
goto e_inval ;
inet - > mc_all = val ;
break ;
2007-03-09 07:44:43 +03:00
case IP_IPSEC_POLICY :
case IP_XFRM_POLICY :
err = - EPERM ;
2022-08-17 09:17:37 +03:00
if ( ! sockopt_ns_capable ( sock_net ( sk ) - > user_ns , CAP_NET_ADMIN ) )
2005-04-17 02:20:36 +04:00
break ;
2020-07-23 09:08:58 +03:00
err = xfrm_user_policy ( sk , optname , optval , optlen ) ;
2007-03-09 07:44:43 +03:00
break ;
2005-04-17 02:20:36 +04:00
2008-10-01 18:30:02 +04:00
case IP_TRANSPARENT :
2022-08-17 09:17:37 +03:00
if ( ! ! val & & ! sockopt_ns_capable ( sock_net ( sk ) - > user_ns , CAP_NET_RAW ) & &
! sockopt_ns_capable ( sock_net ( sk ) - > user_ns , CAP_NET_ADMIN ) ) {
2008-10-01 18:30:02 +04:00
err = - EPERM ;
break ;
}
if ( optlen < 1 )
goto e_inval ;
inet - > transparent = ! ! val ;
break ;
2010-01-12 03:28:01 +03:00
case IP_MINTTL :
if ( optlen < 1 )
goto e_inval ;
if ( val < 0 | | val > 255 )
goto e_inval ;
2021-10-25 19:48:24 +03:00
if ( val )
static_branch_enable ( & ip4_min_ttl ) ;
2021-10-25 19:48:23 +03:00
/* tcp_v4_err() and tcp_v4_rcv() might read min_ttl
* while we are changint it .
*/
WRITE_ONCE ( inet - > min_ttl , val ) ;
2010-01-12 03:28:01 +03:00
break ;
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24 16:36:43 +03:00
case IP_LOCAL_PORT_RANGE :
{
const __u16 lo = val ;
const __u16 hi = val > > 16 ;
if ( optlen ! = sizeof ( __u32 ) )
goto e_inval ;
if ( lo ! = 0 & & hi ! = 0 & & lo > hi )
goto e_inval ;
inet - > local_port_range . lo = lo ;
inet - > local_port_range . hi = hi ;
break ;
}
2007-03-09 07:44:43 +03:00
default :
err = - ENOPROTOOPT ;
break ;
2005-04-17 02:20:36 +04:00
}
2022-08-17 09:17:37 +03:00
sockopt_release_sock ( sk ) ;
2015-03-18 20:50:42 +03:00
if ( needs_rtnl )
rtnl_unlock ( ) ;
2005-04-17 02:20:36 +04:00
return err ;
e_inval :
2022-08-17 09:17:37 +03:00
sockopt_release_sock ( sk ) ;
2015-03-18 20:50:42 +03:00
if ( needs_rtnl )
rtnl_unlock ( ) ;
2005-04-17 02:20:36 +04:00
return - EINVAL ;
}
2010-04-29 02:31:51 +04:00
/**
net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-01 06:22:34 +03:00
* ipv4_pktinfo_prepare - transfer some info from rtable to skb
2010-04-29 02:31:51 +04:00
* @ sk : socket
* @ skb : buffer
*
2012-06-28 14:59:11 +04:00
* To support IP_CMSG_PKTINFO option , we store rt_iif and specific
* destination in skb - > cb [ ] before dst drop .
2013-12-09 00:15:44 +04:00
* This way , receiver doesn ' t make cache line misses to read rtable .
2010-04-29 02:31:51 +04:00
*/
2013-10-07 20:01:40 +04:00
void ipv4_pktinfo_prepare ( const struct sock * sk , struct sk_buff * skb )
2010-04-29 02:31:51 +04:00
{
ipv4: PKTINFO doesnt need dst reference
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>
OK I found it, I did some extra tests and believe its ready.
[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.
We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.
We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.
This removes two atomic operations per packet, and false sharing as
well.
On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.
IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 11:24:35 +04:00
struct in_pktinfo * pktinfo = PKTINFO_SKB_CB ( skb ) ;
2023-08-16 11:15:33 +03:00
bool prepare = inet_test_bit ( PKTINFO , sk ) | |
2014-01-20 06:43:08 +04:00
ipv6_sk_rxinfo ( sk ) ;
ipv4: PKTINFO doesnt need dst reference
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>
OK I found it, I did some extra tests and believe its ready.
[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.
We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.
We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.
This removes two atomic operations per packet, and false sharing as
well.
On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.
IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 11:24:35 +04:00
2014-01-20 06:43:08 +04:00
if ( prepare & & skb_rtable ( skb ) ) {
2016-05-10 21:19:51 +03:00
/* skb->cb is overloaded: prior to this point it is IP{6}CB
* which has interface index ( iif ) as the first member of the
* underlying inet { 6 } _skb_parm struct . This code then overlays
* PKTINFO_SKB_CB and in_pktinfo also has iif as the first
2016-12-29 11:45:04 +03:00
* element so the iif is picked up from the prior IPCB . If iif
* is the loopback interface , then return the sending interface
* ( e . g . , process binds socket to eth0 for Tx which is
* redirected to loopback in the rtable / dst ) .
2016-05-10 21:19:51 +03:00
*/
2017-09-14 03:11:37 +03:00
struct rtable * rt = skb_rtable ( skb ) ;
bool l3slave = ipv4_l3mdev_skb ( IPCB ( skb ) - > flags ) ;
if ( pktinfo - > ipi_ifindex = = LOOPBACK_IFINDEX )
2016-12-29 11:45:04 +03:00
pktinfo - > ipi_ifindex = inet_iif ( skb ) ;
2017-09-14 03:11:37 +03:00
else if ( l3slave & & rt & & rt - > rt_iif )
pktinfo - > ipi_ifindex = rt - > rt_iif ;
2016-12-29 11:45:04 +03:00
2012-06-28 14:59:11 +04:00
pktinfo - > ipi_spec_dst . s_addr = fib_compute_spec_dst ( skb ) ;
ipv4: PKTINFO doesnt need dst reference
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>
OK I found it, I did some extra tests and believe its ready.
[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.
We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.
We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.
This removes two atomic operations per packet, and false sharing as
well.
On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.
IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 11:24:35 +04:00
} else {
pktinfo - > ipi_ifindex = 0 ;
pktinfo - > ipi_spec_dst . s_addr = 0 ;
}
2017-08-03 19:07:07 +03:00
skb_dst_drop ( skb ) ;
2010-04-29 02:31:51 +04:00
}
2020-07-23 09:09:07 +03:00
int ip_setsockopt ( struct sock * sk , int level , int optname , sockptr_t optval ,
unsigned int optlen )
2006-03-21 09:45:21 +03:00
{
int err ;
if ( level ! = SOL_IP )
return - ENOPROTOOPT ;
2020-07-23 09:09:07 +03:00
err = do_ip_setsockopt ( sk , level , optname , optval , optlen ) ;
2018-11-05 16:31:41 +03:00
# if IS_ENABLED(CONFIG_BPFILTER_UMH)
2018-05-22 05:22:30 +03:00
if ( optname > = BPFILTER_IPT_SO_SET_REPLACE & &
optname < BPFILTER_IPT_SET_MAX )
2020-07-23 09:09:07 +03:00
err = bpfilter_ip_set_sockopt ( sk , optname , optval , optlen ) ;
2018-05-22 05:22:30 +03:00
# endif
2006-03-21 09:45:21 +03:00
# ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if ( err = = - ENOPROTOOPT & & optname ! = IP_HDRINCL & &
2007-11-06 08:32:31 +03:00
optname ! = IP_IPSEC_POLICY & &
optname ! = IP_XFRM_POLICY & &
netfilter: on sockopt() acquire sock lock only in the required scope
Syzbot reported several deadlocks in the netfilter area caused by
rtnl lock and socket lock being acquired with a different order on
different code paths, leading to backtraces like the following one:
======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc9+ #212 Not tainted
------------------------------------------------------
syzkaller041579/3682 is trying to acquire lock:
(sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
include/net/sock.h:1463 [inline]
(sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
but task is already holding lock:
(rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (rtnl_mutex){+.+.}:
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
find_check_entry.isra.7+0x935/0xcf0
net/ipv6/netfilter/ip6_tables.c:580
translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0
-> #0 (sk_lock-AF_INET6){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
lock_sock include/net/sock.h:1463 [inline]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock(sk_lock-AF_INET6);
lock(rtnl_mutex);
lock(sk_lock-AF_INET6);
*** DEADLOCK ***
1 lock held by syzkaller041579/3682:
#0: (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
The problem, as Florian noted, is that nf_setsockopt() is always
called with the socket held, even if the lock itself is required only
for very tight scopes and only for some operation.
This patch addresses the issues moving the lock_sock() call only
where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
does not need anymore to acquire both locks.
Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-30 21:01:40 +03:00
! ip_mroute_opt ( optname ) )
2020-07-23 09:09:07 +03:00
err = nf_setsockopt ( sk , PF_INET , optname , optval , optlen ) ;
2006-03-21 09:45:21 +03:00
# endif
return err ;
}
2009-06-02 11:42:16 +04:00
EXPORT_SYMBOL ( ip_setsockopt ) ;
2006-03-21 09:45:21 +03:00
2005-04-17 02:20:36 +04:00
/*
2009-06-02 11:42:16 +04:00
* Get the options . Note for future reference . The GET of IP options gets
* the _received_ ones . The set sets the _sent_ ones .
2005-04-17 02:20:36 +04:00
*/
2015-11-04 02:41:16 +03:00
static bool getsockopt_needs_rtnl ( int optname )
{
switch ( optname ) {
case IP_MSFILTER :
case MCAST_MSFILTER :
return true ;
}
return false ;
}
2022-09-02 03:28:28 +03:00
static int ip_get_mcast_msfilter ( struct sock * sk , sockptr_t optval ,
sockptr_t optlen , int len )
2020-07-17 09:23:23 +03:00
{
2021-08-04 23:45:36 +03:00
const int size0 = offsetof ( struct group_filter , gf_slist_flex ) ;
2020-07-17 09:23:23 +03:00
struct group_filter gsf ;
2022-09-02 03:28:28 +03:00
int num , gsf_size ;
2020-07-17 09:23:23 +03:00
int err ;
if ( len < size0 )
return - EINVAL ;
2022-09-02 03:28:28 +03:00
if ( copy_from_sockptr ( & gsf , optval , size0 ) )
2020-07-17 09:23:23 +03:00
return - EFAULT ;
num = gsf . gf_numsrc ;
2022-09-02 03:28:28 +03:00
err = ip_mc_gsfget ( sk , & gsf , optval ,
offsetof ( struct group_filter , gf_slist_flex ) ) ;
2020-07-17 09:23:23 +03:00
if ( err )
return err ;
if ( gsf . gf_numsrc < num )
num = gsf . gf_numsrc ;
2022-09-02 03:28:28 +03:00
gsf_size = GROUP_FILTER_SIZE ( num ) ;
if ( copy_to_sockptr ( optlen , & gsf_size , sizeof ( int ) ) | |
copy_to_sockptr ( optval , & gsf , size0 ) )
2020-07-17 09:23:23 +03:00
return - EFAULT ;
return 0 ;
}
2022-09-02 03:28:28 +03:00
static int compat_ip_get_mcast_msfilter ( struct sock * sk , sockptr_t optval ,
sockptr_t optlen , int len )
2020-07-17 09:23:23 +03:00
{
2021-08-04 23:45:36 +03:00
const int size0 = offsetof ( struct compat_group_filter , gf_slist_flex ) ;
2020-07-17 09:23:23 +03:00
struct compat_group_filter gf32 ;
struct group_filter gf ;
int num ;
2020-07-17 09:23:26 +03:00
int err ;
2020-07-17 09:23:23 +03:00
if ( len < size0 )
return - EINVAL ;
2022-09-02 03:28:28 +03:00
if ( copy_from_sockptr ( & gf32 , optval , size0 ) )
2020-07-17 09:23:23 +03:00
return - EFAULT ;
gf . gf_interface = gf32 . gf_interface ;
gf . gf_fmode = gf32 . gf_fmode ;
num = gf . gf_numsrc = gf32 . gf_numsrc ;
gf . gf_group = gf32 . gf_group ;
2022-09-02 03:28:28 +03:00
err = ip_mc_gsfget ( sk , & gf , optval ,
offsetof ( struct compat_group_filter , gf_slist_flex ) ) ;
2020-07-17 09:23:23 +03:00
if ( err )
return err ;
if ( gf . gf_numsrc < num )
num = gf . gf_numsrc ;
len = GROUP_FILTER_SIZE ( num ) - ( sizeof ( gf ) - sizeof ( gf32 ) ) ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) | |
copy_to_sockptr_offset ( optval , offsetof ( struct compat_group_filter , gf_fmode ) ,
& gf . gf_fmode , sizeof ( gf . gf_fmode ) ) | |
copy_to_sockptr_offset ( optval , offsetof ( struct compat_group_filter , gf_numsrc ) ,
& gf . gf_numsrc , sizeof ( gf . gf_numsrc ) ) )
2020-07-17 09:23:23 +03:00
return - EFAULT ;
return 0 ;
}
2022-09-02 03:29:25 +03:00
int do_ip_getsockopt ( struct sock * sk , int level , int optname ,
sockptr_t optval , sockptr_t optlen )
2005-04-17 02:20:36 +04:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
2015-11-04 02:41:16 +03:00
bool needs_rtnl = getsockopt_needs_rtnl ( optname ) ;
int val , err = 0 ;
2005-04-17 02:20:36 +04:00
int len ;
2007-02-09 17:24:47 +03:00
2007-03-09 07:44:43 +03:00
if ( level ! = SOL_IP )
2005-04-17 02:20:36 +04:00
return - EOPNOTSUPP ;
2007-11-06 08:32:31 +03:00
if ( ip_mroute_opt ( optname ) )
2008-11-03 11:27:11 +03:00
return ip_mroute_getsockopt ( sk , optname , optval , optlen ) ;
2005-04-17 02:20:36 +04:00
2022-09-02 03:28:28 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2007-03-09 07:44:43 +03:00
if ( len < 0 )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2007-02-09 17:24:47 +03:00
inet: set/get simple options locklessly
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16 11:15:34 +03:00
/* Handle options that can be read without locking the socket. */
switch ( optname ) {
case IP_PKTINFO :
val = inet_test_bit ( PKTINFO , sk ) ;
goto copyval ;
case IP_RECVTTL :
val = inet_test_bit ( TTL , sk ) ;
goto copyval ;
case IP_RECVTOS :
val = inet_test_bit ( TOS , sk ) ;
goto copyval ;
case IP_RECVOPTS :
val = inet_test_bit ( RECVOPTS , sk ) ;
goto copyval ;
case IP_RETOPTS :
val = inet_test_bit ( RETOPTS , sk ) ;
goto copyval ;
case IP_PASSSEC :
val = inet_test_bit ( PASSSEC , sk ) ;
goto copyval ;
case IP_RECVORIGDSTADDR :
val = inet_test_bit ( ORIGDSTADDR , sk ) ;
goto copyval ;
case IP_CHECKSUM :
val = inet_test_bit ( CHECKSUM , sk ) ;
goto copyval ;
case IP_RECVFRAGSIZE :
val = inet_test_bit ( RECVFRAGSIZE , sk ) ;
goto copyval ;
2023-08-16 11:15:35 +03:00
case IP_RECVERR :
val = inet_test_bit ( RECVERR , sk ) ;
goto copyval ;
2023-08-16 11:15:36 +03:00
case IP_RECVERR_RFC4884 :
val = inet_test_bit ( RECVERR_RFC4884 , sk ) ;
goto copyval ;
2023-08-16 11:15:37 +03:00
case IP_FREEBIND :
val = inet_test_bit ( FREEBIND , sk ) ;
goto copyval ;
2023-08-16 11:15:38 +03:00
case IP_HDRINCL :
val = inet_test_bit ( HDRINCL , sk ) ;
goto copyval ;
inet: set/get simple options locklessly
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16 11:15:34 +03:00
}
2015-11-04 02:41:16 +03:00
if ( needs_rtnl )
rtnl_lock ( ) ;
2022-09-02 03:28:34 +03:00
sockopt_lock_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
switch ( optname ) {
case IP_OPTIONS :
{
unsigned char optbuf [ sizeof ( struct ip_options ) + 40 ] ;
2011-04-21 13:45:37 +04:00
struct ip_options * opt = ( struct ip_options * ) optbuf ;
struct ip_options_rcu * inet_opt ;
inet_opt = rcu_dereference_protected ( inet - > inet_opt ,
2016-04-05 18:10:15 +03:00
lockdep_sock_is_held ( sk ) ) ;
2007-03-09 07:44:43 +03:00
opt - > optlen = 0 ;
2011-04-21 13:45:37 +04:00
if ( inet_opt )
memcpy ( optbuf , & inet_opt - > opt ,
sizeof ( struct ip_options ) +
inet_opt - > opt . optlen ) ;
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2007-03-09 07:44:43 +03:00
2022-09-02 03:28:28 +03:00
if ( opt - > optlen = = 0 ) {
len = 0 ;
return copy_to_sockptr ( optlen , & len , sizeof ( int ) ) ;
}
2007-03-09 07:44:43 +03:00
ip_options_undo ( opt ) ;
len = min_t ( unsigned int , len , opt - > optlen ) ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2007-03-09 07:44:43 +03:00
return - EFAULT ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optval , opt - > __data , len ) )
2007-03-09 07:44:43 +03:00
return - EFAULT ;
return 0 ;
}
case IP_TOS :
val = inet - > tos ;
break ;
case IP_TTL :
2016-02-15 13:11:27 +03:00
{
struct net * net = sock_net ( sk ) ;
2007-03-09 07:44:43 +03:00
val = ( inet - > uc_ttl = = - 1 ?
2022-07-13 23:51:51 +03:00
READ_ONCE ( net - > ipv4 . sysctl_ip_default_ttl ) :
2007-03-09 07:44:43 +03:00
inet - > uc_ttl ) ;
break ;
2016-02-15 13:11:27 +03:00
}
2010-09-11 00:26:56 +04:00
case IP_NODEFRAG :
val = inet - > nodefrag ;
break ;
inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).
As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())
With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.
This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.
lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets
lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
Check that a given source port is indeed used by many different
connections :
lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07 07:17:57 +03:00
case IP_BIND_ADDRESS_NO_PORT :
val = inet - > bind_address_no_port ;
break ;
2007-03-09 07:44:43 +03:00
case IP_MTU_DISCOVER :
val = inet - > pmtudisc ;
break ;
case IP_MTU :
{
struct dst_entry * dst ;
val = 0 ;
dst = sk_dst_get ( sk ) ;
if ( dst ) {
val = dst_mtu ( dst ) ;
dst_release ( dst ) ;
2005-04-17 02:20:36 +04:00
}
2007-03-09 07:44:43 +03:00
if ( ! val ) {
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2007-03-09 07:44:43 +03:00
return - ENOTCONN ;
2005-04-17 02:20:36 +04:00
}
2007-03-09 07:44:43 +03:00
break ;
}
case IP_MULTICAST_TTL :
val = inet - > mc_ttl ;
break ;
case IP_MULTICAST_LOOP :
val = inet - > mc_loop ;
break ;
2012-02-08 13:11:07 +04:00
case IP_UNICAST_IF :
val = ( __force int ) htonl ( ( __u32 ) inet - > uc_index ) ;
break ;
2007-03-09 07:44:43 +03:00
case IP_MULTICAST_IF :
{
struct in_addr addr ;
len = min_t ( unsigned int , len , sizeof ( struct in_addr ) ) ;
addr . s_addr = inet - > mc_addr ;
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2007-03-09 07:44:43 +03:00
return - EFAULT ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optval , & addr , len ) )
2007-03-09 07:44:43 +03:00
return - EFAULT ;
return 0 ;
}
case IP_MSFILTER :
{
struct ip_msfilter msf ;
2021-08-04 21:23:25 +03:00
if ( len < IP_MSFILTER_SIZE ( 0 ) ) {
2015-11-04 02:41:16 +03:00
err = - EINVAL ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2022-09-02 03:28:28 +03:00
if ( copy_from_sockptr ( & msf , optval , IP_MSFILTER_SIZE ( 0 ) ) ) {
2015-11-04 02:41:16 +03:00
err = - EFAULT ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2022-09-02 03:28:28 +03:00
err = ip_mc_msfget ( sk , & msf , optval , optlen ) ;
2015-11-04 02:41:16 +03:00
goto out ;
2007-03-09 07:44:43 +03:00
}
case MCAST_MSFILTER :
2020-07-17 09:23:26 +03:00
if ( in_compat_syscall ( ) )
err = compat_ip_get_mcast_msfilter ( sk , optval , optlen ,
len ) ;
else
err = ip_get_mcast_msfilter ( sk , optval , optlen , len ) ;
2015-11-04 02:41:16 +03:00
goto out ;
2009-05-28 11:00:46 +04:00
case IP_MULTICAST_ALL :
val = inet - > mc_all ;
break ;
2007-03-09 07:44:43 +03:00
case IP_PKTOPTIONS :
{
struct msghdr msg ;
2005-04-17 02:20:36 +04:00
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
2007-03-09 07:44:43 +03:00
if ( sk - > sk_type ! = SOCK_STREAM )
return - ENOPROTOOPT ;
2005-04-17 02:20:36 +04:00
2022-09-02 03:28:28 +03:00
if ( optval . is_kernel ) {
msg . msg_control_is_user = false ;
msg . msg_control = optval . kernel ;
} else {
msg . msg_control_is_user = true ;
msg . msg_control_user = optval . user ;
}
2007-03-09 07:44:43 +03:00
msg . msg_controllen = len ;
2020-07-17 09:23:26 +03:00
msg . msg_flags = in_compat_syscall ( ) ? MSG_CMSG_COMPAT : 0 ;
2005-04-17 02:20:36 +04:00
2023-08-16 11:15:33 +03:00
if ( inet_test_bit ( PKTINFO , sk ) ) {
2007-03-09 07:44:43 +03:00
struct in_pktinfo info ;
2009-10-15 10:30:45 +04:00
info . ipi_addr . s_addr = inet - > inet_rcv_saddr ;
info . ipi_spec_dst . s_addr = inet - > inet_rcv_saddr ;
2007-03-09 07:44:43 +03:00
info . ipi_ifindex = inet - > mc_index ;
put_cmsg ( & msg , SOL_IP , IP_PKTINFO , sizeof ( info ) , & info ) ;
2005-04-17 02:20:36 +04:00
}
2023-08-16 11:15:33 +03:00
if ( inet_test_bit ( TTL , sk ) ) {
2007-03-09 07:44:43 +03:00
int hlim = inet - > mc_ttl ;
put_cmsg ( & msg , SOL_IP , IP_TTL , sizeof ( hlim ) , & hlim ) ;
}
2023-08-16 11:15:33 +03:00
if ( inet_test_bit ( TOS , sk ) ) {
2012-02-09 13:35:49 +04:00
int tos = inet - > rcv_tos ;
put_cmsg ( & msg , SOL_IP , IP_TOS , sizeof ( tos ) , & tos ) ;
}
2007-03-09 07:44:43 +03:00
len - = msg . msg_controllen ;
2022-09-02 03:28:28 +03:00
return copy_to_sockptr ( optlen , & len , sizeof ( int ) ) ;
2007-03-09 07:44:43 +03:00
}
2008-10-01 18:30:02 +04:00
case IP_TRANSPARENT :
val = inet - > transparent ;
break ;
2010-01-12 03:28:01 +03:00
case IP_MINTTL :
val = inet - > min_ttl ;
break ;
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24 16:36:43 +03:00
case IP_LOCAL_PORT_RANGE :
val = inet - > local_port_range . hi < < 16 | inet - > local_port_range . lo ;
break ;
2023-05-22 15:08:20 +03:00
case IP_PROTOCOL :
val = inet_sk ( sk ) - > inet_num ;
break ;
2007-03-09 07:44:43 +03:00
default :
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2007-03-09 07:44:43 +03:00
return - ENOPROTOOPT ;
2005-04-17 02:20:36 +04:00
}
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
inet: set/get simple options locklessly
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16 11:15:34 +03:00
copyval :
2009-06-02 11:42:16 +04:00
if ( len < sizeof ( int ) & & len > 0 & & val > = 0 & & val < = 255 ) {
2005-04-17 02:20:36 +04:00
unsigned char ucval = ( unsigned char ) val ;
len = 1 ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optval , & ucval , 1 ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
} else {
len = min_t ( unsigned int , sizeof ( int ) , len ) ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2022-09-02 03:28:28 +03:00
if ( copy_to_sockptr ( optval , & val , len ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
}
return 0 ;
2015-11-04 02:41:16 +03:00
out :
2022-09-02 03:28:34 +03:00
sockopt_release_sock ( sk ) ;
2015-11-04 02:41:16 +03:00
if ( needs_rtnl )
rtnl_unlock ( ) ;
return err ;
2005-04-17 02:20:36 +04:00
}
2006-03-21 09:45:21 +03:00
int ip_getsockopt ( struct sock * sk , int level ,
2007-03-09 07:44:43 +03:00
int optname , char __user * optval , int __user * optlen )
2006-03-21 09:45:21 +03:00
{
int err ;
2022-09-02 03:28:28 +03:00
err = do_ip_getsockopt ( sk , level , optname ,
USER_SOCKPTR ( optval ) , USER_SOCKPTR ( optlen ) ) ;
2008-04-29 14:23:22 +04:00
2018-11-05 16:31:41 +03:00
# if IS_ENABLED(CONFIG_BPFILTER_UMH)
2018-05-22 05:22:30 +03:00
if ( optname > = BPFILTER_IPT_SO_GET_INFO & &
optname < BPFILTER_IPT_GET_MAX )
err = bpfilter_ip_get_sockopt ( sk , optname , optval , optlen ) ;
# endif
2006-03-21 09:45:21 +03:00
# ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
2007-11-06 08:32:31 +03:00
if ( err = = - ENOPROTOOPT & & optname ! = IP_PKTOPTIONS & &
! ip_mroute_opt ( optname ) ) {
2007-02-09 17:24:47 +03:00
int len ;
2006-03-21 09:45:21 +03:00
2006-03-21 09:48:35 +03:00
if ( get_user ( len , optlen ) )
2006-03-21 09:45:21 +03:00
return - EFAULT ;
2020-07-17 09:23:20 +03:00
err = nf_getsockopt ( sk , PF_INET , optname , optval , & len ) ;
2006-03-21 09:45:21 +03:00
if ( err > = 0 )
err = put_user ( len , optlen ) ;
return err ;
}
# endif
return err ;
}
2020-07-17 09:23:26 +03:00
EXPORT_SYMBOL ( ip_getsockopt ) ;