2017-11-01 15:08:43 +01:00
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
2012-12-17 13:47:09 +00:00
# ifndef _UAPI_ASM_SOCKET_H
# define _UAPI_ASM_SOCKET_H
2019-03-11 16:38:17 +01:00
# include <linux/posix_types.h>
2012-12-17 13:47:09 +00:00
# include <asm/sockios.h>
/* For setsockopt(2) */
/*
* Note : we only bother about making the SOL_SOCKET options
* same as OSF / 1 , as that ' s all that " normal " programs are
* likely to set . We don ' t necessarily want to be binary
* compatible with _everything_ .
*/
# define SOL_SOCKET 0xffff
# define SO_DEBUG 0x0001
# define SO_REUSEADDR 0x0004
# define SO_KEEPALIVE 0x0008
# define SO_DONTROUTE 0x0010
# define SO_BROADCAST 0x0020
# define SO_LINGER 0x0080
# define SO_OOBINLINE 0x0100
2013-01-22 09:49:50 +00:00
# define SO_REUSEPORT 0x0200
2012-12-17 13:47:09 +00:00
# define SO_TYPE 0x1008
# define SO_ERROR 0x1007
# define SO_SNDBUF 0x1001
# define SO_RCVBUF 0x1002
# define SO_SNDBUFFORCE 0x100a
# define SO_RCVBUFFORCE 0x100b
# define SO_RCVLOWAT 0x1010
# define SO_SNDLOWAT 0x1011
2019-02-02 07:34:53 -08:00
# define SO_RCVTIMEO_OLD 0x1012
# define SO_SNDTIMEO_OLD 0x1013
2012-12-17 13:47:09 +00:00
# define SO_ACCEPTCONN 0x1014
# define SO_PROTOCOL 0x1028
# define SO_DOMAIN 0x1029
/* linux-specific, might as well be the same as on i386 */
# define SO_NO_CHECK 11
# define SO_PRIORITY 12
# define SO_BSDCOMPAT 14
# define SO_PASSCRED 17
# define SO_PEERCRED 18
# define SO_BINDTODEVICE 25
/* Socket filtering */
# define SO_ATTACH_FILTER 26
# define SO_DETACH_FILTER 27
# define SO_GET_FILTER SO_ATTACH_FILTER
# define SO_PEERNAME 28
# define SO_PEERSEC 30
# define SO_PASSSEC 34
/* Security levels - as per NRL IPv6 - don't actually do anything */
# define SO_SECURITY_AUTHENTICATION 19
# define SO_SECURITY_ENCRYPTION_TRANSPORT 20
# define SO_SECURITY_ENCRYPTION_NETWORK 21
# define SO_MARK 36
# define SO_RXQ_OVFL 40
# define SO_WIFI_STATUS 41
# define SCM_WIFI_STATUS SO_WIFI_STATUS
# define SO_PEEK_OFF 42
/* Instruct lower device to use last 4-bytes of skb data as FCS */
# define SO_NOFCS 43
2013-01-16 22:55:49 +01:00
# define SO_LOCK_FILTER 44
2012-12-17 13:47:09 +00:00
2013-03-28 11:19:25 +00:00
# define SO_SELECT_ERR_QUEUE 45
2013-09-24 08:20:52 -07:00
# define SO_BUSY_POLL 46
# define SO_MAX_PACING_RATE 47
2013-06-14 16:33:57 +03:00
2014-01-17 17:09:45 +01:00
# define SO_BPF_EXTENSIONS 48
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 05:54:28 -08:00
# define SO_INCOMING_CPU 49
2014-12-01 15:06:35 -08:00
# define SO_ATTACH_BPF 50
# define SO_DETACH_BPF SO_DETACH_FILTER
2016-01-04 17:41:47 -05:00
# define SO_ATTACH_REUSEPORT_CBPF 51
# define SO_ATTACH_REUSEPORT_EBPF 52
2016-02-24 10:02:52 -08:00
# define SO_CNX_ADVICE 53
2016-11-27 23:07:18 -08:00
# define SCM_TIMESTAMPING_OPT_STATS 54
2017-03-20 15:22:03 -04:00
# define SO_MEMINFO 55
2017-03-24 10:08:36 -07:00
# define SO_INCOMING_NAPI_ID 56
2017-04-05 19:00:55 -07:00
# define SO_COOKIE 57
2017-05-21 23:13:37 -04:00
# define SCM_TIMESTAMPING_PKTINFO 58
net: introduce SO_PEERGROUPS getsockopt
This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
retrieve the auxiliary groups of the remote peer. It is designed to
naturally extend SO_PEERCRED. That is, the underlying data is from the
same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
is, if the provided buffer is too small, ERANGE is returned and @optlen
is updated. Otherwise, the information is copied, @optlen is set to the
actual size, and 0 is returned.
While SO_PEERCRED (and thus `struct ucred') already returns the primary
group, it lacks the auxiliary group vector. However, nearly all access
controls (including kernel side VFS and SYSVIPC, but also user-space
polkit, DBus, ...) consider the entire set of groups, rather than just
the primary group. But this is currently not possible with pure
SO_PEERCRED. Instead, user-space has to work around this and query the
system database for the auxiliary groups of a UID retrieved via
SO_PEERCRED.
Unfortunately, there is no race-free way to query the auxiliary groups
of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
solution is to use getgrouplist(3p), which itself falls back to NSS and
whatever is configured in nsswitch.conf(3). This effectively checks
which groups we *would* assign to the user if it logged in *now*. On
normal systems it is as easy as reading /etc/group, but with NSS it can
resort to quering network databases (eg., LDAP), using IPC or network
communication.
Long story short: Whenever we want to use auxiliary groups for access
checks on IPC, we need further IPC to talk to the user/group databases,
rather than just relying on SO_PEERCRED and the incoming socket. This
is unfortunate, and might even result in dead-locks if the database
query uses the same IPC as the original request.
So far, those recursions / dead-locks have been avoided by using
primitive IPC for all crucial NSS modules. However, we want to avoid
re-inventing the wheel for each NSS module that might be involved in
user/group queries. Hence, we would preferably make DBus (and other IPC
that supports access-management based on groups) work without resorting
to the user/group database. This new SO_PEERGROUPS ioctl would allow us
to make dbus-daemon work without ever calling into NSS.
Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 10:47:15 +02:00
# define SO_PEERGROUPS 59
2017-08-03 16:29:40 -04:00
# define SO_ZEROCOPY 60
2018-07-03 15:42:48 -07:00
# define SO_TXTIME 61
# define SCM_TXTIME SO_TXTIME
net: introduce SO_BINDTOIFINDEX sockopt
This introduces a new generic SOL_SOCKET-level socket option called
SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
network interface index as argument, rather than the network interface
name.
User-space often refers to network-interfaces via their index, but has
to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
This might pose problems when the network-device is renamed
asynchronously by other parts of the system. When this happens, the
SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
device.
In most cases user-space only ever operates on devices which they
either manage themselves, or otherwise have a guarantee that the device
name will not change (e.g., devices that are UP cannot be renamed).
However, particularly in libraries this guarantee is non-obvious and it
would be nice if that race-condition would simply not exist. It would
make it easier for those libraries to operate even in situations where
the device-name might change under the hood.
A real use-case that we recently hit is trying to start the network
stack early in the initrd but make it survive into the real system.
Existing distributions rename network-interfaces during the transition
from initrd into the real system. This, obviously, cannot affect
devices that are up and running (unless you also consider moving them
between network-namespaces). However, the network manager now has to
make sure its management engine for dormant devices will not run in
parallel to these renames. Particularly, when you offload operations
like DHCP into separate processes, these might setup their sockets
early, and thus have to resolve the device-name possibly running into
this race-condition.
By avoiding a call to resolve the device-name, we no longer depend on
the name and can run network setup of dormant devices in parallel to
the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
race.
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15 14:42:14 +01:00
# define SO_BINDTOIFINDEX 62
2019-02-02 07:34:46 -08:00
# define SO_TIMESTAMP_OLD 29
# define SO_TIMESTAMPNS_OLD 35
# define SO_TIMESTAMPING_OLD 37
2019-02-02 07:34:50 -08:00
# define SO_TIMESTAMP_NEW 63
# define SO_TIMESTAMPNS_NEW 64
2019-02-02 07:34:51 -08:00
# define SO_TIMESTAMPING_NEW 65
2019-02-02 07:34:50 -08:00
2019-02-02 07:34:54 -08:00
# define SO_RCVTIMEO_NEW 66
# define SO_SNDTIMEO_NEW 67
2019-02-02 07:34:46 -08:00
2019-06-13 15:00:01 -07:00
# define SO_DETACH_REUSEPORT_BPF 68
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-11-30 19:51:56 +01:00
# define SO_PREFER_BUSY_POLL 69
2020-11-30 19:51:57 +01:00
# define SO_BUSY_POLL_BUDGET 70
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-11-30 19:51:56 +01:00
2021-06-23 15:56:45 +02:00
# define SO_NETNS_COOKIE 71
2021-08-04 10:55:56 +03:00
# define SO_BUF_LOCK 72
2019-02-02 07:34:54 -08:00
# if !defined(__KERNEL__)
2019-02-02 07:34:53 -08:00
2019-02-02 07:34:50 -08:00
# if __BITS_PER_LONG == 64
# define SO_TIMESTAMP SO_TIMESTAMP_OLD
# define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
2019-02-02 07:34:51 -08:00
# define SO_TIMESTAMPING SO_TIMESTAMPING_OLD
2019-02-02 07:34:54 -08:00
# define SO_RCVTIMEO SO_RCVTIMEO_OLD
# define SO_SNDTIMEO SO_SNDTIMEO_OLD
2019-02-02 07:34:50 -08:00
# else
# define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
# define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
2019-02-02 07:34:51 -08:00
# define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
2019-02-02 07:34:54 -08:00
# define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
# define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
2019-02-02 07:34:50 -08:00
# endif
2019-02-02 07:34:46 -08:00
# define SCM_TIMESTAMP SO_TIMESTAMP
# define SCM_TIMESTAMPNS SO_TIMESTAMPNS
# define SCM_TIMESTAMPING SO_TIMESTAMPING
# endif
2012-12-17 13:47:09 +00:00
# endif /* _UAPI_ASM_SOCKET_H */