2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* net - sysfs . c - network device class and attributes
*
* Copyright ( c ) 2003 Stephen Hemminger < shemminger @ osdl . org >
*/
2006-01-11 12:17:47 -08:00
# include <linux/capability.h>
2005-04-16 15:20:36 -07:00
# include <linux/kernel.h>
# include <linux/netdevice.h>
# include <linux/if_arp.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2017-02-02 19:15:33 +01:00
# include <linux/sched/signal.h>
2020-06-25 18:34:43 -04:00
# include <linux/sched/isolation.h>
2010-05-04 17:36:45 -07:00
# include <linux/nsproxy.h>
2005-04-16 15:20:36 -07:00
# include <net/sock.h>
2010-05-04 17:36:45 -07:00
# include <net/net_namespace.h>
2005-04-16 15:20:36 -07:00
# include <linux/rtnetlink.h>
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
# include <linux/vmalloc.h>
2011-07-15 11:47:34 -04:00
# include <linux/export.h>
2011-11-28 16:33:09 +00:00
# include <linux/jiffies.h>
2013-02-22 16:34:16 -08:00
# include <linux/pm_runtime.h>
2015-03-09 14:31:20 -07:00
# include <linux/of.h>
2016-06-07 19:27:51 +01:00
# include <linux/of_net.h>
2018-08-08 20:07:35 -07:00
# include <linux/cpu.h>
2005-04-16 15:20:36 -07:00
2007-10-23 21:14:45 -07:00
# include "net-sysfs.h"
2007-09-26 22:02:53 -07:00
# ifdef CONFIG_SYSFS
2005-04-16 15:20:36 -07:00
static const char fmt_hex [ ] = " %#x \n " ;
static const char fmt_dec [ ] = " %d \n " ;
static const char fmt_ulong [ ] = " %lu \n " ;
2010-06-08 07:19:54 +00:00
static const char fmt_u64 [ ] = " %llu \n " ;
2005-04-16 15:20:36 -07:00
2007-02-09 23:24:36 +09:00
static inline int dev_isalive ( const struct net_device * dev )
2005-04-16 15:20:36 -07:00
{
2006-05-06 17:56:03 -07:00
return dev - > reg_state < = NETREG_REGISTERED ;
2005-04-16 15:20:36 -07:00
}
/* use same locking rules as GIF* ioctl's */
2002-04-09 12:14:34 -07:00
static ssize_t netdev_show ( const struct device * dev ,
struct device_attribute * attr , char * buf ,
2005-04-16 15:20:36 -07:00
ssize_t ( * format ) ( const struct net_device * , char * ) )
{
2014-07-23 16:09:10 -07:00
struct net_device * ndev = to_net_dev ( dev ) ;
2005-04-16 15:20:36 -07:00
ssize_t ret = - EINVAL ;
read_lock ( & dev_base_lock ) ;
2014-07-23 16:09:10 -07:00
if ( dev_isalive ( ndev ) )
ret = ( * format ) ( ndev , buf ) ;
2005-04-16 15:20:36 -07:00
read_unlock ( & dev_base_lock ) ;
return ret ;
}
/* generate a show function for simple field */
# define NETDEVICE_SHOW(field, format_string) \
2014-07-23 16:09:10 -07:00
static ssize_t format_ # # field ( const struct net_device * dev , char * buf ) \
2005-04-16 15:20:36 -07:00
{ \
2014-07-23 16:09:10 -07:00
return sprintf ( buf , format_string , dev - > field ) ; \
2005-04-16 15:20:36 -07:00
} \
2013-07-24 15:05:33 -07:00
static ssize_t field # # _show ( struct device * dev , \
2002-04-09 12:14:34 -07:00
struct device_attribute * attr , char * buf ) \
2005-04-16 15:20:36 -07:00
{ \
2002-04-09 12:14:34 -07:00
return netdev_show ( dev , attr , buf , format_ # # field ) ; \
2013-07-24 15:05:33 -07:00
} \
# define NETDEVICE_SHOW_RO(field, format_string) \
NETDEVICE_SHOW ( field , format_string ) ; \
static DEVICE_ATTR_RO ( field )
2005-04-16 15:20:36 -07:00
2013-07-24 15:05:33 -07:00
# define NETDEVICE_SHOW_RW(field, format_string) \
NETDEVICE_SHOW ( field , format_string ) ; \
static DEVICE_ATTR_RW ( field )
2005-04-16 15:20:36 -07:00
/* use same locking and permission rules as SIF* ioctl's */
2002-04-09 12:14:34 -07:00
static ssize_t netdev_store ( struct device * dev , struct device_attribute * attr ,
2005-04-16 15:20:36 -07:00
const char * buf , size_t len ,
int ( * set ) ( struct net_device * , unsigned long ) )
{
2012-11-16 03:03:04 +00:00
struct net_device * netdev = to_net_dev ( dev ) ;
struct net * net = dev_net ( netdev ) ;
2005-04-16 15:20:36 -07:00
unsigned long new ;
2020-04-09 14:41:26 +01:00
int ret ;
2005-04-16 15:20:36 -07:00
2012-11-16 03:03:04 +00:00
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2012-04-12 09:28:13 +00:00
ret = kstrtoul ( buf , 0 , & new ) ;
if ( ret )
2005-04-16 15:20:36 -07:00
goto err ;
2009-02-26 06:49:24 +00:00
if ( ! rtnl_trylock ( ) )
2009-05-13 16:57:25 +00:00
return restart_syscall ( ) ;
2009-02-26 06:49:24 +00:00
2012-11-16 03:03:04 +00:00
if ( dev_isalive ( netdev ) ) {
2017-08-18 13:46:28 -07:00
ret = ( * set ) ( netdev , new ) ;
if ( ret = = 0 )
2005-04-16 15:20:36 -07:00
ret = len ;
}
rtnl_unlock ( ) ;
err :
return ret ;
}
2013-07-24 15:05:33 -07:00
NETDEVICE_SHOW_RO ( dev_id , fmt_hex ) ;
2014-02-25 18:17:50 +02:00
NETDEVICE_SHOW_RO ( dev_port , fmt_dec ) ;
2013-07-24 15:05:33 -07:00
NETDEVICE_SHOW_RO ( addr_assign_type , fmt_dec ) ;
NETDEVICE_SHOW_RO ( addr_len , fmt_dec ) ;
NETDEVICE_SHOW_RO ( ifindex , fmt_dec ) ;
NETDEVICE_SHOW_RO ( type , fmt_dec ) ;
NETDEVICE_SHOW_RO ( link_mode , fmt_dec ) ;
2005-04-16 15:20:36 -07:00
2015-04-02 17:07:00 +02:00
static ssize_t iflink_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct net_device * ndev = to_net_dev ( dev ) ;
return sprintf ( buf , fmt_dec , dev_get_iflink ( ndev ) ) ;
}
static DEVICE_ATTR_RO ( iflink ) ;
2014-07-23 16:09:10 -07:00
static ssize_t format_name_assign_type ( const struct net_device * dev , char * buf )
net: add name_assign_type netdev attribute
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:22 +02:00
{
2014-07-23 16:09:10 -07:00
return sprintf ( buf , fmt_dec , dev - > name_assign_type ) ;
net: add name_assign_type netdev attribute
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:22 +02:00
}
static ssize_t name_assign_type_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
2014-07-23 16:09:10 -07:00
struct net_device * ndev = to_net_dev ( dev ) ;
net: add name_assign_type netdev attribute
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:22 +02:00
ssize_t ret = - EINVAL ;
2014-07-23 16:09:10 -07:00
if ( ndev - > name_assign_type ! = NET_NAME_UNKNOWN )
net: add name_assign_type netdev attribute
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:22 +02:00
ret = netdev_show ( dev , attr , buf , format_name_assign_type ) ;
return ret ;
}
static DEVICE_ATTR_RO ( name_assign_type ) ;
2005-04-16 15:20:36 -07:00
/* use same locking rules as GIFHWADDR ioctl's */
2013-07-24 15:05:33 -07:00
static ssize_t address_show ( struct device * dev , struct device_attribute * attr ,
2002-04-09 12:14:34 -07:00
char * buf )
2005-04-16 15:20:36 -07:00
{
2014-07-23 16:09:10 -07:00
struct net_device * ndev = to_net_dev ( dev ) ;
2005-04-16 15:20:36 -07:00
ssize_t ret = - EINVAL ;
read_lock ( & dev_base_lock ) ;
2014-07-23 16:09:10 -07:00
if ( dev_isalive ( ndev ) )
ret = sysfs_format_mac ( buf , ndev - > dev_addr , ndev - > addr_len ) ;
2005-04-16 15:20:36 -07:00
read_unlock ( & dev_base_lock ) ;
return ret ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( address ) ;
2005-04-16 15:20:36 -07:00
2013-07-24 15:05:33 -07:00
static ssize_t broadcast_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
2005-04-16 15:20:36 -07:00
{
2014-07-23 16:09:10 -07:00
struct net_device * ndev = to_net_dev ( dev ) ;
2017-08-18 13:46:28 -07:00
2014-07-23 16:09:10 -07:00
if ( dev_isalive ( ndev ) )
return sysfs_format_mac ( buf , ndev - > broadcast , ndev - > addr_len ) ;
2005-04-16 15:20:36 -07:00
return - EINVAL ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( broadcast ) ;
2005-04-16 15:20:36 -07:00
2014-07-23 16:09:10 -07:00
static int change_carrier ( struct net_device * dev , unsigned long new_carrier )
2012-12-27 23:49:38 +00:00
{
2014-07-23 16:09:10 -07:00
if ( ! netif_running ( dev ) )
2012-12-27 23:49:38 +00:00
return - EINVAL ;
2017-08-18 13:46:28 -07:00
return dev_change_carrier ( dev , ( bool ) new_carrier ) ;
2012-12-27 23:49:38 +00:00
}
2013-07-24 15:05:33 -07:00
static ssize_t carrier_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t len )
2012-12-27 23:49:38 +00:00
{
2021-10-07 16:00:51 +02:00
struct net_device * netdev = to_net_dev ( dev ) ;
/* The check is also done in change_carrier; this helps returning early
* without hitting the trylock / restart in netdev_store .
*/
if ( ! netdev - > netdev_ops - > ndo_change_carrier )
return - EOPNOTSUPP ;
2012-12-27 23:49:38 +00:00
return netdev_store ( dev , attr , buf , len , change_carrier ) ;
}
2013-07-24 15:05:33 -07:00
static ssize_t carrier_show ( struct device * dev ,
2002-04-09 12:14:34 -07:00
struct device_attribute * attr , char * buf )
2005-04-16 15:20:36 -07:00
{
struct net_device * netdev = to_net_dev ( dev ) ;
2017-08-18 13:46:28 -07:00
if ( netif_running ( netdev ) )
2005-04-16 15:20:36 -07:00
return sprintf ( buf , fmt_dec , ! ! netif_carrier_ok ( netdev ) ) ;
2017-08-18 13:46:28 -07:00
2005-04-16 15:20:36 -07:00
return - EINVAL ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RW ( carrier ) ;
2005-04-16 15:20:36 -07:00
2013-07-24 15:05:33 -07:00
static ssize_t speed_show ( struct device * dev ,
2009-10-02 09:26:12 +00:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
int ret = - EINVAL ;
2021-10-07 16:00:51 +02:00
/* The check is also done in __ethtool_get_link_ksettings; this helps
* returning early without hitting the trylock / restart below .
*/
if ( ! netdev - > ethtool_ops - > get_link_ksettings )
return ret ;
2009-10-02 09:26:12 +00:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2011-04-27 18:32:38 +00:00
if ( netif_running ( netdev ) ) {
2016-02-24 10:58:10 -08:00
struct ethtool_link_ksettings cmd ;
if ( ! __ethtool_get_link_ksettings ( netdev , & cmd ) )
ret = sprintf ( buf , fmt_dec , cmd . base . speed ) ;
2009-10-02 09:26:12 +00:00
}
rtnl_unlock ( ) ;
return ret ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( speed ) ;
2009-10-02 09:26:12 +00:00
2013-07-24 15:05:33 -07:00
static ssize_t duplex_show ( struct device * dev ,
2009-10-02 09:26:12 +00:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
int ret = - EINVAL ;
2021-10-07 16:00:51 +02:00
/* The check is also done in __ethtool_get_link_ksettings; this helps
* returning early without hitting the trylock / restart below .
*/
if ( ! netdev - > ethtool_ops - > get_link_ksettings )
return ret ;
2009-10-02 09:26:12 +00:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2011-04-27 18:32:38 +00:00
if ( netif_running ( netdev ) ) {
2016-02-24 10:58:10 -08:00
struct ethtool_link_ksettings cmd ;
if ( ! __ethtool_get_link_ksettings ( netdev , & cmd ) ) {
2012-09-05 04:11:28 +00:00
const char * duplex ;
2016-02-24 10:58:10 -08:00
switch ( cmd . base . duplex ) {
2012-09-05 04:11:28 +00:00
case DUPLEX_HALF :
duplex = " half " ;
break ;
case DUPLEX_FULL :
duplex = " full " ;
break ;
default :
duplex = " unknown " ;
break ;
}
ret = sprintf ( buf , " %s \n " , duplex ) ;
}
2009-10-02 09:26:12 +00:00
}
rtnl_unlock ( ) ;
return ret ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( duplex ) ;
2009-10-02 09:26:12 +00:00
2020-04-20 00:11:51 +02:00
static ssize_t testing_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
if ( netif_running ( netdev ) )
return sprintf ( buf , fmt_dec , ! ! netif_testing ( netdev ) ) ;
return - EINVAL ;
}
static DEVICE_ATTR_RO ( testing ) ;
2013-07-24 15:05:33 -07:00
static ssize_t dormant_show ( struct device * dev ,
2002-04-09 12:14:34 -07:00
struct device_attribute * attr , char * buf )
2006-03-20 17:09:11 -08:00
{
struct net_device * netdev = to_net_dev ( dev ) ;
if ( netif_running ( netdev ) )
return sprintf ( buf , fmt_dec , ! ! netif_dormant ( netdev ) ) ;
return - EINVAL ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( dormant ) ;
2006-03-20 17:09:11 -08:00
2009-08-05 10:42:58 -07:00
static const char * const operstates [ ] = {
2006-03-20 17:09:11 -08:00
" unknown " ,
" notpresent " , /* currently unused */
" down " ,
" lowerlayerdown " ,
2020-04-20 00:11:51 +02:00
" testing " ,
2006-03-20 17:09:11 -08:00
" dormant " ,
" up "
} ;
2013-07-24 15:05:33 -07:00
static ssize_t operstate_show ( struct device * dev ,
2002-04-09 12:14:34 -07:00
struct device_attribute * attr , char * buf )
2006-03-20 17:09:11 -08:00
{
const struct net_device * netdev = to_net_dev ( dev ) ;
unsigned char operstate ;
read_lock ( & dev_base_lock ) ;
operstate = netdev - > operstate ;
if ( ! netif_running ( netdev ) )
operstate = IF_OPER_DOWN ;
read_unlock ( & dev_base_lock ) ;
2006-04-05 22:19:47 -07:00
if ( operstate > = ARRAY_SIZE ( operstates ) )
2006-03-20 17:09:11 -08:00
return - EINVAL ; /* should not happen */
return sprintf ( buf , " %s \n " , operstates [ operstate ] ) ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( operstate ) ;
2006-03-20 17:09:11 -08:00
2014-03-29 09:48:35 -07:00
static ssize_t carrier_changes_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
2017-08-18 13:46:28 -07:00
2014-03-29 09:48:35 -07:00
return sprintf ( buf , fmt_dec ,
2018-01-18 09:59:13 -08:00
atomic_read ( & netdev - > carrier_up_count ) +
atomic_read ( & netdev - > carrier_down_count ) ) ;
2014-03-29 09:48:35 -07:00
}
static DEVICE_ATTR_RO ( carrier_changes ) ;
2018-01-18 09:59:13 -08:00
static ssize_t carrier_up_count_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
return sprintf ( buf , fmt_dec , atomic_read ( & netdev - > carrier_up_count ) ) ;
}
static DEVICE_ATTR_RO ( carrier_up_count ) ;
static ssize_t carrier_down_count_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
return sprintf ( buf , fmt_dec , atomic_read ( & netdev - > carrier_down_count ) ) ;
}
static DEVICE_ATTR_RO ( carrier_down_count ) ;
2005-04-16 15:20:36 -07:00
/* read-write attributes */
2014-07-23 16:09:10 -07:00
static int change_mtu ( struct net_device * dev , unsigned long new_mtu )
2005-04-16 15:20:36 -07:00
{
2017-08-18 13:46:28 -07:00
return dev_set_mtu ( dev , ( int ) new_mtu ) ;
2005-04-16 15:20:36 -07:00
}
2013-07-24 15:05:33 -07:00
static ssize_t mtu_store ( struct device * dev , struct device_attribute * attr ,
2002-04-09 12:14:34 -07:00
const char * buf , size_t len )
2005-04-16 15:20:36 -07:00
{
2002-04-09 12:14:34 -07:00
return netdev_store ( dev , attr , buf , len , change_mtu ) ;
2005-04-16 15:20:36 -07:00
}
2013-07-24 15:05:33 -07:00
NETDEVICE_SHOW_RW ( mtu , fmt_dec ) ;
2005-04-16 15:20:36 -07:00
2014-07-23 16:09:10 -07:00
static int change_flags ( struct net_device * dev , unsigned long new_flags )
2005-04-16 15:20:36 -07:00
{
2018-12-06 17:05:42 +00:00
return dev_change_flags ( dev , ( unsigned int ) new_flags , NULL ) ;
2005-04-16 15:20:36 -07:00
}
2013-07-24 15:05:33 -07:00
static ssize_t flags_store ( struct device * dev , struct device_attribute * attr ,
2002-04-09 12:14:34 -07:00
const char * buf , size_t len )
2005-04-16 15:20:36 -07:00
{
2002-04-09 12:14:34 -07:00
return netdev_store ( dev , attr , buf , len , change_flags ) ;
2005-04-16 15:20:36 -07:00
}
2013-07-24 15:05:33 -07:00
NETDEVICE_SHOW_RW ( flags , fmt_hex ) ;
2005-04-16 15:20:36 -07:00
2013-07-24 15:05:33 -07:00
static ssize_t tx_queue_len_store ( struct device * dev ,
2002-04-09 12:14:34 -07:00
struct device_attribute * attr ,
const char * buf , size_t len )
2005-04-16 15:20:36 -07:00
{
2012-11-16 03:03:04 +00:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
2018-01-25 18:26:22 -08:00
return netdev_store ( dev , attr , buf , len , dev_change_tx_queue_len ) ;
2005-04-16 15:20:36 -07:00
}
2017-05-17 13:30:44 +03:00
NETDEVICE_SHOW_RW ( tx_queue_len , fmt_dec ) ;
2005-04-16 15:20:36 -07:00
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:09:44 -08:00
static int change_gro_flush_timeout ( struct net_device * dev , unsigned long val )
{
2020-04-22 09:13:28 -07:00
WRITE_ONCE ( dev - > gro_flush_timeout , val ) ;
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:09:44 -08:00
return 0 ;
}
static ssize_t gro_flush_timeout_store ( struct device * dev ,
2017-08-18 13:46:28 -07:00
struct device_attribute * attr ,
const char * buf , size_t len )
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:09:44 -08:00
{
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
return netdev_store ( dev , attr , buf , len , change_gro_flush_timeout ) ;
}
NETDEVICE_SHOW_RW ( gro_flush_timeout , fmt_ulong ) ;
net: napi: add hard irqs deferral feature
Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
we added the ability to arm one high resolution timer, that we used
to keep not-complete packets in GRO engine a bit longer, hoping that further
frames might be added to them.
Since then, we added the napi_complete_done() interface, and commit
364b6055738b ("net: busy-poll: return busypolling status to drivers")
allowed drivers to avoid re-arming NIC interrupts if we made a promise
that their NAPI poll() handler would be called in the near future.
This infrastructure can be leveraged, thanks to a new device parameter,
which allows to arm the napi hrtimer, instead of re-arming the device
hard IRQ.
We have noticed that on some servers with 32 RX queues or more, the chit-chat
between the NIC and the host caused by IRQ delivery and re-arming could hurt
throughput by ~20% on 100Gbit NIC.
In contrast, hrtimers are using local (percpu) resources and might have lower
cost.
The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
than gro_flush_timeout (/sys/class/net/ethX/)
By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
This patch does not change the prior behavior of gro_flush_timeout
if used alone : NIC hard irqs should be rearmed as before.
One concrete usage can be :
echo 20000 >/sys/class/net/eth1/gro_flush_timeout
echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
If at least one packet is retired, then we will reset napi counter
to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
of the queue.
On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
avoidance was only possible if napi->poll() was exhausting its budget
and not call napi_complete_done().
This feature also can be used to work around some non-optimal NIC irq
coalescing strategies.
Having the ability to insert XX usec delays between each napi->poll()
can increase cache efficiency, since we increase batch sizes.
It also keeps serving cpus not idle too long, reducing tail latencies.
Co-developed-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 09:13:27 -07:00
static int change_napi_defer_hard_irqs ( struct net_device * dev , unsigned long val )
{
2020-04-22 09:13:28 -07:00
WRITE_ONCE ( dev - > napi_defer_hard_irqs , val ) ;
net: napi: add hard irqs deferral feature
Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
we added the ability to arm one high resolution timer, that we used
to keep not-complete packets in GRO engine a bit longer, hoping that further
frames might be added to them.
Since then, we added the napi_complete_done() interface, and commit
364b6055738b ("net: busy-poll: return busypolling status to drivers")
allowed drivers to avoid re-arming NIC interrupts if we made a promise
that their NAPI poll() handler would be called in the near future.
This infrastructure can be leveraged, thanks to a new device parameter,
which allows to arm the napi hrtimer, instead of re-arming the device
hard IRQ.
We have noticed that on some servers with 32 RX queues or more, the chit-chat
between the NIC and the host caused by IRQ delivery and re-arming could hurt
throughput by ~20% on 100Gbit NIC.
In contrast, hrtimers are using local (percpu) resources and might have lower
cost.
The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
than gro_flush_timeout (/sys/class/net/ethX/)
By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
This patch does not change the prior behavior of gro_flush_timeout
if used alone : NIC hard irqs should be rearmed as before.
One concrete usage can be :
echo 20000 >/sys/class/net/eth1/gro_flush_timeout
echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
If at least one packet is retired, then we will reset napi counter
to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
of the queue.
On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
avoidance was only possible if napi->poll() was exhausting its budget
and not call napi_complete_done().
This feature also can be used to work around some non-optimal NIC irq
coalescing strategies.
Having the ability to insert XX usec delays between each napi->poll()
can increase cache efficiency, since we increase batch sizes.
It also keeps serving cpus not idle too long, reducing tail latencies.
Co-developed-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 09:13:27 -07:00
return 0 ;
}
static ssize_t napi_defer_hard_irqs_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t len )
{
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
return netdev_store ( dev , attr , buf , len , change_napi_defer_hard_irqs ) ;
}
NETDEVICE_SHOW_RW ( napi_defer_hard_irqs , fmt_dec ) ;
2013-07-24 15:05:33 -07:00
static ssize_t ifalias_store ( struct device * dev , struct device_attribute * attr ,
2008-09-22 21:28:11 -07:00
const char * buf , size_t len )
{
struct net_device * netdev = to_net_dev ( dev ) ;
2012-11-16 03:03:04 +00:00
struct net * net = dev_net ( netdev ) ;
2008-09-22 21:28:11 -07:00
size_t count = len ;
2017-11-13 23:21:36 -08:00
ssize_t ret = 0 ;
2008-09-22 21:28:11 -07:00
2012-11-16 03:03:04 +00:00
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2008-09-22 21:28:11 -07:00
return - EPERM ;
/* ignore trailing newline */
if ( len > 0 & & buf [ len - 1 ] = = ' \n ' )
- - count ;
2017-11-13 23:21:36 -08:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2008-09-22 21:28:11 -07:00
2017-11-13 23:21:36 -08:00
if ( dev_isalive ( netdev ) ) {
ret = dev_set_alias ( netdev , buf , count ) ;
if ( ret < 0 )
goto err ;
ret = len ;
netdev_state_change ( netdev ) ;
}
err :
rtnl_unlock ( ) ;
return ret ;
2008-09-22 21:28:11 -07:00
}
2013-07-24 15:05:33 -07:00
static ssize_t ifalias_show ( struct device * dev ,
2008-09-22 21:28:11 -07:00
struct device_attribute * attr , char * buf )
{
const struct net_device * netdev = to_net_dev ( dev ) ;
2017-10-02 23:50:05 +02:00
char tmp [ IFALIASZ ] ;
2008-09-22 21:28:11 -07:00
ssize_t ret = 0 ;
2017-10-02 23:50:05 +02:00
ret = dev_get_alias ( netdev , tmp , sizeof ( tmp ) ) ;
if ( ret > 0 )
ret = sprintf ( buf , " %s \n " , tmp ) ;
2008-09-22 21:28:11 -07:00
return ret ;
}
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RW ( ifalias ) ;
2011-01-24 03:37:29 +00:00
2014-07-23 16:09:10 -07:00
static int change_group ( struct net_device * dev , unsigned long new_group )
2011-01-24 03:37:29 +00:00
{
2017-08-18 13:46:28 -07:00
dev_set_group ( dev , ( int ) new_group ) ;
2011-01-24 03:37:29 +00:00
return 0 ;
}
2013-07-24 15:05:33 -07:00
static ssize_t group_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t len )
2011-01-24 03:37:29 +00:00
{
return netdev_store ( dev , attr , buf , len , change_group ) ;
}
2013-07-24 15:05:33 -07:00
NETDEVICE_SHOW ( group , fmt_dec ) ;
2018-03-23 15:54:38 -07:00
static DEVICE_ATTR ( netdev_group , 0644 , group_show , group_store ) ;
2013-07-24 15:05:33 -07:00
2015-07-14 13:43:19 -07:00
static int change_proto_down ( struct net_device * dev , unsigned long proto_down )
{
2017-08-18 13:46:28 -07:00
return dev_change_proto_down ( dev , ( bool ) proto_down ) ;
2015-07-14 13:43:19 -07:00
}
static ssize_t proto_down_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t len )
{
return netdev_store ( dev , attr , buf , len , change_proto_down ) ;
}
NETDEVICE_SHOW_RW ( proto_down , fmt_dec ) ;
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-05 14:54:29 -07:00
static ssize_t phys_port_id_show ( struct device * dev ,
2013-07-29 18:16:51 +02:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
2021-10-07 16:00:51 +02:00
/* The check is also done in dev_get_phys_port_id; this helps returning
* early without hitting the trylock / restart below .
*/
if ( ! netdev - > netdev_ops - > ndo_get_phys_port_id )
return - EOPNOTSUPP ;
2013-07-29 18:16:51 +02:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
if ( dev_isalive ( netdev ) ) {
2014-11-28 14:34:16 +01:00
struct netdev_phys_item_id ppid ;
2013-07-29 18:16:51 +02:00
ret = dev_get_phys_port_id ( netdev , & ppid ) ;
if ( ! ret )
ret = sprintf ( buf , " %*phN \n " , ppid . id_len , ppid . id ) ;
}
rtnl_unlock ( ) ;
return ret ;
}
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-05 14:54:29 -07:00
static DEVICE_ATTR_RO ( phys_port_id ) ;
2015-03-17 20:23:15 -06:00
static ssize_t phys_port_name_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
2021-10-07 16:00:51 +02:00
/* The checks are also done in dev_get_phys_port_name; this helps
* returning early without hitting the trylock / restart below .
*/
if ( ! netdev - > netdev_ops - > ndo_get_phys_port_name & &
! netdev - > netdev_ops - > ndo_get_devlink_port )
return - EOPNOTSUPP ;
2015-03-17 20:23:15 -06:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
if ( dev_isalive ( netdev ) ) {
char name [ IFNAMSIZ ] ;
ret = dev_get_phys_port_name ( netdev , name , sizeof ( name ) ) ;
if ( ! ret )
ret = sprintf ( buf , " %s \n " , name ) ;
}
rtnl_unlock ( ) ;
return ret ;
}
static DEVICE_ATTR_RO ( phys_port_name ) ;
2014-11-28 14:34:19 +01:00
static ssize_t phys_switch_id_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
2021-10-07 16:00:51 +02:00
/* The checks are also done in dev_get_phys_port_name; this helps
* returning early without hitting the trylock / restart below . This works
* because recurse is false when calling dev_get_port_parent_id .
*/
if ( ! netdev - > netdev_ops - > ndo_get_port_parent_id & &
! netdev - > netdev_ops - > ndo_get_devlink_port )
return - EOPNOTSUPP ;
2014-11-28 14:34:19 +01:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
if ( dev_isalive ( netdev ) ) {
2019-02-06 09:45:46 -08:00
struct netdev_phys_item_id ppid = { } ;
ret = dev_get_port_parent_id ( netdev , & ppid , false ) ;
2014-11-28 14:34:19 +01:00
if ( ! ret )
2019-02-06 09:45:46 -08:00
ret = sprintf ( buf , " %*phN \n " , ppid . id_len , ppid . id ) ;
2014-11-28 14:34:19 +01:00
}
rtnl_unlock ( ) ;
return ret ;
}
static DEVICE_ATTR_RO ( phys_switch_id ) ;
2021-02-08 11:34:10 -08:00
static ssize_t threaded_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
if ( dev_isalive ( netdev ) )
ret = sprintf ( buf , fmt_dec , netdev - > threaded ) ;
rtnl_unlock ( ) ;
return ret ;
}
static int modify_napi_threaded ( struct net_device * dev , unsigned long val )
{
int ret ;
if ( list_empty ( & dev - > napi_list ) )
return - EOPNOTSUPP ;
if ( val ! = 0 & & val ! = 1 )
return - EOPNOTSUPP ;
ret = dev_set_threaded ( dev , val ) ;
return ret ;
}
static ssize_t threaded_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t len )
{
return netdev_store ( dev , attr , buf , len , modify_napi_threaded ) ;
}
static DEVICE_ATTR_RW ( threaded ) ;
2017-08-18 13:46:23 -07:00
static struct attribute * net_class_attrs [ ] __ro_after_init = {
2013-07-24 15:05:33 -07:00
& dev_attr_netdev_group . attr ,
& dev_attr_type . attr ,
& dev_attr_dev_id . attr ,
2014-02-25 18:17:50 +02:00
& dev_attr_dev_port . attr ,
2013-07-24 15:05:33 -07:00
& dev_attr_iflink . attr ,
& dev_attr_ifindex . attr ,
net: add name_assign_type netdev attribute
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:22 +02:00
& dev_attr_name_assign_type . attr ,
2013-07-24 15:05:33 -07:00
& dev_attr_addr_assign_type . attr ,
& dev_attr_addr_len . attr ,
& dev_attr_link_mode . attr ,
& dev_attr_address . attr ,
& dev_attr_broadcast . attr ,
& dev_attr_speed . attr ,
& dev_attr_duplex . attr ,
& dev_attr_dormant . attr ,
2020-04-20 00:11:51 +02:00
& dev_attr_testing . attr ,
2013-07-24 15:05:33 -07:00
& dev_attr_operstate . attr ,
2014-03-29 09:48:35 -07:00
& dev_attr_carrier_changes . attr ,
2013-07-24 15:05:33 -07:00
& dev_attr_ifalias . attr ,
& dev_attr_carrier . attr ,
& dev_attr_mtu . attr ,
& dev_attr_flags . attr ,
& dev_attr_tx_queue_len . attr ,
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:09:44 -08:00
& dev_attr_gro_flush_timeout . attr ,
net: napi: add hard irqs deferral feature
Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
we added the ability to arm one high resolution timer, that we used
to keep not-complete packets in GRO engine a bit longer, hoping that further
frames might be added to them.
Since then, we added the napi_complete_done() interface, and commit
364b6055738b ("net: busy-poll: return busypolling status to drivers")
allowed drivers to avoid re-arming NIC interrupts if we made a promise
that their NAPI poll() handler would be called in the near future.
This infrastructure can be leveraged, thanks to a new device parameter,
which allows to arm the napi hrtimer, instead of re-arming the device
hard IRQ.
We have noticed that on some servers with 32 RX queues or more, the chit-chat
between the NIC and the host caused by IRQ delivery and re-arming could hurt
throughput by ~20% on 100Gbit NIC.
In contrast, hrtimers are using local (percpu) resources and might have lower
cost.
The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
than gro_flush_timeout (/sys/class/net/ethX/)
By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
This patch does not change the prior behavior of gro_flush_timeout
if used alone : NIC hard irqs should be rearmed as before.
One concrete usage can be :
echo 20000 >/sys/class/net/eth1/gro_flush_timeout
echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
If at least one packet is retired, then we will reset napi counter
to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
of the queue.
On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
avoidance was only possible if napi->poll() was exhausting its budget
and not call napi_complete_done().
This feature also can be used to work around some non-optimal NIC irq
coalescing strategies.
Having the ability to insert XX usec delays between each napi->poll()
can increase cache efficiency, since we increase batch sizes.
It also keeps serving cpus not idle too long, reducing tail latencies.
Co-developed-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 09:13:27 -07:00
& dev_attr_napi_defer_hard_irqs . attr ,
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-05 14:54:29 -07:00
& dev_attr_phys_port_id . attr ,
2015-03-17 20:23:15 -06:00
& dev_attr_phys_port_name . attr ,
2014-11-28 14:34:19 +01:00
& dev_attr_phys_switch_id . attr ,
2015-07-14 13:43:19 -07:00
& dev_attr_proto_down . attr ,
2018-01-18 09:59:13 -08:00
& dev_attr_carrier_up_count . attr ,
& dev_attr_carrier_down_count . attr ,
2021-02-08 11:34:10 -08:00
& dev_attr_threaded . attr ,
2013-07-24 15:05:33 -07:00
NULL ,
2005-04-16 15:20:36 -07:00
} ;
2013-07-24 15:05:33 -07:00
ATTRIBUTE_GROUPS ( net_class ) ;
2005-04-16 15:20:36 -07:00
/* Show a given an attribute in the statistics group */
2002-04-09 12:14:34 -07:00
static ssize_t netstat_show ( const struct device * d ,
struct device_attribute * attr , char * buf ,
2005-04-16 15:20:36 -07:00
unsigned long offset )
{
2002-04-09 12:14:34 -07:00
struct net_device * dev = to_net_dev ( d ) ;
2005-04-16 15:20:36 -07:00
ssize_t ret = - EINVAL ;
2010-06-08 07:19:54 +00:00
WARN_ON ( offset > sizeof ( struct rtnl_link_stats64 ) | |
2017-08-18 13:46:28 -07:00
offset % sizeof ( u64 ) ! = 0 ) ;
2005-04-16 15:20:36 -07:00
read_lock ( & dev_base_lock ) ;
2008-05-21 14:12:46 -07:00
if ( dev_isalive ( dev ) ) {
2010-07-07 14:58:56 -07:00
struct rtnl_link_stats64 temp ;
const struct rtnl_link_stats64 * stats = dev_get_stats ( dev , & temp ) ;
2017-08-18 13:46:28 -07:00
ret = sprintf ( buf , fmt_u64 , * ( u64 * ) ( ( ( u8 * ) stats ) + offset ) ) ;
2008-05-21 14:12:46 -07:00
}
2005-04-16 15:20:36 -07:00
read_unlock ( & dev_base_lock ) ;
return ret ;
}
/* generate a read-only statistics attribute */
# define NETSTAT_ENTRY(name) \
2013-07-24 15:05:33 -07:00
static ssize_t name # # _show ( struct device * d , \
2017-08-18 13:46:28 -07:00
struct device_attribute * attr , char * buf ) \
2005-04-16 15:20:36 -07:00
{ \
2002-04-09 12:14:34 -07:00
return netstat_show ( d , attr , buf , \
2010-06-08 07:19:54 +00:00
offsetof ( struct rtnl_link_stats64 , name ) ) ; \
2005-04-16 15:20:36 -07:00
} \
2013-07-24 15:05:33 -07:00
static DEVICE_ATTR_RO ( name )
2005-04-16 15:20:36 -07:00
NETSTAT_ENTRY ( rx_packets ) ;
NETSTAT_ENTRY ( tx_packets ) ;
NETSTAT_ENTRY ( rx_bytes ) ;
NETSTAT_ENTRY ( tx_bytes ) ;
NETSTAT_ENTRY ( rx_errors ) ;
NETSTAT_ENTRY ( tx_errors ) ;
NETSTAT_ENTRY ( rx_dropped ) ;
NETSTAT_ENTRY ( tx_dropped ) ;
NETSTAT_ENTRY ( multicast ) ;
NETSTAT_ENTRY ( collisions ) ;
NETSTAT_ENTRY ( rx_length_errors ) ;
NETSTAT_ENTRY ( rx_over_errors ) ;
NETSTAT_ENTRY ( rx_crc_errors ) ;
NETSTAT_ENTRY ( rx_frame_errors ) ;
NETSTAT_ENTRY ( rx_fifo_errors ) ;
NETSTAT_ENTRY ( rx_missed_errors ) ;
NETSTAT_ENTRY ( tx_aborted_errors ) ;
NETSTAT_ENTRY ( tx_carrier_errors ) ;
NETSTAT_ENTRY ( tx_fifo_errors ) ;
NETSTAT_ENTRY ( tx_heartbeat_errors ) ;
NETSTAT_ENTRY ( tx_window_errors ) ;
NETSTAT_ENTRY ( rx_compressed ) ;
NETSTAT_ENTRY ( tx_compressed ) ;
2016-02-01 18:51:05 -05:00
NETSTAT_ENTRY ( rx_nohandler ) ;
2005-04-16 15:20:36 -07:00
2017-08-18 13:46:23 -07:00
static struct attribute * netstat_attrs [ ] __ro_after_init = {
2002-04-09 12:14:34 -07:00
& dev_attr_rx_packets . attr ,
& dev_attr_tx_packets . attr ,
& dev_attr_rx_bytes . attr ,
& dev_attr_tx_bytes . attr ,
& dev_attr_rx_errors . attr ,
& dev_attr_tx_errors . attr ,
& dev_attr_rx_dropped . attr ,
& dev_attr_tx_dropped . attr ,
& dev_attr_multicast . attr ,
& dev_attr_collisions . attr ,
& dev_attr_rx_length_errors . attr ,
& dev_attr_rx_over_errors . attr ,
& dev_attr_rx_crc_errors . attr ,
& dev_attr_rx_frame_errors . attr ,
& dev_attr_rx_fifo_errors . attr ,
& dev_attr_rx_missed_errors . attr ,
& dev_attr_tx_aborted_errors . attr ,
& dev_attr_tx_carrier_errors . attr ,
& dev_attr_tx_fifo_errors . attr ,
& dev_attr_tx_heartbeat_errors . attr ,
& dev_attr_tx_window_errors . attr ,
& dev_attr_rx_compressed . attr ,
& dev_attr_tx_compressed . attr ,
2016-02-01 18:51:05 -05:00
& dev_attr_rx_nohandler . attr ,
2005-04-16 15:20:36 -07:00
NULL
} ;
2017-06-29 16:31:26 +05:30
static const struct attribute_group netstat_group = {
2005-04-16 15:20:36 -07:00
. name = " statistics " ,
. attrs = netstat_attrs ,
} ;
2012-11-16 20:46:19 +01:00
# if IS_ENABLED(CONFIG_WIRELESS_EXT) || IS_ENABLED(CONFIG_CFG80211)
static struct attribute * wireless_attrs [ ] = {
NULL
} ;
2017-06-29 16:31:26 +05:30
static const struct attribute_group wireless_group = {
2012-11-16 20:46:19 +01:00
. name = " wireless " ,
. attrs = wireless_attrs ,
} ;
# endif
2013-07-24 15:05:33 -07:00
# else /* CONFIG_SYSFS */
# define net_class_groups NULL
2010-05-16 21:59:45 -07:00
# endif /* CONFIG_SYSFS */
2005-04-16 15:20:36 -07:00
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_SYSFS
2017-08-18 13:46:28 -07:00
# define to_rx_queue_attr(_attr) \
container_of ( _attr , struct rx_queue_attribute , attr )
2010-03-16 08:03:29 +00:00
# define to_rx_queue(obj) container_of(obj, struct netdev_rx_queue, kobj)
static ssize_t rx_queue_attr_show ( struct kobject * kobj , struct attribute * attr ,
char * buf )
{
2017-08-18 13:46:27 -07:00
const struct rx_queue_attribute * attribute = to_rx_queue_attr ( attr ) ;
2010-03-16 08:03:29 +00:00
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
2017-08-18 13:46:24 -07:00
return attribute - > show ( queue , buf ) ;
2010-03-16 08:03:29 +00:00
}
static ssize_t rx_queue_attr_store ( struct kobject * kobj , struct attribute * attr ,
const char * buf , size_t count )
{
2017-08-18 13:46:27 -07:00
const struct rx_queue_attribute * attribute = to_rx_queue_attr ( attr ) ;
2010-03-16 08:03:29 +00:00
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
2017-08-18 13:46:24 -07:00
return attribute - > store ( queue , buf , count ) ;
2010-03-16 08:03:29 +00:00
}
2010-08-31 12:14:13 +00:00
static const struct sysfs_ops rx_queue_sysfs_ops = {
2010-03-16 08:03:29 +00:00
. show = rx_queue_attr_show ,
. store = rx_queue_attr_store ,
} ;
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_RPS
2017-08-18 13:46:24 -07:00
static ssize_t show_rps_map ( struct netdev_rx_queue * queue , char * buf )
2010-03-16 08:03:29 +00:00
{
struct rps_map * map ;
cpumask_var_t mask ;
2015-02-13 14:37:42 -08:00
int i , len ;
2010-03-16 08:03:29 +00:00
if ( ! zalloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
rcu_read_lock ( ) ;
map = rcu_dereference ( queue - > rps_map ) ;
if ( map )
for ( i = 0 ; i < map - > len ; i + + )
cpumask_set_cpu ( map - > cpus [ i ] , mask ) ;
2015-02-13 14:37:42 -08:00
len = snprintf ( buf , PAGE_SIZE , " %*pb \n " , cpumask_pr_args ( mask ) ) ;
2010-03-16 08:03:29 +00:00
rcu_read_unlock ( ) ;
free_cpumask_var ( mask ) ;
2015-02-13 14:37:42 -08:00
return len < PAGE_SIZE ? len : - EINVAL ;
2010-03-16 08:03:29 +00:00
}
2010-04-19 14:40:57 -07:00
static ssize_t store_rps_map ( struct netdev_rx_queue * queue ,
2017-08-18 13:46:24 -07:00
const char * buf , size_t len )
2010-03-16 08:03:29 +00:00
{
struct rps_map * old_map , * map ;
cpumask_var_t mask ;
2020-06-25 18:34:43 -04:00
int err , cpu , i , hk_flags ;
2015-08-13 14:03:16 -04:00
static DEFINE_MUTEX ( rps_map_mutex ) ;
2010-03-16 08:03:29 +00:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
if ( ! alloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
err = bitmap_parse ( buf , len , cpumask_bits ( mask ) , nr_cpumask_bits ) ;
if ( err ) {
free_cpumask_var ( mask ) ;
return err ;
}
2020-08-11 18:34:40 -07:00
if ( ! cpumask_empty ( mask ) ) {
hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ ;
cpumask_and ( mask , mask , housekeeping_cpumask ( hk_flags ) ) ;
if ( cpumask_empty ( mask ) ) {
free_cpumask_var ( mask ) ;
return - EINVAL ;
}
2020-06-25 18:34:43 -04:00
}
2012-04-15 05:58:06 +00:00
map = kzalloc ( max_t ( unsigned int ,
2017-08-18 13:46:28 -07:00
RPS_MAP_SIZE ( cpumask_weight ( mask ) ) , L1_CACHE_BYTES ) ,
GFP_KERNEL ) ;
2010-03-16 08:03:29 +00:00
if ( ! map ) {
free_cpumask_var ( mask ) ;
return - ENOMEM ;
}
i = 0 ;
for_each_cpu_and ( cpu , mask , cpu_online_mask )
map - > cpus [ i + + ] = cpu ;
2017-08-18 13:46:28 -07:00
if ( i ) {
2010-03-16 08:03:29 +00:00
map - > len = i ;
2017-08-18 13:46:28 -07:00
} else {
2010-03-16 08:03:29 +00:00
kfree ( map ) ;
map = NULL ;
}
2015-08-13 14:03:16 -04:00
mutex_lock ( & rps_map_mutex ) ;
2010-10-25 03:02:02 +00:00
old_map = rcu_dereference_protected ( queue - > rps_map ,
2015-08-13 14:03:16 -04:00
mutex_is_locked ( & rps_map_mutex ) ) ;
2010-03-16 08:03:29 +00:00
rcu_assign_pointer ( queue - > rps_map , map ) ;
2011-11-17 03:13:26 +00:00
if ( map )
2019-03-22 08:56:38 -07:00
static_branch_inc ( & rps_needed ) ;
2015-08-05 09:39:27 -07:00
if ( old_map )
2019-03-22 08:56:38 -07:00
static_branch_dec ( & rps_needed ) ;
2015-08-05 09:39:27 -07:00
2015-08-13 14:03:16 -04:00
mutex_unlock ( & rps_map_mutex ) ;
2015-08-05 09:39:27 -07:00
if ( old_map )
kfree_rcu ( old_map , rcu ) ;
2010-03-16 08:03:29 +00:00
free_cpumask_var ( mask ) ;
return len ;
}
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
static ssize_t show_rps_dev_flow_table_cnt ( struct netdev_rx_queue * queue ,
char * buf )
{
struct rps_dev_flow_table * flow_table ;
2011-12-24 06:56:49 +00:00
unsigned long val = 0 ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
rcu_read_lock ( ) ;
flow_table = rcu_dereference ( queue - > rps_flow_table ) ;
if ( flow_table )
2011-12-24 06:56:49 +00:00
val = ( unsigned long ) flow_table - > mask + 1 ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
rcu_read_unlock ( ) ;
2011-12-24 06:56:49 +00:00
return sprintf ( buf , " %lu \n " , val ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
}
static void rps_dev_flow_table_release ( struct rcu_head * rcu )
{
struct rps_dev_flow_table * table = container_of ( rcu ,
struct rps_dev_flow_table , rcu ) ;
2013-05-05 16:05:55 +00:00
vfree ( table ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
}
2010-04-19 14:40:57 -07:00
static ssize_t store_rps_dev_flow_table_cnt ( struct netdev_rx_queue * queue ,
2017-08-18 13:46:24 -07:00
const char * buf , size_t len )
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
{
2011-12-24 06:56:49 +00:00
unsigned long mask , count ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
struct rps_dev_flow_table * table , * old_table ;
static DEFINE_SPINLOCK ( rps_dev_flow_lock ) ;
2011-12-24 06:56:49 +00:00
int rc ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
2011-12-24 06:56:49 +00:00
rc = kstrtoul ( buf , 0 , & count ) ;
if ( rc < 0 )
return rc ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
if ( count ) {
2011-12-24 06:56:49 +00:00
mask = count - 1 ;
/* mask = roundup_pow_of_two(count) - 1;
* without overflows . . .
*/
while ( ( mask | ( mask > > 1 ) ) ! = mask )
mask | = ( mask > > 1 ) ;
/* On 64 bit arches, must check mask fits in table->mask (u32),
2013-12-08 12:15:44 -08:00
* and on 32 bit arches , must check
* RPS_DEV_FLOW_TABLE_SIZE ( mask + 1 ) doesn ' t overflow .
2011-12-24 06:56:49 +00:00
*/
# if BITS_PER_LONG > 32
if ( mask > ( unsigned long ) ( u32 ) mask )
2011-12-22 13:35:22 +00:00
return - EINVAL ;
2011-12-24 06:56:49 +00:00
# else
if ( mask > ( ULONG_MAX - RPS_DEV_FLOW_TABLE_SIZE ( 1 ) )
2011-12-22 13:35:22 +00:00
/ sizeof ( struct rps_dev_flow ) ) {
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
/* Enforce a limit to prevent overflow */
return - EINVAL ;
}
2011-12-24 06:56:49 +00:00
# endif
table = vmalloc ( RPS_DEV_FLOW_TABLE_SIZE ( mask + 1 ) ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
if ( ! table )
return - ENOMEM ;
2011-12-24 06:56:49 +00:00
table - > mask = mask ;
for ( count = 0 ; count < = mask ; count + + )
table - > flows [ count ] . cpu = RPS_NO_CPU ;
2017-08-18 13:46:28 -07:00
} else {
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
table = NULL ;
2017-08-18 13:46:28 -07:00
}
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
spin_lock ( & rps_dev_flow_lock ) ;
2010-10-25 03:02:02 +00:00
old_table = rcu_dereference_protected ( queue - > rps_flow_table ,
lockdep_is_held ( & rps_dev_flow_lock ) ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
rcu_assign_pointer ( queue - > rps_flow_table , table ) ;
spin_unlock ( & rps_dev_flow_lock ) ;
if ( old_table )
call_rcu ( & old_table - > rcu , rps_dev_flow_table_release ) ;
return len ;
}
2017-08-18 13:46:27 -07:00
static struct rx_queue_attribute rps_cpus_attribute __ro_after_init
2018-03-23 15:54:38 -07:00
= __ATTR ( rps_cpus , 0644 , show_rps_map , store_rps_map ) ;
2010-03-16 08:03:29 +00:00
2017-08-18 13:46:27 -07:00
static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute __ro_after_init
2018-03-23 15:54:38 -07:00
= __ATTR ( rps_flow_cnt , 0644 ,
2017-08-18 13:46:27 -07:00
show_rps_dev_flow_table_cnt , store_rps_dev_flow_table_cnt ) ;
2014-01-16 22:23:28 -08:00
# endif /* CONFIG_RPS */
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
2017-08-18 13:46:27 -07:00
static struct attribute * rx_queue_default_attrs [ ] __ro_after_init = {
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_RPS
2010-03-16 08:03:29 +00:00
& rps_cpus_attribute . attr ,
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
& rps_dev_flow_table_cnt_attribute . attr ,
2014-01-16 22:23:28 -08:00
# endif
2010-03-16 08:03:29 +00:00
NULL
} ;
2019-04-01 22:51:35 -04:00
ATTRIBUTE_GROUPS ( rx_queue_default ) ;
2010-03-16 08:03:29 +00:00
static void rx_queue_release ( struct kobject * kobj )
{
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_RPS
2010-10-25 03:02:02 +00:00
struct rps_map * map ;
struct rps_dev_flow_table * flow_table ;
2010-03-16 08:03:29 +00:00
2011-08-11 19:30:52 +00:00
map = rcu_dereference_protected ( queue - > rps_map , 1 ) ;
2010-11-16 06:31:39 +00:00
if ( map ) {
RCU_INIT_POINTER ( queue - > rps_map , NULL ) ;
2011-03-18 12:01:31 +08:00
kfree_rcu ( map , rcu ) ;
2010-11-16 06:31:39 +00:00
}
2010-10-25 03:02:02 +00:00
2011-08-11 19:30:52 +00:00
flow_table = rcu_dereference_protected ( queue - > rps_flow_table , 1 ) ;
2010-11-16 06:31:39 +00:00
if ( flow_table ) {
RCU_INIT_POINTER ( queue - > rps_flow_table , NULL ) ;
2010-10-25 03:02:02 +00:00
call_rcu ( & flow_table - > rcu , rps_dev_flow_table_release ) ;
2010-11-16 06:31:39 +00:00
}
2014-01-16 22:23:28 -08:00
# endif
2010-03-16 08:03:29 +00:00
2010-11-16 06:31:39 +00:00
memset ( kobj , 0 , sizeof ( * kobj ) ) ;
2021-12-04 20:21:58 -08:00
dev_put_track ( queue - > dev , & queue - > dev_tracker ) ;
2010-03-16 08:03:29 +00:00
}
2014-01-16 17:24:31 +08:00
static const void * rx_queue_namespace ( struct kobject * kobj )
{
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
struct device * dev = & queue - > dev - > dev ;
const void * ns = NULL ;
if ( dev - > class & & dev - > class - > ns_type )
ns = dev - > class - > namespace ( dev ) ;
return ns ;
}
2018-07-20 21:56:52 +00:00
static void rx_queue_get_ownership ( struct kobject * kobj ,
kuid_t * uid , kgid_t * gid )
{
const struct net * net = rx_queue_namespace ( kobj ) ;
net_ns_get_ownership ( net , uid , gid ) ;
}
2017-08-18 13:46:27 -07:00
static struct kobj_type rx_queue_ktype __ro_after_init = {
2010-03-16 08:03:29 +00:00
. sysfs_ops = & rx_queue_sysfs_ops ,
. release = rx_queue_release ,
2019-04-01 22:51:35 -04:00
. default_groups = rx_queue_default_groups ,
2018-07-20 21:56:52 +00:00
. namespace = rx_queue_namespace ,
. get_ownership = rx_queue_get_ownership ,
2010-03-16 08:03:29 +00:00
} ;
2014-07-23 16:09:10 -07:00
static int rx_queue_add_kobject ( struct net_device * dev , int index )
2010-03-16 08:03:29 +00:00
{
2014-07-23 16:09:10 -07:00
struct netdev_rx_queue * queue = dev - > _rx + index ;
2010-03-16 08:03:29 +00:00
struct kobject * kobj = & queue - > kobj ;
int error = 0 ;
2019-12-17 13:46:34 +02:00
/* Kobject_put later will trigger rx_queue_release call which
* decreases dev refcount : Take that reference here
*/
2021-12-04 20:21:58 -08:00
dev_hold_track ( queue - > dev , & queue - > dev_tracker , GFP_KERNEL ) ;
2019-12-17 13:46:34 +02:00
2014-07-23 16:09:10 -07:00
kobj - > kset = dev - > queues_kset ;
2010-03-16 08:03:29 +00:00
error = kobject_init_and_add ( kobj , & rx_queue_ktype , NULL ,
2017-08-18 13:46:28 -07:00
" rx-%u " , index ) ;
2014-01-16 22:23:28 -08:00
if ( error )
2019-11-20 09:08:16 +02:00
goto err ;
2014-01-16 22:23:28 -08:00
2014-07-23 16:09:10 -07:00
if ( dev - > sysfs_rx_queue_group ) {
error = sysfs_create_group ( kobj , dev - > sysfs_rx_queue_group ) ;
2019-11-20 09:08:16 +02:00
if ( error )
goto err ;
2010-03-16 08:03:29 +00:00
}
kobject_uevent ( kobj , KOBJ_ADD ) ;
return error ;
2019-11-20 09:08:16 +02:00
err :
kobject_put ( kobj ) ;
return error ;
2010-03-16 08:03:29 +00:00
}
2020-02-27 04:37:18 +01:00
static int rx_queue_change_owner ( struct net_device * dev , int index , kuid_t kuid ,
kgid_t kgid )
{
struct netdev_rx_queue * queue = dev - > _rx + index ;
struct kobject * kobj = & queue - > kobj ;
int error ;
error = sysfs_change_owner ( kobj , kuid , kgid ) ;
if ( error )
return error ;
if ( dev - > sysfs_rx_queue_group )
error = sysfs_group_change_owner (
kobj , dev - > sysfs_rx_queue_group , kuid , kgid ) ;
return error ;
}
2014-02-09 14:07:11 +01:00
# endif /* CONFIG_SYSFS */
2010-03-16 08:03:29 +00:00
2010-09-27 08:24:33 +00:00
int
2014-07-23 16:09:10 -07:00
net_rx_queue_update_kobjects ( struct net_device * dev , int old_num , int new_num )
2010-03-16 08:03:29 +00:00
{
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_SYSFS
2010-03-16 08:03:29 +00:00
int i ;
int error = 0 ;
2014-01-16 22:23:28 -08:00
# ifndef CONFIG_RPS
2014-07-23 16:09:10 -07:00
if ( ! dev - > sysfs_rx_queue_group )
2014-01-16 22:23:28 -08:00
return 0 ;
# endif
2010-09-27 08:24:33 +00:00
for ( i = old_num ; i < new_num ; i + + ) {
2014-07-23 16:09:10 -07:00
error = rx_queue_add_kobject ( dev , i ) ;
2010-09-27 08:24:33 +00:00
if ( error ) {
new_num = old_num ;
2010-03-16 08:03:29 +00:00
break ;
2010-09-27 08:24:33 +00:00
}
2010-03-16 08:03:29 +00:00
}
2014-01-16 22:23:28 -08:00
while ( - - i > = new_num ) {
2016-10-24 19:09:53 -07:00
struct kobject * kobj = & dev - > _rx [ i ] . kobj ;
2020-08-19 14:06:36 +02:00
if ( ! refcount_read ( & dev_net ( dev ) - > ns . count ) )
2016-10-24 19:09:53 -07:00
kobj - > uevent_suppress = 1 ;
2014-07-23 16:09:10 -07:00
if ( dev - > sysfs_rx_queue_group )
2016-10-24 19:09:53 -07:00
sysfs_remove_group ( kobj , dev - > sysfs_rx_queue_group ) ;
kobject_put ( kobj ) ;
2014-01-16 22:23:28 -08:00
}
2010-03-16 08:03:29 +00:00
return error ;
2010-11-26 08:36:09 +00:00
# else
return 0 ;
# endif
2010-03-16 08:03:29 +00:00
}
2020-02-27 04:37:18 +01:00
static int net_rx_queue_change_owner ( struct net_device * dev , int num ,
kuid_t kuid , kgid_t kgid )
{
# ifdef CONFIG_SYSFS
int error = 0 ;
int i ;
# ifndef CONFIG_RPS
if ( ! dev - > sysfs_rx_queue_group )
return 0 ;
# endif
for ( i = 0 ; i < num ; i + + ) {
error = rx_queue_change_owner ( dev , i , kuid , kgid ) ;
if ( error )
break ;
}
return error ;
# else
return 0 ;
# endif
}
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_SYSFS
2010-11-21 13:17:27 +00:00
/*
* netdev_queue sysfs structures and functions .
*/
struct netdev_queue_attribute {
struct attribute attr ;
2017-08-18 13:46:24 -07:00
ssize_t ( * show ) ( struct netdev_queue * queue , char * buf ) ;
2010-11-21 13:17:27 +00:00
ssize_t ( * store ) ( struct netdev_queue * queue ,
2017-08-18 13:46:24 -07:00
const char * buf , size_t len ) ;
2010-11-21 13:17:27 +00:00
} ;
2017-08-18 13:46:28 -07:00
# define to_netdev_queue_attr(_attr) \
container_of ( _attr , struct netdev_queue_attribute , attr )
2010-11-21 13:17:27 +00:00
# define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
static ssize_t netdev_queue_attr_show ( struct kobject * kobj ,
struct attribute * attr , char * buf )
{
2017-08-18 13:46:27 -07:00
const struct netdev_queue_attribute * attribute
= to_netdev_queue_attr ( attr ) ;
2010-11-21 13:17:27 +00:00
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
2017-08-18 13:46:24 -07:00
return attribute - > show ( queue , buf ) ;
2010-11-21 13:17:27 +00:00
}
static ssize_t netdev_queue_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t count )
{
2017-08-18 13:46:27 -07:00
const struct netdev_queue_attribute * attribute
= to_netdev_queue_attr ( attr ) ;
2010-11-21 13:17:27 +00:00
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
2017-08-18 13:46:24 -07:00
return attribute - > store ( queue , buf , count ) ;
2010-11-21 13:17:27 +00:00
}
static const struct sysfs_ops netdev_queue_sysfs_ops = {
. show = netdev_queue_attr_show ,
. store = netdev_queue_attr_store ,
} ;
2017-08-18 13:46:26 -07:00
static ssize_t tx_timeout_show ( struct netdev_queue * queue , char * buf )
2011-11-16 12:15:10 +00:00
{
2021-11-16 19:29:21 -08:00
unsigned long trans_timeout = atomic_long_read ( & queue - > trans_timeout ) ;
2011-11-16 12:15:10 +00:00
2020-07-21 15:02:57 +08:00
return sprintf ( buf , fmt_ulong , trans_timeout ) ;
2011-11-16 12:15:10 +00:00
}
2015-09-15 18:28:00 -03:00
static unsigned int get_netdev_queue_index ( struct netdev_queue * queue )
2015-03-18 14:57:33 +02:00
{
struct net_device * dev = queue - > dev ;
2015-09-15 18:28:00 -03:00
unsigned int i ;
2015-03-18 14:57:33 +02:00
2015-09-15 18:28:00 -03:00
i = queue - dev - > _tx ;
2015-03-18 14:57:33 +02:00
BUG_ON ( i > = dev - > num_tx_queues ) ;
return i ;
}
2017-08-18 13:46:26 -07:00
static ssize_t traffic_class_show ( struct netdev_queue * queue ,
2016-10-28 11:43:49 -04:00
char * buf )
{
struct net_device * dev = queue - > dev ;
2021-02-08 14:29:18 -08:00
int num_tc , tc ;
2018-07-09 12:19:32 -04:00
int index ;
2016-10-28 11:43:49 -04:00
2018-07-09 12:19:32 -04:00
if ( ! netif_is_multiqueue ( dev ) )
return - ENOENT ;
2021-02-08 14:29:18 -08:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2018-07-09 12:19:32 -04:00
index = get_netdev_queue_index ( queue ) ;
2018-07-09 12:19:38 -04:00
/* If queue belongs to subordinate dev use its TC mapping */
dev = netdev_get_tx_queue ( dev , index ) - > sb_dev ? : dev ;
2021-02-08 14:29:18 -08:00
num_tc = dev - > num_tc ;
2018-07-09 12:19:32 -04:00
tc = netdev_txq_to_tc ( dev , index ) ;
2021-02-08 14:29:18 -08:00
rtnl_unlock ( ) ;
2016-10-28 11:43:49 -04:00
if ( tc < 0 )
return - EINVAL ;
2018-07-09 12:19:38 -04:00
/* We can report the traffic class one of two ways:
* Subordinate device traffic classes are reported with the traffic
* class first , and then the subordinate class so for example TC0 on
* subordinate device 2 will be reported as " 0-2 " . If the queue
* belongs to the root device it will be reported with just the
* traffic class , so just " 0 " for TC 0 for example .
*/
2021-02-08 14:29:18 -08:00
return num_tc < 0 ? sprintf ( buf , " %d%d \n " , tc , num_tc ) :
sprintf ( buf , " %d \n " , tc ) ;
2016-10-28 11:43:49 -04:00
}
# ifdef CONFIG_XPS
2017-08-18 13:46:26 -07:00
static ssize_t tx_maxrate_show ( struct netdev_queue * queue ,
2015-03-18 14:57:33 +02:00
char * buf )
{
return sprintf ( buf , " %lu \n " , queue - > tx_maxrate ) ;
}
2017-08-18 13:46:26 -07:00
static ssize_t tx_maxrate_store ( struct netdev_queue * queue ,
const char * buf , size_t len )
2015-03-18 14:57:33 +02:00
{
struct net_device * dev = queue - > dev ;
int err , index = get_netdev_queue_index ( queue ) ;
u32 rate = 0 ;
2018-07-20 21:56:51 +00:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
2021-10-07 16:00:51 +02:00
/* The check is also done later; this helps returning early without
* hitting the trylock / restart below .
*/
if ( ! dev - > netdev_ops - > ndo_set_tx_maxrate )
return - EOPNOTSUPP ;
2015-03-18 14:57:33 +02:00
err = kstrtou32 ( buf , 10 , & rate ) ;
if ( err < 0 )
return err ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
err = - EOPNOTSUPP ;
if ( dev - > netdev_ops - > ndo_set_tx_maxrate )
err = dev - > netdev_ops - > ndo_set_tx_maxrate ( dev , index , rate ) ;
rtnl_unlock ( ) ;
if ( ! err ) {
queue - > tx_maxrate = rate ;
return len ;
}
return err ;
}
2017-08-18 13:46:26 -07:00
static struct netdev_queue_attribute queue_tx_maxrate __ro_after_init
= __ATTR_RW ( tx_maxrate ) ;
2015-03-18 14:57:33 +02:00
# endif
2017-08-18 13:46:26 -07:00
static struct netdev_queue_attribute queue_trans_timeout __ro_after_init
= __ATTR_RO ( tx_timeout ) ;
2011-11-16 12:15:10 +00:00
2017-08-18 13:46:26 -07:00
static struct netdev_queue_attribute queue_traffic_class __ro_after_init
= __ATTR_RO ( traffic_class ) ;
2016-10-28 11:43:49 -04:00
2011-11-28 16:33:09 +00:00
# ifdef CONFIG_BQL
/*
* Byte queue limits sysfs structures and functions .
*/
static ssize_t bql_show ( char * buf , unsigned int value )
{
return sprintf ( buf , " %u \n " , value ) ;
}
static ssize_t bql_set ( const char * buf , const size_t count ,
unsigned int * pvalue )
{
unsigned int value ;
int err ;
2017-08-18 13:46:28 -07:00
if ( ! strcmp ( buf , " max " ) | | ! strcmp ( buf , " max \n " ) ) {
2011-11-28 16:33:09 +00:00
value = DQL_MAX_LIMIT ;
2017-08-18 13:46:28 -07:00
} else {
2011-11-28 16:33:09 +00:00
err = kstrtouint ( buf , 10 , & value ) ;
if ( err < 0 )
return err ;
if ( value > DQL_MAX_LIMIT )
return - EINVAL ;
}
* pvalue = value ;
return count ;
}
static ssize_t bql_show_hold_time ( struct netdev_queue * queue ,
char * buf )
{
struct dql * dql = & queue - > dql ;
return sprintf ( buf , " %u \n " , jiffies_to_msecs ( dql - > slack_hold_time ) ) ;
}
static ssize_t bql_set_hold_time ( struct netdev_queue * queue ,
const char * buf , size_t len )
{
struct dql * dql = & queue - > dql ;
2012-04-15 05:58:06 +00:00
unsigned int value ;
2011-11-28 16:33:09 +00:00
int err ;
err = kstrtouint ( buf , 10 , & value ) ;
if ( err < 0 )
return err ;
dql - > slack_hold_time = msecs_to_jiffies ( value ) ;
return len ;
}
2017-08-18 13:46:25 -07:00
static struct netdev_queue_attribute bql_hold_time_attribute __ro_after_init
2018-03-23 15:54:38 -07:00
= __ATTR ( hold_time , 0644 ,
2017-08-18 13:46:25 -07:00
bql_show_hold_time , bql_set_hold_time ) ;
2011-11-28 16:33:09 +00:00
static ssize_t bql_show_inflight ( struct netdev_queue * queue ,
char * buf )
{
struct dql * dql = & queue - > dql ;
return sprintf ( buf , " %u \n " , dql - > num_queued - dql - > num_completed ) ;
}
2017-08-18 13:46:25 -07:00
static struct netdev_queue_attribute bql_inflight_attribute __ro_after_init =
2018-03-23 15:54:38 -07:00
__ATTR ( inflight , 0444 , bql_show_inflight , NULL ) ;
2011-11-28 16:33:09 +00:00
# define BQL_ATTR(NAME, FIELD) \
static ssize_t bql_show_ # # NAME ( struct netdev_queue * queue , \
char * buf ) \
{ \
return bql_show ( buf , queue - > dql . FIELD ) ; \
} \
\
static ssize_t bql_set_ # # NAME ( struct netdev_queue * queue , \
const char * buf , size_t len ) \
{ \
return bql_set ( buf , len , & queue - > dql . FIELD ) ; \
} \
\
2017-08-18 13:46:25 -07:00
static struct netdev_queue_attribute bql_ # # NAME # # _attribute __ro_after_init \
2018-03-23 15:54:38 -07:00
= __ATTR ( NAME , 0644 , \
2017-08-18 13:46:25 -07:00
bql_show_ # # NAME , bql_set_ # # NAME )
2011-11-28 16:33:09 +00:00
2017-08-18 13:46:25 -07:00
BQL_ATTR ( limit , limit ) ;
BQL_ATTR ( limit_max , max_limit ) ;
BQL_ATTR ( limit_min , min_limit ) ;
2011-11-28 16:33:09 +00:00
2017-08-18 13:46:25 -07:00
static struct attribute * dql_attrs [ ] __ro_after_init = {
2011-11-28 16:33:09 +00:00
& bql_limit_attribute . attr ,
& bql_limit_max_attribute . attr ,
& bql_limit_min_attribute . attr ,
& bql_hold_time_attribute . attr ,
& bql_inflight_attribute . attr ,
NULL
} ;
2017-06-29 16:31:26 +05:30
static const struct attribute_group dql_group = {
2011-11-28 16:33:09 +00:00
. name = " byte_queue_limits " ,
. attrs = dql_attrs ,
} ;
# endif /* CONFIG_BQL */
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_XPS
2021-03-18 19:37:50 +01:00
static ssize_t xps_queue_show ( struct net_device * dev , unsigned int index ,
int tc , char * buf , enum xps_map_type type )
2010-11-21 13:17:27 +00:00
{
struct xps_dev_maps * dev_maps ;
2021-03-18 19:37:41 +01:00
unsigned long * mask ;
2021-03-18 19:37:50 +01:00
unsigned int nr_ids ;
int j , len ;
2021-03-18 19:37:49 +01:00
2021-03-18 19:37:44 +01:00
rcu_read_lock ( ) ;
2021-03-18 19:37:50 +01:00
dev_maps = rcu_dereference ( dev - > xps_maps [ type ] ) ;
/* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
* when dev_maps hasn ' t been allocated yet , to be backward compatible .
*/
nr_ids = dev_maps ? dev_maps - > nr_ids :
( type = = XPS_CPUS ? nr_cpu_ids : dev - > num_rx_queues ) ;
2021-03-18 19:37:44 +01:00
2021-03-22 16:43:29 +01:00
mask = bitmap_zalloc ( nr_ids , GFP_NOWAIT ) ;
2021-03-18 19:37:40 +01:00
if ( ! mask ) {
2021-03-18 19:37:50 +01:00
rcu_read_unlock ( ) ;
return - ENOMEM ;
2020-12-23 22:23:21 +01:00
}
2018-05-31 15:59:46 -04:00
2021-03-18 19:37:43 +01:00
if ( ! dev_maps | | tc > = dev_maps - > num_tc )
2021-03-18 19:37:42 +01:00
goto out_no_maps ;
2021-03-18 19:37:45 +01:00
for ( j = 0 ; j < nr_ids ; j + + ) {
2021-03-18 19:37:43 +01:00
int i , tci = j * dev_maps - > num_tc + tc ;
2021-03-18 19:37:42 +01:00
struct xps_map * map ;
map = rcu_dereference ( dev_maps - > attr_map [ tci ] ) ;
if ( ! map )
continue ;
for ( i = map - > len ; i - - ; ) {
if ( map - > queues [ i ] = = index ) {
2021-11-21 19:01:03 +01:00
__set_bit ( j , mask ) ;
2021-03-18 19:37:42 +01:00
break ;
2010-11-21 13:17:27 +00:00
}
}
}
2021-03-18 19:37:42 +01:00
out_no_maps :
2010-11-21 13:17:27 +00:00
rcu_read_unlock ( ) ;
2020-12-23 22:23:21 +01:00
2021-03-18 19:37:44 +01:00
len = bitmap_print_to_pagebuf ( false , buf , mask , nr_ids ) ;
2021-03-18 19:37:40 +01:00
bitmap_free ( mask ) ;
2021-03-18 19:37:50 +01:00
2015-02-13 14:37:42 -08:00
return len < PAGE_SIZE ? len : - EINVAL ;
2021-03-18 19:37:50 +01:00
}
static ssize_t xps_cpus_show ( struct netdev_queue * queue , char * buf )
{
struct net_device * dev = queue - > dev ;
unsigned int index ;
int len , tc ;
if ( ! netif_is_multiqueue ( dev ) )
return - ENOENT ;
index = get_netdev_queue_index ( queue ) ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue ( dev , index ) - > sb_dev ? : dev ;
tc = netdev_txq_to_tc ( dev , index ) ;
if ( tc < 0 ) {
rtnl_unlock ( ) ;
return - EINVAL ;
}
/* Make sure the subordinate device can't be freed */
get_device ( & dev - > dev ) ;
rtnl_unlock ( ) ;
len = xps_queue_show ( dev , index , tc , buf , XPS_CPUS ) ;
2020-12-23 22:23:21 +01:00
2021-03-18 19:37:49 +01:00
put_device ( & dev - > dev ) ;
2021-03-18 19:37:50 +01:00
return len ;
2010-11-21 13:17:27 +00:00
}
2017-08-18 13:46:26 -07:00
static ssize_t xps_cpus_store ( struct netdev_queue * queue ,
const char * buf , size_t len )
2010-11-21 13:17:27 +00:00
{
struct net_device * dev = queue - > dev ;
2021-03-18 19:37:41 +01:00
unsigned int index ;
2013-01-10 08:57:02 +00:00
cpumask_var_t mask ;
int err ;
2010-11-21 13:17:27 +00:00
2018-07-09 12:19:32 -04:00
if ( ! netif_is_multiqueue ( dev ) )
return - ENOENT ;
2010-11-21 13:17:27 +00:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
if ( ! alloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
index = get_netdev_queue_index ( queue ) ;
err = bitmap_parse ( buf , len , cpumask_bits ( mask ) , nr_cpumask_bits ) ;
if ( err ) {
free_cpumask_var ( mask ) ;
return err ;
}
2020-12-23 22:23:20 +01:00
if ( ! rtnl_trylock ( ) ) {
free_cpumask_var ( mask ) ;
return restart_syscall ( ) ;
}
2013-01-10 08:57:02 +00:00
err = netif_set_xps_queue ( dev , mask , index ) ;
2020-12-23 22:23:20 +01:00
rtnl_unlock ( ) ;
2010-11-21 13:17:27 +00:00
free_cpumask_var ( mask ) ;
2013-01-10 08:57:02 +00:00
return err ? : len ;
2010-11-21 13:17:27 +00:00
}
2017-08-18 13:46:26 -07:00
static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
= __ATTR_RW ( xps_cpus ) ;
2018-06-29 21:27:07 -07:00
static ssize_t xps_rxqs_show ( struct netdev_queue * queue , char * buf )
{
struct net_device * dev = queue - > dev ;
2021-03-18 19:37:50 +01:00
unsigned int index ;
int tc ;
2018-06-29 21:27:07 -07:00
index = get_netdev_queue_index ( queue ) ;
2020-12-23 22:23:23 +01:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2021-03-18 19:37:43 +01:00
tc = netdev_txq_to_tc ( dev , index ) ;
2021-03-18 19:37:49 +01:00
rtnl_unlock ( ) ;
if ( tc < 0 )
return - EINVAL ;
2021-03-18 19:37:43 +01:00
2021-03-18 19:37:50 +01:00
return xps_queue_show ( dev , index , tc , buf , XPS_RXQS ) ;
2018-06-29 21:27:07 -07:00
}
static ssize_t xps_rxqs_store ( struct netdev_queue * queue , const char * buf ,
size_t len )
{
struct net_device * dev = queue - > dev ;
struct net * net = dev_net ( dev ) ;
2021-03-18 19:37:41 +01:00
unsigned long * mask ;
unsigned int index ;
2018-06-29 21:27:07 -07:00
int err ;
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
return - EPERM ;
2019-03-04 11:48:56 +02:00
mask = bitmap_zalloc ( dev - > num_rx_queues , GFP_KERNEL ) ;
2018-06-29 21:27:07 -07:00
if ( ! mask )
return - ENOMEM ;
index = get_netdev_queue_index ( queue ) ;
err = bitmap_parse ( buf , len , mask , dev - > num_rx_queues ) ;
if ( err ) {
2019-03-04 11:48:56 +02:00
bitmap_free ( mask ) ;
2018-06-29 21:27:07 -07:00
return err ;
}
2020-12-23 22:23:22 +01:00
if ( ! rtnl_trylock ( ) ) {
bitmap_free ( mask ) ;
return restart_syscall ( ) ;
}
2018-08-08 20:07:35 -07:00
cpus_read_lock ( ) ;
2021-03-18 19:37:46 +01:00
err = __netif_set_xps_queue ( dev , mask , index , XPS_RXQS ) ;
2018-08-08 20:07:35 -07:00
cpus_read_unlock ( ) ;
2020-12-23 22:23:22 +01:00
rtnl_unlock ( ) ;
2019-03-04 11:48:56 +02:00
bitmap_free ( mask ) ;
2018-06-29 21:27:07 -07:00
return err ? : len ;
}
static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
= __ATTR_RW ( xps_rxqs ) ;
2011-11-16 12:15:10 +00:00
# endif /* CONFIG_XPS */
2010-11-21 13:17:27 +00:00
2017-08-18 13:46:26 -07:00
static struct attribute * netdev_queue_default_attrs [ ] __ro_after_init = {
2011-11-16 12:15:10 +00:00
& queue_trans_timeout . attr ,
2016-10-28 11:43:49 -04:00
& queue_traffic_class . attr ,
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_XPS
2010-11-21 13:17:27 +00:00
& xps_cpus_attribute . attr ,
2018-06-29 21:27:07 -07:00
& xps_rxqs_attribute . attr ,
2015-03-18 14:57:33 +02:00
& queue_tx_maxrate . attr ,
2011-11-16 12:15:10 +00:00
# endif
2010-11-21 13:17:27 +00:00
NULL
} ;
2019-04-01 22:51:35 -04:00
ATTRIBUTE_GROUPS ( netdev_queue_default ) ;
2010-11-21 13:17:27 +00:00
static void netdev_queue_release ( struct kobject * kobj )
{
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
memset ( kobj , 0 , sizeof ( * kobj ) ) ;
2021-12-04 20:21:59 -08:00
dev_put_track ( queue - > dev , & queue - > dev_tracker ) ;
2010-11-21 13:17:27 +00:00
}
2014-01-16 17:24:31 +08:00
static const void * netdev_queue_namespace ( struct kobject * kobj )
{
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
struct device * dev = & queue - > dev - > dev ;
const void * ns = NULL ;
if ( dev - > class & & dev - > class - > ns_type )
ns = dev - > class - > namespace ( dev ) ;
return ns ;
}
2018-07-20 21:56:52 +00:00
static void netdev_queue_get_ownership ( struct kobject * kobj ,
kuid_t * uid , kgid_t * gid )
{
const struct net * net = netdev_queue_namespace ( kobj ) ;
net_ns_get_ownership ( net , uid , gid ) ;
}
2017-08-18 13:46:26 -07:00
static struct kobj_type netdev_queue_ktype __ro_after_init = {
2010-11-21 13:17:27 +00:00
. sysfs_ops = & netdev_queue_sysfs_ops ,
. release = netdev_queue_release ,
2019-04-01 22:51:35 -04:00
. default_groups = netdev_queue_default_groups ,
2014-01-16 17:24:31 +08:00
. namespace = netdev_queue_namespace ,
2018-07-20 21:56:52 +00:00
. get_ownership = netdev_queue_get_ownership ,
2010-11-21 13:17:27 +00:00
} ;
2014-07-23 16:09:10 -07:00
static int netdev_queue_add_kobject ( struct net_device * dev , int index )
2010-11-21 13:17:27 +00:00
{
2014-07-23 16:09:10 -07:00
struct netdev_queue * queue = dev - > _tx + index ;
2010-11-21 13:17:27 +00:00
struct kobject * kobj = & queue - > kobj ;
int error = 0 ;
2019-12-05 15:57:07 +02:00
/* Kobject_put later will trigger netdev_queue_release call
* which decreases dev refcount : Take that reference here
*/
2021-12-04 20:21:59 -08:00
dev_hold_track ( queue - > dev , & queue - > dev_tracker , GFP_KERNEL ) ;
2019-12-05 15:57:07 +02:00
2014-07-23 16:09:10 -07:00
kobj - > kset = dev - > queues_kset ;
2010-11-21 13:17:27 +00:00
error = kobject_init_and_add ( kobj , & netdev_queue_ktype , NULL ,
2017-08-18 13:46:28 -07:00
" tx-%u " , index ) ;
2011-11-28 16:33:09 +00:00
if ( error )
2019-11-20 09:08:16 +02:00
goto err ;
2011-11-28 16:33:09 +00:00
# ifdef CONFIG_BQL
error = sysfs_create_group ( kobj , & dql_group ) ;
2019-11-20 09:08:16 +02:00
if ( error )
goto err ;
2011-11-28 16:33:09 +00:00
# endif
2010-11-21 13:17:27 +00:00
kobject_uevent ( kobj , KOBJ_ADD ) ;
2019-11-20 19:19:07 -08:00
return 0 ;
2010-11-21 13:17:27 +00:00
2019-11-20 09:08:16 +02:00
err :
kobject_put ( kobj ) ;
return error ;
2010-11-21 13:17:27 +00:00
}
2020-02-27 04:37:18 +01:00
static int tx_queue_change_owner ( struct net_device * ndev , int index ,
kuid_t kuid , kgid_t kgid )
{
struct netdev_queue * queue = ndev - > _tx + index ;
struct kobject * kobj = & queue - > kobj ;
int error ;
error = sysfs_change_owner ( kobj , kuid , kgid ) ;
if ( error )
return error ;
# ifdef CONFIG_BQL
error = sysfs_group_change_owner ( kobj , & dql_group , kuid , kgid ) ;
# endif
return error ;
}
2011-11-16 12:15:10 +00:00
# endif /* CONFIG_SYSFS */
2010-11-21 13:17:27 +00:00
int
2014-07-23 16:09:10 -07:00
netdev_queue_update_kobjects ( struct net_device * dev , int old_num , int new_num )
2010-11-21 13:17:27 +00:00
{
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_SYSFS
2010-11-21 13:17:27 +00:00
int i ;
int error = 0 ;
for ( i = old_num ; i < new_num ; i + + ) {
2014-07-23 16:09:10 -07:00
error = netdev_queue_add_kobject ( dev , i ) ;
2010-11-21 13:17:27 +00:00
if ( error ) {
new_num = old_num ;
break ;
}
}
2011-11-28 16:33:09 +00:00
while ( - - i > = new_num ) {
2014-07-23 16:09:10 -07:00
struct netdev_queue * queue = dev - > _tx + i ;
2011-11-28 16:33:09 +00:00
2020-08-19 14:06:36 +02:00
if ( ! refcount_read ( & dev_net ( dev ) - > ns . count ) )
2016-10-24 19:09:53 -07:00
queue - > kobj . uevent_suppress = 1 ;
2011-11-28 16:33:09 +00:00
# ifdef CONFIG_BQL
sysfs_remove_group ( & queue - > kobj , & dql_group ) ;
# endif
kobject_put ( & queue - > kobj ) ;
}
2010-11-21 13:17:27 +00:00
return error ;
2010-11-26 08:36:09 +00:00
# else
return 0 ;
2011-11-16 12:15:10 +00:00
# endif /* CONFIG_SYSFS */
2010-11-21 13:17:27 +00:00
}
2020-02-27 04:37:18 +01:00
static int net_tx_queue_change_owner ( struct net_device * dev , int num ,
kuid_t kuid , kgid_t kgid )
{
# ifdef CONFIG_SYSFS
int error = 0 ;
int i ;
for ( i = 0 ; i < num ; i + + ) {
error = tx_queue_change_owner ( dev , i , kuid , kgid ) ;
if ( error )
break ;
}
return error ;
# else
return 0 ;
# endif /* CONFIG_SYSFS */
}
2014-07-23 16:09:10 -07:00
static int register_queue_kobjects ( struct net_device * dev )
2010-11-21 13:17:27 +00:00
{
2010-11-26 08:36:09 +00:00
int error = 0 , txq = 0 , rxq = 0 , real_rx = 0 , real_tx = 0 ;
2010-11-21 13:17:27 +00:00
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_SYSFS
2014-07-23 16:09:10 -07:00
dev - > queues_kset = kset_create_and_add ( " queues " ,
2017-08-18 13:46:28 -07:00
NULL , & dev - > dev . kobj ) ;
2014-07-23 16:09:10 -07:00
if ( ! dev - > queues_kset )
2010-09-27 08:24:33 +00:00
return - ENOMEM ;
2014-07-23 16:09:10 -07:00
real_rx = dev - > real_num_rx_queues ;
2010-11-26 08:36:09 +00:00
# endif
2014-07-23 16:09:10 -07:00
real_tx = dev - > real_num_tx_queues ;
2010-11-21 13:17:27 +00:00
2014-07-23 16:09:10 -07:00
error = net_rx_queue_update_kobjects ( dev , 0 , real_rx ) ;
2010-11-21 13:17:27 +00:00
if ( error )
goto error ;
2010-11-26 08:36:09 +00:00
rxq = real_rx ;
2010-11-21 13:17:27 +00:00
2014-07-23 16:09:10 -07:00
error = netdev_queue_update_kobjects ( dev , 0 , real_tx ) ;
2010-11-21 13:17:27 +00:00
if ( error )
goto error ;
2010-11-26 08:36:09 +00:00
txq = real_tx ;
2010-11-21 13:17:27 +00:00
return 0 ;
error :
2014-07-23 16:09:10 -07:00
netdev_queue_update_kobjects ( dev , txq , 0 ) ;
net_rx_queue_update_kobjects ( dev , rxq , 0 ) ;
2019-03-02 10:34:55 +08:00
# ifdef CONFIG_SYSFS
kset_unregister ( dev - > queues_kset ) ;
# endif
2010-11-21 13:17:27 +00:00
return error ;
2010-09-27 08:24:33 +00:00
}
2010-03-16 08:03:29 +00:00
2020-02-27 04:37:18 +01:00
static int queue_change_owner ( struct net_device * ndev , kuid_t kuid , kgid_t kgid )
{
int error = 0 , real_rx = 0 , real_tx = 0 ;
# ifdef CONFIG_SYSFS
if ( ndev - > queues_kset ) {
error = sysfs_change_owner ( & ndev - > queues_kset - > kobj , kuid , kgid ) ;
if ( error )
return error ;
}
real_rx = ndev - > real_num_rx_queues ;
# endif
real_tx = ndev - > real_num_tx_queues ;
error = net_rx_queue_change_owner ( ndev , real_rx , kuid , kgid ) ;
if ( error )
return error ;
error = net_tx_queue_change_owner ( ndev , real_tx , kuid , kgid ) ;
if ( error )
return error ;
return 0 ;
}
2014-07-23 16:09:10 -07:00
static void remove_queue_kobjects ( struct net_device * dev )
2010-09-27 08:24:33 +00:00
{
2010-11-26 08:36:09 +00:00
int real_rx = 0 , real_tx = 0 ;
2014-01-16 22:23:28 -08:00
# ifdef CONFIG_SYSFS
2014-07-23 16:09:10 -07:00
real_rx = dev - > real_num_rx_queues ;
2010-11-26 08:36:09 +00:00
# endif
2014-07-23 16:09:10 -07:00
real_tx = dev - > real_num_tx_queues ;
2010-11-26 08:36:09 +00:00
2014-07-23 16:09:10 -07:00
net_rx_queue_update_kobjects ( dev , real_rx , 0 ) ;
netdev_queue_update_kobjects ( dev , real_tx , 0 ) ;
2011-11-16 12:15:10 +00:00
# ifdef CONFIG_SYSFS
2014-07-23 16:09:10 -07:00
kset_unregister ( dev - > queues_kset ) ;
2010-11-26 08:36:09 +00:00
# endif
2010-03-16 08:03:29 +00:00
}
2010-05-04 17:36:45 -07:00
2013-03-25 20:07:01 -07:00
static bool net_current_may_mount ( void )
{
struct net * net = current - > nsproxy - > net_ns ;
return ns_capable ( net - > user_ns , CAP_SYS_ADMIN ) ;
}
2011-06-08 21:13:01 -04:00
static void * net_grab_current_ns ( void )
2010-05-04 17:36:45 -07:00
{
2011-06-08 21:13:01 -04:00
struct net * ns = current - > nsproxy - > net_ns ;
# ifdef CONFIG_NET_NS
if ( ns )
2017-06-30 13:08:08 +03:00
refcount_inc ( & ns - > passive ) ;
2011-06-08 21:13:01 -04:00
# endif
return ns ;
2010-05-04 17:36:45 -07:00
}
static const void * net_initial_ns ( void )
{
return & init_net ;
}
static const void * net_netlink_ns ( struct sock * sk )
{
return sock_net ( sk ) ;
}
2017-08-18 13:46:22 -07:00
const struct kobj_ns_type_operations net_ns_type_operations = {
2010-05-04 17:36:45 -07:00
. type = KOBJ_NS_TYPE_NET ,
2013-03-25 20:07:01 -07:00
. current_may_mount = net_current_may_mount ,
2011-06-08 21:13:01 -04:00
. grab_current_ns = net_grab_current_ns ,
2010-05-04 17:36:45 -07:00
. netlink_ns = net_netlink_ns ,
. initial_ns = net_initial_ns ,
2011-06-08 21:13:01 -04:00
. drop_ns = net_drop_ns ,
2010-05-04 17:36:45 -07:00
} ;
2010-08-05 17:45:15 +02:00
EXPORT_SYMBOL_GPL ( net_ns_type_operations ) ;
2010-05-04 17:36:45 -07:00
2007-08-14 15:15:12 +02:00
static int netdev_uevent ( struct device * d , struct kobj_uevent_env * env )
2005-04-16 15:20:36 -07:00
{
2002-04-09 12:14:34 -07:00
struct net_device * dev = to_net_dev ( d ) ;
2007-08-14 15:15:12 +02:00
int retval ;
2005-04-16 15:20:36 -07:00
2005-11-16 09:00:00 +01:00
/* pass interface to uevent. */
2007-08-14 15:15:12 +02:00
retval = add_uevent_var ( env , " INTERFACE=%s " , dev - > name ) ;
2007-03-30 22:23:12 -07:00
if ( retval )
goto exit ;
2007-03-07 10:49:30 -08:00
/* pass ifindex to uevent.
* ifindex is useful as it won ' t change ( interface name may change )
2017-08-18 13:46:28 -07:00
* and is what RtNetlink uses natively .
*/
2007-08-14 15:15:12 +02:00
retval = add_uevent_var ( env , " IFINDEX=%d " , dev - > ifindex ) ;
2005-04-16 15:20:36 -07:00
2007-03-30 22:23:12 -07:00
exit :
return retval ;
2005-04-16 15:20:36 -07:00
}
/*
2007-02-09 23:24:36 +09:00
* netdev_release - - destroy and free a dead device .
2002-04-09 12:14:34 -07:00
* Called when last reference to device kobject is gone .
2005-04-16 15:20:36 -07:00
*/
2002-04-09 12:14:34 -07:00
static void netdev_release ( struct device * d )
2005-04-16 15:20:36 -07:00
{
2002-04-09 12:14:34 -07:00
struct net_device * dev = to_net_dev ( d ) ;
2005-04-16 15:20:36 -07:00
BUG_ON ( dev - > reg_state ! = NETREG_RELEASED ) ;
2017-10-02 23:50:05 +02:00
/* no need to wait for rcu grace period:
* device is dead and about to be freed .
*/
kfree ( rcu_access_pointer ( dev - > ifalias ) ) ;
2013-10-30 13:10:44 -07:00
netdev_freemem ( dev ) ;
2005-04-16 15:20:36 -07:00
}
2010-05-04 17:36:45 -07:00
static const void * net_namespace ( struct device * d )
{
2015-12-22 23:11:49 +08:00
struct net_device * dev = to_net_dev ( d ) ;
2010-05-04 17:36:45 -07:00
return dev_net ( dev ) ;
}
2018-07-20 21:56:52 +00:00
static void net_get_ownership ( struct device * d , kuid_t * uid , kgid_t * gid )
{
struct net_device * dev = to_net_dev ( d ) ;
const struct net * net = dev_net ( dev ) ;
net_ns_get_ownership ( net , uid , gid ) ;
}
2017-08-18 13:46:21 -07:00
static struct class net_class __ro_after_init = {
2005-04-16 15:20:36 -07:00
. name = " net " ,
2002-04-09 12:14:34 -07:00
. dev_release = netdev_release ,
2013-07-24 15:05:33 -07:00
. dev_groups = net_class_groups ,
2002-04-09 12:14:34 -07:00
. dev_uevent = netdev_uevent ,
2010-05-04 17:36:45 -07:00
. ns_type = & net_ns_type_operations ,
. namespace = net_namespace ,
2018-07-20 21:56:52 +00:00
. get_ownership = net_get_ownership ,
2005-04-16 15:20:36 -07:00
} ;
2021-10-06 18:06:54 -07:00
# ifdef CONFIG_OF
2015-03-09 14:31:20 -07:00
static int of_dev_node_match ( struct device * dev , const void * data )
{
2020-05-15 11:52:52 +02:00
for ( ; dev ; dev = dev - > parent ) {
if ( dev - > of_node = = data )
return 1 ;
}
2015-03-09 14:31:20 -07:00
2020-05-15 11:52:52 +02:00
return 0 ;
2015-03-09 14:31:20 -07:00
}
2015-09-24 20:36:33 +01:00
/*
* of_find_net_device_by_node - lookup the net device for the device node
* @ np : OF device node
*
* Looks up the net_device structure corresponding with the device node .
* If successful , returns a pointer to the net_device with the embedded
* struct device refcount incremented by one , or NULL on failure . The
* refcount must be dropped when done with the net_device .
*/
2015-03-09 14:31:20 -07:00
struct net_device * of_find_net_device_by_node ( struct device_node * np )
{
struct device * dev ;
dev = class_find_device ( & net_class , NULL , np , of_dev_node_match ) ;
if ( ! dev )
return NULL ;
return to_net_dev ( dev ) ;
}
EXPORT_SYMBOL ( of_find_net_device_by_node ) ;
# endif
2007-05-19 15:39:25 -07:00
/* Delete sysfs entries but hold kobject reference until after all
* netdev references are gone .
*/
2014-07-23 16:09:10 -07:00
void netdev_unregister_kobject ( struct net_device * ndev )
2005-04-16 15:20:36 -07:00
{
2017-08-18 13:46:28 -07:00
struct device * dev = & ndev - > dev ;
2007-05-19 15:39:25 -07:00
2020-08-19 14:06:36 +02:00
if ( ! refcount_read ( & dev_net ( ndev ) - > ns . count ) )
2016-10-24 19:09:53 -07:00
dev_set_uevent_suppress ( dev , 1 ) ;
2007-05-19 15:39:25 -07:00
kobject_get ( & dev - > kobj ) ;
2008-10-27 17:51:47 -07:00
2014-07-23 16:09:10 -07:00
remove_queue_kobjects ( ndev ) ;
2010-03-16 08:03:29 +00:00
2013-02-22 16:34:16 -08:00
pm_runtime_set_memalloc_noio ( dev , false ) ;
2007-05-19 15:39:25 -07:00
device_del ( dev ) ;
2005-04-16 15:20:36 -07:00
}
/* Create sysfs entries for network device. */
2014-07-23 16:09:10 -07:00
int netdev_register_kobject ( struct net_device * ndev )
2005-04-16 15:20:36 -07:00
{
2017-08-18 13:46:28 -07:00
struct device * dev = & ndev - > dev ;
2014-07-23 16:09:10 -07:00
const struct attribute_group * * groups = ndev - > sysfs_groups ;
2010-03-16 08:03:29 +00:00
int error = 0 ;
2005-04-16 15:20:36 -07:00
2010-05-04 17:36:49 -07:00
device_initialize ( dev ) ;
2002-04-09 12:14:34 -07:00
dev - > class = & net_class ;
2014-07-23 16:09:10 -07:00
dev - > platform_data = ndev ;
2002-04-09 12:14:34 -07:00
dev - > groups = groups ;
2005-04-16 15:20:36 -07:00
2014-07-23 16:09:10 -07:00
dev_set_name ( dev , " %s " , ndev - > name ) ;
2005-04-16 15:20:36 -07:00
2007-09-26 22:02:53 -07:00
# ifdef CONFIG_SYSFS
2009-10-29 14:18:21 +00:00
/* Allow for a device specific group */
if ( * groups )
groups + + ;
2005-04-16 15:20:36 -07:00
2009-10-29 14:18:21 +00:00
* groups + + = & netstat_group ;
2012-11-16 20:46:19 +01:00
# if IS_ENABLED(CONFIG_WIRELESS_EXT) || IS_ENABLED(CONFIG_CFG80211)
2014-07-23 16:09:10 -07:00
if ( ndev - > ieee80211_ptr )
2012-11-16 20:46:19 +01:00
* groups + + = & wireless_group ;
# if IS_ENABLED(CONFIG_WIRELESS_EXT)
2014-07-23 16:09:10 -07:00
else if ( ndev - > wireless_handlers )
2012-11-16 20:46:19 +01:00
* groups + + = & wireless_group ;
# endif
# endif
2007-09-26 22:02:53 -07:00
# endif /* CONFIG_SYSFS */
2005-04-16 15:20:36 -07:00
2010-03-16 08:03:29 +00:00
error = device_add ( dev ) ;
if ( error )
2019-04-12 16:36:33 -04:00
return error ;
2010-03-16 08:03:29 +00:00
2014-07-23 16:09:10 -07:00
error = register_queue_kobjects ( ndev ) ;
2019-04-12 16:36:33 -04:00
if ( error ) {
device_del ( dev ) ;
return error ;
}
2010-03-16 08:03:29 +00:00
2013-02-22 16:34:16 -08:00
pm_runtime_set_memalloc_noio ( dev , true ) ;
2010-03-16 08:03:29 +00:00
return error ;
2005-04-16 15:20:36 -07:00
}
2020-02-27 04:37:17 +01:00
/* Change owner for sysfs entries when moving network devices across network
* namespaces owned by different user namespaces .
*/
int netdev_change_owner ( struct net_device * ndev , const struct net * net_old ,
const struct net * net_new )
{
2021-10-25 02:31:48 -04:00
kuid_t old_uid = GLOBAL_ROOT_UID , new_uid = GLOBAL_ROOT_UID ;
kgid_t old_gid = GLOBAL_ROOT_GID , new_gid = GLOBAL_ROOT_GID ;
2020-02-27 04:37:17 +01:00
struct device * dev = & ndev - > dev ;
int error ;
net_ns_get_ownership ( net_old , & old_uid , & old_gid ) ;
net_ns_get_ownership ( net_new , & new_uid , & new_gid ) ;
/* The network namespace was changed but the owning user namespace is
* identical so there ' s no need to change the owner of sysfs entries .
*/
if ( uid_eq ( old_uid , new_uid ) & & gid_eq ( old_gid , new_gid ) )
return 0 ;
error = device_change_owner ( dev , new_uid , new_gid ) ;
if ( error )
return error ;
2020-02-27 04:37:18 +01:00
error = queue_change_owner ( ndev , new_uid , new_gid ) ;
if ( error )
return error ;
2020-02-27 04:37:17 +01:00
return 0 ;
}
2017-08-18 13:46:20 -07:00
int netdev_class_create_file_ns ( const struct class_attribute * class_attr ,
2013-09-11 22:29:04 -04:00
const void * ns )
2008-06-13 18:12:04 -07:00
{
2013-09-11 22:29:04 -04:00
return class_create_file_ns ( & net_class , class_attr , ns ) ;
2008-06-13 18:12:04 -07:00
}
2013-09-11 22:29:04 -04:00
EXPORT_SYMBOL ( netdev_class_create_file_ns ) ;
2008-06-13 18:12:04 -07:00
2017-08-18 13:46:20 -07:00
void netdev_class_remove_file_ns ( const struct class_attribute * class_attr ,
2013-09-11 22:29:04 -04:00
const void * ns )
2008-06-13 18:12:04 -07:00
{
2013-09-11 22:29:04 -04:00
class_remove_file_ns ( & net_class , class_attr , ns ) ;
2008-06-13 18:12:04 -07:00
}
2013-09-11 22:29:04 -04:00
EXPORT_SYMBOL ( netdev_class_remove_file_ns ) ;
2008-06-13 18:12:04 -07:00
2014-01-06 01:20:11 +01:00
int __init netdev_kobject_init ( void )
2005-04-16 15:20:36 -07:00
{
2010-05-04 17:36:45 -07:00
kobj_ns_type_register ( & net_ns_type_operations ) ;
2005-04-16 15:20:36 -07:00
return class_register ( & net_class ) ;
}