2005-04-17 02:20:36 +04:00
/*
* net - sysfs . c - network device class and attributes
*
* Copyright ( c ) 2003 Stephen Hemminger < shemminger @ osdl . org >
2007-02-09 17:24:36 +03:00
*
2005-04-17 02:20:36 +04:00
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation ; either version
* 2 of the License , or ( at your option ) any later version .
*/
2006-01-11 23:17:47 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/kernel.h>
# include <linux/netdevice.h>
# include <linux/if_arp.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2010-05-05 04:36:45 +04:00
# include <linux/nsproxy.h>
2005-04-17 02:20:36 +04:00
# include <net/sock.h>
2010-05-05 04:36:45 +04:00
# include <net/net_namespace.h>
2005-04-17 02:20:36 +04:00
# include <linux/rtnetlink.h>
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
# include <linux/vmalloc.h>
2011-07-15 19:47:34 +04:00
# include <linux/export.h>
2011-11-28 20:33:09 +04:00
# include <linux/jiffies.h>
2013-02-23 04:34:16 +04:00
# include <linux/pm_runtime.h>
2005-04-17 02:20:36 +04:00
2007-10-24 08:14:45 +04:00
# include "net-sysfs.h"
2007-09-27 09:02:53 +04:00
# ifdef CONFIG_SYSFS
2005-04-17 02:20:36 +04:00
static const char fmt_hex [ ] = " %#x \n " ;
2005-05-30 07:28:25 +04:00
static const char fmt_long_hex [ ] = " %#lx \n " ;
2005-04-17 02:20:36 +04:00
static const char fmt_dec [ ] = " %d \n " ;
2011-04-27 22:32:38 +04:00
static const char fmt_udec [ ] = " %u \n " ;
2005-04-17 02:20:36 +04:00
static const char fmt_ulong [ ] = " %lu \n " ;
2010-06-08 11:19:54 +04:00
static const char fmt_u64 [ ] = " %llu \n " ;
2005-04-17 02:20:36 +04:00
2007-02-09 17:24:36 +03:00
static inline int dev_isalive ( const struct net_device * dev )
2005-04-17 02:20:36 +04:00
{
2006-05-07 04:56:03 +04:00
return dev - > reg_state < = NETREG_REGISTERED ;
2005-04-17 02:20:36 +04:00
}
/* use same locking rules as GIF* ioctl's */
2002-04-09 23:14:34 +04:00
static ssize_t netdev_show ( const struct device * dev ,
struct device_attribute * attr , char * buf ,
2005-04-17 02:20:36 +04:00
ssize_t ( * format ) ( const struct net_device * , char * ) )
{
2002-04-09 23:14:34 +04:00
struct net_device * net = to_net_dev ( dev ) ;
2005-04-17 02:20:36 +04:00
ssize_t ret = - EINVAL ;
read_lock ( & dev_base_lock ) ;
if ( dev_isalive ( net ) )
ret = ( * format ) ( net , buf ) ;
read_unlock ( & dev_base_lock ) ;
return ret ;
}
/* generate a show function for simple field */
# define NETDEVICE_SHOW(field, format_string) \
static ssize_t format_ # # field ( const struct net_device * net , char * buf ) \
{ \
return sprintf ( buf , format_string , net - > field ) ; \
} \
2013-07-25 02:05:33 +04:00
static ssize_t field # # _show ( struct device * dev , \
2002-04-09 23:14:34 +04:00
struct device_attribute * attr , char * buf ) \
2005-04-17 02:20:36 +04:00
{ \
2002-04-09 23:14:34 +04:00
return netdev_show ( dev , attr , buf , format_ # # field ) ; \
2013-07-25 02:05:33 +04:00
} \
# define NETDEVICE_SHOW_RO(field, format_string) \
NETDEVICE_SHOW ( field , format_string ) ; \
static DEVICE_ATTR_RO ( field )
2005-04-17 02:20:36 +04:00
2013-07-25 02:05:33 +04:00
# define NETDEVICE_SHOW_RW(field, format_string) \
NETDEVICE_SHOW ( field , format_string ) ; \
static DEVICE_ATTR_RW ( field )
2005-04-17 02:20:36 +04:00
/* use same locking and permission rules as SIF* ioctl's */
2002-04-09 23:14:34 +04:00
static ssize_t netdev_store ( struct device * dev , struct device_attribute * attr ,
2005-04-17 02:20:36 +04:00
const char * buf , size_t len ,
int ( * set ) ( struct net_device * , unsigned long ) )
{
2012-11-16 07:03:04 +04:00
struct net_device * netdev = to_net_dev ( dev ) ;
struct net * net = dev_net ( netdev ) ;
2005-04-17 02:20:36 +04:00
unsigned long new ;
int ret = - EINVAL ;
2012-11-16 07:03:04 +04:00
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2005-04-17 02:20:36 +04:00
return - EPERM ;
2012-04-12 13:28:13 +04:00
ret = kstrtoul ( buf , 0 , & new ) ;
if ( ret )
2005-04-17 02:20:36 +04:00
goto err ;
2009-02-26 09:49:24 +03:00
if ( ! rtnl_trylock ( ) )
2009-05-13 20:57:25 +04:00
return restart_syscall ( ) ;
2009-02-26 09:49:24 +03:00
2012-11-16 07:03:04 +04:00
if ( dev_isalive ( netdev ) ) {
if ( ( ret = ( * set ) ( netdev , new ) ) = = 0 )
2005-04-17 02:20:36 +04:00
ret = len ;
}
rtnl_unlock ( ) ;
err :
return ret ;
}
2013-07-25 02:05:33 +04:00
NETDEVICE_SHOW_RO ( dev_id , fmt_hex ) ;
NETDEVICE_SHOW_RO ( addr_assign_type , fmt_dec ) ;
NETDEVICE_SHOW_RO ( addr_len , fmt_dec ) ;
NETDEVICE_SHOW_RO ( iflink , fmt_dec ) ;
NETDEVICE_SHOW_RO ( ifindex , fmt_dec ) ;
NETDEVICE_SHOW_RO ( type , fmt_dec ) ;
NETDEVICE_SHOW_RO ( link_mode , fmt_dec ) ;
2005-04-17 02:20:36 +04:00
/* use same locking rules as GIFHWADDR ioctl's */
2013-07-25 02:05:33 +04:00
static ssize_t address_show ( struct device * dev , struct device_attribute * attr ,
2002-04-09 23:14:34 +04:00
char * buf )
2005-04-17 02:20:36 +04:00
{
struct net_device * net = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
read_lock ( & dev_base_lock ) ;
if ( dev_isalive ( net ) )
2007-12-25 08:28:09 +03:00
ret = sysfs_format_mac ( buf , net - > dev_addr , net - > addr_len ) ;
2005-04-17 02:20:36 +04:00
read_unlock ( & dev_base_lock ) ;
return ret ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( address ) ;
2005-04-17 02:20:36 +04:00
2013-07-25 02:05:33 +04:00
static ssize_t broadcast_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
2005-04-17 02:20:36 +04:00
{
struct net_device * net = to_net_dev ( dev ) ;
if ( dev_isalive ( net ) )
2007-12-25 08:28:09 +03:00
return sysfs_format_mac ( buf , net - > broadcast , net - > addr_len ) ;
2005-04-17 02:20:36 +04:00
return - EINVAL ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( broadcast ) ;
2005-04-17 02:20:36 +04:00
2012-12-28 03:49:38 +04:00
static int change_carrier ( struct net_device * net , unsigned long new_carrier )
{
if ( ! netif_running ( net ) )
return - EINVAL ;
return dev_change_carrier ( net , ( bool ) new_carrier ) ;
}
2013-07-25 02:05:33 +04:00
static ssize_t carrier_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t len )
2012-12-28 03:49:38 +04:00
{
return netdev_store ( dev , attr , buf , len , change_carrier ) ;
}
2013-07-25 02:05:33 +04:00
static ssize_t carrier_show ( struct device * dev ,
2002-04-09 23:14:34 +04:00
struct device_attribute * attr , char * buf )
2005-04-17 02:20:36 +04:00
{
struct net_device * netdev = to_net_dev ( dev ) ;
if ( netif_running ( netdev ) ) {
return sprintf ( buf , fmt_dec , ! ! netif_carrier_ok ( netdev ) ) ;
}
return - EINVAL ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RW ( carrier ) ;
2005-04-17 02:20:36 +04:00
2013-07-25 02:05:33 +04:00
static ssize_t speed_show ( struct device * dev ,
2009-10-02 13:26:12 +04:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
int ret = - EINVAL ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2011-04-27 22:32:38 +04:00
if ( netif_running ( netdev ) ) {
struct ethtool_cmd cmd ;
2011-09-03 07:34:30 +04:00
if ( ! __ethtool_get_settings ( netdev , & cmd ) )
2011-04-27 22:32:38 +04:00
ret = sprintf ( buf , fmt_udec , ethtool_cmd_speed ( & cmd ) ) ;
2009-10-02 13:26:12 +04:00
}
rtnl_unlock ( ) ;
return ret ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( speed ) ;
2009-10-02 13:26:12 +04:00
2013-07-25 02:05:33 +04:00
static ssize_t duplex_show ( struct device * dev ,
2009-10-02 13:26:12 +04:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
int ret = - EINVAL ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2011-04-27 22:32:38 +04:00
if ( netif_running ( netdev ) ) {
struct ethtool_cmd cmd ;
2012-09-05 08:11:28 +04:00
if ( ! __ethtool_get_settings ( netdev , & cmd ) ) {
const char * duplex ;
switch ( cmd . duplex ) {
case DUPLEX_HALF :
duplex = " half " ;
break ;
case DUPLEX_FULL :
duplex = " full " ;
break ;
default :
duplex = " unknown " ;
break ;
}
ret = sprintf ( buf , " %s \n " , duplex ) ;
}
2009-10-02 13:26:12 +04:00
}
rtnl_unlock ( ) ;
return ret ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( duplex ) ;
2009-10-02 13:26:12 +04:00
2013-07-25 02:05:33 +04:00
static ssize_t dormant_show ( struct device * dev ,
2002-04-09 23:14:34 +04:00
struct device_attribute * attr , char * buf )
2006-03-21 04:09:11 +03:00
{
struct net_device * netdev = to_net_dev ( dev ) ;
if ( netif_running ( netdev ) )
return sprintf ( buf , fmt_dec , ! ! netif_dormant ( netdev ) ) ;
return - EINVAL ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( dormant ) ;
2006-03-21 04:09:11 +03:00
2009-08-05 21:42:58 +04:00
static const char * const operstates [ ] = {
2006-03-21 04:09:11 +03:00
" unknown " ,
" notpresent " , /* currently unused */
" down " ,
" lowerlayerdown " ,
" testing " , /* currently unused */
" dormant " ,
" up "
} ;
2013-07-25 02:05:33 +04:00
static ssize_t operstate_show ( struct device * dev ,
2002-04-09 23:14:34 +04:00
struct device_attribute * attr , char * buf )
2006-03-21 04:09:11 +03:00
{
const struct net_device * netdev = to_net_dev ( dev ) ;
unsigned char operstate ;
read_lock ( & dev_base_lock ) ;
operstate = netdev - > operstate ;
if ( ! netif_running ( netdev ) )
operstate = IF_OPER_DOWN ;
read_unlock ( & dev_base_lock ) ;
2006-04-06 09:19:47 +04:00
if ( operstate > = ARRAY_SIZE ( operstates ) )
2006-03-21 04:09:11 +03:00
return - EINVAL ; /* should not happen */
return sprintf ( buf , " %s \n " , operstates [ operstate ] ) ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( operstate ) ;
2006-03-21 04:09:11 +03:00
2005-04-17 02:20:36 +04:00
/* read-write attributes */
static int change_mtu ( struct net_device * net , unsigned long new_mtu )
{
return dev_set_mtu ( net , ( int ) new_mtu ) ;
}
2013-07-25 02:05:33 +04:00
static ssize_t mtu_store ( struct device * dev , struct device_attribute * attr ,
2002-04-09 23:14:34 +04:00
const char * buf , size_t len )
2005-04-17 02:20:36 +04:00
{
2002-04-09 23:14:34 +04:00
return netdev_store ( dev , attr , buf , len , change_mtu ) ;
2005-04-17 02:20:36 +04:00
}
2013-07-25 02:05:33 +04:00
NETDEVICE_SHOW_RW ( mtu , fmt_dec ) ;
2005-04-17 02:20:36 +04:00
static int change_flags ( struct net_device * net , unsigned long new_flags )
{
2012-04-15 09:58:06 +04:00
return dev_change_flags ( net , ( unsigned int ) new_flags ) ;
2005-04-17 02:20:36 +04:00
}
2013-07-25 02:05:33 +04:00
static ssize_t flags_store ( struct device * dev , struct device_attribute * attr ,
2002-04-09 23:14:34 +04:00
const char * buf , size_t len )
2005-04-17 02:20:36 +04:00
{
2002-04-09 23:14:34 +04:00
return netdev_store ( dev , attr , buf , len , change_flags ) ;
2005-04-17 02:20:36 +04:00
}
2013-07-25 02:05:33 +04:00
NETDEVICE_SHOW_RW ( flags , fmt_hex ) ;
2005-04-17 02:20:36 +04:00
static int change_tx_queue_len ( struct net_device * net , unsigned long new_len )
{
net - > tx_queue_len = new_len ;
return 0 ;
}
2013-07-25 02:05:33 +04:00
static ssize_t tx_queue_len_store ( struct device * dev ,
2002-04-09 23:14:34 +04:00
struct device_attribute * attr ,
const char * buf , size_t len )
2005-04-17 02:20:36 +04:00
{
2012-11-16 07:03:04 +04:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
2002-04-09 23:14:34 +04:00
return netdev_store ( dev , attr , buf , len , change_tx_queue_len ) ;
2005-04-17 02:20:36 +04:00
}
2013-07-25 02:05:33 +04:00
NETDEVICE_SHOW_RW ( tx_queue_len , fmt_ulong ) ;
2005-04-17 02:20:36 +04:00
2013-07-25 02:05:33 +04:00
static ssize_t ifalias_store ( struct device * dev , struct device_attribute * attr ,
2008-09-23 08:28:11 +04:00
const char * buf , size_t len )
{
struct net_device * netdev = to_net_dev ( dev ) ;
2012-11-16 07:03:04 +04:00
struct net * net = dev_net ( netdev ) ;
2008-09-23 08:28:11 +04:00
size_t count = len ;
ssize_t ret ;
2012-11-16 07:03:04 +04:00
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2008-09-23 08:28:11 +04:00
return - EPERM ;
/* ignore trailing newline */
if ( len > 0 & & buf [ len - 1 ] = = ' \n ' )
- - count ;
2009-05-13 20:57:25 +04:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2008-09-23 08:28:11 +04:00
ret = dev_set_alias ( netdev , buf , count ) ;
rtnl_unlock ( ) ;
return ret < 0 ? ret : len ;
}
2013-07-25 02:05:33 +04:00
static ssize_t ifalias_show ( struct device * dev ,
2008-09-23 08:28:11 +04:00
struct device_attribute * attr , char * buf )
{
const struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = 0 ;
2009-05-13 20:57:25 +04:00
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
2008-09-23 08:28:11 +04:00
if ( netdev - > ifalias )
ret = sprintf ( buf , " %s \n " , netdev - > ifalias ) ;
rtnl_unlock ( ) ;
return ret ;
}
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RW ( ifalias ) ;
2011-01-24 06:37:29 +03:00
static int change_group ( struct net_device * net , unsigned long new_group )
{
dev_set_group ( net , ( int ) new_group ) ;
return 0 ;
}
2013-07-25 02:05:33 +04:00
static ssize_t group_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t len )
2011-01-24 06:37:29 +03:00
{
return netdev_store ( dev , attr , buf , len , change_group ) ;
}
2013-07-25 02:05:33 +04:00
NETDEVICE_SHOW ( group , fmt_dec ) ;
static DEVICE_ATTR ( netdev_group , S_IRUGO | S_IWUSR , group_show , group_store ) ;
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-06 01:54:29 +04:00
static ssize_t phys_port_id_show ( struct device * dev ,
2013-07-29 20:16:51 +04:00
struct device_attribute * attr , char * buf )
{
struct net_device * netdev = to_net_dev ( dev ) ;
ssize_t ret = - EINVAL ;
if ( ! rtnl_trylock ( ) )
return restart_syscall ( ) ;
if ( dev_isalive ( netdev ) ) {
struct netdev_phys_port_id ppid ;
ret = dev_get_phys_port_id ( netdev , & ppid ) ;
if ( ! ret )
ret = sprintf ( buf , " %*phN \n " , ppid . id_len , ppid . id ) ;
}
rtnl_unlock ( ) ;
return ret ;
}
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-06 01:54:29 +04:00
static DEVICE_ATTR_RO ( phys_port_id ) ;
2013-07-25 02:05:33 +04:00
static struct attribute * net_class_attrs [ ] = {
& dev_attr_netdev_group . attr ,
& dev_attr_type . attr ,
& dev_attr_dev_id . attr ,
& dev_attr_iflink . attr ,
& dev_attr_ifindex . attr ,
& dev_attr_addr_assign_type . attr ,
& dev_attr_addr_len . attr ,
& dev_attr_link_mode . attr ,
& dev_attr_address . attr ,
& dev_attr_broadcast . attr ,
& dev_attr_speed . attr ,
& dev_attr_duplex . attr ,
& dev_attr_dormant . attr ,
& dev_attr_operstate . attr ,
& dev_attr_ifalias . attr ,
& dev_attr_carrier . attr ,
& dev_attr_mtu . attr ,
& dev_attr_flags . attr ,
& dev_attr_tx_queue_len . attr ,
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
2013-09-06 01:54:29 +04:00
& dev_attr_phys_port_id . attr ,
2013-07-25 02:05:33 +04:00
NULL ,
2005-04-17 02:20:36 +04:00
} ;
2013-07-25 02:05:33 +04:00
ATTRIBUTE_GROUPS ( net_class ) ;
2005-04-17 02:20:36 +04:00
/* Show a given an attribute in the statistics group */
2002-04-09 23:14:34 +04:00
static ssize_t netstat_show ( const struct device * d ,
struct device_attribute * attr , char * buf ,
2005-04-17 02:20:36 +04:00
unsigned long offset )
{
2002-04-09 23:14:34 +04:00
struct net_device * dev = to_net_dev ( d ) ;
2005-04-17 02:20:36 +04:00
ssize_t ret = - EINVAL ;
2010-06-08 11:19:54 +04:00
WARN_ON ( offset > sizeof ( struct rtnl_link_stats64 ) | |
offset % sizeof ( u64 ) ! = 0 ) ;
2005-04-17 02:20:36 +04:00
read_lock ( & dev_base_lock ) ;
2008-05-22 01:12:46 +04:00
if ( dev_isalive ( dev ) ) {
2010-07-08 01:58:56 +04:00
struct rtnl_link_stats64 temp ;
const struct rtnl_link_stats64 * stats = dev_get_stats ( dev , & temp ) ;
2010-06-08 11:19:54 +04:00
ret = sprintf ( buf , fmt_u64 , * ( u64 * ) ( ( ( u8 * ) stats ) + offset ) ) ;
2008-05-22 01:12:46 +04:00
}
2005-04-17 02:20:36 +04:00
read_unlock ( & dev_base_lock ) ;
return ret ;
}
/* generate a read-only statistics attribute */
# define NETSTAT_ENTRY(name) \
2013-07-25 02:05:33 +04:00
static ssize_t name # # _show ( struct device * d , \
2002-04-09 23:14:34 +04:00
struct device_attribute * attr , char * buf ) \
2005-04-17 02:20:36 +04:00
{ \
2002-04-09 23:14:34 +04:00
return netstat_show ( d , attr , buf , \
2010-06-08 11:19:54 +04:00
offsetof ( struct rtnl_link_stats64 , name ) ) ; \
2005-04-17 02:20:36 +04:00
} \
2013-07-25 02:05:33 +04:00
static DEVICE_ATTR_RO ( name )
2005-04-17 02:20:36 +04:00
NETSTAT_ENTRY ( rx_packets ) ;
NETSTAT_ENTRY ( tx_packets ) ;
NETSTAT_ENTRY ( rx_bytes ) ;
NETSTAT_ENTRY ( tx_bytes ) ;
NETSTAT_ENTRY ( rx_errors ) ;
NETSTAT_ENTRY ( tx_errors ) ;
NETSTAT_ENTRY ( rx_dropped ) ;
NETSTAT_ENTRY ( tx_dropped ) ;
NETSTAT_ENTRY ( multicast ) ;
NETSTAT_ENTRY ( collisions ) ;
NETSTAT_ENTRY ( rx_length_errors ) ;
NETSTAT_ENTRY ( rx_over_errors ) ;
NETSTAT_ENTRY ( rx_crc_errors ) ;
NETSTAT_ENTRY ( rx_frame_errors ) ;
NETSTAT_ENTRY ( rx_fifo_errors ) ;
NETSTAT_ENTRY ( rx_missed_errors ) ;
NETSTAT_ENTRY ( tx_aborted_errors ) ;
NETSTAT_ENTRY ( tx_carrier_errors ) ;
NETSTAT_ENTRY ( tx_fifo_errors ) ;
NETSTAT_ENTRY ( tx_heartbeat_errors ) ;
NETSTAT_ENTRY ( tx_window_errors ) ;
NETSTAT_ENTRY ( rx_compressed ) ;
NETSTAT_ENTRY ( tx_compressed ) ;
static struct attribute * netstat_attrs [ ] = {
2002-04-09 23:14:34 +04:00
& dev_attr_rx_packets . attr ,
& dev_attr_tx_packets . attr ,
& dev_attr_rx_bytes . attr ,
& dev_attr_tx_bytes . attr ,
& dev_attr_rx_errors . attr ,
& dev_attr_tx_errors . attr ,
& dev_attr_rx_dropped . attr ,
& dev_attr_tx_dropped . attr ,
& dev_attr_multicast . attr ,
& dev_attr_collisions . attr ,
& dev_attr_rx_length_errors . attr ,
& dev_attr_rx_over_errors . attr ,
& dev_attr_rx_crc_errors . attr ,
& dev_attr_rx_frame_errors . attr ,
& dev_attr_rx_fifo_errors . attr ,
& dev_attr_rx_missed_errors . attr ,
& dev_attr_tx_aborted_errors . attr ,
& dev_attr_tx_carrier_errors . attr ,
& dev_attr_tx_fifo_errors . attr ,
& dev_attr_tx_heartbeat_errors . attr ,
& dev_attr_tx_window_errors . attr ,
& dev_attr_rx_compressed . attr ,
& dev_attr_tx_compressed . attr ,
2005-04-17 02:20:36 +04:00
NULL
} ;
static struct attribute_group netstat_group = {
. name = " statistics " ,
. attrs = netstat_attrs ,
} ;
2012-11-16 23:46:19 +04:00
# if IS_ENABLED(CONFIG_WIRELESS_EXT) || IS_ENABLED(CONFIG_CFG80211)
static struct attribute * wireless_attrs [ ] = {
NULL
} ;
static struct attribute_group wireless_group = {
. name = " wireless " ,
. attrs = wireless_attrs ,
} ;
# endif
2013-07-25 02:05:33 +04:00
# else /* CONFIG_SYSFS */
# define net_class_groups NULL
2010-05-17 08:59:45 +04:00
# endif /* CONFIG_SYSFS */
2005-04-17 02:20:36 +04:00
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_SYSFS
2010-03-16 11:03:29 +03:00
# define to_rx_queue_attr(_attr) container_of(_attr, \
struct rx_queue_attribute , attr )
# define to_rx_queue(obj) container_of(obj, struct netdev_rx_queue, kobj)
static ssize_t rx_queue_attr_show ( struct kobject * kobj , struct attribute * attr ,
char * buf )
{
struct rx_queue_attribute * attribute = to_rx_queue_attr ( attr ) ;
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
return attribute - > show ( queue , attribute , buf ) ;
}
static ssize_t rx_queue_attr_store ( struct kobject * kobj , struct attribute * attr ,
const char * buf , size_t count )
{
struct rx_queue_attribute * attribute = to_rx_queue_attr ( attr ) ;
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
return attribute - > store ( queue , attribute , buf , count ) ;
}
2010-08-31 16:14:13 +04:00
static const struct sysfs_ops rx_queue_sysfs_ops = {
2010-03-16 11:03:29 +03:00
. show = rx_queue_attr_show ,
. store = rx_queue_attr_store ,
} ;
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_RPS
2010-03-16 11:03:29 +03:00
static ssize_t show_rps_map ( struct netdev_rx_queue * queue ,
struct rx_queue_attribute * attribute , char * buf )
{
struct rps_map * map ;
cpumask_var_t mask ;
size_t len = 0 ;
int i ;
if ( ! zalloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
rcu_read_lock ( ) ;
map = rcu_dereference ( queue - > rps_map ) ;
if ( map )
for ( i = 0 ; i < map - > len ; i + + )
cpumask_set_cpu ( map - > cpus [ i ] , mask ) ;
len + = cpumask_scnprintf ( buf + len , PAGE_SIZE , mask ) ;
if ( PAGE_SIZE - len < 3 ) {
rcu_read_unlock ( ) ;
free_cpumask_var ( mask ) ;
return - EINVAL ;
}
rcu_read_unlock ( ) ;
free_cpumask_var ( mask ) ;
len + = sprintf ( buf + len , " \n " ) ;
return len ;
}
2010-04-20 01:40:57 +04:00
static ssize_t store_rps_map ( struct netdev_rx_queue * queue ,
2010-03-16 11:03:29 +03:00
struct rx_queue_attribute * attribute ,
const char * buf , size_t len )
{
struct rps_map * old_map , * map ;
cpumask_var_t mask ;
int err , cpu , i ;
static DEFINE_SPINLOCK ( rps_map_lock ) ;
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
if ( ! alloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
err = bitmap_parse ( buf , len , cpumask_bits ( mask ) , nr_cpumask_bits ) ;
if ( err ) {
free_cpumask_var ( mask ) ;
return err ;
}
2012-04-15 09:58:06 +04:00
map = kzalloc ( max_t ( unsigned int ,
2010-03-16 11:03:29 +03:00
RPS_MAP_SIZE ( cpumask_weight ( mask ) ) , L1_CACHE_BYTES ) ,
GFP_KERNEL ) ;
if ( ! map ) {
free_cpumask_var ( mask ) ;
return - ENOMEM ;
}
i = 0 ;
for_each_cpu_and ( cpu , mask , cpu_online_mask )
map - > cpus [ i + + ] = cpu ;
if ( i )
map - > len = i ;
else {
kfree ( map ) ;
map = NULL ;
}
spin_lock ( & rps_map_lock ) ;
2010-10-25 07:02:02 +04:00
old_map = rcu_dereference_protected ( queue - > rps_map ,
lockdep_is_held ( & rps_map_lock ) ) ;
2010-03-16 11:03:29 +03:00
rcu_assign_pointer ( queue - > rps_map , map ) ;
spin_unlock ( & rps_map_lock ) ;
2011-11-17 07:13:26 +04:00
if ( map )
2012-02-24 11:31:31 +04:00
static_key_slow_inc ( & rps_needed ) ;
2011-11-17 07:13:26 +04:00
if ( old_map ) {
2011-03-18 07:01:31 +03:00
kfree_rcu ( old_map , rcu ) ;
2012-02-24 11:31:31 +04:00
static_key_slow_dec ( & rps_needed ) ;
2011-11-17 07:13:26 +04:00
}
2010-03-16 11:03:29 +03:00
free_cpumask_var ( mask ) ;
return len ;
}
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
static ssize_t show_rps_dev_flow_table_cnt ( struct netdev_rx_queue * queue ,
struct rx_queue_attribute * attr ,
char * buf )
{
struct rps_dev_flow_table * flow_table ;
2011-12-24 10:56:49 +04:00
unsigned long val = 0 ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
rcu_read_lock ( ) ;
flow_table = rcu_dereference ( queue - > rps_flow_table ) ;
if ( flow_table )
2011-12-24 10:56:49 +04:00
val = ( unsigned long ) flow_table - > mask + 1 ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
rcu_read_unlock ( ) ;
2011-12-24 10:56:49 +04:00
return sprintf ( buf , " %lu \n " , val ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
}
static void rps_dev_flow_table_release ( struct rcu_head * rcu )
{
struct rps_dev_flow_table * table = container_of ( rcu ,
struct rps_dev_flow_table , rcu ) ;
2013-05-05 20:05:55 +04:00
vfree ( table ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
}
2010-04-20 01:40:57 +04:00
static ssize_t store_rps_dev_flow_table_cnt ( struct netdev_rx_queue * queue ,
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
struct rx_queue_attribute * attr ,
const char * buf , size_t len )
{
2011-12-24 10:56:49 +04:00
unsigned long mask , count ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
struct rps_dev_flow_table * table , * old_table ;
static DEFINE_SPINLOCK ( rps_dev_flow_lock ) ;
2011-12-24 10:56:49 +04:00
int rc ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
2011-12-24 10:56:49 +04:00
rc = kstrtoul ( buf , 0 , & count ) ;
if ( rc < 0 )
return rc ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
if ( count ) {
2011-12-24 10:56:49 +04:00
mask = count - 1 ;
/* mask = roundup_pow_of_two(count) - 1;
* without overflows . . .
*/
while ( ( mask | ( mask > > 1 ) ) ! = mask )
mask | = ( mask > > 1 ) ;
/* On 64 bit arches, must check mask fits in table->mask (u32),
2013-12-09 00:15:44 +04:00
* and on 32 bit arches , must check
* RPS_DEV_FLOW_TABLE_SIZE ( mask + 1 ) doesn ' t overflow .
2011-12-24 10:56:49 +04:00
*/
# if BITS_PER_LONG > 32
if ( mask > ( unsigned long ) ( u32 ) mask )
2011-12-22 17:35:22 +04:00
return - EINVAL ;
2011-12-24 10:56:49 +04:00
# else
if ( mask > ( ULONG_MAX - RPS_DEV_FLOW_TABLE_SIZE ( 1 ) )
2011-12-22 17:35:22 +04:00
/ sizeof ( struct rps_dev_flow ) ) {
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
/* Enforce a limit to prevent overflow */
return - EINVAL ;
}
2011-12-24 10:56:49 +04:00
# endif
table = vmalloc ( RPS_DEV_FLOW_TABLE_SIZE ( mask + 1 ) ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
if ( ! table )
return - ENOMEM ;
2011-12-24 10:56:49 +04:00
table - > mask = mask ;
for ( count = 0 ; count < = mask ; count + + )
table - > flows [ count ] . cpu = RPS_NO_CPU ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
} else
table = NULL ;
spin_lock ( & rps_dev_flow_lock ) ;
2010-10-25 07:02:02 +04:00
old_table = rcu_dereference_protected ( queue - > rps_flow_table ,
lockdep_is_held ( & rps_dev_flow_lock ) ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
rcu_assign_pointer ( queue - > rps_flow_table , table ) ;
spin_unlock ( & rps_dev_flow_lock ) ;
if ( old_table )
call_rcu ( & old_table - > rcu , rps_dev_flow_table_release ) ;
return len ;
}
2010-03-16 11:03:29 +03:00
static struct rx_queue_attribute rps_cpus_attribute =
__ATTR ( rps_cpus , S_IRUGO | S_IWUSR , show_rps_map , store_rps_map ) ;
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
__ATTR ( rps_flow_cnt , S_IRUGO | S_IWUSR ,
show_rps_dev_flow_table_cnt , store_rps_dev_flow_table_cnt ) ;
2014-01-17 10:23:28 +04:00
# endif /* CONFIG_RPS */
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
2010-03-16 11:03:29 +03:00
static struct attribute * rx_queue_default_attrs [ ] = {
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_RPS
2010-03-16 11:03:29 +03:00
& rps_cpus_attribute . attr ,
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
& rps_dev_flow_table_cnt_attribute . attr ,
2014-01-17 10:23:28 +04:00
# endif
2010-03-16 11:03:29 +03:00
NULL
} ;
static void rx_queue_release ( struct kobject * kobj )
{
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_RPS
2010-10-25 07:02:02 +04:00
struct rps_map * map ;
struct rps_dev_flow_table * flow_table ;
2010-03-16 11:03:29 +03:00
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 03:01:27 +04:00
2011-08-11 23:30:52 +04:00
map = rcu_dereference_protected ( queue - > rps_map , 1 ) ;
2010-11-16 09:31:39 +03:00
if ( map ) {
RCU_INIT_POINTER ( queue - > rps_map , NULL ) ;
2011-03-18 07:01:31 +03:00
kfree_rcu ( map , rcu ) ;
2010-11-16 09:31:39 +03:00
}
2010-10-25 07:02:02 +04:00
2011-08-11 23:30:52 +04:00
flow_table = rcu_dereference_protected ( queue - > rps_flow_table , 1 ) ;
2010-11-16 09:31:39 +03:00
if ( flow_table ) {
RCU_INIT_POINTER ( queue - > rps_flow_table , NULL ) ;
2010-10-25 07:02:02 +04:00
call_rcu ( & flow_table - > rcu , rps_dev_flow_table_release ) ;
2010-11-16 09:31:39 +03:00
}
2014-01-17 10:23:28 +04:00
# endif
2010-03-16 11:03:29 +03:00
2010-11-16 09:31:39 +03:00
memset ( kobj , 0 , sizeof ( * kobj ) ) ;
2010-11-09 13:47:38 +03:00
dev_put ( queue - > dev ) ;
2010-03-16 11:03:29 +03:00
}
2014-01-16 13:24:31 +04:00
static const void * rx_queue_namespace ( struct kobject * kobj )
{
struct netdev_rx_queue * queue = to_rx_queue ( kobj ) ;
struct device * dev = & queue - > dev - > dev ;
const void * ns = NULL ;
if ( dev - > class & & dev - > class - > ns_type )
ns = dev - > class - > namespace ( dev ) ;
return ns ;
}
2010-03-16 11:03:29 +03:00
static struct kobj_type rx_queue_ktype = {
. sysfs_ops = & rx_queue_sysfs_ops ,
. release = rx_queue_release ,
. default_attrs = rx_queue_default_attrs ,
2014-01-16 13:24:31 +04:00
. namespace = rx_queue_namespace
2010-03-16 11:03:29 +03:00
} ;
static int rx_queue_add_kobject ( struct net_device * net , int index )
{
struct netdev_rx_queue * queue = net - > _rx + index ;
struct kobject * kobj = & queue - > kobj ;
int error = 0 ;
kobj - > kset = net - > queues_kset ;
error = kobject_init_and_add ( kobj , & rx_queue_ktype , NULL ,
" rx-%u " , index ) ;
2014-01-17 10:23:28 +04:00
if ( error )
goto exit ;
if ( net - > sysfs_rx_queue_group ) {
error = sysfs_create_group ( kobj , net - > sysfs_rx_queue_group ) ;
if ( error )
goto exit ;
2010-03-16 11:03:29 +03:00
}
kobject_uevent ( kobj , KOBJ_ADD ) ;
2010-11-09 13:47:38 +03:00
dev_hold ( queue - > dev ) ;
2010-03-16 11:03:29 +03:00
2014-01-17 10:23:28 +04:00
return error ;
exit :
kobject_put ( kobj ) ;
2010-03-16 11:03:29 +03:00
return error ;
}
2014-01-17 10:23:28 +04:00
# endif /* CONFIG_SYFS */
2010-03-16 11:03:29 +03:00
2010-09-27 12:24:33 +04:00
int
net_rx_queue_update_kobjects ( struct net_device * net , int old_num , int new_num )
2010-03-16 11:03:29 +03:00
{
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_SYSFS
2010-03-16 11:03:29 +03:00
int i ;
int error = 0 ;
2014-01-17 10:23:28 +04:00
# ifndef CONFIG_RPS
if ( ! net - > sysfs_rx_queue_group )
return 0 ;
# endif
2010-09-27 12:24:33 +04:00
for ( i = old_num ; i < new_num ; i + + ) {
2010-03-16 11:03:29 +03:00
error = rx_queue_add_kobject ( net , i ) ;
2010-09-27 12:24:33 +04:00
if ( error ) {
new_num = old_num ;
2010-03-16 11:03:29 +03:00
break ;
2010-09-27 12:24:33 +04:00
}
2010-03-16 11:03:29 +03:00
}
2014-01-17 10:23:28 +04:00
while ( - - i > = new_num ) {
if ( net - > sysfs_rx_queue_group )
sysfs_remove_group ( & net - > _rx [ i ] . kobj ,
net - > sysfs_rx_queue_group ) ;
2010-09-27 12:24:33 +04:00
kobject_put ( & net - > _rx [ i ] . kobj ) ;
2014-01-17 10:23:28 +04:00
}
2010-03-16 11:03:29 +03:00
return error ;
2010-11-26 11:36:09 +03:00
# else
return 0 ;
# endif
2010-03-16 11:03:29 +03:00
}
2011-11-16 16:15:10 +04:00
# ifdef CONFIG_SYSFS
2010-11-21 16:17:27 +03:00
/*
* netdev_queue sysfs structures and functions .
*/
struct netdev_queue_attribute {
struct attribute attr ;
ssize_t ( * show ) ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attr , char * buf ) ;
ssize_t ( * store ) ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attr , const char * buf , size_t len ) ;
} ;
# define to_netdev_queue_attr(_attr) container_of(_attr, \
struct netdev_queue_attribute , attr )
# define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
static ssize_t netdev_queue_attr_show ( struct kobject * kobj ,
struct attribute * attr , char * buf )
{
struct netdev_queue_attribute * attribute = to_netdev_queue_attr ( attr ) ;
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
return attribute - > show ( queue , attribute , buf ) ;
}
static ssize_t netdev_queue_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t count )
{
struct netdev_queue_attribute * attribute = to_netdev_queue_attr ( attr ) ;
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
return attribute - > store ( queue , attribute , buf , count ) ;
}
static const struct sysfs_ops netdev_queue_sysfs_ops = {
. show = netdev_queue_attr_show ,
. store = netdev_queue_attr_store ,
} ;
2011-11-16 16:15:10 +04:00
static ssize_t show_trans_timeout ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attribute ,
char * buf )
{
unsigned long trans_timeout ;
spin_lock_irq ( & queue - > _xmit_lock ) ;
trans_timeout = queue - > trans_timeout ;
spin_unlock_irq ( & queue - > _xmit_lock ) ;
return sprintf ( buf , " %lu " , trans_timeout ) ;
}
static struct netdev_queue_attribute queue_trans_timeout =
__ATTR ( tx_timeout , S_IRUGO , show_trans_timeout , NULL ) ;
2011-11-28 20:33:09 +04:00
# ifdef CONFIG_BQL
/*
* Byte queue limits sysfs structures and functions .
*/
static ssize_t bql_show ( char * buf , unsigned int value )
{
return sprintf ( buf , " %u \n " , value ) ;
}
static ssize_t bql_set ( const char * buf , const size_t count ,
unsigned int * pvalue )
{
unsigned int value ;
int err ;
if ( ! strcmp ( buf , " max " ) | | ! strcmp ( buf , " max \n " ) )
value = DQL_MAX_LIMIT ;
else {
err = kstrtouint ( buf , 10 , & value ) ;
if ( err < 0 )
return err ;
if ( value > DQL_MAX_LIMIT )
return - EINVAL ;
}
* pvalue = value ;
return count ;
}
static ssize_t bql_show_hold_time ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attr ,
char * buf )
{
struct dql * dql = & queue - > dql ;
return sprintf ( buf , " %u \n " , jiffies_to_msecs ( dql - > slack_hold_time ) ) ;
}
static ssize_t bql_set_hold_time ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attribute ,
const char * buf , size_t len )
{
struct dql * dql = & queue - > dql ;
2012-04-15 09:58:06 +04:00
unsigned int value ;
2011-11-28 20:33:09 +04:00
int err ;
err = kstrtouint ( buf , 10 , & value ) ;
if ( err < 0 )
return err ;
dql - > slack_hold_time = msecs_to_jiffies ( value ) ;
return len ;
}
static struct netdev_queue_attribute bql_hold_time_attribute =
__ATTR ( hold_time , S_IRUGO | S_IWUSR , bql_show_hold_time ,
bql_set_hold_time ) ;
static ssize_t bql_show_inflight ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attr ,
char * buf )
{
struct dql * dql = & queue - > dql ;
return sprintf ( buf , " %u \n " , dql - > num_queued - dql - > num_completed ) ;
}
static struct netdev_queue_attribute bql_inflight_attribute =
2012-01-14 11:10:21 +04:00
__ATTR ( inflight , S_IRUGO , bql_show_inflight , NULL ) ;
2011-11-28 20:33:09 +04:00
# define BQL_ATTR(NAME, FIELD) \
static ssize_t bql_show_ # # NAME ( struct netdev_queue * queue , \
struct netdev_queue_attribute * attr , \
char * buf ) \
{ \
return bql_show ( buf , queue - > dql . FIELD ) ; \
} \
\
static ssize_t bql_set_ # # NAME ( struct netdev_queue * queue , \
struct netdev_queue_attribute * attr , \
const char * buf , size_t len ) \
{ \
return bql_set ( buf , len , & queue - > dql . FIELD ) ; \
} \
\
static struct netdev_queue_attribute bql_ # # NAME # # _attribute = \
__ATTR ( NAME , S_IRUGO | S_IWUSR , bql_show_ # # NAME , \
bql_set_ # # NAME ) ;
BQL_ATTR ( limit , limit )
BQL_ATTR ( limit_max , max_limit )
BQL_ATTR ( limit_min , min_limit )
static struct attribute * dql_attrs [ ] = {
& bql_limit_attribute . attr ,
& bql_limit_max_attribute . attr ,
& bql_limit_min_attribute . attr ,
& bql_hold_time_attribute . attr ,
& bql_inflight_attribute . attr ,
NULL
} ;
static struct attribute_group dql_group = {
. name = " byte_queue_limits " ,
. attrs = dql_attrs ,
} ;
# endif /* CONFIG_BQL */
2011-11-16 16:15:10 +04:00
# ifdef CONFIG_XPS
2010-11-21 16:17:27 +03:00
static inline unsigned int get_netdev_queue_index ( struct netdev_queue * queue )
2010-03-16 11:03:29 +03:00
{
2010-11-21 16:17:27 +03:00
struct net_device * dev = queue - > dev ;
int i ;
for ( i = 0 ; i < dev - > num_tx_queues ; i + + )
if ( queue = = & dev - > _tx [ i ] )
break ;
BUG_ON ( i > = dev - > num_tx_queues ) ;
return i ;
}
static ssize_t show_xps_map ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attribute , char * buf )
{
struct net_device * dev = queue - > dev ;
struct xps_dev_maps * dev_maps ;
cpumask_var_t mask ;
unsigned long index ;
size_t len = 0 ;
int i ;
if ( ! zalloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
index = get_netdev_queue_index ( queue ) ;
rcu_read_lock ( ) ;
dev_maps = rcu_dereference ( dev - > xps_maps ) ;
if ( dev_maps ) {
for_each_possible_cpu ( i ) {
struct xps_map * map =
rcu_dereference ( dev_maps - > cpu_map [ i ] ) ;
if ( map ) {
int j ;
for ( j = 0 ; j < map - > len ; j + + ) {
if ( map - > queues [ j ] = = index ) {
cpumask_set_cpu ( i , mask ) ;
break ;
}
}
}
}
}
rcu_read_unlock ( ) ;
len + = cpumask_scnprintf ( buf + len , PAGE_SIZE , mask ) ;
if ( PAGE_SIZE - len < 3 ) {
free_cpumask_var ( mask ) ;
return - EINVAL ;
}
free_cpumask_var ( mask ) ;
len + = sprintf ( buf + len , " \n " ) ;
return len ;
}
static ssize_t store_xps_map ( struct netdev_queue * queue ,
struct netdev_queue_attribute * attribute ,
const char * buf , size_t len )
{
struct net_device * dev = queue - > dev ;
unsigned long index ;
2013-01-10 12:57:02 +04:00
cpumask_var_t mask ;
int err ;
2010-11-21 16:17:27 +03:00
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
if ( ! alloc_cpumask_var ( & mask , GFP_KERNEL ) )
return - ENOMEM ;
index = get_netdev_queue_index ( queue ) ;
err = bitmap_parse ( buf , len , cpumask_bits ( mask ) , nr_cpumask_bits ) ;
if ( err ) {
free_cpumask_var ( mask ) ;
return err ;
}
2013-01-10 12:57:02 +04:00
err = netif_set_xps_queue ( dev , mask , index ) ;
2010-11-21 16:17:27 +03:00
free_cpumask_var ( mask ) ;
2013-01-10 12:57:02 +04:00
return err ? : len ;
2010-11-21 16:17:27 +03:00
}
static struct netdev_queue_attribute xps_cpus_attribute =
__ATTR ( xps_cpus , S_IRUGO | S_IWUSR , show_xps_map , store_xps_map ) ;
2011-11-16 16:15:10 +04:00
# endif /* CONFIG_XPS */
2010-11-21 16:17:27 +03:00
static struct attribute * netdev_queue_default_attrs [ ] = {
2011-11-16 16:15:10 +04:00
& queue_trans_timeout . attr ,
# ifdef CONFIG_XPS
2010-11-21 16:17:27 +03:00
& xps_cpus_attribute . attr ,
2011-11-16 16:15:10 +04:00
# endif
2010-11-21 16:17:27 +03:00
NULL
} ;
static void netdev_queue_release ( struct kobject * kobj )
{
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
memset ( kobj , 0 , sizeof ( * kobj ) ) ;
dev_put ( queue - > dev ) ;
}
2014-01-16 13:24:31 +04:00
static const void * netdev_queue_namespace ( struct kobject * kobj )
{
struct netdev_queue * queue = to_netdev_queue ( kobj ) ;
struct device * dev = & queue - > dev - > dev ;
const void * ns = NULL ;
if ( dev - > class & & dev - > class - > ns_type )
ns = dev - > class - > namespace ( dev ) ;
return ns ;
}
2010-11-21 16:17:27 +03:00
static struct kobj_type netdev_queue_ktype = {
. sysfs_ops = & netdev_queue_sysfs_ops ,
. release = netdev_queue_release ,
. default_attrs = netdev_queue_default_attrs ,
2014-01-16 13:24:31 +04:00
. namespace = netdev_queue_namespace ,
2010-11-21 16:17:27 +03:00
} ;
static int netdev_queue_add_kobject ( struct net_device * net , int index )
{
struct netdev_queue * queue = net - > _tx + index ;
struct kobject * kobj = & queue - > kobj ;
int error = 0 ;
kobj - > kset = net - > queues_kset ;
error = kobject_init_and_add ( kobj , & netdev_queue_ktype , NULL ,
" tx-%u " , index ) ;
2011-11-28 20:33:09 +04:00
if ( error )
goto exit ;
# ifdef CONFIG_BQL
error = sysfs_create_group ( kobj , & dql_group ) ;
if ( error )
goto exit ;
# endif
2010-11-21 16:17:27 +03:00
kobject_uevent ( kobj , KOBJ_ADD ) ;
dev_hold ( queue - > dev ) ;
2011-11-28 20:33:09 +04:00
return 0 ;
exit :
kobject_put ( kobj ) ;
2010-11-21 16:17:27 +03:00
return error ;
}
2011-11-16 16:15:10 +04:00
# endif /* CONFIG_SYSFS */
2010-11-21 16:17:27 +03:00
int
netdev_queue_update_kobjects ( struct net_device * net , int old_num , int new_num )
{
2011-11-16 16:15:10 +04:00
# ifdef CONFIG_SYSFS
2010-11-21 16:17:27 +03:00
int i ;
int error = 0 ;
for ( i = old_num ; i < new_num ; i + + ) {
error = netdev_queue_add_kobject ( net , i ) ;
if ( error ) {
new_num = old_num ;
break ;
}
}
2011-11-28 20:33:09 +04:00
while ( - - i > = new_num ) {
struct netdev_queue * queue = net - > _tx + i ;
# ifdef CONFIG_BQL
sysfs_remove_group ( & queue - > kobj , & dql_group ) ;
# endif
kobject_put ( & queue - > kobj ) ;
}
2010-11-21 16:17:27 +03:00
return error ;
2010-11-26 11:36:09 +03:00
# else
return 0 ;
2011-11-16 16:15:10 +04:00
# endif /* CONFIG_SYSFS */
2010-11-21 16:17:27 +03:00
}
static int register_queue_kobjects ( struct net_device * net )
{
2010-11-26 11:36:09 +03:00
int error = 0 , txq = 0 , rxq = 0 , real_rx = 0 , real_tx = 0 ;
2010-11-21 16:17:27 +03:00
2011-11-16 16:15:10 +04:00
# ifdef CONFIG_SYSFS
2010-09-27 12:24:33 +04:00
net - > queues_kset = kset_create_and_add ( " queues " ,
NULL , & net - > dev . kobj ) ;
if ( ! net - > queues_kset )
return - ENOMEM ;
2010-11-26 11:36:09 +03:00
real_rx = net - > real_num_rx_queues ;
# endif
real_tx = net - > real_num_tx_queues ;
2010-11-21 16:17:27 +03:00
2010-11-26 11:36:09 +03:00
error = net_rx_queue_update_kobjects ( net , 0 , real_rx ) ;
2010-11-21 16:17:27 +03:00
if ( error )
goto error ;
2010-11-26 11:36:09 +03:00
rxq = real_rx ;
2010-11-21 16:17:27 +03:00
2010-11-26 11:36:09 +03:00
error = netdev_queue_update_kobjects ( net , 0 , real_tx ) ;
2010-11-21 16:17:27 +03:00
if ( error )
goto error ;
2010-11-26 11:36:09 +03:00
txq = real_tx ;
2010-11-21 16:17:27 +03:00
return 0 ;
error :
netdev_queue_update_kobjects ( net , txq , 0 ) ;
net_rx_queue_update_kobjects ( net , rxq , 0 ) ;
return error ;
2010-09-27 12:24:33 +04:00
}
2010-03-16 11:03:29 +03:00
2010-11-21 16:17:27 +03:00
static void remove_queue_kobjects ( struct net_device * net )
2010-09-27 12:24:33 +04:00
{
2010-11-26 11:36:09 +03:00
int real_rx = 0 , real_tx = 0 ;
2014-01-17 10:23:28 +04:00
# ifdef CONFIG_SYSFS
2010-11-26 11:36:09 +03:00
real_rx = net - > real_num_rx_queues ;
# endif
real_tx = net - > real_num_tx_queues ;
net_rx_queue_update_kobjects ( net , real_rx , 0 ) ;
netdev_queue_update_kobjects ( net , real_tx , 0 ) ;
2011-11-16 16:15:10 +04:00
# ifdef CONFIG_SYSFS
2010-03-16 11:03:29 +03:00
kset_unregister ( net - > queues_kset ) ;
2010-11-26 11:36:09 +03:00
# endif
2010-03-16 11:03:29 +03:00
}
2010-05-05 04:36:45 +04:00
2013-03-26 07:07:01 +04:00
static bool net_current_may_mount ( void )
{
struct net * net = current - > nsproxy - > net_ns ;
return ns_capable ( net - > user_ns , CAP_SYS_ADMIN ) ;
}
2011-06-09 05:13:01 +04:00
static void * net_grab_current_ns ( void )
2010-05-05 04:36:45 +04:00
{
2011-06-09 05:13:01 +04:00
struct net * ns = current - > nsproxy - > net_ns ;
# ifdef CONFIG_NET_NS
if ( ns )
atomic_inc ( & ns - > passive ) ;
# endif
return ns ;
2010-05-05 04:36:45 +04:00
}
static const void * net_initial_ns ( void )
{
return & init_net ;
}
static const void * net_netlink_ns ( struct sock * sk )
{
return sock_net ( sk ) ;
}
2010-08-05 19:45:15 +04:00
struct kobj_ns_type_operations net_ns_type_operations = {
2010-05-05 04:36:45 +04:00
. type = KOBJ_NS_TYPE_NET ,
2013-03-26 07:07:01 +04:00
. current_may_mount = net_current_may_mount ,
2011-06-09 05:13:01 +04:00
. grab_current_ns = net_grab_current_ns ,
2010-05-05 04:36:45 +04:00
. netlink_ns = net_netlink_ns ,
. initial_ns = net_initial_ns ,
2011-06-09 05:13:01 +04:00
. drop_ns = net_drop_ns ,
2010-05-05 04:36:45 +04:00
} ;
2010-08-05 19:45:15 +04:00
EXPORT_SYMBOL_GPL ( net_ns_type_operations ) ;
2010-05-05 04:36:45 +04:00
2007-08-14 17:15:12 +04:00
static int netdev_uevent ( struct device * d , struct kobj_uevent_env * env )
2005-04-17 02:20:36 +04:00
{
2002-04-09 23:14:34 +04:00
struct net_device * dev = to_net_dev ( d ) ;
2007-08-14 17:15:12 +04:00
int retval ;
2005-04-17 02:20:36 +04:00
2005-11-16 11:00:00 +03:00
/* pass interface to uevent. */
2007-08-14 17:15:12 +04:00
retval = add_uevent_var ( env , " INTERFACE=%s " , dev - > name ) ;
2007-03-31 09:23:12 +04:00
if ( retval )
goto exit ;
2007-03-07 21:49:30 +03:00
/* pass ifindex to uevent.
* ifindex is useful as it won ' t change ( interface name may change )
* and is what RtNetlink uses natively . */
2007-08-14 17:15:12 +04:00
retval = add_uevent_var ( env , " IFINDEX=%d " , dev - > ifindex ) ;
2005-04-17 02:20:36 +04:00
2007-03-31 09:23:12 +04:00
exit :
return retval ;
2005-04-17 02:20:36 +04:00
}
/*
2007-02-09 17:24:36 +03:00
* netdev_release - - destroy and free a dead device .
2002-04-09 23:14:34 +04:00
* Called when last reference to device kobject is gone .
2005-04-17 02:20:36 +04:00
*/
2002-04-09 23:14:34 +04:00
static void netdev_release ( struct device * d )
2005-04-17 02:20:36 +04:00
{
2002-04-09 23:14:34 +04:00
struct net_device * dev = to_net_dev ( d ) ;
2005-04-17 02:20:36 +04:00
BUG_ON ( dev - > reg_state ! = NETREG_RELEASED ) ;
2008-09-23 08:28:11 +04:00
kfree ( dev - > ifalias ) ;
2013-10-31 00:10:44 +04:00
netdev_freemem ( dev ) ;
2005-04-17 02:20:36 +04:00
}
2010-05-05 04:36:45 +04:00
static const void * net_namespace ( struct device * d )
{
struct net_device * dev ;
dev = container_of ( d , struct net_device , dev ) ;
return dev_net ( dev ) ;
}
2005-04-17 02:20:36 +04:00
static struct class net_class = {
. name = " net " ,
2002-04-09 23:14:34 +04:00
. dev_release = netdev_release ,
2013-07-25 02:05:33 +04:00
. dev_groups = net_class_groups ,
2002-04-09 23:14:34 +04:00
. dev_uevent = netdev_uevent ,
2010-05-05 04:36:45 +04:00
. ns_type = & net_ns_type_operations ,
. namespace = net_namespace ,
2005-04-17 02:20:36 +04:00
} ;
2007-05-20 02:39:25 +04:00
/* Delete sysfs entries but hold kobject reference until after all
* netdev references are gone .
*/
2007-09-27 09:02:53 +04:00
void netdev_unregister_kobject ( struct net_device * net )
2005-04-17 02:20:36 +04:00
{
2007-05-20 02:39:25 +04:00
struct device * dev = & ( net - > dev ) ;
kobject_get ( & dev - > kobj ) ;
2008-10-28 03:51:47 +03:00
2010-11-21 16:17:27 +03:00
remove_queue_kobjects ( net ) ;
2010-03-16 11:03:29 +03:00
2013-02-23 04:34:16 +04:00
pm_runtime_set_memalloc_noio ( dev , false ) ;
2007-05-20 02:39:25 +04:00
device_del ( dev ) ;
2005-04-17 02:20:36 +04:00
}
/* Create sysfs entries for network device. */
2007-09-27 09:02:53 +04:00
int netdev_register_kobject ( struct net_device * net )
2005-04-17 02:20:36 +04:00
{
2002-04-09 23:14:34 +04:00
struct device * dev = & ( net - > dev ) ;
2009-06-24 21:06:31 +04:00
const struct attribute_group * * groups = net - > sysfs_groups ;
2010-03-16 11:03:29 +03:00
int error = 0 ;
2005-04-17 02:20:36 +04:00
2010-05-05 04:36:49 +04:00
device_initialize ( dev ) ;
2002-04-09 23:14:34 +04:00
dev - > class = & net_class ;
dev - > platform_data = net ;
dev - > groups = groups ;
2005-04-17 02:20:36 +04:00
2009-03-09 16:51:55 +03:00
dev_set_name ( dev , " %s " , net - > name ) ;
2005-04-17 02:20:36 +04:00
2007-09-27 09:02:53 +04:00
# ifdef CONFIG_SYSFS
2009-10-29 17:18:21 +03:00
/* Allow for a device specific group */
if ( * groups )
groups + + ;
2005-04-17 02:20:36 +04:00
2009-10-29 17:18:21 +03:00
* groups + + = & netstat_group ;
2012-11-16 23:46:19 +04:00
# if IS_ENABLED(CONFIG_WIRELESS_EXT) || IS_ENABLED(CONFIG_CFG80211)
if ( net - > ieee80211_ptr )
* groups + + = & wireless_group ;
# if IS_ENABLED(CONFIG_WIRELESS_EXT)
else if ( net - > wireless_handlers )
* groups + + = & wireless_group ;
# endif
# endif
2007-09-27 09:02:53 +04:00
# endif /* CONFIG_SYSFS */
2005-04-17 02:20:36 +04:00
2010-03-16 11:03:29 +03:00
error = device_add ( dev ) ;
if ( error )
return error ;
2010-11-21 16:17:27 +03:00
error = register_queue_kobjects ( net ) ;
2010-03-16 11:03:29 +03:00
if ( error ) {
device_del ( dev ) ;
return error ;
}
2013-02-23 04:34:16 +04:00
pm_runtime_set_memalloc_noio ( dev , true ) ;
2010-03-16 11:03:29 +03:00
return error ;
2005-04-17 02:20:36 +04:00
}
2013-09-12 06:29:04 +04:00
int netdev_class_create_file_ns ( struct class_attribute * class_attr ,
const void * ns )
2008-06-14 05:12:04 +04:00
{
2013-09-12 06:29:04 +04:00
return class_create_file_ns ( & net_class , class_attr , ns ) ;
2008-06-14 05:12:04 +04:00
}
2013-09-12 06:29:04 +04:00
EXPORT_SYMBOL ( netdev_class_create_file_ns ) ;
2008-06-14 05:12:04 +04:00
2013-09-12 06:29:04 +04:00
void netdev_class_remove_file_ns ( struct class_attribute * class_attr ,
const void * ns )
2008-06-14 05:12:04 +04:00
{
2013-09-12 06:29:04 +04:00
class_remove_file_ns ( & net_class , class_attr , ns ) ;
2008-06-14 05:12:04 +04:00
}
2013-09-12 06:29:04 +04:00
EXPORT_SYMBOL ( netdev_class_remove_file_ns ) ;
2008-06-14 05:12:04 +04:00
2014-01-06 04:20:11 +04:00
int __init netdev_kobject_init ( void )
2005-04-17 02:20:36 +04:00
{
2010-05-05 04:36:45 +04:00
kobj_ns_type_register ( & net_ns_type_operations ) ;
2005-04-17 02:20:36 +04:00
return class_register ( & net_class ) ;
}