2005-04-17 02:20:36 +04:00
/*
* originally based on the dummy device .
*
* Copyright 1999 , Thomas Davis , tadavis @ lbl . gov .
* Licensed under the GPL . Based on dummy . c , and eql . c devices .
*
* bonding . c : an Ethernet Bonding driver
*
* This is useful to talk to a Cisco EtherChannel compatible equipment :
* Cisco 5500
* Sun Trunking ( Solaris )
* Alteon AceDirector Trunks
* Linux Bonding
* and probably many L2 switches . . .
*
* How it works :
* ifconfig bond0 ipaddress netmask up
* will setup a network device , with an ip address . No mac address
* will be assigned at this time . The hw mac address will come from
* the first slave bonded to the channel . All slaves will then use
* this hw mac address .
*
* ifconfig bond0 down
* will release all slaves , marking them as down .
*
* ifenslave bond0 eth0
* will attach eth0 to bond0 as a slave . eth0 hw mac address will either
* a : be used as initial mac address
* b : if a hw mac address already is there , eth0 ' s hw mac address
* will then be set from bond0 .
*
*/
2009-12-14 07:06:07 +03:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2005-04-17 02:20:36 +04:00
# include <linux/kernel.h>
# include <linux/module.h>
# include <linux/types.h>
# include <linux/fcntl.h>
# include <linux/interrupt.h>
# include <linux/ptrace.h>
# include <linux/ioport.h>
# include <linux/in.h>
2005-06-27 01:54:11 +04:00
# include <net/ip.h>
2005-04-17 02:20:36 +04:00
# include <linux/ip.h>
2005-06-27 01:54:11 +04:00
# include <linux/tcp.h>
# include <linux/udp.h>
2005-04-17 02:20:36 +04:00
# include <linux/slab.h>
# include <linux/string.h>
# include <linux/init.h>
# include <linux/timer.h>
# include <linux/socket.h>
# include <linux/ctype.h>
# include <linux/inet.h>
# include <linux/bitops.h>
2009-06-12 23:02:48 +04:00
# include <linux/io.h>
2005-04-17 02:20:36 +04:00
# include <asm/dma.h>
2009-06-12 23:02:48 +04:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <linux/errno.h>
# include <linux/netdevice.h>
# include <linux/inetdevice.h>
bonding: Improve IGMP join processing
In active-backup mode, the current bonding code duplicates IGMP
traffic to all slaves, so that switches are up to date in case of a
failover from an active to a backup interface. If bonding then fails
back to the original active interface, it is likely that the "active
slave" switch's IGMP forwarding for the port will be out of date until
some event occurs to refresh the switch (e.g., a membership query).
This patch alters the behavior of bonding to no longer flood
IGMP to all ports, and to issue IGMP JOINs to the newly active port at
the time of a failover. This insures that switches are kept up to date
for all cases.
"GOELLESCH Niels" <niels.goellesch@eurocontrol.int> originally
reported this problem, and included a patch. His original patch was
modified by Jay Vosburgh to additionally remove the existing IGMP flood
behavior, use RCU, streamline code paths, fix trailing white space, and
adjust for style.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-03-01 04:03:37 +03:00
# include <linux/igmp.h>
2005-04-17 02:20:36 +04:00
# include <linux/etherdevice.h>
# include <linux/skbuff.h>
# include <net/sock.h>
# include <linux/rtnetlink.h>
# include <linux/smp.h>
# include <linux/if_ether.h>
# include <net/arp.h>
# include <linux/mii.h>
# include <linux/ethtool.h>
# include <linux/if_vlan.h>
# include <linux/if_bonding.h>
2007-12-07 10:40:33 +03:00
# include <linux/jiffies.h>
2010-10-13 20:01:50 +04:00
# include <linux/preempt.h>
2005-06-27 01:52:20 +04:00
# include <net/route.h>
2007-09-12 14:01:34 +04:00
# include <net/net_namespace.h>
2009-10-29 17:18:26 +03:00
# include <net/netns/generic.h>
2012-06-12 10:03:51 +04:00
# include <net/pkt_sched.h>
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
# include <linux/rculist.h>
2005-04-17 02:20:36 +04:00
# include "bonding.h"
# include "bond_3ad.h"
# include "bond_alb.h"
/*---------------------------- Module parameters ----------------------------*/
/* monitor all links that often (in milliseconds). <=0 disables monitoring */
# define BOND_LINK_MON_INTERV 0
# define BOND_LINK_ARP_INTERV 0
static int max_bonds = BOND_DEFAULT_MAX_BONDS ;
2010-06-02 12:40:18 +04:00
static int tx_queues = BOND_DEFAULT_TX_QUEUES ;
2011-04-26 19:25:52 +04:00
static int num_peer_notif = 1 ;
2005-04-17 02:20:36 +04:00
static int miimon = BOND_LINK_MON_INTERV ;
2009-06-12 23:02:48 +04:00
static int updelay ;
static int downdelay ;
2005-04-17 02:20:36 +04:00
static int use_carrier = 1 ;
2009-06-12 23:02:48 +04:00
static char * mode ;
static char * primary ;
2009-09-25 07:28:09 +04:00
static char * primary_reselect ;
2009-06-12 23:02:48 +04:00
static char * lacp_rate ;
2011-06-22 13:54:39 +04:00
static int min_links ;
2009-06-12 23:02:48 +04:00
static char * ad_select ;
static char * xmit_hash_policy ;
2005-04-17 02:20:36 +04:00
static int arp_interval = BOND_LINK_ARP_INTERV ;
2009-06-12 23:02:48 +04:00
static char * arp_ip_target [ BOND_MAX_ARP_TARGETS ] ;
static char * arp_validate ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
static char * arp_all_targets ;
2009-06-12 23:02:48 +04:00
static char * fail_over_mac ;
2013-07-23 11:25:47 +04:00
static int all_slaves_active ;
2009-06-12 23:02:44 +04:00
static struct bond_params bonding_defaults ;
2010-10-05 18:23:59 +04:00
static int resend_igmp = BOND_DEFAULT_RESEND_IGMP ;
2005-04-17 02:20:36 +04:00
module_param ( max_bonds , int , 0 ) ;
MODULE_PARM_DESC ( max_bonds , " Max number of bonded devices " ) ;
2010-06-02 12:40:18 +04:00
module_param ( tx_queues , int , 0 ) ;
MODULE_PARM_DESC ( tx_queues , " Max number of transmit queues (default = 16) " ) ;
2011-04-26 19:25:52 +04:00
module_param_named ( num_grat_arp , num_peer_notif , int , 0644 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( num_grat_arp , " Number of peer notifications to send on "
" failover event (alias of num_unsol_na) " ) ;
2011-04-26 19:25:52 +04:00
module_param_named ( num_unsol_na , num_peer_notif , int , 0644 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( num_unsol_na , " Number of peer notifications to send on "
" failover event (alias of num_grat_arp) " ) ;
2005-04-17 02:20:36 +04:00
module_param ( miimon , int , 0 ) ;
MODULE_PARM_DESC ( miimon , " Link check interval in milliseconds " ) ;
module_param ( updelay , int , 0 ) ;
MODULE_PARM_DESC ( updelay , " Delay before considering link up, in milliseconds " ) ;
module_param ( downdelay , int , 0 ) ;
2005-11-09 21:35:03 +03:00
MODULE_PARM_DESC ( downdelay , " Delay before considering link down, "
" in milliseconds " ) ;
2005-04-17 02:20:36 +04:00
module_param ( use_carrier , int , 0 ) ;
2005-11-09 21:35:03 +03:00
MODULE_PARM_DESC ( use_carrier , " Use netif_carrier_ok (vs MII ioctls) in miimon; "
" 0 for off, 1 for on (default) " ) ;
2005-04-17 02:20:36 +04:00
module_param ( mode , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( mode , " Mode of operation; 0 for balance-rr, "
2005-11-09 21:35:03 +03:00
" 1 for active-backup, 2 for balance-xor, "
" 3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, "
" 6 for balance-alb " ) ;
2005-04-17 02:20:36 +04:00
module_param ( primary , charp , 0 ) ;
MODULE_PARM_DESC ( primary , " Primary network device to use " ) ;
2009-09-25 07:28:09 +04:00
module_param ( primary_reselect , charp , 0 ) ;
MODULE_PARM_DESC ( primary_reselect , " Reselect primary slave "
" once it comes up; "
" 0 for always (default), "
" 1 for only if speed of primary is "
" better, "
" 2 for only on active slave "
" failure " ) ;
2005-04-17 02:20:36 +04:00
module_param ( lacp_rate , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( lacp_rate , " LACPDU tx rate to request from 802.3ad partner; "
" 0 for slow, 1 for fast " ) ;
2008-11-05 04:51:16 +03:00
module_param ( ad_select , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( ad_select , " 803.ad aggregation selection logic; "
" 0 for stable (default), 1 for bandwidth, "
" 2 for count " ) ;
2011-06-22 13:54:39 +04:00
module_param ( min_links , int , 0 ) ;
MODULE_PARM_DESC ( min_links , " Minimum number of available links before turning on carrier " ) ;
2005-06-27 01:54:11 +04:00
module_param ( xmit_hash_policy , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( xmit_hash_policy , " balance-xor and 802.3ad hashing method; "
" 0 for layer 2 (default), 1 for layer 3+4, "
" 2 for layer 2+3 " ) ;
2005-04-17 02:20:36 +04:00
module_param ( arp_interval , int , 0 ) ;
MODULE_PARM_DESC ( arp_interval , " arp interval in milliseconds " ) ;
module_param_array ( arp_ip_target , charp , NULL , 0 ) ;
MODULE_PARM_DESC ( arp_ip_target , " arp targets in n.n.n.n form " ) ;
2006-09-23 08:54:53 +04:00
module_param ( arp_validate , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( arp_validate , " validate src/dst of ARP probes; "
" 0 for none (default), 1 for active, "
" 2 for backup, 3 for all " ) ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
module_param ( arp_all_targets , charp , 0 ) ;
MODULE_PARM_DESC ( arp_all_targets , " fail on any/all arp targets timeout; 0 for any (default), 1 for all " ) ;
2008-05-18 08:10:14 +04:00
module_param ( fail_over_mac , charp , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( fail_over_mac , " For active-backup, do not set all slaves to "
" the same MAC; 0 for none (default), "
" 1 for active, 2 for follow " ) ;
2010-06-02 12:39:21 +04:00
module_param ( all_slaves_active , int , 0 ) ;
MODULE_PARM_DESC ( all_slaves_active , " Keep all frames received on an interface "
2011-05-25 08:41:59 +04:00
" by setting active flag for all slaves; "
2010-06-02 12:39:21 +04:00
" 0 for never (default), 1 for always. " ) ;
2010-10-05 18:23:59 +04:00
module_param ( resend_igmp , int , 0 ) ;
2011-05-25 08:41:59 +04:00
MODULE_PARM_DESC ( resend_igmp , " Number of IGMP membership reports to send on "
" link failure " ) ;
2005-04-17 02:20:36 +04:00
/*----------------------------- Global variables ----------------------------*/
2010-10-13 20:01:50 +04:00
# ifdef CONFIG_NET_POLL_CONTROLLER
net: Convert netpoll blocking api in bonding driver to be a counter
A while back I made some changes to enable netpoll in the bonding driver. Among
them was a per-cpu flag that indicated we were in a path that held locks which
could cause the netpoll path to block in during tx, and as such the tx path
should queue the frame for later use. This appears to have given rise to a
regression. If one of those paths on which we hold the per-cpu flag yields the
cpu, its possible for us to come back on a different cpu, leading to us clearing
a different flag than we set. This results in odd netpoll drops, and BUG
backtraces appearing in the log, as we check to make sure that we only clear set
bits, and only set clear bits. I had though briefly about changing the
offending paths so that they wouldn't sleep, but looking at my origional work
more closely, it doesn't appear that a per-cpu flag is warranted. We alrady
gate the checking of this flag on IFF_IN_NETPOLL, so we don't hit this in the
normal tx case anyway. And practically speaking, the normal use case for
netpoll is to only have one client anyway, so we're not going to erroneously
queue netpoll frames when its actually safe to do so. As such, lets just
convert that per-cpu flag to an atomic counter. It fixes the rescheduling bugs,
is equivalent from a performance perspective and actually eliminates some code
in the process.
Tested by the reporter and myself, successfully
Reported-by: Liang Zheng <lzheng@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:05:50 +03:00
atomic_t netpoll_block_tx = ATOMIC_INIT ( 0 ) ;
2010-10-13 20:01:50 +04:00
# endif
2009-11-17 13:42:49 +03:00
int bond_net_id __read_mostly ;
2005-04-17 02:20:36 +04:00
2009-06-12 23:02:48 +04:00
static __be32 arp_target [ BOND_MAX_ARP_TARGETS ] ;
static int arp_ip_count ;
2005-04-17 02:20:36 +04:00
static int bond_mode = BOND_MODE_ROUNDROBIN ;
2009-06-12 23:02:48 +04:00
static int xmit_hashtype = BOND_XMIT_POLICY_LAYER2 ;
static int lacp_fast ;
2005-04-17 02:20:36 +04:00
2008-12-10 10:10:38 +03:00
const struct bond_parm_tbl bond_lacp_tbl [ ] = {
2005-04-17 02:20:36 +04:00
{ " slow " , AD_LACP_SLOW } ,
{ " fast " , AD_LACP_FAST } ,
{ NULL , - 1 } ,
} ;
2008-12-10 10:10:38 +03:00
const struct bond_parm_tbl bond_mode_tbl [ ] = {
2005-04-17 02:20:36 +04:00
{ " balance-rr " , BOND_MODE_ROUNDROBIN } ,
{ " active-backup " , BOND_MODE_ACTIVEBACKUP } ,
{ " balance-xor " , BOND_MODE_XOR } ,
{ " broadcast " , BOND_MODE_BROADCAST } ,
{ " 802.3ad " , BOND_MODE_8023AD } ,
{ " balance-tlb " , BOND_MODE_TLB } ,
{ " balance-alb " , BOND_MODE_ALB } ,
{ NULL , - 1 } ,
} ;
2008-12-10 10:10:38 +03:00
const struct bond_parm_tbl xmit_hashtype_tbl [ ] = {
2005-06-27 01:54:11 +04:00
{ " layer2 " , BOND_XMIT_POLICY_LAYER2 } ,
{ " layer3+4 " , BOND_XMIT_POLICY_LAYER34 } ,
2007-12-07 10:40:34 +03:00
{ " layer2+3 " , BOND_XMIT_POLICY_LAYER23 } ,
2005-06-27 01:54:11 +04:00
{ NULL , - 1 } ,
} ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
const struct bond_parm_tbl arp_all_targets_tbl [ ] = {
{ " any " , BOND_ARP_TARGETS_ANY } ,
{ " all " , BOND_ARP_TARGETS_ALL } ,
{ NULL , - 1 } ,
} ;
2008-12-10 10:10:38 +03:00
const struct bond_parm_tbl arp_validate_tbl [ ] = {
2006-09-23 08:54:53 +04:00
{ " none " , BOND_ARP_VALIDATE_NONE } ,
{ " active " , BOND_ARP_VALIDATE_ACTIVE } ,
{ " backup " , BOND_ARP_VALIDATE_BACKUP } ,
{ " all " , BOND_ARP_VALIDATE_ALL } ,
{ NULL , - 1 } ,
} ;
2008-12-10 10:10:38 +03:00
const struct bond_parm_tbl fail_over_mac_tbl [ ] = {
2008-05-18 08:10:14 +04:00
{ " none " , BOND_FOM_NONE } ,
{ " active " , BOND_FOM_ACTIVE } ,
{ " follow " , BOND_FOM_FOLLOW } ,
{ NULL , - 1 } ,
} ;
2009-09-25 07:28:09 +04:00
const struct bond_parm_tbl pri_reselect_tbl [ ] = {
{ " always " , BOND_PRI_RESELECT_ALWAYS } ,
{ " better " , BOND_PRI_RESELECT_BETTER } ,
{ " failure " , BOND_PRI_RESELECT_FAILURE } ,
{ NULL , - 1 } ,
} ;
2008-11-05 04:51:16 +03:00
struct bond_parm_tbl ad_select_tbl [ ] = {
{ " stable " , BOND_AD_STABLE } ,
{ " bandwidth " , BOND_AD_BANDWIDTH } ,
{ " count " , BOND_AD_COUNT } ,
{ NULL , - 1 } ,
} ;
2005-04-17 02:20:36 +04:00
/*-------------------------- Forward declarations ---------------------------*/
2009-06-12 23:02:52 +04:00
static int bond_init ( struct net_device * bond_dev ) ;
2009-10-29 17:18:24 +03:00
static void bond_uninit ( struct net_device * bond_dev ) ;
2005-04-17 02:20:36 +04:00
/*---------------------------- General routines -----------------------------*/
2011-03-07 00:58:46 +03:00
const char * bond_mode_name ( int mode )
2005-04-17 02:20:36 +04:00
{
2008-12-10 10:08:09 +03:00
static const char * names [ ] = {
[ BOND_MODE_ROUNDROBIN ] = " load balancing (round-robin) " ,
[ BOND_MODE_ACTIVEBACKUP ] = " fault-tolerance (active-backup) " ,
[ BOND_MODE_XOR ] = " load balancing (xor) " ,
[ BOND_MODE_BROADCAST ] = " fault-tolerance (broadcast) " ,
2009-06-12 23:02:48 +04:00
[ BOND_MODE_8023AD ] = " IEEE 802.3ad Dynamic link aggregation " ,
2008-12-10 10:08:09 +03:00
[ BOND_MODE_TLB ] = " transmit load balancing " ,
[ BOND_MODE_ALB ] = " adaptive load balancing " ,
} ;
2013-07-24 10:53:26 +04:00
if ( mode < BOND_MODE_ROUNDROBIN | | mode > BOND_MODE_ALB )
2005-04-17 02:20:36 +04:00
return " unknown " ;
2008-12-10 10:08:09 +03:00
return names [ mode ] ;
2005-04-17 02:20:36 +04:00
}
/*---------------------------------- VLAN -----------------------------------*/
/**
* bond_dev_queue_xmit - Prepare skb for xmit .
2009-06-12 23:02:48 +04:00
*
2005-04-17 02:20:36 +04:00
* @ bond : bond device that got this skb for tx .
* @ skb : hw accel VLAN tagged skb to transmit
* @ slave_dev : slave that is supposed to xmit this skbuff
*/
2009-06-12 23:02:48 +04:00
int bond_dev_queue_xmit ( struct bonding * bond , struct sk_buff * skb ,
struct net_device * slave_dev )
2005-04-17 02:20:36 +04:00
{
2010-12-13 11:19:28 +03:00
skb - > dev = slave_dev ;
2011-06-03 14:35:52 +04:00
2012-06-12 10:03:51 +04:00
BUILD_BUG_ON ( sizeof ( skb - > queue_mapping ) ! =
2012-07-20 06:28:49 +04:00
sizeof ( qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping ) ) ;
skb - > queue_mapping = qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping ;
2011-06-03 14:35:52 +04:00
2012-08-10 05:24:45 +04:00
if ( unlikely ( netpoll_tx_running ( bond - > dev ) ) )
2011-02-18 02:43:32 +03:00
bond_netpoll_send_skb ( bond_get_slave_by_dev ( bond , slave_dev ) , skb ) ;
2011-02-18 02:43:33 +03:00
else
2010-05-06 11:48:51 +04:00
dev_queue_xmit ( skb ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
/*
2011-07-20 08:54:46 +04:00
* In the following 2 functions , bond_vlan_rx_add_vid and bond_vlan_rx_kill_vid ,
* We don ' t protect the slave list iteration with a lock because :
2005-04-17 02:20:36 +04:00
* a . This operation is performed in IOCTL context ,
* b . The operation is protected by the RTNL semaphore in the 8021 q code ,
* c . Holding a lock with BH disabled while directly calling a base driver
* entry point is generally a BAD idea .
2009-06-12 23:02:48 +04:00
*
2005-04-17 02:20:36 +04:00
* The design of synchronization / protection for this operation in the 8021 q
* module is good for one or more VLAN devices over a single physical device
* and cannot be extended for a teaming solution like bonding , so there is a
* potential race condition here where a net device from the vlan group might
* be referenced ( either by a base driver or the 8021 q code ) while it is being
* removed from the system . However , it turns out we ' re not making matters
* worse , and if it works for regular VLAN usage it will work here too .
*/
/**
* bond_vlan_rx_add_vid - Propagates adding an id to slaves
* @ bond_dev : bonding net device that got called
* @ vid : vlan id being added
*/
2013-04-19 06:04:28 +04:00
static int bond_vlan_rx_add_vid ( struct net_device * bond_dev ,
__be16 proto , u16 vid )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:13 +04:00
struct slave * slave , * rollback_slave ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
int res ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2013-04-19 06:04:28 +04:00
res = vlan_vid_add ( slave - > dev , proto , vid ) ;
2011-12-08 08:11:17 +04:00
if ( res )
goto unwind ;
2005-04-17 02:20:36 +04:00
}
2011-12-09 04:52:37 +04:00
return 0 ;
2011-12-08 08:11:17 +04:00
unwind :
2013-09-25 11:20:13 +04:00
/* unwind to the slave that failed */
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2013-09-25 11:20:13 +04:00
if ( rollback_slave = = slave )
break ;
vlan_vid_del ( rollback_slave - > dev , proto , vid ) ;
}
2011-12-08 08:11:17 +04:00
return res ;
2005-04-17 02:20:36 +04:00
}
/**
* bond_vlan_rx_kill_vid - Propagates deleting an id to slaves
* @ bond_dev : bonding net device that got called
* @ vid : vlan id being removed
*/
2013-04-19 06:04:28 +04:00
static int bond_vlan_rx_kill_vid ( struct net_device * bond_dev ,
__be16 proto , u16 vid )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
struct slave * slave ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter )
2013-04-19 06:04:28 +04:00
vlan_vid_del ( slave - > dev , proto , vid ) ;
2005-04-17 02:20:36 +04:00
2013-08-29 01:25:15 +04:00
if ( bond_is_lb ( bond ) )
bond_alb_clear_vlan ( bond , vid ) ;
2011-12-09 04:52:37 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
/*------------------------------- Link status -------------------------------*/
2006-03-28 01:27:43 +04:00
/*
* Set the carrier state for the master according to the state of its
* slaves . If any slaves are up , the master is up . In 802.3 ad mode ,
* do special 802.3 ad magic .
*
* Returns zero if carrier state does not change , nonzero if it does .
*/
static int bond_set_carrier ( struct bonding * bond )
{
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2006-03-28 01:27:43 +04:00
struct slave * slave ;
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) )
2006-03-28 01:27:43 +04:00
goto down ;
if ( bond - > params . mode = = BOND_MODE_8023AD )
return bond_3ad_set_carrier ( bond ) ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2006-03-28 01:27:43 +04:00
if ( slave - > link = = BOND_LINK_UP ) {
if ( ! netif_carrier_ok ( bond - > dev ) ) {
netif_carrier_on ( bond - > dev ) ;
return 1 ;
}
return 0 ;
}
}
down :
if ( netif_carrier_ok ( bond - > dev ) ) {
netif_carrier_off ( bond - > dev ) ;
return 1 ;
}
return 0 ;
}
2005-04-17 02:20:36 +04:00
/*
* Get link speed and duplex from the slave ' s base driver
* using ethtool . If for some reason the call fails or the
bonding:update speed/duplex for NETDEV_CHANGE
Zheng Liang(lzheng@redhat.com) found a bug that if we config bonding with
arp monitor, sometimes bonding driver cannot get the speed and duplex from
its slaves, it will assume them to be 100Mb/sec and Full, please see
/proc/net/bonding/bond0.
But there is no such problem when uses miimon.
(Take igb for example)
I find that the reason is that after dev_open() in bond_enslave(),
bond_update_speed_duplex() will call igb_get_settings()
, but in that function,
it runs ethtool_cmd_speed_set(ecmd, -1); ecmd->duplex = -1;
because igb get an error value of status.
So even dev_open() is called, but the device is not really ready to get its
settings.
Maybe it is safe for us to call igb_get_settings() only after
this message shows up, that is "igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX".
So I prefer to update the speed and duplex for a slave when reseices
NETDEV_CHANGE/NETDEV_UP event.
Changelog
V2:
1 remove the "fake 100/Full" logic in bond_update_speed_duplex(),
set speed and duplex to -1 when it gets error value of speed and duplex.
2 delete the warning in bond_enslave() if bond_update_speed_duplex() returns
error.
3 make bond_info_show_slave() handle bad values of speed and duplex.
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-31 21:20:48 +04:00
* values are invalid , set speed and duplex to - 1 ,
2012-04-26 15:20:30 +04:00
* and return .
2005-04-17 02:20:36 +04:00
*/
2012-04-26 15:20:30 +04:00
static void bond_update_speed_duplex ( struct slave * slave )
2005-04-17 02:20:36 +04:00
{
struct net_device * slave_dev = slave - > dev ;
2011-09-03 07:34:30 +04:00
struct ethtool_cmd ecmd ;
2011-04-13 19:22:31 +04:00
u32 slave_speed ;
2007-08-01 01:00:02 +04:00
int res ;
2005-04-17 02:20:36 +04:00
2011-11-04 12:21:38 +04:00
slave - > speed = SPEED_UNKNOWN ;
slave - > duplex = DUPLEX_UNKNOWN ;
2005-04-17 02:20:36 +04:00
2011-09-03 07:34:30 +04:00
res = __ethtool_get_settings ( slave_dev , & ecmd ) ;
2007-08-01 01:00:02 +04:00
if ( res < 0 )
2012-04-26 15:20:30 +04:00
return ;
2005-04-17 02:20:36 +04:00
2011-09-03 07:34:30 +04:00
slave_speed = ethtool_cmd_speed ( & ecmd ) ;
2011-06-01 14:36:33 +04:00
if ( slave_speed = = 0 | | slave_speed = = ( ( __u32 ) - 1 ) )
2012-04-26 15:20:30 +04:00
return ;
2005-04-17 02:20:36 +04:00
2011-09-03 07:34:30 +04:00
switch ( ecmd . duplex ) {
2005-04-17 02:20:36 +04:00
case DUPLEX_FULL :
case DUPLEX_HALF :
break ;
default :
2012-04-26 15:20:30 +04:00
return ;
2005-04-17 02:20:36 +04:00
}
2011-04-13 19:22:31 +04:00
slave - > speed = slave_speed ;
2011-09-03 07:34:30 +04:00
slave - > duplex = ecmd . duplex ;
2005-04-17 02:20:36 +04:00
2012-04-26 15:20:30 +04:00
return ;
2005-04-17 02:20:36 +04:00
}
/*
* if < dev > supports MII link status reporting , check its link status .
*
* We either do MII / ETHTOOL ioctls , or check netif_carrier_ok ( ) ,
2009-06-12 23:02:48 +04:00
* depending upon the setting of the use_carrier parameter .
2005-04-17 02:20:36 +04:00
*
* Return either BMSR_LSTATUS , meaning that the link is up ( or we
* can ' t tell and just pretend it is ) , or 0 , meaning that the link is
* down .
*
* If reporting is non - zero , instead of faking link up , return - 1 if
* both ETHTOOL and MII ioctls fail ( meaning the device does not
* support them ) . If use_carrier is set , return whatever it says .
* It ' d be nice if there was a good way to tell if a driver supports
* netif_carrier , but there really isn ' t .
*/
2009-06-12 23:02:48 +04:00
static int bond_check_dev_link ( struct bonding * bond ,
struct net_device * slave_dev , int reporting )
2005-04-17 02:20:36 +04:00
{
2008-11-20 08:56:05 +03:00
const struct net_device_ops * slave_ops = slave_dev - > netdev_ops ;
2009-10-29 08:23:54 +03:00
int ( * ioctl ) ( struct net_device * , struct ifreq * , int ) ;
2005-04-17 02:20:36 +04:00
struct ifreq ifr ;
struct mii_ioctl_data * mii ;
2009-08-28 16:05:15 +04:00
if ( ! reporting & & ! netif_running ( slave_dev ) )
return 0 ;
2008-11-20 08:56:05 +03:00
if ( bond - > params . use_carrier )
2005-04-17 02:20:36 +04:00
return netif_carrier_ok ( slave_dev ) ? BMSR_LSTATUS : 0 ;
2009-04-24 05:58:23 +04:00
/* Try to get link status using Ethtool first. */
2012-12-07 10:15:32 +04:00
if ( slave_dev - > ethtool_ops - > get_link )
return slave_dev - > ethtool_ops - > get_link ( slave_dev ) ?
BMSR_LSTATUS : 0 ;
2009-04-24 05:58:23 +04:00
2009-06-12 23:02:48 +04:00
/* Ethtool can't be used, fallback to MII ioctls. */
2008-11-20 08:56:05 +03:00
ioctl = slave_ops - > ndo_do_ioctl ;
2005-04-17 02:20:36 +04:00
if ( ioctl ) {
/* TODO: set pointer to correct ioctl on a per team member */
/* bases to make this more efficient. that is, once */
/* we determine the correct ioctl, we will always */
/* call it and not the others for that team */
/* member. */
/*
* We cannot assume that SIOCGMIIPHY will also read a
* register ; not all network drivers ( e . g . , e100 )
* support that .
*/
/* Yes, the mii is overlaid on the ifreq.ifr_ifru */
strncpy ( ifr . ifr_name , slave_dev - > name , IFNAMSIZ ) ;
mii = if_mii ( & ifr ) ;
if ( IOCTL ( slave_dev , & ifr , SIOCGMIIPHY ) = = 0 ) {
mii - > reg_num = MII_BMSR ;
2009-06-12 23:02:48 +04:00
if ( IOCTL ( slave_dev , & ifr , SIOCGMIIREG ) = = 0 )
return mii - > val_out & BMSR_LSTATUS ;
2005-04-17 02:20:36 +04:00
}
}
/*
* If reporting , report that either there ' s no dev - > do_ioctl ,
2007-08-01 01:00:02 +04:00
* or both SIOCGMIIREG and get_link failed ( meaning that we
2005-04-17 02:20:36 +04:00
* cannot report link status ) . If not reporting , pretend
* we ' re ok .
*/
2009-06-12 23:02:48 +04:00
return reporting ? - 1 : BMSR_LSTATUS ;
2005-04-17 02:20:36 +04:00
}
/*----------------------------- Multicast list ------------------------------*/
/*
* Push the promiscuity flag down to appropriate slaves
*/
2008-07-15 07:51:36 +04:00
static int bond_set_promiscuity ( struct bonding * bond , int inc )
2005-04-17 02:20:36 +04:00
{
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2008-07-15 07:51:36 +04:00
int err = 0 ;
2013-09-25 11:20:14 +04:00
2005-04-17 02:20:36 +04:00
if ( USES_PRIMARY ( bond - > params . mode ) ) {
/* write lock already acquired */
if ( bond - > curr_active_slave ) {
2008-07-15 07:51:36 +04:00
err = dev_set_promiscuity ( bond - > curr_active_slave - > dev ,
inc ) ;
2005-04-17 02:20:36 +04:00
}
} else {
struct slave * slave ;
2013-08-01 18:54:47 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-15 07:51:36 +04:00
err = dev_set_promiscuity ( slave - > dev , inc ) ;
if ( err )
return err ;
2005-04-17 02:20:36 +04:00
}
}
2008-07-15 07:51:36 +04:00
return err ;
2005-04-17 02:20:36 +04:00
}
/*
* Push the allmulti flag down to all slaves
*/
2008-07-15 07:51:36 +04:00
static int bond_set_allmulti ( struct bonding * bond , int inc )
2005-04-17 02:20:36 +04:00
{
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2008-07-15 07:51:36 +04:00
int err = 0 ;
2013-09-25 11:20:14 +04:00
2005-04-17 02:20:36 +04:00
if ( USES_PRIMARY ( bond - > params . mode ) ) {
/* write lock already acquired */
if ( bond - > curr_active_slave ) {
2008-07-15 07:51:36 +04:00
err = dev_set_allmulti ( bond - > curr_active_slave - > dev ,
inc ) ;
2005-04-17 02:20:36 +04:00
}
} else {
struct slave * slave ;
2013-08-01 18:54:47 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-15 07:51:36 +04:00
err = dev_set_allmulti ( slave - > dev , inc ) ;
if ( err )
return err ;
2005-04-17 02:20:36 +04:00
}
}
2008-07-15 07:51:36 +04:00
return err ;
2005-04-17 02:20:36 +04:00
}
2010-10-05 18:23:57 +04:00
/*
* Retrieve the list of registered multicast addresses for the bonding
* device and retransmit an IGMP JOIN request to the current active
* slave .
*/
static void bond_resend_igmp_join_requests ( struct bonding * bond )
{
2013-07-20 14:13:53 +04:00
if ( ! rtnl_trylock ( ) ) {
2013-08-01 13:51:42 +04:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , 1 ) ;
2013-07-20 14:13:53 +04:00
return ;
2010-10-05 18:23:57 +04:00
}
2013-07-20 14:13:53 +04:00
call_netdevice_notifiers ( NETDEV_RESEND_IGMP , bond - > dev ) ;
rtnl_unlock ( ) ;
2010-10-05 18:23:57 +04:00
2013-06-12 02:07:02 +04:00
/* We use curr_slave_lock to protect against concurrent access to
* igmp_retrans from multiple running instances of this function and
* bond_change_active_slave
*/
write_lock_bh ( & bond - > curr_slave_lock ) ;
if ( bond - > igmp_retrans > 1 ) {
bond - > igmp_retrans - - ;
2010-10-05 18:23:59 +04:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , HZ / 5 ) ;
2013-06-12 02:07:02 +04:00
}
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2010-10-05 18:23:57 +04:00
}
2010-10-15 15:02:56 +04:00
static void bond_resend_igmp_join_requests_delayed ( struct work_struct * work )
2010-10-05 18:23:57 +04:00
{
struct bonding * bond = container_of ( work , struct bonding ,
2011-05-25 12:38:58 +04:00
mcast_work . work ) ;
2013-03-26 08:10:02 +04:00
2010-10-05 18:23:57 +04:00
bond_resend_igmp_join_requests ( bond ) ;
}
2013-05-31 15:57:30 +04:00
/* Flush bond's hardware addresses from slave
2005-04-17 02:20:36 +04:00
*/
2013-05-31 15:57:30 +04:00
static void bond_hw_addr_flush ( struct net_device * bond_dev ,
2009-06-12 23:02:48 +04:00
struct net_device * slave_dev )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
2013-05-31 15:57:30 +04:00
dev_uc_unsync ( slave_dev , bond_dev ) ;
dev_mc_unsync ( slave_dev , bond_dev ) ;
2005-04-17 02:20:36 +04:00
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
/* del lacpdu mc addr from mc list */
u8 lacpdu_multicast [ ETH_ALEN ] = MULTICAST_LACPDU_ADDR ;
2010-04-02 01:22:57 +04:00
dev_mc_del ( slave_dev , lacpdu_multicast ) ;
2005-04-17 02:20:36 +04:00
}
}
/*--------------------------- Active slave change ---------------------------*/
2013-05-31 15:57:30 +04:00
/* Update the hardware address list and promisc/allmulti for the new and
* old active slaves ( if any ) . Modes that are ! USES_PRIMARY keep all
* slaves up date at all times ; only the USES_PRIMARY modes need to call
* this function to swap these settings during a failover .
2005-04-17 02:20:36 +04:00
*/
2013-05-31 15:57:30 +04:00
static void bond_hw_addr_swap ( struct bonding * bond , struct slave * new_active ,
struct slave * old_active )
2005-04-17 02:20:36 +04:00
{
2013-08-05 16:56:06 +04:00
ASSERT_RTNL ( ) ;
2005-04-17 02:20:36 +04:00
if ( old_active ) {
2009-06-12 23:02:48 +04:00
if ( bond - > dev - > flags & IFF_PROMISC )
2005-04-17 02:20:36 +04:00
dev_set_promiscuity ( old_active - > dev , - 1 ) ;
2009-06-12 23:02:48 +04:00
if ( bond - > dev - > flags & IFF_ALLMULTI )
2005-04-17 02:20:36 +04:00
dev_set_allmulti ( old_active - > dev , - 1 ) ;
2013-05-31 15:57:30 +04:00
bond_hw_addr_flush ( bond - > dev , old_active - > dev ) ;
2005-04-17 02:20:36 +04:00
}
if ( new_active ) {
2008-07-15 07:51:36 +04:00
/* FIXME: Signal errors upstream. */
2009-06-12 23:02:48 +04:00
if ( bond - > dev - > flags & IFF_PROMISC )
2005-04-17 02:20:36 +04:00
dev_set_promiscuity ( new_active - > dev , 1 ) ;
2009-06-12 23:02:48 +04:00
if ( bond - > dev - > flags & IFF_ALLMULTI )
2005-04-17 02:20:36 +04:00
dev_set_allmulti ( new_active - > dev , 1 ) ;
2013-04-18 11:33:38 +04:00
netif_addr_lock_bh ( bond - > dev ) ;
2013-05-31 15:57:30 +04:00
dev_uc_sync ( new_active - > dev , bond - > dev ) ;
dev_mc_sync ( new_active - > dev , bond - > dev ) ;
2013-04-18 11:33:38 +04:00
netif_addr_unlock_bh ( bond - > dev ) ;
2005-04-17 02:20:36 +04:00
}
}
2013-06-26 19:13:39 +04:00
/**
* bond_set_dev_addr - clone slave ' s address to bond
* @ bond_dev : bond net device
* @ slave_dev : slave net device
*
* Should be called with RTNL held .
*/
static void bond_set_dev_addr ( struct net_device * bond_dev ,
struct net_device * slave_dev )
{
2013-06-29 15:16:59 +04:00
pr_debug ( " bond_dev=%p slave_dev=%p slave_dev->addr_len=%d \n " ,
bond_dev , slave_dev , slave_dev - > addr_len ) ;
2013-06-26 19:13:39 +04:00
memcpy ( bond_dev - > dev_addr , slave_dev - > dev_addr , slave_dev - > addr_len ) ;
bond_dev - > addr_assign_type = NET_ADDR_STOLEN ;
call_netdevice_notifiers ( NETDEV_CHANGEADDR , bond_dev ) ;
}
2008-05-18 08:10:14 +04:00
/*
* bond_do_fail_over_mac
*
* Perform special MAC address swapping for fail_over_mac settings
*
* Called with RTNL , bond - > lock for read , curr_slave_lock for write_bh .
*/
static void bond_do_fail_over_mac ( struct bonding * bond ,
struct slave * new_active ,
struct slave * old_active )
2009-02-14 14:15:33 +03:00
__releases ( & bond - > curr_slave_lock )
__releases ( & bond - > lock )
__acquires ( & bond - > lock )
__acquires ( & bond - > curr_slave_lock )
2008-05-18 08:10:14 +04:00
{
u8 tmp_mac [ ETH_ALEN ] ;
struct sockaddr saddr ;
int rv ;
switch ( bond - > params . fail_over_mac ) {
case BOND_FOM_ACTIVE :
2012-03-27 23:18:24 +04:00
if ( new_active ) {
write_unlock_bh ( & bond - > curr_slave_lock ) ;
read_unlock ( & bond - > lock ) ;
2013-06-26 19:13:39 +04:00
bond_set_dev_addr ( bond - > dev , new_active - > dev ) ;
2012-03-27 23:18:24 +04:00
read_lock ( & bond - > lock ) ;
write_lock_bh ( & bond - > curr_slave_lock ) ;
}
2008-05-18 08:10:14 +04:00
break ;
case BOND_FOM_FOLLOW :
/*
* if new_active & & old_active , swap them
* if just old_active , do nothing ( going to no active slave )
* if just new_active , set new_active to bond ' s MAC
*/
if ( ! new_active )
return ;
write_unlock_bh ( & bond - > curr_slave_lock ) ;
read_unlock ( & bond - > lock ) ;
if ( old_active ) {
memcpy ( tmp_mac , new_active - > dev - > dev_addr , ETH_ALEN ) ;
memcpy ( saddr . sa_data , old_active - > dev - > dev_addr ,
ETH_ALEN ) ;
saddr . sa_family = new_active - > dev - > type ;
} else {
memcpy ( saddr . sa_data , bond - > dev - > dev_addr , ETH_ALEN ) ;
saddr . sa_family = bond - > dev - > type ;
}
rv = dev_set_mac_address ( new_active - > dev , & saddr ) ;
if ( rv ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error %d setting MAC of slave %s \n " ,
2008-05-18 08:10:14 +04:00
bond - > dev - > name , - rv , new_active - > dev - > name ) ;
goto out ;
}
if ( ! old_active )
goto out ;
memcpy ( saddr . sa_data , tmp_mac , ETH_ALEN ) ;
saddr . sa_family = old_active - > dev - > type ;
rv = dev_set_mac_address ( old_active - > dev , & saddr ) ;
if ( rv )
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error %d setting MAC of slave %s \n " ,
2008-05-18 08:10:14 +04:00
bond - > dev - > name , - rv , new_active - > dev - > name ) ;
out :
read_lock ( & bond - > lock ) ;
write_lock_bh ( & bond - > curr_slave_lock ) ;
break ;
default :
2009-12-14 07:06:07 +03:00
pr_err ( " %s: bond_do_fail_over_mac impossible: bad policy %d \n " ,
2008-05-18 08:10:14 +04:00
bond - > dev - > name , bond - > params . fail_over_mac ) ;
break ;
}
}
2009-09-25 07:28:09 +04:00
static bool bond_should_change_active ( struct bonding * bond )
{
struct slave * prim = bond - > primary_slave ;
struct slave * curr = bond - > curr_active_slave ;
if ( ! prim | | ! curr | | curr - > link ! = BOND_LINK_UP )
return true ;
if ( bond - > force_primary ) {
bond - > force_primary = false ;
return true ;
}
if ( bond - > params . primary_reselect = = BOND_PRI_RESELECT_BETTER & &
( prim - > speed < curr - > speed | |
( prim - > speed = = curr - > speed & & prim - > duplex < = curr - > duplex ) ) )
return false ;
if ( bond - > params . primary_reselect = = BOND_PRI_RESELECT_FAILURE )
return false ;
return true ;
}
2008-05-18 08:10:14 +04:00
2005-04-17 02:20:36 +04:00
/**
* find_best_interface - select the best available slave to be the active one
* @ bond : our bonding struct
*/
static struct slave * bond_find_best_slave ( struct bonding * bond )
{
2013-09-25 11:20:18 +04:00
struct slave * slave , * bestslave = NULL ;
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
int mintime = bond - > params . updelay ;
2013-09-25 11:20:18 +04:00
if ( bond - > primary_slave & & bond - > primary_slave - > link = = BOND_LINK_UP & &
bond_should_change_active ( bond ) )
return bond - > primary_slave ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:18 +04:00
bond_for_each_slave ( bond , slave , iter ) {
if ( slave - > link = = BOND_LINK_UP )
return slave ;
if ( slave - > link = = BOND_LINK_BACK & & IS_UP ( slave - > dev ) & &
slave - > delay < mintime ) {
mintime = slave - > delay ;
bestslave = slave ;
2005-04-17 02:20:36 +04:00
}
}
return bestslave ;
}
2011-04-26 19:25:52 +04:00
static bool bond_should_notify_peers ( struct bonding * bond )
{
struct slave * slave = bond - > curr_active_slave ;
pr_debug ( " bond_should_notify_peers: bond %s slave %s \n " ,
bond - > dev - > name , slave ? slave - > dev - > name : " NULL " ) ;
if ( ! slave | | ! bond - > send_peer_notif | |
test_bit ( __LINK_STATE_LINKWATCH_PENDING , & slave - > dev - > state ) )
return false ;
return true ;
}
2005-04-17 02:20:36 +04:00
/**
* change_active_interface - change the active slave into the specified one
* @ bond : our bonding struct
* @ new : the new slave to make the active one
*
* Set the new slave to the bond ' s settings and unset them on the old
* curr_active_slave .
* Setting include flags , mc - list , promiscuity , allmulti , etc .
*
* If @ new ' s link state is % BOND_LINK_BACK we ' ll set it to % BOND_LINK_UP ,
* because it is apparently the best available slave we have , even though its
* updelay hasn ' t timed out yet .
*
2008-05-18 08:10:14 +04:00
* If new_active is not NULL , caller must hold bond - > lock for read and
* curr_slave_lock for write_bh .
2005-04-17 02:20:36 +04:00
*/
2005-11-09 21:35:51 +03:00
void bond_change_active_slave ( struct bonding * bond , struct slave * new_active )
2005-04-17 02:20:36 +04:00
{
struct slave * old_active = bond - > curr_active_slave ;
2009-06-12 23:02:48 +04:00
if ( old_active = = new_active )
2005-04-17 02:20:36 +04:00
return ;
if ( new_active ) {
2008-05-18 08:10:13 +04:00
new_active - > jiffies = jiffies ;
2005-04-17 02:20:36 +04:00
if ( new_active - > link = = BOND_LINK_BACK ) {
if ( USES_PRIMARY ( bond - > params . mode ) ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: making interface %s the new active one %d ms earlier. \n " ,
bond - > dev - > name , new_active - > dev - > name ,
( bond - > params . updelay - new_active - > delay ) * bond - > params . miimon ) ;
2005-04-17 02:20:36 +04:00
}
new_active - > delay = 0 ;
new_active - > link = BOND_LINK_UP ;
2009-06-12 23:02:48 +04:00
if ( bond - > params . mode = = BOND_MODE_8023AD )
2005-04-17 02:20:36 +04:00
bond_3ad_handle_link_change ( new_active , BOND_LINK_UP ) ;
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) )
2005-04-17 02:20:36 +04:00
bond_alb_handle_link_change ( bond , new_active , BOND_LINK_UP ) ;
} else {
if ( USES_PRIMARY ( bond - > params . mode ) ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: making interface %s the new active one. \n " ,
bond - > dev - > name , new_active - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
}
2009-06-12 23:02:48 +04:00
if ( USES_PRIMARY ( bond - > params . mode ) )
2013-05-31 15:57:30 +04:00
bond_hw_addr_swap ( bond , new_active , old_active ) ;
2005-04-17 02:20:36 +04:00
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 02:20:36 +04:00
bond_alb_handle_active_change ( bond , new_active ) ;
2006-02-22 03:36:44 +03:00
if ( old_active )
bond_set_slave_inactive_flags ( old_active ) ;
if ( new_active )
bond_set_slave_active_flags ( new_active ) ;
2005-04-17 02:20:36 +04:00
} else {
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
rcu_assign_pointer ( bond - > curr_active_slave , new_active ) ;
2005-04-17 02:20:36 +04:00
}
2005-06-27 01:52:20 +04:00
if ( bond - > params . mode = = BOND_MODE_ACTIVEBACKUP ) {
2009-06-12 23:02:48 +04:00
if ( old_active )
2005-06-27 01:52:20 +04:00
bond_set_slave_inactive_flags ( old_active ) ;
if ( new_active ) {
2011-04-26 19:25:52 +04:00
bool should_notify_peers = false ;
2005-06-27 01:52:20 +04:00
bond_set_slave_active_flags ( new_active ) ;
2007-10-10 06:43:39 +04:00
2008-06-14 05:12:01 +04:00
if ( bond - > params . fail_over_mac )
bond_do_fail_over_mac ( bond , new_active ,
old_active ) ;
2008-05-18 08:10:14 +04:00
2011-04-26 19:25:52 +04:00
if ( netif_running ( bond - > dev ) ) {
bond - > send_peer_notif =
bond - > params . num_peer_notif ;
should_notify_peers =
bond_should_notify_peers ( bond ) ;
}
2008-06-14 05:12:02 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
read_unlock ( & bond - > lock ) ;
2012-08-10 02:14:57 +04:00
call_netdevice_notifiers ( NETDEV_BONDING_FAILOVER , bond - > dev ) ;
2011-04-26 19:25:52 +04:00
if ( should_notify_peers )
2012-08-10 02:14:57 +04:00
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS ,
bond - > dev ) ;
2008-06-14 05:12:02 +04:00
read_lock ( & bond - > lock ) ;
write_lock_bh ( & bond - > curr_slave_lock ) ;
2008-05-18 08:10:12 +04:00
}
2005-06-27 01:52:20 +04:00
}
2010-03-25 17:49:05 +03:00
2010-10-05 18:23:57 +04:00
/* resend IGMP joins since active slave has changed or
2011-05-25 12:38:58 +04:00
* all were sent on curr_active_slave .
* resend only if bond is brought up with the affected
* bonding modes and the retransmission is enabled */
if ( netif_running ( bond - > dev ) & & ( bond - > params . resend_igmp > 0 ) & &
( ( USES_PRIMARY ( bond - > params . mode ) & & new_active ) | |
bond - > params . mode = = BOND_MODE_ROUNDROBIN ) ) {
2010-10-05 18:23:59 +04:00
bond - > igmp_retrans = bond - > params . resend_igmp ;
2013-08-01 13:51:42 +04:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , 1 ) ;
2010-03-25 17:49:05 +03:00
}
2005-04-17 02:20:36 +04:00
}
/**
* bond_select_active_slave - select a new active slave , if needed
* @ bond : our bonding struct
*
2009-06-12 23:02:48 +04:00
* This functions should be called when one of the following occurs :
2005-04-17 02:20:36 +04:00
* - The old curr_active_slave has been released or lost its link .
* - The primary_slave has got its link back .
* - A slave has got its link back and there ' s no old curr_active_slave .
*
2008-05-18 08:10:14 +04:00
* Caller must hold bond - > lock for read and curr_slave_lock for write_bh .
2005-04-17 02:20:36 +04:00
*/
2005-11-09 21:35:51 +03:00
void bond_select_active_slave ( struct bonding * bond )
2005-04-17 02:20:36 +04:00
{
struct slave * best_slave ;
2006-03-28 01:27:43 +04:00
int rv ;
2005-04-17 02:20:36 +04:00
best_slave = bond_find_best_slave ( bond ) ;
if ( best_slave ! = bond - > curr_active_slave ) {
bond_change_active_slave ( bond , best_slave ) ;
2006-03-28 01:27:43 +04:00
rv = bond_set_carrier ( bond ) ;
if ( ! rv )
return ;
if ( netif_carrier_ok ( bond - > dev ) ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: first active interface up! \n " ,
bond - > dev - > name ) ;
2006-03-28 01:27:43 +04:00
} else {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: now running without any active interface ! \n " ,
bond - > dev - > name ) ;
2006-03-28 01:27:43 +04:00
}
2005-04-17 02:20:36 +04:00
}
}
/*--------------------------- slave list handling ---------------------------*/
/*
* This function attaches the slave to the end of list .
*
* bond - > lock held for writing by caller .
*/
static void bond_attach_slave ( struct bonding * bond , struct slave * new_slave )
{
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
list_add_tail_rcu ( & new_slave - > list , & bond - > slave_list ) ;
2005-04-17 02:20:36 +04:00
bond - > slave_cnt + + ;
}
/*
* This function detaches the slave from the list .
* WARNING : no check is made to verify if the slave effectively
* belongs to < bond > .
* Nothing is freed on return , structures are just unchained .
* If any slave pointer in bond was pointing to < slave > ,
* it should be changed by the calling function .
*
* bond - > lock held for writing by caller .
*/
static void bond_detach_slave ( struct bonding * bond , struct slave * slave )
{
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
list_del_rcu ( & slave - > list ) ;
2005-04-17 02:20:36 +04:00
bond - > slave_cnt - - ;
}
2010-05-06 11:48:51 +04:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2011-02-18 02:43:32 +03:00
static inline int slave_enable_netpoll ( struct slave * slave )
2010-05-06 11:48:51 +04:00
{
2011-02-18 02:43:32 +03:00
struct netpoll * np ;
int err = 0 ;
2010-05-06 11:48:51 +04:00
2012-08-10 05:24:37 +04:00
np = kzalloc ( sizeof ( * np ) , GFP_ATOMIC ) ;
2011-02-18 02:43:32 +03:00
err = - ENOMEM ;
if ( ! np )
goto out ;
2012-08-10 05:24:37 +04:00
err = __netpoll_setup ( np , slave - > dev , GFP_ATOMIC ) ;
2011-02-18 02:43:32 +03:00
if ( err ) {
kfree ( np ) ;
goto out ;
2010-05-06 11:48:51 +04:00
}
2011-02-18 02:43:32 +03:00
slave - > np = np ;
out :
return err ;
}
static inline void slave_disable_netpoll ( struct slave * slave )
{
struct netpoll * np = slave - > np ;
if ( ! np )
return ;
slave - > np = NULL ;
2013-02-11 14:25:30 +04:00
__netpoll_free_async ( np ) ;
2011-02-18 02:43:32 +03:00
}
static inline bool slave_dev_support_netpoll ( struct net_device * slave_dev )
{
if ( slave_dev - > priv_flags & IFF_DISABLE_NETPOLL )
return false ;
if ( ! slave_dev - > netdev_ops - > ndo_poll_controller )
return false ;
return true ;
2010-05-06 11:48:51 +04:00
}
static void bond_poll_controller ( struct net_device * bond_dev )
{
2011-02-18 02:43:32 +03:00
}
2013-07-23 11:25:27 +04:00
static void bond_netpoll_cleanup ( struct net_device * bond_dev )
2011-02-18 02:43:32 +03:00
{
2013-07-23 11:25:27 +04:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2010-10-13 20:01:49 +04:00
struct slave * slave ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter )
2011-02-18 02:43:32 +03:00
if ( IS_UP ( slave - > dev ) )
slave_disable_netpoll ( slave ) ;
2010-05-06 11:48:51 +04:00
}
2011-02-18 02:43:32 +03:00
2012-08-10 05:24:37 +04:00
static int bond_netpoll_setup ( struct net_device * dev , struct netpoll_info * ni , gfp_t gfp )
2011-02-18 02:43:32 +03:00
{
struct bonding * bond = netdev_priv ( dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2010-05-06 11:48:51 +04:00
struct slave * slave ;
2013-08-01 18:54:47 +04:00
int err = 0 ;
2010-05-06 11:48:51 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2011-02-18 02:43:32 +03:00
err = slave_enable_netpoll ( slave ) ;
if ( err ) {
2013-07-23 11:25:27 +04:00
bond_netpoll_cleanup ( dev ) ;
2011-02-18 02:43:32 +03:00
break ;
2010-05-06 11:48:51 +04:00
}
}
2011-02-18 02:43:32 +03:00
return err ;
2010-05-06 11:48:51 +04:00
}
2011-02-18 02:43:32 +03:00
# else
static inline int slave_enable_netpoll ( struct slave * slave )
{
return 0 ;
}
static inline void slave_disable_netpoll ( struct slave * slave )
{
}
2010-05-06 11:48:51 +04:00
static void bond_netpoll_cleanup ( struct net_device * bond_dev )
{
}
# endif
2005-04-17 02:20:36 +04:00
/*---------------------------------- IOCTL ----------------------------------*/
2011-11-15 19:29:55 +04:00
static netdev_features_t bond_fix_features ( struct net_device * dev ,
2013-09-02 15:51:41 +04:00
netdev_features_t features )
2005-08-23 09:34:53 +04:00
{
2011-05-07 07:22:17 +04:00
struct bonding * bond = netdev_priv ( dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2011-11-15 19:29:55 +04:00
netdev_features_t mask ;
2013-09-02 15:51:41 +04:00
struct slave * slave ;
2008-10-23 12:11:29 +04:00
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) ) {
2011-05-07 07:22:17 +04:00
/* Disable adding VLANs to empty bond. But why? --mq */
features | = NETIF_F_VLAN_CHALLENGED ;
2013-09-02 15:51:41 +04:00
return features ;
2011-05-07 07:22:17 +04:00
}
2008-10-23 12:11:29 +04:00
2011-05-07 07:22:17 +04:00
mask = features ;
2008-10-23 12:11:29 +04:00
features & = ~ NETIF_F_ONE_FOR_ALL ;
2011-05-07 07:22:17 +04:00
features | = NETIF_F_ALL_FOR_ALL ;
2007-08-11 02:47:58 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-10-23 12:11:29 +04:00
features = netdev_increment_features ( features ,
slave - > dev - > features ,
2011-05-07 07:22:17 +04:00
mask ) ;
}
2013-05-16 11:34:53 +04:00
features = netdev_add_tso_features ( features , mask ) ;
2011-05-07 07:22:17 +04:00
return features ;
}
2011-07-13 18:10:29 +04:00
# define BOND_VLAN_FEATURES (NETIF_F_ALL_CSUM | NETIF_F_SG | \
NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
NETIF_F_HIGHDMA | NETIF_F_LRO )
2011-05-07 07:22:17 +04:00
static void bond_compute_features ( struct bonding * bond )
{
2013-09-02 15:51:42 +04:00
unsigned int flags , dst_release_flag = IFF_XMIT_DST_RELEASE ;
2011-11-15 19:29:55 +04:00
netdev_features_t vlan_features = BOND_VLAN_FEATURES ;
2013-09-25 11:20:14 +04:00
struct net_device * bond_dev = bond - > dev ;
struct list_head * iter ;
struct slave * slave ;
2011-05-07 07:22:17 +04:00
unsigned short max_hard_header_len = ETH_HLEN ;
2012-11-21 08:35:03 +04:00
unsigned int gso_max_size = GSO_MAX_SIZE ;
u16 gso_max_segs = GSO_MAX_SEGS ;
2011-05-07 07:22:17 +04:00
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) )
2011-05-07 07:22:17 +04:00
goto done ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2009-08-28 16:05:12 +04:00
vlan_features = netdev_increment_features ( vlan_features ,
2011-05-07 07:22:17 +04:00
slave - > dev - > vlan_features , BOND_VLAN_FEATURES ) ;
2012-07-17 16:19:48 +04:00
dst_release_flag & = slave - > dev - > priv_flags ;
2006-09-23 08:53:39 +04:00
if ( slave - > dev - > hard_header_len > max_hard_header_len )
max_hard_header_len = slave - > dev - > hard_header_len ;
2012-11-21 08:35:03 +04:00
gso_max_size = min ( gso_max_size , slave - > dev - > gso_max_size ) ;
gso_max_segs = min ( gso_max_segs , slave - > dev - > gso_max_segs ) ;
2006-09-23 08:53:39 +04:00
}
2005-08-23 09:34:53 +04:00
2008-10-23 12:11:29 +04:00
done :
2011-05-07 07:22:17 +04:00
bond_dev - > vlan_features = vlan_features ;
2006-09-23 08:53:39 +04:00
bond_dev - > hard_header_len = max_hard_header_len ;
2012-11-21 08:35:03 +04:00
bond_dev - > gso_max_segs = gso_max_segs ;
netif_set_gso_max_size ( bond_dev , gso_max_size ) ;
2005-08-23 09:34:53 +04:00
2012-07-17 16:19:48 +04:00
flags = bond_dev - > priv_flags & ~ IFF_XMIT_DST_RELEASE ;
bond_dev - > priv_flags = flags | dst_release_flag ;
2011-05-07 07:22:17 +04:00
netdev_change_features ( bond_dev ) ;
2005-08-23 09:34:53 +04:00
}
2007-10-10 06:43:38 +04:00
static void bond_setup_by_slave ( struct net_device * bond_dev ,
struct net_device * slave_dev )
{
2008-11-21 07:14:53 +03:00
bond_dev - > header_ops = slave_dev - > header_ops ;
2007-10-10 06:43:38 +04:00
bond_dev - > type = slave_dev - > type ;
bond_dev - > hard_header_len = slave_dev - > hard_header_len ;
bond_dev - > addr_len = slave_dev - > addr_len ;
memcpy ( bond_dev - > broadcast , slave_dev - > broadcast ,
slave_dev - > addr_len ) ;
}
2011-02-23 12:05:42 +03:00
/* On bonding slaves other than the currently active slave, suppress
2011-04-19 07:48:16 +04:00
* duplicates except for alb non - mcast / bcast .
2011-02-23 12:05:42 +03:00
*/
static bool bond_should_deliver_exact_match ( struct sk_buff * skb ,
2011-03-16 11:45:23 +03:00
struct slave * slave ,
struct bonding * bond )
2011-02-23 12:05:42 +03:00
{
2011-03-16 11:46:43 +03:00
if ( bond_is_slave_inactive ( slave ) ) {
2011-03-16 11:45:23 +03:00
if ( bond - > params . mode = = BOND_MODE_ALB & &
2011-02-23 12:05:42 +03:00
skb - > pkt_type ! = PACKET_BROADCAST & &
skb - > pkt_type ! = PACKET_MULTICAST )
return false ;
return true ;
}
return false ;
}
2011-03-12 06:14:39 +03:00
static rx_handler_result_t bond_handle_frame ( struct sk_buff * * pskb )
2011-02-23 12:05:42 +03:00
{
2011-03-12 06:14:39 +03:00
struct sk_buff * skb = * pskb ;
2011-03-12 06:14:35 +03:00
struct slave * slave ;
2011-03-16 11:45:23 +03:00
struct bonding * bond ;
2012-06-11 23:23:07 +04:00
int ( * recv_probe ) ( const struct sk_buff * , struct bonding * ,
struct slave * ) ;
2012-05-09 05:01:40 +04:00
int ret = RX_HANDLER_ANOTHER ;
2011-02-23 12:05:42 +03:00
2011-03-12 06:14:39 +03:00
skb = skb_share_check ( skb , GFP_ATOMIC ) ;
if ( unlikely ( ! skb ) )
return RX_HANDLER_CONSUMED ;
* pskb = skb ;
2011-02-23 12:05:42 +03:00
2011-03-22 05:38:12 +03:00
slave = bond_slave_get_rcu ( skb - > dev ) ;
bond = slave - > bond ;
2011-03-16 11:45:23 +03:00
if ( bond - > params . arp_interval )
2011-03-12 06:14:35 +03:00
slave - > dev - > last_rx = jiffies ;
2011-02-23 12:05:42 +03:00
2011-10-12 20:04:29 +04:00
recv_probe = ACCESS_ONCE ( bond - > recv_probe ) ;
if ( recv_probe ) {
2012-06-11 23:23:07 +04:00
ret = recv_probe ( skb , bond , slave ) ;
if ( ret = = RX_HANDLER_CONSUMED ) {
consume_skb ( skb ) ;
return ret ;
2011-04-19 07:48:16 +04:00
}
}
2011-03-16 11:45:23 +03:00
if ( bond_should_deliver_exact_match ( skb , slave , bond ) ) {
2011-03-12 06:14:39 +03:00
return RX_HANDLER_EXACT ;
2011-02-23 12:05:42 +03:00
}
2011-03-22 05:38:12 +03:00
skb - > dev = bond - > dev ;
2011-02-23 12:05:42 +03:00
2011-03-16 11:45:23 +03:00
if ( bond - > params . mode = = BOND_MODE_ALB & &
2011-03-22 05:38:12 +03:00
bond - > dev - > priv_flags & IFF_BRIDGE_PORT & &
2011-02-23 12:05:42 +03:00
skb - > pkt_type = = PACKET_HOST ) {
2011-03-03 00:07:14 +03:00
if ( unlikely ( skb_cow_head ( skb ,
skb - > data - skb_mac_header ( skb ) ) ) ) {
kfree_skb ( skb ) ;
2011-03-12 06:14:39 +03:00
return RX_HANDLER_CONSUMED ;
2011-03-03 00:07:14 +03:00
}
2011-03-22 05:38:12 +03:00
memcpy ( eth_hdr ( skb ) - > h_dest , bond - > dev - > dev_addr , ETH_ALEN ) ;
2011-02-23 12:05:42 +03:00
}
2012-05-09 05:01:40 +04:00
return ret ;
2011-02-23 12:05:42 +03:00
}
2013-01-04 02:49:01 +04:00
static int bond_master_upper_dev_link ( struct net_device * bond_dev ,
2013-09-25 11:20:10 +04:00
struct net_device * slave_dev ,
struct slave * slave )
2013-01-04 02:49:01 +04:00
{
int err ;
2013-09-25 11:20:10 +04:00
err = netdev_master_upper_dev_link_private ( slave_dev , bond_dev , slave ) ;
2013-01-04 02:49:01 +04:00
if ( err )
return err ;
slave_dev - > flags | = IFF_SLAVE ;
rtmsg_ifinfo ( RTM_NEWLINK , slave_dev , IFF_SLAVE ) ;
return 0 ;
}
static void bond_upper_dev_unlink ( struct net_device * bond_dev ,
struct net_device * slave_dev )
{
netdev_upper_dev_unlink ( slave_dev , bond_dev ) ;
slave_dev - > flags & = ~ IFF_SLAVE ;
rtmsg_ifinfo ( RTM_NEWLINK , slave_dev , IFF_SLAVE ) ;
}
2005-04-17 02:20:36 +04:00
/* enslave device <slave> to bond device <master> */
2005-11-09 21:35:51 +03:00
int bond_enslave ( struct net_device * bond_dev , struct net_device * slave_dev )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2008-11-20 08:56:05 +03:00
const struct net_device_ops * slave_ops = slave_dev - > netdev_ops ;
2005-04-17 02:20:36 +04:00
struct slave * new_slave = NULL ;
struct sockaddr addr ;
int link_reporting ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
int res = 0 , i ;
2005-04-17 02:20:36 +04:00
2012-12-07 10:15:32 +04:00
if ( ! bond - > params . use_carrier & &
slave_dev - > ethtool_ops - > get_link = = NULL & &
slave_ops - > ndo_do_ioctl = = NULL ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: no link monitoring support for %s \n " ,
bond_dev - > name , slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
/* already enslaved */
if ( slave_dev - > flags & IFF_SLAVE ) {
2008-12-10 10:09:22 +03:00
pr_debug ( " Error, Device was already enslaved \n " ) ;
2005-04-17 02:20:36 +04:00
return - EBUSY ;
}
/* vlan challenged mutual exclusion */
/* no need to lock since we're protected by rtnl_lock */
if ( slave_dev - > features & NETIF_F_VLAN_CHALLENGED ) {
2008-12-10 10:09:22 +03:00
pr_debug ( " %s: NETIF_F_VLAN_CHALLENGED \n " , slave_dev - > name ) ;
2012-10-14 08:30:56 +04:00
if ( vlan_uses_dev ( bond_dev ) ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error: cannot enslave VLAN challenged slave %s on VLAN enabled bond %s \n " ,
bond_dev - > name , slave_dev - > name , bond_dev - > name ) ;
2005-04-17 02:20:36 +04:00
return - EPERM ;
} else {
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: enslaved VLAN challenged slave %s. Adding VLANs will be blocked as long as %s is part of bond %s \n " ,
bond_dev - > name , slave_dev - > name ,
slave_dev - > name , bond_dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
} else {
2008-12-10 10:09:22 +03:00
pr_debug ( " %s: ! NETIF_F_VLAN_CHALLENGED \n " , slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
2005-09-27 03:11:50 +04:00
/*
* Old ifenslave binaries are no longer supported . These can
2009-06-12 23:02:48 +04:00
* be identified with moderate accuracy by the state of the slave :
2005-09-27 03:11:50 +04:00
* the current ifenslave will set the interface down prior to
* enslaving it ; the old ifenslave will not .
*/
if ( ( slave_dev - > flags & IFF_UP ) ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s is up. This may be due to an out of date ifenslave. \n " ,
2005-09-27 03:11:50 +04:00
slave_dev - > name ) ;
res = - EPERM ;
goto err_undo_flags ;
}
2005-04-17 02:20:36 +04:00
2007-10-10 06:43:38 +04:00
/* set bonding device ether type by slave - bonding netdevices are
* created with ether_setup , so when the slave type is not ARPHRD_ETHER
* there is a need to override some of the type dependent attribs / funcs .
*
* bond ether type mutual exclusion - don ' t allow slaves of dissimilar
* ether type ( eg ARPHRD_ETHER and ARPHRD_INFINIBAND ) share the same bond
*/
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) ) {
2009-07-15 08:56:31 +04:00
if ( bond_dev - > type ! = slave_dev - > type ) {
pr_debug ( " %s: change device type from %d to %d \n " ,
2009-12-14 07:06:07 +03:00
bond_dev - > name ,
bond_dev - > type , slave_dev - > type ) ;
2009-09-15 13:37:40 +04:00
2012-08-10 02:14:57 +04:00
res = call_netdevice_notifiers ( NETDEV_PRE_TYPE_CHANGE ,
bond_dev ) ;
2010-03-10 13:29:35 +03:00
res = notifier_to_errno ( res ) ;
if ( res ) {
pr_err ( " %s: refused to change device type \n " ,
bond_dev - > name ) ;
res = - EBUSY ;
goto err_undo_flags ;
}
2009-09-15 13:37:40 +04:00
2010-03-19 07:00:23 +03:00
/* Flush unicast and multicast addresses */
2010-04-02 01:22:09 +04:00
dev_uc_flush ( bond_dev ) ;
2010-04-02 01:22:57 +04:00
dev_mc_flush ( bond_dev ) ;
2010-03-19 07:00:23 +03:00
2009-07-15 08:56:31 +04:00
if ( slave_dev - > type ! = ARPHRD_ETHER )
bond_setup_by_slave ( bond_dev , slave_dev ) ;
2011-07-26 10:05:38 +04:00
else {
2009-07-15 08:56:31 +04:00
ether_setup ( bond_dev ) ;
2011-07-26 10:05:38 +04:00
bond_dev - > priv_flags & = ~ IFF_TX_SKB_SHARING ;
}
2009-09-15 13:37:40 +04:00
2012-08-10 02:14:57 +04:00
call_netdevice_notifiers ( NETDEV_POST_TYPE_CHANGE ,
bond_dev ) ;
2009-07-15 08:56:31 +04:00
}
2007-10-10 06:43:38 +04:00
} else if ( bond_dev - > type ! = slave_dev - > type ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s ether type (%d) is different from other slaves (%d), can not enslave it. \n " ,
slave_dev - > name ,
slave_dev - > type , bond_dev - > type ) ;
res = - EINVAL ;
goto err_undo_flags ;
2007-10-10 06:43:38 +04:00
}
2008-11-20 08:56:05 +03:00
if ( slave_ops - > ndo_set_mac_address = = NULL ) {
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active. " ,
bond_dev - > name ) ;
2008-05-18 08:10:14 +04:00
bond - > params . fail_over_mac = BOND_FOM_ACTIVE ;
} else if ( bond - > params . fail_over_mac ! = BOND_FOM_ACTIVE ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error: The slave device specified does not support setting the MAC address, but fail_over_mac is not set to active. \n " ,
bond_dev - > name ) ;
2007-10-10 06:43:39 +04:00
res = - EOPNOTSUPP ;
goto err_undo_flags ;
}
2005-04-17 02:20:36 +04:00
}
2011-05-20 01:39:10 +04:00
call_netdevice_notifiers ( NETDEV_JOIN , slave_dev ) ;
2010-05-19 05:14:29 +04:00
/* If this is the first slave, then we need to set the master's hardware
* address to be the same as the slave ' s . */
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) & &
2013-08-01 18:54:47 +04:00
bond - > dev - > addr_assign_type = = NET_ADDR_RANDOM )
2013-01-30 14:08:11 +04:00
bond_set_dev_addr ( bond - > dev , slave_dev ) ;
2010-05-19 05:14:29 +04:00
2007-02-07 01:16:40 +03:00
new_slave = kzalloc ( sizeof ( struct slave ) , GFP_KERNEL ) ;
2005-04-17 02:20:36 +04:00
if ( ! new_slave ) {
res = - ENOMEM ;
goto err_undo_flags ;
}
2013-08-01 18:54:47 +04:00
INIT_LIST_HEAD ( & new_slave - > list ) ;
2010-06-02 12:40:18 +04:00
/*
* Set the new_slave ' s queue_id to be zero . Queue ID mapping
* is set via sysfs or module option if desired .
*/
new_slave - > queue_id = 0 ;
2010-05-18 09:42:40 +04:00
/* Save slave's original mtu and then set it to match the bond */
new_slave - > original_mtu = slave_dev - > mtu ;
res = dev_set_mtu ( slave_dev , bond - > dev - > mtu ) ;
if ( res ) {
pr_debug ( " Error %d calling dev_set_mtu \n " , res ) ;
goto err_free ;
}
2005-09-27 03:11:50 +04:00
/*
* Save slave ' s original ( " permanent " ) mac address for modes
* that need it , and for restoring it upon release , and then
* set it to the master ' s address
*/
memcpy ( new_slave - > perm_hwaddr , slave_dev - > dev_addr , ETH_ALEN ) ;
2005-04-17 02:20:36 +04:00
2007-10-10 06:57:24 +04:00
if ( ! bond - > params . fail_over_mac ) {
2007-10-10 06:43:39 +04:00
/*
* Set slave to master ' s mac address . The application already
* set the master ' s mac address to that of the first slave
*/
memcpy ( addr . sa_data , bond_dev - > dev_addr , bond_dev - > addr_len ) ;
addr . sa_family = slave_dev - > type ;
res = dev_set_mac_address ( slave_dev , & addr ) ;
if ( res ) {
2008-12-10 10:09:22 +03:00
pr_debug ( " Error %d calling set_mac_address \n " , res ) ;
2010-05-18 09:42:40 +04:00
goto err_restore_mtu ;
2007-10-10 06:43:39 +04:00
}
2005-09-27 03:11:50 +04:00
}
2005-04-17 02:20:36 +04:00
2005-09-27 03:11:50 +04:00
/* open the slave since the application closed it */
res = dev_open ( slave_dev ) ;
if ( res ) {
2009-06-12 23:02:48 +04:00
pr_debug ( " Opening slave %s failed \n " , slave_dev - > name ) ;
2013-09-25 11:20:10 +04:00
goto err_restore_mac ;
2005-04-17 02:20:36 +04:00
}
2011-03-22 05:38:12 +03:00
new_slave - > bond = bond ;
2005-04-17 02:20:36 +04:00
new_slave - > dev = slave_dev ;
2006-09-23 08:54:10 +04:00
slave_dev - > priv_flags | = IFF_BONDING ;
2005-04-17 02:20:36 +04:00
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 02:20:36 +04:00
/* bond_alb_init_slave() must be called before all other stages since
* it might fail and we do not want to have to undo everything
*/
res = bond_alb_init_slave ( bond , new_slave ) ;
2009-06-12 23:02:48 +04:00
if ( res )
2008-05-03 05:06:02 +04:00
goto err_close ;
2005-04-17 02:20:36 +04:00
}
2013-05-31 15:57:30 +04:00
/* If the mode USES_PRIMARY, then the following is handled by
* bond_change_active_slave ( ) .
2005-04-17 02:20:36 +04:00
*/
if ( ! USES_PRIMARY ( bond - > params . mode ) ) {
/* set promiscuity level to new slave */
if ( bond_dev - > flags & IFF_PROMISC ) {
2008-07-15 07:51:36 +04:00
res = dev_set_promiscuity ( slave_dev , 1 ) ;
if ( res )
goto err_close ;
2005-04-17 02:20:36 +04:00
}
/* set allmulti level to new slave */
if ( bond_dev - > flags & IFF_ALLMULTI ) {
2008-07-15 07:51:36 +04:00
res = dev_set_allmulti ( slave_dev , 1 ) ;
if ( res )
goto err_close ;
2005-04-17 02:20:36 +04:00
}
2008-07-15 11:15:08 +04:00
netif_addr_lock_bh ( bond_dev ) ;
2013-05-31 15:57:30 +04:00
dev_mc_sync_multiple ( slave_dev , bond_dev ) ;
dev_uc_sync_multiple ( slave_dev , bond_dev ) ;
2008-07-15 11:15:08 +04:00
netif_addr_unlock_bh ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
}
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
/* add lacpdu mc addr to mc list */
u8 lacpdu_multicast [ ETH_ALEN ] = MULTICAST_LACPDU_ADDR ;
2010-04-02 01:22:57 +04:00
dev_mc_add ( slave_dev , lacpdu_multicast ) ;
2005-04-17 02:20:36 +04:00
}
2013-08-23 06:45:07 +04:00
res = vlan_vids_add_by_dev ( slave_dev , bond_dev ) ;
if ( res ) {
2013-08-06 14:40:15 +04:00
pr_err ( " %s: Error: Couldn't add bond vlan ids to %s \n " ,
bond_dev - > name , slave_dev - > name ) ;
goto err_close ;
}
2005-04-17 02:20:36 +04:00
write_lock_bh ( & bond - > lock ) ;
bond_attach_slave ( bond , new_slave ) ;
new_slave - > delay = 0 ;
new_slave - > link_failure_count = 0 ;
2008-05-18 08:10:14 +04:00
write_unlock_bh ( & bond - > lock ) ;
2011-05-07 07:22:17 +04:00
bond_compute_features ( bond ) ;
2013-03-12 10:31:32 +04:00
bond_update_speed_duplex ( new_slave ) ;
2008-05-18 08:10:14 +04:00
read_lock ( & bond - > lock ) ;
2012-04-17 06:02:06 +04:00
new_slave - > last_arp_rx = jiffies -
( msecs_to_jiffies ( bond - > params . arp_interval ) + 1 ) ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
for ( i = 0 ; i < BOND_MAX_ARP_TARGETS ; i + + )
new_slave - > target_last_arp_rx [ i ] = new_slave - > last_arp_rx ;
2006-09-23 08:54:53 +04:00
2005-04-17 02:20:36 +04:00
if ( bond - > params . miimon & & ! bond - > params . use_carrier ) {
link_reporting = bond_check_dev_link ( bond , slave_dev , 1 ) ;
if ( ( link_reporting = = - 1 ) & & ! bond - > params . arp_interval ) {
/*
* miimon is set but a bonded network driver
* does not support ETHTOOL / MII and
* arp_interval is not set . Note : if
* use_carrier is enabled , we will never go
* here ( because netif_carrier is always
* supported ) ; thus , we don ' t need to change
* the messages for netif_carrier .
*/
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: MII and ETHTOOL support not available for interface %s, and arp_interval/arp_ip_target module parameters not specified, thus bonding will not detect link failures! see bonding.txt for details. \n " ,
2005-11-09 21:34:57 +03:00
bond_dev - > name , slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
} else if ( link_reporting = = - 1 ) {
/* unable get link status using mii/ethtool */
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: can't get link status from interface %s; the network driver associated with this interface does not support MII or ETHTOOL link status reporting, thus miimon has no effect on this interface. \n " ,
bond_dev - > name , slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
/* check for initial state */
2012-04-17 06:02:06 +04:00
if ( bond - > params . miimon ) {
if ( bond_check_dev_link ( bond , slave_dev , 0 ) = = BMSR_LSTATUS ) {
if ( bond - > params . updelay ) {
new_slave - > link = BOND_LINK_BACK ;
new_slave - > delay = bond - > params . updelay ;
} else {
new_slave - > link = BOND_LINK_UP ;
}
2005-04-17 02:20:36 +04:00
} else {
2012-04-17 06:02:06 +04:00
new_slave - > link = BOND_LINK_DOWN ;
2005-04-17 02:20:36 +04:00
}
2012-04-17 06:02:06 +04:00
} else if ( bond - > params . arp_interval ) {
new_slave - > link = ( netif_carrier_ok ( slave_dev ) ?
BOND_LINK_UP : BOND_LINK_DOWN ) ;
2005-04-17 02:20:36 +04:00
} else {
2012-04-17 06:02:06 +04:00
new_slave - > link = BOND_LINK_UP ;
2005-04-17 02:20:36 +04:00
}
2012-04-17 06:02:06 +04:00
if ( new_slave - > link ! = BOND_LINK_DOWN )
new_slave - > jiffies = jiffies ;
pr_debug ( " Initial state of slave_dev is BOND_LINK_%s \n " ,
new_slave - > link = = BOND_LINK_DOWN ? " DOWN " :
( new_slave - > link = = BOND_LINK_UP ? " UP " : " BACK " ) ) ;
2005-04-17 02:20:36 +04:00
if ( USES_PRIMARY ( bond - > params . mode ) & & bond - > params . primary [ 0 ] ) {
/* if there is a primary slave, remember it */
2009-09-25 07:28:09 +04:00
if ( strcmp ( bond - > params . primary , new_slave - > dev - > name ) = = 0 ) {
2005-04-17 02:20:36 +04:00
bond - > primary_slave = new_slave ;
2009-09-25 07:28:09 +04:00
bond - > force_primary = true ;
}
2005-04-17 02:20:36 +04:00
}
2008-05-18 08:10:14 +04:00
write_lock_bh ( & bond - > curr_slave_lock ) ;
2005-04-17 02:20:36 +04:00
switch ( bond - > params . mode ) {
case BOND_MODE_ACTIVEBACKUP :
2006-09-23 08:56:15 +04:00
bond_set_slave_inactive_flags ( new_slave ) ;
bond_select_active_slave ( bond ) ;
2005-04-17 02:20:36 +04:00
break ;
case BOND_MODE_8023AD :
/* in 802.3ad mode, the internal mechanism
* will activate the slaves in the selected
* aggregator
*/
bond_set_slave_inactive_flags ( new_slave ) ;
/* if this is the first slave */
2013-08-01 18:54:47 +04:00
if ( bond_first_slave ( bond ) = = new_slave ) {
2005-04-17 02:20:36 +04:00
SLAVE_AD_INFO ( new_slave ) . id = 1 ;
/* Initialize AD with the number of times that the AD timer is called in 1 second
* can be called only after the mac address of the bond is set
*/
2011-06-09 01:19:02 +04:00
bond_3ad_initialize ( bond , 1000 / AD_TIMER_INTERVAL ) ;
2005-04-17 02:20:36 +04:00
} else {
2013-08-01 18:54:47 +04:00
struct slave * prev_slave ;
prev_slave = bond_prev_slave ( bond , new_slave ) ;
2005-04-17 02:20:36 +04:00
SLAVE_AD_INFO ( new_slave ) . id =
2013-08-01 18:54:47 +04:00
SLAVE_AD_INFO ( prev_slave ) . id + 1 ;
2005-04-17 02:20:36 +04:00
}
bond_3ad_bind_slave ( new_slave ) ;
break ;
case BOND_MODE_TLB :
case BOND_MODE_ALB :
2011-03-12 06:14:37 +03:00
bond_set_active_slave ( new_slave ) ;
2007-10-18 04:37:49 +04:00
bond_set_slave_inactive_flags ( new_slave ) ;
bonding: select current active slave when enslaving device for mode tlb and alb
I've hit an issue on my system when I've been using RealTek RTL8139D cards in
bonding interface in mode balancing-alb. When I enslave a card, the current
active slave (bond->curr_active_slave) is not set and the link is therefore
not functional.
----
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: None
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:1f:01:2f:22
----
The thing that gets it right is when I unplug the cable and then I put it back
into the NIC. Then the current active slave is set to eth1 and link is working
just fine. Here is dmesg log with bonding DEBUG messages turned on:
----
ADDRCONF(NETDEV_UP): bond0: link is not ready
event_dev: bond0, event: 1
IFF_MASTER
event_dev: bond0, event: 8
IFF_MASTER
bond_ioctl: master=bond0, cmd=35216
slave_dev=cac5d800:
slave_dev->name=eth1:
eth1: ! NETIF_F_VLAN_CHALLENGED
event_dev: eth1, event: 8
eth1: link up, 100Mbps, full-duplex, lpa 0xC5E1
event_dev: eth1, event: 1
event_dev: eth1, event: 8
IFF_SLAVE
Initial state of slave_dev is BOND_LINK_UP
bonding: bond0: enslaving eth1 as an active interface with an up link.
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
event_dev: bond0, event: 4
IFF_MASTER
bond0: no IPv6 routers present
<<<<cable unplug>>>>
eth1: link down
event_dev: eth1, event: 4
IFF_SLAVE
bonding: bond0: link status definitely down for interface eth1, disabling it
event_dev: bond0, event: 4
IFF_MASTER
<<<<cable plug>>>>
eth1: link up, 100Mbps, full-duplex, lpa 0xC5E1
event_dev: eth1, event: 4
IFF_SLAVE
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: making interface eth1 the new active one.
event_dev: eth1, event: 8
IFF_SLAVE
event_dev: eth1, event: 8
IFF_SLAVE
bonding: bond0: first active interface up!
event_dev: bond0, event: 4
IFF_MASTER
----
The current active slave is set by calling bond_select_active_slave() function
from bond_miimon_commit() function when the slave (eth1) link goes to state up.
I also tested this on other machine with Broadcom NetXtreme II BCM5708
1000Base-T NIC and there all works fine. The thing is that this adapter is down
and goes up after few seconds after it is enslaved.
This patch calls bond_select_active_slave() in bond_enslave() function for modes
alb and tlb and makes sure that the current active slave is set up properly even
when the slave state is already up. Tested on both systems, works fine.
Notice: The same problem can maybe also occrur in mode 8023AD but I'm unable to
test that.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-26 03:23:38 +03:00
bond_select_active_slave ( bond ) ;
2005-04-17 02:20:36 +04:00
break ;
default :
2008-12-10 10:09:22 +03:00
pr_debug ( " This slave is always active in trunk mode \n " ) ;
2005-04-17 02:20:36 +04:00
/* always active in trunk mode */
2011-03-12 06:14:37 +03:00
bond_set_active_slave ( new_slave ) ;
2005-04-17 02:20:36 +04:00
/* In trunking mode there is little meaning to curr_active_slave
* anyway ( it holds no special properties of the bond device ) ,
* so we can change it without calling change_active_interface ( )
*/
2012-11-22 06:48:39 +04:00
if ( ! bond - > curr_active_slave & & new_slave - > link = = BOND_LINK_UP )
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
rcu_assign_pointer ( bond - > curr_active_slave , new_slave ) ;
2009-06-12 23:02:48 +04:00
2005-04-17 02:20:36 +04:00
break ;
} /* switch(bond_mode) */
2008-05-18 08:10:14 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2006-03-28 01:27:43 +04:00
bond_set_carrier ( bond ) ;
2010-05-06 11:48:51 +04:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2013-07-24 22:53:57 +04:00
slave_dev - > npinfo = bond - > dev - > npinfo ;
2011-02-18 02:43:32 +03:00
if ( slave_dev - > npinfo ) {
if ( slave_enable_netpoll ( new_slave ) ) {
read_unlock ( & bond - > lock ) ;
pr_info ( " Error, %s: master_dev is using netpoll, "
" but new slave device does not support netpoll. \n " ,
bond_dev - > name ) ;
res = - EBUSY ;
2011-12-31 17:26:46 +04:00
goto err_detach ;
2011-02-18 02:43:32 +03:00
}
2010-05-06 11:48:51 +04:00
}
# endif
2011-02-18 02:43:32 +03:00
2008-05-18 08:10:14 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
2005-11-09 21:36:41 +03:00
res = bond_create_slave_symlinks ( bond_dev , slave_dev ) ;
if ( res )
2011-12-31 17:26:46 +04:00
goto err_detach ;
2005-11-09 21:36:41 +03:00
2011-03-22 05:38:12 +03:00
res = netdev_rx_handler_register ( slave_dev , bond_handle_frame ,
new_slave ) ;
if ( res ) {
pr_debug ( " Error %d calling netdev_rx_handler_register \n " , res ) ;
goto err_dest_symlinks ;
}
2013-09-25 11:20:10 +04:00
res = bond_master_upper_dev_link ( bond_dev , slave_dev , new_slave ) ;
if ( res ) {
pr_debug ( " Error %d calling bond_master_upper_dev_link \n " , res ) ;
goto err_unregister ;
}
2009-12-14 07:06:07 +03:00
pr_info ( " %s: enslaving %s as a%s interface with a%s link. \n " ,
bond_dev - > name , slave_dev - > name ,
2011-03-12 06:14:37 +03:00
bond_is_active_slave ( new_slave ) ? " n active " : " backup " ,
2009-12-14 07:06:07 +03:00
new_slave - > link ! = BOND_LINK_DOWN ? " n up " : " down " ) ;
2005-04-17 02:20:36 +04:00
/* enslave is successful */
return 0 ;
/* Undo stages on error */
2013-09-25 11:20:10 +04:00
err_unregister :
netdev_rx_handler_unregister ( slave_dev ) ;
2011-03-22 05:38:12 +03:00
err_dest_symlinks :
bond_destroy_slave_symlinks ( bond_dev , slave_dev ) ;
2011-12-31 17:26:46 +04:00
err_detach :
2013-05-31 15:57:30 +04:00
if ( ! USES_PRIMARY ( bond - > params . mode ) )
bond_hw_addr_flush ( bond_dev , slave_dev ) ;
2013-08-06 14:40:15 +04:00
vlan_vids_del_by_dev ( slave_dev , bond_dev ) ;
2011-12-31 17:26:46 +04:00
write_lock_bh ( & bond - > lock ) ;
bond_detach_slave ( bond , new_slave ) ;
2013-04-18 11:33:36 +04:00
if ( bond - > primary_slave = = new_slave )
bond - > primary_slave = NULL ;
if ( bond - > curr_active_slave = = new_slave ) {
2013-04-22 12:12:22 +04:00
bond_change_active_slave ( bond , NULL ) ;
write_unlock_bh ( & bond - > lock ) ;
2013-04-18 11:33:36 +04:00
read_lock ( & bond - > lock ) ;
write_lock_bh ( & bond - > curr_slave_lock ) ;
bond_select_active_slave ( bond ) ;
write_unlock_bh ( & bond - > curr_slave_lock ) ;
read_unlock ( & bond - > lock ) ;
2013-04-22 12:12:22 +04:00
} else {
write_unlock_bh ( & bond - > lock ) ;
2013-04-18 11:33:36 +04:00
}
2013-04-18 11:33:37 +04:00
slave_disable_netpoll ( new_slave ) ;
2011-12-31 17:26:46 +04:00
2005-04-17 02:20:36 +04:00
err_close :
2013-04-11 13:18:56 +04:00
slave_dev - > priv_flags & = ~ IFF_BONDING ;
2005-04-17 02:20:36 +04:00
dev_close ( slave_dev ) ;
err_restore_mac :
2007-10-10 06:57:24 +04:00
if ( ! bond - > params . fail_over_mac ) {
2008-05-18 08:10:14 +04:00
/* XXX TODO - fom follow mode needs to change master's
* MAC if this slave ' s MAC is in use by the bond , or at
* least print a warning .
*/
2007-10-10 06:43:39 +04:00
memcpy ( addr . sa_data , new_slave - > perm_hwaddr , ETH_ALEN ) ;
addr . sa_family = slave_dev - > type ;
dev_set_mac_address ( slave_dev , & addr ) ;
}
2005-04-17 02:20:36 +04:00
2010-05-18 09:42:40 +04:00
err_restore_mtu :
dev_set_mtu ( slave_dev , new_slave - > original_mtu ) ;
2005-04-17 02:20:36 +04:00
err_free :
kfree ( new_slave ) ;
err_undo_flags :
2011-05-07 07:22:17 +04:00
bond_compute_features ( bond ) ;
2013-06-12 02:07:01 +04:00
/* Enslave of first slave has failed and we need to fix master's mac */
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) & &
2013-06-12 02:07:01 +04:00
ether_addr_equal ( bond_dev - > dev_addr , slave_dev - > dev_addr ) )
eth_hw_addr_random ( bond_dev ) ;
2009-06-12 23:02:48 +04:00
2005-04-17 02:20:36 +04:00
return res ;
}
/*
* Try to release the slave device < slave > from the bond device < master >
* It is legal to access curr_active_slave without a lock because all the function
2013-02-18 18:09:42 +04:00
* is write - locked . If " all " is true it means that the function is being called
* while destroying a bond interface and all slaves are being released .
2005-04-17 02:20:36 +04:00
*
* The rules for slave state should be :
* for Active / Backup :
* Active stays on all backups go down
* for Bonded connections :
* The first up interface should be left on and all others downed .
*/
2013-02-18 18:09:42 +04:00
static int __bond_release_one ( struct net_device * bond_dev ,
struct net_device * slave_dev ,
bool all )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
struct slave * slave , * oldcurrent ;
struct sockaddr addr ;
2011-11-15 19:29:55 +04:00
netdev_features_t old_features = bond_dev - > features ;
2005-04-17 02:20:36 +04:00
/* slave is not a slave or master is not master of this slave */
if ( ! ( slave_dev - > flags & IFF_SLAVE ) | |
2013-01-04 02:49:01 +04:00
! netdev_has_upper_dev ( slave_dev , bond_dev ) ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error: cannot release %s. \n " ,
2005-04-17 02:20:36 +04:00
bond_dev - > name , slave_dev - > name ) ;
return - EINVAL ;
}
2010-10-13 20:01:50 +04:00
block_netpoll_tx ( ) ;
2005-04-17 02:20:36 +04:00
write_lock_bh ( & bond - > lock ) ;
slave = bond_get_slave_by_dev ( bond , slave_dev ) ;
if ( ! slave ) {
/* not a slave of this bond */
2009-12-14 07:06:07 +03:00
pr_info ( " %s: %s not enslaved \n " ,
bond_dev - > name , slave_dev - > name ) ;
2006-02-08 08:17:22 +03:00
write_unlock_bh ( & bond - > lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
2005-04-17 02:20:36 +04:00
return - EINVAL ;
}
2013-04-02 09:15:16 +04:00
write_unlock_bh ( & bond - > lock ) ;
2013-09-25 11:20:10 +04:00
bond_upper_dev_unlink ( bond_dev , slave_dev ) ;
2011-03-22 05:38:12 +03:00
/* unregister rx_handler early so bond_handle_frame wouldn't be called
* for this slave anymore .
*/
netdev_rx_handler_unregister ( slave_dev ) ;
write_lock_bh ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
/* Inform AD package of unbinding of slave. */
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
/* must be called before the slave is
* detached from the list
*/
bond_3ad_unbind_slave ( slave ) ;
}
2009-12-14 07:06:07 +03:00
pr_info ( " %s: releasing %s interface %s \n " ,
bond_dev - > name ,
2011-03-12 06:14:37 +03:00
bond_is_active_slave ( slave ) ? " active " : " backup " ,
2009-12-14 07:06:07 +03:00
slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
oldcurrent = bond - > curr_active_slave ;
bond - > current_arp_slave = NULL ;
/* release the slave from its bond */
bond_detach_slave ( bond , slave ) ;
2013-08-01 18:54:47 +04:00
if ( ! all & & ! bond - > params . fail_over_mac ) {
if ( ether_addr_equal ( bond_dev - > dev_addr , slave - > perm_hwaddr ) & &
2013-09-25 11:20:21 +04:00
bond_has_slaves ( bond ) )
2013-08-01 18:54:47 +04:00
pr_warn ( " %s: Warning: the permanent HWaddr of %s - %pM - is still in use by %s. Set the HWaddr of %s to a different address to avoid conflicts. \n " ,
bond_dev - > name , slave_dev - > name ,
slave - > perm_hwaddr ,
bond_dev - > name , slave_dev - > name ) ;
}
2009-06-12 23:02:48 +04:00
if ( bond - > primary_slave = = slave )
2005-04-17 02:20:36 +04:00
bond - > primary_slave = NULL ;
2009-06-12 23:02:48 +04:00
if ( oldcurrent = = slave )
2005-04-17 02:20:36 +04:00
bond_change_active_slave ( bond , NULL ) ;
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 02:20:36 +04:00
/* Must be called only after the slave has been
* detached from the list and the curr_active_slave
* has been cleared ( if our_slave = = old_current ) ,
* but before a new active slave is selected .
*/
2008-01-18 03:24:59 +03:00
write_unlock_bh ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
bond_alb_deinit_slave ( bond , slave ) ;
2008-01-18 03:24:59 +03:00
write_lock_bh ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
}
2013-02-18 18:09:42 +04:00
if ( all ) {
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
rcu_assign_pointer ( bond - > curr_active_slave , NULL ) ;
2013-02-18 18:09:42 +04:00
} else if ( oldcurrent = = slave ) {
2007-10-18 04:37:49 +04:00
/*
* Note that we hold RTNL over this sequence , so there
* is no concern that another slave add / remove event
* will interfere .
*/
write_unlock_bh ( & bond - > lock ) ;
read_lock ( & bond - > lock ) ;
write_lock_bh ( & bond - > curr_slave_lock ) ;
2005-04-17 02:20:36 +04:00
bond_select_active_slave ( bond ) ;
2007-10-18 04:37:49 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
read_unlock ( & bond - > lock ) ;
write_lock_bh ( & bond - > lock ) ;
}
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) ) {
2006-03-28 01:27:43 +04:00
bond_set_carrier ( bond ) ;
2013-01-30 14:08:11 +04:00
eth_hw_addr_random ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
2013-08-29 01:25:12 +04:00
if ( vlan_uses_dev ( bond_dev ) ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " %s: Warning: clearing HW address of %s while it still has VLANs. \n " ,
bond_dev - > name , bond_dev - > name ) ;
pr_warning ( " %s: When re-adding slaves, make sure the bond's HW address matches its VLANs'. \n " ,
bond_dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
write_unlock_bh ( & bond - > lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
synchronize_rcu ( ) ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) ) {
2012-04-04 02:56:19 +04:00
call_netdevice_notifiers ( NETDEV_CHANGEADDR , bond - > dev ) ;
2013-03-06 11:10:32 +04:00
call_netdevice_notifiers ( NETDEV_RELEASE , bond - > dev ) ;
}
2012-04-04 02:56:19 +04:00
2011-05-07 07:22:17 +04:00
bond_compute_features ( bond ) ;
if ( ! ( bond_dev - > features & NETIF_F_VLAN_CHALLENGED ) & &
( old_features & NETIF_F_VLAN_CHALLENGED ) )
pr_info ( " %s: last VLAN challenged slave %s left bond %s. VLAN blocking is removed \n " ,
bond_dev - > name , slave_dev - > name , bond_dev - > name ) ;
2005-11-09 21:36:41 +03:00
/* must do this from outside any spinlocks */
bond_destroy_slave_symlinks ( bond_dev , slave_dev ) ;
2013-08-06 14:40:15 +04:00
vlan_vids_del_by_dev ( slave_dev , bond_dev ) ;
2005-04-17 02:20:36 +04:00
2013-05-31 15:57:30 +04:00
/* If the mode USES_PRIMARY, then this cases was handled above by
* bond_change_active_slave ( . . . , NULL )
2005-04-17 02:20:36 +04:00
*/
if ( ! USES_PRIMARY ( bond - > params . mode ) ) {
/* unset promiscuity level from slave */
2009-06-12 23:02:48 +04:00
if ( bond_dev - > flags & IFF_PROMISC )
2005-04-17 02:20:36 +04:00
dev_set_promiscuity ( slave_dev , - 1 ) ;
/* unset allmulti level from slave */
2009-06-12 23:02:48 +04:00
if ( bond_dev - > flags & IFF_ALLMULTI )
2005-04-17 02:20:36 +04:00
dev_set_allmulti ( slave_dev , - 1 ) ;
2013-05-31 15:57:30 +04:00
bond_hw_addr_flush ( bond_dev , slave_dev ) ;
2005-04-17 02:20:36 +04:00
}
2011-02-18 02:43:32 +03:00
slave_disable_netpoll ( slave ) ;
2010-05-06 11:48:51 +04:00
2005-04-17 02:20:36 +04:00
/* close slave before restoring its mac address */
dev_close ( slave_dev ) ;
2008-05-18 08:10:14 +04:00
if ( bond - > params . fail_over_mac ! = BOND_FOM_ACTIVE ) {
2007-10-10 06:43:39 +04:00
/* restore original ("permanent") mac address */
memcpy ( addr . sa_data , slave - > perm_hwaddr , ETH_ALEN ) ;
addr . sa_family = slave_dev - > type ;
dev_set_mac_address ( slave_dev , & addr ) ;
}
2005-04-17 02:20:36 +04:00
2010-05-18 09:42:40 +04:00
dev_set_mtu ( slave_dev , slave - > original_mtu ) ;
2011-03-16 11:46:43 +03:00
slave_dev - > priv_flags & = ~ IFF_BONDING ;
2005-04-17 02:20:36 +04:00
kfree ( slave ) ;
return 0 ; /* deletion OK */
}
2013-02-18 18:09:42 +04:00
/* A wrapper used because of ndo_del_link */
int bond_release ( struct net_device * bond_dev , struct net_device * slave_dev )
{
return __bond_release_one ( bond_dev , slave_dev , false ) ;
}
2007-10-10 06:43:43 +04:00
/*
2011-03-19 23:36:18 +03:00
* First release a slave and then destroy the bond if no more slaves are left .
2007-10-10 06:43:43 +04:00
* Must be under rtnl_lock when this function is called .
*/
2010-10-15 09:09:34 +04:00
static int bond_release_and_destroy ( struct net_device * bond_dev ,
struct net_device * slave_dev )
2007-10-10 06:43:43 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2007-10-10 06:43:43 +04:00
int ret ;
ret = bond_release ( bond_dev , slave_dev ) ;
2013-09-25 11:20:21 +04:00
if ( ret = = 0 & & ! bond_has_slaves ( bond ) ) {
2011-02-18 02:43:32 +03:00
bond_dev - > priv_flags | = IFF_DISABLE_NETPOLL ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: destroying bond %s. \n " ,
bond_dev - > name , bond_dev - > name ) ;
2009-06-12 23:02:47 +04:00
unregister_netdevice ( bond_dev ) ;
2007-10-10 06:43:43 +04:00
}
return ret ;
}
2005-04-17 02:20:36 +04:00
/*
* This function changes the active slave to slave < slave_dev > .
* It returns - EINVAL in the following cases .
* - < slave_dev > is not found in the list .
* - There is not active slave now .
* - < slave_dev > is already active .
* - The link state of < slave_dev > is not BOND_LINK_UP .
* - < slave_dev > is not running .
2009-06-12 23:02:48 +04:00
* In these cases , this function does nothing .
* In the other cases , current_slave pointer is changed and 0 is returned .
2005-04-17 02:20:36 +04:00
*/
static int bond_ioctl_change_active ( struct net_device * bond_dev , struct net_device * slave_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
struct slave * old_active = NULL ;
struct slave * new_active = NULL ;
int res = 0 ;
2009-06-12 23:02:48 +04:00
if ( ! USES_PRIMARY ( bond - > params . mode ) )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2013-01-04 02:49:01 +04:00
/* Verify that bond_dev is indeed the master of slave_dev */
if ( ! ( slave_dev - > flags & IFF_SLAVE ) | |
! netdev_has_upper_dev ( slave_dev , bond_dev ) )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2007-10-18 04:37:49 +04:00
read_lock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
old_active = bond - > curr_active_slave ;
new_active = bond_get_slave_by_dev ( bond , slave_dev ) ;
/*
* Changing to the current active : do nothing ; return success .
*/
2013-08-01 18:54:48 +04:00
if ( new_active & & new_active = = old_active ) {
2007-10-18 04:37:49 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2013-08-01 18:54:48 +04:00
if ( new_active & &
old_active & &
new_active - > link = = BOND_LINK_UP & &
2005-04-17 02:20:36 +04:00
IS_UP ( new_active - > dev ) ) {
2010-10-13 20:01:50 +04:00
block_netpoll_tx ( ) ;
2007-10-18 04:37:49 +04:00
write_lock_bh ( & bond - > curr_slave_lock ) ;
2005-04-17 02:20:36 +04:00
bond_change_active_slave ( bond , new_active ) ;
2007-10-18 04:37:49 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
2009-06-12 23:02:48 +04:00
} else
2005-04-17 02:20:36 +04:00
res = - EINVAL ;
2007-10-18 04:37:49 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
return res ;
}
static int bond_info_query ( struct net_device * bond_dev , struct ifbond * info )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
info - > bond_mode = bond - > params . mode ;
info - > miimon = bond - > params . miimon ;
2007-10-18 04:37:50 +04:00
read_lock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
info - > num_slaves = bond - > slave_cnt ;
2007-10-18 04:37:50 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
static int bond_slave_info_query ( struct net_device * bond_dev , struct ifslave * info )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
int i = 0 , res = - ENODEV ;
2005-04-17 02:20:36 +04:00
struct slave * slave ;
2007-10-18 04:37:50 +04:00
read_lock ( & bond - > lock ) ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2013-08-01 18:54:47 +04:00
if ( i + + = = ( int ) info - > slave_id ) {
2009-04-23 07:39:04 +04:00
res = 0 ;
strcpy ( info - > slave_name , slave - > dev - > name ) ;
info - > link = slave - > link ;
2011-03-12 06:14:37 +03:00
info - > state = bond_slave_state ( slave ) ;
2009-04-23 07:39:04 +04:00
info - > link_failure_count = slave - > link_failure_count ;
2005-04-17 02:20:36 +04:00
break ;
}
}
2007-10-18 04:37:50 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
2009-04-23 07:39:04 +04:00
return res ;
2005-04-17 02:20:36 +04:00
}
/*-------------------------------- Monitoring -------------------------------*/
2008-07-03 05:21:58 +04:00
static int bond_miimon_inspect ( struct bonding * bond )
{
2013-08-01 18:54:47 +04:00
int link_state , commit = 0 ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2008-07-03 05:21:58 +04:00
struct slave * slave ;
2009-04-24 07:57:29 +04:00
bool ignore_updelay ;
ignore_updelay = ! bond - > curr_active_slave ? true : false ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-03 05:21:58 +04:00
slave - > new_link = BOND_LINK_NOCHANGE ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
link_state = bond_check_dev_link ( bond , slave - > dev , 0 ) ;
2005-04-17 02:20:36 +04:00
switch ( slave - > link ) {
2008-07-03 05:21:58 +04:00
case BOND_LINK_UP :
if ( link_state )
continue ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
slave - > link = BOND_LINK_FAIL ;
slave - > delay = bond - > params . downdelay ;
if ( slave - > delay ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status down for %sinterface %s, disabling it in %d ms. \n " ,
bond - > dev - > name ,
( bond - > params . mode = =
BOND_MODE_ACTIVEBACKUP ) ?
2011-03-12 06:14:37 +03:00
( bond_is_active_slave ( slave ) ?
2009-12-14 07:06:07 +03:00
" active " : " backup " ) : " " ,
slave - > dev - > name ,
bond - > params . downdelay * bond - > params . miimon ) ;
2005-04-17 02:20:36 +04:00
}
2008-07-03 05:21:58 +04:00
/*FALLTHRU*/
case BOND_LINK_FAIL :
if ( link_state ) {
/*
* recovered before downdelay expired
*/
slave - > link = BOND_LINK_UP ;
2005-04-17 02:20:36 +04:00
slave - > jiffies = jiffies ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status up again after %d ms for interface %s. \n " ,
bond - > dev - > name ,
( bond - > params . downdelay - slave - > delay ) *
bond - > params . miimon ,
slave - > dev - > name ) ;
2008-07-03 05:21:58 +04:00
continue ;
2005-04-17 02:20:36 +04:00
}
2008-07-03 05:21:58 +04:00
if ( slave - > delay < = 0 ) {
slave - > new_link = BOND_LINK_DOWN ;
commit + + ;
continue ;
2005-04-17 02:20:36 +04:00
}
2008-07-03 05:21:58 +04:00
slave - > delay - - ;
break ;
case BOND_LINK_DOWN :
if ( ! link_state )
continue ;
slave - > link = BOND_LINK_BACK ;
slave - > delay = bond - > params . updelay ;
if ( slave - > delay ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status up for interface %s, enabling it in %d ms. \n " ,
bond - > dev - > name , slave - > dev - > name ,
ignore_updelay ? 0 :
bond - > params . updelay *
bond - > params . miimon ) ;
2008-07-03 05:21:58 +04:00
}
/*FALLTHRU*/
case BOND_LINK_BACK :
if ( ! link_state ) {
slave - > link = BOND_LINK_DOWN ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status down again after %d ms for interface %s. \n " ,
bond - > dev - > name ,
( bond - > params . updelay - slave - > delay ) *
bond - > params . miimon ,
slave - > dev - > name ) ;
2008-07-03 05:21:58 +04:00
continue ;
}
2009-04-24 07:57:29 +04:00
if ( ignore_updelay )
slave - > delay = 0 ;
2008-07-03 05:21:58 +04:00
if ( slave - > delay < = 0 ) {
slave - > new_link = BOND_LINK_UP ;
commit + + ;
2009-04-24 07:57:29 +04:00
ignore_updelay = false ;
2008-07-03 05:21:58 +04:00
continue ;
2005-04-17 02:20:36 +04:00
}
2008-07-03 05:21:58 +04:00
slave - > delay - - ;
2005-04-17 02:20:36 +04:00
break ;
2008-07-03 05:21:58 +04:00
}
}
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
return commit ;
}
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
static void bond_miimon_commit ( struct bonding * bond )
{
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2008-07-03 05:21:58 +04:00
struct slave * slave ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-03 05:21:58 +04:00
switch ( slave - > new_link ) {
case BOND_LINK_NOCHANGE :
continue ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
case BOND_LINK_UP :
slave - > link = BOND_LINK_UP ;
slave - > jiffies = jiffies ;
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
/* prevent it from being the active one */
2011-03-12 06:14:37 +03:00
bond_set_backup_slave ( slave ) ;
2008-07-03 05:21:58 +04:00
} else if ( bond - > params . mode ! = BOND_MODE_ACTIVEBACKUP ) {
/* make it immediately active */
2011-03-12 06:14:37 +03:00
bond_set_active_slave ( slave ) ;
2008-07-03 05:21:58 +04:00
} else if ( slave ! = bond - > primary_slave ) {
/* prevent it from being the active one */
2011-03-12 06:14:37 +03:00
bond_set_backup_slave ( slave ) ;
2005-04-17 02:20:36 +04:00
}
2011-04-13 19:22:31 +04:00
pr_info ( " %s: link status definitely up for interface %s, %u Mbps %s duplex. \n " ,
2010-10-07 01:25:06 +04:00
bond - > dev - > name , slave - > dev - > name ,
2013-06-20 16:34:13 +04:00
slave - > speed = = SPEED_UNKNOWN ? 0 : slave - > speed ,
slave - > duplex ? " full " : " half " ) ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
/* notify ad that the link status has changed */
if ( bond - > params . mode = = BOND_MODE_8023AD )
bond_3ad_handle_link_change ( slave , BOND_LINK_UP ) ;
2007-10-18 04:37:49 +04:00
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) )
2008-07-03 05:21:58 +04:00
bond_alb_handle_link_change ( bond , slave ,
BOND_LINK_UP ) ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
if ( ! bond - > curr_active_slave | |
( slave = = bond - > primary_slave ) )
goto do_failover ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
continue ;
2007-10-18 04:37:49 +04:00
2008-07-03 05:21:58 +04:00
case BOND_LINK_DOWN :
2008-10-31 03:41:14 +03:00
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
2008-07-03 05:21:58 +04:00
slave - > link = BOND_LINK_DOWN ;
2005-04-17 02:20:36 +04:00
2008-07-03 05:21:58 +04:00
if ( bond - > params . mode = = BOND_MODE_ACTIVEBACKUP | |
bond - > params . mode = = BOND_MODE_8023AD )
bond_set_slave_inactive_flags ( slave ) ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status definitely down for interface %s, disabling it \n " ,
bond - > dev - > name , slave - > dev - > name ) ;
2008-07-03 05:21:58 +04:00
if ( bond - > params . mode = = BOND_MODE_8023AD )
bond_3ad_handle_link_change ( slave ,
BOND_LINK_DOWN ) ;
2009-05-27 09:42:36 +04:00
if ( bond_is_lb ( bond ) )
2008-07-03 05:21:58 +04:00
bond_alb_handle_link_change ( bond , slave ,
BOND_LINK_DOWN ) ;
if ( slave = = bond - > curr_active_slave )
goto do_failover ;
continue ;
default :
2009-12-14 07:06:07 +03:00
pr_err ( " %s: invalid new link %d on slave %s \n " ,
2008-07-03 05:21:58 +04:00
bond - > dev - > name , slave - > new_link ,
slave - > dev - > name ) ;
slave - > new_link = BOND_LINK_NOCHANGE ;
continue ;
}
do_failover :
ASSERT_RTNL ( ) ;
2010-10-13 20:01:50 +04:00
block_netpoll_tx ( ) ;
2008-07-03 05:21:58 +04:00
write_lock_bh ( & bond - > curr_slave_lock ) ;
bond_select_active_slave ( bond ) ;
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
2008-07-03 05:21:58 +04:00
}
bond_set_carrier ( bond ) ;
2005-04-17 02:20:36 +04:00
}
2007-10-18 04:37:48 +04:00
/*
* bond_mii_monitor
*
* Really a wrapper that splits the mii monitor into two phases : an
2008-07-03 05:21:58 +04:00
* inspection , then ( if inspection indicates something needs to be done )
* an acquisition of appropriate locks followed by a commit phase to
* implement whatever link state changes are indicated .
2007-10-18 04:37:48 +04:00
*/
void bond_mii_monitor ( struct work_struct * work )
{
struct bonding * bond = container_of ( work , struct bonding ,
mii_work . work ) ;
2011-04-26 19:25:52 +04:00
bool should_notify_peers = false ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
unsigned long delay ;
2007-10-18 04:37:48 +04:00
read_lock ( & bond - > lock ) ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
delay = msecs_to_jiffies ( bond - > params . miimon ) ;
2008-07-03 05:21:58 +04:00
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) )
2008-07-03 05:21:58 +04:00
goto re_arm ;
2008-06-14 05:12:03 +04:00
2011-04-26 19:25:52 +04:00
should_notify_peers = bond_should_notify_peers ( bond ) ;
2008-07-03 05:21:58 +04:00
if ( bond_miimon_inspect ( bond ) ) {
2007-10-18 04:37:48 +04:00
read_unlock ( & bond - > lock ) ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
/* Race avoidance with bond_close cancel of workqueue */
if ( ! rtnl_trylock ( ) ) {
read_lock ( & bond - > lock ) ;
delay = 1 ;
should_notify_peers = false ;
goto re_arm ;
}
2007-10-18 04:37:48 +04:00
read_lock ( & bond - > lock ) ;
2008-07-03 05:21:58 +04:00
bond_miimon_commit ( bond ) ;
2008-01-18 03:25:03 +03:00
read_unlock ( & bond - > lock ) ;
rtnl_unlock ( ) ; /* might sleep, hold no other locks */
read_lock ( & bond - > lock ) ;
2007-10-18 04:37:48 +04:00
}
2008-07-03 05:21:58 +04:00
re_arm :
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
if ( bond - > params . miimon )
queue_delayed_work ( bond - > wq , & bond - > mii_work , delay ) ;
2007-10-18 04:37:48 +04:00
read_unlock ( & bond - > lock ) ;
2011-04-26 19:25:52 +04:00
if ( should_notify_peers ) {
2013-09-02 15:51:38 +04:00
if ( ! rtnl_trylock ( ) )
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
return ;
2012-08-10 02:14:57 +04:00
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS , bond - > dev ) ;
2011-04-26 19:25:52 +04:00
rtnl_unlock ( ) ;
}
2007-10-18 04:37:48 +04:00
}
2005-06-27 01:52:20 +04:00
2013-08-29 01:25:11 +04:00
static bool bond_has_this_ip ( struct bonding * bond , __be32 ip )
2006-09-23 08:54:53 +04:00
{
2013-08-29 01:25:11 +04:00
struct net_device * upper ;
struct list_head * iter ;
bool ret = false ;
2006-09-23 08:54:53 +04:00
2012-03-22 20:14:29 +04:00
if ( ip = = bond_confirm_addr ( bond - > dev , 0 , ip ) )
2013-08-29 01:25:11 +04:00
return true ;
2006-09-23 08:54:53 +04:00
2013-08-29 01:25:11 +04:00
rcu_read_lock ( ) ;
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 11:20:07 +04:00
netdev_for_each_all_upper_dev_rcu ( bond - > dev , upper , iter ) {
2013-08-29 01:25:11 +04:00
if ( ip = = bond_confirm_addr ( upper , 0 , ip ) ) {
ret = true ;
break ;
}
2006-09-23 08:54:53 +04:00
}
2013-08-29 01:25:11 +04:00
rcu_read_unlock ( ) ;
2006-09-23 08:54:53 +04:00
2013-08-29 01:25:11 +04:00
return ret ;
2006-09-23 08:54:53 +04:00
}
2005-06-27 01:52:20 +04:00
/*
* We go to the ( large ) trouble of VLAN tagging ARP frames because
* switches in VLAN mode ( especially if ports are configured as
* " native " to a VLAN ) might not pass non - tagged frames .
*/
2007-08-23 04:06:58 +04:00
static void bond_arp_send ( struct net_device * slave_dev , int arp_op , __be32 dest_ip , __be32 src_ip , unsigned short vlan_id )
2005-06-27 01:52:20 +04:00
{
struct sk_buff * skb ;
2013-05-18 05:18:29 +04:00
pr_debug ( " arp %d on slave %s: dst %pI4 src %pI4 vid %d \n " , arp_op ,
slave_dev - > name , & dest_ip , & src_ip , vlan_id ) ;
2009-06-12 23:02:48 +04:00
2005-06-27 01:52:20 +04:00
skb = arp_create ( arp_op , ETH_P_ARP , dest_ip , slave_dev , src_ip ,
NULL , slave_dev - > dev_addr , NULL ) ;
if ( ! skb ) {
2009-12-14 07:06:07 +03:00
pr_err ( " ARP packet allocation failed \n " ) ;
2005-06-27 01:52:20 +04:00
return ;
}
if ( vlan_id ) {
2013-04-19 06:04:29 +04:00
skb = vlan_put_tag ( skb , htons ( ETH_P_8021Q ) , vlan_id ) ;
2005-06-27 01:52:20 +04:00
if ( ! skb ) {
2009-12-14 07:06:07 +03:00
pr_err ( " failed to insert VLAN tag \n " ) ;
2005-06-27 01:52:20 +04:00
return ;
}
}
arp_xmit ( skb ) ;
}
2005-04-17 02:20:36 +04:00
static void bond_arp_send_all ( struct bonding * bond , struct slave * slave )
{
2013-08-29 01:25:10 +04:00
struct net_device * upper , * vlan_upper ;
struct list_head * iter , * vlan_iter ;
2005-06-27 01:52:20 +04:00
struct rtable * rt ;
2013-08-29 01:25:10 +04:00
__be32 * targets = bond - > params . arp_targets , addr ;
int i , vlan_id ;
2005-04-17 02:20:36 +04:00
2013-08-29 01:25:10 +04:00
for ( i = 0 ; i < BOND_MAX_ARP_TARGETS & & targets [ i ] ; i + + ) {
2013-05-18 05:18:29 +04:00
pr_debug ( " basa: target %pI4 \n " , & targets [ i ] ) ;
2005-06-27 01:52:20 +04:00
2013-08-29 01:25:10 +04:00
/* Find out through which dev should the packet go */
2011-03-12 08:00:52 +03:00
rt = ip_route_output ( dev_net ( bond - > dev ) , targets [ i ] , 0 ,
RTO_ONLINK , 0 ) ;
2011-03-03 01:31:35 +03:00
if ( IS_ERR ( rt ) ) {
2013-08-29 01:25:16 +04:00
pr_debug ( " %s: no route to arp_ip_target %pI4 \n " ,
bond - > dev - > name , & targets [ i ] ) ;
2005-06-27 01:52:20 +04:00
continue ;
}
vlan_id = 0 ;
2013-08-29 01:25:10 +04:00
/* bond device itself */
if ( rt - > dst . dev = = bond - > dev )
goto found ;
rcu_read_lock ( ) ;
/* first we search only for vlan devices. for every vlan
* found we verify its upper dev list , searching for the
* rt - > dst . dev . If found we save the tag of the vlan and
* proceed to send the packet .
*
* TODO : QinQ ?
*/
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 11:20:07 +04:00
netdev_for_each_all_upper_dev_rcu ( bond - > dev , vlan_upper ,
vlan_iter ) {
2013-08-29 01:25:10 +04:00
if ( ! is_vlan_dev ( vlan_upper ) )
continue ;
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 11:20:07 +04:00
netdev_for_each_all_upper_dev_rcu ( vlan_upper , upper ,
iter ) {
2013-08-29 01:25:10 +04:00
if ( upper = = rt - > dst . dev ) {
vlan_id = vlan_dev_vlan_id ( vlan_upper ) ;
rcu_read_unlock ( ) ;
goto found ;
}
2005-06-27 01:52:20 +04:00
}
}
2013-08-29 01:25:10 +04:00
/* if the device we're looking for is not on top of any of
* our upper vlans , then just search for any dev that
* matches , and in case it ' s a vlan - save the id
*/
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 11:20:07 +04:00
netdev_for_each_all_upper_dev_rcu ( bond - > dev , upper , iter ) {
2013-08-29 01:25:10 +04:00
if ( upper = = rt - > dst . dev ) {
/* if it's a vlan - get its VID */
if ( is_vlan_dev ( upper ) )
vlan_id = vlan_dev_vlan_id ( upper ) ;
rcu_read_unlock ( ) ;
goto found ;
}
2005-06-27 01:52:20 +04:00
}
2013-08-29 01:25:10 +04:00
rcu_read_unlock ( ) ;
2005-06-27 01:52:20 +04:00
2013-08-29 01:25:10 +04:00
/* Not our device - skip */
2013-08-29 01:25:16 +04:00
pr_debug ( " %s: no path to arp_ip_target %pI4 via rt.dev %s \n " ,
bond - > dev - > name , & targets [ i ] ,
rt - > dst . dev ? rt - > dst . dev - > name : " NULL " ) ;
2005-09-15 01:52:09 +04:00
ip_rt_put ( rt ) ;
2013-08-29 01:25:10 +04:00
continue ;
found :
addr = bond_confirm_addr ( rt - > dst . dev , targets [ i ] , 0 ) ;
ip_rt_put ( rt ) ;
bond_arp_send ( slave - > dev , ARPOP_REQUEST , targets [ i ] ,
addr , vlan_id ) ;
2005-06-27 01:52:20 +04:00
}
}
2007-08-23 04:06:58 +04:00
static void bond_validate_arp ( struct bonding * bond , struct slave * slave , __be32 sip , __be32 tip )
2006-09-23 08:54:53 +04:00
{
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
int i ;
2013-06-24 13:49:29 +04:00
if ( ! sip | | ! bond_has_this_ip ( bond , tip ) ) {
pr_debug ( " bva: sip %pI4 tip %pI4 not found \n " , & sip , & tip ) ;
return ;
}
2006-09-23 08:54:53 +04:00
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
i = bond_get_targets_ip ( bond - > params . arp_targets , sip ) ;
if ( i = = - 1 ) {
2013-06-24 13:49:29 +04:00
pr_debug ( " bva: sip %pI4 not found in targets \n " , & sip ) ;
return ;
2006-09-23 08:54:53 +04:00
}
2013-06-24 13:49:29 +04:00
slave - > last_arp_rx = jiffies ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
slave - > target_last_arp_rx [ i ] = jiffies ;
2006-09-23 08:54:53 +04:00
}
2013-09-07 02:00:26 +04:00
int bond_arp_rcv ( const struct sk_buff * skb , struct bonding * bond ,
struct slave * slave )
2006-09-23 08:54:53 +04:00
{
2012-06-11 23:23:07 +04:00
struct arphdr * arp = ( struct arphdr * ) skb - > data ;
2006-09-23 08:54:53 +04:00
unsigned char * arp_ptr ;
2007-08-23 04:06:58 +04:00
__be32 sip , tip ;
2012-06-11 23:23:07 +04:00
int alen ;
2006-09-23 08:54:53 +04:00
2011-04-19 07:48:16 +04:00
if ( skb - > protocol ! = __cpu_to_be16 ( ETH_P_ARP ) )
2012-05-09 05:01:40 +04:00
return RX_HANDLER_ANOTHER ;
2006-09-23 08:54:53 +04:00
read_lock ( & bond - > lock ) ;
2013-06-24 13:49:31 +04:00
if ( ! slave_do_arp_validate ( bond , slave ) )
goto out_unlock ;
2012-06-11 23:23:07 +04:00
alen = arp_hdr_len ( bond - > dev ) ;
2006-09-23 08:54:53 +04:00
2011-04-19 07:48:16 +04:00
pr_debug ( " bond_arp_rcv: bond %s skb->dev %s \n " ,
bond - > dev - > name , skb - > dev - > name ) ;
2006-09-23 08:54:53 +04:00
2012-06-11 23:23:07 +04:00
if ( alen > skb_headlen ( skb ) ) {
arp = kmalloc ( alen , GFP_ATOMIC ) ;
if ( ! arp )
goto out_unlock ;
if ( skb_copy_bits ( skb , 0 , arp , alen ) < 0 )
goto out_unlock ;
}
2006-09-23 08:54:53 +04:00
2011-04-19 07:48:16 +04:00
if ( arp - > ar_hln ! = bond - > dev - > addr_len | |
2006-09-23 08:54:53 +04:00
skb - > pkt_type = = PACKET_OTHERHOST | |
skb - > pkt_type = = PACKET_LOOPBACK | |
arp - > ar_hrd ! = htons ( ARPHRD_ETHER ) | |
arp - > ar_pro ! = htons ( ETH_P_IP ) | |
arp - > ar_pln ! = 4 )
goto out_unlock ;
arp_ptr = ( unsigned char * ) ( arp + 1 ) ;
2011-04-19 07:48:16 +04:00
arp_ptr + = bond - > dev - > addr_len ;
2006-09-23 08:54:53 +04:00
memcpy ( & sip , arp_ptr , 4 ) ;
2011-04-19 07:48:16 +04:00
arp_ptr + = 4 + bond - > dev - > addr_len ;
2006-09-23 08:54:53 +04:00
memcpy ( & tip , arp_ptr , 4 ) ;
2008-12-10 10:09:22 +03:00
pr_debug ( " bond_arp_rcv: %s %s/%d av %d sv %d sip %pI4 tip %pI4 \n " ,
2011-03-12 06:14:37 +03:00
bond - > dev - > name , slave - > dev - > name , bond_slave_state ( slave ) ,
2009-12-14 07:06:07 +03:00
bond - > params . arp_validate , slave_do_arp_validate ( bond , slave ) ,
& sip , & tip ) ;
2006-09-23 08:54:53 +04:00
/*
* Backup slaves won ' t see the ARP reply , but do come through
* here for each ARP probe ( so we swap the sip / tip to validate
* the probe ) . In a " redundant switch, common router " type of
* configuration , the ARP probe will ( hopefully ) travel from
* the active , through one switch , the router , then the other
* switch before reaching the backup .
bonding: don't trust arp requests unless active slave really works
Currently, if we receive any arp packet on a backup slave in active-backup
mode and arp_validate enabled, we suppose that it's an arp request, swap
source/target ip and try to validate it. This optimization gives us
virtually no downtime in the most common situation (active and backup
slaves are in the same broadcast domain and the active slave failed).
However, if we can't reach the arp_ip_target(s), we end up in an endless
loop of reselecting slaves, because we receive our arp requests, sent by
the active slave, and think that backup slaves are up, thus selecting them
as active and, again, sending arp requests, which fool our backup slaves.
Fix this by not validating the swapped arp packets if the current active
slave didn't receive any arp reply after it was selected as active. This
way we will only accept arp requests if we know that the current active
slave can actually reach arp_ip_target.
v3->v4:
Obey 80 lines and make checkpatch.pl happy, per Sergei's suggestion.
v1->v3:
No change.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:32 +04:00
*
* We ' trust ' the arp requests if there is an active slave and
* it received valid arp reply ( s ) after it became active . This
* is done to avoid endless looping when we can ' t reach the
* arp_ip_target and fool ourselves with our own arp requests .
2006-09-23 08:54:53 +04:00
*/
2011-03-12 06:14:37 +03:00
if ( bond_is_active_slave ( slave ) )
2006-09-23 08:54:53 +04:00
bond_validate_arp ( bond , slave , sip , tip ) ;
bonding: don't trust arp requests unless active slave really works
Currently, if we receive any arp packet on a backup slave in active-backup
mode and arp_validate enabled, we suppose that it's an arp request, swap
source/target ip and try to validate it. This optimization gives us
virtually no downtime in the most common situation (active and backup
slaves are in the same broadcast domain and the active slave failed).
However, if we can't reach the arp_ip_target(s), we end up in an endless
loop of reselecting slaves, because we receive our arp requests, sent by
the active slave, and think that backup slaves are up, thus selecting them
as active and, again, sending arp requests, which fool our backup slaves.
Fix this by not validating the swapped arp packets if the current active
slave didn't receive any arp reply after it was selected as active. This
way we will only accept arp requests if we know that the current active
slave can actually reach arp_ip_target.
v3->v4:
Obey 80 lines and make checkpatch.pl happy, per Sergei's suggestion.
v1->v3:
No change.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:32 +04:00
else if ( bond - > curr_active_slave & &
time_after ( slave_last_rx ( bond , bond - > curr_active_slave ) ,
bond - > curr_active_slave - > jiffies ) )
2006-09-23 08:54:53 +04:00
bond_validate_arp ( bond , slave , tip , sip ) ;
out_unlock :
read_unlock ( & bond - > lock ) ;
2012-06-11 23:23:07 +04:00
if ( arp ! = ( struct arphdr * ) skb - > data )
kfree ( arp ) ;
2012-05-09 05:01:40 +04:00
return RX_HANDLER_ANOTHER ;
2006-09-23 08:54:53 +04:00
}
2013-08-03 05:50:36 +04:00
/* function to verify if we're in the arp_interval timeslice, returns true if
* ( last_act - arp_interval ) < = jiffies < = ( last_act + mod * arp_interval +
* arp_interval / 2 ) . the arp_interval / 2 is needed for really fast networks .
*/
static bool bond_time_in_interval ( struct bonding * bond , unsigned long last_act ,
int mod )
{
int delta_in_ticks = msecs_to_jiffies ( bond - > params . arp_interval ) ;
return time_in_range ( jiffies ,
last_act - delta_in_ticks ,
last_act + mod * delta_in_ticks + delta_in_ticks / 2 ) ;
}
2005-04-17 02:20:36 +04:00
/*
* this function is called regularly to monitor each slave ' s link
* ensuring that traffic is being sent and received when arp monitoring
* is used in load - balancing mode . if the adapter has been dormant , then an
* arp is transmitted to generate traffic . see activebackup_arp_monitor for
* arp monitoring in active backup mode .
*/
2007-10-18 04:37:45 +04:00
void bond_loadbalance_arp_mon ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
2007-10-18 04:37:45 +04:00
struct bonding * bond = container_of ( work , struct bonding ,
arp_work . work ) ;
2005-04-17 02:20:36 +04:00
struct slave * slave , * oldcurrent ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
int do_failover = 0 ;
read_lock ( & bond - > lock ) ;
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) )
2005-04-17 02:20:36 +04:00
goto re_arm ;
oldcurrent = bond - > curr_active_slave ;
/* see if any of the previous devices are up now (i.e. they have
* xmt and rcv traffic ) . the curr_active_slave does not come into
* the picture unless it is null . also , slave - > jiffies is not needed
* here because we send an arp on each slave and give a slave as
* long as it needs to get the tx / rx within the delta .
* TODO : what about up / down delay in arp mode ? it wasn ' t here before
* so it can wait
*/
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2010-09-02 09:45:54 +04:00
unsigned long trans_start = dev_trans_start ( slave - > dev ) ;
2005-04-17 02:20:36 +04:00
if ( slave - > link ! = BOND_LINK_UP ) {
2013-08-03 05:50:36 +04:00
if ( bond_time_in_interval ( bond , trans_start , 1 ) & &
bond_time_in_interval ( bond , slave - > dev - > last_rx , 1 ) ) {
2005-04-17 02:20:36 +04:00
slave - > link = BOND_LINK_UP ;
2011-03-12 06:14:37 +03:00
bond_set_active_slave ( slave ) ;
2005-04-17 02:20:36 +04:00
/* primary_slave has no meaning in round-robin
* mode . the window of a slave being up and
* curr_active_slave being null after enslaving
* is closed .
*/
if ( ! oldcurrent ) {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status definitely up for interface %s, " ,
bond - > dev - > name ,
slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
do_failover = 1 ;
} else {
2009-12-14 07:06:07 +03:00
pr_info ( " %s: interface %s is now up \n " ,
bond - > dev - > name ,
slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
} else {
/* slave->link == BOND_LINK_UP */
/* not all switches will respond to an arp request
* when the source ip is 0 , so don ' t take the link down
* if we don ' t know our ip yet
*/
2013-08-03 05:50:36 +04:00
if ( ! bond_time_in_interval ( bond , trans_start , 2 ) | |
! bond_time_in_interval ( bond , slave - > dev - > last_rx , 2 ) ) {
2005-04-17 02:20:36 +04:00
slave - > link = BOND_LINK_DOWN ;
2011-03-12 06:14:37 +03:00
bond_set_backup_slave ( slave ) ;
2005-04-17 02:20:36 +04:00
2009-06-12 23:02:48 +04:00
if ( slave - > link_failure_count < UINT_MAX )
2005-04-17 02:20:36 +04:00
slave - > link_failure_count + + ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: interface %s is now down. \n " ,
bond - > dev - > name ,
slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
2009-06-12 23:02:48 +04:00
if ( slave = = oldcurrent )
2005-04-17 02:20:36 +04:00
do_failover = 1 ;
}
}
/* note: if switch is in round-robin mode, all links
* must tx arp to ensure all links rx an arp - otherwise
* links may oscillate or not come up at all ; if switch is
* in something like xor mode , there is nothing we can
* do - all replies will be rx ' ed on same link causing slaves
* to be unstable during low / no traffic periods
*/
2009-06-12 23:02:48 +04:00
if ( IS_UP ( slave - > dev ) )
2005-04-17 02:20:36 +04:00
bond_arp_send_all ( bond , slave ) ;
}
if ( do_failover ) {
2010-10-13 20:01:50 +04:00
block_netpoll_tx ( ) ;
2007-10-18 04:37:49 +04:00
write_lock_bh ( & bond - > curr_slave_lock ) ;
2005-04-17 02:20:36 +04:00
bond_select_active_slave ( bond ) ;
2007-10-18 04:37:49 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
2005-04-17 02:20:36 +04:00
}
re_arm :
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
if ( bond - > params . arp_interval )
2013-08-03 05:50:36 +04:00
queue_delayed_work ( bond - > wq , & bond - > arp_work ,
msecs_to_jiffies ( bond - > params . arp_interval ) ) ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
2005-04-17 02:20:36 +04:00
read_unlock ( & bond - > lock ) ;
}
/*
2008-05-18 08:10:13 +04:00
* Called to inspect slaves for active - backup mode ARP monitor link state
* changes . Sets new_link in slaves to specify what action should take
* place for the slave . Returns 0 if no changes are found , > 0 if changes
* to link states must be committed .
*
* Called with bond - > lock held for read .
2005-04-17 02:20:36 +04:00
*/
2013-08-03 05:50:36 +04:00
static int bond_ab_arp_inspect ( struct bonding * bond )
2005-04-17 02:20:36 +04:00
{
2013-08-03 05:50:35 +04:00
unsigned long trans_start , last_rx ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
struct slave * slave ;
int commit = 0 ;
2012-08-30 16:02:47 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-05-18 08:10:13 +04:00
slave - > new_link = BOND_LINK_NOCHANGE ;
2013-08-03 05:50:35 +04:00
last_rx = slave_last_rx ( bond , slave ) ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
if ( slave - > link ! = BOND_LINK_UP ) {
2013-08-03 05:50:36 +04:00
if ( bond_time_in_interval ( bond , last_rx , 1 ) ) {
2008-05-18 08:10:13 +04:00
slave - > new_link = BOND_LINK_UP ;
commit + + ;
}
continue ;
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
/*
* Give slaves 2 * delta after being enslaved or made
* active . This avoids bouncing , as the last receive
* times need a full ARP monitor cycle to be updated .
*/
2013-08-03 05:50:36 +04:00
if ( bond_time_in_interval ( bond , slave - > jiffies , 2 ) )
2008-05-18 08:10:13 +04:00
continue ;
/*
* Backup slave is down if :
* - No current_arp_slave AND
* - more than 3 * delta since last receive AND
* - the bond has an IP address
*
* Note : a non - null current_arp_slave indicates
* the curr_active_slave went down and we are
* searching for a new one ; under this condition
* we only take the curr_active_slave down - this
* gives each slave a chance to tx / rx traffic
* before being taken out
*/
2011-03-12 06:14:37 +03:00
if ( ! bond_is_active_slave ( slave ) & &
2008-05-18 08:10:13 +04:00
! bond - > current_arp_slave & &
2013-08-03 05:50:36 +04:00
! bond_time_in_interval ( bond , last_rx , 3 ) ) {
2008-05-18 08:10:13 +04:00
slave - > new_link = BOND_LINK_DOWN ;
commit + + ;
}
/*
* Active slave is down if :
* - more than 2 * delta since transmitting OR
* - ( more than 2 * delta since receive AND
* the bond has an IP address )
*/
2010-09-02 09:45:54 +04:00
trans_start = dev_trans_start ( slave - > dev ) ;
2011-03-12 06:14:37 +03:00
if ( bond_is_active_slave ( slave ) & &
2013-08-03 05:50:36 +04:00
( ! bond_time_in_interval ( bond , trans_start , 2 ) | |
! bond_time_in_interval ( bond , last_rx , 2 ) ) ) {
2008-05-18 08:10:13 +04:00
slave - > new_link = BOND_LINK_DOWN ;
commit + + ;
}
2005-04-17 02:20:36 +04:00
}
2008-05-18 08:10:13 +04:00
return commit ;
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
/*
* Called to commit link state changes noted by inspection step of
* active - backup mode ARP monitor .
*
* Called with RTNL and bond - > lock for read .
*/
2013-08-03 05:50:36 +04:00
static void bond_ab_arp_commit ( struct bonding * bond )
2008-05-18 08:10:13 +04:00
{
2010-09-02 09:45:54 +04:00
unsigned long trans_start ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
struct slave * slave ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-05-18 08:10:13 +04:00
switch ( slave - > new_link ) {
case BOND_LINK_NOCHANGE :
continue ;
2006-03-28 01:27:43 +04:00
2008-05-18 08:10:13 +04:00
case BOND_LINK_UP :
2010-09-02 09:45:54 +04:00
trans_start = dev_trans_start ( slave - > dev ) ;
2013-08-03 05:50:36 +04:00
if ( bond - > curr_active_slave ! = slave | |
( ! bond - > curr_active_slave & &
bond_time_in_interval ( bond , trans_start , 1 ) ) ) {
2008-05-18 08:10:13 +04:00
slave - > link = BOND_LINK_UP ;
2012-04-05 07:47:43 +04:00
if ( bond - > current_arp_slave ) {
bond_set_slave_inactive_flags (
bond - > current_arp_slave ) ;
bond - > current_arp_slave = NULL ;
}
2008-05-18 08:10:13 +04:00
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status definitely up for interface %s. \n " ,
2009-08-31 15:09:38 +04:00
bond - > dev - > name , slave - > dev - > name ) ;
2008-05-18 08:10:13 +04:00
2009-08-31 15:09:38 +04:00
if ( ! bond - > curr_active_slave | |
( slave = = bond - > primary_slave ) )
goto do_failover ;
2005-04-17 02:20:36 +04:00
2009-08-31 15:09:38 +04:00
}
2005-04-17 02:20:36 +04:00
2009-08-31 15:09:38 +04:00
continue ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
case BOND_LINK_DOWN :
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
slave - > link = BOND_LINK_DOWN ;
2009-08-31 15:09:38 +04:00
bond_set_slave_inactive_flags ( slave ) ;
2008-05-18 08:10:13 +04:00
2009-12-14 07:06:07 +03:00
pr_info ( " %s: link status definitely down for interface %s, disabling it \n " ,
2009-08-31 15:09:38 +04:00
bond - > dev - > name , slave - > dev - > name ) ;
2008-05-18 08:10:13 +04:00
2009-08-31 15:09:38 +04:00
if ( slave = = bond - > curr_active_slave ) {
2008-05-18 08:10:13 +04:00
bond - > current_arp_slave = NULL ;
2009-08-31 15:09:38 +04:00
goto do_failover ;
2005-04-17 02:20:36 +04:00
}
2009-08-31 15:09:38 +04:00
continue ;
2008-05-18 08:10:13 +04:00
default :
2009-12-14 07:06:07 +03:00
pr_err ( " %s: impossible: new_link %d on slave %s \n " ,
2008-05-18 08:10:13 +04:00
bond - > dev - > name , slave - > new_link ,
slave - > dev - > name ) ;
2009-08-31 15:09:38 +04:00
continue ;
2005-04-17 02:20:36 +04:00
}
2009-08-31 15:09:38 +04:00
do_failover :
ASSERT_RTNL ( ) ;
2010-10-13 20:01:50 +04:00
block_netpoll_tx ( ) ;
2008-05-18 08:10:13 +04:00
write_lock_bh ( & bond - > curr_slave_lock ) ;
2009-08-31 15:09:38 +04:00
bond_select_active_slave ( bond ) ;
2008-05-18 08:10:13 +04:00
write_unlock_bh ( & bond - > curr_slave_lock ) ;
2010-10-13 20:01:50 +04:00
unblock_netpoll_tx ( ) ;
2008-05-18 08:10:13 +04:00
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
bond_set_carrier ( bond ) ;
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
/*
* Send ARP probes for active - backup mode ARP monitor .
*
* Called with bond - > lock held for read .
*/
static void bond_ab_arp_probe ( struct bonding * bond )
{
2013-09-25 11:20:19 +04:00
struct slave * slave , * before = NULL , * new_slave = NULL ;
struct list_head * iter ;
bool found = false ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
read_lock ( & bond - > curr_slave_lock ) ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
if ( bond - > current_arp_slave & & bond - > curr_active_slave )
2009-12-14 07:06:07 +03:00
pr_info ( " PROBE: c_arp %s && cas %s BAD \n " ,
bond - > current_arp_slave - > dev - > name ,
bond - > curr_active_slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
if ( bond - > curr_active_slave ) {
bond_arp_send_all ( bond , bond - > curr_active_slave ) ;
read_unlock ( & bond - > curr_slave_lock ) ;
return ;
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
read_unlock ( & bond - > curr_slave_lock ) ;
2007-10-18 04:37:49 +04:00
2008-05-18 08:10:13 +04:00
/* if we don't have a curr_active_slave, search for the next available
* backup slave from the current_arp_slave and make it the candidate
* for becoming the curr_active_slave
*/
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
if ( ! bond - > current_arp_slave ) {
2013-08-01 18:54:47 +04:00
bond - > current_arp_slave = bond_first_slave ( bond ) ;
2008-05-18 08:10:13 +04:00
if ( ! bond - > current_arp_slave )
return ;
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
bond_set_slave_inactive_flags ( bond - > current_arp_slave ) ;
2007-10-18 04:37:49 +04:00
2013-09-25 11:20:19 +04:00
bond_for_each_slave ( bond , slave , iter ) {
if ( ! found & & ! before & & IS_UP ( slave - > dev ) )
before = slave ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:19 +04:00
if ( found & & ! new_slave & & IS_UP ( slave - > dev ) )
new_slave = slave ;
2008-05-18 08:10:13 +04:00
/* if the link state is up at this point, we
* mark it down - this can happen if we have
* simultaneous link failures and
* reselect_active_interface doesn ' t make this
* one the current slave so it is still marked
* up when it is actually down
2005-04-17 02:20:36 +04:00
*/
2013-09-25 11:20:19 +04:00
if ( ! IS_UP ( slave - > dev ) & & slave - > link = = BOND_LINK_UP ) {
2008-05-18 08:10:13 +04:00
slave - > link = BOND_LINK_DOWN ;
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
bond_set_slave_inactive_flags ( slave ) ;
2009-12-14 07:06:07 +03:00
pr_info ( " %s: backup interface %s is now down. \n " ,
bond - > dev - > name , slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
2013-09-25 11:20:19 +04:00
if ( slave = = bond - > current_arp_slave )
found = true ;
2008-05-18 08:10:13 +04:00
}
2013-09-25 11:20:19 +04:00
if ( ! new_slave & & before )
new_slave = before ;
if ( ! new_slave )
return ;
new_slave - > link = BOND_LINK_BACK ;
bond_set_slave_active_flags ( new_slave ) ;
bond_arp_send_all ( bond , new_slave ) ;
new_slave - > jiffies = jiffies ;
bond - > current_arp_slave = new_slave ;
2008-05-18 08:10:13 +04:00
}
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
void bond_activebackup_arp_mon ( struct work_struct * work )
{
struct bonding * bond = container_of ( work , struct bonding ,
arp_work . work ) ;
2011-04-26 19:25:52 +04:00
bool should_notify_peers = false ;
2008-05-18 08:10:13 +04:00
int delta_in_ticks ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
read_lock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
2008-05-18 08:10:13 +04:00
delta_in_ticks = msecs_to_jiffies ( bond - > params . arp_interval ) ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:21 +04:00
if ( ! bond_has_slaves ( bond ) )
2008-05-18 08:10:13 +04:00
goto re_arm ;
2011-04-26 19:25:52 +04:00
should_notify_peers = bond_should_notify_peers ( bond ) ;
2013-08-03 05:50:36 +04:00
if ( bond_ab_arp_inspect ( bond ) ) {
2008-05-18 08:10:13 +04:00
read_unlock ( & bond - > lock ) ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
/* Race avoidance with bond_close flush of workqueue */
if ( ! rtnl_trylock ( ) ) {
read_lock ( & bond - > lock ) ;
delta_in_ticks = 1 ;
should_notify_peers = false ;
goto re_arm ;
}
2008-05-18 08:10:13 +04:00
read_lock ( & bond - > lock ) ;
2013-08-03 05:50:36 +04:00
bond_ab_arp_commit ( bond ) ;
2008-05-18 08:10:13 +04:00
read_unlock ( & bond - > lock ) ;
rtnl_unlock ( ) ;
read_lock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
}
2008-05-18 08:10:13 +04:00
bond_ab_arp_probe ( bond ) ;
2005-04-17 02:20:36 +04:00
re_arm :
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
if ( bond - > params . arp_interval )
2007-10-18 04:37:45 +04:00
queue_delayed_work ( bond - > wq , & bond - > arp_work , delta_in_ticks ) ;
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
2005-04-17 02:20:36 +04:00
read_unlock ( & bond - > lock ) ;
2011-04-26 19:25:52 +04:00
if ( should_notify_peers ) {
2013-09-02 15:51:38 +04:00
if ( ! rtnl_trylock ( ) )
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 19:42:50 +04:00
return ;
2012-08-10 02:14:57 +04:00
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS , bond - > dev ) ;
2011-04-26 19:25:52 +04:00
rtnl_unlock ( ) ;
}
2005-04-17 02:20:36 +04:00
}
/*-------------------------- netdev event handling --------------------------*/
/*
* Change device name
*/
static int bond_event_changename ( struct bonding * bond )
{
bond_remove_proc_entry ( bond ) ;
bond_create_proc_entry ( bond ) ;
2009-06-12 23:02:46 +04:00
2010-12-09 18:17:13 +03:00
bond_debug_reregister ( bond ) ;
2005-04-17 02:20:36 +04:00
return NOTIFY_DONE ;
}
2009-06-12 23:02:48 +04:00
static int bond_master_netdev_event ( unsigned long event ,
struct net_device * bond_dev )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * event_bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
switch ( event ) {
case NETDEV_CHANGENAME :
return bond_event_changename ( event_bond ) ;
2012-07-09 14:51:45 +04:00
case NETDEV_UNREGISTER :
bond_remove_proc_entry ( event_bond ) ;
break ;
case NETDEV_REGISTER :
bond_create_proc_entry ( event_bond ) ;
break ;
2013-09-02 15:51:38 +04:00
case NETDEV_NOTIFY_PEERS :
if ( event_bond - > send_peer_notif )
event_bond - > send_peer_notif - - ;
break ;
2005-04-17 02:20:36 +04:00
default :
break ;
}
return NOTIFY_DONE ;
}
2009-06-12 23:02:48 +04:00
static int bond_slave_netdev_event ( unsigned long event ,
struct net_device * slave_dev )
2005-04-17 02:20:36 +04:00
{
2013-01-04 02:49:01 +04:00
struct slave * slave = bond_slave_get_rtnl ( slave_dev ) ;
2013-04-11 13:18:55 +04:00
struct bonding * bond ;
struct net_device * bond_dev ;
2013-01-04 02:49:01 +04:00
u32 old_speed ;
u8 old_duplex ;
2005-04-17 02:20:36 +04:00
2013-04-11 13:18:55 +04:00
/* A netdev event can be generated while enslaving a device
* before netdev_rx_handler_register is called in which case
* slave will be NULL
*/
if ( ! slave )
return NOTIFY_DONE ;
bond_dev = slave - > bond - > dev ;
bond = slave - > bond ;
2005-04-17 02:20:36 +04:00
switch ( event ) {
case NETDEV_UNREGISTER :
2013-06-26 19:13:37 +04:00
if ( bond_dev - > type ! = ARPHRD_ETHER )
2013-01-04 02:49:01 +04:00
bond_release_and_destroy ( bond_dev , slave_dev ) ;
else
bond_release ( bond_dev , slave_dev ) ;
2005-04-17 02:20:36 +04:00
break ;
bonding:update speed/duplex for NETDEV_CHANGE
Zheng Liang(lzheng@redhat.com) found a bug that if we config bonding with
arp monitor, sometimes bonding driver cannot get the speed and duplex from
its slaves, it will assume them to be 100Mb/sec and Full, please see
/proc/net/bonding/bond0.
But there is no such problem when uses miimon.
(Take igb for example)
I find that the reason is that after dev_open() in bond_enslave(),
bond_update_speed_duplex() will call igb_get_settings()
, but in that function,
it runs ethtool_cmd_speed_set(ecmd, -1); ecmd->duplex = -1;
because igb get an error value of status.
So even dev_open() is called, but the device is not really ready to get its
settings.
Maybe it is safe for us to call igb_get_settings() only after
this message shows up, that is "igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX".
So I prefer to update the speed and duplex for a slave when reseices
NETDEV_CHANGE/NETDEV_UP event.
Changelog
V2:
1 remove the "fake 100/Full" logic in bond_update_speed_duplex(),
set speed and duplex to -1 when it gets error value of speed and duplex.
2 delete the warning in bond_enslave() if bond_update_speed_duplex() returns
error.
3 make bond_info_show_slave() handle bad values of speed and duplex.
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-31 21:20:48 +04:00
case NETDEV_UP :
2005-04-17 02:20:36 +04:00
case NETDEV_CHANGE :
2013-01-04 02:49:01 +04:00
old_speed = slave - > speed ;
old_duplex = slave - > duplex ;
2009-03-19 04:38:25 +03:00
2013-01-04 02:49:01 +04:00
bond_update_speed_duplex ( slave ) ;
2009-03-19 04:38:25 +03:00
2013-01-04 02:49:01 +04:00
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
if ( old_speed ! = slave - > speed )
bond_3ad_adapter_speed_changed ( slave ) ;
if ( old_duplex ! = slave - > duplex )
bond_3ad_adapter_duplex_changed ( slave ) ;
2009-03-19 04:38:25 +03:00
}
2005-04-17 02:20:36 +04:00
break ;
case NETDEV_DOWN :
/*
* . . . Or is it this ?
*/
break ;
case NETDEV_CHANGEMTU :
/*
* TODO : Should slaves be allowed to
* independently alter their MTU ? For
* an active - backup bond , slaves need
* not be the same type of device , so
* MTUs may vary . For other modes ,
* slaves arguably should have the
* same MTUs . To do this , we ' d need to
* take over the slave ' s change_mtu
* function for the duration of their
* servitude .
*/
break ;
case NETDEV_CHANGENAME :
/*
* TODO : handle changing the primary ' s name
*/
break ;
2005-08-23 09:34:53 +04:00
case NETDEV_FEAT_CHANGE :
bond_compute_features ( bond ) ;
break ;
2013-07-20 14:13:53 +04:00
case NETDEV_RESEND_IGMP :
/* Propagate to master device */
call_netdevice_notifiers ( event , slave - > bond - > dev ) ;
break ;
2005-04-17 02:20:36 +04:00
default :
break ;
}
return NOTIFY_DONE ;
}
/*
* bond_netdev_event : handle netdev notifier chain events .
*
* This function receives events for the netdev chain . The caller ( an
[PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 13:16:30 +04:00
* ioctl handler calling blocking_notifier_call_chain ) holds the necessary
2005-04-17 02:20:36 +04:00
* locks for us to safely manipulate the slave devices ( RTNL lock ,
* dev_probe_lock ) .
*/
2009-06-12 23:02:48 +04:00
static int bond_netdev_event ( struct notifier_block * this ,
unsigned long event , void * ptr )
2005-04-17 02:20:36 +04:00
{
2013-05-28 05:30:21 +04:00
struct net_device * event_dev = netdev_notifier_info_to_dev ( ptr ) ;
2005-04-17 02:20:36 +04:00
2008-12-10 10:09:22 +03:00
pr_debug ( " event_dev: %s, event: %lx \n " ,
2009-12-14 07:06:07 +03:00
event_dev ? event_dev - > name : " None " ,
event ) ;
2005-04-17 02:20:36 +04:00
2006-09-23 08:54:10 +04:00
if ( ! ( event_dev - > priv_flags & IFF_BONDING ) )
return NOTIFY_DONE ;
2005-04-17 02:20:36 +04:00
if ( event_dev - > flags & IFF_MASTER ) {
2008-12-10 10:09:22 +03:00
pr_debug ( " IFF_MASTER \n " ) ;
2005-04-17 02:20:36 +04:00
return bond_master_netdev_event ( event , event_dev ) ;
}
if ( event_dev - > flags & IFF_SLAVE ) {
2008-12-10 10:09:22 +03:00
pr_debug ( " IFF_SLAVE \n " ) ;
2005-04-17 02:20:36 +04:00
return bond_slave_netdev_event ( event , event_dev ) ;
}
return NOTIFY_DONE ;
}
static struct notifier_block bond_netdev_notifier = {
. notifier_call = bond_netdev_event ,
} ;
2005-06-27 01:54:11 +04:00
/*---------------------------- Hashing Policies -----------------------------*/
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
/*
* Hash for the output device based upon layer 2 data
*/
static int bond_xmit_hash_policy_l2 ( struct sk_buff * skb , int count )
{
struct ethhdr * data = ( struct ethhdr * ) skb - > data ;
if ( skb_headlen ( skb ) > = offsetof ( struct ethhdr , h_proto ) )
return ( data - > h_dest [ 5 ] ^ data - > h_source [ 5 ] ) % count ;
return 0 ;
}
2007-12-07 10:40:34 +03:00
/*
* Hash for the output device based upon layer 2 and layer 3 data . If
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
* the packet is not IP , fall back on bond_xmit_hash_policy_l2 ( )
2007-12-07 10:40:34 +03:00
*/
2009-10-23 08:09:24 +04:00
static int bond_xmit_hash_policy_l23 ( struct sk_buff * skb , int count )
2007-12-07 10:40:34 +03:00
{
2013-04-15 21:03:24 +04:00
const struct ethhdr * data ;
const struct iphdr * iph ;
const struct ipv6hdr * ipv6h ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
u32 v6hash ;
2013-04-15 21:03:24 +04:00
const __be32 * s , * d ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
if ( skb - > protocol = = htons ( ETH_P_IP ) & &
2013-04-15 21:03:24 +04:00
pskb_network_may_pull ( skb , sizeof ( * iph ) ) ) {
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
iph = ip_hdr ( skb ) ;
2013-04-15 21:03:24 +04:00
data = ( struct ethhdr * ) skb - > data ;
2007-12-07 10:40:34 +03:00
return ( ( ntohl ( iph - > saddr ^ iph - > daddr ) & 0xffff ) ^
2009-10-23 08:08:46 +04:00
( data - > h_dest [ 5 ] ^ data - > h_source [ 5 ] ) ) % count ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
} else if ( skb - > protocol = = htons ( ETH_P_IPV6 ) & &
2013-04-15 21:03:24 +04:00
pskb_network_may_pull ( skb , sizeof ( * ipv6h ) ) ) {
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
ipv6h = ipv6_hdr ( skb ) ;
2013-04-15 21:03:24 +04:00
data = ( struct ethhdr * ) skb - > data ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
s = & ipv6h - > saddr . s6_addr32 [ 0 ] ;
d = & ipv6h - > daddr . s6_addr32 [ 0 ] ;
v6hash = ( s [ 1 ] ^ d [ 1 ] ) ^ ( s [ 2 ] ^ d [ 2 ] ) ^ ( s [ 3 ] ^ d [ 3 ] ) ;
v6hash ^ = ( v6hash > > 24 ) ^ ( v6hash > > 16 ) ^ ( v6hash > > 8 ) ;
return ( v6hash ^ data - > h_dest [ 5 ] ^ data - > h_source [ 5 ] ) % count ;
2007-12-07 10:40:34 +03:00
}
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
return bond_xmit_hash_policy_l2 ( skb , count ) ;
2007-12-07 10:40:34 +03:00
}
2005-06-27 01:54:11 +04:00
/*
2007-05-09 10:57:56 +04:00
* Hash for the output device based upon layer 3 and layer 4 data . If
2005-06-27 01:54:11 +04:00
* the packet is a frag or not TCP or UDP , just use layer 3 data . If it is
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
* altogether not IP , fall back on bond_xmit_hash_policy_l2 ( )
2005-06-27 01:54:11 +04:00
*/
2009-10-23 08:09:24 +04:00
static int bond_xmit_hash_policy_l34 ( struct sk_buff * skb , int count )
2005-06-27 01:54:11 +04:00
{
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
u32 layer4_xor = 0 ;
2013-04-15 21:03:24 +04:00
const struct iphdr * iph ;
const struct ipv6hdr * ipv6h ;
const __be32 * s , * d ;
const __be16 * l4 = NULL ;
__be16 _l4 [ 2 ] ;
int noff = skb_network_offset ( skb ) ;
int poff ;
2005-06-27 01:54:11 +04:00
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
if ( skb - > protocol = = htons ( ETH_P_IP ) & &
2013-04-15 21:03:24 +04:00
pskb_may_pull ( skb , noff + sizeof ( * iph ) ) ) {
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
iph = ip_hdr ( skb ) ;
2013-04-15 21:03:24 +04:00
poff = proto_ports_offset ( iph - > protocol ) ;
if ( ! ip_is_fragment ( iph ) & & poff > = 0 ) {
l4 = skb_header_pointer ( skb , noff + ( iph - > ihl < < 2 ) + poff ,
sizeof ( _l4 ) , & _l4 ) ;
if ( l4 )
layer4_xor = ntohs ( l4 [ 0 ] ^ l4 [ 1 ] ) ;
2005-06-27 01:54:11 +04:00
}
return ( layer4_xor ^
( ( ntohl ( iph - > saddr ^ iph - > daddr ) ) & 0xffff ) ) % count ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
} else if ( skb - > protocol = = htons ( ETH_P_IPV6 ) & &
2013-04-15 21:03:24 +04:00
pskb_may_pull ( skb , noff + sizeof ( * ipv6h ) ) ) {
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
ipv6h = ipv6_hdr ( skb ) ;
2013-04-15 21:03:24 +04:00
poff = proto_ports_offset ( ipv6h - > nexthdr ) ;
if ( poff > = 0 ) {
l4 = skb_header_pointer ( skb , noff + sizeof ( * ipv6h ) + poff ,
sizeof ( _l4 ) , & _l4 ) ;
if ( l4 )
layer4_xor = ntohs ( l4 [ 0 ] ^ l4 [ 1 ] ) ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
}
s = & ipv6h - > saddr . s6_addr32 [ 0 ] ;
d = & ipv6h - > daddr . s6_addr32 [ 0 ] ;
layer4_xor ^ = ( s [ 1 ] ^ d [ 1 ] ) ^ ( s [ 2 ] ^ d [ 2 ] ) ^ ( s [ 3 ] ^ d [ 3 ] ) ;
layer4_xor ^ = ( layer4_xor > > 24 ) ^ ( layer4_xor > > 16 ) ^
( layer4_xor > > 8 ) ;
return layer4_xor % count ;
2005-06-27 01:54:11 +04:00
}
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 00:43:35 +04:00
return bond_xmit_hash_policy_l2 ( skb , count ) ;
2005-06-27 01:54:11 +04:00
}
2005-04-17 02:20:36 +04:00
/*-------------------------- Device entry points ----------------------------*/
2012-11-29 05:31:31 +04:00
static void bond_work_init_all ( struct bonding * bond )
{
INIT_DELAYED_WORK ( & bond - > mcast_work ,
bond_resend_igmp_join_requests_delayed ) ;
INIT_DELAYED_WORK ( & bond - > alb_work , bond_alb_monitor ) ;
INIT_DELAYED_WORK ( & bond - > mii_work , bond_mii_monitor ) ;
if ( bond - > params . mode = = BOND_MODE_ACTIVEBACKUP )
INIT_DELAYED_WORK ( & bond - > arp_work , bond_activebackup_arp_mon ) ;
else
INIT_DELAYED_WORK ( & bond - > arp_work , bond_loadbalance_arp_mon ) ;
INIT_DELAYED_WORK ( & bond - > ad_work , bond_3ad_state_machine_handler ) ;
}
static void bond_work_cancel_all ( struct bonding * bond )
{
cancel_delayed_work_sync ( & bond - > mii_work ) ;
cancel_delayed_work_sync ( & bond - > arp_work ) ;
cancel_delayed_work_sync ( & bond - > alb_work ) ;
cancel_delayed_work_sync ( & bond - > ad_work ) ;
cancel_delayed_work_sync ( & bond - > mcast_work ) ;
}
2005-04-17 02:20:36 +04:00
static int bond_open ( struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 19:57:35 +04:00
struct slave * slave ;
2005-04-17 02:20:36 +04:00
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 19:57:35 +04:00
/* reset slave->backup and slave->inactive */
read_lock ( & bond - > lock ) ;
2013-09-25 11:20:21 +04:00
if ( bond_has_slaves ( bond ) ) {
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 19:57:35 +04:00
read_lock ( & bond - > curr_slave_lock ) ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 19:57:35 +04:00
if ( ( bond - > params . mode = = BOND_MODE_ACTIVEBACKUP )
& & ( slave ! = bond - > curr_active_slave ) ) {
bond_set_slave_inactive_flags ( slave ) ;
} else {
bond_set_slave_active_flags ( slave ) ;
}
}
read_unlock ( & bond - > curr_slave_lock ) ;
}
read_unlock ( & bond - > lock ) ;
2012-11-29 05:31:31 +04:00
bond_work_init_all ( bond ) ;
2010-10-05 18:23:57 +04:00
2008-12-10 10:07:13 +03:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 02:20:36 +04:00
/* bond_alb_initialize must be called before the timer
* is started .
*/
2012-11-29 05:31:31 +04:00
if ( bond_alb_initialize ( bond , ( bond - > params . mode = = BOND_MODE_ALB ) ) )
2010-01-26 02:34:15 +03:00
return - ENOMEM ;
2007-10-18 04:37:45 +04:00
queue_delayed_work ( bond - > wq , & bond - > alb_work , 0 ) ;
2005-04-17 02:20:36 +04:00
}
2012-11-29 05:31:31 +04:00
if ( bond - > params . miimon ) /* link check interval, in milliseconds. */
2007-10-18 04:37:45 +04:00
queue_delayed_work ( bond - > wq , & bond - > mii_work , 0 ) ;
2005-04-17 02:20:36 +04:00
if ( bond - > params . arp_interval ) { /* arp interval, in milliseconds. */
2007-10-18 04:37:45 +04:00
queue_delayed_work ( bond - > wq , & bond - > arp_work , 0 ) ;
2006-09-23 08:54:53 +04:00
if ( bond - > params . arp_validate )
2011-04-19 07:48:16 +04:00
bond - > recv_probe = bond_arp_rcv ;
2005-04-17 02:20:36 +04:00
}
if ( bond - > params . mode = = BOND_MODE_8023AD ) {
2007-10-18 04:37:45 +04:00
queue_delayed_work ( bond - > wq , & bond - > ad_work , 0 ) ;
2005-04-17 02:20:36 +04:00
/* register to receive LACPDUs */
2011-04-19 07:48:16 +04:00
bond - > recv_probe = bond_3ad_lacpdu_recv ;
2008-11-05 04:51:16 +03:00
bond_3ad_initiate_agg_selection ( bond , 1 ) ;
2005-04-17 02:20:36 +04:00
}
return 0 ;
}
static int bond_close ( struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
2012-11-29 05:31:31 +04:00
bond_work_cancel_all ( bond ) ;
2013-09-02 15:51:38 +04:00
bond - > send_peer_notif = 0 ;
2013-09-02 15:51:39 +04:00
if ( bond_is_lb ( bond ) )
2005-04-17 02:20:36 +04:00
bond_alb_deinitialize ( bond ) ;
2011-04-19 07:48:16 +04:00
bond - > recv_probe = NULL ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2010-07-08 01:58:56 +04:00
static struct rtnl_link_stats64 * bond_get_stats ( struct net_device * bond_dev ,
struct rtnl_link_stats64 * stats )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2010-07-08 01:58:56 +04:00
struct rtnl_link_stats64 temp ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
struct slave * slave ;
2010-07-08 01:58:56 +04:00
memset ( stats , 0 , sizeof ( * stats ) ) ;
2005-04-17 02:20:36 +04:00
read_lock_bh ( & bond - > lock ) ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2010-06-08 11:19:54 +04:00
const struct rtnl_link_stats64 * sstats =
2010-07-08 01:58:56 +04:00
dev_get_stats ( slave - > dev , & temp ) ;
stats - > rx_packets + = sstats - > rx_packets ;
stats - > rx_bytes + = sstats - > rx_bytes ;
stats - > rx_errors + = sstats - > rx_errors ;
stats - > rx_dropped + = sstats - > rx_dropped ;
stats - > tx_packets + = sstats - > tx_packets ;
stats - > tx_bytes + = sstats - > tx_bytes ;
stats - > tx_errors + = sstats - > tx_errors ;
stats - > tx_dropped + = sstats - > tx_dropped ;
stats - > multicast + = sstats - > multicast ;
stats - > collisions + = sstats - > collisions ;
stats - > rx_length_errors + = sstats - > rx_length_errors ;
stats - > rx_over_errors + = sstats - > rx_over_errors ;
stats - > rx_crc_errors + = sstats - > rx_crc_errors ;
stats - > rx_frame_errors + = sstats - > rx_frame_errors ;
stats - > rx_fifo_errors + = sstats - > rx_fifo_errors ;
stats - > rx_missed_errors + = sstats - > rx_missed_errors ;
stats - > tx_aborted_errors + = sstats - > tx_aborted_errors ;
stats - > tx_carrier_errors + = sstats - > tx_carrier_errors ;
stats - > tx_fifo_errors + = sstats - > tx_fifo_errors ;
stats - > tx_heartbeat_errors + = sstats - > tx_heartbeat_errors ;
stats - > tx_window_errors + = sstats - > tx_window_errors ;
2008-01-30 05:07:46 +03:00
}
2005-04-17 02:20:36 +04:00
read_unlock_bh ( & bond - > lock ) ;
return stats ;
}
static int bond_do_ioctl ( struct net_device * bond_dev , struct ifreq * ifr , int cmd )
{
struct net_device * slave_dev = NULL ;
struct ifbond k_binfo ;
struct ifbond __user * u_binfo = NULL ;
struct ifslave k_sinfo ;
struct ifslave __user * u_sinfo = NULL ;
struct mii_ioctl_data * mii = NULL ;
2013-01-31 20:31:00 +04:00
struct net * net ;
2005-04-17 02:20:36 +04:00
int res = 0 ;
2009-12-14 07:06:07 +03:00
pr_debug ( " bond_ioctl: master=%s, cmd=%d \n " , bond_dev - > name , cmd ) ;
2005-04-17 02:20:36 +04:00
switch ( cmd ) {
case SIOCGMIIPHY :
mii = if_mii ( ifr ) ;
2009-06-12 23:02:48 +04:00
if ( ! mii )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2009-06-12 23:02:48 +04:00
2005-04-17 02:20:36 +04:00
mii - > phy_id = 0 ;
/* Fall Through */
case SIOCGMIIREG :
/*
* We do this again just in case we were called by SIOCGMIIREG
* instead of SIOCGMIIPHY .
*/
mii = if_mii ( ifr ) ;
2009-06-12 23:02:48 +04:00
if ( ! mii )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2009-06-12 23:02:48 +04:00
2005-04-17 02:20:36 +04:00
if ( mii - > reg_num = = 1 ) {
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
mii - > val_out = 0 ;
2007-10-18 04:37:50 +04:00
read_lock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
read_lock ( & bond - > curr_slave_lock ) ;
2009-06-12 23:02:48 +04:00
if ( netif_carrier_ok ( bond - > dev ) )
2005-04-17 02:20:36 +04:00
mii - > val_out = BMSR_LSTATUS ;
2009-06-12 23:02:48 +04:00
2005-04-17 02:20:36 +04:00
read_unlock ( & bond - > curr_slave_lock ) ;
2007-10-18 04:37:50 +04:00
read_unlock ( & bond - > lock ) ;
2005-04-17 02:20:36 +04:00
}
return 0 ;
case BOND_INFO_QUERY_OLD :
case SIOCBONDINFOQUERY :
u_binfo = ( struct ifbond __user * ) ifr - > ifr_data ;
2009-06-12 23:02:48 +04:00
if ( copy_from_user ( & k_binfo , u_binfo , sizeof ( ifbond ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
res = bond_info_query ( bond_dev , & k_binfo ) ;
2009-06-12 23:02:48 +04:00
if ( res = = 0 & &
copy_to_user ( u_binfo , & k_binfo , sizeof ( ifbond ) ) )
return - EFAULT ;
2005-04-17 02:20:36 +04:00
return res ;
case BOND_SLAVE_INFO_QUERY_OLD :
case SIOCBONDSLAVEINFOQUERY :
u_sinfo = ( struct ifslave __user * ) ifr - > ifr_data ;
2009-06-12 23:02:48 +04:00
if ( copy_from_user ( & k_sinfo , u_sinfo , sizeof ( ifslave ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
res = bond_slave_info_query ( bond_dev , & k_sinfo ) ;
2009-06-12 23:02:48 +04:00
if ( res = = 0 & &
copy_to_user ( u_sinfo , & k_sinfo , sizeof ( ifslave ) ) )
return - EFAULT ;
2005-04-17 02:20:36 +04:00
return res ;
default :
/* Go on */
break ;
}
2013-01-31 20:31:00 +04:00
net = dev_net ( bond_dev ) ;
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2005-04-17 02:20:36 +04:00
return - EPERM ;
2013-01-31 20:31:00 +04:00
slave_dev = dev_get_by_name ( net , ifr - > ifr_slave ) ;
2005-04-17 02:20:36 +04:00
2009-12-14 07:06:07 +03:00
pr_debug ( " slave_dev=%p: \n " , slave_dev ) ;
2005-04-17 02:20:36 +04:00
2009-06-12 23:02:48 +04:00
if ( ! slave_dev )
2005-04-17 02:20:36 +04:00
res = - ENODEV ;
2009-06-12 23:02:48 +04:00
else {
2009-12-14 07:06:07 +03:00
pr_debug ( " slave_dev->name=%s: \n " , slave_dev - > name ) ;
2005-04-17 02:20:36 +04:00
switch ( cmd ) {
case BOND_ENSLAVE_OLD :
case SIOCBONDENSLAVE :
res = bond_enslave ( bond_dev , slave_dev ) ;
break ;
case BOND_RELEASE_OLD :
case SIOCBONDRELEASE :
res = bond_release ( bond_dev , slave_dev ) ;
break ;
case BOND_SETHWADDR_OLD :
case SIOCBONDSETHWADDR :
2013-01-30 14:08:11 +04:00
bond_set_dev_addr ( bond_dev , slave_dev ) ;
res = 0 ;
2005-04-17 02:20:36 +04:00
break ;
case BOND_CHANGE_ACTIVE_OLD :
case SIOCBONDCHANGEACTIVE :
res = bond_ioctl_change_active ( bond_dev , slave_dev ) ;
break ;
default :
res = - EOPNOTSUPP ;
}
dev_put ( slave_dev ) ;
}
return res ;
}
2011-08-16 07:15:04 +04:00
static void bond_change_rx_flags ( struct net_device * bond_dev , int change )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
2011-08-16 07:15:04 +04:00
if ( change & IFF_PROMISC )
bond_set_promiscuity ( bond ,
bond_dev - > flags & IFF_PROMISC ? 1 : - 1 ) ;
2009-06-12 23:02:48 +04:00
2011-08-16 07:15:04 +04:00
if ( change & IFF_ALLMULTI )
bond_set_allmulti ( bond ,
bond_dev - > flags & IFF_ALLMULTI ? 1 : - 1 ) ;
}
2005-04-17 02:20:36 +04:00
2013-05-31 15:57:30 +04:00
static void bond_set_rx_mode ( struct net_device * bond_dev )
2011-08-16 07:15:04 +04:00
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-05-31 15:57:30 +04:00
struct slave * slave ;
2005-04-17 02:20:36 +04:00
2013-08-05 16:56:06 +04:00
ASSERT_RTNL ( ) ;
2008-01-30 05:07:44 +03:00
2013-05-31 15:57:30 +04:00
if ( USES_PRIMARY ( bond - > params . mode ) ) {
2013-08-05 16:56:06 +04:00
slave = rtnl_dereference ( bond - > curr_active_slave ) ;
2013-05-31 15:57:30 +04:00
if ( slave ) {
dev_uc_sync ( slave - > dev , bond_dev ) ;
dev_mc_sync ( slave - > dev , bond_dev ) ;
}
} else {
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2013-05-31 15:57:30 +04:00
dev_uc_sync_multiple ( slave - > dev , bond_dev ) ;
dev_mc_sync_multiple ( slave - > dev , bond_dev ) ;
}
2005-04-17 02:20:36 +04:00
}
}
2012-04-04 02:56:20 +04:00
static int bond_neigh_init ( struct neighbour * n )
2008-11-21 07:14:53 +03:00
{
2012-04-04 02:56:20 +04:00
struct bonding * bond = netdev_priv ( n - > dev ) ;
const struct net_device_ops * slave_ops ;
struct neigh_parms parms ;
2013-08-01 18:54:47 +04:00
struct slave * slave ;
2012-04-04 02:56:20 +04:00
int ret ;
2013-08-01 18:54:47 +04:00
slave = bond_first_slave ( bond ) ;
2012-04-04 02:56:20 +04:00
if ( ! slave )
return 0 ;
slave_ops = slave - > dev - > netdev_ops ;
if ( ! slave_ops - > ndo_neigh_setup )
return 0 ;
parms . neigh_setup = NULL ;
parms . neigh_cleanup = NULL ;
ret = slave_ops - > ndo_neigh_setup ( slave - > dev , & parms ) ;
if ( ret )
return ret ;
/*
* Assign slave ' s neigh_cleanup to neighbour in case cleanup is called
* after the last slave has been detached . Assumes that all slaves
* utilize the same neigh_cleanup ( true at this writing as only user
* is ipoib ) .
*/
n - > parms - > neigh_cleanup = parms . neigh_cleanup ;
if ( ! parms . neigh_setup )
return 0 ;
return parms . neigh_setup ( n ) ;
}
/*
* The bonding ndo_neigh_setup is called at init time beofre any
* slave exists . So we must declare proxy setup function which will
* be used at run time to resolve the actual slave neigh param setup .
2013-08-02 21:07:39 +04:00
*
* It ' s also called by master devices ( such as vlans ) to setup their
* underlying devices . In that case - do nothing , we ' re already set up from
* our init .
2012-04-04 02:56:20 +04:00
*/
static int bond_neigh_setup ( struct net_device * dev ,
struct neigh_parms * parms )
{
2013-08-02 21:07:39 +04:00
/* modify only our neigh_parms */
if ( parms - > dev = = dev )
parms - > neigh_setup = bond_neigh_init ;
2008-11-21 07:14:53 +03:00
return 0 ;
}
2005-04-17 02:20:36 +04:00
/*
* Change the MTU of all of a master ' s slaves to match the master
*/
static int bond_change_mtu ( struct net_device * bond_dev , int new_mtu )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:13 +04:00
struct slave * slave , * rollback_slave ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
int res = 0 ;
2008-12-10 10:09:22 +03:00
pr_debug ( " bond=%p, name=%s, new_mtu=%d \n " , bond ,
2009-12-14 07:06:07 +03:00
( bond_dev ? bond_dev - > name : " None " ) , new_mtu ) ;
2005-04-17 02:20:36 +04:00
/* Can't hold bond->lock with bh disabled here since
* some base drivers panic . On the other hand we can ' t
* hold bond - > lock without bh disabled because we ' ll
* deadlock . The only solution is to rely on the fact
* that we ' re under rtnl_lock here , and the slaves
* list won ' t change . This doesn ' t solve the problem
* of setting the slave ' s MTU while it is
* transmitting , but the assumption is that the base
* driver can handle that .
*
* TODO : figure out a way to safely iterate the slaves
* list , but without holding a lock around the actual
* call to the base driver .
*/
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2009-12-14 07:06:07 +03:00
pr_debug ( " s %p s->p %p c_m %p \n " ,
slave ,
2013-08-01 18:54:47 +04:00
bond_prev_slave ( bond , slave ) ,
2009-12-14 07:06:07 +03:00
slave - > dev - > netdev_ops - > ndo_change_mtu ) ;
2005-11-09 21:36:50 +03:00
2005-04-17 02:20:36 +04:00
res = dev_set_mtu ( slave - > dev , new_mtu ) ;
if ( res ) {
/* If we failed to set the slave's mtu to the new value
* we must abort the operation even in ACTIVE_BACKUP
* mode , because if we allow the backup slaves to have
* different mtu values than the active slave we ' ll
* need to change their mtu when doing a failover . That
* means changing their mtu from timer context , which
* is probably not a good idea .
*/
2008-12-10 10:09:22 +03:00
pr_debug ( " err %d %s \n " , res , slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
goto unwind ;
}
}
bond_dev - > mtu = new_mtu ;
return 0 ;
unwind :
/* unwind from head to the slave that failed */
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2005-04-17 02:20:36 +04:00
int tmp_res ;
2013-09-25 11:20:13 +04:00
if ( rollback_slave = = slave )
break ;
tmp_res = dev_set_mtu ( rollback_slave - > dev , bond_dev - > mtu ) ;
2005-04-17 02:20:36 +04:00
if ( tmp_res ) {
2009-12-14 07:06:07 +03:00
pr_debug ( " unwind err %d dev %s \n " ,
2013-09-25 11:20:13 +04:00
tmp_res , rollback_slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
return res ;
}
/*
* Change HW address
*
* Note that many devices must be down to change the HW address , and
* downing the master releases all slaves . We can make bonds full of
* bonding devices to test this , however .
*/
static int bond_set_mac_address ( struct net_device * bond_dev , void * addr )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:13 +04:00
struct slave * slave , * rollback_slave ;
2005-04-17 02:20:36 +04:00
struct sockaddr * sa = addr , tmp_sa ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
int res = 0 ;
2008-11-20 08:56:05 +03:00
if ( bond - > params . mode = = BOND_MODE_ALB )
return bond_alb_set_mac_address ( bond_dev , addr ) ;
2009-12-14 07:06:07 +03:00
pr_debug ( " bond=%p, name=%s \n " ,
bond , bond_dev ? bond_dev - > name : " None " ) ;
2005-04-17 02:20:36 +04:00
2013-05-31 15:57:31 +04:00
/* If fail_over_mac is enabled, do nothing and return success.
* Returning an error causes ifenslave to fail .
2007-10-10 06:57:24 +04:00
*/
2013-05-31 15:57:31 +04:00
if ( bond - > params . fail_over_mac )
2007-10-10 06:57:24 +04:00
return 0 ;
2007-10-10 06:43:39 +04:00
2009-06-12 23:02:48 +04:00
if ( ! is_valid_ether_addr ( sa - > sa_data ) )
2005-04-17 02:20:36 +04:00
return - EADDRNOTAVAIL ;
/* Can't hold bond->lock with bh disabled here since
* some base drivers panic . On the other hand we can ' t
* hold bond - > lock without bh disabled because we ' ll
* deadlock . The only solution is to rely on the fact
* that we ' re under rtnl_lock here , and the slaves
* list won ' t change . This doesn ' t solve the problem
* of setting the slave ' s hw address while it is
* transmitting , but the assumption is that the base
* driver can handle that .
*
* TODO : figure out a way to safely iterate the slaves
* list , but without holding a lock around the actual
* call to the base driver .
*/
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2008-11-20 08:56:05 +03:00
const struct net_device_ops * slave_ops = slave - > dev - > netdev_ops ;
2008-12-10 10:09:22 +03:00
pr_debug ( " slave %p %s \n " , slave , slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
2008-11-20 08:56:05 +03:00
if ( slave_ops - > ndo_set_mac_address = = NULL ) {
2005-04-17 02:20:36 +04:00
res = - EOPNOTSUPP ;
2008-12-10 10:09:22 +03:00
pr_debug ( " EOPNOTSUPP %s \n " , slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
goto unwind ;
}
res = dev_set_mac_address ( slave - > dev , addr ) ;
if ( res ) {
/* TODO: consider downing the slave
* and retry ?
* User should expect communications
* breakage anyway until ARP finish
* updating , so . . .
*/
2008-12-10 10:09:22 +03:00
pr_debug ( " err %d %s \n " , res , slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
goto unwind ;
}
}
/* success */
memcpy ( bond_dev - > dev_addr , sa - > sa_data , bond_dev - > addr_len ) ;
return 0 ;
unwind :
memcpy ( tmp_sa . sa_data , bond_dev - > dev_addr , bond_dev - > addr_len ) ;
tmp_sa . sa_family = bond_dev - > type ;
/* unwind from head to the slave that failed */
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2005-04-17 02:20:36 +04:00
int tmp_res ;
2013-09-25 11:20:13 +04:00
if ( rollback_slave = = slave )
break ;
tmp_res = dev_set_mac_address ( rollback_slave - > dev , & tmp_sa ) ;
2005-04-17 02:20:36 +04:00
if ( tmp_res ) {
2009-12-14 07:06:07 +03:00
pr_debug ( " unwind err %d dev %s \n " ,
2013-09-25 11:20:13 +04:00
tmp_res , rollback_slave - > dev - > name ) ;
2005-04-17 02:20:36 +04:00
}
}
return res ;
}
2013-08-01 18:54:50 +04:00
/**
* bond_xmit_slave_id - transmit skb through slave with slave_id
* @ bond : bonding device that is transmitting
* @ skb : buffer to transmit
* @ slave_id : slave id up to slave_cnt - 1 through which to transmit
*
* This function tries to transmit through slave with slave_id but in case
* it fails , it tries to find the first available slave for transmission .
* The skb is consumed in all cases , thus the function is void .
*/
void bond_xmit_slave_id ( struct bonding * bond , struct sk_buff * skb , int slave_id )
{
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:50 +04:00
struct slave * slave ;
int i = slave_id ;
/* Here we start from the slave with slave_id */
2013-09-25 11:20:14 +04:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 18:54:50 +04:00
if ( - - i < 0 ) {
if ( slave_can_tx ( slave ) ) {
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
return ;
}
}
}
/* Here we start from the first slave up to slave_id */
i = slave_id ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 18:54:50 +04:00
if ( - - i < 0 )
break ;
if ( slave_can_tx ( slave ) ) {
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
return ;
}
}
/* no slave that can tx has been found */
kfree_skb ( skb ) ;
}
2005-04-17 02:20:36 +04:00
static int bond_xmit_roundrobin ( struct sk_buff * skb , struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2010-03-25 17:49:05 +03:00
struct iphdr * iph = ip_hdr ( skb ) ;
2013-08-01 18:54:50 +04:00
struct slave * slave ;
2005-04-17 02:20:36 +04:00
2007-10-18 04:37:47 +04:00
/*
2010-03-25 17:49:05 +03:00
* Start with the curr_active_slave that joined the bond as the
* default for sending IGMP traffic . For failover purposes one
* needs to maintain some consistency for the interface that will
* send the join / membership reports . The curr_active_slave found
* will send all of this type of traffic .
2007-10-18 04:37:47 +04:00
*/
2013-08-01 18:54:50 +04:00
if ( iph - > protocol = = IPPROTO_IGMP & & skb - > protocol = = htons ( ETH_P_IP ) ) {
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
slave = rcu_dereference ( bond - > curr_active_slave ) ;
2013-08-01 18:54:50 +04:00
if ( slave & & slave_can_tx ( slave ) )
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
bond_xmit_slave_id ( bond , skb , 0 ) ;
2010-03-25 17:49:05 +03:00
} else {
2013-08-01 18:54:50 +04:00
bond_xmit_slave_id ( bond , skb ,
bond - > rr_tx_counter + + % bond - > slave_cnt ) ;
2005-04-17 02:20:36 +04:00
}
2011-05-07 05:48:02 +04:00
2009-07-06 06:23:38 +04:00
return NETDEV_TX_OK ;
2005-04-17 02:20:36 +04:00
}
/*
* in active - backup mode , we know that bond - > curr_active_slave is always valid if
* the bond has a usable interface .
*/
static int bond_xmit_activebackup ( struct sk_buff * skb , struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-08-01 18:54:48 +04:00
struct slave * slave ;
2005-04-17 02:20:36 +04:00
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
slave = rcu_dereference ( bond - > curr_active_slave ) ;
2013-08-01 18:54:48 +04:00
if ( slave )
2013-08-01 18:54:50 +04:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
2012-06-13 09:30:07 +04:00
kfree_skb ( skb ) ;
2011-05-07 05:48:02 +04:00
2009-07-06 06:23:38 +04:00
return NETDEV_TX_OK ;
2005-04-17 02:20:36 +04:00
}
/*
2005-06-27 01:54:11 +04:00
* In bond_xmit_xor ( ) , we determine the output device by using a pre -
* determined xmit_hash_policy ( ) , If the selected device is not enabled ,
* find the next active slave .
2005-04-17 02:20:36 +04:00
*/
static int bond_xmit_xor ( struct sk_buff * skb , struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
2013-08-01 18:54:50 +04:00
bond_xmit_slave_id ( bond , skb ,
bond - > xmit_hash_policy ( skb , bond - > slave_cnt ) ) ;
2011-05-07 05:48:02 +04:00
2009-07-06 06:23:38 +04:00
return NETDEV_TX_OK ;
2005-04-17 02:20:36 +04:00
}
2013-08-01 18:54:49 +04:00
/* in broadcast mode, we send everything to all usable interfaces. */
2005-04-17 02:20:36 +04:00
static int bond_xmit_broadcast ( struct sk_buff * skb , struct net_device * bond_dev )
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-08-01 18:54:49 +04:00
struct slave * slave = NULL ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2005-04-17 02:20:36 +04:00
2013-09-25 11:20:14 +04:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 18:54:49 +04:00
if ( bond_is_last_slave ( bond , slave ) )
break ;
if ( IS_UP ( slave - > dev ) & & slave - > link = = BOND_LINK_UP ) {
struct sk_buff * skb2 = skb_clone ( skb , GFP_ATOMIC ) ;
2005-04-17 02:20:36 +04:00
2013-08-01 18:54:49 +04:00
if ( ! skb2 ) {
pr_err ( " %s: Error: bond_xmit_broadcast(): skb_clone() failed \n " ,
bond_dev - > name ) ;
continue ;
2005-04-17 02:20:36 +04:00
}
2013-08-01 18:54:49 +04:00
/* bond_dev_queue_xmit always returns 0 */
bond_dev_queue_xmit ( bond , skb2 , slave - > dev ) ;
2005-04-17 02:20:36 +04:00
}
}
2013-08-01 18:54:49 +04:00
if ( slave & & IS_UP ( slave - > dev ) & & slave - > link = = BOND_LINK_UP )
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
2012-06-13 09:30:07 +04:00
kfree_skb ( skb ) ;
2009-06-12 23:02:48 +04:00
2009-07-06 06:23:38 +04:00
return NETDEV_TX_OK ;
2005-04-17 02:20:36 +04:00
}
/*------------------------- Device initialization ---------------------------*/
2007-12-07 10:40:34 +03:00
static void bond_set_xmit_hash_policy ( struct bonding * bond )
{
switch ( bond - > params . xmit_policy ) {
case BOND_XMIT_POLICY_LAYER23 :
bond - > xmit_hash_policy = bond_xmit_hash_policy_l23 ;
break ;
case BOND_XMIT_POLICY_LAYER34 :
bond - > xmit_hash_policy = bond_xmit_hash_policy_l34 ;
break ;
case BOND_XMIT_POLICY_LAYER2 :
default :
bond - > xmit_hash_policy = bond_xmit_hash_policy_l2 ;
break ;
}
}
2010-06-02 12:40:18 +04:00
/*
* Lookup the slave that corresponds to a qid
*/
static inline int bond_slave_override ( struct bonding * bond ,
struct sk_buff * skb )
{
struct slave * slave = NULL ;
struct slave * check_slave ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
int res = 1 ;
2010-06-02 12:40:18 +04:00
2011-05-07 05:48:02 +04:00
if ( ! skb - > queue_mapping )
return 1 ;
2010-06-02 12:40:18 +04:00
/* Find out if any slaves have the same mapping as this skb. */
2013-09-25 11:20:14 +04:00
bond_for_each_slave_rcu ( bond , check_slave , iter ) {
2010-06-02 12:40:18 +04:00
if ( check_slave - > queue_id = = skb - > queue_mapping ) {
slave = check_slave ;
break ;
}
}
/* If the slave isn't UP, use default transmit policy. */
if ( slave & & slave - > queue_id & & IS_UP ( slave - > dev ) & &
( slave - > link = = BOND_LINK_UP ) ) {
res = bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
}
return res ;
}
2011-06-03 14:35:52 +04:00
2010-06-02 12:40:18 +04:00
static u16 bond_select_queue ( struct net_device * dev , struct sk_buff * skb )
{
/*
* This helper function exists to help dev_pick_tx get the correct
2011-03-14 09:22:04 +03:00
* destination queue . Using a helper function skips a call to
2010-06-02 12:40:18 +04:00
* skb_tx_hash and will put the skbs in the queue we expect on their
* way down to the bonding driver .
*/
2011-03-14 09:22:04 +03:00
u16 txq = skb_rx_queue_recorded ( skb ) ? skb_get_rx_queue ( skb ) : 0 ;
2011-06-03 14:35:52 +04:00
/*
* Save the original txq to restore before passing to the driver
*/
2012-07-20 06:28:49 +04:00
qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping = skb - > queue_mapping ;
2011-06-03 14:35:52 +04:00
2011-03-14 09:22:04 +03:00
if ( unlikely ( txq > = dev - > real_num_tx_queues ) ) {
2011-04-13 19:22:29 +04:00
do {
2011-03-14 09:22:04 +03:00
txq - = dev - > real_num_tx_queues ;
2011-04-13 19:22:29 +04:00
} while ( txq > = dev - > real_num_tx_queues ) ;
2011-03-14 09:22:04 +03:00
}
return txq ;
2010-06-02 12:40:18 +04:00
}
2011-05-07 05:48:02 +04:00
static netdev_tx_t __bond_start_xmit ( struct sk_buff * skb , struct net_device * dev )
2008-11-21 07:14:53 +03:00
{
2010-06-02 12:40:18 +04:00
struct bonding * bond = netdev_priv ( dev ) ;
if ( TX_QUEUE_OVERRIDE ( bond - > params . mode ) ) {
if ( ! bond_slave_override ( bond , skb ) )
return NETDEV_TX_OK ;
}
2008-11-21 07:14:53 +03:00
switch ( bond - > params . mode ) {
case BOND_MODE_ROUNDROBIN :
return bond_xmit_roundrobin ( skb , dev ) ;
case BOND_MODE_ACTIVEBACKUP :
return bond_xmit_activebackup ( skb , dev ) ;
case BOND_MODE_XOR :
return bond_xmit_xor ( skb , dev ) ;
case BOND_MODE_BROADCAST :
return bond_xmit_broadcast ( skb , dev ) ;
case BOND_MODE_8023AD :
return bond_3ad_xmit_xor ( skb , dev ) ;
case BOND_MODE_ALB :
case BOND_MODE_TLB :
return bond_alb_xmit ( skb , dev ) ;
default :
/* Should never happen, mode already checked */
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error: Unknown bonding mode %d \n " ,
dev - > name , bond - > params . mode ) ;
2008-11-21 07:14:53 +03:00
WARN_ON_ONCE ( 1 ) ;
2012-06-13 09:30:07 +04:00
kfree_skb ( skb ) ;
2008-11-21 07:14:53 +03:00
return NETDEV_TX_OK ;
}
}
2011-05-07 05:48:02 +04:00
static netdev_tx_t bond_start_xmit ( struct sk_buff * skb , struct net_device * dev )
{
struct bonding * bond = netdev_priv ( dev ) ;
netdev_tx_t ret = NETDEV_TX_OK ;
/*
* If we risk deadlock from transmitting this in the
* netpoll path , tell netpoll to queue the frame for later tx
*/
if ( is_netpoll_tx_blocked ( dev ) )
return NETDEV_TX_BUSY ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
rcu_read_lock ( ) ;
2013-09-25 11:20:21 +04:00
if ( bond_has_slaves ( bond ) )
2011-05-07 05:48:02 +04:00
ret = __bond_start_xmit ( skb , dev ) ;
else
2012-06-13 09:30:07 +04:00
kfree_skb ( skb ) ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 18:54:51 +04:00
rcu_read_unlock ( ) ;
2011-05-07 05:48:02 +04:00
return ret ;
}
2008-11-21 07:14:53 +03:00
2005-04-17 02:20:36 +04:00
/*
* set bond mode specific net device operations
*/
2005-11-09 21:35:51 +03:00
void bond_set_mode_ops ( struct bonding * bond , int mode )
2005-04-17 02:20:36 +04:00
{
2005-06-27 01:54:11 +04:00
struct net_device * bond_dev = bond - > dev ;
2005-04-17 02:20:36 +04:00
switch ( mode ) {
case BOND_MODE_ROUNDROBIN :
break ;
case BOND_MODE_ACTIVEBACKUP :
break ;
case BOND_MODE_XOR :
2007-12-07 10:40:34 +03:00
bond_set_xmit_hash_policy ( bond ) ;
2005-04-17 02:20:36 +04:00
break ;
case BOND_MODE_BROADCAST :
break ;
case BOND_MODE_8023AD :
2007-12-07 10:40:34 +03:00
bond_set_xmit_hash_policy ( bond ) ;
2005-04-17 02:20:36 +04:00
break ;
case BOND_MODE_ALB :
2006-02-22 03:36:44 +03:00
/* FALLTHRU */
case BOND_MODE_TLB :
2005-04-17 02:20:36 +04:00
break ;
default :
/* Should never happen, mode already checked */
2009-12-14 07:06:07 +03:00
pr_err ( " %s: Error: Unknown bonding mode %d \n " ,
bond_dev - > name , mode ) ;
2005-04-17 02:20:36 +04:00
break ;
}
}
2013-04-16 18:46:00 +04:00
static int bond_ethtool_get_settings ( struct net_device * bond_dev ,
struct ethtool_cmd * ecmd )
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
unsigned long speed = 0 ;
2013-09-25 11:20:14 +04:00
struct list_head * iter ;
2013-08-01 18:54:47 +04:00
struct slave * slave ;
2013-04-16 18:46:00 +04:00
ecmd - > duplex = DUPLEX_UNKNOWN ;
ecmd - > port = PORT_OTHER ;
/* Since SLAVE_IS_OK returns false for all inactive or down slaves, we
* do not need to check mode . Though link speed might not represent
* the true receive or transmit bandwidth ( not all modes are symmetric )
* this is an accurate maximum .
*/
read_lock ( & bond - > lock ) ;
2013-09-25 11:20:14 +04:00
bond_for_each_slave ( bond , slave , iter ) {
2013-04-16 18:46:00 +04:00
if ( SLAVE_IS_OK ( slave ) ) {
if ( slave - > speed ! = SPEED_UNKNOWN )
speed + = slave - > speed ;
if ( ecmd - > duplex = = DUPLEX_UNKNOWN & &
slave - > duplex ! = DUPLEX_UNKNOWN )
ecmd - > duplex = slave - > duplex ;
}
}
ethtool_cmd_speed_set ( ecmd , speed ? : SPEED_UNKNOWN ) ;
read_unlock ( & bond - > lock ) ;
2013-08-01 18:54:47 +04:00
2013-04-16 18:46:00 +04:00
return 0 ;
}
2005-09-27 03:11:50 +04:00
static void bond_ethtool_get_drvinfo ( struct net_device * bond_dev ,
2013-01-06 04:44:26 +04:00
struct ethtool_drvinfo * drvinfo )
2005-09-27 03:11:50 +04:00
{
2013-01-06 04:44:26 +04:00
strlcpy ( drvinfo - > driver , DRV_NAME , sizeof ( drvinfo - > driver ) ) ;
strlcpy ( drvinfo - > version , DRV_VERSION , sizeof ( drvinfo - > version ) ) ;
snprintf ( drvinfo - > fw_version , sizeof ( drvinfo - > fw_version ) , " %d " ,
BOND_ABI_VERSION ) ;
2005-09-27 03:11:50 +04:00
}
2006-09-13 22:30:00 +04:00
static const struct ethtool_ops bond_ethtool_ops = {
2005-09-27 03:11:50 +04:00
. get_drvinfo = bond_ethtool_get_drvinfo ,
2013-04-16 18:46:00 +04:00
. get_settings = bond_ethtool_get_settings ,
2008-09-14 05:17:09 +04:00
. get_link = ethtool_op_get_link ,
2005-08-23 09:34:53 +04:00
} ;
2008-11-20 08:56:05 +03:00
static const struct net_device_ops bond_netdev_ops = {
2009-06-12 23:02:52 +04:00
. ndo_init = bond_init ,
2009-06-12 23:02:47 +04:00
. ndo_uninit = bond_uninit ,
2008-11-20 08:56:05 +03:00
. ndo_open = bond_open ,
. ndo_stop = bond_close ,
2008-11-21 07:14:53 +03:00
. ndo_start_xmit = bond_start_xmit ,
2010-06-02 12:40:18 +04:00
. ndo_select_queue = bond_select_queue ,
2010-06-08 11:19:54 +04:00
. ndo_get_stats64 = bond_get_stats ,
2008-11-20 08:56:05 +03:00
. ndo_do_ioctl = bond_do_ioctl ,
2011-08-16 07:15:04 +04:00
. ndo_change_rx_flags = bond_change_rx_flags ,
2013-05-31 15:57:30 +04:00
. ndo_set_rx_mode = bond_set_rx_mode ,
2008-11-20 08:56:05 +03:00
. ndo_change_mtu = bond_change_mtu ,
2011-07-20 08:54:46 +04:00
. ndo_set_mac_address = bond_set_mac_address ,
2008-11-21 07:14:53 +03:00
. ndo_neigh_setup = bond_neigh_setup ,
2011-07-20 08:54:46 +04:00
. ndo_vlan_rx_add_vid = bond_vlan_rx_add_vid ,
2008-11-20 08:56:05 +03:00
. ndo_vlan_rx_kill_vid = bond_vlan_rx_kill_vid ,
2010-05-06 11:48:51 +04:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2011-02-18 02:43:32 +03:00
. ndo_netpoll_setup = bond_netpoll_setup ,
2010-05-06 11:48:51 +04:00
. ndo_netpoll_cleanup = bond_netpoll_cleanup ,
. ndo_poll_controller = bond_poll_controller ,
# endif
2011-02-13 12:33:01 +03:00
. ndo_add_slave = bond_enslave ,
. ndo_del_slave = bond_release ,
2011-05-07 07:22:17 +04:00
. ndo_fix_features = bond_fix_features ,
2008-11-20 08:56:05 +03:00
} ;
2013-02-18 18:59:23 +04:00
static const struct device_type bond_type = {
. name = " bond " ,
} ;
2010-04-01 01:30:52 +04:00
static void bond_destructor ( struct net_device * bond_dev )
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
if ( bond - > wq )
destroy_workqueue ( bond - > wq ) ;
free_netdev ( bond_dev ) ;
}
2009-06-12 23:02:52 +04:00
static void bond_setup ( struct net_device * bond_dev )
2005-04-17 02:20:36 +04:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 02:20:36 +04:00
/* initialize rwlocks */
rwlock_init ( & bond - > lock ) ;
rwlock_init ( & bond - > curr_slave_lock ) ;
2013-08-01 18:54:47 +04:00
INIT_LIST_HEAD ( & bond - > slave_list ) ;
2009-06-12 23:02:44 +04:00
bond - > params = bonding_defaults ;
2005-04-17 02:20:36 +04:00
/* Initialize pointers */
bond - > dev = bond_dev ;
/* Initialize the device entry points */
2009-06-12 23:02:52 +04:00
ether_setup ( bond_dev ) ;
2008-11-20 08:56:05 +03:00
bond_dev - > netdev_ops = & bond_netdev_ops ;
2005-08-23 09:34:53 +04:00
bond_dev - > ethtool_ops = & bond_ethtool_ops ;
2005-06-27 01:54:11 +04:00
bond_set_mode_ops ( bond , bond - > params . mode ) ;
2005-04-17 02:20:36 +04:00
2010-04-01 01:30:52 +04:00
bond_dev - > destructor = bond_destructor ;
2005-04-17 02:20:36 +04:00
2013-02-18 18:59:23 +04:00
SET_NETDEV_DEVTYPE ( bond_dev , & bond_type ) ;
2005-04-17 02:20:36 +04:00
/* Initialize the device options */
bond_dev - > tx_queue_len = 0 ;
bond_dev - > flags | = IFF_MASTER | IFF_MULTICAST ;
2006-09-23 08:54:10 +04:00
bond_dev - > priv_flags | = IFF_BONDING ;
2011-07-26 10:05:38 +04:00
bond_dev - > priv_flags & = ~ ( IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING ) ;
2009-06-12 23:02:52 +04:00
2005-04-17 02:20:36 +04:00
/* At first, we block adding VLANs. That's the only way to
* prevent problems that occur when adding VLANs over an
* empty bond . The block will be removed once non - challenged
* slaves are enslaved .
*/
bond_dev - > features | = NETIF_F_VLAN_CHALLENGED ;
2006-06-09 23:20:56 +04:00
/* don't acquire bond device's netif_tx_lock when
2005-04-17 02:20:36 +04:00
* transmitting */
bond_dev - > features | = NETIF_F_LLTX ;
/* By default, we declare the bond to be fully
* VLAN hardware accelerated capable . Special
* care is taken in the various xmit functions
* when there are slaves that are not hw accel
* capable
*/
2011-05-07 07:22:17 +04:00
bond_dev - > hw_features = BOND_VLAN_FEATURES |
2013-04-19 06:04:27 +04:00
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_HW_VLAN_CTAG_FILTER ;
2011-05-07 07:22:17 +04:00
2011-11-15 19:29:55 +04:00
bond_dev - > hw_features & = ~ ( NETIF_F_ALL_CSUM & ~ NETIF_F_HW_CSUM ) ;
2011-05-07 07:22:17 +04:00
bond_dev - > features | = bond_dev - > hw_features ;
2005-04-17 02:20:36 +04:00
}
2009-10-29 17:18:24 +03:00
/*
* Destroy a bonding device .
* Must be under rtnl_lock when this function is called .
*/
static void bond_uninit ( struct net_device * bond_dev )
2008-10-31 03:41:15 +03:00
{
2008-11-13 10:37:49 +03:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 11:20:15 +04:00
struct list_head * iter ;
struct slave * slave ;
2008-10-31 03:41:15 +03:00
2010-05-06 11:48:51 +04:00
bond_netpoll_cleanup ( bond_dev ) ;
2009-10-29 17:18:24 +03:00
/* Release the bonded slaves */
2013-09-25 11:20:15 +04:00
bond_for_each_slave ( bond , slave , iter )
2013-08-01 18:54:47 +04:00
__bond_release_one ( bond_dev , slave - > dev , true ) ;
2013-02-18 18:09:42 +04:00
pr_info ( " %s: released all slaves \n " , bond_dev - > name ) ;
2009-10-29 17:18:24 +03:00
2008-10-31 03:41:15 +03:00
list_del ( & bond - > bond_list ) ;
2010-12-09 18:17:13 +03:00
bond_debug_unregister ( bond ) ;
2008-10-31 03:41:15 +03:00
}
2005-04-17 02:20:36 +04:00
/*------------------------- Module initialization ---------------------------*/
/*
* Convert string input module parms . Accept either the
2008-01-18 03:25:01 +03:00
* number of the mode or its string name . A bit complicated because
* some mode names are substrings of other names , and calls from sysfs
* may have whitespace in the name ( trailing newlines , for example ) .
2005-04-17 02:20:36 +04:00
*/
2008-12-10 10:10:17 +03:00
int bond_parse_parm ( const char * buf , const struct bond_parm_tbl * tbl )
2005-04-17 02:20:36 +04:00
{
2009-02-14 14:15:49 +03:00
int modeint = - 1 , i , rv ;
2008-01-30 05:07:43 +03:00
char * p , modestr [ BOND_MAX_MODENAME_LEN + 1 ] = { 0 , } ;
2008-01-18 03:25:01 +03:00
2008-01-30 05:07:43 +03:00
for ( p = ( char * ) buf ; * p ; p + + )
if ( ! ( isdigit ( * p ) | | isspace ( * p ) ) )
break ;
if ( * p )
2008-01-18 03:25:01 +03:00
rv = sscanf ( buf , " %20s " , modestr ) ;
2008-01-30 05:07:43 +03:00
else
2009-02-14 14:15:49 +03:00
rv = sscanf ( buf , " %d " , & modeint ) ;
2008-01-30 05:07:43 +03:00
if ( ! rv )
return - 1 ;
2005-04-17 02:20:36 +04:00
for ( i = 0 ; tbl [ i ] . modename ; i + + ) {
2009-02-14 14:15:49 +03:00
if ( modeint = = tbl [ i ] . mode )
2008-01-18 03:25:01 +03:00
return tbl [ i ] . mode ;
if ( strcmp ( modestr , tbl [ i ] . modename ) = = 0 )
2005-04-17 02:20:36 +04:00
return tbl [ i ] . mode ;
}
return - 1 ;
}
static int bond_check_params ( struct bond_params * params )
{
2013-05-18 05:18:30 +04:00
int arp_validate_value , fail_over_mac_value , primary_reselect_value , i ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
int arp_all_targets_value ;
2006-09-23 08:54:53 +04:00
2005-04-17 02:20:36 +04:00
/*
* Convert string parameters .
*/
if ( mode ) {
bond_mode = bond_parse_parm ( mode , bond_mode_tbl ) ;
if ( bond_mode = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: Invalid bonding mode \" %s \" \n " ,
2005-04-17 02:20:36 +04:00
mode = = NULL ? " NULL " : mode ) ;
return - EINVAL ;
}
}
2005-06-27 01:54:11 +04:00
if ( xmit_hash_policy ) {
if ( ( bond_mode ! = BOND_MODE_XOR ) & &
( bond_mode ! = BOND_MODE_8023AD ) ) {
2009-12-14 07:06:07 +03:00
pr_info ( " xmit_hash_policy param is irrelevant in mode %s \n " ,
2005-06-27 01:54:11 +04:00
bond_mode_name ( bond_mode ) ) ;
} else {
xmit_hashtype = bond_parse_parm ( xmit_hash_policy ,
xmit_hashtype_tbl ) ;
if ( xmit_hashtype = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: Invalid xmit_hash_policy \" %s \" \n " ,
2009-06-12 23:02:48 +04:00
xmit_hash_policy = = NULL ? " NULL " :
2005-06-27 01:54:11 +04:00
xmit_hash_policy ) ;
return - EINVAL ;
}
}
}
2005-04-17 02:20:36 +04:00
if ( lacp_rate ) {
if ( bond_mode ! = BOND_MODE_8023AD ) {
2009-12-14 07:06:07 +03:00
pr_info ( " lacp_rate param is irrelevant in mode %s \n " ,
bond_mode_name ( bond_mode ) ) ;
2005-04-17 02:20:36 +04:00
} else {
lacp_fast = bond_parse_parm ( lacp_rate , bond_lacp_tbl ) ;
if ( lacp_fast = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: Invalid lacp rate \" %s \" \n " ,
2005-04-17 02:20:36 +04:00
lacp_rate = = NULL ? " NULL " : lacp_rate ) ;
return - EINVAL ;
}
}
}
2008-11-05 04:51:16 +03:00
if ( ad_select ) {
params - > ad_select = bond_parse_parm ( ad_select , ad_select_tbl ) ;
if ( params - > ad_select = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: Invalid ad_select \" %s \" \n " ,
2008-11-05 04:51:16 +03:00
ad_select = = NULL ? " NULL " : ad_select ) ;
return - EINVAL ;
}
if ( bond_mode ! = BOND_MODE_8023AD ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " ad_select param only affects 802.3ad mode \n " ) ;
2008-11-05 04:51:16 +03:00
}
} else {
params - > ad_select = BOND_AD_STABLE ;
}
2009-08-28 17:18:34 +04:00
if ( max_bonds < 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: max_bonds (%d) not in range %d-%d, so it was reset to BOND_DEFAULT_MAX_BONDS (%d) \n " ,
max_bonds , 0 , INT_MAX , BOND_DEFAULT_MAX_BONDS ) ;
2005-04-17 02:20:36 +04:00
max_bonds = BOND_DEFAULT_MAX_BONDS ;
}
if ( miimon < 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: miimon module parameter (%d), not in range 0-%d, so it was reset to %d \n " ,
miimon , INT_MAX , BOND_LINK_MON_INTERV ) ;
2005-04-17 02:20:36 +04:00
miimon = BOND_LINK_MON_INTERV ;
}
if ( updelay < 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: updelay module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
updelay , INT_MAX ) ;
2005-04-17 02:20:36 +04:00
updelay = 0 ;
}
if ( downdelay < 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: downdelay module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
downdelay , INT_MAX ) ;
2005-04-17 02:20:36 +04:00
downdelay = 0 ;
}
if ( ( use_carrier ! = 0 ) & & ( use_carrier ! = 1 ) ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: use_carrier module parameter (%d), not of valid value (0/1), so it was set to 1 \n " ,
use_carrier ) ;
2005-04-17 02:20:36 +04:00
use_carrier = 1 ;
}
2011-04-26 19:25:52 +04:00
if ( num_peer_notif < 0 | | num_peer_notif > 255 ) {
pr_warning ( " Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1 \n " ,
num_peer_notif ) ;
num_peer_notif = 1 ;
}
2005-04-17 02:20:36 +04:00
/* reset values for 802.3ad */
if ( bond_mode = = BOND_MODE_8023AD ) {
if ( ! miimon ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: miimon must be specified, otherwise bonding will not detect link failure, speed and duplex which are essential for 802.3ad operation \n " ) ;
2009-06-12 23:02:48 +04:00
pr_warning ( " Forcing miimon to 100msec \n " ) ;
2005-04-17 02:20:36 +04:00
miimon = 100 ;
}
}
2010-06-02 12:40:18 +04:00
if ( tx_queues < 1 | | tx_queues > 255 ) {
pr_warning ( " Warning: tx_queues (%d) should be between "
" 1 and 255, resetting to %d \n " ,
tx_queues , BOND_DEFAULT_TX_QUEUES ) ;
tx_queues = BOND_DEFAULT_TX_QUEUES ;
}
2010-06-02 12:39:21 +04:00
if ( ( all_slaves_active ! = 0 ) & & ( all_slaves_active ! = 1 ) ) {
pr_warning ( " Warning: all_slaves_active module parameter (%d), "
" not of valid value (0/1), so it was set to "
" 0 \n " , all_slaves_active ) ;
all_slaves_active = 0 ;
}
2010-10-05 18:23:59 +04:00
if ( resend_igmp < 0 | | resend_igmp > 255 ) {
pr_warning ( " Warning: resend_igmp (%d) should be between "
" 0 and 255, resetting to %d \n " ,
resend_igmp , BOND_DEFAULT_RESEND_IGMP ) ;
resend_igmp = BOND_DEFAULT_RESEND_IGMP ;
}
2005-04-17 02:20:36 +04:00
/* reset values for TLB/ALB */
if ( ( bond_mode = = BOND_MODE_TLB ) | |
( bond_mode = = BOND_MODE_ALB ) ) {
if ( ! miimon ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: miimon must be specified, otherwise bonding will not detect link failure and link speed which are essential for TLB/ALB load balancing \n " ) ;
2009-06-12 23:02:48 +04:00
pr_warning ( " Forcing miimon to 100msec \n " ) ;
2005-04-17 02:20:36 +04:00
miimon = 100 ;
}
}
if ( bond_mode = = BOND_MODE_ALB ) {
2009-12-14 07:06:07 +03:00
pr_notice ( " In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (%d msec) is incompatible with the forwarding delay time of the switch \n " ,
updelay ) ;
2005-04-17 02:20:36 +04:00
}
if ( ! miimon ) {
if ( updelay | | downdelay ) {
/* just warn the user the up/down delay will have
* no effect since miimon is zero . . .
*/
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: miimon module parameter not set and updelay (%d) or downdelay (%d) module parameter is set; updelay and downdelay have no effect unless miimon is set \n " ,
updelay , downdelay ) ;
2005-04-17 02:20:36 +04:00
}
} else {
/* don't allow arp monitoring */
if ( arp_interval ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: miimon (%d) and arp_interval (%d) can't be used simultaneously, disabling ARP monitoring \n " ,
miimon , arp_interval ) ;
2005-04-17 02:20:36 +04:00
arp_interval = 0 ;
}
if ( ( updelay % miimon ) ! = 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: updelay (%d) is not a multiple of miimon (%d), updelay rounded to %d ms \n " ,
updelay , miimon ,
( updelay / miimon ) * miimon ) ;
2005-04-17 02:20:36 +04:00
}
updelay / = miimon ;
if ( ( downdelay % miimon ) ! = 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: downdelay (%d) is not a multiple of miimon (%d), downdelay rounded to %d ms \n " ,
downdelay , miimon ,
( downdelay / miimon ) * miimon ) ;
2005-04-17 02:20:36 +04:00
}
downdelay / = miimon ;
}
if ( arp_interval < 0 ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: arp_interval module parameter (%d) , not in range 0-%d, so it was reset to %d \n " ,
arp_interval , INT_MAX , BOND_LINK_ARP_INTERV ) ;
2005-04-17 02:20:36 +04:00
arp_interval = BOND_LINK_ARP_INTERV ;
}
2013-05-18 05:18:30 +04:00
for ( arp_ip_count = 0 , i = 0 ;
( arp_ip_count < BOND_MAX_ARP_TARGETS ) & & arp_ip_target [ i ] ; i + + ) {
2005-04-17 02:20:36 +04:00
/* not complete check, but should be good enough to
catch mistakes */
2013-05-18 05:18:30 +04:00
__be32 ip = in_aton ( arp_ip_target [ i ] ) ;
if ( ! isdigit ( arp_ip_target [ i ] [ 0 ] ) | | ip = = 0 | |
ip = = htonl ( INADDR_BROADCAST ) ) {
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: bad arp_ip_target module parameter (%s), ARP monitoring will not be performed \n " ,
2013-05-18 05:18:30 +04:00
arp_ip_target [ i ] ) ;
2005-04-17 02:20:36 +04:00
arp_interval = 0 ;
} else {
2013-06-24 13:49:30 +04:00
if ( bond_get_targets_ip ( arp_target , ip ) = = - 1 )
arp_target [ arp_ip_count + + ] = ip ;
else
pr_warning ( " Warning: duplicate address %pI4 in arp_ip_target, skipping \n " ,
& ip ) ;
2005-04-17 02:20:36 +04:00
}
}
if ( arp_interval & & ! arp_ip_count ) {
/* don't allow arping if no arp_ip_target given... */
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: arp_interval module parameter (%d) specified without providing an arp_ip_target parameter, arp_interval was reset to 0 \n " ,
arp_interval ) ;
2005-04-17 02:20:36 +04:00
arp_interval = 0 ;
}
2006-09-23 08:54:53 +04:00
if ( arp_validate ) {
if ( bond_mode ! = BOND_MODE_ACTIVEBACKUP ) {
2009-12-14 07:06:07 +03:00
pr_err ( " arp_validate only supported in active-backup mode \n " ) ;
2006-09-23 08:54:53 +04:00
return - EINVAL ;
}
if ( ! arp_interval ) {
2009-12-14 07:06:07 +03:00
pr_err ( " arp_validate requires arp_interval \n " ) ;
2006-09-23 08:54:53 +04:00
return - EINVAL ;
}
arp_validate_value = bond_parse_parm ( arp_validate ,
arp_validate_tbl ) ;
if ( arp_validate_value = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: invalid arp_validate \" %s \" \n " ,
2006-09-23 08:54:53 +04:00
arp_validate = = NULL ? " NULL " : arp_validate ) ;
return - EINVAL ;
}
} else
arp_validate_value = 0 ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
arp_all_targets_value = 0 ;
if ( arp_all_targets ) {
arp_all_targets_value = bond_parse_parm ( arp_all_targets ,
arp_all_targets_tbl ) ;
if ( arp_all_targets_value = = - 1 ) {
pr_err ( " Error: invalid arp_all_targets_value \" %s \" \n " ,
arp_all_targets ) ;
arp_all_targets_value = 0 ;
}
}
2005-04-17 02:20:36 +04:00
if ( miimon ) {
2009-12-14 07:06:07 +03:00
pr_info ( " MII link monitoring set to %d ms \n " , miimon ) ;
2005-04-17 02:20:36 +04:00
} else if ( arp_interval ) {
2009-12-14 07:06:07 +03:00
pr_info ( " ARP monitoring set to %d ms, validate %s, with %d target(s): " ,
arp_interval ,
arp_validate_tbl [ arp_validate_value ] . modename ,
arp_ip_count ) ;
2005-04-17 02:20:36 +04:00
for ( i = 0 ; i < arp_ip_count ; i + + )
2009-08-13 08:11:52 +04:00
pr_info ( " %s " , arp_ip_target [ i ] ) ;
2005-04-17 02:20:36 +04:00
2009-08-13 08:11:52 +04:00
pr_info ( " \n " ) ;
2005-04-17 02:20:36 +04:00
2008-06-14 05:12:04 +04:00
} else if ( max_bonds ) {
2005-04-17 02:20:36 +04:00
/* miimon and arp_interval not set, we need one so things
* work as expected , see bonding . txt for details
*/
2011-07-27 14:09:26 +04:00
pr_debug ( " Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details. \n " ) ;
2005-04-17 02:20:36 +04:00
}
if ( primary & & ! USES_PRIMARY ( bond_mode ) ) {
/* currently, using a primary only makes sense
* in active backup , TLB or ALB modes
*/
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: %s primary device specified but has no effect in %s mode \n " ,
primary , bond_mode_name ( bond_mode ) ) ;
2005-04-17 02:20:36 +04:00
primary = NULL ;
}
2009-09-25 07:28:09 +04:00
if ( primary & & primary_reselect ) {
primary_reselect_value = bond_parse_parm ( primary_reselect ,
pri_reselect_tbl ) ;
if ( primary_reselect_value = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: Invalid primary_reselect \" %s \" \n " ,
2009-09-25 07:28:09 +04:00
primary_reselect = =
NULL ? " NULL " : primary_reselect ) ;
return - EINVAL ;
}
} else {
primary_reselect_value = BOND_PRI_RESELECT_ALWAYS ;
}
2008-05-18 08:10:14 +04:00
if ( fail_over_mac ) {
fail_over_mac_value = bond_parse_parm ( fail_over_mac ,
fail_over_mac_tbl ) ;
if ( fail_over_mac_value = = - 1 ) {
2009-12-14 07:06:07 +03:00
pr_err ( " Error: invalid fail_over_mac \" %s \" \n " ,
2008-05-18 08:10:14 +04:00
arp_validate = = NULL ? " NULL " : arp_validate ) ;
return - EINVAL ;
}
if ( bond_mode ! = BOND_MODE_ACTIVEBACKUP )
2009-12-14 07:06:07 +03:00
pr_warning ( " Warning: fail_over_mac only affects active-backup mode. \n " ) ;
2008-05-18 08:10:14 +04:00
} else {
fail_over_mac_value = BOND_FOM_NONE ;
}
2007-10-10 06:57:24 +04:00
2005-04-17 02:20:36 +04:00
/* fill params struct with the proper values */
params - > mode = bond_mode ;
2005-06-27 01:54:11 +04:00
params - > xmit_policy = xmit_hashtype ;
2005-04-17 02:20:36 +04:00
params - > miimon = miimon ;
2011-04-26 19:25:52 +04:00
params - > num_peer_notif = num_peer_notif ;
2005-04-17 02:20:36 +04:00
params - > arp_interval = arp_interval ;
2006-09-23 08:54:53 +04:00
params - > arp_validate = arp_validate_value ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 13:49:34 +04:00
params - > arp_all_targets = arp_all_targets_value ;
2005-04-17 02:20:36 +04:00
params - > updelay = updelay ;
params - > downdelay = downdelay ;
params - > use_carrier = use_carrier ;
params - > lacp_fast = lacp_fast ;
params - > primary [ 0 ] = 0 ;
2009-09-25 07:28:09 +04:00
params - > primary_reselect = primary_reselect_value ;
2008-05-18 08:10:14 +04:00
params - > fail_over_mac = fail_over_mac_value ;
2010-06-02 12:40:18 +04:00
params - > tx_queues = tx_queues ;
2010-06-02 12:39:21 +04:00
params - > all_slaves_active = all_slaves_active ;
2010-10-05 18:23:59 +04:00
params - > resend_igmp = resend_igmp ;
2011-06-22 13:54:39 +04:00
params - > min_links = min_links ;
2013-09-13 19:05:33 +04:00
params - > lp_interval = BOND_ALB_DEFAULT_LP_INTERVAL ;
2005-04-17 02:20:36 +04:00
if ( primary ) {
strncpy ( params - > primary , primary , IFNAMSIZ ) ;
params - > primary [ IFNAMSIZ - 1 ] = 0 ;
}
memcpy ( params - > arp_targets , arp_target , sizeof ( arp_target ) ) ;
return 0 ;
}
2006-11-09 06:51:01 +03:00
static struct lock_class_key bonding_netdev_xmit_lock_key ;
2008-07-23 01:16:42 +04:00
static struct lock_class_key bonding_netdev_addr_lock_key ;
bonding: set qdisc_tx_busylock to avoid LOCKDEP splat
If a qdisc is installed on a bonding device, its possible to get
following lockdep splat under stress :
=============================================
[ INFO: possible recursive locking detected ]
3.6.0+ #211 Not tainted
---------------------------------------------
ping/4876 is trying to acquire lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
but task is already holding lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by ping/4876:
#0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e5030>] raw_sendmsg+0x600/0xc30
#1: (rcu_read_lock_bh){.+....}, at: [<ffffffff815ba4bd>] ip_finish_output+0x12d/0x870
#2: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
#3: (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
#4: (&bond->lock){++.?..}, at: [<ffffffffa02128c1>] bond_start_xmit+0x31/0x4b0 [bonding]
#5: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
stack backtrace:
Pid: 4876, comm: ping Not tainted 3.6.0+ #211
Call Trace:
[<ffffffff810a0145>] __lock_acquire+0x715/0x1b80
[<ffffffff810a256b>] ? mark_held_locks+0x9b/0x100
[<ffffffff810a1bf2>] lock_acquire+0x92/0x1d0
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff81726b7c>] _raw_spin_lock+0x3c/0x50
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff8106264d>] ? rcu_read_lock_bh_held+0x5d/0x90
[<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffffa0212a6a>] bond_start_xmit+0x1da/0x4b0 [bonding]
[<ffffffff815796d0>] dev_hard_start_xmit+0x240/0x6b0
[<ffffffff81597c6e>] sch_direct_xmit+0xfe/0x2a0
[<ffffffff8157a249>] dev_queue_xmit+0x199/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffff815ba96f>] ip_finish_output+0x5df/0x870
[<ffffffff815ba4bd>] ? ip_finish_output+0x12d/0x870
[<ffffffff815bb964>] ip_output+0x54/0xf0
[<ffffffff815bad48>] ip_local_out+0x28/0x90
[<ffffffff815bc444>] ip_send_skb+0x14/0x50
[<ffffffff815bc4b2>] ip_push_pending_frames+0x32/0x40
[<ffffffff815e536a>] raw_sendmsg+0x93a/0xc30
[<ffffffff8128d570>] ? selinux_file_send_sigiotask+0x1f0/0x1f0
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6855>] inet_sendmsg+0x125/0x240
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8155cddb>] sock_sendmsg+0xab/0xe0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff8155d18c>] __sys_sendmsg+0x37c/0x390
[<ffffffff81195b2a>] ? fsnotify+0x2ca/0x7e0
[<ffffffff811958e8>] ? fsnotify+0x88/0x7e0
[<ffffffff81361f36>] ? put_ldisc+0x56/0xd0
[<ffffffff8116f98a>] ? fget_light+0x3da/0x510
[<ffffffff8155f6c4>] sys_sendmsg+0x44/0x80
[<ffffffff8172fc22>] system_call_fastpath+0x16/0x1b
Avoid this problem using a distinct lock_class_key for bonding
devices.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-04 03:05:26 +04:00
static struct lock_class_key bonding_tx_busylock_key ;
2006-11-09 06:51:01 +03:00
2008-07-17 11:34:19 +04:00
static void bond_set_lockdep_class_one ( struct net_device * dev ,
struct netdev_queue * txq ,
void * _unused )
2008-07-09 10:13:53 +04:00
{
lockdep_set_class ( & txq - > _xmit_lock ,
& bonding_netdev_xmit_lock_key ) ;
}
static void bond_set_lockdep_class ( struct net_device * dev )
{
2008-07-23 01:16:42 +04:00
lockdep_set_class ( & dev - > addr_list_lock ,
& bonding_netdev_addr_lock_key ) ;
2008-07-17 11:34:19 +04:00
netdev_for_each_tx_queue ( dev , bond_set_lockdep_class_one , NULL ) ;
bonding: set qdisc_tx_busylock to avoid LOCKDEP splat
If a qdisc is installed on a bonding device, its possible to get
following lockdep splat under stress :
=============================================
[ INFO: possible recursive locking detected ]
3.6.0+ #211 Not tainted
---------------------------------------------
ping/4876 is trying to acquire lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
but task is already holding lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by ping/4876:
#0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e5030>] raw_sendmsg+0x600/0xc30
#1: (rcu_read_lock_bh){.+....}, at: [<ffffffff815ba4bd>] ip_finish_output+0x12d/0x870
#2: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
#3: (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
#4: (&bond->lock){++.?..}, at: [<ffffffffa02128c1>] bond_start_xmit+0x31/0x4b0 [bonding]
#5: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
stack backtrace:
Pid: 4876, comm: ping Not tainted 3.6.0+ #211
Call Trace:
[<ffffffff810a0145>] __lock_acquire+0x715/0x1b80
[<ffffffff810a256b>] ? mark_held_locks+0x9b/0x100
[<ffffffff810a1bf2>] lock_acquire+0x92/0x1d0
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff81726b7c>] _raw_spin_lock+0x3c/0x50
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff8106264d>] ? rcu_read_lock_bh_held+0x5d/0x90
[<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffffa0212a6a>] bond_start_xmit+0x1da/0x4b0 [bonding]
[<ffffffff815796d0>] dev_hard_start_xmit+0x240/0x6b0
[<ffffffff81597c6e>] sch_direct_xmit+0xfe/0x2a0
[<ffffffff8157a249>] dev_queue_xmit+0x199/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffff815ba96f>] ip_finish_output+0x5df/0x870
[<ffffffff815ba4bd>] ? ip_finish_output+0x12d/0x870
[<ffffffff815bb964>] ip_output+0x54/0xf0
[<ffffffff815bad48>] ip_local_out+0x28/0x90
[<ffffffff815bc444>] ip_send_skb+0x14/0x50
[<ffffffff815bc4b2>] ip_push_pending_frames+0x32/0x40
[<ffffffff815e536a>] raw_sendmsg+0x93a/0xc30
[<ffffffff8128d570>] ? selinux_file_send_sigiotask+0x1f0/0x1f0
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6855>] inet_sendmsg+0x125/0x240
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8155cddb>] sock_sendmsg+0xab/0xe0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff8155d18c>] __sys_sendmsg+0x37c/0x390
[<ffffffff81195b2a>] ? fsnotify+0x2ca/0x7e0
[<ffffffff811958e8>] ? fsnotify+0x88/0x7e0
[<ffffffff81361f36>] ? put_ldisc+0x56/0xd0
[<ffffffff8116f98a>] ? fget_light+0x3da/0x510
[<ffffffff8155f6c4>] sys_sendmsg+0x44/0x80
[<ffffffff8172fc22>] system_call_fastpath+0x16/0x1b
Avoid this problem using a distinct lock_class_key for bonding
devices.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-04 03:05:26 +04:00
dev - > qdisc_tx_busylock = & bonding_tx_busylock_key ;
2008-07-09 10:13:53 +04:00
}
2009-06-12 23:02:52 +04:00
/*
* Called from registration process
*/
static int bond_init ( struct net_device * bond_dev )
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
2009-10-29 17:18:26 +03:00
struct bond_net * bn = net_generic ( dev_net ( bond_dev ) , bond_net_id ) ;
bonding: prevent deadlock on slave store with alb mode (v3)
This soft lockup was recently reported:
[root@dell-per715-01 ~]# echo +bond5 > /sys/class/net/bonding_masters
[root@dell-per715-01 ~]# echo +eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
bonding bond5: master_dev is not up in bond_enslave
[root@dell-per715-01 ~]# echo -eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
BUG: soft lockup - CPU#12 stuck for 60s! [bash:6444]
CPU 12:
Modules linked in: bonding autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
be2d
Pid: 6444, comm: bash Not tainted 2.6.18-262.el5 #1
RIP: 0010:[<ffffffff80064bf0>] [<ffffffff80064bf0>]
.text.lock.spinlock+0x26/00
RSP: 0018:ffff810113167da8 EFLAGS: 00000286
RAX: ffff810113167fd8 RBX: ffff810123a47800 RCX: 0000000000ff1025
RDX: 0000000000000000 RSI: ffff810123a47800 RDI: ffff81021b57f6f8
RBP: ffff81021b57f500 R08: 0000000000000000 R09: 000000000000000c
R10: 00000000ffffffff R11: ffff81011d41c000 R12: ffff81021b57f000
R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000282
FS: 00002b3b41ef3f50(0000) GS:ffff810123b27940(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b3b456dd000 CR3: 000000031fc60000 CR4: 00000000000006e0
Call Trace:
[<ffffffff80064af9>] _spin_lock_bh+0x9/0x14
[<ffffffff886937d7>] :bonding:tlb_clear_slave+0x22/0xa1
[<ffffffff8869423c>] :bonding:bond_alb_deinit_slave+0xba/0xf0
[<ffffffff8868dda6>] :bonding:bond_release+0x1b4/0x450
[<ffffffff8006457b>] __down_write_nested+0x12/0x92
[<ffffffff88696ae4>] :bonding:bonding_store_slaves+0x25c/0x2f7
[<ffffffff801106f7>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016b87>] vfs_write+0xce/0x174
[<ffffffff80017450>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
It occurs because we are able to change the slave configuarion of a bond while
the bond interface is down. The bonding driver initializes some data structures
only after its ndo_open routine is called. Among them is the initalization of
the alb tx and rx hash locks. So if we add or remove a slave without first
opening the bond master device, we run the risk of trying to lock/unlock a
spinlock that has garbage for data in it, which results in our above softlock.
Note that sometimes this works, because in many cases an unlocked spinlock has
the raw_lock parameter initialized to zero (meaning that the kzalloc of the
net_device private data is equivalent to calling spin_lock_init), but thats not
true in all cases, and we aren't guaranteed that condition, so we need to pass
the relevant spinlocks through the spin_lock_init function.
Fix it by moving the spin_lock_init calls for the tx and rx hashtable locks to
the ndo_init path, so they are ready for use by the bond_store_slaves path.
Change notes:
v2) Based on conversation with Jay and Nicolas it seems that the ability to
enslave devices while the bond master is down should be safe to do. As such
this is an outlier bug, and so instead we'll just initalize the errant spinlocks
in the init path rather than the open path, solving the problem. We'll also
remove the warnings about the bond being down during enslave operations, since
it should be safe
v3) Fix spelling error
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: jtluka@redhat.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: nicolas.2p.debian@gmail.com
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-25 12:13:01 +04:00
struct alb_bond_info * bond_info = & ( BOND_ALB_INFO ( bond ) ) ;
2009-06-12 23:02:52 +04:00
pr_debug ( " Begin bond_init for %s \n " , bond_dev - > name ) ;
bonding: prevent deadlock on slave store with alb mode (v3)
This soft lockup was recently reported:
[root@dell-per715-01 ~]# echo +bond5 > /sys/class/net/bonding_masters
[root@dell-per715-01 ~]# echo +eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
bonding bond5: master_dev is not up in bond_enslave
[root@dell-per715-01 ~]# echo -eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
BUG: soft lockup - CPU#12 stuck for 60s! [bash:6444]
CPU 12:
Modules linked in: bonding autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
be2d
Pid: 6444, comm: bash Not tainted 2.6.18-262.el5 #1
RIP: 0010:[<ffffffff80064bf0>] [<ffffffff80064bf0>]
.text.lock.spinlock+0x26/00
RSP: 0018:ffff810113167da8 EFLAGS: 00000286
RAX: ffff810113167fd8 RBX: ffff810123a47800 RCX: 0000000000ff1025
RDX: 0000000000000000 RSI: ffff810123a47800 RDI: ffff81021b57f6f8
RBP: ffff81021b57f500 R08: 0000000000000000 R09: 000000000000000c
R10: 00000000ffffffff R11: ffff81011d41c000 R12: ffff81021b57f000
R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000282
FS: 00002b3b41ef3f50(0000) GS:ffff810123b27940(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b3b456dd000 CR3: 000000031fc60000 CR4: 00000000000006e0
Call Trace:
[<ffffffff80064af9>] _spin_lock_bh+0x9/0x14
[<ffffffff886937d7>] :bonding:tlb_clear_slave+0x22/0xa1
[<ffffffff8869423c>] :bonding:bond_alb_deinit_slave+0xba/0xf0
[<ffffffff8868dda6>] :bonding:bond_release+0x1b4/0x450
[<ffffffff8006457b>] __down_write_nested+0x12/0x92
[<ffffffff88696ae4>] :bonding:bonding_store_slaves+0x25c/0x2f7
[<ffffffff801106f7>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016b87>] vfs_write+0xce/0x174
[<ffffffff80017450>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
It occurs because we are able to change the slave configuarion of a bond while
the bond interface is down. The bonding driver initializes some data structures
only after its ndo_open routine is called. Among them is the initalization of
the alb tx and rx hash locks. So if we add or remove a slave without first
opening the bond master device, we run the risk of trying to lock/unlock a
spinlock that has garbage for data in it, which results in our above softlock.
Note that sometimes this works, because in many cases an unlocked spinlock has
the raw_lock parameter initialized to zero (meaning that the kzalloc of the
net_device private data is equivalent to calling spin_lock_init), but thats not
true in all cases, and we aren't guaranteed that condition, so we need to pass
the relevant spinlocks through the spin_lock_init function.
Fix it by moving the spin_lock_init calls for the tx and rx hashtable locks to
the ndo_init path, so they are ready for use by the bond_store_slaves path.
Change notes:
v2) Based on conversation with Jay and Nicolas it seems that the ability to
enslave devices while the bond master is down should be safe to do. As such
this is an outlier bug, and so instead we'll just initalize the errant spinlocks
in the init path rather than the open path, solving the problem. We'll also
remove the warnings about the bond being down during enslave operations, since
it should be safe
v3) Fix spelling error
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: jtluka@redhat.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: nicolas.2p.debian@gmail.com
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-25 12:13:01 +04:00
/*
* Initialize locks that may be required during
* en / deslave operations . All of the bond_open work
* ( of which this is part ) should really be moved to
* a phase prior to dev_open
*/
spin_lock_init ( & ( bond_info - > tx_hashtbl_lock ) ) ;
spin_lock_init ( & ( bond_info - > rx_hashtbl_lock ) ) ;
2009-06-12 23:02:52 +04:00
bond - > wq = create_singlethread_workqueue ( bond_dev - > name ) ;
if ( ! bond - > wq )
return - ENOMEM ;
bond_set_lockdep_class ( bond_dev ) ;
2009-10-29 17:18:26 +03:00
list_add_tail ( & bond - > bond_list , & bn - > dev_list ) ;
2009-06-12 23:02:52 +04:00
2009-10-29 17:18:22 +03:00
bond_prepare_sysfs_group ( bond ) ;
2010-04-02 01:22:57 +04:00
2010-12-09 18:17:13 +03:00
bond_debug_register ( bond ) ;
2013-01-30 14:08:11 +04:00
/* Ensure valid dev_addr */
if ( is_zero_ether_addr ( bond_dev - > dev_addr ) & &
2013-06-26 19:13:38 +04:00
bond_dev - > addr_assign_type = = NET_ADDR_PERM )
2013-01-30 14:08:11 +04:00
eth_hw_addr_random ( bond_dev ) ;
2009-06-12 23:02:52 +04:00
return 0 ;
}
2009-10-29 17:18:25 +03:00
static int bond_validate ( struct nlattr * tb [ ] , struct nlattr * data [ ] )
{
if ( tb [ IFLA_ADDRESS ] ) {
if ( nla_len ( tb [ IFLA_ADDRESS ] ) ! = ETH_ALEN )
return - EINVAL ;
if ( ! is_valid_ether_addr ( nla_data ( tb [ IFLA_ADDRESS ] ) ) )
return - EADDRNOTAVAIL ;
}
return 0 ;
}
2012-07-20 06:28:47 +04:00
static unsigned int bond_get_num_tx_queues ( void )
2011-08-10 10:09:44 +04:00
{
2012-04-10 22:34:43 +04:00
return tx_queues ;
2011-08-10 10:09:44 +04:00
}
2009-10-29 17:18:25 +03:00
static struct rtnl_link_ops bond_link_ops __read_mostly = {
2012-07-20 06:28:47 +04:00
. kind = " bond " ,
. priv_size = sizeof ( struct bonding ) ,
. setup = bond_setup ,
. validate = bond_validate ,
. get_num_tx_queues = bond_get_num_tx_queues ,
. get_num_rx_queues = bond_get_num_tx_queues , /* Use the same number
as for TX queues */
2009-10-29 17:18:25 +03:00
} ;
2005-11-09 21:36:04 +03:00
/* Create a new bond based on the specified name and bonding parameters.
2007-01-20 05:15:31 +03:00
* If name is NULL , obtain a suitable " bond%d " name for us .
2005-11-09 21:36:04 +03:00
* Caller must NOT hold rtnl_lock ; we need to release it here before we
* set up our sysfs entries .
*/
2009-10-29 17:18:26 +03:00
int bond_create ( struct net * net , const char * name )
2005-11-09 21:36:04 +03:00
{
struct net_device * bond_dev ;
int res ;
rtnl_lock ( ) ;
2008-01-18 03:25:02 +03:00
2011-04-30 05:21:32 +04:00
bond_dev = alloc_netdev_mq ( sizeof ( struct bonding ) ,
name ? name : " bond%d " ,
bond_setup , tx_queues ) ;
2005-11-09 21:36:04 +03:00
if ( ! bond_dev ) {
2009-12-14 07:06:07 +03:00
pr_err ( " %s: eek! can't alloc netdev! \n " , name ) ;
2010-04-01 01:30:52 +04:00
rtnl_unlock ( ) ;
return - ENOMEM ;
2005-11-09 21:36:04 +03:00
}
2009-10-29 17:18:26 +03:00
dev_net_set ( bond_dev , net ) ;
2009-10-29 17:18:25 +03:00
bond_dev - > rtnl_link_ops = & bond_link_ops ;
2005-11-09 21:36:04 +03:00
res = register_netdevice ( bond_dev ) ;
2006-11-09 06:51:01 +03:00
2011-03-14 09:22:05 +03:00
netif_carrier_off ( bond_dev ) ;
2009-06-12 23:02:46 +04:00
rtnl_unlock ( ) ;
2010-04-01 01:30:52 +04:00
if ( res < 0 )
bond_destructor ( bond_dev ) ;
2009-10-29 17:18:23 +03:00
return res ;
2005-11-09 21:36:04 +03:00
}
2010-01-17 06:35:32 +03:00
static int __net_init bond_net_init ( struct net * net )
2009-10-29 17:18:26 +03:00
{
2009-11-29 18:46:04 +03:00
struct bond_net * bn = net_generic ( net , bond_net_id ) ;
2009-10-29 17:18:26 +03:00
bn - > net = net ;
INIT_LIST_HEAD ( & bn - > dev_list ) ;
bond_create_proc_dir ( bn ) ;
2011-10-13 01:56:25 +04:00
bond_create_sysfs ( bn ) ;
2013-06-24 13:49:29 +04:00
2009-11-29 18:46:04 +03:00
return 0 ;
2009-10-29 17:18:26 +03:00
}
2010-01-17 06:35:32 +03:00
static void __net_exit bond_net_exit ( struct net * net )
2009-10-29 17:18:26 +03:00
{
2009-11-29 18:46:04 +03:00
struct bond_net * bn = net_generic ( net , bond_net_id ) ;
2013-04-06 04:54:38 +04:00
struct bonding * bond , * tmp_bond ;
LIST_HEAD ( list ) ;
2009-10-29 17:18:26 +03:00
2011-10-13 01:56:25 +04:00
bond_destroy_sysfs ( bn ) ;
2009-10-29 17:18:26 +03:00
bond_destroy_proc_dir ( bn ) ;
2013-04-06 04:54:38 +04:00
/* Kill off any bonds created after unregistering bond rtnl ops */
rtnl_lock ( ) ;
list_for_each_entry_safe ( bond , tmp_bond , & bn - > dev_list , bond_list )
unregister_netdevice_queue ( bond - > dev , & list ) ;
unregister_netdevice_many ( & list ) ;
rtnl_unlock ( ) ;
2009-10-29 17:18:26 +03:00
}
static struct pernet_operations bond_net_ops = {
. init = bond_net_init ,
. exit = bond_net_exit ,
2009-11-29 18:46:04 +03:00
. id = & bond_net_id ,
. size = sizeof ( struct bond_net ) ,
2009-10-29 17:18:26 +03:00
} ;
2005-04-17 02:20:36 +04:00
static int __init bonding_init ( void )
{
int i ;
int res ;
2011-03-07 00:58:46 +03:00
pr_info ( " %s " , bond_version ) ;
2005-04-17 02:20:36 +04:00
2005-11-09 21:36:04 +03:00
res = bond_check_params ( & bonding_defaults ) ;
2009-06-12 23:02:48 +04:00
if ( res )
2005-11-09 21:36:04 +03:00
goto out ;
2005-04-17 02:20:36 +04:00
2009-11-29 18:46:04 +03:00
res = register_pernet_subsys ( & bond_net_ops ) ;
2009-10-29 17:18:26 +03:00
if ( res )
goto out ;
2008-01-18 03:25:02 +03:00
2009-10-29 17:18:25 +03:00
res = rtnl_link_register ( & bond_link_ops ) ;
if ( res )
2009-10-30 02:58:54 +03:00
goto err_link ;
2009-10-29 17:18:25 +03:00
2010-12-09 18:17:13 +03:00
bond_create_debugfs ( ) ;
2005-04-17 02:20:36 +04:00
for ( i = 0 ; i < max_bonds ; i + + ) {
2009-10-29 17:18:26 +03:00
res = bond_create ( & init_net , NULL ) ;
2005-11-09 21:36:04 +03:00
if ( res )
goto err ;
2005-04-17 02:20:36 +04:00
}
register_netdevice_notifier ( & bond_netdev_notifier ) ;
2005-11-09 21:36:04 +03:00
out :
2005-04-17 02:20:36 +04:00
return res ;
2009-10-29 17:18:25 +03:00
err :
rtnl_link_unregister ( & bond_link_ops ) ;
2009-10-30 02:58:54 +03:00
err_link :
2009-11-29 18:46:04 +03:00
unregister_pernet_subsys ( & bond_net_ops ) ;
2009-10-29 17:18:25 +03:00
goto out ;
2005-11-09 21:36:04 +03:00
2005-04-17 02:20:36 +04:00
}
static void __exit bonding_exit ( void )
{
unregister_netdevice_notifier ( & bond_netdev_notifier ) ;
2010-12-09 18:17:13 +03:00
bond_destroy_debugfs ( ) ;
2008-05-03 04:49:39 +04:00
2013-04-03 09:46:33 +04:00
rtnl_link_unregister ( & bond_link_ops ) ;
2013-04-06 04:54:37 +04:00
unregister_pernet_subsys ( & bond_net_ops ) ;
2010-10-13 20:01:50 +04:00
# ifdef CONFIG_NET_POLL_CONTROLLER
net: Convert netpoll blocking api in bonding driver to be a counter
A while back I made some changes to enable netpoll in the bonding driver. Among
them was a per-cpu flag that indicated we were in a path that held locks which
could cause the netpoll path to block in during tx, and as such the tx path
should queue the frame for later use. This appears to have given rise to a
regression. If one of those paths on which we hold the per-cpu flag yields the
cpu, its possible for us to come back on a different cpu, leading to us clearing
a different flag than we set. This results in odd netpoll drops, and BUG
backtraces appearing in the log, as we check to make sure that we only clear set
bits, and only set clear bits. I had though briefly about changing the
offending paths so that they wouldn't sleep, but looking at my origional work
more closely, it doesn't appear that a per-cpu flag is warranted. We alrady
gate the checking of this flag on IFF_IN_NETPOLL, so we don't hit this in the
normal tx case anyway. And practically speaking, the normal use case for
netpoll is to only have one client anyway, so we're not going to erroneously
queue netpoll frames when its actually safe to do so. As such, lets just
convert that per-cpu flag to an atomic counter. It fixes the rescheduling bugs,
is equivalent from a performance perspective and actually eliminates some code
in the process.
Tested by the reporter and myself, successfully
Reported-by: Liang Zheng <lzheng@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:05:50 +03:00
/*
* Make sure we don ' t have an imbalance on our netpoll blocking
*/
WARN_ON ( atomic_read ( & netpoll_block_tx ) ) ;
2010-10-13 20:01:50 +04:00
# endif
2005-04-17 02:20:36 +04:00
}
module_init ( bonding_init ) ;
module_exit ( bonding_exit ) ;
MODULE_LICENSE ( " GPL " ) ;
MODULE_VERSION ( DRV_VERSION ) ;
MODULE_DESCRIPTION ( DRV_DESCRIPTION " , v " DRV_VERSION ) ;
MODULE_AUTHOR ( " Thomas Davis, tadavis@lbl.gov and many others " ) ;
2009-10-29 17:18:25 +03:00
MODULE_ALIAS_RTNL_LINK ( " bond " ) ;