2006-01-02 19:04:38 +01:00
/*
* net / tipc / link . c : TIPC link code
2007-02-09 23:25:21 +09:00
*
2016-03-03 14:23:21 -05:00
* Copyright ( c ) 1996 - 2007 , 2012 - 2016 , Ericsson AB
2013-06-17 10:54:42 -04:00
* Copyright ( c ) 2004 - 2007 , 2010 - 2013 , Wind River Systems
2006-01-02 19:04:38 +01:00
* All rights reserved .
*
2006-01-11 13:30:43 +01:00
* Redistribution and use in source and binary forms , with or without
2006-01-02 19:04:38 +01:00
* modification , are permitted provided that the following conditions are met :
*
2006-01-11 13:30:43 +01:00
* 1. Redistributions of source code must retain the above copyright
* notice , this list of conditions and the following disclaimer .
* 2. Redistributions in binary form must reproduce the above copyright
* notice , this list of conditions and the following disclaimer in the
* documentation and / or other materials provided with the distribution .
* 3. Neither the names of the copyright holders nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission .
2006-01-02 19:04:38 +01:00
*
2006-01-11 13:30:43 +01:00
* Alternatively , this software may be distributed under the terms of the
* GNU General Public License ( " GPL " ) version 2 as published by the Free
* Software Foundation .
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS " AS IS "
* AND ANY EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT LIMITED TO , THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED . IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL , EXEMPLARY , OR
* CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITED TO , PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE , DATA , OR PROFITS ; OR BUSINESS
* INTERRUPTION ) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN
* CONTRACT , STRICT LIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE )
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE , EVEN IF ADVISED OF THE
2006-01-02 19:04:38 +01:00
* POSSIBILITY OF SUCH DAMAGE .
*/
# include "core.h"
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
# include "subscr.h"
2006-01-02 19:04:38 +01:00
# include "link.h"
2014-11-20 10:29:12 +01:00
# include "bcast.h"
2014-05-14 05:39:15 -04:00
# include "socket.h"
2006-01-02 19:04:38 +01:00
# include "name_distr.h"
# include "discover.h"
2014-11-20 10:29:07 +01:00
# include "netlink.h"
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
# include "monitor.h"
2006-01-02 19:04:38 +01:00
2013-06-17 10:54:48 -04:00
# include <linux/pkt_sched.h>
2015-11-19 14:30:46 -05:00
struct tipc_stats {
2016-11-25 10:35:02 -05:00
u32 sent_pkts ;
u32 recv_pkts ;
2015-11-19 14:30:46 -05:00
u32 sent_states ;
u32 recv_states ;
u32 sent_probes ;
u32 recv_probes ;
u32 sent_nacks ;
u32 recv_nacks ;
u32 sent_acks ;
u32 sent_bundled ;
u32 sent_bundles ;
u32 recv_bundled ;
u32 recv_bundles ;
u32 retransmitted ;
u32 sent_fragmented ;
u32 sent_fragments ;
u32 recv_fragmented ;
u32 recv_fragments ;
u32 link_congs ; /* # port sends blocked by congestion */
u32 deferred_recv ;
u32 duplicates ;
u32 max_queue_sz ; /* send queue size high water mark */
u32 accu_queue_sz ; /* used for send queue size profiling */
u32 queue_sz_counts ; /* used for send queue size profiling */
u32 msg_length_counts ; /* used for message length profiling */
u32 msg_lengths_total ; /* used for message length profiling */
u32 msg_length_profile [ 7 ] ; /* used for msg. length profiling */
} ;
/**
* struct tipc_link - TIPC link data structure
* @ addr : network address of link ' s peer node
* @ name : link name character string
* @ media_addr : media address to use when sending messages over link
* @ timer : link timer
* @ net : pointer to namespace struct
* @ refcnt : reference counter for permanent references ( owner node & timer )
* @ peer_session : link session # being used by peer end of link
* @ peer_bearer_id : bearer id used by link ' s peer endpoint
* @ bearer_id : local bearer id used by link
* @ tolerance : minimum link continuity loss needed to reset link [ in ms ]
* @ abort_limit : # of unacknowledged continuity probes needed to reset link
* @ state : current state of link FSM
* @ peer_caps : bitmap describing capabilities of peer node
* @ silent_intv_cnt : # of timer intervals without any reception from peer
* @ proto_msg : template for control messages generated by link
* @ pmsg : convenience pointer to " proto_msg " field
* @ priority : current link priority
* @ net_plane : current link network plane ( ' A ' through ' H ' )
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
* @ mon_state : cookie with information needed by link monitor
2015-11-19 14:30:46 -05:00
* @ backlog_limit : backlog queue congestion thresholds ( indexed by importance )
* @ exp_msg_count : # of tunnelled messages expected during link changeover
* @ reset_rcv_checkpt : seq # of last acknowledged message at time of link reset
* @ mtu : current maximum packet size for this link
* @ advertised_mtu : advertised own mtu when link is being established
* @ transmitq : queue for sent , non - acked messages
* @ backlogq : queue for messages waiting to be sent
* @ snt_nxt : next sequence number to use for outbound messages
* @ last_retransmitted : sequence number of most recently retransmitted message
* @ stale_count : # of identical retransmit requests made by peer
* @ ackers : # of peers that needs to ack each packet before it can be released
* @ acked : # last packet acked by a certain peer . Used for broadcast .
* @ rcv_nxt : next sequence number to expect for inbound messages
* @ deferred_queue : deferred queue saved OOS b ' cast message received from node
* @ unacked_window : # of inbound messages rx ' d without ack ' ing back to peer
* @ inputq : buffer queue for messages to be delivered upwards
* @ namedq : buffer queue for name table messages to be delivered upwards
* @ next_out : ptr to first unsent outbound message in queue
* @ wakeupq : linked list of wakeup msgs waiting for link congestion to abate
* @ long_msg_seq_no : next identifier to use for outbound fragmented messages
* @ reasm_buf : head of partially reassembled inbound message fragments
* @ bc_rcvr : marks that this is a broadcast receiver link
* @ stats : collects statistics regarding link activity
*/
struct tipc_link {
u32 addr ;
char name [ TIPC_MAX_LINK_NAME ] ;
struct net * net ;
/* Management and link supervision data */
u32 peer_session ;
2016-03-03 14:23:21 -05:00
u32 session ;
2015-11-19 14:30:46 -05:00
u32 peer_bearer_id ;
u32 bearer_id ;
u32 tolerance ;
u32 abort_limit ;
u32 state ;
u16 peer_caps ;
bool active ;
u32 silent_intv_cnt ;
2016-03-03 14:23:21 -05:00
char if_name [ TIPC_MAX_IF_NAME ] ;
2015-11-19 14:30:46 -05:00
u32 priority ;
char net_plane ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
struct tipc_mon_state mon_state ;
2016-04-15 13:33:04 -04:00
u16 rst_cnt ;
2015-11-19 14:30:46 -05:00
/* Failover/synch */
u16 drop_point ;
struct sk_buff * failover_reasm_skb ;
/* Max packet negotiation */
u16 mtu ;
u16 advertised_mtu ;
/* Sending */
struct sk_buff_head transmq ;
struct sk_buff_head backlogq ;
struct {
u16 len ;
u16 limit ;
} backlog [ 5 ] ;
u16 snd_nxt ;
u16 last_retransm ;
u16 window ;
u32 stale_count ;
/* Reception */
u16 rcv_nxt ;
u32 rcv_unacked ;
struct sk_buff_head deferdq ;
struct sk_buff_head * inputq ;
struct sk_buff_head * namedq ;
/* Congestion handling */
struct sk_buff_head wakeupq ;
/* Fragmentation/reassembly */
struct sk_buff * reasm_buf ;
/* Broadcast */
u16 ackers ;
u16 acked ;
struct tipc_link * bc_rcvlink ;
struct tipc_link * bc_sndlink ;
2016-09-01 13:52:50 -04:00
unsigned long prev_retr ;
u16 prev_from ;
u16 prev_to ;
u8 nack_state ;
2015-11-19 14:30:46 -05:00
bool bc_peer_is_up ;
/* Statistics */
struct tipc_stats stats ;
} ;
2012-06-29 00:16:37 -04:00
/*
* Error message prefixes
*/
2015-07-30 18:24:19 -04:00
static const char * link_co_err = " Link tunneling error, " ;
2012-06-29 00:16:37 -04:00
static const char * link_rst_msg = " Resetting link " ;
2014-11-20 10:29:12 +01:00
2015-10-22 08:51:41 -04:00
/* Send states for broadcast NACKs
*/
enum {
BC_NACK_SND_CONDITIONAL ,
BC_NACK_SND_UNCONDITIONAL ,
BC_NACK_SND_SUPPRESS ,
} ;
2016-09-01 13:52:50 -04:00
# define TIPC_BC_RETR_LIMIT 10 /* [ms] */
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/*
* Interval between NACKs when packets arrive out of order
*/
# define TIPC_NACK_INTV (TIPC_MIN_LINK_WIN * 2)
2016-03-03 14:23:21 -05:00
/* Wildcard value for link session numbers. When it is known that
* peer endpoint is down , any session number must be accepted .
2008-06-04 17:29:39 -07:00
*/
2016-03-03 14:23:21 -05:00
# define ANY_SESSION 0x10000
2008-06-04 17:29:39 -07:00
2015-07-30 18:24:21 -04:00
/* Link FSM states:
2006-01-02 19:04:38 +01:00
*/
tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().
We eliminate these flags in this commit.
Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.
Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:25 -04:00
enum {
2015-07-30 18:24:21 -04:00
LINK_ESTABLISHED = 0xe ,
LINK_ESTABLISHING = 0xe < < 4 ,
LINK_RESET = 0x1 < < 8 ,
LINK_RESETTING = 0x2 < < 12 ,
LINK_PEER_RESET = 0xd < < 16 ,
LINK_FAILINGOVER = 0xf < < 20 ,
LINK_SYNCHING = 0xc < < 24
tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().
We eliminate these flags in this commit.
Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.
Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:25 -04:00
} ;
/* Link FSM state checking routines
*/
2015-07-30 18:24:21 -04:00
static int link_is_up ( struct tipc_link * l )
tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().
We eliminate these flags in this commit.
Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.
Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:25 -04:00
{
2015-07-30 18:24:21 -04:00
return l - > state & ( LINK_ESTABLISHED | LINK_SYNCHING ) ;
tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().
We eliminate these flags in this commit.
Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.
Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:25 -04:00
}
2006-01-02 19:04:38 +01:00
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static int tipc_link_proto_rcv ( struct tipc_link * l , struct sk_buff * skb ,
struct sk_buff_head * xmitq ) ;
2015-07-16 16:54:26 -04:00
static void tipc_link_build_proto_msg ( struct tipc_link * l , int mtyp , bool probe ,
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
bool probe_reply , u16 rcvgap ,
int tolerance , int priority ,
2015-07-16 16:54:26 -04:00
struct sk_buff_head * xmitq ) ;
2015-11-19 14:30:47 -05:00
static void link_print ( struct tipc_link * l , const char * str ) ;
2016-09-01 13:52:51 -04:00
static int tipc_link_build_nack_msg ( struct tipc_link * l ,
struct sk_buff_head * xmitq ) ;
2015-10-22 08:51:41 -04:00
static void tipc_link_build_bc_init_msg ( struct tipc_link * l ,
struct sk_buff_head * xmitq ) ;
static bool tipc_link_release_pkts ( struct tipc_link * l , u16 to ) ;
2015-03-25 12:07:26 -04:00
2006-01-02 19:04:38 +01:00
/*
2006-03-20 22:37:04 -08:00
* Simple non - static link routines ( i . e . referenced outside this file )
2006-01-02 19:04:38 +01:00
*/
2015-07-30 18:24:21 -04:00
bool tipc_link_is_up ( struct tipc_link * l )
2006-01-02 19:04:38 +01:00
{
2015-07-30 18:24:21 -04:00
return link_is_up ( l ) ;
}
2015-10-15 14:52:46 -04:00
bool tipc_link_peer_is_down ( struct tipc_link * l )
{
return l - > state = = LINK_PEER_RESET ;
}
2015-07-30 18:24:21 -04:00
bool tipc_link_is_reset ( struct tipc_link * l )
{
return l - > state & ( LINK_RESET | LINK_FAILINGOVER | LINK_ESTABLISHING ) ;
}
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
bool tipc_link_is_establishing ( struct tipc_link * l )
{
return l - > state = = LINK_ESTABLISHING ;
}
2015-07-30 18:24:21 -04:00
bool tipc_link_is_synching ( struct tipc_link * l )
{
return l - > state = = LINK_SYNCHING ;
}
bool tipc_link_is_failingover ( struct tipc_link * l )
{
return l - > state = = LINK_FAILINGOVER ;
}
bool tipc_link_is_blocked ( struct tipc_link * l )
{
return l - > state & ( LINK_RESETTING | LINK_PEER_RESET | LINK_FAILINGOVER ) ;
2006-01-02 19:04:38 +01:00
}
2015-10-24 22:56:01 +08:00
static bool link_is_bc_sndlink ( struct tipc_link * l )
2015-10-22 08:51:41 -04:00
{
return ! l - > bc_sndlink ;
}
2015-10-24 22:56:01 +08:00
static bool link_is_bc_rcvlink ( struct tipc_link * l )
2015-10-22 08:51:41 -04:00
{
return ( ( l - > bc_rcvlink = = l ) & & ! link_is_bc_sndlink ( l ) ) ;
}
2015-07-16 16:54:19 -04:00
int tipc_link_is_active ( struct tipc_link * l )
2006-01-02 19:04:38 +01:00
{
2015-10-22 08:51:46 -04:00
return l - > active ;
}
2015-07-16 16:54:19 -04:00
2015-10-22 08:51:46 -04:00
void tipc_link_set_active ( struct tipc_link * l , bool active )
{
l - > active = active ;
2006-01-02 19:04:38 +01:00
}
2015-11-19 14:30:46 -05:00
u32 tipc_link_id ( struct tipc_link * l )
{
return l - > peer_bearer_id < < 16 | l - > bearer_id ;
}
int tipc_link_window ( struct tipc_link * l )
{
return l - > window ;
}
int tipc_link_prio ( struct tipc_link * l )
{
return l - > priority ;
}
unsigned long tipc_link_tolerance ( struct tipc_link * l )
{
return l - > tolerance ;
}
struct sk_buff_head * tipc_link_inputq ( struct tipc_link * l )
{
return l - > inputq ;
}
char tipc_link_plane ( struct tipc_link * l )
{
return l - > net_plane ;
}
2015-10-22 08:51:41 -04:00
void tipc_link_add_bc_peer ( struct tipc_link * snd_l ,
struct tipc_link * uc_l ,
struct sk_buff_head * xmitq )
2015-10-22 08:51:39 -04:00
{
2015-10-22 08:51:41 -04:00
struct tipc_link * rcv_l = uc_l - > bc_rcvlink ;
snd_l - > ackers + + ;
rcv_l - > acked = snd_l - > snd_nxt - 1 ;
tipc: correct settings of broadcast link state
Since commit 5266698661401afc5e ("tipc: let broadcast packet
reception use new link receive function") the broadcast send
link state was meant to always be set to LINK_ESTABLISHED, since
we don't need this link to follow the regular link FSM rules. It
was also the intention that this state anyway shouldn't impact
the run-time working state of the link, since the latter in
reality is controlled by the number of registered peers.
We have now discovered that this assumption is not quite correct.
If the broadcast link is reset because of too many retransmissions,
its state will inadvertently go to LINK_RESETTING, and never go
back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
anticipated. This will work well once, but if it happens a second
time, the reset on a link in LINK_RESETTING has has no effect, and
neither the broadcast link nor the unicast links will go down as
they should.
Furthermore, it is confusing that the management tool shows that
this link is in UP state when that obviously isn't the case.
We now ensure that this state strictly follows the true working
state of the link. The state is set to LINK_ESTABLISHED when
the number of peers is non-zero, and to LINK_RESET otherwise.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-19 14:12:50 -05:00
snd_l - > state = LINK_ESTABLISHED ;
2015-10-22 08:51:41 -04:00
tipc_link_build_bc_init_msg ( uc_l , xmitq ) ;
2015-10-22 08:51:39 -04:00
}
2015-10-22 08:51:41 -04:00
void tipc_link_remove_bc_peer ( struct tipc_link * snd_l ,
struct tipc_link * rcv_l ,
struct sk_buff_head * xmitq )
2015-10-22 08:51:39 -04:00
{
2015-10-22 08:51:41 -04:00
u16 ack = snd_l - > snd_nxt - 1 ;
snd_l - > ackers - - ;
2016-07-11 16:08:36 -04:00
rcv_l - > bc_peer_is_up = true ;
rcv_l - > state = LINK_ESTABLISHED ;
2015-10-22 08:51:41 -04:00
tipc_link_bc_ack_rcv ( rcv_l , ack , xmitq ) ;
tipc_link_reset ( rcv_l ) ;
rcv_l - > state = LINK_RESET ;
if ( ! snd_l - > ackers ) {
tipc_link_reset ( snd_l ) ;
tipc: correct settings of broadcast link state
Since commit 5266698661401afc5e ("tipc: let broadcast packet
reception use new link receive function") the broadcast send
link state was meant to always be set to LINK_ESTABLISHED, since
we don't need this link to follow the regular link FSM rules. It
was also the intention that this state anyway shouldn't impact
the run-time working state of the link, since the latter in
reality is controlled by the number of registered peers.
We have now discovered that this assumption is not quite correct.
If the broadcast link is reset because of too many retransmissions,
its state will inadvertently go to LINK_RESETTING, and never go
back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
anticipated. This will work well once, but if it happens a second
time, the reset on a link in LINK_RESETTING has has no effect, and
neither the broadcast link nor the unicast links will go down as
they should.
Furthermore, it is confusing that the management tool shows that
this link is in UP state when that obviously isn't the case.
We now ensure that this state strictly follows the true working
state of the link. The state is set to LINK_ESTABLISHED when
the number of peers is non-zero, and to LINK_RESET otherwise.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-19 14:12:50 -05:00
snd_l - > state = LINK_RESET ;
2015-10-22 08:51:41 -04:00
__skb_queue_purge ( xmitq ) ;
}
2015-10-22 08:51:39 -04:00
}
int tipc_link_bc_peers ( struct tipc_link * l )
{
return l - > ackers ;
}
2016-09-01 13:52:49 -04:00
u16 link_bc_rcv_gap ( struct tipc_link * l )
{
struct sk_buff * skb = skb_peek ( & l - > deferdq ) ;
u16 gap = 0 ;
if ( more ( l - > snd_nxt , l - > rcv_nxt ) )
gap = l - > snd_nxt - l - > rcv_nxt ;
if ( skb )
gap = buf_seqno ( skb ) - l - > rcv_nxt ;
return gap ;
}
2015-10-22 08:51:43 -04:00
void tipc_link_set_mtu ( struct tipc_link * l , int mtu )
{
l - > mtu = mtu ;
}
int tipc_link_mtu ( struct tipc_link * l )
{
return l - > mtu ;
}
2015-11-19 14:30:46 -05:00
u16 tipc_link_rcv_nxt ( struct tipc_link * l )
{
return l - > rcv_nxt ;
}
u16 tipc_link_acked ( struct tipc_link * l )
{
return l - > acked ;
}
char * tipc_link_name ( struct tipc_link * l )
{
return l - > name ;
}
2006-01-02 19:04:38 +01:00
/**
2006-01-18 00:38:21 +01:00
* tipc_link_create - create a new link
2015-07-30 18:24:26 -04:00
* @ n : pointer to associated node
2015-10-22 08:51:36 -04:00
* @ if_name : associated interface name
* @ bearer_id : id ( index ) of associated bearer
* @ tolerance : link tolerance to be used by link
* @ net_plane : network plane ( A , B , c . . ) this link belongs to
* @ mtu : mtu to be advertised by link
* @ priority : priority to be used by link
* @ window : send window to be used by link
* @ session : session to be used by link
2015-07-30 18:24:26 -04:00
* @ ownnode : identity of own node
2015-10-22 08:51:36 -04:00
* @ peer : node id of peer node
2015-10-22 08:51:40 -04:00
* @ peer_caps : bitmap describing peer node capabilities
2015-10-22 08:51:41 -04:00
* @ bc_sndlink : the namespace global link used for broadcast sending
* @ bc_rcvlink : the peer specific link used for broadcast reception
2015-07-30 18:24:26 -04:00
* @ inputq : queue to put messages ready for delivery
* @ namedq : queue to put binding table update messages ready for delivery
* @ link : return value , pointer to put the created link
2007-02-09 23:25:21 +09:00
*
2015-07-30 18:24:26 -04:00
* Returns true if link was created , otherwise false
2006-01-02 19:04:38 +01:00
*/
2015-10-22 08:51:46 -04:00
bool tipc_link_create ( struct net * net , char * if_name , int bearer_id ,
2015-10-22 08:51:36 -04:00
int tolerance , char net_plane , u32 mtu , int priority ,
tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.
In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.
Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.
The rules for accepting discovery requests/responses from neighboring
nodes now become:
- If the user is using legacy address format on both peers, reception
of discovery messages is subject to the legacy lookup domain check
in addition to the cluster id check.
- Otherwise, the discovery request/response is always accepted, provided
both peers have the same network id.
This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:47 +01:00
int window , u32 session , u32 self , u32 peer ,
2015-10-22 08:51:41 -04:00
u16 peer_caps ,
struct tipc_link * bc_sndlink ,
struct tipc_link * bc_rcvlink ,
struct sk_buff_head * inputq ,
struct sk_buff_head * namedq ,
2015-07-30 18:24:26 -04:00
struct tipc_link * * link )
2006-01-02 19:04:38 +01:00
{
2018-03-22 20:42:50 +01:00
char * self_str = tipc_own_id_string ( net ) ;
2015-07-30 18:24:26 -04:00
struct tipc_link * l ;
2011-02-28 11:32:27 -05:00
2015-07-30 18:24:26 -04:00
l = kzalloc ( sizeof ( * l ) , GFP_ATOMIC ) ;
if ( ! l )
return false ;
* link = l ;
2016-03-03 14:23:21 -05:00
l - > session = session ;
2011-02-28 11:32:27 -05:00
2015-07-30 18:24:26 -04:00
/* Note: peer i/f name is completed by reset/activate message */
2018-03-22 20:42:50 +01:00
if ( strlen ( self_str ) > 16 )
sprintf ( l - > name , " %x:%s-%x:unknown " , self , if_name , peer ) ;
else
sprintf ( l - > name , " %s:%s-%x:unknown " , self_str , if_name , peer ) ;
2016-03-03 14:23:21 -05:00
strcpy ( l - > if_name , if_name ) ;
2015-07-30 18:24:26 -04:00
l - > addr = peer ;
2015-10-22 08:51:40 -04:00
l - > peer_caps = peer_caps ;
2015-10-22 08:51:46 -04:00
l - > net = net ;
2016-03-03 14:23:21 -05:00
l - > peer_session = ANY_SESSION ;
2015-10-22 08:51:36 -04:00
l - > bearer_id = bearer_id ;
l - > tolerance = tolerance ;
l - > net_plane = net_plane ;
l - > advertised_mtu = mtu ;
l - > mtu = mtu ;
l - > priority = priority ;
tipc_link_set_queue_limits ( l , window ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
l - > ackers = 1 ;
2015-10-22 08:51:41 -04:00
l - > bc_sndlink = bc_sndlink ;
l - > bc_rcvlink = bc_rcvlink ;
2015-07-30 18:24:26 -04:00
l - > inputq = inputq ;
l - > namedq = namedq ;
l - > state = LINK_RESETTING ;
__skb_queue_head_init ( & l - > transmq ) ;
__skb_queue_head_init ( & l - > backlogq ) ;
__skb_queue_head_init ( & l - > deferdq ) ;
skb_queue_head_init ( & l - > wakeupq ) ;
skb_queue_head_init ( l - > inputq ) ;
return true ;
2006-01-02 19:04:38 +01:00
}
2015-10-22 08:51:37 -04:00
/**
* tipc_link_bc_create - create new link to be used for broadcast
* @ n : pointer to associated node
2017-11-30 16:47:25 +01:00
* @ mtu : mtu to be used initially if no peers
2015-10-22 08:51:37 -04:00
* @ window : send window to be used
* @ inputq : queue to put messages ready for delivery
* @ namedq : queue to put binding table update messages ready for delivery
* @ link : return value , pointer to put the created link
*
* Returns true if link was created , otherwise false
*/
2015-10-22 08:51:46 -04:00
bool tipc_link_bc_create ( struct net * net , u32 ownnode , u32 peer ,
2015-10-22 08:51:41 -04:00
int mtu , int window , u16 peer_caps ,
2015-10-22 08:51:37 -04:00
struct sk_buff_head * inputq ,
struct sk_buff_head * namedq ,
2015-10-22 08:51:41 -04:00
struct tipc_link * bc_sndlink ,
2015-10-22 08:51:37 -04:00
struct tipc_link * * link )
{
struct tipc_link * l ;
2015-10-22 08:51:46 -04:00
if ( ! tipc_link_create ( net , " " , MAX_BEARERS , 0 , ' Z ' , mtu , 0 , window ,
2015-10-22 08:51:48 -04:00
0 , ownnode , peer , peer_caps , bc_sndlink ,
2015-10-22 08:51:41 -04:00
NULL , inputq , namedq , link ) )
2015-10-22 08:51:37 -04:00
return false ;
l = * link ;
strcpy ( l - > name , tipc_bclink_name ) ;
tipc_link_reset ( l ) ;
2015-10-22 08:51:41 -04:00
l - > state = LINK_RESET ;
2015-10-22 08:51:39 -04:00
l - > ackers = 0 ;
2015-10-22 08:51:41 -04:00
l - > bc_rcvlink = l ;
2015-10-22 08:51:37 -04:00
2015-10-22 08:51:41 -04:00
/* Broadcast send link is always up */
if ( link_is_bc_sndlink ( l ) )
l - > state = LINK_ESTABLISHED ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2017-01-18 13:50:53 -05:00
/* Disable replicast if even a single peer doesn't support it */
if ( link_is_bc_rcvlink ( l ) & & ! ( peer_caps & TIPC_BCAST_RCAST ) )
tipc_bcast_disable_rcast ( net ) ;
2015-10-22 08:51:41 -04:00
return true ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
/**
* tipc_link_fsm_evt - link finite state machine
* @ l : pointer to link
* @ evt : state machine event to be processed
*/
2015-07-30 18:24:21 -04:00
int tipc_link_fsm_evt ( struct tipc_link * l , int evt )
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
{
2015-07-30 18:24:20 -04:00
int rc = 0 ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
switch ( l - > state ) {
2015-07-30 18:24:21 -04:00
case LINK_RESETTING :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
switch ( evt ) {
2015-07-30 18:24:21 -04:00
case LINK_PEER_RESET_EVT :
l - > state = LINK_PEER_RESET ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_RESET_EVT :
l - > state = LINK_RESET ;
break ;
case LINK_FAILURE_EVT :
case LINK_FAILOVER_BEGIN_EVT :
case LINK_ESTABLISH_EVT :
case LINK_FAILOVER_END_EVT :
case LINK_SYNCH_BEGIN_EVT :
case LINK_SYNCH_END_EVT :
default :
goto illegal_evt ;
}
break ;
case LINK_RESET :
switch ( evt ) {
case LINK_PEER_RESET_EVT :
l - > state = LINK_ESTABLISHING ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_FAILOVER_BEGIN_EVT :
l - > state = LINK_FAILINGOVER ;
case LINK_FAILURE_EVT :
case LINK_RESET_EVT :
case LINK_ESTABLISH_EVT :
case LINK_FAILOVER_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_SYNCH_BEGIN_EVT :
case LINK_SYNCH_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
default :
2015-07-30 18:24:21 -04:00
goto illegal_evt ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
}
break ;
2015-07-30 18:24:21 -04:00
case LINK_PEER_RESET :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
switch ( evt ) {
2015-07-30 18:24:21 -04:00
case LINK_RESET_EVT :
l - > state = LINK_ESTABLISHING ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_PEER_RESET_EVT :
case LINK_ESTABLISH_EVT :
case LINK_FAILURE_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_SYNCH_BEGIN_EVT :
case LINK_SYNCH_END_EVT :
case LINK_FAILOVER_BEGIN_EVT :
case LINK_FAILOVER_END_EVT :
default :
goto illegal_evt ;
}
break ;
case LINK_FAILINGOVER :
switch ( evt ) {
case LINK_FAILOVER_END_EVT :
l - > state = LINK_RESET ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_PEER_RESET_EVT :
case LINK_RESET_EVT :
case LINK_ESTABLISH_EVT :
case LINK_FAILURE_EVT :
break ;
case LINK_FAILOVER_BEGIN_EVT :
case LINK_SYNCH_BEGIN_EVT :
case LINK_SYNCH_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
default :
2015-07-30 18:24:21 -04:00
goto illegal_evt ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
}
break ;
2015-07-30 18:24:21 -04:00
case LINK_ESTABLISHING :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
switch ( evt ) {
2015-07-30 18:24:21 -04:00
case LINK_ESTABLISH_EVT :
l - > state = LINK_ESTABLISHED ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_FAILOVER_BEGIN_EVT :
l - > state = LINK_FAILINGOVER ;
break ;
case LINK_RESET_EVT :
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
l - > state = LINK_RESET ;
break ;
2015-07-30 18:24:21 -04:00
case LINK_FAILURE_EVT :
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
case LINK_PEER_RESET_EVT :
2015-07-30 18:24:21 -04:00
case LINK_SYNCH_BEGIN_EVT :
case LINK_FAILOVER_END_EVT :
break ;
case LINK_SYNCH_END_EVT :
default :
goto illegal_evt ;
}
break ;
case LINK_ESTABLISHED :
switch ( evt ) {
case LINK_PEER_RESET_EVT :
l - > state = LINK_PEER_RESET ;
rc | = TIPC_LINK_DOWN_EVT ;
break ;
case LINK_FAILURE_EVT :
l - > state = LINK_RESETTING ;
rc | = TIPC_LINK_DOWN_EVT ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_RESET_EVT :
l - > state = LINK_RESET ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_ESTABLISH_EVT :
2015-08-20 02:12:55 -04:00
case LINK_SYNCH_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_SYNCH_BEGIN_EVT :
l - > state = LINK_SYNCHING ;
break ;
case LINK_FAILOVER_BEGIN_EVT :
case LINK_FAILOVER_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
default :
2015-07-30 18:24:21 -04:00
goto illegal_evt ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
}
break ;
2015-07-30 18:24:21 -04:00
case LINK_SYNCHING :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
switch ( evt ) {
2015-07-30 18:24:21 -04:00
case LINK_PEER_RESET_EVT :
l - > state = LINK_PEER_RESET ;
rc | = TIPC_LINK_DOWN_EVT ;
break ;
case LINK_FAILURE_EVT :
l - > state = LINK_RESETTING ;
rc | = TIPC_LINK_DOWN_EVT ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_RESET_EVT :
l - > state = LINK_RESET ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_ESTABLISH_EVT :
case LINK_SYNCH_BEGIN_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
break ;
2015-07-30 18:24:21 -04:00
case LINK_SYNCH_END_EVT :
l - > state = LINK_ESTABLISHED ;
break ;
case LINK_FAILOVER_BEGIN_EVT :
case LINK_FAILOVER_END_EVT :
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
default :
2015-07-30 18:24:21 -04:00
goto illegal_evt ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
}
break ;
default :
2015-07-30 18:24:21 -04:00
pr_err ( " Unknown FSM state %x in %s \n " , l - > state , l - > name ) ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
}
2015-07-30 18:24:21 -04:00
return rc ;
illegal_evt :
pr_err ( " Illegal FSM event %x in state %x on link %s \n " ,
evt , l - > state , l - > name ) ;
tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.
In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.
Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.
We also rename the link states as follows:
WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN -> TIPC_LINK_RESETTING
RESET_RESET -> TIPC_LINK_ESTABLISHING
The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:27 -04:00
return rc ;
}
2015-07-16 16:54:28 -04:00
/* link_profile_stats - update statistical profiling of traffic
*/
static void link_profile_stats ( struct tipc_link * l )
{
struct sk_buff * skb ;
struct tipc_msg * msg ;
int length ;
/* Update counters used in statistical profiling of send traffic */
l - > stats . accu_queue_sz + = skb_queue_len ( & l - > transmq ) ;
l - > stats . queue_sz_counts + + ;
skb = skb_peek ( & l - > transmq ) ;
if ( ! skb )
return ;
msg = buf_msg ( skb ) ;
length = msg_size ( msg ) ;
if ( msg_user ( msg ) = = MSG_FRAGMENTER ) {
if ( msg_type ( msg ) ! = FIRST_FRAGMENT )
return ;
length = msg_size ( msg_get_wrapped ( msg ) ) ;
}
l - > stats . msg_lengths_total + = length ;
l - > stats . msg_length_counts + + ;
if ( length < = 64 )
l - > stats . msg_length_profile [ 0 ] + + ;
else if ( length < = 256 )
l - > stats . msg_length_profile [ 1 ] + + ;
else if ( length < = 1024 )
l - > stats . msg_length_profile [ 2 ] + + ;
else if ( length < = 4096 )
l - > stats . msg_length_profile [ 3 ] + + ;
else if ( length < = 16384 )
l - > stats . msg_length_profile [ 4 ] + + ;
else if ( length < = 32768 )
l - > stats . msg_length_profile [ 5 ] + + ;
else
l - > stats . msg_length_profile [ 6 ] + + ;
}
/* tipc_link_timeout - perform periodic task as instructed from node timeout
*/
int tipc_link_timeout ( struct tipc_link * l , struct sk_buff_head * xmitq )
{
2016-06-15 14:11:31 +08:00
int mtyp = 0 ;
int rc = 0 ;
2016-04-15 13:33:05 -04:00
bool state = false ;
bool probe = false ;
bool setup = false ;
2015-10-22 08:51:41 -04:00
u16 bc_snt = l - > bc_sndlink - > snd_nxt - 1 ;
u16 bc_acked = l - > bc_rcvlink - > acked ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
struct tipc_mon_state * mstate = & l - > mon_state ;
2015-07-30 18:24:20 -04:00
2015-07-30 18:24:21 -04:00
switch ( l - > state ) {
case LINK_ESTABLISHED :
case LINK_SYNCHING :
2016-04-15 13:33:05 -04:00
mtyp = STATE_MSG ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
link_profile_stats ( l ) ;
tipc_mon_get_state ( l - > net , l - > addr , mstate , l - > bearer_id ) ;
if ( mstate - > reset | | ( l - > silent_intv_cnt > l - > abort_limit ) )
return tipc_link_fsm_evt ( l , LINK_FAILURE_EVT ) ;
2016-04-15 13:33:05 -04:00
state = bc_acked ! = bc_snt ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
state | = l - > bc_rcvlink - > rcv_unacked ;
state | = l - > rcv_unacked ;
state | = ! skb_queue_empty ( & l - > transmq ) ;
state | = ! skb_queue_empty ( & l - > deferdq ) ;
probe = mstate - > probing ;
probe | = l - > silent_intv_cnt ;
if ( probe | | mstate - > monitoring )
l - > silent_intv_cnt + + ;
2015-07-30 18:24:21 -04:00
break ;
case LINK_RESET :
2016-04-15 13:33:05 -04:00
setup = l - > rst_cnt + + < = 4 ;
setup | = ! ( l - > rst_cnt % 16 ) ;
2015-07-30 18:24:20 -04:00
mtyp = RESET_MSG ;
2015-07-30 18:24:21 -04:00
break ;
case LINK_ESTABLISHING :
2016-04-15 13:33:05 -04:00
setup = true ;
2015-07-30 18:24:20 -04:00
mtyp = ACTIVATE_MSG ;
2015-07-30 18:24:21 -04:00
break ;
case LINK_PEER_RESET :
2015-07-30 18:24:23 -04:00
case LINK_RESETTING :
2015-07-30 18:24:21 -04:00
case LINK_FAILINGOVER :
break ;
default :
break ;
2015-07-30 18:24:20 -04:00
}
2015-07-30 18:24:21 -04:00
2016-04-15 13:33:05 -04:00
if ( state | | probe | | setup )
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
tipc_link_build_proto_msg ( l , mtyp , probe , 0 , 0 , 0 , 0 , xmitq ) ;
2015-07-30 18:24:20 -04:00
2015-07-16 16:54:28 -04:00
return rc ;
}
2006-01-02 19:04:38 +01:00
/**
2015-03-25 12:07:25 -04:00
* link_schedule_user - schedule a message sender for wakeup after congestion
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
* @ l : congested link
* @ hdr : header of message that is being sent
2014-08-22 18:09:07 -04:00
* Create pseudo msg to send back to user when congestion abates
2006-01-02 19:04:38 +01:00
*/
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
static int link_schedule_user ( struct tipc_link * l , struct tipc_msg * hdr )
2006-01-02 19:04:38 +01:00
{
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
u32 dnode = tipc_own_addr ( l - > net ) ;
u32 dport = msg_origport ( hdr ) ;
2015-03-25 12:07:25 -04:00
struct sk_buff * skb ;
/* Create and schedule wakeup pseudo message */
skb = tipc_msg_create ( SOCK_WAKEUP , 0 , INT_H_SIZE , 0 ,
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
dnode , l - > addr , dport , 0 , 0 ) ;
2015-03-25 12:07:25 -04:00
if ( ! skb )
2015-07-16 16:54:23 -04:00
return - ENOBUFS ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
msg_set_dest_droppable ( buf_msg ( skb ) , true ) ;
TIPC_SKB_CB ( skb ) - > chain_imp = msg_importance ( hdr ) ;
skb_queue_tail ( & l - > wakeupq , skb ) ;
l - > stats . link_congs + + ;
2015-03-25 12:07:25 -04:00
return - ELINKCONG ;
2006-01-02 19:04:38 +01:00
}
2014-08-22 18:09:07 -04:00
/**
* link_prepare_wakeup - prepare users for wakeup after congestion
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
* @ l : congested link
* Wake up a number of waiting users , as permitted by available space
* in the send queue
2014-08-22 18:09:07 -04:00
*/
tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 12:07:24 -04:00
void link_prepare_wakeup ( struct tipc_link * l )
2006-01-02 19:04:38 +01:00
{
2014-11-26 11:41:51 +08:00
struct sk_buff * skb , * tmp ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
int imp , i = 0 ;
2014-08-22 18:09:07 -04:00
tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 12:07:24 -04:00
skb_queue_walk_safe ( & l - > wakeupq , skb , tmp ) {
imp = TIPC_SKB_CB ( skb ) - > chain_imp ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
if ( l - > backlog [ imp ] . len < l - > backlog [ imp ] . limit ) {
skb_unlink ( skb , & l - > wakeupq ) ;
skb_queue_tail ( l - > inputq , skb ) ;
} else if ( i + + > 10 ) {
2006-01-02 19:04:38 +01:00
break ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
}
2006-01-02 19:04:38 +01:00
}
}
2015-07-30 18:24:19 -04:00
void tipc_link_reset ( struct tipc_link * l )
2006-01-02 19:04:38 +01:00
{
2016-03-03 14:23:21 -05:00
l - > peer_session = ANY_SESSION ;
l - > session + + ;
2015-07-30 18:24:19 -04:00
l - > mtu = l - > advertised_mtu ;
__skb_queue_purge ( & l - > transmq ) ;
__skb_queue_purge ( & l - > deferdq ) ;
2015-07-30 18:24:24 -04:00
skb_queue_splice_init ( & l - > wakeupq , l - > inputq ) ;
2015-10-22 08:51:48 -04:00
__skb_queue_purge ( & l - > backlogq ) ;
l - > backlog [ TIPC_LOW_IMPORTANCE ] . len = 0 ;
l - > backlog [ TIPC_MEDIUM_IMPORTANCE ] . len = 0 ;
l - > backlog [ TIPC_HIGH_IMPORTANCE ] . len = 0 ;
l - > backlog [ TIPC_CRITICAL_IMPORTANCE ] . len = 0 ;
l - > backlog [ TIPC_SYSTEM_IMPORTANCE ] . len = 0 ;
2015-07-30 18:24:19 -04:00
kfree_skb ( l - > reasm_buf ) ;
kfree_skb ( l - > failover_reasm_skb ) ;
l - > reasm_buf = NULL ;
l - > failover_reasm_skb = NULL ;
l - > rcv_unacked = 0 ;
l - > snd_nxt = 1 ;
l - > rcv_nxt = 1 ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
l - > acked = 0 ;
2015-07-30 18:24:19 -04:00
l - > silent_intv_cnt = 0 ;
2016-04-15 13:33:04 -04:00
l - > rst_cnt = 0 ;
2015-07-30 18:24:19 -04:00
l - > stale_count = 0 ;
2015-10-22 08:51:41 -04:00
l - > bc_peer_is_up = false ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
memset ( & l - > mon_state , 0 , sizeof ( l - > mon_state ) ) ;
2015-11-19 14:30:46 -05:00
tipc_link_reset_stats ( l ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-16 16:54:24 -04:00
/**
* tipc_link_xmit ( ) : enqueue buffer list according to queue situation
* @ link : link to use
* @ list : chain of buffers containing message
* @ xmitq : returned list of packets to be sent by caller
*
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
* Consumes the buffer chain .
2015-07-16 16:54:24 -04:00
* Returns 0 if success , or errno : - ELINKCONG , - EMSGSIZE or - ENOBUFS
* Messages at TIPC_SYSTEM_IMPORTANCE are always accepted
*/
int tipc_link_xmit ( struct tipc_link * l , struct sk_buff_head * list ,
struct sk_buff_head * xmitq )
{
struct tipc_msg * hdr = buf_msg ( skb_peek ( list ) ) ;
unsigned int maxwin = l - > window ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
int imp = msg_importance ( hdr ) ;
2015-07-16 16:54:24 -04:00
unsigned int mtu = l - > mtu ;
u16 ack = l - > rcv_nxt - 1 ;
u16 seqno = l - > snd_nxt ;
2015-10-22 08:51:41 -04:00
u16 bc_ack = l - > bc_rcvlink - > rcv_nxt - 1 ;
2015-07-16 16:54:24 -04:00
struct sk_buff_head * transmq = & l - > transmq ;
struct sk_buff_head * backlogq = & l - > backlogq ;
struct sk_buff * skb , * _skb , * bskb ;
2016-11-25 10:35:02 -05:00
int pkt_cnt = skb_queue_len ( list ) ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
int rc = 0 ;
2015-07-16 16:54:24 -04:00
2016-02-11 10:43:15 +01:00
if ( unlikely ( msg_size ( hdr ) > mtu ) ) {
skb_queue_purge ( list ) ;
2015-07-16 16:54:24 -04:00
return - EMSGSIZE ;
2016-02-11 10:43:15 +01:00
}
2015-07-16 16:54:24 -04:00
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
/* Allow oversubscription of one data msg per source at congestion */
if ( unlikely ( l - > backlog [ imp ] . len > = l - > backlog [ imp ] . limit ) ) {
if ( imp = = TIPC_SYSTEM_IMPORTANCE ) {
pr_warn ( " %s<%s>, link overflow " , link_rst_msg , l - > name ) ;
return - ENOBUFS ;
}
rc = link_schedule_user ( l , hdr ) ;
}
2016-11-25 10:35:02 -05:00
if ( pkt_cnt > 1 ) {
l - > stats . sent_fragmented + + ;
l - > stats . sent_fragments + = pkt_cnt ;
}
2015-07-16 16:54:24 -04:00
/* Prepare each packet for sending, and add to relevant queue: */
while ( skb_queue_len ( list ) ) {
skb = skb_peek ( list ) ;
hdr = buf_msg ( skb ) ;
msg_set_seqno ( hdr , seqno ) ;
msg_set_ack ( hdr , ack ) ;
2015-10-22 08:51:41 -04:00
msg_set_bcast_ack ( hdr , bc_ack ) ;
2015-07-16 16:54:24 -04:00
if ( likely ( skb_queue_len ( transmq ) < maxwin ) ) {
_skb = skb_clone ( skb , GFP_ATOMIC ) ;
2016-02-11 10:43:15 +01:00
if ( ! _skb ) {
skb_queue_purge ( list ) ;
2015-07-16 16:54:24 -04:00
return - ENOBUFS ;
2016-02-11 10:43:15 +01:00
}
2015-07-16 16:54:24 -04:00
__skb_dequeue ( list ) ;
__skb_queue_tail ( transmq , skb ) ;
__skb_queue_tail ( xmitq , _skb ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
TIPC_SKB_CB ( skb ) - > ackers = l - > ackers ;
2015-07-16 16:54:24 -04:00
l - > rcv_unacked = 0 ;
2016-11-25 10:35:02 -05:00
l - > stats . sent_pkts + + ;
2015-07-16 16:54:24 -04:00
seqno + + ;
continue ;
}
if ( tipc_msg_bundle ( skb_peek_tail ( backlogq ) , hdr , mtu ) ) {
kfree_skb ( __skb_dequeue ( list ) ) ;
l - > stats . sent_bundled + + ;
continue ;
}
if ( tipc_msg_make_bundle ( & bskb , hdr , mtu , l - > addr ) ) {
kfree_skb ( __skb_dequeue ( list ) ) ;
__skb_queue_tail ( backlogq , bskb ) ;
l - > backlog [ msg_importance ( buf_msg ( bskb ) ) ] . len + + ;
l - > stats . sent_bundled + + ;
l - > stats . sent_bundles + + ;
continue ;
}
l - > backlog [ imp ] . len + = skb_queue_len ( list ) ;
skb_queue_splice_tail_init ( list , backlogq ) ;
}
l - > snd_nxt = seqno ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
return rc ;
2015-07-16 16:54:24 -04:00
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
void tipc_link_advance_backlog ( struct tipc_link * l , struct sk_buff_head * xmitq )
{
struct sk_buff * skb , * _skb ;
struct tipc_msg * hdr ;
u16 seqno = l - > snd_nxt ;
u16 ack = l - > rcv_nxt - 1 ;
2015-10-22 08:51:41 -04:00
u16 bc_ack = l - > bc_rcvlink - > rcv_nxt - 1 ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
while ( skb_queue_len ( & l - > transmq ) < l - > window ) {
skb = skb_peek ( & l - > backlogq ) ;
if ( ! skb )
break ;
_skb = skb_clone ( skb , GFP_ATOMIC ) ;
if ( ! _skb )
break ;
__skb_dequeue ( & l - > backlogq ) ;
hdr = buf_msg ( skb ) ;
l - > backlog [ msg_importance ( hdr ) ] . len - - ;
__skb_queue_tail ( & l - > transmq , skb ) ;
__skb_queue_tail ( xmitq , _skb ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
TIPC_SKB_CB ( skb ) - > ackers = l - > ackers ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
msg_set_seqno ( hdr , seqno ) ;
2015-10-22 08:51:41 -04:00
msg_set_ack ( hdr , ack ) ;
msg_set_bcast_ack ( hdr , bc_ack ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > rcv_unacked = 0 ;
2016-11-25 10:35:02 -05:00
l - > stats . sent_pkts + + ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
seqno + + ;
}
l - > snd_nxt = seqno ;
}
2015-10-22 08:51:41 -04:00
static void link_retransmit_failure ( struct tipc_link * l , struct sk_buff * skb )
2006-06-25 23:40:01 -07:00
{
2015-10-22 08:51:41 -04:00
struct tipc_msg * hdr = buf_msg ( skb ) ;
pr_warn ( " Retransmission failure on link <%s> \n " , l - > name ) ;
2017-08-21 17:59:30 +02:00
link_print ( l , " State of link " ) ;
2015-10-22 08:51:41 -04:00
pr_info ( " Failed msg: usr %u, typ %u, len %u, err %u \n " ,
msg_user ( hdr ) , msg_type ( hdr ) , msg_size ( hdr ) , msg_errcode ( hdr ) ) ;
pr_info ( " sqno %u, prev: %x, src: %x \n " ,
msg_seqno ( hdr ) , msg_prevnode ( hdr ) , msg_orignode ( hdr ) ) ;
2006-06-25 23:40:01 -07:00
}
2017-08-21 17:59:30 +02:00
int tipc_link_retrans ( struct tipc_link * l , struct tipc_link * nacker ,
u16 from , u16 to , struct sk_buff_head * xmitq )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
{
struct sk_buff * _skb , * skb = skb_peek ( & l - > transmq ) ;
struct tipc_msg * hdr ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
u16 ack = l - > rcv_nxt - 1 ;
2015-10-22 08:51:41 -04:00
u16 bc_ack = l - > bc_rcvlink - > rcv_nxt - 1 ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
if ( ! skb )
return 0 ;
/* Detect repeated retransmit failures on same packet */
2017-08-21 17:59:30 +02:00
if ( nacker - > last_retransm ! = buf_seqno ( skb ) ) {
nacker - > last_retransm = buf_seqno ( skb ) ;
nacker - > stale_count = 1 ;
} else if ( + + nacker - > stale_count > 100 ) {
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
link_retransmit_failure ( l , skb ) ;
2017-08-21 17:59:30 +02:00
nacker - > stale_count = 0 ;
if ( link_is_bc_sndlink ( l ) )
return TIPC_LINK_DOWN_EVT ;
2015-07-30 18:24:21 -04:00
return tipc_link_fsm_evt ( l , LINK_FAILURE_EVT ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
/* Move forward to where retransmission should start */
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
skb_queue_walk ( & l - > transmq , skb ) {
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
if ( ! less ( buf_seqno ( skb ) , from ) )
break ;
}
skb_queue_walk_from ( & l - > transmq , skb ) {
if ( more ( buf_seqno ( skb ) , to ) )
break ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
hdr = buf_msg ( skb ) ;
_skb = __pskb_copy ( skb , MIN_H_SIZE , GFP_ATOMIC ) ;
if ( ! _skb )
return 0 ;
hdr = buf_msg ( _skb ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
msg_set_ack ( hdr , ack ) ;
msg_set_bcast_ack ( hdr , bc_ack ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
_skb - > priority = TC_PRIO_CONTROL ;
__skb_queue_tail ( xmitq , _skb ) ;
l - > stats . retransmitted + + ;
}
return 0 ;
}
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
/* tipc_data_input - deliver data and name distr msgs to upper layer
2014-07-01 10:22:40 +02:00
*
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
* Consumes buffer if message is of right type
2014-07-01 10:22:40 +02:00
* Node lock must be held
*/
2015-10-22 08:51:41 -04:00
static bool tipc_data_input ( struct tipc_link * l , struct sk_buff * skb ,
2015-07-30 18:24:25 -04:00
struct sk_buff_head * inputq )
2014-07-01 10:22:40 +02:00
{
2017-10-13 11:04:32 +02:00
struct sk_buff_head * mc_inputq = l - > bc_rcvlink - > inputq ;
2017-01-18 13:50:52 -05:00
struct tipc_msg * hdr = buf_msg ( skb ) ;
switch ( msg_user ( hdr ) ) {
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
case TIPC_LOW_IMPORTANCE :
case TIPC_MEDIUM_IMPORTANCE :
case TIPC_HIGH_IMPORTANCE :
case TIPC_CRITICAL_IMPORTANCE :
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:31 +02:00
if ( unlikely ( msg_in_group ( hdr ) | | msg_mcast ( hdr ) ) ) {
2017-10-13 11:04:32 +02:00
skb_queue_tail ( mc_inputq , skb ) ;
2017-01-18 13:50:52 -05:00
return true ;
}
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:31 +02:00
case CONN_MANAGER :
2017-10-16 16:04:51 +02:00
skb_queue_tail ( inputq , skb ) ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
return true ;
2017-10-13 11:04:32 +02:00
case GROUP_PROTOCOL :
skb_queue_tail ( mc_inputq , skb ) ;
return true ;
2014-07-01 10:22:40 +02:00
case NAME_DISTRIBUTOR :
2015-10-22 08:51:41 -04:00
l - > bc_rcvlink - > state = LINK_ESTABLISHED ;
skb_queue_tail ( l - > namedq , skb ) ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
return true ;
case MSG_BUNDLER :
tipc: eliminate delayed link deletion at link failover
When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.
In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.
We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:01 -04:00
case TUNNEL_PROTOCOL :
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
case MSG_FRAGMENTER :
2014-07-01 10:22:40 +02:00
case BCAST_PROTOCOL :
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
return false ;
2014-07-01 10:22:40 +02:00
default :
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
pr_warn ( " Dropping received illegal msg type \n " ) ;
kfree_skb ( skb ) ;
return false ;
} ;
2014-07-01 10:22:40 +02:00
}
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
/* tipc_link_input - process packet that has passed link protocol check
*
* Consumes buffer
2014-07-01 10:22:40 +02:00
*/
2015-07-30 18:24:25 -04:00
static int tipc_link_input ( struct tipc_link * l , struct sk_buff * skb ,
struct sk_buff_head * inputq )
2014-07-01 10:22:40 +02:00
{
2015-07-30 18:24:19 -04:00
struct tipc_msg * hdr = buf_msg ( skb ) ;
struct sk_buff * * reasm_skb = & l - > reasm_buf ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
struct sk_buff * iskb ;
2015-10-15 14:52:40 -04:00
struct sk_buff_head tmpq ;
2015-07-30 18:24:19 -04:00
int usr = msg_user ( hdr ) ;
2015-07-30 18:24:16 -04:00
int rc = 0 ;
2015-07-30 18:24:19 -04:00
int pos = 0 ;
int ipos = 0 ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
2015-07-30 18:24:19 -04:00
if ( unlikely ( usr = = TUNNEL_PROTOCOL ) ) {
if ( msg_type ( hdr ) = = SYNCH_MSG ) {
__skb_queue_purge ( & l - > deferdq ) ;
goto drop ;
2015-03-25 12:07:26 -04:00
}
2015-07-30 18:24:19 -04:00
if ( ! tipc_msg_extract ( skb , & iskb , & ipos ) )
return rc ;
kfree_skb ( skb ) ;
skb = iskb ;
hdr = buf_msg ( skb ) ;
if ( less ( msg_seqno ( hdr ) , l - > drop_point ) )
goto drop ;
2015-07-30 18:24:25 -04:00
if ( tipc_data_input ( l , skb , inputq ) )
2015-07-30 18:24:19 -04:00
return rc ;
usr = msg_user ( hdr ) ;
reasm_skb = & l - > failover_reasm_skb ;
}
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
2015-07-30 18:24:19 -04:00
if ( usr = = MSG_BUNDLER ) {
2015-10-15 14:52:40 -04:00
skb_queue_head_init ( & tmpq ) ;
2015-07-30 18:24:19 -04:00
l - > stats . recv_bundles + + ;
l - > stats . recv_bundled + = msg_msgcnt ( hdr ) ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
while ( tipc_msg_extract ( skb , & iskb , & pos ) )
2015-10-15 14:52:40 -04:00
tipc_data_input ( l , iskb , & tmpq ) ;
tipc_skb_queue_splice_tail ( & tmpq , inputq ) ;
2015-07-30 18:24:21 -04:00
return 0 ;
2015-07-30 18:24:19 -04:00
} else if ( usr = = MSG_FRAGMENTER ) {
l - > stats . recv_fragments + + ;
if ( tipc_buf_append ( reasm_skb , & skb ) ) {
l - > stats . recv_fragmented + + ;
2015-07-30 18:24:25 -04:00
tipc_data_input ( l , skb , inputq ) ;
2015-10-22 08:51:41 -04:00
} else if ( ! * reasm_skb & & ! link_is_bc_rcvlink ( l ) ) {
pr_warn_ratelimited ( " Unable to build fragment list \n " ) ;
2015-07-30 18:24:21 -04:00
return tipc_link_fsm_evt ( l , LINK_FAILURE_EVT ) ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
}
2015-07-30 18:24:21 -04:00
return 0 ;
2015-07-30 18:24:19 -04:00
} else if ( usr = = BCAST_PROTOCOL ) {
2015-10-22 08:51:46 -04:00
tipc_bcast_lock ( l - > net ) ;
2015-10-22 08:51:41 -04:00
tipc_link_bc_init_rcv ( l - > bc_rcvlink , hdr ) ;
2015-10-22 08:51:46 -04:00
tipc_bcast_unlock ( l - > net ) ;
2015-07-30 18:24:19 -04:00
}
drop :
kfree_skb ( skb ) ;
2015-07-30 18:24:21 -04:00
return 0 ;
2014-07-01 10:22:40 +02:00
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static bool tipc_link_release_pkts ( struct tipc_link * l , u16 acked )
{
bool released = false ;
struct sk_buff * skb , * tmp ;
skb_queue_walk_safe ( & l - > transmq , skb , tmp ) {
if ( more ( buf_seqno ( skb ) , acked ) )
break ;
__skb_unlink ( skb , & l - > transmq ) ;
kfree_skb ( skb ) ;
released = true ;
}
return released ;
}
2016-04-15 13:33:07 -04:00
/* tipc_link_build_state_msg: prepare link state message for transmission
2015-10-22 08:51:41 -04:00
*
* Note that sending of broadcast ack is coordinated among nodes , to reduce
* risk of ack storms towards the sender
2015-10-15 14:52:41 -04:00
*/
2016-04-15 13:33:07 -04:00
int tipc_link_build_state_msg ( struct tipc_link * l , struct sk_buff_head * xmitq )
2015-10-15 14:52:41 -04:00
{
2015-10-22 08:51:41 -04:00
if ( ! l )
return 0 ;
/* Broadcast ACK must be sent via a unicast link => defer to caller */
if ( link_is_bc_rcvlink ( l ) ) {
2016-03-03 14:23:21 -05:00
if ( ( ( l - > rcv_nxt ^ tipc_own_addr ( l - > net ) ) & 0xf ) ! = 0xf )
2015-10-22 08:51:41 -04:00
return 0 ;
l - > rcv_unacked = 0 ;
2016-09-01 13:52:49 -04:00
/* Use snd_nxt to store peer's snd_nxt in broadcast rcv link */
l - > snd_nxt = l - > rcv_nxt ;
return TIPC_LINK_SND_STATE ;
2015-10-22 08:51:41 -04:00
}
/* Unicast ACK */
2015-10-15 14:52:41 -04:00
l - > rcv_unacked = 0 ;
l - > stats . sent_acks + + ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
tipc_link_build_proto_msg ( l , STATE_MSG , 0 , 0 , 0 , 0 , 0 , xmitq ) ;
2015-10-22 08:51:41 -04:00
return 0 ;
2015-10-15 14:52:41 -04:00
}
2015-10-15 14:52:45 -04:00
/* tipc_link_build_reset_msg: prepare link RESET or ACTIVATE message
*/
void tipc_link_build_reset_msg ( struct tipc_link * l , struct sk_buff_head * xmitq )
{
int mtyp = RESET_MSG ;
tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 13:33:03 -04:00
struct sk_buff * skb ;
2015-10-15 14:52:45 -04:00
if ( l - > state = = LINK_ESTABLISHING )
mtyp = ACTIVATE_MSG ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
tipc_link_build_proto_msg ( l , mtyp , 0 , 0 , 0 , 0 , 0 , xmitq ) ;
tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 13:33:03 -04:00
/* Inform peer that this endpoint is going down if applicable */
skb = skb_peek_tail ( xmitq ) ;
if ( skb & & ( l - > state = = LINK_RESET ) )
msg_set_peer_stopping ( buf_msg ( skb ) , 1 ) ;
2015-10-15 14:52:45 -04:00
}
2015-10-15 14:52:41 -04:00
/* tipc_link_build_nack_msg: prepare link nack message for transmission
2016-09-01 13:52:51 -04:00
* Note that sending of broadcast NACK is coordinated among nodes , to
* reduce the risk of NACK storms towards the sender
2015-10-15 14:52:41 -04:00
*/
2016-09-01 13:52:51 -04:00
static int tipc_link_build_nack_msg ( struct tipc_link * l ,
struct sk_buff_head * xmitq )
2015-10-15 14:52:41 -04:00
{
u32 def_cnt = + + l - > stats . deferred_recv ;
2016-09-01 13:52:51 -04:00
int match1 , match2 ;
2015-10-15 14:52:41 -04:00
2016-09-01 13:52:51 -04:00
if ( link_is_bc_rcvlink ( l ) ) {
match1 = def_cnt & 0xf ;
match2 = tipc_own_addr ( l - > net ) & 0xf ;
if ( match1 = = match2 )
return TIPC_LINK_SND_STATE ;
return 0 ;
}
2015-10-22 08:51:41 -04:00
2015-10-15 14:52:41 -04:00
if ( ( skb_queue_len ( & l - > deferdq ) = = 1 ) | | ! ( def_cnt % TIPC_NACK_INTV ) )
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
tipc_link_build_proto_msg ( l , STATE_MSG , 0 , 0 , 0 , 0 , 0 , xmitq ) ;
2016-09-01 13:52:51 -04:00
return 0 ;
2015-10-15 14:52:41 -04:00
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* tipc_link_rcv - process TIPC packets/messages arriving from off-node
2015-10-15 14:52:41 -04:00
* @ l : the link that should handle the message
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
* @ skb : TIPC packet
* @ xmitq : queue to place packets to be sent after this call
*/
int tipc_link_rcv ( struct tipc_link * l , struct sk_buff * skb ,
struct sk_buff_head * xmitq )
{
2015-10-15 14:52:41 -04:00
struct sk_buff_head * defq = & l - > deferdq ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
struct tipc_msg * hdr ;
tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.
Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:42 -04:00
u16 seqno , rcv_nxt , win_lim ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
int rc = 0 ;
2015-10-15 14:52:41 -04:00
do {
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
hdr = buf_msg ( skb ) ;
2015-10-15 14:52:41 -04:00
seqno = msg_seqno ( hdr ) ;
rcv_nxt = l - > rcv_nxt ;
tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.
Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:42 -04:00
win_lim = rcv_nxt + TIPC_MAX_LINK_WIN ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* Verify and update link state */
2015-10-15 14:52:41 -04:00
if ( unlikely ( msg_user ( hdr ) = = LINK_PROTOCOL ) )
return tipc_link_proto_rcv ( l , skb , xmitq ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2015-07-30 18:24:21 -04:00
if ( unlikely ( ! link_is_up ( l ) ) ) {
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
if ( l - > state = = LINK_ESTABLISHING )
rc = TIPC_LINK_UP_EVT ;
goto drop ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
2015-10-15 14:52:41 -04:00
/* Don't send probe at next timeout expiration */
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > silent_intv_cnt = 0 ;
tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.
Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:42 -04:00
/* Drop if outside receive window */
if ( unlikely ( less ( seqno , rcv_nxt ) | | more ( seqno , win_lim ) ) ) {
l - > stats . duplicates + + ;
goto drop ;
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* Forward queues and wake up waiting users */
if ( likely ( tipc_link_release_pkts ( l , msg_ack ( hdr ) ) ) ) {
tipc_link_advance_backlog ( l , xmitq ) ;
if ( unlikely ( ! skb_queue_empty ( & l - > wakeupq ) ) )
link_prepare_wakeup ( l ) ;
}
tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.
Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:42 -04:00
/* Defer delivery if sequence gap */
if ( unlikely ( seqno ! = rcv_nxt ) ) {
2015-10-15 14:52:43 -04:00
__tipc_skb_queue_sorted ( defq , seqno , skb ) ;
2016-09-01 13:52:51 -04:00
rc | = tipc_link_build_nack_msg ( l , xmitq ) ;
2015-10-15 14:52:41 -04:00
break ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.
Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:42 -04:00
/* Deliver packet */
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > rcv_nxt + + ;
2016-11-25 10:35:02 -05:00
l - > stats . recv_pkts + + ;
2015-10-15 14:52:41 -04:00
if ( ! tipc_data_input ( l , skb , l - > inputq ) )
2015-10-22 08:51:41 -04:00
rc | = tipc_link_input ( l , skb , l - > inputq ) ;
2015-10-15 14:52:41 -04:00
if ( unlikely ( + + l - > rcv_unacked > = TIPC_MIN_LINK_WIN ) )
2016-04-15 13:33:07 -04:00
rc | = tipc_link_build_state_msg ( l , xmitq ) ;
2016-09-01 13:52:49 -04:00
if ( unlikely ( rc & ~ TIPC_LINK_SND_STATE ) )
2015-10-22 08:51:41 -04:00
break ;
2015-10-15 14:52:41 -04:00
} while ( ( skb = __skb_dequeue ( defq ) ) ) ;
return rc ;
drop :
kfree_skb ( skb ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
return rc ;
}
2015-07-16 16:54:26 -04:00
static void tipc_link_build_proto_msg ( struct tipc_link * l , int mtyp , bool probe ,
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
bool probe_reply , u16 rcvgap ,
int tolerance , int priority ,
2015-07-16 16:54:26 -04:00
struct sk_buff_head * xmitq )
{
2016-09-01 13:52:49 -04:00
struct tipc_link * bcl = l - > bc_rcvlink ;
2016-03-03 14:23:21 -05:00
struct sk_buff * skb ;
struct tipc_msg * hdr ;
struct sk_buff_head * dfq = & l - > deferdq ;
2016-09-01 13:52:49 -04:00
bool node_up = link_is_up ( bcl ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
struct tipc_mon_state * mstate = & l - > mon_state ;
int dlen = 0 ;
void * data ;
2015-07-16 16:54:26 -04:00
/* Don't send protocol message during reset or link failover */
2015-07-30 18:24:21 -04:00
if ( tipc_link_is_blocked ( l ) )
2015-07-16 16:54:26 -04:00
return ;
2016-03-03 14:23:21 -05:00
if ( ! tipc_link_is_up ( l ) & & ( mtyp = = STATE_MSG ) )
return ;
if ( ! skb_queue_empty ( dfq ) )
rcvgap = buf_seqno ( skb_peek ( dfq ) ) - l - > rcv_nxt ;
skb = tipc_msg_create ( LINK_PROTOCOL , mtyp , INT_H_SIZE ,
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
tipc_max_domain_size , l - > addr ,
2016-03-03 14:23:21 -05:00
tipc_own_addr ( l - > net ) , 0 , 0 , 0 ) ;
if ( ! skb )
return ;
hdr = buf_msg ( skb ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
data = msg_data ( hdr ) ;
2016-03-03 14:23:21 -05:00
msg_set_session ( hdr , l - > session ) ;
msg_set_bearer_id ( hdr , l - > bearer_id ) ;
2015-07-16 16:54:26 -04:00
msg_set_net_plane ( hdr , l - > net_plane ) ;
2015-10-22 08:51:41 -04:00
msg_set_next_sent ( hdr , l - > snd_nxt ) ;
msg_set_ack ( hdr , l - > rcv_nxt - 1 ) ;
2016-09-01 13:52:49 -04:00
msg_set_bcast_ack ( hdr , bcl - > rcv_nxt - 1 ) ;
2016-10-27 18:51:55 -04:00
msg_set_bc_ack_invalid ( hdr , ! node_up ) ;
2015-10-22 08:51:41 -04:00
msg_set_last_bcast ( hdr , l - > bc_sndlink - > snd_nxt - 1 ) ;
2015-07-16 16:54:26 -04:00
msg_set_link_tolerance ( hdr , tolerance ) ;
msg_set_linkprio ( hdr , priority ) ;
msg_set_redundant_link ( hdr , node_up ) ;
msg_set_seq_gap ( hdr , 0 ) ;
2015-10-22 08:51:41 -04:00
msg_set_seqno ( hdr , l - > snd_nxt + U16_MAX / 2 ) ;
2015-07-16 16:54:26 -04:00
if ( mtyp = = STATE_MSG ) {
2016-03-03 14:23:21 -05:00
msg_set_seq_gap ( hdr , rcvgap ) ;
2016-09-01 13:52:49 -04:00
msg_set_bc_gap ( hdr , link_bc_rcv_gap ( bcl ) ) ;
2015-07-16 16:54:26 -04:00
msg_set_probe ( hdr , probe ) ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
msg_set_is_keepalive ( hdr , probe | | probe_reply ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
tipc_mon_prep ( l - > net , data , & dlen , mstate , l - > bearer_id ) ;
msg_set_size ( hdr , INT_H_SIZE + dlen ) ;
skb_trim ( skb , INT_H_SIZE + dlen ) ;
2015-07-16 16:54:26 -04:00
l - > stats . sent_states + + ;
2015-10-22 08:51:41 -04:00
l - > rcv_unacked = 0 ;
2015-07-16 16:54:26 -04:00
} else {
/* RESET_MSG or ACTIVATE_MSG */
msg_set_max_pkt ( hdr , l - > advertised_mtu ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
strcpy ( data , l - > if_name ) ;
msg_set_size ( hdr , INT_H_SIZE + TIPC_MAX_IF_NAME ) ;
skb_trim ( skb , INT_H_SIZE + TIPC_MAX_IF_NAME ) ;
2015-07-16 16:54:26 -04:00
}
2016-03-03 14:23:21 -05:00
if ( probe )
l - > stats . sent_probes + + ;
if ( rcvgap )
l - > stats . sent_nacks + + ;
2015-07-16 16:54:26 -04:00
skb - > priority = TC_PRIO_CONTROL ;
2015-07-30 18:24:19 -04:00
__skb_queue_tail ( xmitq , skb ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-30 18:24:19 -04:00
/* tipc_link_tnl_prepare(): prepare and return a list of tunnel packets
2015-10-15 14:52:41 -04:00
* with contents of the link ' s transmit and backlog queues .
2006-01-02 19:04:38 +01:00
*/
2015-07-30 18:24:19 -04:00
void tipc_link_tnl_prepare ( struct tipc_link * l , struct tipc_link * tnl ,
int mtyp , struct sk_buff_head * xmitq )
2006-01-02 19:04:38 +01:00
{
2015-07-30 18:24:19 -04:00
struct sk_buff * skb , * tnlskb ;
struct tipc_msg * hdr , tnlhdr ;
struct sk_buff_head * queue = & l - > transmq ;
struct sk_buff_head tmpxq , tnlq ;
u16 pktlen , pktcnt , seqno = l - > snd_nxt ;
2006-01-02 19:04:38 +01:00
2015-07-30 18:24:19 -04:00
if ( ! tnl )
2006-01-02 19:04:38 +01:00
return ;
2015-07-30 18:24:19 -04:00
skb_queue_head_init ( & tnlq ) ;
skb_queue_head_init ( & tmpxq ) ;
2015-05-14 10:46:18 -04:00
2015-07-30 18:24:19 -04:00
/* At least one packet required for safe algorithm => add dummy */
skb = tipc_msg_create ( TIPC_LOW_IMPORTANCE , TIPC_DIRECT_MSG ,
2016-03-03 14:23:21 -05:00
BASIC_H_SIZE , 0 , l - > addr , tipc_own_addr ( l - > net ) ,
2015-07-30 18:24:19 -04:00
0 , 0 , TIPC_ERR_NO_PORT ) ;
if ( ! skb ) {
pr_warn ( " %sunable to create tunnel packet \n " , link_co_err ) ;
2006-01-02 19:04:38 +01:00
return ;
}
2015-07-30 18:24:19 -04:00
skb_queue_tail ( & tnlq , skb ) ;
tipc_link_xmit ( l , & tnlq , & tmpxq ) ;
__skb_queue_purge ( & tmpxq ) ;
/* Initialize reusable tunnel packet header */
2016-03-03 14:23:21 -05:00
tipc_msg_init ( tipc_own_addr ( l - > net ) , & tnlhdr , TUNNEL_PROTOCOL ,
2015-07-30 18:24:19 -04:00
mtyp , INT_H_SIZE , l - > addr ) ;
pktcnt = skb_queue_len ( & l - > transmq ) + skb_queue_len ( & l - > backlogq ) ;
msg_set_msgcnt ( & tnlhdr , pktcnt ) ;
msg_set_bearer_id ( & tnlhdr , l - > peer_bearer_id ) ;
tnl :
/* Wrap each packet into a tunnel packet */
2015-03-13 16:08:10 -04:00
skb_queue_walk ( queue , skb ) {
2015-07-30 18:24:19 -04:00
hdr = buf_msg ( skb ) ;
if ( queue = = & l - > backlogq )
msg_set_seqno ( hdr , seqno + + ) ;
pktlen = msg_size ( hdr ) ;
msg_set_size ( & tnlhdr , pktlen + INT_H_SIZE ) ;
2017-01-13 15:46:25 +01:00
tnlskb = tipc_buf_acquire ( pktlen + INT_H_SIZE , GFP_ATOMIC ) ;
2015-07-30 18:24:19 -04:00
if ( ! tnlskb ) {
pr_warn ( " %sunable to send packet \n " , link_co_err ) ;
2006-01-02 19:04:38 +01:00
return ;
}
2015-07-30 18:24:19 -04:00
skb_copy_to_linear_data ( tnlskb , & tnlhdr , INT_H_SIZE ) ;
skb_copy_to_linear_data_offset ( tnlskb , INT_H_SIZE , hdr , pktlen ) ;
__skb_queue_tail ( & tnlq , tnlskb ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-30 18:24:19 -04:00
if ( queue ! = & l - > backlogq ) {
queue = & l - > backlogq ;
goto tnl ;
tipc: change reception of tunnelled failover packets
When a link is reset, and there is a redundant link available, all
sender sockets will steer their subsequent traffic through the
remaining link. In order to guarantee preserved packet order and
cardinality during the transition, we tunnel the failing link's send
queue through the remaining link before we allow any sockets to use it.
In this commit, we change the algorithm for receiving failover
("ORIGINAL_MSG") packets in tipc_link_tunnel_rcv(), at the same time
delegating it to a new subfuncton, tipc_link_failover_rcv(). Instead
of directly returning an extracted inner packet to the packet reception
loop in tipc_rcv(), we first check if it is a message fragment, in which
case we append it to the reset link's fragment chain. If the fragment
chain is complete, we return the whole chain instead of the individual
buffer, eliminating any need for the tipc_rcv() loop to do reassembly of
tunneled packets.
This change makes it possible to further simplify tipc_link_tunnel_rcv(),
as well as the calling tipc_rcv() loop. We will do that in later
commits. It also makes it possible to identify a single spot in the code
where we can tell that a failover procedure is finished, something that
is useful when we are deleting links after a failover. This will also
be done in a later commit.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:29:11 -05:00
}
tipc: change reception of tunnelled duplicate packets
When a second link to a destination comes up, some sender sockets will
steer their subsequent traffic through the new link. In order to
guarantee preserved packet order and cardinality for those sockets, we
tunnel a duplicate of the old link's send queue through the new link
before we open it for regular traffic. The last arriving packet copy,
on whichever link, will be dropped at the receiving end based on the
original sequence number, to ensure that only one copy is delivered to
the end receiver.
In this commit, we change the algorithm for receiving DUPLICATE_MSG
packets, at the same time delegating it to a new subfunction,
tipc_link_dup_rcv(). Instead of returning an extracted inner packet to
the packet reception loop in tipc_rcv(), we just add it to the receiving
(new) link's deferred packet queue. The packet will then be processed by
that link when it receives its first non-tunneled packet, i.e., at
latest when the changeover procedure is finished.
Because tipc_link_tunnel_rcv()/tipc_link_dup_rcv() now is consuming all
packets of type DUPLICATE_MSG, the calling tipc_rcv() function can omit
testing for this. This in turn means that the current conditional jump
to the label 'protocol_check' becomes redundant, and we can remove that
label.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:29:10 -05:00
2015-07-30 18:24:19 -04:00
tipc_link_xmit ( tnl , & tnlq , xmitq ) ;
2006-01-02 19:04:38 +01:00
2015-07-30 18:24:19 -04:00
if ( mtyp = = FAILOVER_MSG ) {
tnl - > drop_point = l - > rcv_nxt ;
tnl - > failover_reasm_skb = l - > reasm_buf ;
l - > reasm_buf = NULL ;
2015-04-02 09:33:00 -04:00
}
2006-01-02 19:04:38 +01:00
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* tipc_link_proto_rcv(): receive link level protocol message :
* Note that network plane id propagates through the network , and may
* change at any time . The node with lowest numerical id determines
* network plane
*/
static int tipc_link_proto_rcv ( struct tipc_link * l , struct sk_buff * skb ,
struct sk_buff_head * xmitq )
{
struct tipc_msg * hdr = buf_msg ( skb ) ;
u16 rcvgap = 0 ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
u16 ack = msg_ack ( hdr ) ;
u16 gap = msg_seq_gap ( hdr ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
u16 peers_snd_nxt = msg_next_sent ( hdr ) ;
u16 peers_tol = msg_link_tolerance ( hdr ) ;
u16 peers_prio = msg_linkprio ( hdr ) ;
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
u16 rcv_nxt = l - > rcv_nxt ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
u16 dlen = msg_data_sz ( hdr ) ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
int mtyp = msg_type ( hdr ) ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
bool reply = msg_probe ( hdr ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
void * data ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
char * if_name ;
int rc = 0 ;
2015-10-22 08:51:41 -04:00
if ( tipc_link_is_blocked ( l ) | | ! xmitq )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
goto exit ;
2016-03-03 14:23:21 -05:00
if ( tipc_own_addr ( l - > net ) > msg_prevnode ( hdr ) )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > net_plane = msg_net_plane ( hdr ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
skb_linearize ( skb ) ;
hdr = buf_msg ( skb ) ;
data = msg_data ( hdr ) ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
switch ( mtyp ) {
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
case RESET_MSG :
/* Ignore duplicate RESET with old session number */
if ( ( less_eq ( msg_session ( hdr ) , l - > peer_session ) ) & &
2016-03-03 14:23:21 -05:00
( l - > peer_session ! = ANY_SESSION ) )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
break ;
/* fall thru' */
2015-07-30 18:24:21 -04:00
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
case ACTIVATE_MSG :
/* Complete own link name with peer's interface name */
if_name = strrchr ( l - > name , ' : ' ) + 1 ;
if ( sizeof ( l - > name ) - ( if_name - l - > name ) < = TIPC_MAX_IF_NAME )
break ;
if ( msg_data_sz ( hdr ) < TIPC_MAX_IF_NAME )
break ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
strncpy ( if_name , data , TIPC_MAX_IF_NAME ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* Update own tolerance if peer indicates a non-zero value */
if ( in_range ( peers_tol , TIPC_MIN_LINK_TOL , TIPC_MAX_LINK_TOL ) )
l - > tolerance = peers_tol ;
/* Update own priority if peer's priority is higher */
if ( in_range ( peers_prio , l - > priority + 1 , TIPC_MAX_LINK_PRI ) )
l - > priority = peers_prio ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
/* ACTIVATE_MSG serves as PEER_RESET if link is already down */
tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 13:33:03 -04:00
if ( msg_peer_stopping ( hdr ) )
rc = tipc_link_fsm_evt ( l , LINK_FAILURE_EVT ) ;
else if ( ( mtyp = = RESET_MSG ) | | ! link_is_up ( l ) )
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
rc = tipc_link_fsm_evt ( l , LINK_PEER_RESET_EVT ) ;
/* ACTIVATE_MSG takes up link if it was already locally reset */
if ( ( mtyp = = ACTIVATE_MSG ) & & ( l - > state = = LINK_ESTABLISHING ) )
rc = TIPC_LINK_UP_EVT ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > peer_session = msg_session ( hdr ) ;
l - > peer_bearer_id = msg_bearer_id ( hdr ) ;
if ( l - > mtu > msg_max_pkt ( hdr ) )
l - > mtu = msg_max_pkt ( hdr ) ;
break ;
2015-07-30 18:24:21 -04:00
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
case STATE_MSG :
2015-07-30 18:24:21 -04:00
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* Update own tolerance if peer indicates a non-zero value */
if ( in_range ( peers_tol , TIPC_MIN_LINK_TOL , TIPC_MAX_LINK_TOL ) )
l - > tolerance = peers_tol ;
2016-11-23 21:05:26 -05:00
/* Update own prio if peer indicates a different value */
if ( ( peers_prio ! = l - > priority ) & &
in_range ( peers_prio , 1 , TIPC_MAX_LINK_PRI ) ) {
2016-02-01 08:19:57 +01:00
l - > priority = peers_prio ;
rc = tipc_link_fsm_evt ( l , LINK_FAILURE_EVT ) ;
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > silent_intv_cnt = 0 ;
l - > stats . recv_states + + ;
if ( msg_probe ( hdr ) )
l - > stats . recv_probes + + ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
if ( ! link_is_up ( l ) ) {
if ( l - > state = = LINK_ESTABLISHING )
rc = TIPC_LINK_UP_EVT ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
break ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
}
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
tipc_mon_rcv ( l - > net , data , dlen , l - > addr ,
& l - > mon_state , l - > bearer_id ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2015-07-30 18:24:21 -04:00
/* Send NACK if peer has sent pkts we haven't received yet */
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
if ( more ( peers_snd_nxt , rcv_nxt ) & & ! tipc_link_is_synching ( l ) )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
rcvgap = peers_snd_nxt - l - > rcv_nxt ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
if ( rcvgap | | reply )
tipc_link_build_proto_msg ( l , STATE_MSG , 0 , reply ,
rcvgap , 0 , 0 , xmitq ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
tipc_link_release_pkts ( l , ack ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* If NACK, retransmit will now start at right position */
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
if ( gap ) {
2017-08-21 17:59:30 +02:00
rc = tipc_link_retrans ( l , l , ack + 1 , ack + gap , xmitq ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
l - > stats . recv_nacks + + ;
}
2015-07-30 18:24:21 -04:00
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
tipc_link_advance_backlog ( l , xmitq ) ;
if ( unlikely ( ! skb_queue_empty ( & l - > wakeupq ) ) )
link_prepare_wakeup ( l ) ;
}
exit :
kfree_skb ( skb ) ;
return rc ;
}
2015-10-22 08:51:41 -04:00
/* tipc_link_build_bc_proto_msg() - create broadcast protocol message
*/
static bool tipc_link_build_bc_proto_msg ( struct tipc_link * l , bool bcast ,
u16 peers_snd_nxt ,
struct sk_buff_head * xmitq )
{
struct sk_buff * skb ;
struct tipc_msg * hdr ;
struct sk_buff * dfrd_skb = skb_peek ( & l - > deferdq ) ;
u16 ack = l - > rcv_nxt - 1 ;
u16 gap_to = peers_snd_nxt - 1 ;
skb = tipc_msg_create ( BCAST_PROTOCOL , STATE_MSG , INT_H_SIZE ,
2016-03-03 14:23:21 -05:00
0 , l - > addr , tipc_own_addr ( l - > net ) , 0 , 0 , 0 ) ;
2015-10-22 08:51:41 -04:00
if ( ! skb )
return false ;
hdr = buf_msg ( skb ) ;
msg_set_last_bcast ( hdr , l - > bc_sndlink - > snd_nxt - 1 ) ;
msg_set_bcast_ack ( hdr , ack ) ;
msg_set_bcgap_after ( hdr , ack ) ;
if ( dfrd_skb )
gap_to = buf_seqno ( dfrd_skb ) - 1 ;
msg_set_bcgap_to ( hdr , gap_to ) ;
msg_set_non_seq ( hdr , bcast ) ;
__skb_queue_tail ( xmitq , skb ) ;
return true ;
}
/* tipc_link_build_bc_init_msg() - synchronize broadcast link endpoints.
*
* Give a newly added peer node the sequence number where it should
* start receiving and acking broadcast packets .
*/
2015-10-24 22:56:01 +08:00
static void tipc_link_build_bc_init_msg ( struct tipc_link * l ,
struct sk_buff_head * xmitq )
2015-10-22 08:51:41 -04:00
{
struct sk_buff_head list ;
__skb_queue_head_init ( & list ) ;
if ( ! tipc_link_build_bc_proto_msg ( l - > bc_rcvlink , false , 0 , & list ) )
return ;
2016-10-27 18:51:55 -04:00
msg_set_bc_ack_invalid ( buf_msg ( skb_peek ( & list ) ) , true ) ;
2015-10-22 08:51:41 -04:00
tipc_link_xmit ( l , & list , xmitq ) ;
}
/* tipc_link_bc_init_rcv - receive initial broadcast synch data from peer
*/
void tipc_link_bc_init_rcv ( struct tipc_link * l , struct tipc_msg * hdr )
{
int mtyp = msg_type ( hdr ) ;
u16 peers_snd_nxt = msg_bc_snd_nxt ( hdr ) ;
if ( link_is_up ( l ) )
return ;
if ( msg_user ( hdr ) = = BCAST_PROTOCOL ) {
l - > rcv_nxt = peers_snd_nxt ;
l - > state = LINK_ESTABLISHED ;
return ;
}
if ( l - > peer_caps & TIPC_BCAST_SYNCH )
return ;
if ( msg_peer_node_is_up ( hdr ) )
return ;
/* Compatibility: accept older, less safe initial synch data */
if ( ( mtyp = = RESET_MSG ) | | ( mtyp = = ACTIVATE_MSG ) )
l - > rcv_nxt = peers_snd_nxt ;
}
2016-09-01 13:52:50 -04:00
/* link_bc_retr eval()- check if the indicated range can be retransmitted now
* - Adjust permitted range if there is overlap with previous retransmission
*/
static bool link_bc_retr_eval ( struct tipc_link * l , u16 * from , u16 * to )
{
unsigned long elapsed = jiffies_to_msecs ( jiffies - l - > prev_retr ) ;
if ( less ( * to , * from ) )
return false ;
/* New retransmission request */
if ( ( elapsed > TIPC_BC_RETR_LIMIT ) | |
less ( * to , l - > prev_from ) | | more ( * from , l - > prev_to ) ) {
l - > prev_from = * from ;
l - > prev_to = * to ;
l - > prev_retr = jiffies ;
return true ;
}
/* Inside range of previous retransmit */
if ( ! less ( * from , l - > prev_from ) & & ! more ( * to , l - > prev_to ) )
return false ;
/* Fully or partially outside previous range => exclude overlap */
if ( less ( * from , l - > prev_from ) ) {
* to = l - > prev_from - 1 ;
l - > prev_from = * from ;
}
if ( more ( * to , l - > prev_to ) ) {
* from = l - > prev_to + 1 ;
l - > prev_to = * to ;
}
l - > prev_retr = jiffies ;
return true ;
}
2015-10-22 08:51:41 -04:00
/* tipc_link_bc_sync_rcv - update rcv link according to peer's send state
*/
2016-09-01 13:52:49 -04:00
int tipc_link_bc_sync_rcv ( struct tipc_link * l , struct tipc_msg * hdr ,
struct sk_buff_head * xmitq )
2015-10-22 08:51:41 -04:00
{
2016-09-01 13:52:50 -04:00
struct tipc_link * snd_l = l - > bc_sndlink ;
2015-10-22 08:51:41 -04:00
u16 peers_snd_nxt = msg_bc_snd_nxt ( hdr ) ;
2016-09-01 13:52:49 -04:00
u16 from = msg_bcast_ack ( hdr ) + 1 ;
u16 to = from + msg_bc_gap ( hdr ) - 1 ;
int rc = 0 ;
2015-10-22 08:51:41 -04:00
if ( ! link_is_up ( l ) )
2016-09-01 13:52:49 -04:00
return rc ;
2015-10-22 08:51:41 -04:00
if ( ! msg_peer_node_is_up ( hdr ) )
2016-09-01 13:52:49 -04:00
return rc ;
2015-10-22 08:51:41 -04:00
2016-07-11 16:08:35 -04:00
/* Open when peer ackowledges our bcast init msg (pkt #1) */
if ( msg_ack ( hdr ) )
l - > bc_peer_is_up = true ;
if ( ! l - > bc_peer_is_up )
2016-09-01 13:52:49 -04:00
return rc ;
2015-10-22 08:51:41 -04:00
2016-09-01 13:52:50 -04:00
l - > stats . recv_nacks + + ;
2015-10-22 08:51:41 -04:00
/* Ignore if peers_snd_nxt goes beyond receive window */
if ( more ( peers_snd_nxt , l - > rcv_nxt + l - > window ) )
2016-09-01 13:52:49 -04:00
return rc ;
2016-09-01 13:52:50 -04:00
if ( link_bc_retr_eval ( snd_l , & from , & to ) )
2017-08-21 17:59:30 +02:00
rc = tipc_link_retrans ( snd_l , l , from , to , xmitq ) ;
2016-09-01 13:52:49 -04:00
l - > snd_nxt = peers_snd_nxt ;
if ( link_bc_rcv_gap ( l ) )
rc | = TIPC_LINK_SND_STATE ;
/* Return now if sender supports nack via STATE messages */
if ( l - > peer_caps & TIPC_BCAST_STATE_NACK )
return rc ;
/* Otherwise, be backwards compatible */
2015-10-22 08:51:41 -04:00
if ( ! more ( peers_snd_nxt , l - > rcv_nxt ) ) {
l - > nack_state = BC_NACK_SND_CONDITIONAL ;
2016-09-01 13:52:49 -04:00
return 0 ;
2015-10-22 08:51:41 -04:00
}
/* Don't NACK if one was recently sent or peeked */
if ( l - > nack_state = = BC_NACK_SND_SUPPRESS ) {
l - > nack_state = BC_NACK_SND_UNCONDITIONAL ;
2016-09-01 13:52:49 -04:00
return 0 ;
2015-10-22 08:51:41 -04:00
}
/* Conditionally delay NACK sending until next synch rcv */
if ( l - > nack_state = = BC_NACK_SND_CONDITIONAL ) {
l - > nack_state = BC_NACK_SND_UNCONDITIONAL ;
if ( ( peers_snd_nxt - l - > rcv_nxt ) < TIPC_MIN_LINK_WIN )
2016-09-01 13:52:49 -04:00
return 0 ;
2015-10-22 08:51:41 -04:00
}
/* Send NACK now but suppress next one */
tipc_link_build_bc_proto_msg ( l , true , peers_snd_nxt , xmitq ) ;
l - > nack_state = BC_NACK_SND_SUPPRESS ;
2016-09-01 13:52:49 -04:00
return 0 ;
2015-10-22 08:51:41 -04:00
}
void tipc_link_bc_ack_rcv ( struct tipc_link * l , u16 acked ,
struct sk_buff_head * xmitq )
{
struct sk_buff * skb , * tmp ;
struct tipc_link * snd_l = l - > bc_sndlink ;
if ( ! link_is_up ( l ) | | ! l - > bc_peer_is_up )
return ;
if ( ! more ( acked , l - > acked ) )
return ;
/* Skip over packets peer has already acked */
skb_queue_walk ( & snd_l - > transmq , skb ) {
if ( more ( buf_seqno ( skb ) , l - > acked ) )
break ;
}
/* Update/release the packets peer is acking now */
skb_queue_walk_from_safe ( & snd_l - > transmq , skb , tmp ) {
if ( more ( buf_seqno ( skb ) , acked ) )
break ;
if ( ! - - TIPC_SKB_CB ( skb ) - > ackers ) {
__skb_unlink ( skb , & snd_l - > transmq ) ;
kfree_skb ( skb ) ;
}
}
l - > acked = acked ;
tipc_link_advance_backlog ( snd_l , xmitq ) ;
if ( unlikely ( ! skb_queue_empty ( & snd_l - > wakeupq ) ) )
link_prepare_wakeup ( snd_l ) ;
}
/* tipc_link_bc_nack_rcv(): receive broadcast nack message
2016-09-01 13:52:49 -04:00
* This function is here for backwards compatibility , since
* no BCAST_PROTOCOL / STATE messages occur from TIPC v2 .5 .
2015-10-22 08:51:41 -04:00
*/
int tipc_link_bc_nack_rcv ( struct tipc_link * l , struct sk_buff * skb ,
struct sk_buff_head * xmitq )
{
struct tipc_msg * hdr = buf_msg ( skb ) ;
u32 dnode = msg_destnode ( hdr ) ;
int mtyp = msg_type ( hdr ) ;
u16 acked = msg_bcast_ack ( hdr ) ;
u16 from = acked + 1 ;
u16 to = msg_bcgap_to ( hdr ) ;
u16 peers_snd_nxt = to + 1 ;
int rc = 0 ;
kfree_skb ( skb ) ;
if ( ! tipc_link_is_up ( l ) | | ! l - > bc_peer_is_up )
return 0 ;
if ( mtyp ! = STATE_MSG )
return 0 ;
2016-03-03 14:23:21 -05:00
if ( dnode = = tipc_own_addr ( l - > net ) ) {
2015-10-22 08:51:41 -04:00
tipc_link_bc_ack_rcv ( l , acked , xmitq ) ;
2017-08-21 17:59:30 +02:00
rc = tipc_link_retrans ( l - > bc_sndlink , l , from , to , xmitq ) ;
2015-10-22 08:51:41 -04:00
l - > stats . recv_nacks + + ;
return rc ;
}
/* Msg for other node => suppress own NACK at next sync if applicable */
if ( more ( peers_snd_nxt , l - > rcv_nxt ) & & ! less ( l - > rcv_nxt , from ) )
l - > nack_state = BC_NACK_SND_SUPPRESS ;
return 0 ;
}
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
void tipc_link_set_queue_limits ( struct tipc_link * l , u32 win )
2006-01-02 19:04:38 +01:00
{
tipc: simplify link mtu negotiation
When a link is being established, the two endpoints advertise their
respective interface MTU in the transmitted RESET and ACTIVATE messages.
If there is any difference, the lower of the two MTUs will be selected
for use by both endpoints.
However, as a remnant of earlier attempts to introduce TIPC level
routing. there also exists an MTU discovery mechanism. If an intermediate
node has a lower MTU than the two endpoints, they will discover this
through a bisectional approach, and finally adopt this MTU for common use.
Since there is no TIPC level routing, and probably never will be,
this mechanism doesn't make any sense, and only serves to make the
link level protocol unecessarily complex.
In this commit, we eliminate the MTU discovery algorithm,and fall back
to the simple MTU advertising approach. This change is fully backwards
compatible.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:02 -04:00
int max_bulk = TIPC_MAX_PUBLICATIONS / ( l - > mtu / ITEM_SIZE ) ;
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
l - > window = win ;
2016-08-16 11:53:51 -04:00
l - > backlog [ TIPC_LOW_IMPORTANCE ] . limit = max_t ( u16 , 50 , win ) ;
l - > backlog [ TIPC_MEDIUM_IMPORTANCE ] . limit = max_t ( u16 , 100 , win * 2 ) ;
l - > backlog [ TIPC_HIGH_IMPORTANCE ] . limit = max_t ( u16 , 150 , win * 3 ) ;
l - > backlog [ TIPC_CRITICAL_IMPORTANCE ] . limit = max_t ( u16 , 200 , win * 4 ) ;
tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 12:07:24 -04:00
l - > backlog [ TIPC_SYSTEM_IMPORTANCE ] . limit = max_bulk ;
2006-01-02 19:04:38 +01:00
}
/**
2015-11-19 14:30:46 -05:00
* link_reset_stats - reset link statistics
2015-11-19 14:30:47 -05:00
* @ l : pointer to link
2006-01-02 19:04:38 +01:00
*/
2015-11-19 14:30:46 -05:00
void tipc_link_reset_stats ( struct tipc_link * l )
2006-01-02 19:04:38 +01:00
{
2015-11-19 14:30:46 -05:00
memset ( & l - > stats , 0 , sizeof ( l - > stats ) ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-16 16:54:30 -04:00
static void link_print ( struct tipc_link * l , const char * str )
2006-01-02 19:04:38 +01:00
{
2015-07-16 16:54:30 -04:00
struct sk_buff * hskb = skb_peek ( & l - > transmq ) ;
tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.
The following changes are introduced here:
- A new counter ("ackers") in struct tipc_link, to indicate how many
peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
has acked at the moment, i.e., where in the transmission queue to
start updating buffers when the next ack arrives. This is to avoid
double acknowledgements from a peer, with inadvertent relase of
packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
from a given sequence number, instead of the first packet in the
transmision queue. This is to minimize the number of retransmitted
packets on the broadcast media.
When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:38 -04:00
u16 head = hskb ? msg_seqno ( buf_msg ( hskb ) ) : l - > snd_nxt - 1 ;
2015-07-16 16:54:30 -04:00
u16 tail = l - > snd_nxt - 1 ;
tipc: decouple the relationship between bearer and link
Currently on both paths of message transmission and reception, the
read lock of tipc_net_lock must be held before bearer is accessed,
while the write lock of tipc_net_lock has to be taken before bearer
is configured. Although it can ensure that bearer is always valid on
the two data paths, link and bearer is closely bound together.
So as the part of effort of removing tipc_net_lock, the locking
policy of bearer protection will be adjusted as below: on the two
data paths, RCU is used, and on the configuration path of bearer,
RTNL lock is applied.
Now RCU just covers the path of message reception. To make it possible
to protect the path of message transmission with RCU, link should not
use its stored bearer pointer to access bearer, but it should use the
bearer identity of its attached bearer as index to get bearer instance
from bearer_list array, which can help us decouple the relationship
between bearer and link. As a result, bearer on the path of message
transmission can be safely protected by RCU when we access bearer_list
array within RCU lock protection.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-21 10:55:46 +08:00
2015-07-30 18:24:21 -04:00
pr_info ( " %s Link <%s> state %x \n " , str , l - > name , l - > state ) ;
2015-07-16 16:54:30 -04:00
pr_info ( " XMTQ: %u [%u-%u], BKLGQ: %u, SNDNX: %u, RCVNX: %u \n " ,
skb_queue_len ( & l - > transmq ) , head , tail ,
skb_queue_len ( & l - > backlogq ) , l - > snd_nxt , l - > rcv_nxt ) ;
2006-01-02 19:04:38 +01:00
}
2014-11-20 10:29:07 +01:00
/* Parse and validate nested (link) properties valid for media, bearer and link
*/
int tipc_nl_parse_link_prop ( struct nlattr * prop , struct nlattr * props [ ] )
{
int err ;
err = nla_parse_nested ( props , TIPC_NLA_PROP_MAX , prop ,
2017-04-12 14:34:07 +02:00
tipc_nl_prop_policy , NULL ) ;
2014-11-20 10:29:07 +01:00
if ( err )
return err ;
if ( props [ TIPC_NLA_PROP_PRIO ] ) {
u32 prio ;
prio = nla_get_u32 ( props [ TIPC_NLA_PROP_PRIO ] ) ;
if ( prio > TIPC_MAX_LINK_PRI )
return - EINVAL ;
}
if ( props [ TIPC_NLA_PROP_TOL ] ) {
u32 tol ;
tol = nla_get_u32 ( props [ TIPC_NLA_PROP_TOL ] ) ;
if ( ( tol < TIPC_MIN_LINK_TOL ) | | ( tol > TIPC_MAX_LINK_TOL ) )
return - EINVAL ;
}
if ( props [ TIPC_NLA_PROP_WIN ] ) {
u32 win ;
win = nla_get_u32 ( props [ TIPC_NLA_PROP_WIN ] ) ;
if ( ( win < TIPC_MIN_LINK_WIN ) | | ( win > TIPC_MAX_LINK_WIN ) )
return - EINVAL ;
}
return 0 ;
}
2014-11-20 10:29:12 +01:00
2014-11-24 11:10:29 +01:00
static int __tipc_nl_add_stats ( struct sk_buff * skb , struct tipc_stats * s )
2014-11-20 10:29:12 +01:00
{
int i ;
struct nlattr * stats ;
struct nla_map {
u32 key ;
u32 val ;
} ;
struct nla_map map [ ] = {
2016-11-25 10:35:02 -05:00
{ TIPC_NLA_STATS_RX_INFO , 0 } ,
2014-11-20 10:29:12 +01:00
{ TIPC_NLA_STATS_RX_FRAGMENTS , s - > recv_fragments } ,
{ TIPC_NLA_STATS_RX_FRAGMENTED , s - > recv_fragmented } ,
{ TIPC_NLA_STATS_RX_BUNDLES , s - > recv_bundles } ,
{ TIPC_NLA_STATS_RX_BUNDLED , s - > recv_bundled } ,
2016-11-25 10:35:02 -05:00
{ TIPC_NLA_STATS_TX_INFO , 0 } ,
2014-11-20 10:29:12 +01:00
{ TIPC_NLA_STATS_TX_FRAGMENTS , s - > sent_fragments } ,
{ TIPC_NLA_STATS_TX_FRAGMENTED , s - > sent_fragmented } ,
{ TIPC_NLA_STATS_TX_BUNDLES , s - > sent_bundles } ,
{ TIPC_NLA_STATS_TX_BUNDLED , s - > sent_bundled } ,
{ TIPC_NLA_STATS_MSG_PROF_TOT , ( s - > msg_length_counts ) ?
s - > msg_length_counts : 1 } ,
{ TIPC_NLA_STATS_MSG_LEN_CNT , s - > msg_length_counts } ,
{ TIPC_NLA_STATS_MSG_LEN_TOT , s - > msg_lengths_total } ,
{ TIPC_NLA_STATS_MSG_LEN_P0 , s - > msg_length_profile [ 0 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P1 , s - > msg_length_profile [ 1 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P2 , s - > msg_length_profile [ 2 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P3 , s - > msg_length_profile [ 3 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P4 , s - > msg_length_profile [ 4 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P5 , s - > msg_length_profile [ 5 ] } ,
{ TIPC_NLA_STATS_MSG_LEN_P6 , s - > msg_length_profile [ 6 ] } ,
{ TIPC_NLA_STATS_RX_STATES , s - > recv_states } ,
{ TIPC_NLA_STATS_RX_PROBES , s - > recv_probes } ,
{ TIPC_NLA_STATS_RX_NACKS , s - > recv_nacks } ,
{ TIPC_NLA_STATS_RX_DEFERRED , s - > deferred_recv } ,
{ TIPC_NLA_STATS_TX_STATES , s - > sent_states } ,
{ TIPC_NLA_STATS_TX_PROBES , s - > sent_probes } ,
{ TIPC_NLA_STATS_TX_NACKS , s - > sent_nacks } ,
{ TIPC_NLA_STATS_TX_ACKS , s - > sent_acks } ,
{ TIPC_NLA_STATS_RETRANSMITTED , s - > retransmitted } ,
{ TIPC_NLA_STATS_DUPLICATES , s - > duplicates } ,
{ TIPC_NLA_STATS_LINK_CONGS , s - > link_congs } ,
{ TIPC_NLA_STATS_MAX_QUEUE , s - > max_queue_sz } ,
{ TIPC_NLA_STATS_AVG_QUEUE , s - > queue_sz_counts ?
( s - > accu_queue_sz / s - > queue_sz_counts ) : 0 }
} ;
stats = nla_nest_start ( skb , TIPC_NLA_LINK_STATS ) ;
if ( ! stats )
return - EMSGSIZE ;
for ( i = 0 ; i < ARRAY_SIZE ( map ) ; i + + )
if ( nla_put_u32 ( skb , map [ i ] . key , map [ i ] . val ) )
goto msg_full ;
nla_nest_end ( skb , stats ) ;
return 0 ;
msg_full :
nla_nest_cancel ( skb , stats ) ;
return - EMSGSIZE ;
}
/* Caller should hold appropriate locks to protect the link */
2015-11-19 14:30:45 -05:00
int __tipc_nl_add_link ( struct net * net , struct tipc_nl_msg * msg ,
struct tipc_link * link , int nlflags )
2014-11-20 10:29:12 +01:00
{
2018-03-22 20:42:49 +01:00
u32 self = tipc_own_addr ( net ) ;
2014-11-20 10:29:12 +01:00
struct nlattr * attrs ;
struct nlattr * prop ;
2018-03-22 20:42:49 +01:00
void * hdr ;
int err ;
2014-11-20 10:29:12 +01:00
2015-02-09 09:50:03 +01:00
hdr = genlmsg_put ( msg - > skb , msg - > portid , msg - > seq , & tipc_genl_family ,
2015-04-28 18:33:50 +02:00
nlflags , TIPC_NL_LINK_GET ) ;
2014-11-20 10:29:12 +01:00
if ( ! hdr )
return - EMSGSIZE ;
attrs = nla_nest_start ( msg - > skb , TIPC_NLA_LINK ) ;
if ( ! attrs )
goto msg_full ;
if ( nla_put_string ( msg - > skb , TIPC_NLA_LINK_NAME , link - > name ) )
goto attr_msg_full ;
2018-03-22 20:42:49 +01:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_DEST , tipc_cluster_mask ( self ) ) )
2014-11-20 10:29:12 +01:00
goto attr_msg_full ;
tipc: simplify link mtu negotiation
When a link is being established, the two endpoints advertise their
respective interface MTU in the transmitted RESET and ACTIVATE messages.
If there is any difference, the lower of the two MTUs will be selected
for use by both endpoints.
However, as a remnant of earlier attempts to introduce TIPC level
routing. there also exists an MTU discovery mechanism. If an intermediate
node has a lower MTU than the two endpoints, they will discover this
through a bisectional approach, and finally adopt this MTU for common use.
Since there is no TIPC level routing, and probably never will be,
this mechanism doesn't make any sense, and only serves to make the
link level protocol unecessarily complex.
In this commit, we eliminate the MTU discovery algorithm,and fall back
to the simple MTU advertising approach. This change is fully backwards
compatible.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:02 -04:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_MTU , link - > mtu ) )
2014-11-20 10:29:12 +01:00
goto attr_msg_full ;
2016-11-25 10:35:02 -05:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_RX , link - > stats . recv_pkts ) )
2014-11-20 10:29:12 +01:00
goto attr_msg_full ;
2016-11-25 10:35:02 -05:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_TX , link - > stats . sent_pkts ) )
2014-11-20 10:29:12 +01:00
goto attr_msg_full ;
if ( tipc_link_is_up ( link ) )
if ( nla_put_flag ( msg - > skb , TIPC_NLA_LINK_UP ) )
goto attr_msg_full ;
2015-10-22 08:51:46 -04:00
if ( link - > active )
2014-11-20 10:29:12 +01:00
if ( nla_put_flag ( msg - > skb , TIPC_NLA_LINK_ACTIVE ) )
goto attr_msg_full ;
prop = nla_nest_start ( msg - > skb , TIPC_NLA_LINK_PROP ) ;
if ( ! prop )
goto attr_msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_PROP_PRIO , link - > priority ) )
goto prop_msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_PROP_TOL , link - > tolerance ) )
goto prop_msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_PROP_WIN ,
tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 12:07:24 -04:00
link - > window ) )
2014-11-20 10:29:12 +01:00
goto prop_msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_PROP_PRIO , link - > priority ) )
goto prop_msg_full ;
nla_nest_end ( msg - > skb , prop ) ;
err = __tipc_nl_add_stats ( msg - > skb , & link - > stats ) ;
if ( err )
goto attr_msg_full ;
nla_nest_end ( msg - > skb , attrs ) ;
genlmsg_end ( msg - > skb , hdr ) ;
return 0 ;
prop_msg_full :
nla_nest_cancel ( msg - > skb , prop ) ;
attr_msg_full :
nla_nest_cancel ( msg - > skb , attrs ) ;
msg_full :
genlmsg_cancel ( msg - > skb , hdr ) ;
return - EMSGSIZE ;
}
2015-11-19 14:30:46 -05:00
static int __tipc_nl_add_bc_link_stat ( struct sk_buff * skb ,
struct tipc_stats * stats )
{
int i ;
struct nlattr * nest ;
struct nla_map {
__u32 key ;
__u32 val ;
} ;
struct nla_map map [ ] = {
2016-11-25 10:35:02 -05:00
{ TIPC_NLA_STATS_RX_INFO , stats - > recv_pkts } ,
2015-11-19 14:30:46 -05:00
{ TIPC_NLA_STATS_RX_FRAGMENTS , stats - > recv_fragments } ,
{ TIPC_NLA_STATS_RX_FRAGMENTED , stats - > recv_fragmented } ,
{ TIPC_NLA_STATS_RX_BUNDLES , stats - > recv_bundles } ,
{ TIPC_NLA_STATS_RX_BUNDLED , stats - > recv_bundled } ,
2016-11-25 10:35:02 -05:00
{ TIPC_NLA_STATS_TX_INFO , stats - > sent_pkts } ,
2015-11-19 14:30:46 -05:00
{ TIPC_NLA_STATS_TX_FRAGMENTS , stats - > sent_fragments } ,
{ TIPC_NLA_STATS_TX_FRAGMENTED , stats - > sent_fragmented } ,
{ TIPC_NLA_STATS_TX_BUNDLES , stats - > sent_bundles } ,
{ TIPC_NLA_STATS_TX_BUNDLED , stats - > sent_bundled } ,
{ TIPC_NLA_STATS_RX_NACKS , stats - > recv_nacks } ,
{ TIPC_NLA_STATS_RX_DEFERRED , stats - > deferred_recv } ,
{ TIPC_NLA_STATS_TX_NACKS , stats - > sent_nacks } ,
{ TIPC_NLA_STATS_TX_ACKS , stats - > sent_acks } ,
{ TIPC_NLA_STATS_RETRANSMITTED , stats - > retransmitted } ,
{ TIPC_NLA_STATS_DUPLICATES , stats - > duplicates } ,
{ TIPC_NLA_STATS_LINK_CONGS , stats - > link_congs } ,
{ TIPC_NLA_STATS_MAX_QUEUE , stats - > max_queue_sz } ,
{ TIPC_NLA_STATS_AVG_QUEUE , stats - > queue_sz_counts ?
( stats - > accu_queue_sz / stats - > queue_sz_counts ) : 0 }
} ;
nest = nla_nest_start ( skb , TIPC_NLA_LINK_STATS ) ;
if ( ! nest )
return - EMSGSIZE ;
for ( i = 0 ; i < ARRAY_SIZE ( map ) ; i + + )
if ( nla_put_u32 ( skb , map [ i ] . key , map [ i ] . val ) )
goto msg_full ;
nla_nest_end ( skb , nest ) ;
return 0 ;
msg_full :
nla_nest_cancel ( skb , nest ) ;
return - EMSGSIZE ;
}
int tipc_nl_add_bc_link ( struct net * net , struct tipc_nl_msg * msg )
{
int err ;
void * hdr ;
struct nlattr * attrs ;
struct nlattr * prop ;
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
struct tipc_link * bcl = tn - > bcl ;
if ( ! bcl )
return 0 ;
tipc_bcast_lock ( net ) ;
hdr = genlmsg_put ( msg - > skb , msg - > portid , msg - > seq , & tipc_genl_family ,
NLM_F_MULTI , TIPC_NL_LINK_GET ) ;
2016-02-17 11:47:35 -05:00
if ( ! hdr ) {
tipc_bcast_unlock ( net ) ;
2015-11-19 14:30:46 -05:00
return - EMSGSIZE ;
2016-02-17 11:47:35 -05:00
}
2015-11-19 14:30:46 -05:00
attrs = nla_nest_start ( msg - > skb , TIPC_NLA_LINK ) ;
if ( ! attrs )
goto msg_full ;
/* The broadcast link is always up */
if ( nla_put_flag ( msg - > skb , TIPC_NLA_LINK_UP ) )
goto attr_msg_full ;
if ( nla_put_flag ( msg - > skb , TIPC_NLA_LINK_BROADCAST ) )
goto attr_msg_full ;
if ( nla_put_string ( msg - > skb , TIPC_NLA_LINK_NAME , bcl - > name ) )
goto attr_msg_full ;
2016-11-25 10:35:02 -05:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_RX , 0 ) )
2015-11-19 14:30:46 -05:00
goto attr_msg_full ;
2016-11-25 10:35:02 -05:00
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_LINK_TX , 0 ) )
2015-11-19 14:30:46 -05:00
goto attr_msg_full ;
prop = nla_nest_start ( msg - > skb , TIPC_NLA_LINK_PROP ) ;
if ( ! prop )
goto attr_msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_PROP_WIN , bcl - > window ) )
goto prop_msg_full ;
nla_nest_end ( msg - > skb , prop ) ;
err = __tipc_nl_add_bc_link_stat ( msg - > skb , & bcl - > stats ) ;
if ( err )
goto attr_msg_full ;
tipc_bcast_unlock ( net ) ;
nla_nest_end ( msg - > skb , attrs ) ;
genlmsg_end ( msg - > skb , hdr ) ;
return 0 ;
prop_msg_full :
nla_nest_cancel ( msg - > skb , prop ) ;
attr_msg_full :
nla_nest_cancel ( msg - > skb , attrs ) ;
msg_full :
tipc_bcast_unlock ( net ) ;
genlmsg_cancel ( msg - > skb , hdr ) ;
return - EMSGSIZE ;
}
2016-02-01 08:19:56 +01:00
void tipc_link_set_tolerance ( struct tipc_link * l , u32 tol ,
struct sk_buff_head * xmitq )
2015-11-19 14:30:46 -05:00
{
l - > tolerance = tol ;
2018-02-14 13:34:39 +01:00
if ( link_is_up ( l ) )
tipc_link_build_proto_msg ( l , STATE_MSG , 0 , 0 , 0 , tol , 0 , xmitq ) ;
2015-11-19 14:30:46 -05:00
}
2016-02-01 08:19:56 +01:00
void tipc_link_set_prio ( struct tipc_link * l , u32 prio ,
struct sk_buff_head * xmitq )
2015-11-19 14:30:46 -05:00
{
l - > priority = prio ;
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
tipc_link_build_proto_msg ( l , STATE_MSG , 0 , 0 , 0 , 0 , prio , xmitq ) ;
2015-11-19 14:30:46 -05:00
}
void tipc_link_set_abort_limit ( struct tipc_link * l , u32 limit )
{
l - > abort_limit = limit ;
}