2006-01-02 19:04:38 +01:00
/*
* net / tipc / node . c : TIPC node management routines
2007-02-09 23:25:21 +09:00
*
2016-05-02 11:58:46 -04:00
* Copyright ( c ) 2000 - 2006 , 2012 - 2016 , Ericsson AB
2014-03-27 12:54:36 +08:00
* Copyright ( c ) 2005 - 2006 , 2010 - 2014 , Wind River Systems
2006-01-02 19:04:38 +01:00
* All rights reserved .
*
2006-01-11 13:30:43 +01:00
* Redistribution and use in source and binary forms , with or without
2006-01-02 19:04:38 +01:00
* modification , are permitted provided that the following conditions are met :
*
2006-01-11 13:30:43 +01:00
* 1. Redistributions of source code must retain the above copyright
* notice , this list of conditions and the following disclaimer .
* 2. Redistributions in binary form must reproduce the above copyright
* notice , this list of conditions and the following disclaimer in the
* documentation and / or other materials provided with the distribution .
* 3. Neither the names of the copyright holders nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission .
2006-01-02 19:04:38 +01:00
*
2006-01-11 13:30:43 +01:00
* Alternatively , this software may be distributed under the terms of the
* GNU General Public License ( " GPL " ) version 2 as published by the Free
* Software Foundation .
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS " AS IS "
* AND ANY EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT LIMITED TO , THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED . IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL , EXEMPLARY , OR
* CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITED TO , PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE , DATA , OR PROFITS ; OR BUSINESS
* INTERRUPTION ) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN
* CONTRACT , STRICT LIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE )
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE , EVEN IF ADVISED OF THE
2006-01-02 19:04:38 +01:00
* POSSIBILITY OF SUCH DAMAGE .
*/
# include "core.h"
2015-02-09 09:50:18 +01:00
# include "link.h"
2006-01-02 19:04:38 +01:00
# include "node.h"
# include "name_distr.h"
2014-08-22 18:09:07 -04:00
# include "socket.h"
2015-05-14 10:46:13 -04:00
# include "bcast.h"
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
# include "monitor.h"
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
# include "discover.h"
2016-03-04 17:04:42 +01:00
# include "netlink.h"
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
# include "trace.h"
2019-11-08 12:05:11 +07:00
# include "crypto.h"
2015-10-15 14:52:46 -04:00
2015-11-19 14:30:45 -05:00
# define INVALID_NODE_SIG 0x10000
2018-06-29 13:23:41 +02:00
# define NODE_CLEANUP_AFTER 300000
2015-11-19 14:30:45 -05:00
/* Flags used to take different actions according to flag type
* TIPC_NOTIFY_NODE_DOWN : notify node is down
* TIPC_NOTIFY_NODE_UP : notify node is up
* TIPC_DISTRIBUTE_NAME : publish or withdraw link state name type
*/
enum {
TIPC_NOTIFY_NODE_DOWN = ( 1 < < 3 ) ,
TIPC_NOTIFY_NODE_UP = ( 1 < < 4 ) ,
TIPC_NOTIFY_LINK_UP = ( 1 < < 6 ) ,
TIPC_NOTIFY_LINK_DOWN = ( 1 < < 7 )
} ;
struct tipc_link_entry {
struct tipc_link * link ;
spinlock_t lock ; /* per link */
u32 mtu ;
struct sk_buff_head inputq ;
struct tipc_media_addr maddr ;
} ;
struct tipc_bclink_entry {
struct tipc_link * link ;
struct sk_buff_head inputq1 ;
struct sk_buff_head arrvq ;
struct sk_buff_head inputq2 ;
struct sk_buff_head namedq ;
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
u16 named_rcv_nxt ;
bool named_open ;
2015-11-19 14:30:45 -05:00
} ;
/**
* struct tipc_node - TIPC node structure
* @ addr : network address of node
2020-11-29 10:32:47 -08:00
* @ kref : reference counter to node object
2015-11-19 14:30:45 -05:00
* @ lock : rwlock governing access to structure
* @ net : the applicable net namespace
* @ hash : links to adjacent nodes in unsorted hash chain
* @ inputq : pointer to input queue containing messages for msg event
* @ namedq : pointer to name table input queue with name table messages
* @ active_links : bearer ids of active links , used as index into links [ ] array
* @ links : array containing references to all links to node
2020-11-29 10:32:47 -08:00
* @ bc_entry : broadcast link entry
2015-11-19 14:30:45 -05:00
* @ action_flags : bit mask of different types of node actions
* @ state : connectivity state vs peer node
2019-11-08 12:05:09 +07:00
* @ preliminary : a preliminary node or not
2020-11-29 10:32:47 -08:00
* @ failover_sent : failover sent or not
2015-11-19 14:30:45 -05:00
* @ sync_point : sequence number where synch / failover is finished
* @ list : links to adjacent nodes in sorted list of cluster ' s nodes
* @ working_links : number of working links to node ( both active and standby )
* @ link_cnt : number of links to node
* @ capabilities : bitmap , indicating peer node ' s functional capabilities
* @ signature : node instance identifier
* @ link_id : local and remote bearer ids of changing link , if any
2020-11-29 10:32:47 -08:00
* @ peer_id : 128 - bit ID of peer
* @ peer_id_string : ID string of peer
2015-11-19 14:30:45 -05:00
* @ publ_list : list of publications
2020-11-29 10:32:47 -08:00
* @ conn_sks : list of connections ( FIXME )
* @ timer : node ' s keepalive timer
* @ keepalive_intv : keepalive interval in milliseconds
2015-11-19 14:30:45 -05:00
* @ rcu : rcu struct for tipc_node
2018-06-29 13:23:41 +02:00
* @ delete_at : indicates the time for deleting a down node
2020-11-29 10:32:47 -08:00
* @ peer_net : peer ' s net namespace
* @ peer_hash_mix : hash for this peer ( FIXME )
2019-11-08 12:05:11 +07:00
* @ crypto_rx : RX crypto handler
2015-11-19 14:30:45 -05:00
*/
struct tipc_node {
u32 addr ;
struct kref kref ;
rwlock_t lock ;
struct net * net ;
struct hlist_node hash ;
int active_links [ 2 ] ;
struct tipc_link_entry links [ MAX_BEARERS ] ;
struct tipc_bclink_entry bc_entry ;
int action_flags ;
struct list_head list ;
int state ;
2019-11-08 12:05:09 +07:00
bool preliminary ;
2018-09-26 21:00:54 +02:00
bool failover_sent ;
2015-11-19 14:30:45 -05:00
u16 sync_point ;
int link_cnt ;
u16 working_links ;
u16 capabilities ;
u32 signature ;
u32 link_id ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
u8 peer_id [ 16 ] ;
2019-11-08 12:05:09 +07:00
char peer_id_string [ NODE_ID_STR_LEN ] ;
2015-11-19 14:30:45 -05:00
struct list_head publ_list ;
struct list_head conn_sks ;
unsigned long keepalive_intv ;
struct timer_list timer ;
struct rcu_head rcu ;
2018-06-29 13:23:41 +02:00
unsigned long delete_at ;
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
struct net * peer_net ;
u32 peer_hash_mix ;
2019-11-08 12:05:11 +07:00
# ifdef CONFIG_TIPC_CRYPTO
struct tipc_crypto * crypto_rx ;
# endif
2015-11-19 14:30:45 -05:00
} ;
2015-07-30 18:24:19 -04:00
/* Node FSM states and events:
*/
enum {
SELF_DOWN_PEER_DOWN = 0xdd ,
SELF_UP_PEER_UP = 0xaa ,
SELF_DOWN_PEER_LEAVING = 0xd1 ,
SELF_UP_PEER_COMING = 0xac ,
SELF_COMING_PEER_UP = 0xca ,
SELF_LEAVING_PEER_DOWN = 0x1d ,
NODE_FAILINGOVER = 0xf0 ,
NODE_SYNCHING = 0xcc
} ;
enum {
SELF_ESTABL_CONTACT_EVT = 0xece ,
SELF_LOST_CONTACT_EVT = 0x1ce ,
PEER_ESTABL_CONTACT_EVT = 0x9ece ,
PEER_LOST_CONTACT_EVT = 0x91ce ,
NODE_FAILOVER_BEGIN_EVT = 0xfbe ,
NODE_FAILOVER_END_EVT = 0xfee ,
NODE_SYNCH_BEGIN_EVT = 0xcbe ,
NODE_SYNCH_END_EVT = 0xcee
} ;
2015-07-30 18:24:23 -04:00
static void __tipc_node_link_down ( struct tipc_node * n , int * bearer_id ,
struct sk_buff_head * xmitq ,
struct tipc_media_addr * * maddr ) ;
static void tipc_node_link_down ( struct tipc_node * n , int bearer_id ,
bool delete ) ;
static void node_lost_contact ( struct tipc_node * n , struct sk_buff_head * inputq ) ;
2015-03-26 18:10:24 +08:00
static void tipc_node_delete ( struct tipc_node * node ) ;
2017-10-30 14:06:45 -07:00
static void tipc_node_timeout ( struct timer_list * t ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static void tipc_node_fsm_evt ( struct tipc_node * n , int evt ) ;
2015-11-19 14:30:45 -05:00
static struct tipc_node * tipc_node_find ( struct net * net , u32 addr ) ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
static struct tipc_node * tipc_node_find_by_id ( struct net * net , u8 * id ) ;
2017-10-13 11:04:19 +02:00
static bool node_is_up ( struct tipc_node * n ) ;
2018-06-29 13:23:41 +02:00
static void tipc_node_delete_from_list ( struct tipc_node * node ) ;
2006-01-02 19:04:38 +01:00
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
struct tipc_sock_conn {
u32 port ;
u32 peer_port ;
u32 peer_node ;
struct list_head list ;
} ;
2015-11-19 14:30:45 -05:00
static struct tipc_link * node_active_link ( struct tipc_node * n , int sel )
{
int bearer_id = n - > active_links [ sel & 1 ] ;
if ( unlikely ( bearer_id = = INVALID_BEARER_ID ) )
return NULL ;
return n - > links [ bearer_id ] . link ;
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
int tipc_node_get_mtu ( struct net * net , u32 addr , u32 sel , bool connected )
2015-11-19 14:30:45 -05:00
{
struct tipc_node * n ;
int bearer_id ;
unsigned int mtu = MAX_MSG_SIZE ;
n = tipc_node_find ( net , addr ) ;
if ( unlikely ( ! n ) )
return mtu ;
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
/* Allow MAX_MSG_SIZE when building connection oriented message
* if they are in the same core network
*/
if ( n - > peer_net & & connected ) {
tipc_node_put ( n ) ;
return mtu ;
}
2015-11-19 14:30:45 -05:00
bearer_id = n - > active_links [ sel & 1 ] ;
if ( likely ( bearer_id ! = INVALID_BEARER_ID ) )
mtu = n - > links [ bearer_id ] . mtu ;
tipc_node_put ( n ) ;
return mtu ;
}
2016-05-02 11:58:46 -04:00
2018-04-25 19:29:36 +02:00
bool tipc_node_get_id ( struct net * net , u32 addr , u8 * id )
{
u8 * own_id = tipc_own_id ( net ) ;
struct tipc_node * n ;
if ( ! own_id )
return true ;
if ( addr = = tipc_own_addr ( net ) ) {
memcpy ( id , own_id , TIPC_NODEID_LEN ) ;
return true ;
}
n = tipc_node_find ( net , addr ) ;
if ( ! n )
return false ;
memcpy ( id , & n - > peer_id , TIPC_NODEID_LEN ) ;
tipc_node_put ( n ) ;
return true ;
}
2016-05-02 11:58:46 -04:00
u16 tipc_node_get_capabilities ( struct net * net , u32 addr )
{
struct tipc_node * n ;
u16 caps ;
n = tipc_node_find ( net , addr ) ;
if ( unlikely ( ! n ) )
return TIPC_NODE_CAPABILITIES ;
caps = n - > capabilities ;
tipc_node_put ( n ) ;
return caps ;
}
2019-11-08 12:05:09 +07:00
u32 tipc_node_get_addr ( struct tipc_node * node )
{
return ( node ) ? node - > addr : 0 ;
}
char * tipc_node_get_id_str ( struct tipc_node * node )
{
return node - > peer_id_string ;
}
2019-11-08 12:05:11 +07:00
# ifdef CONFIG_TIPC_CRYPTO
/**
* tipc_node_crypto_rx - Retrieve crypto RX handle from node
2020-11-29 10:32:47 -08:00
* @ __n : target tipc_node
2019-11-08 12:05:11 +07:00
* Note : node ref counter must be held first !
*/
struct tipc_crypto * tipc_node_crypto_rx ( struct tipc_node * __n )
{
return ( __n ) ? __n - > crypto_rx : NULL ;
}
struct tipc_crypto * tipc_node_crypto_rx_by_list ( struct list_head * pos )
{
return container_of ( pos , struct tipc_node , list ) - > crypto_rx ;
}
tipc: add automatic session key exchange
With support from the master key option in the previous commit, it
becomes easy to make frequent updates/exchanges of session keys between
authenticated cluster nodes.
Basically, there are two situations where the key exchange will take in
place:
- When a new node joins the cluster (with the master key), it will need
to get its peer's TX key, so that be able to decrypt further messages
from that peer.
- When a new session key is generated (by either user manual setting or
later automatic rekeying feature), the key will be distributed to all
peer nodes in the cluster.
A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
/KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
using the master key before sending out. Upon receipt of the message it
will be decrypted in the same way as regular messages, then attached as
the sender's RX key in the receiver node.
In this way, the key exchange is reliable by the link layer, as well as
security, integrity and authenticity by the crypto layer.
Also, the forward security will be easily achieved by user changing the
master key actively but this should not be required very frequently.
The key exchange feature is independent on the presence of a master key
Note however that the master key still is needed for new nodes to be
able to join the cluster. It is also optional, and can be turned off/on
via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
Backward compatibility is guaranteed because for nodes that do not have
master key support, key exchange using master key ie. tx_key = 0 if any
will be shortly discarded at the message validation step. In other
words, the key exchange feature will be automatically disabled to those
nodes.
v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
error in node.c. The function only exists when built with the TIPC
"CONFIG_TIPC_CRYPTO" option.
v3: use 'info->extack' for a message emitted due to netlink operations
instead (- David's comment).
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:28 +07:00
struct tipc_crypto * tipc_node_crypto_rx_by_addr ( struct net * net , u32 addr )
{
struct tipc_node * n ;
n = tipc_node_find ( net , addr ) ;
return ( n ) ? n - > crypto_rx : NULL ;
}
2019-11-08 12:05:11 +07:00
# endif
2020-02-10 16:11:09 +08:00
static void tipc_node_free ( struct rcu_head * rp )
2019-11-08 12:05:11 +07:00
{
struct tipc_node * n = container_of ( rp , struct tipc_node , rcu ) ;
# ifdef CONFIG_TIPC_CRYPTO
tipc_crypto_stop ( & n - > crypto_rx ) ;
# endif
kfree ( n ) ;
}
2015-03-26 18:10:24 +08:00
static void tipc_node_kref_release ( struct kref * kref )
{
2016-02-24 11:10:48 -05:00
struct tipc_node * n = container_of ( kref , struct tipc_node , kref ) ;
2015-03-26 18:10:24 +08:00
2016-02-24 11:10:48 -05:00
kfree ( n - > bc_entry . link ) ;
2019-11-08 12:05:11 +07:00
call_rcu ( & n - > rcu , tipc_node_free ) ;
2015-03-26 18:10:24 +08:00
}
2019-11-08 12:05:11 +07:00
void tipc_node_put ( struct tipc_node * node )
2015-03-26 18:10:24 +08:00
{
kref_put ( & node - > kref , tipc_node_kref_release ) ;
}
tipc: add automatic session key exchange
With support from the master key option in the previous commit, it
becomes easy to make frequent updates/exchanges of session keys between
authenticated cluster nodes.
Basically, there are two situations where the key exchange will take in
place:
- When a new node joins the cluster (with the master key), it will need
to get its peer's TX key, so that be able to decrypt further messages
from that peer.
- When a new session key is generated (by either user manual setting or
later automatic rekeying feature), the key will be distributed to all
peer nodes in the cluster.
A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
/KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
using the master key before sending out. Upon receipt of the message it
will be decrypted in the same way as regular messages, then attached as
the sender's RX key in the receiver node.
In this way, the key exchange is reliable by the link layer, as well as
security, integrity and authenticity by the crypto layer.
Also, the forward security will be easily achieved by user changing the
master key actively but this should not be required very frequently.
The key exchange feature is independent on the presence of a master key
Note however that the master key still is needed for new nodes to be
able to join the cluster. It is also optional, and can be turned off/on
via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
Backward compatibility is guaranteed because for nodes that do not have
master key support, key exchange using master key ie. tx_key = 0 if any
will be shortly discarded at the message validation step. In other
words, the key exchange feature will be automatically disabled to those
nodes.
v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
error in node.c. The function only exists when built with the TIPC
"CONFIG_TIPC_CRYPTO" option.
v3: use 'info->extack' for a message emitted due to netlink operations
instead (- David's comment).
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:28 +07:00
void tipc_node_get ( struct tipc_node * node )
2015-03-26 18:10:24 +08:00
{
kref_get ( & node - > kref ) ;
}
2011-10-27 15:03:24 -04:00
/*
2011-02-25 18:42:52 -05:00
* tipc_node_find - locate specified node object , if it exists
*/
2015-11-19 14:30:45 -05:00
static struct tipc_node * tipc_node_find ( struct net * net , u32 addr )
2011-02-25 18:42:52 -05:00
{
2016-02-24 11:00:19 -05:00
struct tipc_net * tn = tipc_net ( net ) ;
2011-02-25 18:42:52 -05:00
struct tipc_node * node ;
2016-02-24 11:00:19 -05:00
unsigned int thash = tipc_hashfn ( addr ) ;
2011-02-25 18:42:52 -05:00
2014-03-27 12:54:37 +08:00
rcu_read_lock ( ) ;
2016-02-24 11:00:19 -05:00
hlist_for_each_entry_rcu ( node , & tn - > node_htable [ thash ] , hash ) {
2019-11-08 12:05:09 +07:00
if ( node - > addr ! = addr | | node - > preliminary )
2016-02-24 11:00:19 -05:00
continue ;
if ( ! kref_get_unless_zero ( & node - > kref ) )
node = NULL ;
break ;
2011-02-25 18:42:52 -05:00
}
2014-03-27 12:54:37 +08:00
rcu_read_unlock ( ) ;
2016-02-24 11:00:19 -05:00
return node ;
2011-02-25 18:42:52 -05:00
}
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
/* tipc_node_find_by_id - locate specified node object by its 128-bit id
* Note : this function is called only when a discovery request failed
* to find the node by its 32 - bit id , and is not time critical
*/
static struct tipc_node * tipc_node_find_by_id ( struct net * net , u8 * id )
{
struct tipc_net * tn = tipc_net ( net ) ;
struct tipc_node * n ;
bool found = false ;
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( n , & tn - > node_list , list ) {
read_lock_bh ( & n - > lock ) ;
if ( ! memcmp ( id , n - > peer_id , 16 ) & &
kref_get_unless_zero ( & n - > kref ) )
found = true ;
read_unlock_bh ( & n - > lock ) ;
if ( found )
break ;
}
rcu_read_unlock ( ) ;
return found ? n : NULL ;
}
2015-11-19 14:30:45 -05:00
static void tipc_node_read_lock ( struct tipc_node * n )
2021-03-11 10:33:23 +07:00
__acquires ( n - > lock )
2015-11-19 14:30:44 -05:00
{
read_lock_bh ( & n - > lock ) ;
}
2015-11-19 14:30:45 -05:00
static void tipc_node_read_unlock ( struct tipc_node * n )
2021-03-11 10:33:23 +07:00
__releases ( n - > lock )
2015-11-19 14:30:44 -05:00
{
read_unlock_bh ( & n - > lock ) ;
}
static void tipc_node_write_lock ( struct tipc_node * n )
2021-03-11 10:33:23 +07:00
__acquires ( n - > lock )
2015-11-19 14:30:44 -05:00
{
write_lock_bh ( & n - > lock ) ;
}
2017-01-24 13:00:43 +01:00
static void tipc_node_write_unlock_fast ( struct tipc_node * n )
2021-03-11 10:33:23 +07:00
__releases ( n - > lock )
2017-01-24 13:00:43 +01:00
{
write_unlock_bh ( & n - > lock ) ;
}
2015-11-19 14:30:44 -05:00
static void tipc_node_write_unlock ( struct tipc_node * n )
2021-03-11 10:33:23 +07:00
__releases ( n - > lock )
2015-11-19 14:30:44 -05:00
{
struct net * net = n - > net ;
u32 addr = 0 ;
u32 flags = n - > action_flags ;
u32 link_id = 0 ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
u32 bearer_id ;
2015-11-19 14:30:44 -05:00
struct list_head * publ_list ;
if ( likely ( ! flags ) ) {
write_unlock_bh ( & n - > lock ) ;
return ;
}
addr = n - > addr ;
link_id = n - > link_id ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
bearer_id = link_id & 0xffff ;
2015-11-19 14:30:44 -05:00
publ_list = & n - > publ_list ;
n - > action_flags & = ~ ( TIPC_NOTIFY_NODE_DOWN | TIPC_NOTIFY_NODE_UP |
TIPC_NOTIFY_LINK_DOWN | TIPC_NOTIFY_LINK_UP ) ;
write_unlock_bh ( & n - > lock ) ;
if ( flags & TIPC_NOTIFY_NODE_DOWN )
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
tipc_publ_notify ( net , publ_list , addr , n - > capabilities ) ;
2015-11-19 14:30:44 -05:00
if ( flags & TIPC_NOTIFY_NODE_UP )
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
tipc_named_node_up ( net , addr , n - > capabilities ) ;
2015-11-19 14:30:44 -05:00
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
if ( flags & TIPC_NOTIFY_LINK_UP ) {
tipc_mon_peer_up ( net , addr , bearer_id ) ;
2015-11-19 14:30:44 -05:00
tipc_nametbl_publish ( net , TIPC_LINK_STATE , addr , addr ,
2018-03-29 23:20:41 +02:00
TIPC_NODE_SCOPE , link_id , link_id ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
}
if ( flags & TIPC_NOTIFY_LINK_DOWN ) {
tipc_mon_peer_down ( net , addr , bearer_id ) ;
2015-11-19 14:30:44 -05:00
tipc_nametbl_withdraw ( net , TIPC_LINK_STATE , addr ,
2018-03-29 23:20:43 +02:00
addr , link_id ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
}
2015-11-19 14:30:44 -05:00
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
static void tipc_node_assign_peer_net ( struct tipc_node * n , u32 hash_mixes )
{
int net_id = tipc_netid ( n - > net ) ;
struct tipc_net * tn_peer ;
struct net * tmp ;
u32 hash_chk ;
if ( n - > peer_net )
return ;
for_each_net_rcu ( tmp ) {
tn_peer = tipc_net ( tmp ) ;
if ( ! tn_peer )
continue ;
/* Integrity checking whether node exists in namespace or not */
if ( tn_peer - > net_id ! = net_id )
continue ;
if ( memcmp ( n - > peer_id , tn_peer - > node_id , NODE_ID_LEN ) )
continue ;
hash_chk = tipc_net_hash_mixes ( tmp , tn_peer - > random ) ;
if ( hash_mixes ^ hash_chk )
continue ;
n - > peer_net = tmp ;
n - > peer_hash_mix = hash_mixes ;
break ;
}
}
2019-11-08 12:05:11 +07:00
struct tipc_node * tipc_node_create ( struct net * net , u32 addr , u8 * peer_id ,
u16 capabilities , u32 hash_mixes ,
bool preliminary )
2006-01-02 19:04:38 +01:00
{
2015-01-09 15:27:05 +08:00
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
2015-11-19 14:30:47 -05:00
struct tipc_node * n , * temp_node ;
2018-07-10 01:07:35 +02:00
struct tipc_link * l ;
2019-11-08 12:05:09 +07:00
unsigned long intv ;
2018-07-10 01:07:35 +02:00
int bearer_id ;
2015-11-19 14:30:44 -05:00
int i ;
2006-01-02 19:04:38 +01:00
2015-01-09 15:27:05 +08:00
spin_lock_bh ( & tn - > node_list_lock ) ;
2019-11-08 12:05:09 +07:00
n = tipc_node_find ( net , addr ) ? :
tipc_node_find_by_id ( net , peer_id ) ;
2016-05-02 11:58:46 -04:00
if ( n ) {
2019-11-08 12:05:09 +07:00
if ( ! n - > preliminary )
goto update ;
if ( preliminary )
goto exit ;
/* A preliminary node becomes "real" now, refresh its data */
tipc_node_write_lock ( n ) ;
n - > preliminary = false ;
n - > addr = addr ;
hlist_del_rcu ( & n - > hash ) ;
hlist_add_head_rcu ( & n - > hash ,
& tn - > node_htable [ tipc_hashfn ( addr ) ] ) ;
list_del_rcu ( & n - > list ) ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
if ( n - > addr < temp_node - > addr )
break ;
}
list_add_tail_rcu ( & n - > list , & temp_node - > list ) ;
tipc_node_write_unlock_fast ( n ) ;
update :
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
if ( n - > peer_hash_mix ^ hash_mixes )
tipc_node_assign_peer_net ( n , hash_mixes ) ;
2018-07-18 19:50:06 +02:00
if ( n - > capabilities = = capabilities )
goto exit ;
2016-05-02 11:58:46 -04:00
/* Same node may come back with new capabilities */
2019-04-11 21:56:28 +02:00
tipc_node_write_lock ( n ) ;
2016-05-02 11:58:46 -04:00
n - > capabilities = capabilities ;
2018-07-10 01:07:35 +02:00
for ( bearer_id = 0 ; bearer_id < MAX_BEARERS ; bearer_id + + ) {
l = n - > links [ bearer_id ] . link ;
if ( l )
tipc_link_update_caps ( l , capabilities ) ;
}
2019-04-11 21:56:28 +02:00
tipc_node_write_unlock_fast ( n ) ;
2019-03-19 18:49:49 +07:00
/* Calculate cluster capabilities */
tn - > capabilities = TIPC_NODE_CAPABILITIES ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
tn - > capabilities & = temp_node - > capabilities ;
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
2019-11-21 10:01:09 +07:00
tipc_bcast_toggle_rcast ( net ,
( tn - > capabilities & TIPC_BCAST_RCAST ) ) ;
2015-02-03 08:59:19 -05:00
goto exit ;
2016-05-02 11:58:46 -04:00
}
2015-11-19 14:30:47 -05:00
n = kzalloc ( sizeof ( * n ) , GFP_ATOMIC ) ;
if ( ! n ) {
2012-06-29 00:16:37 -04:00
pr_warn ( " Node creation failed, no memory \n " ) ;
2015-02-03 08:59:19 -05:00
goto exit ;
2006-06-25 23:52:17 -07:00
}
2019-11-08 12:05:09 +07:00
tipc_nodeid2string ( n - > peer_id_string , peer_id ) ;
2019-11-08 12:05:11 +07:00
# ifdef CONFIG_TIPC_CRYPTO
if ( unlikely ( tipc_crypto_start ( & n - > crypto_rx , net , n ) ) ) {
pr_warn ( " Failed to start crypto RX(%s)! \n " , n - > peer_id_string ) ;
kfree ( n ) ;
n = NULL ;
goto exit ;
}
# endif
2015-11-19 14:30:47 -05:00
n - > addr = addr ;
2019-11-08 12:05:09 +07:00
n - > preliminary = preliminary ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
memcpy ( & n - > peer_id , peer_id , 16 ) ;
2015-11-19 14:30:47 -05:00
n - > net = net ;
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
n - > peer_net = NULL ;
n - > peer_hash_mix = 0 ;
/* Assign kernel local namespace if exists */
tipc_node_assign_peer_net ( n , hash_mixes ) ;
2015-11-19 14:30:47 -05:00
n - > capabilities = capabilities ;
kref_init ( & n - > kref ) ;
rwlock_init ( & n - > lock ) ;
INIT_HLIST_NODE ( & n - > hash ) ;
INIT_LIST_HEAD ( & n - > list ) ;
INIT_LIST_HEAD ( & n - > publ_list ) ;
INIT_LIST_HEAD ( & n - > conn_sks ) ;
skb_queue_head_init ( & n - > bc_entry . namedq ) ;
skb_queue_head_init ( & n - > bc_entry . inputq1 ) ;
__skb_queue_head_init ( & n - > bc_entry . arrvq ) ;
skb_queue_head_init ( & n - > bc_entry . inputq2 ) ;
2015-11-19 14:30:44 -05:00
for ( i = 0 ; i < MAX_BEARERS ; i + + )
2015-11-19 14:30:47 -05:00
spin_lock_init ( & n - > links [ i ] . lock ) ;
n - > state = SELF_DOWN_PEER_LEAVING ;
2018-06-29 13:23:41 +02:00
n - > delete_at = jiffies + msecs_to_jiffies ( NODE_CLEANUP_AFTER ) ;
2015-11-19 14:30:47 -05:00
n - > signature = INVALID_NODE_SIG ;
n - > active_links [ 0 ] = INVALID_BEARER_ID ;
n - > active_links [ 1 ] = INVALID_BEARER_ID ;
2019-11-08 12:05:09 +07:00
n - > bc_entry . link = NULL ;
2015-11-19 14:30:47 -05:00
tipc_node_get ( n ) ;
2017-10-30 14:06:45 -07:00
timer_setup ( & n - > timer , tipc_node_timeout , 0 ) ;
2019-11-08 12:05:09 +07:00
/* Start a slow timer anyway, crypto needs it */
n - > keepalive_intv = 10000 ;
intv = jiffies + msecs_to_jiffies ( n - > keepalive_intv ) ;
if ( ! mod_timer ( & n - > timer , intv ) )
tipc_node_get ( n ) ;
2016-02-10 16:14:57 -05:00
hlist_add_head_rcu ( & n - > hash , & tn - > node_htable [ tipc_hashfn ( addr ) ] ) ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
if ( n - > addr < temp_node - > addr )
break ;
}
list_add_tail_rcu ( & n - > list , & temp_node - > list ) ;
2019-03-19 18:49:49 +07:00
/* Calculate cluster capabilities */
tn - > capabilities = TIPC_NODE_CAPABILITIES ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
tn - > capabilities & = temp_node - > capabilities ;
}
2019-11-21 10:01:09 +07:00
tipc_bcast_toggle_rcast ( net , ( tn - > capabilities & TIPC_BCAST_RCAST ) ) ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_create ( n , true , " " ) ;
2015-02-03 08:59:19 -05:00
exit :
2015-01-09 15:27:05 +08:00
spin_unlock_bh ( & tn - > node_list_lock ) ;
2015-11-19 14:30:47 -05:00
return n ;
2006-01-02 19:04:38 +01:00
}
2015-07-16 16:54:29 -04:00
static void tipc_node_calculate_timer ( struct tipc_node * n , struct tipc_link * l )
{
2015-11-19 14:30:46 -05:00
unsigned long tol = tipc_link_tolerance ( l ) ;
2015-07-16 16:54:29 -04:00
unsigned long intv = ( ( tol / 4 ) > 500 ) ? 500 : tol / 4 ;
/* Link with lowest tolerance determines timer interval */
2016-06-08 12:00:05 -04:00
if ( intv < n - > keepalive_intv )
n - > keepalive_intv = intv ;
2015-07-16 16:54:29 -04:00
2016-06-08 12:00:05 -04:00
/* Ensure link's abort limit corresponds to current tolerance */
tipc_link_set_abort_limit ( l , tol / n - > keepalive_intv ) ;
2015-07-16 16:54:29 -04:00
}
2018-06-29 13:23:41 +02:00
static void tipc_node_delete_from_list ( struct tipc_node * node )
2006-01-02 19:04:38 +01:00
{
tipc: add automatic session key exchange
With support from the master key option in the previous commit, it
becomes easy to make frequent updates/exchanges of session keys between
authenticated cluster nodes.
Basically, there are two situations where the key exchange will take in
place:
- When a new node joins the cluster (with the master key), it will need
to get its peer's TX key, so that be able to decrypt further messages
from that peer.
- When a new session key is generated (by either user manual setting or
later automatic rekeying feature), the key will be distributed to all
peer nodes in the cluster.
A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
/KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
using the master key before sending out. Upon receipt of the message it
will be decrypted in the same way as regular messages, then attached as
the sender's RX key in the receiver node.
In this way, the key exchange is reliable by the link layer, as well as
security, integrity and authenticity by the crypto layer.
Also, the forward security will be easily achieved by user changing the
master key actively but this should not be required very frequently.
The key exchange feature is independent on the presence of a master key
Note however that the master key still is needed for new nodes to be
able to join the cluster. It is also optional, and can be turned off/on
via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
Backward compatibility is guaranteed because for nodes that do not have
master key support, key exchange using master key ie. tx_key = 0 if any
will be shortly discarded at the message validation step. In other
words, the key exchange feature will be automatically disabled to those
nodes.
v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
error in node.c. The function only exists when built with the TIPC
"CONFIG_TIPC_CRYPTO" option.
v3: use 'info->extack' for a message emitted due to netlink operations
instead (- David's comment).
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:28 +07:00
# ifdef CONFIG_TIPC_CRYPTO
tipc_crypto_key_flush ( node - > crypto_rx ) ;
# endif
2015-03-26 18:10:24 +08:00
list_del_rcu ( & node - > list ) ;
hlist_del_rcu ( & node - > hash ) ;
2016-02-24 11:10:48 -05:00
tipc_node_put ( node ) ;
2018-06-29 13:23:41 +02:00
}
static void tipc_node_delete ( struct tipc_node * node )
{
2018-12-19 09:17:59 +07:00
trace_tipc_node_delete ( node , true , " " ) ;
2018-06-29 13:23:41 +02:00
tipc_node_delete_from_list ( node ) ;
2016-02-24 11:10:48 -05:00
del_timer_sync ( & node - > timer ) ;
tipc_node_put ( node ) ;
2006-01-02 19:04:38 +01:00
}
2015-01-09 15:27:05 +08:00
void tipc_node_stop ( struct net * net )
2014-03-27 12:54:36 +08:00
{
2016-02-24 11:10:48 -05:00
struct tipc_net * tn = tipc_net ( net ) ;
2014-03-27 12:54:36 +08:00
struct tipc_node * node , * t_node ;
2015-01-09 15:27:05 +08:00
spin_lock_bh ( & tn - > node_list_lock ) ;
2016-02-24 11:10:48 -05:00
list_for_each_entry_safe ( node , t_node , & tn - > node_list , list )
tipc_node_delete ( node ) ;
2015-01-09 15:27:05 +08:00
spin_unlock_bh ( & tn - > node_list_lock ) ;
2014-03-27 12:54:36 +08:00
}
2015-11-19 14:30:42 -05:00
void tipc_node_subscribe ( struct net * net , struct list_head * subscr , u32 addr )
{
struct tipc_node * n ;
if ( in_own_node ( net , addr ) )
return ;
n = tipc_node_find ( net , addr ) ;
if ( ! n ) {
pr_warn ( " Node subscribe rejected, unknown node 0x%x \n " , addr ) ;
return ;
}
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
2015-11-19 14:30:42 -05:00
list_add_tail ( subscr , & n - > publ_list ) ;
2017-01-24 13:00:43 +01:00
tipc_node_write_unlock_fast ( n ) ;
2015-11-19 14:30:42 -05:00
tipc_node_put ( n ) ;
}
void tipc_node_unsubscribe ( struct net * net , struct list_head * subscr , u32 addr )
{
struct tipc_node * n ;
if ( in_own_node ( net , addr ) )
return ;
n = tipc_node_find ( net , addr ) ;
if ( ! n ) {
pr_warn ( " Node unsubscribe rejected, unknown node 0x%x \n " , addr ) ;
return ;
}
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
2015-11-19 14:30:42 -05:00
list_del_init ( subscr ) ;
2017-01-24 13:00:43 +01:00
tipc_node_write_unlock_fast ( n ) ;
2015-11-19 14:30:42 -05:00
tipc_node_put ( n ) ;
}
2015-01-09 15:27:05 +08:00
int tipc_node_add_conn ( struct net * net , u32 dnode , u32 port , u32 peer_port )
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
{
struct tipc_node * node ;
struct tipc_sock_conn * conn ;
2015-03-26 18:10:24 +08:00
int err = 0 ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
2015-01-09 15:27:10 +08:00
if ( in_own_node ( net , dnode ) )
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
return 0 ;
2015-01-09 15:27:05 +08:00
node = tipc_node_find ( net , dnode ) ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
if ( ! node ) {
pr_warn ( " Connecting sock to node 0x%x failed \n " , dnode ) ;
return - EHOSTUNREACH ;
}
conn = kmalloc ( sizeof ( * conn ) , GFP_ATOMIC ) ;
2015-03-26 18:10:24 +08:00
if ( ! conn ) {
err = - EHOSTUNREACH ;
goto exit ;
}
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
conn - > peer_node = dnode ;
conn - > port = port ;
conn - > peer_port = peer_port ;
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( node ) ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
list_add_tail ( & conn - > list , & node - > conn_sks ) ;
2015-11-19 14:30:44 -05:00
tipc_node_write_unlock ( node ) ;
2015-03-26 18:10:24 +08:00
exit :
tipc_node_put ( node ) ;
return err ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
}
2015-01-09 15:27:05 +08:00
void tipc_node_remove_conn ( struct net * net , u32 dnode , u32 port )
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
{
struct tipc_node * node ;
struct tipc_sock_conn * conn , * safe ;
2015-01-09 15:27:10 +08:00
if ( in_own_node ( net , dnode ) )
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
return ;
2015-01-09 15:27:05 +08:00
node = tipc_node_find ( net , dnode ) ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
if ( ! node )
return ;
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( node ) ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
list_for_each_entry_safe ( conn , safe , & node - > conn_sks , list ) {
if ( port ! = conn - > port )
continue ;
list_del ( & conn - > list ) ;
kfree ( conn ) ;
}
2015-11-19 14:30:44 -05:00
tipc_node_write_unlock ( node ) ;
2015-03-26 18:10:24 +08:00
tipc_node_put ( node ) ;
tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.
This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.
In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.
This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 18:09:08 -04:00
}
2018-06-29 13:23:41 +02:00
static void tipc_node_clear_links ( struct tipc_node * node )
{
int i ;
for ( i = 0 ; i < MAX_BEARERS ; i + + ) {
struct tipc_link_entry * le = & node - > links [ i ] ;
if ( le - > link ) {
kfree ( le - > link ) ;
le - > link = NULL ;
node - > link_cnt - - ;
}
}
}
/* tipc_node_cleanup - delete nodes that does not
* have active links for NODE_CLEANUP_AFTER time
*/
tipc: fix lockdep warning during node delete
We see the following lockdep warning:
[ 2284.078521] ======================================================
[ 2284.078604] WARNING: possible circular locking dependency detected
[ 2284.078604] 4.19.0+ #42 Tainted: G E
[ 2284.078604] ------------------------------------------------------
[ 2284.078604] rmmod/254 is trying to acquire lock:
[ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
[ 2284.078604]
[ 2284.078604] but task is already holding lock:
[ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
[ 2284.078604]
[ 2284.078604] which lock already depends on the new lock.
[ 2284.078604]
[ 2284.078604]
[ 2284.078604] the existing dependency chain (in reverse order) is:
[ 2284.078604]
[ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
[ 2284.078604] tipc_node_timeout+0x20a/0x330 [tipc]
[ 2284.078604] call_timer_fn+0xa1/0x280
[ 2284.078604] run_timer_softirq+0x1f2/0x4d0
[ 2284.078604] __do_softirq+0xfc/0x413
[ 2284.078604] irq_exit+0xb5/0xc0
[ 2284.078604] smp_apic_timer_interrupt+0xac/0x210
[ 2284.078604] apic_timer_interrupt+0xf/0x20
[ 2284.078604] default_idle+0x1c/0x140
[ 2284.078604] do_idle+0x1bc/0x280
[ 2284.078604] cpu_startup_entry+0x19/0x20
[ 2284.078604] start_secondary+0x187/0x1c0
[ 2284.078604] secondary_startup_64+0xa4/0xb0
[ 2284.078604]
[ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
[ 2284.078604] del_timer_sync+0x34/0xa0
[ 2284.078604] tipc_node_delete+0x1a/0x40 [tipc]
[ 2284.078604] tipc_node_stop+0xcb/0x190 [tipc]
[ 2284.078604] tipc_net_stop+0x154/0x170 [tipc]
[ 2284.078604] tipc_exit_net+0x16/0x30 [tipc]
[ 2284.078604] ops_exit_list.isra.8+0x36/0x70
[ 2284.078604] unregister_pernet_operations+0x87/0xd0
[ 2284.078604] unregister_pernet_subsys+0x1d/0x30
[ 2284.078604] tipc_exit+0x11/0x6f2 [tipc]
[ 2284.078604] __x64_sys_delete_module+0x1df/0x240
[ 2284.078604] do_syscall_64+0x66/0x460
[ 2284.078604] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2284.078604]
[ 2284.078604] other info that might help us debug this:
[ 2284.078604]
[ 2284.078604] Possible unsafe locking scenario:
[ 2284.078604]
[ 2284.078604] CPU0 CPU1
[ 2284.078604] ---- ----
[ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604] lock((&n->timer)#2);
[ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604] lock((&n->timer)#2);
[ 2284.078604]
[ 2284.078604] *** DEADLOCK ***
[ 2284.078604]
[ 2284.078604] 3 locks held by rmmod/254:
[ 2284.078604] #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
[ 2284.078604] #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
[ 2284.078604] #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
[...}
The reason is that the node timer handler sometimes needs to delete a
node which has been disconnected for too long. To do this, it grabs
the lock 'node_list_lock', which may at the same time be held by the
generic node cleanup function, tipc_node_stop(), during module removal.
Since the latter is calling del_timer_sync() inside the same lock, we
have a potential deadlock.
We fix this letting the timer cleanup function use spin_trylock()
instead of just spin_lock(), and when it fails to grab the lock it
just returns so that the timer handler can terminate its execution.
This is safe to do, since tipc_node_stop() anyway is about to
delete both the timer and the node instance.
Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance")
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-26 12:26:14 -05:00
static bool tipc_node_cleanup ( struct tipc_node * peer )
2018-06-29 13:23:41 +02:00
{
2019-03-19 18:49:49 +07:00
struct tipc_node * temp_node ;
2018-06-29 13:23:41 +02:00
struct tipc_net * tn = tipc_net ( peer - > net ) ;
bool deleted = false ;
tipc: fix lockdep warning during node delete
We see the following lockdep warning:
[ 2284.078521] ======================================================
[ 2284.078604] WARNING: possible circular locking dependency detected
[ 2284.078604] 4.19.0+ #42 Tainted: G E
[ 2284.078604] ------------------------------------------------------
[ 2284.078604] rmmod/254 is trying to acquire lock:
[ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
[ 2284.078604]
[ 2284.078604] but task is already holding lock:
[ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
[ 2284.078604]
[ 2284.078604] which lock already depends on the new lock.
[ 2284.078604]
[ 2284.078604]
[ 2284.078604] the existing dependency chain (in reverse order) is:
[ 2284.078604]
[ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
[ 2284.078604] tipc_node_timeout+0x20a/0x330 [tipc]
[ 2284.078604] call_timer_fn+0xa1/0x280
[ 2284.078604] run_timer_softirq+0x1f2/0x4d0
[ 2284.078604] __do_softirq+0xfc/0x413
[ 2284.078604] irq_exit+0xb5/0xc0
[ 2284.078604] smp_apic_timer_interrupt+0xac/0x210
[ 2284.078604] apic_timer_interrupt+0xf/0x20
[ 2284.078604] default_idle+0x1c/0x140
[ 2284.078604] do_idle+0x1bc/0x280
[ 2284.078604] cpu_startup_entry+0x19/0x20
[ 2284.078604] start_secondary+0x187/0x1c0
[ 2284.078604] secondary_startup_64+0xa4/0xb0
[ 2284.078604]
[ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
[ 2284.078604] del_timer_sync+0x34/0xa0
[ 2284.078604] tipc_node_delete+0x1a/0x40 [tipc]
[ 2284.078604] tipc_node_stop+0xcb/0x190 [tipc]
[ 2284.078604] tipc_net_stop+0x154/0x170 [tipc]
[ 2284.078604] tipc_exit_net+0x16/0x30 [tipc]
[ 2284.078604] ops_exit_list.isra.8+0x36/0x70
[ 2284.078604] unregister_pernet_operations+0x87/0xd0
[ 2284.078604] unregister_pernet_subsys+0x1d/0x30
[ 2284.078604] tipc_exit+0x11/0x6f2 [tipc]
[ 2284.078604] __x64_sys_delete_module+0x1df/0x240
[ 2284.078604] do_syscall_64+0x66/0x460
[ 2284.078604] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2284.078604]
[ 2284.078604] other info that might help us debug this:
[ 2284.078604]
[ 2284.078604] Possible unsafe locking scenario:
[ 2284.078604]
[ 2284.078604] CPU0 CPU1
[ 2284.078604] ---- ----
[ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604] lock((&n->timer)#2);
[ 2284.078604] lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604] lock((&n->timer)#2);
[ 2284.078604]
[ 2284.078604] *** DEADLOCK ***
[ 2284.078604]
[ 2284.078604] 3 locks held by rmmod/254:
[ 2284.078604] #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
[ 2284.078604] #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
[ 2284.078604] #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
[...}
The reason is that the node timer handler sometimes needs to delete a
node which has been disconnected for too long. To do this, it grabs
the lock 'node_list_lock', which may at the same time be held by the
generic node cleanup function, tipc_node_stop(), during module removal.
Since the latter is calling del_timer_sync() inside the same lock, we
have a potential deadlock.
We fix this letting the timer cleanup function use spin_trylock()
instead of just spin_lock(), and when it fails to grab the lock it
just returns so that the timer handler can terminate its execution.
This is safe to do, since tipc_node_stop() anyway is about to
delete both the timer and the node instance.
Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance")
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-26 12:26:14 -05:00
/* If lock held by tipc_node_stop() the node will be deleted anyway */
if ( ! spin_trylock_bh ( & tn - > node_list_lock ) )
return false ;
2018-06-29 13:23:41 +02:00
tipc_node_write_lock ( peer ) ;
if ( ! node_is_up ( peer ) & & time_after ( jiffies , peer - > delete_at ) ) {
tipc_node_clear_links ( peer ) ;
tipc_node_delete_from_list ( peer ) ;
deleted = true ;
}
tipc_node_write_unlock ( peer ) ;
2019-03-19 18:49:49 +07:00
2019-11-06 13:26:09 +07:00
if ( ! deleted ) {
spin_unlock_bh ( & tn - > node_list_lock ) ;
return deleted ;
}
2019-03-19 18:49:49 +07:00
/* Calculate cluster capabilities */
tn - > capabilities = TIPC_NODE_CAPABILITIES ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
tn - > capabilities & = temp_node - > capabilities ;
}
2019-11-21 10:01:09 +07:00
tipc_bcast_toggle_rcast ( peer - > net ,
( tn - > capabilities & TIPC_BCAST_RCAST ) ) ;
2018-06-29 13:23:41 +02:00
spin_unlock_bh ( & tn - > node_list_lock ) ;
return deleted ;
}
2015-07-16 16:54:29 -04:00
/* tipc_node_timeout - handle expiration of node timer
*/
2017-10-30 14:06:45 -07:00
static void tipc_node_timeout ( struct timer_list * t )
2015-07-16 16:54:29 -04:00
{
2017-10-30 14:06:45 -07:00
struct tipc_node * n = from_timer ( n , t , timer ) ;
2015-07-30 18:24:23 -04:00
struct tipc_link_entry * le ;
2015-07-16 16:54:29 -04:00
struct sk_buff_head xmitq ;
2018-06-28 22:39:25 +02:00
int remains = n - > link_cnt ;
2015-07-16 16:54:29 -04:00
int bearer_id ;
int rc = 0 ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_timeout ( n , false , " " ) ;
2018-06-29 13:23:41 +02:00
if ( ! node_is_up ( n ) & & tipc_node_cleanup ( n ) ) {
/*Removing the reference of Timer*/
tipc_node_put ( n ) ;
return ;
}
2019-11-08 12:05:11 +07:00
# ifdef CONFIG_TIPC_CRYPTO
/* Take any crypto key related actions first */
tipc_crypto_timeout ( n - > crypto_rx ) ;
# endif
2015-07-16 16:54:29 -04:00
__skb_queue_head_init ( & xmitq ) ;
2018-12-06 09:00:09 +07:00
/* Initial node interval to value larger (10 seconds), then it will be
* recalculated with link lowest tolerance
*/
tipc_node_read_lock ( n ) ;
n - > keepalive_intv = 10000 ;
tipc_node_read_unlock ( n ) ;
2018-06-28 22:39:25 +02:00
for ( bearer_id = 0 ; remains & & ( bearer_id < MAX_BEARERS ) ; bearer_id + + ) {
2015-11-19 14:30:44 -05:00
tipc_node_read_lock ( n ) ;
2015-07-30 18:24:23 -04:00
le = & n - > links [ bearer_id ] ;
if ( le - > link ) {
2018-06-28 22:39:25 +02:00
spin_lock_bh ( & le - > lock ) ;
2015-07-16 16:54:29 -04:00
/* Link tolerance may change asynchronously: */
2015-07-30 18:24:23 -04:00
tipc_node_calculate_timer ( n , le - > link ) ;
rc = tipc_link_timeout ( le - > link , & xmitq ) ;
2018-06-28 22:39:25 +02:00
spin_unlock_bh ( & le - > lock ) ;
remains - - ;
2015-07-16 16:54:29 -04:00
}
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( n ) ;
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( n - > net , bearer_id , & xmitq , & le - > maddr , n ) ;
2015-07-30 18:24:23 -04:00
if ( rc & TIPC_LINK_DOWN_EVT )
tipc_node_link_down ( n , bearer_id , false ) ;
2015-07-16 16:54:29 -04:00
}
2016-06-08 12:00:05 -04:00
mod_timer ( & n - > timer , jiffies + msecs_to_jiffies ( n - > keepalive_intv ) ) ;
2015-07-16 16:54:29 -04:00
}
2006-01-02 19:04:38 +01:00
/**
2015-07-30 18:24:23 -04:00
* __tipc_node_link_up - handle addition of link
2020-11-29 10:32:47 -08:00
* @ n : target tipc_node
* @ bearer_id : id of the bearer
* @ xmitq : queue for messages to be xmited on
2015-07-30 18:24:23 -04:00
* Node lock must be held by caller
2006-01-02 19:04:38 +01:00
* Link becomes active ( alone or shared ) or standby , depending on its priority .
*/
2015-07-30 18:24:23 -04:00
static void __tipc_node_link_up ( struct tipc_node * n , int bearer_id ,
struct sk_buff_head * xmitq )
2006-01-02 19:04:38 +01:00
{
2015-07-16 16:54:22 -04:00
int * slot0 = & n - > active_links [ 0 ] ;
int * slot1 = & n - > active_links [ 1 ] ;
2015-07-30 18:24:19 -04:00
struct tipc_link * ol = node_active_link ( n , 0 ) ;
struct tipc_link * nl = n - > links [ bearer_id ] . link ;
2015-07-16 16:54:19 -04:00
2016-05-11 19:15:45 -04:00
if ( ! nl | | tipc_link_is_up ( nl ) )
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
return ;
tipc_link_fsm_evt ( nl , LINK_ESTABLISH_EVT ) ;
if ( ! tipc_link_is_up ( nl ) )
2015-07-30 18:24:23 -04:00
return ;
2015-07-16 16:54:19 -04:00
n - > working_links + + ;
n - > action_flags | = TIPC_NOTIFY_LINK_UP ;
2015-11-19 14:30:46 -05:00
n - > link_id = tipc_link_id ( nl ) ;
2015-07-30 18:24:19 -04:00
/* Leave room for tunnel header when returning 'mtu' to users: */
2019-11-08 12:05:11 +07:00
n - > links [ bearer_id ] . mtu = tipc_link_mss ( nl ) ;
2014-10-20 14:44:25 +08:00
2015-07-30 18:24:15 -04:00
tipc_bearer_add_dest ( n - > net , bearer_id , n - > addr ) ;
tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.
A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.
This commit introduces these changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:42 -04:00
tipc_bcast_inc_bearer_dst_cnt ( n - > net , bearer_id ) ;
2015-07-30 18:24:15 -04:00
2015-01-22 17:10:31 +01:00
pr_debug ( " Established link <%s> on network plane %c \n " ,
2015-11-19 14:30:46 -05:00
tipc_link_name ( nl ) , tipc_link_plane ( nl ) ) ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_link_up ( n , true , " " ) ;
2007-02-09 23:25:21 +09:00
2016-04-15 13:33:07 -04:00
/* Ensure that a STATE message goes first */
tipc_link_build_state_msg ( nl , xmitq ) ;
2015-07-30 18:24:19 -04:00
/* First link? => give it both slots */
if ( ! ol ) {
2015-07-16 16:54:22 -04:00
* slot0 = bearer_id ;
* slot1 = bearer_id ;
2015-10-22 08:51:41 -04:00
tipc_node_fsm_evt ( n , SELF_ESTABL_CONTACT_EVT ) ;
n - > action_flags | = TIPC_NOTIFY_NODE_UP ;
2016-04-28 20:16:08 -04:00
tipc_link_set_active ( nl , true ) ;
tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.
A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.
This commit introduces these changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:42 -04:00
tipc_bcast_add_peer ( n - > net , nl , xmitq ) ;
2015-07-16 16:54:19 -04:00
return ;
2006-01-02 19:04:38 +01:00
}
2015-07-16 16:54:22 -04:00
2015-07-30 18:24:19 -04:00
/* Second link => redistribute slots */
2015-11-19 14:30:46 -05:00
if ( tipc_link_prio ( nl ) > tipc_link_prio ( ol ) ) {
pr_debug ( " Old link <%s> becomes standby \n " , tipc_link_name ( ol ) ) ;
2015-07-16 16:54:22 -04:00
* slot0 = bearer_id ;
2015-07-30 18:24:19 -04:00
* slot1 = bearer_id ;
2015-10-22 08:51:46 -04:00
tipc_link_set_active ( nl , true ) ;
tipc_link_set_active ( ol , false ) ;
2015-11-19 14:30:46 -05:00
} else if ( tipc_link_prio ( nl ) = = tipc_link_prio ( ol ) ) {
2015-10-22 08:51:46 -04:00
tipc_link_set_active ( nl , true ) ;
2015-10-22 08:51:47 -04:00
* slot1 = bearer_id ;
2015-07-30 18:24:19 -04:00
} else {
2015-11-19 14:30:46 -05:00
pr_debug ( " New link <%s> is standby \n " , tipc_link_name ( nl ) ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-30 18:24:19 -04:00
/* Prepare synchronization with first link */
tipc_link_tnl_prepare ( ol , nl , SYNCH_MSG , xmitq ) ;
2006-01-02 19:04:38 +01:00
}
/**
2015-07-30 18:24:23 -04:00
* tipc_node_link_up - handle addition of link
2020-11-29 10:32:47 -08:00
* @ n : target tipc_node
* @ bearer_id : id of the bearer
* @ xmitq : queue for messages to be xmited on
2015-07-30 18:24:23 -04:00
*
* Link becomes active ( alone or shared ) or standby , depending on its priority .
2006-01-02 19:04:38 +01:00
*/
2015-07-30 18:24:23 -04:00
static void tipc_node_link_up ( struct tipc_node * n , int bearer_id ,
struct sk_buff_head * xmitq )
2006-01-02 19:04:38 +01:00
{
2016-04-15 13:33:06 -04:00
struct tipc_media_addr * maddr ;
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
2015-07-30 18:24:23 -04:00
__tipc_node_link_up ( n , bearer_id , xmitq ) ;
2016-04-15 13:33:06 -04:00
maddr = & n - > links [ bearer_id ] . maddr ;
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( n - > net , bearer_id , xmitq , maddr , n ) ;
2015-11-19 14:30:44 -05:00
tipc_node_write_unlock ( n ) ;
2015-07-30 18:24:23 -04:00
}
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
/**
* tipc_node_link_failover ( ) - start failover in case " half-failover "
*
* This function is only called in a very special situation where link
* failover can be already started on peer node but not on this node .
2020-11-29 10:32:50 -08:00
* This can happen when e . g . : :
*
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
* 1. Both links < 1 A - 2 A > , < 1 B - 2 B > down
* 2. Link endpoint 2 A up , but 1 A still down ( e . g . due to network
2020-11-29 10:32:50 -08:00
* disturbance , wrong session , etc . )
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
* 3. Link < 1 B - 2 B > up
* 4. Link endpoint 2 A down ( e . g . due to link tolerance timeout )
2019-06-17 11:56:12 +07:00
* 5. Node 2 starts failover onto link < 1 B - 2 B >
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
*
2019-06-17 11:56:12 +07:00
* = = > Node 1 does never start link / node failover !
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
*
* @ n : tipc node structure
* @ l : link peer endpoint failingover ( - can be NULL )
* @ tnl : tunnel link
* @ xmitq : queue for messages to be xmited on tnl link later
*/
static void tipc_node_link_failover ( struct tipc_node * n , struct tipc_link * l ,
struct tipc_link * tnl ,
struct sk_buff_head * xmitq )
{
/* Avoid to be "self-failover" that can never end */
if ( ! tipc_link_is_up ( tnl ) )
return ;
2019-06-17 11:56:12 +07:00
/* Don't rush, failure link may be in the process of resetting */
if ( l & & ! tipc_link_is_reset ( l ) )
return ;
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
tipc_link_fsm_evt ( tnl , LINK_SYNCH_END_EVT ) ;
tipc_node_fsm_evt ( n , NODE_SYNCH_END_EVT ) ;
n - > sync_point = tipc_link_rcv_nxt ( tnl ) + ( U16_MAX / 2 - 1 ) ;
tipc_link_failover_prepare ( l , tnl , xmitq ) ;
if ( l )
tipc_link_fsm_evt ( l , LINK_FAILOVER_BEGIN_EVT ) ;
tipc_node_fsm_evt ( n , NODE_FAILOVER_BEGIN_EVT ) ;
}
2015-07-30 18:24:23 -04:00
/**
* __tipc_node_link_down - handle loss of link
2020-11-29 10:32:47 -08:00
* @ n : target tipc_node
* @ bearer_id : id of the bearer
* @ xmitq : queue for messages to be xmited on
* @ maddr : output media address of the bearer
2015-07-30 18:24:23 -04:00
*/
static void __tipc_node_link_down ( struct tipc_node * n , int * bearer_id ,
struct sk_buff_head * xmitq ,
struct tipc_media_addr * * maddr )
{
struct tipc_link_entry * le = & n - > links [ * bearer_id ] ;
2015-07-16 16:54:22 -04:00
int * slot0 = & n - > active_links [ 0 ] ;
int * slot1 = & n - > active_links [ 1 ] ;
2015-11-19 14:30:46 -05:00
int i , highest = 0 , prio ;
2015-07-30 18:24:19 -04:00
struct tipc_link * l , * _l , * tnl ;
2006-01-02 19:04:38 +01:00
2015-07-30 18:24:23 -04:00
l = n - > links [ * bearer_id ] . link ;
2015-07-30 18:24:21 -04:00
if ( ! l | | tipc_link_is_reset ( l ) )
2015-07-30 18:24:17 -04:00
return ;
2015-07-16 16:54:19 -04:00
n - > working_links - - ;
n - > action_flags | = TIPC_NOTIFY_LINK_DOWN ;
2015-11-19 14:30:46 -05:00
n - > link_id = tipc_link_id ( l ) ;
2006-06-25 23:52:50 -07:00
2015-07-30 18:24:23 -04:00
tipc_bearer_remove_dest ( n - > net , * bearer_id , n - > addr ) ;
2015-07-30 18:24:17 -04:00
2015-01-22 17:10:31 +01:00
pr_debug ( " Lost link <%s> on network plane %c \n " ,
2015-11-19 14:30:46 -05:00
tipc_link_name ( l ) , tipc_link_plane ( l ) ) ;
2014-06-25 20:41:33 -05:00
2015-07-16 16:54:22 -04:00
/* Select new active link if any available */
* slot0 = INVALID_BEARER_ID ;
* slot1 = INVALID_BEARER_ID ;
for ( i = 0 ; i < MAX_BEARERS ; i + + ) {
_l = n - > links [ i ] . link ;
if ( ! _l | | ! tipc_link_is_up ( _l ) )
continue ;
2015-07-30 18:24:17 -04:00
if ( _l = = l )
continue ;
2015-11-19 14:30:46 -05:00
prio = tipc_link_prio ( _l ) ;
if ( prio < highest )
2015-07-16 16:54:22 -04:00
continue ;
2015-11-19 14:30:46 -05:00
if ( prio > highest ) {
highest = prio ;
2015-07-16 16:54:22 -04:00
* slot0 = i ;
* slot1 = i ;
continue ;
}
* slot1 = i ;
}
2015-07-30 18:24:17 -04:00
2017-10-13 11:04:19 +02:00
if ( ! node_is_up ( n ) ) {
2015-10-15 14:52:46 -04:00
if ( tipc_link_peer_is_down ( l ) )
tipc_node_fsm_evt ( n , PEER_LOST_CONTACT_EVT ) ;
tipc_node_fsm_evt ( n , SELF_LOST_CONTACT_EVT ) ;
2018-12-19 09:17:57 +07:00
trace_tipc_link_reset ( l , TIPC_DUMP_ALL , " link down! " ) ;
2015-10-15 14:52:46 -04:00
tipc_link_fsm_evt ( l , LINK_RESET_EVT ) ;
2015-07-30 18:24:19 -04:00
tipc_link_reset ( l ) ;
2015-10-15 14:52:45 -04:00
tipc_link_build_reset_msg ( l , xmitq ) ;
* maddr = & n - > links [ * bearer_id ] . maddr ;
2015-07-30 18:24:23 -04:00
node_lost_contact ( n , & le - > inputq ) ;
tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.
A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.
This commit introduces these changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:42 -04:00
tipc_bcast_dec_bearer_dst_cnt ( n - > net , * bearer_id ) ;
2015-07-30 18:24:19 -04:00
return ;
}
tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.
A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.
This commit introduces these changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:42 -04:00
tipc_bcast_dec_bearer_dst_cnt ( n - > net , * bearer_id ) ;
2015-07-30 18:24:17 -04:00
2015-07-30 18:24:19 -04:00
/* There is still a working link => initiate failover */
2015-11-19 14:30:46 -05:00
* bearer_id = n - > active_links [ 0 ] ;
tnl = n - > links [ * bearer_id ] . link ;
2015-08-20 02:12:55 -04:00
tipc_link_fsm_evt ( tnl , LINK_SYNCH_END_EVT ) ;
tipc_node_fsm_evt ( n , NODE_SYNCH_END_EVT ) ;
2015-11-19 14:30:46 -05:00
n - > sync_point = tipc_link_rcv_nxt ( tnl ) + ( U16_MAX / 2 - 1 ) ;
2015-07-30 18:24:23 -04:00
tipc_link_tnl_prepare ( l , tnl , FAILOVER_MSG , xmitq ) ;
2018-12-19 09:17:57 +07:00
trace_tipc_link_reset ( l , TIPC_DUMP_ALL , " link down -> failover! " ) ;
2015-07-30 18:24:17 -04:00
tipc_link_reset ( l ) ;
2015-10-15 14:52:46 -04:00
tipc_link_fsm_evt ( l , LINK_RESET_EVT ) ;
2015-07-30 18:24:21 -04:00
tipc_link_fsm_evt ( l , LINK_FAILOVER_BEGIN_EVT ) ;
2015-07-30 18:24:23 -04:00
tipc_node_fsm_evt ( n , NODE_FAILOVER_BEGIN_EVT ) ;
2015-11-19 14:30:46 -05:00
* maddr = & n - > links [ * bearer_id ] . maddr ;
2015-07-30 18:24:23 -04:00
}
static void tipc_node_link_down ( struct tipc_node * n , int bearer_id , bool delete )
{
struct tipc_link_entry * le = & n - > links [ bearer_id ] ;
2019-03-22 15:03:51 +01:00
struct tipc_media_addr * maddr = NULL ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
struct tipc_link * l = le - > link ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
int old_bearer_id = bearer_id ;
2019-03-22 15:03:51 +01:00
struct sk_buff_head xmitq ;
2015-07-30 18:24:23 -04:00
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
if ( ! l )
return ;
2015-07-30 18:24:23 -04:00
__skb_queue_head_init ( & xmitq ) ;
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
if ( ! tipc_link_is_establishing ( l ) ) {
__tipc_node_link_down ( n , & bearer_id , & xmitq , & maddr ) ;
} else {
/* Defuse pending tipc_node_link_up() */
tipc: fix link session and re-establish issues
When a link endpoint is re-created (e.g. after a node reboot or
interface reset), the link session number is varied by random, the peer
endpoint will be synced with this new session number before the link is
re-established.
However, there is a shortcoming in this mechanism that can lead to the
link never re-established or faced with a failure then. It happens when
the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
well as the 'in_session' flag have been set, but suddenly this link
endpoint leaves. When it comes back with a random session number, there
are two situations possible:
1/ If the random session number is larger than (or equal to) the
previous one, the peer endpoint will be updated with this new session
upon receipt of a RESET_MSG from this endpoint, and the link can be re-
established as normal. Otherwise, all the RESET_MSGs from this endpoint
will be rejected by the peer. In turn, when this link endpoint receives
one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
to send STATE_MSGs, but again these messages will be dropped by the
peer due to wrong session.
The peer link endpoint can still become ESTABLISHED after receiving a
traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
will be forced down sooner or later!
Even in case the random session number is larger than the previous one,
it can be that the ACTIVATE_MSG from the peer arrives first, and this
link endpoint moves quickly to ESTABLISHED without sending out any
RESET_MSG yet. Consequently, the peer link will not be updated with the
new session number, and the same link failure scenario as above will
happen.
2/ Another situation can be that, the peer link endpoint was reset due
to any reasons in the meantime, its link state was set to RESET from
ESTABLISHING but still in session, i.e. the 'in_session' flag is not
reset...
Now, if the random session number from this endpoint is less than the
previous one, all the RESET_MSGs from this endpoint will be rejected by
the peer. In the other direction, when this link endpoint receives a
RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
As a result, the link cannot be re-established but gets stuck with this
link endpoint in state ESTABLISHING and the peer in RESET!
Solution:
===========
This link endpoint should not go directly to ESTABLISHED when getting
ACTIVATE_MSG from the peer which may belong to the old session if the
link was re-created. To ensure the session to be correct before the
link is re-established, the peer endpoint in ESTABLISHING state will
send back the last session number in ACTIVATE_MSG for a verification at
this endpoint. Then, if needed, a new and more appropriate session
number will be regenerated to force a re-synch first.
In addition, when a link in ESTABLISHING state is reset, its state will
move to RESET according to the link FSM, along with resetting the
'in_session' flag (and the other data) as a normal link reset, it will
also be deleted if requested.
The solution is backward compatible.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 13:29:43 +07:00
tipc_link_reset ( l ) ;
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
tipc_link_fsm_evt ( l , LINK_RESET_EVT ) ;
2015-07-30 18:24:23 -04:00
}
tipc: fix link session and re-establish issues
When a link endpoint is re-created (e.g. after a node reboot or
interface reset), the link session number is varied by random, the peer
endpoint will be synced with this new session number before the link is
re-established.
However, there is a shortcoming in this mechanism that can lead to the
link never re-established or faced with a failure then. It happens when
the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
well as the 'in_session' flag have been set, but suddenly this link
endpoint leaves. When it comes back with a random session number, there
are two situations possible:
1/ If the random session number is larger than (or equal to) the
previous one, the peer endpoint will be updated with this new session
upon receipt of a RESET_MSG from this endpoint, and the link can be re-
established as normal. Otherwise, all the RESET_MSGs from this endpoint
will be rejected by the peer. In turn, when this link endpoint receives
one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
to send STATE_MSGs, but again these messages will be dropped by the
peer due to wrong session.
The peer link endpoint can still become ESTABLISHED after receiving a
traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
will be forced down sooner or later!
Even in case the random session number is larger than the previous one,
it can be that the ACTIVATE_MSG from the peer arrives first, and this
link endpoint moves quickly to ESTABLISHED without sending out any
RESET_MSG yet. Consequently, the peer link will not be updated with the
new session number, and the same link failure scenario as above will
happen.
2/ Another situation can be that, the peer link endpoint was reset due
to any reasons in the meantime, its link state was set to RESET from
ESTABLISHING but still in session, i.e. the 'in_session' flag is not
reset...
Now, if the random session number from this endpoint is less than the
previous one, all the RESET_MSGs from this endpoint will be rejected by
the peer. In the other direction, when this link endpoint receives a
RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
As a result, the link cannot be re-established but gets stuck with this
link endpoint in state ESTABLISHING and the peer in RESET!
Solution:
===========
This link endpoint should not go directly to ESTABLISHED when getting
ACTIVATE_MSG from the peer which may belong to the old session if the
link was re-created. To ensure the session to be correct before the
link is re-established, the peer endpoint in ESTABLISHING state will
send back the last session number in ACTIVATE_MSG for a verification at
this endpoint. Then, if needed, a new and more appropriate session
number will be regenerated to force a re-synch first.
In addition, when a link in ESTABLISHING state is reset, its state will
move to RESET according to the link FSM, along with resetting the
'in_session' flag (and the other data) as a normal link reset, it will
also be deleted if requested.
The solution is backward compatible.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 13:29:43 +07:00
if ( delete ) {
kfree ( l ) ;
le - > link = NULL ;
n - > link_cnt - - ;
}
2018-12-19 09:17:59 +07:00
trace_tipc_node_link_down ( n , true , " node link down or deleted! " ) ;
2015-11-19 14:30:44 -05:00
tipc_node_write_unlock ( n ) ;
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-13 20:46:22 -04:00
if ( delete )
tipc_mon_remove_peer ( n - > net , n - > addr , old_bearer_id ) ;
2019-03-22 15:03:51 +01:00
if ( ! skb_queue_empty ( & xmitq ) )
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( n - > net , bearer_id , & xmitq , maddr , n ) ;
2015-07-30 18:24:23 -04:00
tipc_sk_rcv ( n - > net , & le - > inputq ) ;
2006-01-02 19:04:38 +01:00
}
2017-10-13 11:04:19 +02:00
static bool node_is_up ( struct tipc_node * n )
2006-01-02 19:04:38 +01:00
{
2015-07-16 16:54:22 -04:00
return n - > active_links [ 0 ] ! = INVALID_BEARER_ID ;
2006-01-02 19:04:38 +01:00
}
2017-10-13 11:04:19 +02:00
bool tipc_node_is_up ( struct net * net , u32 addr )
{
struct tipc_node * n ;
bool retval = false ;
if ( in_own_node ( net , addr ) )
return true ;
n = tipc_node_find ( net , addr ) ;
if ( ! n )
return false ;
retval = node_is_up ( n ) ;
tipc_node_put ( n ) ;
return retval ;
}
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
static u32 tipc_node_suggest_addr ( struct net * net , u32 addr )
{
struct tipc_node * n ;
addr ^ = tipc_net ( net ) - > random ;
while ( ( n = tipc_node_find ( net , addr ) ) ) {
tipc_node_put ( n ) ;
addr + + ;
}
return addr ;
}
/* tipc_node_try_addr(): Check if addr can be used by peer, suggest other if not
2018-07-06 20:10:03 +02:00
* Returns suggested address if any , otherwise 0
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
*/
u32 tipc_node_try_addr ( struct net * net , u8 * id , u32 addr )
{
struct tipc_net * tn = tipc_net ( net ) ;
struct tipc_node * n ;
2019-11-08 12:05:09 +07:00
bool preliminary ;
u32 sugg_addr ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
/* Suggest new address if some other peer is using this one */
n = tipc_node_find ( net , addr ) ;
if ( n ) {
if ( ! memcmp ( n - > peer_id , id , NODE_ID_LEN ) )
addr = 0 ;
tipc_node_put ( n ) ;
if ( ! addr )
return 0 ;
return tipc_node_suggest_addr ( net , addr ) ;
}
/* Suggest previously used address if peer is known */
n = tipc_node_find_by_id ( net , id ) ;
if ( n ) {
2019-11-08 12:05:09 +07:00
sugg_addr = n - > addr ;
preliminary = n - > preliminary ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
tipc_node_put ( n ) ;
2019-11-08 12:05:09 +07:00
if ( ! preliminary )
return sugg_addr ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
}
2018-07-06 20:10:03 +02:00
/* Even this node may be in conflict */
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
if ( tn - > trial_addr = = addr )
return tipc_node_suggest_addr ( net , addr ) ;
2018-07-06 20:10:03 +02:00
return 0 ;
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
}
void tipc_node_check_dest ( struct net * net , u32 addr ,
u8 * peer_id , struct tipc_bearer * b ,
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
u16 capabilities , u32 signature , u32 hash_mixes ,
2015-07-30 18:24:22 -04:00
struct tipc_media_addr * maddr ,
bool * respond , bool * dupl_addr )
2015-07-16 16:54:20 -04:00
{
2015-07-30 18:24:22 -04:00
struct tipc_node * n ;
2019-11-08 12:05:09 +07:00
struct tipc_link * l , * snd_l ;
2015-07-30 18:24:26 -04:00
struct tipc_link_entry * le ;
2015-07-30 18:24:22 -04:00
bool addr_match = false ;
bool sign_match = false ;
bool link_up = false ;
bool accept_addr = false ;
2015-07-30 18:24:23 -04:00
bool reset = true ;
2015-10-22 08:51:36 -04:00
char * if_name ;
2016-06-08 12:00:05 -04:00
unsigned long intv ;
2018-09-26 22:28:52 +02:00
u16 session ;
2015-07-30 18:24:26 -04:00
2015-07-30 18:24:22 -04:00
* dupl_addr = false ;
* respond = false ;
2019-11-08 12:05:09 +07:00
n = tipc_node_create ( net , addr , peer_id , capabilities , hash_mixes ,
false ) ;
2015-07-30 18:24:22 -04:00
if ( ! n )
return ;
2015-07-16 16:54:20 -04:00
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
2019-11-08 12:05:09 +07:00
if ( unlikely ( ! n - > bc_entry . link ) ) {
snd_l = tipc_bc_sndlink ( net ) ;
if ( ! tipc_link_bc_create ( net , tipc_own_addr ( net ) ,
2020-05-26 16:38:37 +07:00
addr , peer_id , U16_MAX ,
tipc: introduce variable window congestion control
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 00:52:46 +01:00
tipc_link_min_win ( snd_l ) ,
tipc_link_max_win ( snd_l ) ,
2019-11-08 12:05:09 +07:00
n - > capabilities ,
& n - > bc_entry . inputq1 ,
& n - > bc_entry . namedq , snd_l ,
& n - > bc_entry . link ) ) {
pr_warn ( " Broadcast rcv link creation failed, no mem \n " ) ;
tipc_node_write_unlock_fast ( n ) ;
tipc_node_put ( n ) ;
return ;
}
}
2015-07-30 18:24:22 -04:00
2015-07-30 18:24:26 -04:00
le = & n - > links [ b - > identity ] ;
2015-07-30 18:24:22 -04:00
/* Prepare to validate requesting node's signature and media address */
2015-07-30 18:24:26 -04:00
l = le - > link ;
2015-07-30 18:24:22 -04:00
link_up = l & & tipc_link_is_up ( l ) ;
2015-07-30 18:24:26 -04:00
addr_match = l & & ! memcmp ( & le - > maddr , maddr , sizeof ( * maddr ) ) ;
2015-07-30 18:24:22 -04:00
sign_match = ( signature = = n - > signature ) ;
/* These three flags give us eight permutations: */
if ( sign_match & & addr_match & & link_up ) {
/* All is fine. Do nothing. */
2015-07-30 18:24:23 -04:00
reset = false ;
2019-11-08 10:02:37 +07:00
/* Peer node is not a container/local namespace */
if ( ! n - > peer_hash_mix )
n - > peer_hash_mix = hash_mixes ;
2015-07-30 18:24:22 -04:00
} else if ( sign_match & & addr_match & & ! link_up ) {
/* Respond. The link will come up in due time */
* respond = true ;
} else if ( sign_match & & ! addr_match & & link_up ) {
/* Peer has changed i/f address without rebooting.
* If so , the link will reset soon , and the next
* discovery will be accepted . So we can ignore it .
* It may also be an cloned or malicious peer having
* chosen the same node address and signature as an
* existing one .
* Ignore requests until the link goes down , if ever .
*/
* dupl_addr = true ;
} else if ( sign_match & & ! addr_match & & ! link_up ) {
/* Peer link has changed i/f address without rebooting.
* It may also be a cloned or malicious peer ; we can ' t
* distinguish between the two .
* The signature is correct , so we must accept .
*/
accept_addr = true ;
* respond = true ;
} else if ( ! sign_match & & addr_match & & link_up ) {
/* Peer node rebooted. Two possibilities:
* - Delayed re - discovery ; this link endpoint has already
* reset and re - established contact with the peer , before
* receiving a discovery message from that node .
* ( The peer happened to receive one from this node first ) .
* - The peer came back so fast that our side has not
* discovered it yet . Probing from this side will soon
* reset the link , since there can be no working link
* endpoint at the peer end , and the link will re - establish .
* Accept the signature , since it comes from a known peer .
*/
n - > signature = signature ;
} else if ( ! sign_match & & addr_match & & ! link_up ) {
/* The peer node has rebooted.
* Accept signature , since it is a known peer .
*/
n - > signature = signature ;
* respond = true ;
} else if ( ! sign_match & & ! addr_match & & link_up ) {
/* Peer rebooted with new address, or a new/duplicate peer.
* Ignore until the link goes down , if ever .
*/
* dupl_addr = true ;
} else if ( ! sign_match & & ! addr_match & & ! link_up ) {
/* Peer rebooted with new address, or it is a new peer.
* Accept signature and address .
*/
n - > signature = signature ;
accept_addr = true ;
* respond = true ;
}
2015-07-16 16:54:20 -04:00
2015-07-30 18:24:22 -04:00
if ( ! accept_addr )
goto exit ;
2015-07-16 16:54:20 -04:00
2015-07-30 18:24:22 -04:00
/* Now create new link if not already existing */
2015-07-16 16:54:29 -04:00
if ( ! l ) {
tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.
In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.
Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.
The rules for accepting discovery requests/responses from neighboring
nodes now become:
- If the user is using legacy address format on both peers, reception
of discovery messages is subject to the legacy lookup domain check
in addition to the cluster id check.
- Otherwise, the discovery request/response is always accepted, provided
both peers have the same network id.
This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:47 +01:00
if ( n - > link_cnt = = 2 )
2015-07-30 18:24:26 -04:00
goto exit ;
tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.
In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.
Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.
The rules for accepting discovery requests/responses from neighboring
nodes now become:
- If the user is using legacy address format on both peers, reception
of discovery messages is subject to the legacy lookup domain check
in addition to the cluster id check.
- Otherwise, the discovery request/response is always accepted, provided
both peers have the same network id.
This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:47 +01:00
2015-10-22 08:51:36 -04:00
if_name = strchr ( b - > name , ' : ' ) + 1 ;
2018-09-26 22:28:52 +02:00
get_random_bytes ( & session , sizeof ( u16 ) ) ;
2015-10-22 08:51:46 -04:00
if ( ! tipc_link_create ( net , if_name , b - > identity , b - > tolerance ,
2015-10-22 08:51:36 -04:00
b - > net_plane , b - > mtu , b - > priority ,
tipc: introduce variable window congestion control
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 00:52:46 +01:00
b - > min_win , b - > max_win , session ,
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
tipc_own_addr ( net ) , addr , peer_id ,
2015-10-22 08:51:48 -04:00
n - > capabilities ,
2015-10-22 08:51:41 -04:00
tipc_bc_sndlink ( n - > net ) , n - > bc_entry . link ,
& le - > inputq ,
& n - > bc_entry . namedq , & l ) ) {
2015-07-30 18:24:22 -04:00
* respond = false ;
goto exit ;
}
2018-12-19 09:17:57 +07:00
trace_tipc_link_reset ( l , TIPC_DUMP_ALL , " link created! " ) ;
2015-07-30 18:24:26 -04:00
tipc_link_reset ( l ) ;
2015-10-15 14:52:46 -04:00
tipc_link_fsm_evt ( l , LINK_RESET_EVT ) ;
tipc: eliminate risk of premature link setup during failover
When a link goes down, and there is still a working link towards its
destination node, a failover is initiated, and the failed link is not
allowed to re-establish until that procedure is finished. To ensure
this, the concerned link endpoints are set to state LINK_FAILINGOVER,
and the node endpoints to NODE_FAILINGOVER during the failover period.
However, if the link reset is due to a disabled bearer, the corres-
ponding link endpoint is deleted, and only the node endpoint knows
about the ongoing failover. Now, if the disabled bearer is re-enabled
during the failover period, the discovery mechanism may create a new
link endpoint that is ready to be established, despite that this is not
permitted. This situation may cause both the ongoing failover and any
subsequent link synchronization to fail.
In this commit, we ensure that a newly created link goes directly to
state LINK_FAILINGOVER if the corresponding node state is
NODE_FAILINGOVER. This eliminates the problem described above.
Furthermore, we tighten the criteria for which packets are allowed
to end a failover state in the function tipc_node_check_state().
By checking that the receiving link is up and running, instead of just
checking that it is not in failover mode, we eliminate the risk that
protocol packets from the re-created link may cause the failover to
be prematurely terminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:54 -04:00
if ( n - > state = = NODE_FAILINGOVER )
tipc_link_fsm_evt ( l , LINK_FAILOVER_BEGIN_EVT ) ;
2015-07-30 18:24:26 -04:00
le - > link = l ;
n - > link_cnt + + ;
2015-07-16 16:54:29 -04:00
tipc_node_calculate_timer ( n , l ) ;
2016-06-08 12:00:05 -04:00
if ( n - > link_cnt = = 1 ) {
intv = jiffies + msecs_to_jiffies ( n - > keepalive_intv ) ;
if ( ! mod_timer ( & n - > timer , intv ) )
2015-07-16 16:54:29 -04:00
tipc_node_get ( n ) ;
2016-06-08 12:00:05 -04:00
}
2015-07-16 16:54:29 -04:00
}
2015-07-30 18:24:26 -04:00
memcpy ( & le - > maddr , maddr , sizeof ( * maddr ) ) ;
2015-07-30 18:24:22 -04:00
exit :
2015-11-19 14:30:44 -05:00
tipc_node_write_unlock ( n ) ;
2016-03-03 14:20:41 +01:00
if ( reset & & l & & ! tipc_link_is_reset ( l ) )
2015-07-30 18:24:23 -04:00
tipc_node_link_down ( n , b - > identity , false ) ;
2015-07-30 18:24:22 -04:00
tipc_node_put ( n ) ;
2015-07-16 16:54:20 -04:00
}
2015-07-30 18:24:16 -04:00
void tipc_node_delete_links ( struct net * net , int bearer_id )
{
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
struct tipc_node * n ;
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( n , & tn - > node_list , list ) {
2015-07-30 18:24:23 -04:00
tipc_node_link_down ( n , bearer_id , true ) ;
2015-07-30 18:24:16 -04:00
}
rcu_read_unlock ( ) ;
}
static void tipc_node_reset_links ( struct tipc_node * n )
{
2015-07-30 18:24:23 -04:00
int i ;
2015-07-30 18:24:16 -04:00
2018-03-22 20:42:50 +01:00
pr_warn ( " Resetting all links to %x \n " , n - > addr ) ;
2015-07-30 18:24:16 -04:00
2018-12-19 09:17:59 +07:00
trace_tipc_node_reset_links ( n , true , " " ) ;
2015-07-30 18:24:16 -04:00
for ( i = 0 ; i < MAX_BEARERS ; i + + ) {
2015-07-30 18:24:23 -04:00
tipc_node_link_down ( n , i , false ) ;
2015-07-30 18:24:16 -04:00
}
}
2015-07-16 16:54:30 -04:00
/* tipc_node_fsm_evt - node finite state machine
* Determines when contact is allowed with peer node
*/
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static void tipc_node_fsm_evt ( struct tipc_node * n , int evt )
2015-07-16 16:54:30 -04:00
{
int state = n - > state ;
switch ( state ) {
case SELF_DOWN_PEER_DOWN :
switch ( evt ) {
case SELF_ESTABL_CONTACT_EVT :
state = SELF_UP_PEER_COMING ;
break ;
case PEER_ESTABL_CONTACT_EVT :
state = SELF_COMING_PEER_UP ;
break ;
case SELF_LOST_CONTACT_EVT :
case PEER_LOST_CONTACT_EVT :
break ;
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_SYNCH_BEGIN_EVT :
case NODE_FAILOVER_BEGIN_EVT :
case NODE_FAILOVER_END_EVT :
2015-07-16 16:54:30 -04:00
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
case SELF_UP_PEER_UP :
switch ( evt ) {
case SELF_LOST_CONTACT_EVT :
state = SELF_DOWN_PEER_LEAVING ;
break ;
case PEER_LOST_CONTACT_EVT :
state = SELF_LEAVING_PEER_DOWN ;
break ;
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_BEGIN_EVT :
state = NODE_SYNCHING ;
break ;
case NODE_FAILOVER_BEGIN_EVT :
state = NODE_FAILINGOVER ;
break ;
2015-07-16 16:54:30 -04:00
case SELF_ESTABL_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_FAILOVER_END_EVT :
2015-07-16 16:54:30 -04:00
break ;
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
case SELF_DOWN_PEER_LEAVING :
switch ( evt ) {
case PEER_LOST_CONTACT_EVT :
state = SELF_DOWN_PEER_DOWN ;
break ;
case SELF_ESTABL_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
case SELF_LOST_CONTACT_EVT :
break ;
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_SYNCH_BEGIN_EVT :
case NODE_FAILOVER_BEGIN_EVT :
case NODE_FAILOVER_END_EVT :
2015-07-16 16:54:30 -04:00
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
case SELF_UP_PEER_COMING :
switch ( evt ) {
case PEER_ESTABL_CONTACT_EVT :
state = SELF_UP_PEER_UP ;
break ;
case SELF_LOST_CONTACT_EVT :
tipc: correct error in node fsm
commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47ed ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 12:00:04 -04:00
state = SELF_DOWN_PEER_DOWN ;
2015-07-16 16:54:30 -04:00
break ;
case SELF_ESTABL_CONTACT_EVT :
case PEER_LOST_CONTACT_EVT :
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_FAILOVER_BEGIN_EVT :
tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.
We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.
We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.
Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | +----------------+ | |
| RESET_EVT| |RESET_EVT | |
| | | | |
| | | |ESTABLISH_EVT |
| | | +-------------+ | |
| | | | RESET_EVT | | |
| | | | | | |
| V V V | | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 14:52:44 -04:00
break ;
case NODE_SYNCH_BEGIN_EVT :
2015-07-30 18:24:18 -04:00
case NODE_FAILOVER_END_EVT :
2015-07-16 16:54:30 -04:00
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
case SELF_COMING_PEER_UP :
switch ( evt ) {
case SELF_ESTABL_CONTACT_EVT :
state = SELF_UP_PEER_UP ;
break ;
case PEER_LOST_CONTACT_EVT :
tipc: correct error in node fsm
commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47ed ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 12:00:04 -04:00
state = SELF_DOWN_PEER_DOWN ;
2015-07-16 16:54:30 -04:00
break ;
case SELF_LOST_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
break ;
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_SYNCH_BEGIN_EVT :
case NODE_FAILOVER_BEGIN_EVT :
case NODE_FAILOVER_END_EVT :
2015-07-16 16:54:30 -04:00
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
case SELF_LEAVING_PEER_DOWN :
switch ( evt ) {
case SELF_LOST_CONTACT_EVT :
state = SELF_DOWN_PEER_DOWN ;
break ;
case SELF_ESTABL_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
case PEER_LOST_CONTACT_EVT :
break ;
2015-07-30 18:24:18 -04:00
case NODE_SYNCH_END_EVT :
case NODE_SYNCH_BEGIN_EVT :
case NODE_FAILOVER_BEGIN_EVT :
case NODE_FAILOVER_END_EVT :
default :
goto illegal_evt ;
}
break ;
case NODE_FAILINGOVER :
switch ( evt ) {
case SELF_LOST_CONTACT_EVT :
state = SELF_DOWN_PEER_LEAVING ;
break ;
case PEER_LOST_CONTACT_EVT :
state = SELF_LEAVING_PEER_DOWN ;
break ;
case NODE_FAILOVER_END_EVT :
state = SELF_UP_PEER_UP ;
break ;
case NODE_FAILOVER_BEGIN_EVT :
case SELF_ESTABL_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
break ;
case NODE_SYNCH_BEGIN_EVT :
case NODE_SYNCH_END_EVT :
2015-07-16 16:54:30 -04:00
default :
2015-07-30 18:24:18 -04:00
goto illegal_evt ;
}
break ;
case NODE_SYNCHING :
switch ( evt ) {
case SELF_LOST_CONTACT_EVT :
state = SELF_DOWN_PEER_LEAVING ;
break ;
case PEER_LOST_CONTACT_EVT :
state = SELF_LEAVING_PEER_DOWN ;
break ;
case NODE_SYNCH_END_EVT :
state = SELF_UP_PEER_UP ;
break ;
case NODE_FAILOVER_BEGIN_EVT :
state = NODE_FAILINGOVER ;
break ;
case NODE_SYNCH_BEGIN_EVT :
case SELF_ESTABL_CONTACT_EVT :
case PEER_ESTABL_CONTACT_EVT :
break ;
case NODE_FAILOVER_END_EVT :
default :
goto illegal_evt ;
2015-07-16 16:54:30 -04:00
}
break ;
default :
pr_err ( " Unknown node fsm state %x \n " , state ) ;
break ;
}
2018-12-19 09:17:59 +07:00
trace_tipc_node_fsm ( n - > peer_id , n - > state , state , evt ) ;
2015-07-16 16:54:30 -04:00
n - > state = state ;
2015-07-30 18:24:18 -04:00
return ;
illegal_evt :
pr_err ( " Illegal node fsm evt %x in state %x \n " , evt , state ) ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_fsm ( n - > peer_id , n - > state , state , evt ) ;
2015-07-16 16:54:30 -04:00
}
2015-10-22 08:51:41 -04:00
static void node_lost_contact ( struct tipc_node * n ,
2015-07-30 18:24:23 -04:00
struct sk_buff_head * inputq )
2006-01-02 19:04:38 +01:00
{
2015-02-05 08:36:42 -05:00
struct tipc_sock_conn * conn , * safe ;
2015-07-30 18:24:23 -04:00
struct tipc_link * l ;
2015-10-22 08:51:41 -04:00
struct list_head * conns = & n - > conn_sks ;
2015-02-05 08:36:42 -05:00
struct sk_buff * skb ;
uint i ;
2006-01-02 19:04:38 +01:00
2018-03-22 20:42:50 +01:00
pr_debug ( " Lost contact with %x \n " , n - > addr ) ;
2018-06-29 13:23:41 +02:00
n - > delete_at = jiffies + msecs_to_jiffies ( NODE_CLEANUP_AFTER ) ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_lost_contact ( n , true , " " ) ;
2011-04-07 11:58:08 -04:00
2015-10-22 08:51:41 -04:00
/* Clean up broadcast state */
tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.
A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.
This commit introduces these changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 08:51:42 -04:00
tipc_bcast_remove_peer ( n - > net , n - > bc_entry . link ) ;
2020-10-08 14:31:56 +07:00
skb_queue_purge ( & n - > bc_entry . namedq ) ;
2006-01-02 19:04:38 +01:00
tipc: eliminate delayed link deletion at link failover
When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.
In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.
We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:01 -04:00
/* Abort any ongoing link failover */
2006-01-02 19:04:38 +01:00
for ( i = 0 ; i < MAX_BEARERS ; i + + ) {
2015-10-22 08:51:41 -04:00
l = n - > links [ i ] . link ;
2015-07-30 18:24:23 -04:00
if ( l )
tipc_link_fsm_evt ( l , LINK_FAILOVER_END_EVT ) ;
2006-01-02 19:04:38 +01:00
}
2015-07-30 18:24:23 -04:00
2015-02-05 08:36:42 -05:00
/* Notify publications from this node */
2015-10-22 08:51:41 -04:00
n - > action_flags | = TIPC_NOTIFY_NODE_DOWN ;
2019-11-08 10:02:37 +07:00
n - > peer_net = NULL ;
n - > peer_hash_mix = 0 ;
2015-02-05 08:36:42 -05:00
/* Notify sockets connected to node */
list_for_each_entry_safe ( conn , safe , conns , list ) {
skb = tipc_msg_create ( TIPC_CRITICAL_IMPORTANCE , TIPC_CONN_MSG ,
2015-10-22 08:51:41 -04:00
SHORT_H_SIZE , 0 , tipc_own_addr ( n - > net ) ,
2015-02-05 08:36:42 -05:00
conn - > peer_node , conn - > port ,
conn - > peer_port , TIPC_ERR_NO_NODE ) ;
2015-07-30 18:24:24 -04:00
if ( likely ( skb ) )
2015-07-30 18:24:23 -04:00
skb_queue_tail ( inputq , skb ) ;
2015-02-05 08:36:42 -05:00
list_del ( & conn - > list ) ;
kfree ( conn ) ;
}
2006-01-02 19:04:38 +01:00
}
2014-04-24 16:26:47 +02:00
/**
* tipc_node_get_linkname - get the name of a link
*
2020-11-29 10:32:47 -08:00
* @ net : the applicable net namespace
2014-04-24 16:26:47 +02:00
* @ bearer_id : id of the bearer
2020-07-13 01:15:14 +02:00
* @ addr : peer node address
2014-04-24 16:26:47 +02:00
* @ linkname : link name output buffer
2020-11-29 10:32:47 -08:00
* @ len : size of @ linkname output buffer
2014-04-24 16:26:47 +02:00
*
2020-11-29 10:32:48 -08:00
* Return : 0 on success
2014-04-24 16:26:47 +02:00
*/
2015-01-09 15:27:05 +08:00
int tipc_node_get_linkname ( struct net * net , u32 bearer_id , u32 addr ,
char * linkname , size_t len )
2014-04-24 16:26:47 +02:00
{
struct tipc_link * link ;
2015-03-26 18:10:24 +08:00
int err = - EINVAL ;
2015-01-09 15:27:05 +08:00
struct tipc_node * node = tipc_node_find ( net , addr ) ;
2014-04-24 16:26:47 +02:00
2015-03-26 18:10:24 +08:00
if ( ! node )
return err ;
if ( bearer_id > = MAX_BEARERS )
goto exit ;
2015-11-19 14:30:44 -05:00
tipc_node_read_lock ( node ) ;
2015-07-16 16:54:19 -04:00
link = node - > links [ bearer_id ] . link ;
2014-04-24 16:26:47 +02:00
if ( link ) {
2015-11-19 14:30:46 -05:00
strncpy ( linkname , tipc_link_name ( link ) , len ) ;
2015-03-26 18:10:24 +08:00
err = 0 ;
2014-04-24 16:26:47 +02:00
}
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( node ) ;
2017-08-24 16:31:24 +02:00
exit :
2015-03-26 18:10:24 +08:00
tipc_node_put ( node ) ;
return err ;
2014-04-24 16:26:47 +02:00
}
2014-05-05 08:56:12 +08:00
2014-11-20 10:29:17 +01:00
/* Caller should hold node lock for the passed node */
2014-11-24 11:10:29 +01:00
static int __tipc_nl_add_node ( struct tipc_nl_msg * msg , struct tipc_node * node )
2014-11-20 10:29:17 +01:00
{
void * hdr ;
struct nlattr * attrs ;
2015-02-09 09:50:03 +01:00
hdr = genlmsg_put ( msg - > skb , msg - > portid , msg - > seq , & tipc_genl_family ,
2014-11-20 10:29:17 +01:00
NLM_F_MULTI , TIPC_NL_NODE_GET ) ;
if ( ! hdr )
return - EMSGSIZE ;
2019-04-26 11:13:06 +02:00
attrs = nla_nest_start_noflag ( msg - > skb , TIPC_NLA_NODE ) ;
2014-11-20 10:29:17 +01:00
if ( ! attrs )
goto msg_full ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_NODE_ADDR , node - > addr ) )
goto attr_msg_full ;
2017-10-13 11:04:19 +02:00
if ( node_is_up ( node ) )
2014-11-20 10:29:17 +01:00
if ( nla_put_flag ( msg - > skb , TIPC_NLA_NODE_UP ) )
goto attr_msg_full ;
nla_nest_end ( msg - > skb , attrs ) ;
genlmsg_end ( msg - > skb , hdr ) ;
return 0 ;
attr_msg_full :
nla_nest_cancel ( msg - > skb , attrs ) ;
msg_full :
genlmsg_cancel ( msg - > skb , hdr ) ;
return - EMSGSIZE ;
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
static void tipc_lxc_xmit ( struct net * peer_net , struct sk_buff_head * list )
{
struct tipc_msg * hdr = buf_msg ( skb_peek ( list ) ) ;
struct sk_buff_head inputq ;
switch ( msg_user ( hdr ) ) {
case TIPC_LOW_IMPORTANCE :
case TIPC_MEDIUM_IMPORTANCE :
case TIPC_HIGH_IMPORTANCE :
case TIPC_CRITICAL_IMPORTANCE :
2020-03-26 09:50:29 +07:00
if ( msg_connected ( hdr ) | | msg_named ( hdr ) | |
msg_direct ( hdr ) ) {
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
tipc_loopback_trace ( peer_net , list ) ;
spin_lock_init ( & list - > lock ) ;
tipc_sk_rcv ( peer_net , list ) ;
return ;
}
if ( msg_mcast ( hdr ) ) {
tipc_loopback_trace ( peer_net , list ) ;
skb_queue_head_init ( & inputq ) ;
tipc_sk_mcast_rcv ( peer_net , list , & inputq ) ;
__skb_queue_purge ( list ) ;
skb_queue_purge ( & inputq ) ;
return ;
}
return ;
case MSG_FRAGMENTER :
if ( tipc_msg_assemble ( list ) ) {
tipc_loopback_trace ( peer_net , list ) ;
skb_queue_head_init ( & inputq ) ;
tipc_sk_mcast_rcv ( peer_net , list , & inputq ) ;
__skb_queue_purge ( list ) ;
skb_queue_purge ( & inputq ) ;
}
return ;
case GROUP_PROTOCOL :
case CONN_MANAGER :
tipc_loopback_trace ( peer_net , list ) ;
spin_lock_init ( & list - > lock ) ;
tipc_sk_rcv ( peer_net , list ) ;
return ;
case LINK_PROTOCOL :
case NAME_DISTRIBUTOR :
case TUNNEL_PROTOCOL :
case BCAST_PROTOCOL :
return ;
default :
return ;
2020-11-01 07:58:22 -08:00
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
}
2015-07-16 16:54:24 -04:00
/**
2021-01-14 09:04:48 +01:00
* tipc_node_xmit ( ) - general link level function for message sending
2015-07-16 16:54:24 -04:00
* @ net : the applicable net namespace
* @ list : chain of buffers containing message
* @ dnode : address of destination node
* @ selector : a number used for deterministic link selection
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
* Consumes the buffer chain .
2020-11-29 10:32:48 -08:00
* Return : 0 if success , otherwise : - ELINKCONG , - EHOSTUNREACH , - EMSGSIZE , - ENOBUF
2015-07-16 16:54:24 -04:00
*/
int tipc_node_xmit ( struct net * net , struct sk_buff_head * list ,
u32 dnode , int selector )
{
2015-11-19 14:30:44 -05:00
struct tipc_link_entry * le = NULL ;
2015-07-16 16:54:24 -04:00
struct tipc_node * n ;
struct sk_buff_head xmitq ;
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
bool node_up = false ;
2016-02-11 10:43:15 +01:00
int bearer_id ;
int rc ;
if ( in_own_node ( net , dnode ) ) {
2019-08-07 12:52:29 +10:00
tipc_loopback_trace ( net , list ) ;
2019-08-15 16:42:50 +02:00
spin_lock_init ( & list - > lock ) ;
2016-02-11 10:43:15 +01:00
tipc_sk_rcv ( net , list ) ;
return 0 ;
}
2015-07-16 16:54:24 -04:00
n = tipc_node_find ( net , dnode ) ;
2016-02-11 10:43:15 +01:00
if ( unlikely ( ! n ) ) {
2019-08-15 16:42:50 +02:00
__skb_queue_purge ( list ) ;
2016-02-11 10:43:15 +01:00
return - EHOSTUNREACH ;
}
tipc_node_read_lock ( n ) ;
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
node_up = node_is_up ( n ) ;
if ( node_up & & n - > peer_net & & check_net ( n - > peer_net ) ) {
/* xmit inner linux container */
tipc_lxc_xmit ( n - > peer_net , list ) ;
if ( likely ( skb_queue_empty ( list ) ) ) {
tipc_node_read_unlock ( n ) ;
tipc_node_put ( n ) ;
return 0 ;
}
}
2016-02-11 10:43:15 +01:00
bearer_id = n - > active_links [ selector & 1 ] ;
if ( unlikely ( bearer_id = = INVALID_BEARER_ID ) ) {
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( n ) ;
2015-07-16 16:54:24 -04:00
tipc_node_put ( n ) ;
2019-08-15 16:42:50 +02:00
__skb_queue_purge ( list ) ;
2016-02-11 10:43:15 +01:00
return - EHOSTUNREACH ;
2015-07-16 16:54:24 -04:00
}
2015-11-19 14:30:44 -05:00
2016-02-11 10:43:15 +01:00
__skb_queue_head_init ( & xmitq ) ;
le = & n - > links [ bearer_id ] ;
spin_lock_bh ( & le - > lock ) ;
rc = tipc_link_xmit ( le - > link , list , & xmitq ) ;
spin_unlock_bh ( & le - > lock ) ;
tipc_node_read_unlock ( n ) ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
if ( unlikely ( rc = = - ENOBUFS ) )
2016-02-11 10:43:15 +01:00
tipc_node_link_down ( n , bearer_id , false ) ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
else
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( net , bearer_id , & xmitq , & le - > maddr , n ) ;
2016-02-11 10:43:15 +01:00
tipc_node_put ( n ) ;
2015-12-02 15:19:37 -05:00
return rc ;
2015-07-16 16:54:24 -04:00
}
/* tipc_node_xmit_skb(): send single buffer to destination
* Buffers sent via this functon are generally TIPC_SYSTEM_IMPORTANCE
* messages , which will not be rejected
* The only exception is datagram messages rerouted after secondary
* lookup , which are rare and safe to dispose of anyway .
*/
int tipc_node_xmit_skb ( struct net * net , struct sk_buff * skb , u32 dnode ,
u32 selector )
{
struct sk_buff_head head ;
2019-08-15 16:42:50 +02:00
__skb_queue_head_init ( & head ) ;
2015-07-16 16:54:24 -04:00
__skb_queue_tail ( & head , skb ) ;
tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.
This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.
In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.
The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 10:55:11 -05:00
tipc_node_xmit ( net , & head , dnode , selector ) ;
2015-07-16 16:54:24 -04:00
return 0 ;
}
2017-10-13 11:04:21 +02:00
/* tipc_node_distr_xmit(): send single buffer msgs to individual destinations
* Note : this is only for SYSTEM_IMPORTANCE messages , which cannot be rejected
*/
int tipc_node_distr_xmit ( struct net * net , struct sk_buff_head * xmitq )
{
struct sk_buff * skb ;
u32 selector , dnode ;
while ( ( skb = __skb_dequeue ( xmitq ) ) ) {
selector = msg_origport ( buf_msg ( skb ) ) ;
dnode = msg_destnode ( buf_msg ( skb ) ) ;
tipc_node_xmit_skb ( net , skb , dnode , selector ) ;
}
return 0 ;
}
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
void tipc_node_broadcast ( struct net * net , struct sk_buff * skb , int rc_dests )
2015-11-19 14:30:42 -05:00
{
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
struct sk_buff_head xmitq ;
2015-11-19 14:30:42 -05:00
struct sk_buff * txskb ;
struct tipc_node * n ;
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
u16 dummy ;
2015-11-19 14:30:42 -05:00
u32 dst ;
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
/* Use broadcast if all nodes support it */
if ( ! rc_dests & & tipc_bcast_get_mode ( net ) ! = BCLINK_MODE_RCAST ) {
__skb_queue_head_init ( & xmitq ) ;
__skb_queue_tail ( & xmitq , skb ) ;
tipc_bcast_xmit ( net , & xmitq , & dummy ) ;
return ;
}
/* Otherwise use legacy replicast method */
2015-11-19 14:30:42 -05:00
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( n , tipc_nodes ( net ) , list ) {
dst = n - > addr ;
if ( in_own_node ( net , dst ) )
continue ;
2017-10-13 11:04:19 +02:00
if ( ! node_is_up ( n ) )
2015-11-19 14:30:42 -05:00
continue ;
txskb = pskb_copy ( skb , GFP_ATOMIC ) ;
if ( ! txskb )
break ;
msg_set_destnode ( buf_msg ( txskb ) , dst ) ;
tipc_node_xmit_skb ( net , txskb , dst , 0 ) ;
}
rcu_read_unlock ( ) ;
kfree_skb ( skb ) ;
}
2017-01-18 13:50:52 -05:00
static void tipc_node_mcast_rcv ( struct tipc_node * n )
{
struct tipc_bclink_entry * be = & n - > bc_entry ;
/* 'arrvq' is under inputq2's lock protection */
spin_lock_bh ( & be - > inputq2 . lock ) ;
spin_lock_bh ( & be - > inputq1 . lock ) ;
skb_queue_splice_tail_init ( & be - > inputq1 , & be - > arrvq ) ;
spin_unlock_bh ( & be - > inputq1 . lock ) ;
spin_unlock_bh ( & be - > inputq2 . lock ) ;
tipc_sk_mcast_rcv ( n - > net , & be - > arrvq , & be - > inputq2 ) ;
}
2016-09-01 13:52:49 -04:00
static void tipc_node_bc_sync_rcv ( struct tipc_node * n , struct tipc_msg * hdr ,
int bearer_id , struct sk_buff_head * xmitq )
{
struct tipc_link * ucl ;
int rc ;
2020-05-26 16:38:36 +07:00
rc = tipc_bcast_sync_rcv ( n - > net , n - > bc_entry . link , hdr , xmitq ) ;
2016-09-01 13:52:49 -04:00
if ( rc & TIPC_LINK_DOWN_EVT ) {
2017-08-21 17:59:30 +02:00
tipc_node_reset_links ( n ) ;
2016-09-01 13:52:49 -04:00
return ;
}
if ( ! ( rc & TIPC_LINK_SND_STATE ) )
return ;
/* If probe message, a STATE response will be sent anyway */
if ( msg_probe ( hdr ) )
return ;
/* Produce a STATE message carrying broadcast NACK */
tipc_node_read_lock ( n ) ;
ucl = n - > links [ bearer_id ] . link ;
if ( ucl )
tipc_link_build_state_msg ( ucl , xmitq ) ;
tipc_node_read_unlock ( n ) ;
}
2015-10-22 08:51:41 -04:00
/**
* tipc_node_bc_rcv - process TIPC broadcast packet arriving from off - node
* @ net : the applicable net namespace
* @ skb : TIPC packet
* @ bearer_id : id of bearer message arrived on
*
* Invoked with no locks held .
*/
2015-10-24 22:56:01 +08:00
static void tipc_node_bc_rcv ( struct net * net , struct sk_buff * skb , int bearer_id )
2015-10-22 08:51:41 -04:00
{
int rc ;
struct sk_buff_head xmitq ;
struct tipc_bclink_entry * be ;
struct tipc_link_entry * le ;
struct tipc_msg * hdr = buf_msg ( skb ) ;
int usr = msg_user ( hdr ) ;
u32 dnode = msg_destnode ( hdr ) ;
struct tipc_node * n ;
__skb_queue_head_init ( & xmitq ) ;
/* If NACK for other node, let rcv link for that node peek into it */
if ( ( usr = = BCAST_PROTOCOL ) & & ( dnode ! = tipc_own_addr ( net ) ) )
n = tipc_node_find ( net , dnode ) ;
else
n = tipc_node_find ( net , msg_prevnode ( hdr ) ) ;
if ( ! n ) {
kfree_skb ( skb ) ;
return ;
}
be = & n - > bc_entry ;
le = & n - > links [ bearer_id ] ;
rc = tipc_bcast_rcv ( net , be - > link , skb ) ;
/* Broadcast ACKs are sent on a unicast link */
2016-09-01 13:52:49 -04:00
if ( rc & TIPC_LINK_SND_STATE ) {
2015-11-19 14:30:44 -05:00
tipc_node_read_lock ( n ) ;
2016-04-15 13:33:07 -04:00
tipc_link_build_state_msg ( le - > link , & xmitq ) ;
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( n ) ;
2015-10-22 08:51:41 -04:00
}
if ( ! skb_queue_empty ( & xmitq ) )
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( net , bearer_id , & xmitq , & le - > maddr , n ) ;
2015-10-22 08:51:41 -04:00
2017-01-18 13:50:52 -05:00
if ( ! skb_queue_empty ( & be - > inputq1 ) )
tipc_node_mcast_rcv ( n ) ;
2016-07-11 16:08:37 -04:00
2018-12-18 17:43:52 +08:00
/* Handle NAME_DISTRIBUTOR messages sent from 1.7 nodes */
if ( ! skb_queue_empty ( & n - > bc_entry . namedq ) )
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
tipc_named_rcv ( net , & n - > bc_entry . namedq ,
& n - > bc_entry . named_rcv_nxt ,
& n - > bc_entry . named_open ) ;
2018-12-18 17:43:52 +08:00
2017-08-21 17:59:30 +02:00
/* If reassembly or retransmission failure => reset all links to peer */
if ( rc & TIPC_LINK_DOWN_EVT )
tipc_node_reset_links ( n ) ;
2016-07-11 16:08:37 -04:00
2015-10-22 08:51:41 -04:00
tipc_node_put ( n ) ;
}
2015-07-30 18:24:19 -04:00
/**
* tipc_node_check_state - check and if necessary update node state
2020-11-29 10:32:47 -08:00
* @ n : target tipc_node
2015-07-30 18:24:19 -04:00
* @ skb : TIPC packet
* @ bearer_id : identity of bearer delivering the packet
2020-11-29 10:32:47 -08:00
* @ xmitq : queue for messages to be xmited on
2020-11-29 10:32:48 -08:00
* Return : true if state and msg are ok , otherwise false
2015-07-30 18:24:16 -04:00
*/
2015-07-30 18:24:19 -04:00
static bool tipc_node_check_state ( struct tipc_node * n , struct sk_buff * skb ,
2015-07-30 18:24:21 -04:00
int bearer_id , struct sk_buff_head * xmitq )
2015-07-30 18:24:16 -04:00
{
struct tipc_msg * hdr = buf_msg ( skb ) ;
2015-07-30 18:24:19 -04:00
int usr = msg_user ( hdr ) ;
int mtyp = msg_type ( hdr ) ;
2015-07-30 18:24:16 -04:00
u16 oseqno = msg_seqno ( hdr ) ;
2015-07-30 18:24:19 -04:00
u16 exp_pkts = msg_msgcnt ( hdr ) ;
2015-11-19 14:30:46 -05:00
u16 rcv_nxt , syncpt , dlv_nxt , inputq_len ;
2015-07-30 18:24:19 -04:00
int state = n - > state ;
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
struct tipc_link * l , * tnl , * pl = NULL ;
2015-07-30 18:24:23 -04:00
struct tipc_media_addr * maddr ;
2015-11-19 14:30:46 -05:00
int pb_id ;
2015-07-30 18:24:16 -04:00
2018-12-19 09:17:59 +07:00
if ( trace_tipc_node_check_state_enabled ( ) ) {
trace_tipc_skb_dump ( skb , false , " skb for node state check " ) ;
trace_tipc_node_check_state ( n , true , " " ) ;
}
2015-07-30 18:24:19 -04:00
l = n - > links [ bearer_id ] . link ;
if ( ! l )
return false ;
2015-11-19 14:30:46 -05:00
rcv_nxt = tipc_link_rcv_nxt ( l ) ;
2015-07-30 18:24:16 -04:00
2015-07-30 18:24:19 -04:00
if ( likely ( ( state = = SELF_UP_PEER_UP ) & & ( usr ! = TUNNEL_PROTOCOL ) ) )
return true ;
2015-07-30 18:24:16 -04:00
2015-07-30 18:24:19 -04:00
/* Find parallel link, if any */
2015-11-19 14:30:46 -05:00
for ( pb_id = 0 ; pb_id < MAX_BEARERS ; pb_id + + ) {
if ( ( pb_id ! = bearer_id ) & & n - > links [ pb_id ] . link ) {
pl = n - > links [ pb_id ] . link ;
2015-07-30 18:24:19 -04:00
break ;
}
}
2015-07-30 18:24:16 -04:00
2018-12-19 09:17:57 +07:00
if ( ! tipc_link_validate_msg ( l , hdr ) ) {
trace_tipc_skb_dump ( skb , false , " PROTO invalid (2)! " ) ;
trace_tipc_link_dump ( l , TIPC_DUMP_NONE , " PROTO invalid (2)! " ) ;
2018-07-10 01:07:36 +02:00
return false ;
2018-12-19 09:17:57 +07:00
}
2018-07-10 01:07:36 +02:00
2015-11-19 14:30:44 -05:00
/* Check and update node accesibility if applicable */
2015-07-30 18:24:19 -04:00
if ( state = = SELF_UP_PEER_COMING ) {
if ( ! tipc_link_is_up ( l ) )
return true ;
if ( ! msg_peer_link_is_up ( hdr ) )
return true ;
tipc_node_fsm_evt ( n , PEER_ESTABL_CONTACT_EVT ) ;
}
if ( state = = SELF_DOWN_PEER_LEAVING ) {
if ( msg_peer_node_is_up ( hdr ) )
return false ;
tipc_node_fsm_evt ( n , PEER_LOST_CONTACT_EVT ) ;
2015-11-19 14:30:41 -05:00
return true ;
2015-07-30 18:24:19 -04:00
}
2015-11-19 14:30:44 -05:00
if ( state = = SELF_LEAVING_PEER_DOWN )
return false ;
2015-07-30 18:24:19 -04:00
/* Ignore duplicate packets */
tipc: eliminate risk of stalled link synchronization
In commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.
During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.
At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.
We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.
Fixes: commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13 12:41:51 -04:00
if ( ( usr ! = LINK_PROTOCOL ) & & less ( oseqno , rcv_nxt ) )
2015-07-30 18:24:19 -04:00
return true ;
/* Initiate or update failover mode if applicable */
if ( ( usr = = TUNNEL_PROTOCOL ) & & ( mtyp = = FAILOVER_MSG ) ) {
syncpt = oseqno + exp_pkts - 1 ;
2019-06-17 11:56:12 +07:00
if ( pl & & ! tipc_link_is_reset ( pl ) ) {
2015-07-30 18:24:23 -04:00
__tipc_node_link_down ( n , & pb_id , xmitq , & maddr ) ;
2018-12-19 09:17:59 +07:00
trace_tipc_node_link_down ( n , true ,
" node link down <- failover! " ) ;
2015-11-19 14:30:46 -05:00
tipc_skb_queue_splice_tail_init ( tipc_link_inputq ( pl ) ,
tipc_link_inputq ( l ) ) ;
2015-07-30 18:24:23 -04:00
}
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
2018-09-26 21:00:54 +02:00
/* If parallel link was already down, and this happened before
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
* the tunnel link came up , node failover was never started .
* Ensure that a FAILOVER_MSG is sent to get peer out of
* NODE_FAILINGOVER state , also this node must accept
* TUNNEL_MSGs from peer .
2018-09-26 21:00:54 +02:00
*/
tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).
This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.
However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:
1. Both links <1A-2A>, <1B-2B> down
2. Link endpoint 2A up, but 1A still down (e.g. due to network
disturbance, wrong session, etc.)
3. Link <1B-2B> up
4. Link endpoint 2A down (e.g. due to link tolerance timeout)
5. Node B starts failover onto link <1B-2B>
==> Node A does never start link failover.
When the "half-failover" situation happens, two consequences have been
observed:
a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.
The consequence a) was actually solved by commit c140eb166d68 ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!
The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 17:23:23 +07:00
if ( n - > state ! = NODE_FAILINGOVER )
tipc_node_link_failover ( n , pl , l , xmitq ) ;
2015-07-30 18:24:19 -04:00
/* If pkts arrive out of order, use lowest calculated syncpt */
if ( less ( syncpt , n - > sync_point ) )
n - > sync_point = syncpt ;
}
/* Open parallel link when tunnel link reaches synch point */
tipc: eliminate risk of premature link setup during failover
When a link goes down, and there is still a working link towards its
destination node, a failover is initiated, and the failed link is not
allowed to re-establish until that procedure is finished. To ensure
this, the concerned link endpoints are set to state LINK_FAILINGOVER,
and the node endpoints to NODE_FAILINGOVER during the failover period.
However, if the link reset is due to a disabled bearer, the corres-
ponding link endpoint is deleted, and only the node endpoint knows
about the ongoing failover. Now, if the disabled bearer is re-enabled
during the failover period, the discovery mechanism may create a new
link endpoint that is ready to be established, despite that this is not
permitted. This situation may cause both the ongoing failover and any
subsequent link synchronization to fail.
In this commit, we ensure that a newly created link goes directly to
state LINK_FAILINGOVER if the corresponding node state is
NODE_FAILINGOVER. This eliminates the problem described above.
Furthermore, we tighten the criteria for which packets are allowed
to end a failover state in the function tipc_node_check_state().
By checking that the receiving link is up and running, instead of just
checking that it is not in failover mode, we eliminate the risk that
protocol packets from the re-created link may cause the failover to
be prematurely terminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:54 -04:00
if ( ( n - > state = = NODE_FAILINGOVER ) & & tipc_link_is_up ( l ) ) {
2015-07-30 18:24:21 -04:00
if ( ! more ( rcv_nxt , n - > sync_point ) )
return true ;
2015-07-30 18:24:19 -04:00
tipc_node_fsm_evt ( n , NODE_FAILOVER_END_EVT ) ;
if ( pl )
2015-07-30 18:24:21 -04:00
tipc_link_fsm_evt ( pl , LINK_FAILOVER_END_EVT ) ;
2015-07-30 18:24:19 -04:00
return true ;
}
2015-08-20 02:12:55 -04:00
/* No synching needed if only one link */
if ( ! pl | | ! tipc_link_is_up ( pl ) )
return true ;
tipc: eliminate risk of stalled link synchronization
In commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.
During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.
At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.
We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.
Fixes: commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13 12:41:51 -04:00
/* Initiate synch mode if applicable */
if ( ( usr = = TUNNEL_PROTOCOL ) & & ( mtyp = = SYNCH_MSG ) & & ( oseqno = = 1 ) ) {
2019-07-24 08:56:11 +07:00
if ( n - > capabilities & TIPC_TUNNEL_ENHANCED )
syncpt = msg_syncpt ( hdr ) ;
else
syncpt = msg_seqno ( msg_inner_hdr ( hdr ) ) + exp_pkts - 1 ;
tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.
However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before initializing synchronization.
Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.
This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.
We fix this by removing the unnecessary event call.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 22:23:56 +02:00
if ( ! tipc_link_is_up ( l ) )
2015-07-30 18:24:23 -04:00
__tipc_node_link_up ( n , bearer_id , xmitq ) ;
2015-07-30 18:24:19 -04:00
if ( n - > state = = SELF_UP_PEER_UP ) {
n - > sync_point = syncpt ;
2015-07-30 18:24:21 -04:00
tipc_link_fsm_evt ( l , LINK_SYNCH_BEGIN_EVT ) ;
2015-07-30 18:24:19 -04:00
tipc_node_fsm_evt ( n , NODE_SYNCH_BEGIN_EVT ) ;
}
2015-07-30 18:24:16 -04:00
}
2015-07-30 18:24:19 -04:00
/* Open tunnel link when parallel link reaches synch point */
2015-11-19 14:30:41 -05:00
if ( n - > state = = NODE_SYNCHING ) {
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
if ( tipc_link_is_synching ( l ) ) {
tnl = l ;
} else {
tnl = pl ;
pl = l ;
}
2015-11-19 14:30:46 -05:00
inputq_len = skb_queue_len ( tipc_link_inputq ( pl ) ) ;
dlv_nxt = tipc_link_rcv_nxt ( pl ) - inputq_len ;
2015-08-20 02:12:55 -04:00
if ( more ( dlv_nxt , n - > sync_point ) ) {
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
tipc_link_fsm_evt ( tnl , LINK_SYNCH_END_EVT ) ;
2015-07-30 18:24:19 -04:00
tipc_node_fsm_evt ( n , NODE_SYNCH_END_EVT ) ;
return true ;
}
tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.
Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.
The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.
However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 02:12:56 -04:00
if ( l = = pl )
return true ;
2015-07-30 18:24:19 -04:00
if ( ( usr = = TUNNEL_PROTOCOL ) & & ( mtyp = = SYNCH_MSG ) )
return true ;
if ( usr = = LINK_PROTOCOL )
return true ;
return false ;
}
return true ;
2015-07-30 18:24:16 -04:00
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/**
* tipc_rcv - process TIPC packets / messages arriving from off - node
* @ net : the applicable net namespace
* @ skb : TIPC packet
2020-07-13 01:15:14 +02:00
* @ b : pointer to bearer message arrived on
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
*
* Invoked with no locks held . Bearer pointer must point to a valid bearer
* structure ( i . e . cannot be NULL ) , but bearer can be inactive .
*/
void tipc_rcv ( struct net * net , struct sk_buff * skb , struct tipc_bearer * b )
{
struct sk_buff_head xmitq ;
2019-11-08 12:05:11 +07:00
struct tipc_link_entry * le ;
2017-02-23 11:10:31 -05:00
struct tipc_msg * hdr ;
2019-11-08 12:05:11 +07:00
struct tipc_node * n ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
int bearer_id = b - > identity ;
2016-04-29 10:40:24 -04:00
u32 self = tipc_own_addr ( net ) ;
2017-02-23 11:10:31 -05:00
int usr , rc = 0 ;
u16 bc_ack ;
2019-11-08 12:05:11 +07:00
# ifdef CONFIG_TIPC_CRYPTO
struct tipc_ehdr * ehdr ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2019-11-08 12:05:11 +07:00
/* Check if message must be decrypted first */
if ( TIPC_SKB_CB ( skb ) - > decrypted | | ! tipc_ehdr_validate ( skb ) )
goto rcv ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2019-11-08 12:05:11 +07:00
ehdr = ( struct tipc_ehdr * ) skb - > data ;
if ( likely ( ehdr - > user ! = LINK_CONFIG ) ) {
n = tipc_node_find ( net , ntohl ( ehdr - > addr ) ) ;
if ( unlikely ( ! n ) )
goto discard ;
} else {
n = tipc_node_find_by_id ( net , ehdr - > id ) ;
}
tipc_crypto_rcv ( net , ( n ) ? n - > crypto_rx : NULL , & skb , b ) ;
if ( ! skb )
return ;
rcv :
# endif
2017-02-23 11:10:31 -05:00
/* Ensure message is well-formed before touching the header */
2017-11-15 21:23:56 +01:00
if ( unlikely ( ! tipc_msg_validate ( & skb ) ) )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
goto discard ;
2019-11-08 12:05:11 +07:00
__skb_queue_head_init ( & xmitq ) ;
2017-02-23 11:10:31 -05:00
hdr = buf_msg ( skb ) ;
usr = msg_user ( hdr ) ;
bc_ack = msg_bcast_ack ( hdr ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2015-10-22 08:51:41 -04:00
/* Handle arrival of discovery or broadcast packet */
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
if ( unlikely ( msg_non_seq ( hdr ) ) ) {
2015-10-22 08:51:41 -04:00
if ( unlikely ( usr = = LINK_CONFIG ) )
return tipc_disc_rcv ( net , skb , b ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
else
2015-10-22 08:51:41 -04:00
return tipc_node_bc_rcv ( net , skb , bearer_id ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
2016-04-29 10:40:24 -04:00
/* Discard unicast link messages destined for another node */
if ( unlikely ( ! msg_short ( hdr ) & & ( msg_destnode ( hdr ) ! = self ) ) )
goto discard ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
/* Locate neighboring node that sent packet */
n = tipc_node_find ( net , msg_prevnode ( hdr ) ) ;
if ( unlikely ( ! n ) )
goto discard ;
2015-07-30 18:24:19 -04:00
le = & n - > links [ bearer_id ] ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
2015-10-22 08:51:41 -04:00
/* Ensure broadcast reception is in synch with peer's send state */
2020-05-26 16:38:34 +07:00
if ( unlikely ( usr = = LINK_PROTOCOL ) ) {
if ( unlikely ( skb_linearize ( skb ) ) ) {
tipc_node_put ( n ) ;
goto discard ;
}
hdr = buf_msg ( skb ) ;
2016-09-01 13:52:49 -04:00
tipc_node_bc_sync_rcv ( n , hdr , bearer_id , & xmitq ) ;
2020-05-26 16:38:34 +07:00
} else if ( unlikely ( tipc_link_acked ( n - > bc_entry . link ) ! = bc_ack ) ) {
2016-10-27 18:51:55 -04:00
tipc_bcast_ack_rcv ( net , n - > bc_entry . link , hdr ) ;
2020-05-26 16:38:34 +07:00
}
2015-10-22 08:51:41 -04:00
2015-11-19 14:30:44 -05:00
/* Receive packet directly if conditions permit */
tipc_node_read_lock ( n ) ;
if ( likely ( ( n - > state = = SELF_UP_PEER_UP ) & & ( usr ! = TUNNEL_PROTOCOL ) ) ) {
2015-11-19 14:30:43 -05:00
spin_lock_bh ( & le - > lock ) ;
2015-11-19 14:30:44 -05:00
if ( le - > link ) {
rc = tipc_link_rcv ( le - > link , skb , & xmitq ) ;
skb = NULL ;
}
2015-11-19 14:30:43 -05:00
spin_unlock_bh ( & le - > lock ) ;
2015-07-30 18:24:19 -04:00
}
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( n ) ;
/* Check/update node state before receiving */
if ( unlikely ( skb ) ) {
2017-08-24 16:31:22 +02:00
if ( unlikely ( skb_linearize ( skb ) ) )
2020-04-15 16:40:28 +08:00
goto out_node_put ;
2015-11-19 14:30:44 -05:00
tipc_node_write_lock ( n ) ;
if ( tipc_node_check_state ( n , skb , bearer_id , & xmitq ) ) {
if ( le - > link ) {
rc = tipc_link_rcv ( le - > link , skb , & xmitq ) ;
skb = NULL ;
}
}
tipc_node_write_unlock ( n ) ;
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
if ( unlikely ( rc & TIPC_LINK_UP_EVT ) )
2015-07-30 18:24:19 -04:00
tipc_node_link_up ( n , bearer_id , & xmitq ) ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
if ( unlikely ( rc & TIPC_LINK_DOWN_EVT ) )
2015-07-30 18:24:23 -04:00
tipc_node_link_down ( n , bearer_id , false ) ;
2015-07-30 18:24:19 -04:00
2015-10-22 08:51:41 -04:00
if ( unlikely ( ! skb_queue_empty ( & n - > bc_entry . namedq ) ) )
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
tipc_named_rcv ( net , & n - > bc_entry . namedq ,
& n - > bc_entry . named_rcv_nxt ,
& n - > bc_entry . named_open ) ;
2015-07-30 18:24:24 -04:00
2017-01-18 13:50:52 -05:00
if ( unlikely ( ! skb_queue_empty ( & n - > bc_entry . inputq1 ) ) )
tipc_node_mcast_rcv ( n ) ;
2015-07-30 18:24:19 -04:00
if ( ! skb_queue_empty ( & le - > inputq ) )
tipc_sk_rcv ( net , & le - > inputq ) ;
if ( ! skb_queue_empty ( & xmitq ) )
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( net , bearer_id , & xmitq , & le - > maddr , n ) ;
2015-07-30 18:24:19 -04:00
2020-04-15 16:40:28 +08:00
out_node_put :
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
tipc_node_put ( n ) ;
discard :
kfree_skb ( skb ) ;
}
2018-04-19 11:06:20 +02:00
void tipc_node_apply_property ( struct net * net , struct tipc_bearer * b ,
int prop )
2018-02-14 13:34:39 +01:00
{
struct tipc_net * tn = tipc_net ( net ) ;
int bearer_id = b - > identity ;
struct sk_buff_head xmitq ;
struct tipc_link_entry * e ;
struct tipc_node * n ;
__skb_queue_head_init ( & xmitq ) ;
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( n , & tn - > node_list , list ) {
tipc_node_write_lock ( n ) ;
e = & n - > links [ bearer_id ] ;
2018-04-19 11:06:20 +02:00
if ( e - > link ) {
if ( prop = = TIPC_NLA_PROP_TOL )
tipc_link_set_tolerance ( e - > link , b - > tolerance ,
& xmitq ) ;
else if ( prop = = TIPC_NLA_PROP_MTU )
tipc_link_set_mtu ( e - > link , b - > mtu ) ;
2020-12-07 11:14:24 +03:00
/* Update MTU for node link entry */
e - > mtu = tipc_link_mss ( e - > link ) ;
2018-04-19 11:06:20 +02:00
}
2020-12-07 11:14:24 +03:00
2018-02-14 13:34:39 +01:00
tipc_node_write_unlock ( n ) ;
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( net , bearer_id , & xmitq , & e - > maddr , NULL ) ;
2018-02-14 13:34:39 +01:00
}
rcu_read_unlock ( ) ;
}
2016-08-18 10:33:52 +02:00
int tipc_nl_peer_rm ( struct sk_buff * skb , struct genl_info * info )
{
struct net * net = sock_net ( skb - > sk ) ;
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
struct nlattr * attrs [ TIPC_NLA_NET_MAX + 1 ] ;
2019-11-06 13:26:09 +07:00
struct tipc_node * peer , * temp_node ;
2020-12-03 10:50:45 +07:00
u8 node_id [ NODE_ID_LEN ] ;
u64 * w0 = ( u64 * ) & node_id [ 0 ] ;
u64 * w1 = ( u64 * ) & node_id [ 8 ] ;
2016-08-18 10:33:52 +02:00
u32 addr ;
int err ;
/* We identify the peer by its net */
if ( ! info - > attrs [ TIPC_NLA_NET ] )
return - EINVAL ;
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( attrs , TIPC_NLA_NET_MAX ,
info - > attrs [ TIPC_NLA_NET ] ,
tipc_nl_net_policy , info - > extack ) ;
2016-08-18 10:33:52 +02:00
if ( err )
return err ;
2020-12-03 10:50:45 +07:00
/* attrs[TIPC_NLA_NET_NODEID] and attrs[TIPC_NLA_NET_ADDR] are
* mutually exclusive cases
*/
if ( attrs [ TIPC_NLA_NET_ADDR ] ) {
addr = nla_get_u32 ( attrs [ TIPC_NLA_NET_ADDR ] ) ;
if ( ! addr )
return - EINVAL ;
}
2016-08-18 10:33:52 +02:00
2020-12-03 10:50:45 +07:00
if ( attrs [ TIPC_NLA_NET_NODEID ] ) {
if ( ! attrs [ TIPC_NLA_NET_NODEID_W1 ] )
return - EINVAL ;
* w0 = nla_get_u64 ( attrs [ TIPC_NLA_NET_NODEID ] ) ;
* w1 = nla_get_u64 ( attrs [ TIPC_NLA_NET_NODEID_W1 ] ) ;
addr = hash128to32 ( node_id ) ;
}
2016-08-18 10:33:52 +02:00
if ( in_own_node ( net , addr ) )
return - ENOTSUPP ;
spin_lock_bh ( & tn - > node_list_lock ) ;
peer = tipc_node_find ( net , addr ) ;
if ( ! peer ) {
spin_unlock_bh ( & tn - > node_list_lock ) ;
return - ENXIO ;
}
tipc_node_write_lock ( peer ) ;
if ( peer - > state ! = SELF_DOWN_PEER_DOWN & &
peer - > state ! = SELF_DOWN_PEER_LEAVING ) {
tipc_node_write_unlock ( peer ) ;
err = - EBUSY ;
goto err_out ;
}
2018-06-29 13:23:41 +02:00
tipc_node_clear_links ( peer ) ;
2016-08-18 10:33:52 +02:00
tipc_node_write_unlock ( peer ) ;
tipc_node_delete ( peer ) ;
2019-11-06 13:26:09 +07:00
/* Calculate cluster capabilities */
tn - > capabilities = TIPC_NODE_CAPABILITIES ;
list_for_each_entry_rcu ( temp_node , & tn - > node_list , list ) {
tn - > capabilities & = temp_node - > capabilities ;
}
2019-11-21 10:01:09 +07:00
tipc_bcast_toggle_rcast ( net , ( tn - > capabilities & TIPC_BCAST_RCAST ) ) ;
2016-08-18 10:33:52 +02:00
err = 0 ;
err_out :
tipc_node_put ( peer ) ;
spin_unlock_bh ( & tn - > node_list_lock ) ;
return err ;
}
2014-11-20 10:29:17 +01:00
int tipc_nl_node_dump ( struct sk_buff * skb , struct netlink_callback * cb )
{
int err ;
2015-01-09 15:27:05 +08:00
struct net * net = sock_net ( skb - > sk ) ;
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
2014-11-20 10:29:17 +01:00
int done = cb - > args [ 0 ] ;
int last_addr = cb - > args [ 1 ] ;
struct tipc_node * node ;
struct tipc_nl_msg msg ;
if ( done )
return 0 ;
msg . skb = skb ;
msg . portid = NETLINK_CB ( cb - > skb ) . portid ;
msg . seq = cb - > nlh - > nlmsg_seq ;
rcu_read_lock ( ) ;
2015-03-26 18:10:24 +08:00
if ( last_addr ) {
node = tipc_node_find ( net , last_addr ) ;
if ( ! node ) {
rcu_read_unlock ( ) ;
/* We never set seq or call nl_dump_check_consistent()
* this means that setting prev_seq here will cause the
* consistence check to fail in the netlink callback
* handler . Resulting in the NLMSG_DONE message having
* the NLM_F_DUMP_INTR flag set if the node state
* changed while we released the lock .
*/
cb - > prev_seq = 1 ;
return - EPIPE ;
}
tipc_node_put ( node ) ;
2014-11-20 10:29:17 +01:00
}
2015-01-09 15:27:05 +08:00
list_for_each_entry_rcu ( node , & tn - > node_list , list ) {
2019-11-08 12:05:09 +07:00
if ( node - > preliminary )
continue ;
2014-11-20 10:29:17 +01:00
if ( last_addr ) {
if ( node - > addr = = last_addr )
last_addr = 0 ;
else
continue ;
}
2015-11-19 14:30:44 -05:00
tipc_node_read_lock ( node ) ;
2014-11-20 10:29:17 +01:00
err = __tipc_nl_add_node ( & msg , node ) ;
if ( err ) {
last_addr = node - > addr ;
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( node ) ;
2014-11-20 10:29:17 +01:00
goto out ;
}
2015-11-19 14:30:44 -05:00
tipc_node_read_unlock ( node ) ;
2014-11-20 10:29:17 +01:00
}
done = 1 ;
out :
cb - > args [ 0 ] = done ;
cb - > args [ 1 ] = last_addr ;
rcu_read_unlock ( ) ;
return skb - > len ;
}
2015-11-19 14:30:45 -05:00
2015-11-19 14:30:46 -05:00
/* tipc_node_find_by_name - locate owner node of link by link's name
2015-11-19 14:30:45 -05:00
* @ net : the applicable net namespace
* @ name : pointer to link name string
* @ bearer_id : pointer to index in ' node - > links ' array where the link was found .
*
* Returns pointer to node owning the link , or 0 if no matching link is found .
*/
2015-11-19 14:30:46 -05:00
static struct tipc_node * tipc_node_find_by_name ( struct net * net ,
const char * link_name ,
unsigned int * bearer_id )
2015-11-19 14:30:45 -05:00
{
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
2015-11-19 14:30:46 -05:00
struct tipc_link * l ;
struct tipc_node * n ;
2015-11-19 14:30:45 -05:00
struct tipc_node * found_node = NULL ;
int i ;
* bearer_id = 0 ;
rcu_read_lock ( ) ;
2015-11-19 14:30:46 -05:00
list_for_each_entry_rcu ( n , & tn - > node_list , list ) {
tipc_node_read_lock ( n ) ;
2015-11-19 14:30:45 -05:00
for ( i = 0 ; i < MAX_BEARERS ; i + + ) {
2015-11-19 14:30:46 -05:00
l = n - > links [ i ] . link ;
if ( l & & ! strcmp ( tipc_link_name ( l ) , link_name ) ) {
2015-11-19 14:30:45 -05:00
* bearer_id = i ;
2015-11-19 14:30:46 -05:00
found_node = n ;
2015-11-19 14:30:45 -05:00
break ;
}
}
2015-11-19 14:30:46 -05:00
tipc_node_read_unlock ( n ) ;
2015-11-19 14:30:45 -05:00
if ( found_node )
break ;
}
rcu_read_unlock ( ) ;
return found_node ;
}
int tipc_nl_node_set_link ( struct sk_buff * skb , struct genl_info * info )
{
int err ;
int res = 0 ;
int bearer_id ;
char * name ;
struct tipc_link * link ;
struct tipc_node * node ;
2016-02-01 08:19:56 +01:00
struct sk_buff_head xmitq ;
2015-11-19 14:30:45 -05:00
struct nlattr * attrs [ TIPC_NLA_LINK_MAX + 1 ] ;
struct net * net = sock_net ( skb - > sk ) ;
2016-02-01 08:19:56 +01:00
__skb_queue_head_init ( & xmitq ) ;
2015-11-19 14:30:45 -05:00
if ( ! info - > attrs [ TIPC_NLA_LINK ] )
return - EINVAL ;
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( attrs , TIPC_NLA_LINK_MAX ,
info - > attrs [ TIPC_NLA_LINK ] ,
tipc_nl_link_policy , info - > extack ) ;
2015-11-19 14:30:45 -05:00
if ( err )
return err ;
if ( ! attrs [ TIPC_NLA_LINK_NAME ] )
return - EINVAL ;
name = nla_data ( attrs [ TIPC_NLA_LINK_NAME ] ) ;
if ( strcmp ( name , tipc_bclink_name ) = = 0 )
return tipc_nl_bc_link_set ( net , attrs ) ;
2015-11-19 14:30:46 -05:00
node = tipc_node_find_by_name ( net , name , & bearer_id ) ;
2015-11-19 14:30:45 -05:00
if ( ! node )
return - EINVAL ;
tipc_node_read_lock ( node ) ;
link = node - > links [ bearer_id ] . link ;
if ( ! link ) {
res = - EINVAL ;
goto out ;
}
if ( attrs [ TIPC_NLA_LINK_PROP ] ) {
struct nlattr * props [ TIPC_NLA_PROP_MAX + 1 ] ;
tipc: introduce variable window congestion control
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 00:52:46 +01:00
err = tipc_nl_parse_link_prop ( attrs [ TIPC_NLA_LINK_PROP ] , props ) ;
2015-11-19 14:30:45 -05:00
if ( err ) {
res = err ;
goto out ;
}
if ( props [ TIPC_NLA_PROP_TOL ] ) {
u32 tol ;
tol = nla_get_u32 ( props [ TIPC_NLA_PROP_TOL ] ) ;
2016-02-01 08:19:56 +01:00
tipc_link_set_tolerance ( link , tol , & xmitq ) ;
2015-11-19 14:30:45 -05:00
}
if ( props [ TIPC_NLA_PROP_PRIO ] ) {
u32 prio ;
prio = nla_get_u32 ( props [ TIPC_NLA_PROP_PRIO ] ) ;
2016-02-01 08:19:56 +01:00
tipc_link_set_prio ( link , prio , & xmitq ) ;
2015-11-19 14:30:45 -05:00
}
if ( props [ TIPC_NLA_PROP_WIN ] ) {
tipc: introduce variable window congestion control
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 00:52:46 +01:00
u32 max_win ;
2015-11-19 14:30:45 -05:00
tipc: introduce variable window congestion control
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10 00:52:46 +01:00
max_win = nla_get_u32 ( props [ TIPC_NLA_PROP_WIN ] ) ;
tipc_link_set_queue_limits ( link ,
tipc_link_min_win ( link ) ,
max_win ) ;
2015-11-19 14:30:45 -05:00
}
}
out :
tipc_node_read_unlock ( node ) ;
2019-11-08 12:05:11 +07:00
tipc_bearer_xmit ( net , bearer_id , & xmitq , & node - > links [ bearer_id ] . maddr ,
NULL ) ;
2015-11-19 14:30:45 -05:00
return res ;
}
int tipc_nl_node_get_link ( struct sk_buff * skb , struct genl_info * info )
{
struct net * net = genl_info_net ( info ) ;
2018-05-08 21:44:06 +08:00
struct nlattr * attrs [ TIPC_NLA_LINK_MAX + 1 ] ;
2015-11-19 14:30:45 -05:00
struct tipc_nl_msg msg ;
char * name ;
int err ;
msg . portid = info - > snd_portid ;
msg . seq = info - > snd_seq ;
2018-05-08 21:44:06 +08:00
if ( ! info - > attrs [ TIPC_NLA_LINK ] )
2015-11-19 14:30:45 -05:00
return - EINVAL ;
2018-05-08 21:44:06 +08:00
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( attrs , TIPC_NLA_LINK_MAX ,
info - > attrs [ TIPC_NLA_LINK ] ,
tipc_nl_link_policy , info - > extack ) ;
2018-05-08 21:44:06 +08:00
if ( err )
return err ;
if ( ! attrs [ TIPC_NLA_LINK_NAME ] )
return - EINVAL ;
name = nla_data ( attrs [ TIPC_NLA_LINK_NAME ] ) ;
2015-11-19 14:30:45 -05:00
msg . skb = nlmsg_new ( NLMSG_GOODSIZE , GFP_KERNEL ) ;
if ( ! msg . skb )
return - ENOMEM ;
if ( strcmp ( name , tipc_bclink_name ) = = 0 ) {
2020-05-26 16:38:37 +07:00
err = tipc_nl_add_bc_link ( net , & msg , tipc_net ( net ) - > bcl ) ;
2018-01-10 12:50:25 -08:00
if ( err )
goto err_free ;
2015-11-19 14:30:45 -05:00
} else {
int bearer_id ;
struct tipc_node * node ;
struct tipc_link * link ;
2015-11-19 14:30:46 -05:00
node = tipc_node_find_by_name ( net , name , & bearer_id ) ;
2018-01-10 12:50:25 -08:00
if ( ! node ) {
err = - EINVAL ;
goto err_free ;
}
2015-11-19 14:30:45 -05:00
tipc_node_read_lock ( node ) ;
link = node - > links [ bearer_id ] . link ;
if ( ! link ) {
tipc_node_read_unlock ( node ) ;
2018-01-10 12:50:25 -08:00
err = - EINVAL ;
goto err_free ;
2015-11-19 14:30:45 -05:00
}
err = __tipc_nl_add_link ( net , & msg , link , 0 ) ;
tipc_node_read_unlock ( node ) ;
2018-01-10 12:50:25 -08:00
if ( err )
goto err_free ;
2015-11-19 14:30:45 -05:00
}
return genlmsg_reply ( msg . skb , info ) ;
2018-01-10 12:50:25 -08:00
err_free :
nlmsg_free ( msg . skb ) ;
return err ;
2015-11-19 14:30:45 -05:00
}
int tipc_nl_node_reset_link_stats ( struct sk_buff * skb , struct genl_info * info )
{
int err ;
char * link_name ;
unsigned int bearer_id ;
struct tipc_link * link ;
struct tipc_node * node ;
struct nlattr * attrs [ TIPC_NLA_LINK_MAX + 1 ] ;
struct net * net = sock_net ( skb - > sk ) ;
2020-05-26 16:38:37 +07:00
struct tipc_net * tn = tipc_net ( net ) ;
2015-11-19 14:30:45 -05:00
struct tipc_link_entry * le ;
if ( ! info - > attrs [ TIPC_NLA_LINK ] )
return - EINVAL ;
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( attrs , TIPC_NLA_LINK_MAX ,
info - > attrs [ TIPC_NLA_LINK ] ,
tipc_nl_link_policy , info - > extack ) ;
2015-11-19 14:30:45 -05:00
if ( err )
return err ;
if ( ! attrs [ TIPC_NLA_LINK_NAME ] )
return - EINVAL ;
link_name = nla_data ( attrs [ TIPC_NLA_LINK_NAME ] ) ;
2020-05-26 16:38:37 +07:00
err = - EINVAL ;
if ( ! strcmp ( link_name , tipc_bclink_name ) ) {
err = tipc_bclink_reset_stats ( net , tipc_bc_sndlink ( net ) ) ;
2015-11-19 14:30:45 -05:00
if ( err )
return err ;
return 0 ;
2020-05-26 16:38:37 +07:00
} else if ( strstr ( link_name , tipc_bclink_name ) ) {
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( node , & tn - > node_list , list ) {
tipc_node_read_lock ( node ) ;
link = node - > bc_entry . link ;
if ( link & & ! strcmp ( link_name , tipc_link_name ( link ) ) ) {
err = tipc_bclink_reset_stats ( net , link ) ;
tipc_node_read_unlock ( node ) ;
break ;
}
tipc_node_read_unlock ( node ) ;
}
rcu_read_unlock ( ) ;
return err ;
2015-11-19 14:30:45 -05:00
}
2015-11-19 14:30:46 -05:00
node = tipc_node_find_by_name ( net , link_name , & bearer_id ) ;
2015-11-19 14:30:45 -05:00
if ( ! node )
return - EINVAL ;
le = & node - > links [ bearer_id ] ;
tipc_node_read_lock ( node ) ;
spin_lock_bh ( & le - > lock ) ;
link = node - > links [ bearer_id ] . link ;
if ( ! link ) {
spin_unlock_bh ( & le - > lock ) ;
tipc_node_read_unlock ( node ) ;
return - EINVAL ;
}
2015-11-19 14:30:46 -05:00
tipc_link_reset_stats ( link ) ;
2015-11-19 14:30:45 -05:00
spin_unlock_bh ( & le - > lock ) ;
tipc_node_read_unlock ( node ) ;
return 0 ;
}
/* Caller should hold node lock */
static int __tipc_nl_add_node_links ( struct net * net , struct tipc_nl_msg * msg ,
2020-05-26 16:38:37 +07:00
struct tipc_node * node , u32 * prev_link ,
bool bc_link )
2015-11-19 14:30:45 -05:00
{
u32 i ;
int err ;
for ( i = * prev_link ; i < MAX_BEARERS ; i + + ) {
* prev_link = i ;
if ( ! node - > links [ i ] . link )
continue ;
err = __tipc_nl_add_link ( net , msg ,
node - > links [ i ] . link , NLM_F_MULTI ) ;
if ( err )
return err ;
}
2020-05-26 16:38:37 +07:00
if ( bc_link ) {
* prev_link = i ;
err = tipc_nl_add_bc_link ( net , msg , node - > bc_entry . link ) ;
if ( err )
return err ;
}
2015-11-19 14:30:45 -05:00
* prev_link = 0 ;
return 0 ;
}
2015-11-19 14:30:46 -05:00
int tipc_nl_node_dump_link ( struct sk_buff * skb , struct netlink_callback * cb )
2015-11-19 14:30:45 -05:00
{
struct net * net = sock_net ( skb - > sk ) ;
2020-05-26 16:38:37 +07:00
struct nlattr * * attrs = genl_dumpit_info ( cb ) - > attrs ;
struct nlattr * link [ TIPC_NLA_LINK_MAX + 1 ] ;
2015-11-19 14:30:45 -05:00
struct tipc_net * tn = net_generic ( net , tipc_net_id ) ;
struct tipc_node * node ;
struct tipc_nl_msg msg ;
u32 prev_node = cb - > args [ 0 ] ;
u32 prev_link = cb - > args [ 1 ] ;
int done = cb - > args [ 2 ] ;
2020-05-26 16:38:37 +07:00
bool bc_link = cb - > args [ 3 ] ;
2015-11-19 14:30:45 -05:00
int err ;
if ( done )
return 0 ;
2020-05-26 16:38:37 +07:00
if ( ! prev_node ) {
/* Check if broadcast-receiver links dumping is needed */
if ( attrs & & attrs [ TIPC_NLA_LINK ] ) {
err = nla_parse_nested_deprecated ( link ,
TIPC_NLA_LINK_MAX ,
attrs [ TIPC_NLA_LINK ] ,
tipc_nl_link_policy ,
NULL ) ;
if ( unlikely ( err ) )
return err ;
if ( unlikely ( ! link [ TIPC_NLA_LINK_BROADCAST ] ) )
return - EINVAL ;
bc_link = true ;
}
}
2015-11-19 14:30:45 -05:00
msg . skb = skb ;
msg . portid = NETLINK_CB ( cb - > skb ) . portid ;
msg . seq = cb - > nlh - > nlmsg_seq ;
rcu_read_lock ( ) ;
if ( prev_node ) {
node = tipc_node_find ( net , prev_node ) ;
if ( ! node ) {
/* We never set seq or call nl_dump_check_consistent()
* this means that setting prev_seq here will cause the
* consistence check to fail in the netlink callback
* handler . Resulting in the last NLMSG_DONE message
* having the NLM_F_DUMP_INTR flag set .
*/
cb - > prev_seq = 1 ;
goto out ;
}
tipc_node_put ( node ) ;
list_for_each_entry_continue_rcu ( node , & tn - > node_list ,
list ) {
tipc_node_read_lock ( node ) ;
err = __tipc_nl_add_node_links ( net , & msg , node ,
2020-05-26 16:38:37 +07:00
& prev_link , bc_link ) ;
2015-11-19 14:30:45 -05:00
tipc_node_read_unlock ( node ) ;
if ( err )
goto out ;
prev_node = node - > addr ;
}
} else {
2020-05-26 16:38:37 +07:00
err = tipc_nl_add_bc_link ( net , & msg , tn - > bcl ) ;
2015-11-19 14:30:45 -05:00
if ( err )
goto out ;
list_for_each_entry_rcu ( node , & tn - > node_list , list ) {
tipc_node_read_lock ( node ) ;
err = __tipc_nl_add_node_links ( net , & msg , node ,
2020-05-26 16:38:37 +07:00
& prev_link , bc_link ) ;
2015-11-19 14:30:45 -05:00
tipc_node_read_unlock ( node ) ;
if ( err )
goto out ;
prev_node = node - > addr ;
}
}
done = 1 ;
out :
rcu_read_unlock ( ) ;
cb - > args [ 0 ] = prev_node ;
cb - > args [ 1 ] = prev_link ;
cb - > args [ 2 ] = done ;
2020-05-26 16:38:37 +07:00
cb - > args [ 3 ] = bc_link ;
2015-11-19 14:30:45 -05:00
return skb - > len ;
}
2016-07-26 08:47:19 +02:00
int tipc_nl_node_set_monitor ( struct sk_buff * skb , struct genl_info * info )
{
struct nlattr * attrs [ TIPC_NLA_MON_MAX + 1 ] ;
struct net * net = sock_net ( skb - > sk ) ;
int err ;
if ( ! info - > attrs [ TIPC_NLA_MON ] )
return - EINVAL ;
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( attrs , TIPC_NLA_MON_MAX ,
info - > attrs [ TIPC_NLA_MON ] ,
tipc_nl_monitor_policy ,
info - > extack ) ;
2016-07-26 08:47:19 +02:00
if ( err )
return err ;
if ( attrs [ TIPC_NLA_MON_ACTIVATION_THRESHOLD ] ) {
u32 val ;
val = nla_get_u32 ( attrs [ TIPC_NLA_MON_ACTIVATION_THRESHOLD ] ) ;
err = tipc_nl_monitor_set_threshold ( net , val ) ;
if ( err )
return err ;
}
return 0 ;
}
2016-07-26 08:47:20 +02:00
static int __tipc_nl_add_monitor_prop ( struct net * net , struct tipc_nl_msg * msg )
{
struct nlattr * attrs ;
void * hdr ;
u32 val ;
hdr = genlmsg_put ( msg - > skb , msg - > portid , msg - > seq , & tipc_genl_family ,
0 , TIPC_NL_MON_GET ) ;
if ( ! hdr )
return - EMSGSIZE ;
2019-04-26 11:13:06 +02:00
attrs = nla_nest_start_noflag ( msg - > skb , TIPC_NLA_MON ) ;
2016-07-26 08:47:20 +02:00
if ( ! attrs )
goto msg_full ;
val = tipc_nl_monitor_get_threshold ( net ) ;
if ( nla_put_u32 ( msg - > skb , TIPC_NLA_MON_ACTIVATION_THRESHOLD , val ) )
goto attr_msg_full ;
nla_nest_end ( msg - > skb , attrs ) ;
genlmsg_end ( msg - > skb , hdr ) ;
return 0 ;
attr_msg_full :
nla_nest_cancel ( msg - > skb , attrs ) ;
msg_full :
genlmsg_cancel ( msg - > skb , hdr ) ;
return - EMSGSIZE ;
}
int tipc_nl_node_get_monitor ( struct sk_buff * skb , struct genl_info * info )
{
struct net * net = sock_net ( skb - > sk ) ;
struct tipc_nl_msg msg ;
int err ;
msg . skb = nlmsg_new ( NLMSG_GOODSIZE , GFP_KERNEL ) ;
2017-04-23 15:09:19 +08:00
if ( ! msg . skb )
return - ENOMEM ;
2016-07-26 08:47:20 +02:00
msg . portid = info - > snd_portid ;
msg . seq = info - > snd_seq ;
err = __tipc_nl_add_monitor_prop ( net , & msg ) ;
if ( err ) {
nlmsg_free ( msg . skb ) ;
return err ;
}
return genlmsg_reply ( msg . skb , info ) ;
}
2016-07-26 08:47:22 +02:00
int tipc_nl_node_dump_monitor ( struct sk_buff * skb , struct netlink_callback * cb )
{
struct net * net = sock_net ( skb - > sk ) ;
u32 prev_bearer = cb - > args [ 0 ] ;
struct tipc_nl_msg msg ;
2018-04-17 21:58:27 +02:00
int bearer_id ;
2016-07-26 08:47:22 +02:00
int err ;
if ( prev_bearer = = MAX_BEARERS )
return 0 ;
msg . skb = skb ;
msg . portid = NETLINK_CB ( cb - > skb ) . portid ;
msg . seq = cb - > nlh - > nlmsg_seq ;
rtnl_lock ( ) ;
2018-04-17 21:58:27 +02:00
for ( bearer_id = prev_bearer ; bearer_id < MAX_BEARERS ; bearer_id + + ) {
2018-04-25 18:29:25 +02:00
err = __tipc_nl_add_monitor ( net , & msg , bearer_id ) ;
2016-07-26 08:47:22 +02:00
if ( err )
2018-04-17 21:58:27 +02:00
break ;
2016-07-26 08:47:22 +02:00
}
rtnl_unlock ( ) ;
2018-04-17 21:58:27 +02:00
cb - > args [ 0 ] = bearer_id ;
2016-07-26 08:47:22 +02:00
return skb - > len ;
}
int tipc_nl_node_dump_monitor_peer ( struct sk_buff * skb ,
struct netlink_callback * cb )
{
struct net * net = sock_net ( skb - > sk ) ;
u32 prev_node = cb - > args [ 1 ] ;
u32 bearer_id = cb - > args [ 2 ] ;
int done = cb - > args [ 0 ] ;
struct tipc_nl_msg msg ;
int err ;
if ( ! prev_node ) {
2019-10-05 20:04:39 +02:00
struct nlattr * * attrs = genl_dumpit_info ( cb ) - > attrs ;
2016-07-26 08:47:22 +02:00
struct nlattr * mon [ TIPC_NLA_MON_MAX + 1 ] ;
if ( ! attrs [ TIPC_NLA_MON ] )
return - EINVAL ;
netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 14:07:28 +02:00
err = nla_parse_nested_deprecated ( mon , TIPC_NLA_MON_MAX ,
attrs [ TIPC_NLA_MON ] ,
tipc_nl_monitor_policy ,
NULL ) ;
2016-07-26 08:47:22 +02:00
if ( err )
return err ;
if ( ! mon [ TIPC_NLA_MON_REF ] )
return - EINVAL ;
bearer_id = nla_get_u32 ( mon [ TIPC_NLA_MON_REF ] ) ;
if ( bearer_id > = MAX_BEARERS )
return - EINVAL ;
}
if ( done )
return 0 ;
msg . skb = skb ;
msg . portid = NETLINK_CB ( cb - > skb ) . portid ;
msg . seq = cb - > nlh - > nlmsg_seq ;
rtnl_lock ( ) ;
err = tipc_nl_add_monitor_peer ( net , & msg , bearer_id , & prev_node ) ;
if ( ! err )
done = 1 ;
rtnl_unlock ( ) ;
cb - > args [ 0 ] = done ;
cb - > args [ 1 ] = prev_node ;
cb - > args [ 2 ] = bearer_id ;
return skb - > len ;
}
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
2019-11-08 12:05:12 +07:00
# ifdef CONFIG_TIPC_CRYPTO
static int tipc_nl_retrieve_key ( struct nlattr * * attrs ,
struct tipc_aead_key * * key )
{
struct nlattr * attr = attrs [ TIPC_NLA_NODE_KEY ] ;
if ( ! attr )
return - ENODATA ;
* key = ( struct tipc_aead_key * ) nla_data ( attr ) ;
if ( nla_len ( attr ) < tipc_aead_key_size ( * key ) )
return - EINVAL ;
return 0 ;
}
static int tipc_nl_retrieve_nodeid ( struct nlattr * * attrs , u8 * * node_id )
{
struct nlattr * attr = attrs [ TIPC_NLA_NODE_ID ] ;
if ( ! attr )
return - ENODATA ;
if ( nla_len ( attr ) < TIPC_NODEID_LEN )
return - EINVAL ;
* node_id = ( u8 * ) nla_data ( attr ) ;
return 0 ;
}
2020-09-18 08:17:29 +07:00
static int tipc_nl_retrieve_rekeying ( struct nlattr * * attrs , u32 * intv )
{
struct nlattr * attr = attrs [ TIPC_NLA_NODE_REKEYING ] ;
if ( ! attr )
return - ENODATA ;
* intv = nla_get_u32 ( attr ) ;
return 0 ;
}
2020-02-10 16:11:09 +08:00
static int __tipc_nl_node_set_key ( struct sk_buff * skb , struct genl_info * info )
2019-11-08 12:05:12 +07:00
{
struct nlattr * attrs [ TIPC_NLA_NODE_MAX + 1 ] ;
struct net * net = sock_net ( skb - > sk ) ;
2020-09-18 08:17:26 +07:00
struct tipc_crypto * tx = tipc_net ( net ) - > crypto_tx , * c = tx ;
2019-11-08 12:05:12 +07:00
struct tipc_node * n = NULL ;
struct tipc_aead_key * ukey ;
2020-09-18 08:17:29 +07:00
bool rekeying = true , master_key = false ;
2020-09-18 08:17:26 +07:00
u8 * id , * own_id , mode ;
2020-09-18 08:17:29 +07:00
u32 intv = 0 ;
2019-11-08 12:05:12 +07:00
int rc = 0 ;
if ( ! info - > attrs [ TIPC_NLA_NODE ] )
return - EINVAL ;
rc = nla_parse_nested ( attrs , TIPC_NLA_NODE_MAX ,
info - > attrs [ TIPC_NLA_NODE ] ,
tipc_nl_node_policy , info - > extack ) ;
if ( rc )
2020-09-18 08:17:26 +07:00
return rc ;
2019-11-08 12:05:12 +07:00
own_id = tipc_own_id ( net ) ;
if ( ! own_id ) {
2020-09-18 08:17:26 +07:00
GENL_SET_ERR_MSG ( info , " not found own node identity (set id?) " ) ;
return - EPERM ;
2019-11-08 12:05:12 +07:00
}
2020-09-18 08:17:29 +07:00
rc = tipc_nl_retrieve_rekeying ( attrs , & intv ) ;
if ( rc = = - ENODATA )
rekeying = false ;
2019-11-08 12:05:12 +07:00
rc = tipc_nl_retrieve_key ( attrs , & ukey ) ;
2020-09-18 08:17:29 +07:00
if ( rc = = - ENODATA & & rekeying )
goto rekeying ;
else if ( rc )
2020-09-18 08:17:26 +07:00
return rc ;
2019-11-08 12:05:12 +07:00
2020-09-18 08:17:26 +07:00
rc = tipc_aead_key_validate ( ukey , info ) ;
2019-11-08 12:05:12 +07:00
if ( rc )
2020-09-18 08:17:26 +07:00
return rc ;
2019-11-08 12:05:12 +07:00
rc = tipc_nl_retrieve_nodeid ( attrs , & id ) ;
switch ( rc ) {
case - ENODATA :
2020-09-18 08:17:26 +07:00
mode = CLUSTER_KEY ;
tipc: introduce encryption master key
In addition to the supported cluster & per-node encryption keys for the
en/decryption of TIPC messages, we now introduce one option for user to
set a cluster key as 'master key', which is simply a symmetric key like
the former but has a longer life cycle. It has two purposes:
- Authentication of new member nodes in the cluster. New nodes, having
no knowledge of current session keys in the cluster will still be
able to join the cluster as long as they know the master key. This is
because all neighbor discovery (LINK_CONFIG) messages must be
encrypted with this key.
- Encryption of session encryption keys during automatic exchange and
update of those.This is a feature we will introduce in a later commit
in this series.
We insert the new key into the currently unused slot 0 in the key array
and start using it immediately once the user has set it.
After joining, a node only knowing the master key should be fully
communicable to existing nodes in the cluster, although those nodes may
have their own session keys activated (i.e. not the master one). To
support this, we define a 'grace period', starting from the time a node
itself reports having no RX keys, so the existing nodes will use the
master key for encryption instead. The grace period can be extended but
will automatically stop after e.g. 5 seconds without a new report. This
is also the basis for later key exchanging feature as the new node will
be impossible to decrypt anything without the support from master key.
For user to set a master key, we define a new netlink flag -
'TIPC_NLA_NODE_KEY_MASTER', so it can be added to the current 'set key'
netlink command to specify the setting key to be a master key.
Above all, the traditional cluster/per-node key mechanism is guaranteed
to work when user comes not to use this master key option. This is also
compatible to legacy nodes without the feature supported.
Even this master key can be updated without any interruption of cluster
connectivity but is so is needed, this has to be coordinated and set by
the user.
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:27 +07:00
master_key = ! ! ( attrs [ TIPC_NLA_NODE_KEY_MASTER ] ) ;
2019-11-08 12:05:12 +07:00
break ;
case 0 :
2020-09-18 08:17:26 +07:00
mode = PER_NODE_KEY ;
if ( memcmp ( id , own_id , NODE_ID_LEN ) ) {
2019-11-08 12:05:12 +07:00
n = tipc_node_find_by_id ( net , id ) ? :
tipc_node_create ( net , 0 , id , 0xffffu , 0 , true ) ;
2020-09-18 08:17:26 +07:00
if ( unlikely ( ! n ) )
return - ENOMEM ;
2019-11-08 12:05:12 +07:00
c = n - > crypto_rx ;
}
break ;
default :
2020-09-18 08:17:26 +07:00
return rc ;
2019-11-08 12:05:12 +07:00
}
2020-09-18 08:17:26 +07:00
/* Initiate the TX/RX key */
tipc: introduce encryption master key
In addition to the supported cluster & per-node encryption keys for the
en/decryption of TIPC messages, we now introduce one option for user to
set a cluster key as 'master key', which is simply a symmetric key like
the former but has a longer life cycle. It has two purposes:
- Authentication of new member nodes in the cluster. New nodes, having
no knowledge of current session keys in the cluster will still be
able to join the cluster as long as they know the master key. This is
because all neighbor discovery (LINK_CONFIG) messages must be
encrypted with this key.
- Encryption of session encryption keys during automatic exchange and
update of those.This is a feature we will introduce in a later commit
in this series.
We insert the new key into the currently unused slot 0 in the key array
and start using it immediately once the user has set it.
After joining, a node only knowing the master key should be fully
communicable to existing nodes in the cluster, although those nodes may
have their own session keys activated (i.e. not the master one). To
support this, we define a 'grace period', starting from the time a node
itself reports having no RX keys, so the existing nodes will use the
master key for encryption instead. The grace period can be extended but
will automatically stop after e.g. 5 seconds without a new report. This
is also the basis for later key exchanging feature as the new node will
be impossible to decrypt anything without the support from master key.
For user to set a master key, we define a new netlink flag -
'TIPC_NLA_NODE_KEY_MASTER', so it can be added to the current 'set key'
netlink command to specify the setting key to be a master key.
Above all, the traditional cluster/per-node key mechanism is guaranteed
to work when user comes not to use this master key option. This is also
compatible to legacy nodes without the feature supported.
Even this master key can be updated without any interruption of cluster
connectivity but is so is needed, this has to be coordinated and set by
the user.
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:27 +07:00
rc = tipc_crypto_key_init ( c , ukey , mode , master_key ) ;
2020-09-18 08:17:26 +07:00
if ( n )
tipc_node_put ( n ) ;
tipc: introduce encryption master key
In addition to the supported cluster & per-node encryption keys for the
en/decryption of TIPC messages, we now introduce one option for user to
set a cluster key as 'master key', which is simply a symmetric key like
the former but has a longer life cycle. It has two purposes:
- Authentication of new member nodes in the cluster. New nodes, having
no knowledge of current session keys in the cluster will still be
able to join the cluster as long as they know the master key. This is
because all neighbor discovery (LINK_CONFIG) messages must be
encrypted with this key.
- Encryption of session encryption keys during automatic exchange and
update of those.This is a feature we will introduce in a later commit
in this series.
We insert the new key into the currently unused slot 0 in the key array
and start using it immediately once the user has set it.
After joining, a node only knowing the master key should be fully
communicable to existing nodes in the cluster, although those nodes may
have their own session keys activated (i.e. not the master one). To
support this, we define a 'grace period', starting from the time a node
itself reports having no RX keys, so the existing nodes will use the
master key for encryption instead. The grace period can be extended but
will automatically stop after e.g. 5 seconds without a new report. This
is also the basis for later key exchanging feature as the new node will
be impossible to decrypt anything without the support from master key.
For user to set a master key, we define a new netlink flag -
'TIPC_NLA_NODE_KEY_MASTER', so it can be added to the current 'set key'
netlink command to specify the setting key to be a master key.
Above all, the traditional cluster/per-node key mechanism is guaranteed
to work when user comes not to use this master key option. This is also
compatible to legacy nodes without the feature supported.
Even this master key can be updated without any interruption of cluster
connectivity but is so is needed, this has to be coordinated and set by
the user.
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:27 +07:00
if ( unlikely ( rc < 0 ) ) {
2020-09-18 08:17:26 +07:00
GENL_SET_ERR_MSG ( info , " unable to initiate or attach new key " ) ;
return rc ;
tipc: add automatic session key exchange
With support from the master key option in the previous commit, it
becomes easy to make frequent updates/exchanges of session keys between
authenticated cluster nodes.
Basically, there are two situations where the key exchange will take in
place:
- When a new node joins the cluster (with the master key), it will need
to get its peer's TX key, so that be able to decrypt further messages
from that peer.
- When a new session key is generated (by either user manual setting or
later automatic rekeying feature), the key will be distributed to all
peer nodes in the cluster.
A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
/KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
using the master key before sending out. Upon receipt of the message it
will be decrypted in the same way as regular messages, then attached as
the sender's RX key in the receiver node.
In this way, the key exchange is reliable by the link layer, as well as
security, integrity and authenticity by the crypto layer.
Also, the forward security will be easily achieved by user changing the
master key actively but this should not be required very frequently.
The key exchange feature is independent on the presence of a master key
Note however that the master key still is needed for new nodes to be
able to join the cluster. It is also optional, and can be turned off/on
via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
Backward compatibility is guaranteed because for nodes that do not have
master key support, key exchange using master key ie. tx_key = 0 if any
will be shortly discarded at the message validation step. In other
words, the key exchange feature will be automatically disabled to those
nodes.
v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
error in node.c. The function only exists when built with the TIPC
"CONFIG_TIPC_CRYPTO" option.
v3: use 'info->extack' for a message emitted due to netlink operations
instead (- David's comment).
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 08:17:28 +07:00
} else if ( c = = tx ) {
/* Distribute TX key but not master one */
if ( ! master_key & & tipc_crypto_key_distr ( tx , rc , NULL ) )
GENL_SET_ERR_MSG ( info , " failed to replicate new key " ) ;
2020-09-18 08:17:29 +07:00
rekeying :
/* Schedule TX rekeying if needed */
tipc_crypto_rekeying_sched ( tx , rekeying , intv ) ;
2020-09-18 08:17:26 +07:00
}
return 0 ;
2019-11-08 12:05:12 +07:00
}
int tipc_nl_node_set_key ( struct sk_buff * skb , struct genl_info * info )
{
int err ;
rtnl_lock ( ) ;
err = __tipc_nl_node_set_key ( skb , info ) ;
rtnl_unlock ( ) ;
return err ;
}
2020-02-10 16:11:09 +08:00
static int __tipc_nl_node_flush_key ( struct sk_buff * skb ,
struct genl_info * info )
2019-11-08 12:05:12 +07:00
{
struct net * net = sock_net ( skb - > sk ) ;
struct tipc_net * tn = tipc_net ( net ) ;
struct tipc_node * n ;
tipc_crypto_key_flush ( tn - > crypto_tx ) ;
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( n , & tn - > node_list , list )
tipc_crypto_key_flush ( n - > crypto_rx ) ;
rcu_read_unlock ( ) ;
return 0 ;
}
int tipc_nl_node_flush_key ( struct sk_buff * skb , struct genl_info * info )
{
int err ;
rtnl_lock ( ) ;
err = __tipc_nl_node_flush_key ( skb , info ) ;
rtnl_unlock ( ) ;
return err ;
}
# endif
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 09:17:56 +07:00
/**
* tipc_node_dump - dump TIPC node data
* @ n : tipc node to be dumped
* @ more : dump more ?
* - false : dump only tipc node data
* - true : dump node link data as well
* @ buf : returned buffer of dump data in format
*/
int tipc_node_dump ( struct tipc_node * n , bool more , char * buf )
{
int i = 0 ;
size_t sz = ( more ) ? NODE_LMAX : NODE_LMIN ;
if ( ! n ) {
i + = scnprintf ( buf , sz , " node data: (null) \n " ) ;
return i ;
}
i + = scnprintf ( buf , sz , " node data: %x " , n - > addr ) ;
i + = scnprintf ( buf + i , sz - i , " %x " , n - > state ) ;
i + = scnprintf ( buf + i , sz - i , " %d " , n - > active_links [ 0 ] ) ;
i + = scnprintf ( buf + i , sz - i , " %d " , n - > active_links [ 1 ] ) ;
i + = scnprintf ( buf + i , sz - i , " %x " , n - > action_flags ) ;
i + = scnprintf ( buf + i , sz - i , " %u " , n - > failover_sent ) ;
i + = scnprintf ( buf + i , sz - i , " %u " , n - > sync_point ) ;
i + = scnprintf ( buf + i , sz - i , " %d " , n - > link_cnt ) ;
i + = scnprintf ( buf + i , sz - i , " %u " , n - > working_links ) ;
i + = scnprintf ( buf + i , sz - i , " %x " , n - > capabilities ) ;
i + = scnprintf ( buf + i , sz - i , " %lu \n " , n - > keepalive_intv ) ;
if ( ! more )
return i ;
i + = scnprintf ( buf + i , sz - i , " link_entry[0]: \n " ) ;
i + = scnprintf ( buf + i , sz - i , " mtu: %u \n " , n - > links [ 0 ] . mtu ) ;
i + = scnprintf ( buf + i , sz - i , " media: " ) ;
i + = tipc_media_addr_printf ( buf + i , sz - i , & n - > links [ 0 ] . maddr ) ;
i + = scnprintf ( buf + i , sz - i , " \n " ) ;
i + = tipc_link_dump ( n - > links [ 0 ] . link , TIPC_DUMP_NONE , buf + i ) ;
i + = scnprintf ( buf + i , sz - i , " inputq: " ) ;
i + = tipc_list_dump ( & n - > links [ 0 ] . inputq , false , buf + i ) ;
i + = scnprintf ( buf + i , sz - i , " link_entry[1]: \n " ) ;
i + = scnprintf ( buf + i , sz - i , " mtu: %u \n " , n - > links [ 1 ] . mtu ) ;
i + = scnprintf ( buf + i , sz - i , " media: " ) ;
i + = tipc_media_addr_printf ( buf + i , sz - i , & n - > links [ 1 ] . maddr ) ;
i + = scnprintf ( buf + i , sz - i , " \n " ) ;
i + = tipc_link_dump ( n - > links [ 1 ] . link , TIPC_DUMP_NONE , buf + i ) ;
i + = scnprintf ( buf + i , sz - i , " inputq: " ) ;
i + = tipc_list_dump ( & n - > links [ 1 ] . inputq , false , buf + i ) ;
i + = scnprintf ( buf + i , sz - i , " bclink: \n " ) ;
i + = tipc_link_dump ( n - > bc_entry . link , TIPC_DUMP_NONE , buf + i ) ;
return i ;
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
void tipc_node_pre_cleanup_net ( struct net * exit_net )
{
struct tipc_node * n ;
struct tipc_net * tn ;
struct net * tmp ;
rcu_read_lock ( ) ;
for_each_net_rcu ( tmp ) {
if ( tmp = = exit_net )
continue ;
tn = tipc_net ( tmp ) ;
if ( ! tn )
continue ;
spin_lock_bh ( & tn - > node_list_lock ) ;
list_for_each_entry_rcu ( n , & tn - > node_list , list ) {
if ( ! n - > peer_net )
continue ;
if ( n - > peer_net ! = exit_net )
continue ;
tipc_node_write_lock ( n ) ;
n - > peer_net = NULL ;
n - > peer_hash_mix = 0 ;
tipc_node_write_unlock_fast ( n ) ;
break ;
}
spin_unlock_bh ( & tn - > node_list_lock ) ;
}
rcu_read_unlock ( ) ;
}