2006-01-02 19:04:38 +01:00
/*
* net / tipc / msg . h : Include file for TIPC message header routines
2007-02-09 23:25:21 +09:00
*
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
* Copyright ( c ) 2000 - 2007 , 2014 - 2017 Ericsson AB
2011-01-25 13:33:31 -05:00
* Copyright ( c ) 2005 - 2008 , 2010 - 2011 , Wind River Systems
2006-01-02 19:04:38 +01:00
* All rights reserved .
*
2006-01-11 13:30:43 +01:00
* Redistribution and use in source and binary forms , with or without
2006-01-02 19:04:38 +01:00
* modification , are permitted provided that the following conditions are met :
*
2006-01-11 13:30:43 +01:00
* 1. Redistributions of source code must retain the above copyright
* notice , this list of conditions and the following disclaimer .
* 2. Redistributions in binary form must reproduce the above copyright
* notice , this list of conditions and the following disclaimer in the
* documentation and / or other materials provided with the distribution .
* 3. Neither the names of the copyright holders nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission .
2006-01-02 19:04:38 +01:00
*
2006-01-11 13:30:43 +01:00
* Alternatively , this software may be distributed under the terms of the
* GNU General Public License ( " GPL " ) version 2 as published by the Free
* Software Foundation .
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS " AS IS "
* AND ANY EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT LIMITED TO , THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED . IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL , EXEMPLARY , OR
* CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITED TO , PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE , DATA , OR PROFITS ; OR BUSINESS
* INTERRUPTION ) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN
* CONTRACT , STRICT LIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE )
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE , EVEN IF ADVISED OF THE
2006-01-02 19:04:38 +01:00
* POSSIBILITY OF SUCH DAMAGE .
*/
# ifndef _TIPC_MSG_H
# define _TIPC_MSG_H
2015-01-09 15:27:07 +08:00
# include <linux/tipc.h>
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
# include "core.h"
2006-01-02 19:04:38 +01:00
2011-04-08 10:50:52 -04:00
/*
* Constants and routines used to read and write TIPC payload message headers
*
* Note : Some items are also used with TIPC internal message headers
*/
2006-01-02 19:04:38 +01:00
# define TIPC_VERSION 2
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
struct plist ;
2008-03-06 15:06:55 -08:00
2010-11-30 12:00:53 +00:00
/*
2011-04-08 10:50:52 -04:00
* Payload message users are defined in TIPC ' s public API :
* - TIPC_LOW_IMPORTANCE
* - TIPC_MEDIUM_IMPORTANCE
* - TIPC_HIGH_IMPORTANCE
* - TIPC_CRITICAL_IMPORTANCE
*/
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
# define TIPC_SYSTEM_IMPORTANCE 4
2011-04-08 10:50:52 -04:00
/*
* Payload message types
2010-11-30 12:00:53 +00:00
*/
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
# define TIPC_CONN_MSG 0
# define TIPC_MCAST_MSG 1
# define TIPC_NAMED_MSG 2
# define TIPC_DIRECT_MSG 3
2017-10-13 11:04:25 +02:00
# define TIPC_GRP_MEMBER_EVT 4
# define TIPC_GRP_BCAST_MSG 5
2017-10-13 11:04:29 +02:00
# define TIPC_GRP_MCAST_MSG 6
# define TIPC_GRP_UCAST_MSG 7
2010-11-30 12:00:53 +00:00
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
/*
* Internal message users
*/
# define BCAST_PROTOCOL 5
# define MSG_BUNDLER 6
# define LINK_PROTOCOL 7
# define CONN_MANAGER 8
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
# define GROUP_PROTOCOL 9
tipc: eliminate delayed link deletion at link failover
When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.
In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.
We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:01 -04:00
# define TUNNEL_PROTOCOL 10
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
# define NAME_DISTRIBUTOR 11
# define MSG_FRAGMENTER 12
# define LINK_CONFIG 13
# define SOCK_WAKEUP 14 /* pseudo user */
2017-10-13 11:04:17 +02:00
# define TOP_SRV 15 /* pseudo user */
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
2011-04-08 10:50:52 -04:00
/*
* Message header sizes
*/
2011-05-31 15:03:18 -04:00
# define SHORT_H_SIZE 24 /* In-cluster basic payload message */
# define BASIC_H_SIZE 32 /* Basic payload message */
# define NAMED_H_SIZE 40 /* Named payload message */
# define MCAST_H_SIZE 44 /* Multicast payload message */
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
# define GROUP_H_SIZE 44 /* Group payload message */
2008-03-06 15:06:55 -08:00
# define INT_H_SIZE 40 /* Internal messages */
# define MIN_H_SIZE 24 /* Smallest legal TIPC header size */
# define MAX_H_SIZE 60 /* Largest possible TIPC header size */
2006-01-02 19:04:38 +01:00
# define MAX_MSG_SIZE (MAX_H_SIZE + TIPC_MAX_USER_MSG_SIZE)
2017-11-30 16:47:25 +01:00
# define FB_MTU 3744
2015-02-27 08:56:57 +01:00
# define TIPC_MEDIA_INFO_OFFSET 5
2011-10-07 15:19:11 -04:00
2015-01-09 15:27:01 +08:00
struct tipc_skb_cb {
2019-11-08 12:05:11 +07:00
union {
struct {
struct sk_buff * tail ;
unsigned long nxt_retr ;
unsigned long retr_stamp ;
u32 bytes_read ;
u32 orig_member ;
u16 chain_imp ;
u16 ackers ;
u16 retr_cnt ;
} __packed ;
# ifdef CONFIG_TIPC_CRYPTO
struct {
struct tipc_crypto * rx ;
struct tipc_aead * last ;
u8 recurs ;
} tx_clone_ctx __packed ;
# endif
} __packed ;
union {
struct {
u8 validated : 1 ;
# ifdef CONFIG_TIPC_CRYPTO
u8 encrypted : 1 ;
u8 decrypted : 1 ;
u8 probe : 1 ;
u8 tx_clone_deferred : 1 ;
# endif
} ;
u8 flags ;
} ;
u8 reserved ;
# ifdef CONFIG_TIPC_CRYPTO
void * crypto_ctx ;
# endif
} __packed ;
2015-01-09 15:27:01 +08:00
# define TIPC_SKB_CB(__skb) ((struct tipc_skb_cb *)&((__skb)->cb[0]))
2010-11-30 12:00:53 +00:00
struct tipc_msg {
__be32 hdr [ 15 ] ;
} ;
2007-02-09 23:25:21 +09:00
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:51 +07:00
/* struct tipc_gap_ack - TIPC Gap ACK block
* @ ack : seqno of the last consecutive packet in link deferdq
* @ gap : number of gap packets since the last ack
*
* E . g :
* link deferdq : 1 2 3 4 10 11 13 14 15 20
* - - > Gap ACK blocks : < 4 , 5 > , < 11 , 1 > , < 15 , 4 > , < 20 , 0 >
*/
struct tipc_gap_ack {
__be16 ack ;
__be16 gap ;
} ;
/* struct tipc_gap_ack_blks
* @ len : actual length of the record
2020-05-26 16:38:34 +07:00
* @ ugack_cnt : number of Gap ACK blocks for unicast ( following the broadcast
* ones )
* @ start_index : starting index for " valid " broadcast Gap ACK blocks
* @ bgack_cnt : number of Gap ACK blocks for broadcast in the record
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:51 +07:00
* @ gacks : array of Gap ACK blocks
2020-05-26 16:38:34 +07:00
*
* 31 16 15 0
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - +
* | bgack_cnt | ugack_cnt | len |
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + -
* | gap | ack | |
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + > bc gacks
* : : : |
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + -
* | gap | ack | |
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + > uc gacks
* : : : |
* + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - + -
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:51 +07:00
*/
struct tipc_gap_ack_blks {
__be16 len ;
2020-05-26 16:38:34 +07:00
union {
u8 ugack_cnt ;
u8 start_index ;
} ;
u8 bgack_cnt ;
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:51 +07:00
struct tipc_gap_ack gacks [ ] ;
} ;
2020-05-26 16:38:34 +07:00
# define MAX_GAP_ACK_BLKS 128
2020-06-18 08:35:00 -05:00
# define MAX_GAP_ACK_BLKS_SZ (sizeof(struct tipc_gap_ack_blks) + \
sizeof ( struct tipc_gap_ack ) * MAX_GAP_ACK_BLKS )
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:51 +07:00
2015-01-09 15:27:01 +08:00
static inline struct tipc_msg * buf_msg ( struct sk_buff * skb )
{
return ( struct tipc_msg * ) skb - > data ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_word ( struct tipc_msg * m , u32 pos )
{
return ntohl ( m - > hdr [ pos ] ) ;
}
2006-01-02 19:04:38 +01:00
static inline void msg_set_word ( struct tipc_msg * m , u32 w , u32 val )
{
m - > hdr [ w ] = htonl ( val ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_bits ( struct tipc_msg * m , u32 w , u32 pos , u32 mask )
{
return ( msg_word ( m , w ) > > pos ) & mask ;
}
2006-01-02 19:04:38 +01:00
static inline void msg_set_bits ( struct tipc_msg * m , u32 w ,
u32 pos , u32 mask , u32 val )
{
2007-04-24 14:51:55 -07:00
val = ( val & mask ) < < pos ;
2008-04-26 22:42:14 -07:00
mask = mask < < pos ;
m - > hdr [ w ] & = ~ htonl ( mask ) ;
m - > hdr [ w ] | = htonl ( val ) ;
2006-01-02 19:04:38 +01:00
}
2008-06-04 17:54:48 -07:00
static inline void msg_swap_words ( struct tipc_msg * msg , u32 a , u32 b )
{
u32 temp = msg - > hdr [ a ] ;
msg - > hdr [ a ] = msg - > hdr [ b ] ;
msg - > hdr [ b ] = temp ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 0
*/
static inline u32 msg_version ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 29 , 7 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_version ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
2008-03-06 15:07:42 -08:00
msg_set_bits ( m , 0 , 29 , 7 , TIPC_VERSION ) ;
2006-01-02 19:04:38 +01:00
}
static inline u32 msg_user ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 25 , 0xf ) ;
}
static inline u32 msg_isdata ( struct tipc_msg * m )
{
2010-09-22 20:43:57 +00:00
return msg_user ( m ) < = TIPC_CRITICAL_IMPORTANCE ;
2006-01-02 19:04:38 +01:00
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_user ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 0 , 25 , 0xf , n ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_hdr_sz ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 21 , 0xf ) < < 2 ;
}
2010-12-31 18:59:32 +00:00
static inline void msg_set_hdr_sz ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 0 , 21 , 0xf , n > > 2 ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_size ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 0 , 0x1ffff ) ;
}
2017-10-13 11:04:29 +02:00
static inline u32 msg_blocks ( struct tipc_msg * m )
{
return ( msg_size ( m ) / 1024 ) + 1 ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_data_sz ( struct tipc_msg * m )
{
return msg_size ( m ) - msg_hdr_sz ( m ) ;
}
2007-02-09 23:25:21 +09:00
static inline int msg_non_seq ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 0 , 20 , 1 ) ;
}
2008-06-04 17:54:48 -07:00
static inline void msg_set_non_seq ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
2008-06-04 17:54:48 -07:00
msg_set_bits ( m , 0 , 20 , 1 , n ) ;
2006-01-02 19:04:38 +01:00
}
2018-09-28 20:23:21 +02:00
static inline int msg_is_syn ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 17 , 1 ) ;
}
static inline void msg_set_syn ( struct tipc_msg * m , u32 d )
{
msg_set_bits ( m , 0 , 17 , 1 , d ) ;
}
2007-02-09 23:25:21 +09:00
static inline int msg_dest_droppable ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 0 , 19 , 1 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_dest_droppable ( struct tipc_msg * m , u32 d )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 0 , 19 , 1 , d ) ;
}
tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.
To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.
To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.
It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 09:59:26 +01:00
static inline int msg_is_keepalive ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 19 , 1 ) ;
}
static inline void msg_set_is_keepalive ( struct tipc_msg * m , u32 d )
{
msg_set_bits ( m , 0 , 19 , 1 , d ) ;
}
2007-02-09 23:25:21 +09:00
static inline int msg_src_droppable ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 0 , 18 , 1 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_src_droppable ( struct tipc_msg * m , u32 d )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 0 , 18 , 1 , d ) ;
}
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 14:00:41 +01:00
static inline int msg_ack_required ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 18 , 1 ) ;
}
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
static inline void msg_set_ack_required ( struct tipc_msg * m )
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 14:00:41 +01:00
{
tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.
In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.
In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.
When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.
Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.
This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.
In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.
Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:38:38 +07:00
msg_set_bits ( m , 0 , 18 , 1 , 1 ) ;
}
static inline int msg_nagle_ack ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 18 , 1 ) ;
}
static inline void msg_set_nagle_ack ( struct tipc_msg * m )
{
msg_set_bits ( m , 0 , 18 , 1 , 1 ) ;
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 14:00:41 +01:00
}
tipc: smooth change between replicast and broadcast
Currently, a multicast stream may start out using replicast, because
there are few destinations, and then it should ideally switch to
L2/broadcast IGMP/multicast when the number of destinations grows beyond
a certain limit. The opposite should happen when the number decreases
below the limit.
To eliminate the risk of message reordering caused by method change,
a sending socket must stick to a previously selected method until it
enters an idle period of 5 seconds. Means there is a 5 seconds pause
in the traffic from the sender socket.
If the sender never makes such a pause, the method will never change,
and transmission may become very inefficient as the cluster grows.
With this commit, we allow such a switch between replicast and
broadcast without any need for a traffic pause.
Solution is to send a dummy message with only the header, also with
the SYN bit set, via broadcast or replicast. For the data message,
the SYN bit is set and sending via replicast or broadcast (inverse
method with dummy).
Then, at receiving side any messages follow first SYN bit message
(data or dummy message), they will be held in deferred queue until
another pair (dummy or data message) arrived in other link.
v2: reverse christmas tree declaration
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 18:49:50 +07:00
static inline bool msg_is_rcast ( struct tipc_msg * m )
{
return msg_bits ( m , 0 , 18 , 0x1 ) ;
}
static inline void msg_set_is_rcast ( struct tipc_msg * m , bool d )
{
msg_set_bits ( m , 0 , 18 , 0x1 , d ) ;
}
2006-01-02 19:04:38 +01:00
static inline void msg_set_size ( struct tipc_msg * m , u32 sz )
{
m - > hdr [ 0 ] = htonl ( ( msg_word ( m , 0 ) & ~ 0x1ffff ) | sz ) ;
}
2015-03-25 12:07:25 -04:00
static inline unchar * msg_data ( struct tipc_msg * m )
{
return ( ( unchar * ) m ) + msg_hdr_sz ( m ) ;
}
2019-06-25 19:37:00 +02:00
static inline struct tipc_msg * msg_inner_hdr ( struct tipc_msg * m )
2015-03-25 12:07:25 -04:00
{
return ( struct tipc_msg * ) msg_data ( m ) ;
}
2006-01-02 19:04:38 +01:00
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 1
*/
2010-11-30 12:00:53 +00:00
static inline u32 msg_type ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 29 , 0x7 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_type ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 29 , 0x7 , n ) ;
}
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
static inline int msg_in_group ( struct tipc_msg * m )
{
2017-10-13 11:04:25 +02:00
int mtyp = msg_type ( m ) ;
2017-10-13 11:04:27 +02:00
return mtyp > = TIPC_GRP_MEMBER_EVT & & mtyp < = TIPC_GRP_UCAST_MSG ;
2017-10-13 11:04:25 +02:00
}
static inline bool msg_is_grp_evt ( struct tipc_msg * m )
{
return msg_type ( m ) = = TIPC_GRP_MEMBER_EVT ;
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_named ( struct tipc_msg * m )
{
return msg_type ( m ) = = TIPC_NAMED_MSG ;
}
static inline u32 msg_mcast ( struct tipc_msg * m )
{
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
int mtyp = msg_type ( m ) ;
2017-10-13 11:04:29 +02:00
return ( ( mtyp = = TIPC_MCAST_MSG ) | | ( mtyp = = TIPC_GRP_BCAST_MSG ) | |
( mtyp = = TIPC_GRP_MCAST_MSG ) ) ;
2010-11-30 12:00:53 +00:00
}
static inline u32 msg_connected ( struct tipc_msg * m )
{
return msg_type ( m ) = = TIPC_CONN_MSG ;
}
2020-03-26 09:50:29 +07:00
static inline u32 msg_direct ( struct tipc_msg * m )
{
return msg_type ( m ) = = TIPC_DIRECT_MSG ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_errcode ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 25 , 0xf ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_errcode ( struct tipc_msg * m , u32 err )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 25 , 0xf , err ) ;
}
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
static inline void msg_set_bulk ( struct tipc_msg * m )
{
msg_set_bits ( m , 1 , 28 , 0x1 , 1 ) ;
}
static inline u32 msg_is_bulk ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 28 , 0x1 ) ;
}
static inline void msg_set_last_bulk ( struct tipc_msg * m )
{
msg_set_bits ( m , 1 , 27 , 0x1 , 1 ) ;
}
static inline u32 msg_is_last_bulk ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 27 , 0x1 ) ;
}
static inline void msg_set_non_legacy ( struct tipc_msg * m )
{
msg_set_bits ( m , 1 , 26 , 0x1 , 1 ) ;
}
static inline u32 msg_is_legacy ( struct tipc_msg * m )
{
return ! msg_bits ( m , 1 , 26 , 0x1 ) ;
}
2007-02-09 23:25:21 +09:00
static inline u32 msg_reroute_cnt ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 1 , 21 , 0xf ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_incr_reroute_cnt ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 21 , 0xf , msg_reroute_cnt ( m ) + 1 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_reset_reroute_cnt ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 21 , 0xf , 0 ) ;
}
static inline u32 msg_lookup_scope ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 19 , 0x3 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_lookup_scope ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 19 , 0x3 , n ) ;
}
2015-05-14 10:46:14 -04:00
static inline u16 msg_bcast_ack ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 1 , 0 , 0xffff ) ;
}
2015-05-14 10:46:14 -04:00
static inline void msg_set_bcast_ack ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 1 , 0 , 0xffff , n ) ;
}
tipc: fix link session and re-establish issues
When a link endpoint is re-created (e.g. after a node reboot or
interface reset), the link session number is varied by random, the peer
endpoint will be synced with this new session number before the link is
re-established.
However, there is a shortcoming in this mechanism that can lead to the
link never re-established or faced with a failure then. It happens when
the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
well as the 'in_session' flag have been set, but suddenly this link
endpoint leaves. When it comes back with a random session number, there
are two situations possible:
1/ If the random session number is larger than (or equal to) the
previous one, the peer endpoint will be updated with this new session
upon receipt of a RESET_MSG from this endpoint, and the link can be re-
established as normal. Otherwise, all the RESET_MSGs from this endpoint
will be rejected by the peer. In turn, when this link endpoint receives
one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
to send STATE_MSGs, but again these messages will be dropped by the
peer due to wrong session.
The peer link endpoint can still become ESTABLISHED after receiving a
traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
will be forced down sooner or later!
Even in case the random session number is larger than the previous one,
it can be that the ACTIVATE_MSG from the peer arrives first, and this
link endpoint moves quickly to ESTABLISHED without sending out any
RESET_MSG yet. Consequently, the peer link will not be updated with the
new session number, and the same link failure scenario as above will
happen.
2/ Another situation can be that, the peer link endpoint was reset due
to any reasons in the meantime, its link state was set to RESET from
ESTABLISHING but still in session, i.e. the 'in_session' flag is not
reset...
Now, if the random session number from this endpoint is less than the
previous one, all the RESET_MSGs from this endpoint will be rejected by
the peer. In the other direction, when this link endpoint receives a
RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
As a result, the link cannot be re-established but gets stuck with this
link endpoint in state ESTABLISHING and the peer in RESET!
Solution:
===========
This link endpoint should not go directly to ESTABLISHED when getting
ACTIVATE_MSG from the peer which may belong to the old session if the
link was re-created. To ensure the session to be correct before the
link is re-established, the peer endpoint in ESTABLISHING state will
send back the last session number in ACTIVATE_MSG for a verification at
this endpoint. Then, if needed, a new and more appropriate session
number will be regenerated to force a re-synch first.
In addition, when a link in ESTABLISHING state is reset, its state will
move to RESET according to the link FSM, along with resetting the
'in_session' flag (and the other data) as a normal link reset, it will
also be deleted if requested.
The solution is backward compatible.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 13:29:43 +07:00
/* Note: reusing bits in word 1 for ACTIVATE_MSG only, to re-synch
* link peer session number
*/
static inline bool msg_dest_session_valid ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 16 , 0x1 ) ;
}
static inline void msg_set_dest_session_valid ( struct tipc_msg * m , bool valid )
{
msg_set_bits ( m , 1 , 16 , 0x1 , valid ) ;
}
static inline u16 msg_dest_session ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 0 , 0xffff ) ;
}
static inline void msg_set_dest_session ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 1 , 0 , 0xffff , n ) ;
}
2006-01-02 19:04:38 +01:00
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 2
*/
2015-05-14 10:46:14 -04:00
static inline u16 msg_ack ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 2 , 16 , 0xffff ) ;
}
2015-05-14 10:46:14 -04:00
static inline void msg_set_ack ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 2 , 16 , 0xffff , n ) ;
}
2015-05-14 10:46:14 -04:00
static inline u16 msg_seqno ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 2 , 0 , 0xffff ) ;
}
2015-05-14 10:46:14 -04:00
static inline void msg_set_seqno ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 2 , 0 , 0xffff , n ) ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Words 3 - 10
*/
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
static inline u32 msg_importance ( struct tipc_msg * m )
{
tipc: improve link congestion algorithm
The link congestion algorithm used until now implies two problems.
- It is too generous towards lower-level messages in situations of high
load by giving "absolute" bandwidth guarantees to the different
priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
available bandwidth. But, in the absence of higher level traffic, the
ratio between two distinct levels becomes unreasonable. E.g. if there
is only LOW and MEDIUM traffic on a system, the former is guaranteed
1/3 of the bandwidth, and the latter 2/3. This again means that if
there is e.g. one LOW user and 10 MEDIUM users, the former will have
33.3% of the bandwidth, and the others will have to compete for the
remainder, i.e. each will end up with 6.7% of the capacity.
- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
but only after the packets bundled into it have passed the congestion
test for their own respective levels. Since bundled packets don't
result in incrementing the level counter for their own importance,
only occasionally for the SYSTEM level counter, they do in practice
obtain SYSTEM level importance. Hence, the current implementation
provides a gap in the congestion algorithm that in the worst case
may lead to a link reset.
We now refine the congestion algorithm as follows:
- A message is accepted to the link backlog only if its own level
counter, and all superior level counters, permit it.
- The importance of a created bundle packet is set according to its
contents. A bundle packet created from messges at levels LOW to
CRITICAL is given importance level CRITICAL, while a bundle created
from a SYSTEM level message is given importance SYSTEM. In the latter
case only subsequent SYSTEM level messages are allowed to be bundled
into it.
This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.
The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.
Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 10:46:17 -04:00
int usr = msg_user ( m ) ;
if ( likely ( ( usr < = TIPC_CRITICAL_IMPORTANCE ) & & ! msg_errcode ( m ) ) )
return usr ;
if ( ( usr = = MSG_FRAGMENTER ) | | ( usr = = MSG_BUNDLER ) )
2015-10-14 09:23:18 -04:00
return msg_bits ( m , 9 , 0 , 0x7 ) ;
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
return TIPC_SYSTEM_IMPORTANCE ;
}
static inline void msg_set_importance ( struct tipc_msg * m , u32 i )
{
tipc: improve link congestion algorithm
The link congestion algorithm used until now implies two problems.
- It is too generous towards lower-level messages in situations of high
load by giving "absolute" bandwidth guarantees to the different
priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
available bandwidth. But, in the absence of higher level traffic, the
ratio between two distinct levels becomes unreasonable. E.g. if there
is only LOW and MEDIUM traffic on a system, the former is guaranteed
1/3 of the bandwidth, and the latter 2/3. This again means that if
there is e.g. one LOW user and 10 MEDIUM users, the former will have
33.3% of the bandwidth, and the others will have to compete for the
remainder, i.e. each will end up with 6.7% of the capacity.
- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
but only after the packets bundled into it have passed the congestion
test for their own respective levels. Since bundled packets don't
result in incrementing the level counter for their own importance,
only occasionally for the SYSTEM level counter, they do in practice
obtain SYSTEM level importance. Hence, the current implementation
provides a gap in the congestion algorithm that in the worst case
may lead to a link reset.
We now refine the congestion algorithm as follows:
- A message is accepted to the link backlog only if its own level
counter, and all superior level counters, permit it.
- The importance of a created bundle packet is set according to its
contents. A bundle packet created from messges at levels LOW to
CRITICAL is given importance level CRITICAL, while a bundle created
from a SYSTEM level message is given importance SYSTEM. In the latter
case only subsequent SYSTEM level messages are allowed to be bundled
into it.
This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.
The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.
Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 10:46:17 -04:00
int usr = msg_user ( m ) ;
if ( likely ( ( usr = = MSG_FRAGMENTER ) | | ( usr = = MSG_BUNDLER ) ) )
2015-10-14 09:23:18 -04:00
msg_set_bits ( m , 9 , 0 , 0x7 , i ) ;
tipc: improve link congestion algorithm
The link congestion algorithm used until now implies two problems.
- It is too generous towards lower-level messages in situations of high
load by giving "absolute" bandwidth guarantees to the different
priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
available bandwidth. But, in the absence of higher level traffic, the
ratio between two distinct levels becomes unreasonable. E.g. if there
is only LOW and MEDIUM traffic on a system, the former is guaranteed
1/3 of the bandwidth, and the latter 2/3. This again means that if
there is e.g. one LOW user and 10 MEDIUM users, the former will have
33.3% of the bandwidth, and the others will have to compete for the
remainder, i.e. each will end up with 6.7% of the capacity.
- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
but only after the packets bundled into it have passed the congestion
test for their own respective levels. Since bundled packets don't
result in incrementing the level counter for their own importance,
only occasionally for the SYSTEM level counter, they do in practice
obtain SYSTEM level importance. Hence, the current implementation
provides a gap in the congestion algorithm that in the worst case
may lead to a link reset.
We now refine the congestion algorithm as follows:
- A message is accepted to the link backlog only if its own level
counter, and all superior level counters, permit it.
- The importance of a created bundle packet is set according to its
contents. A bundle packet created from messges at levels LOW to
CRITICAL is given importance level CRITICAL, while a bundle created
from a SYSTEM level message is given importance SYSTEM. In the latter
case only subsequent SYSTEM level messages are allowed to be bundled
into it.
This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.
The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.
Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 10:46:17 -04:00
else if ( i < TIPC_SYSTEM_IMPORTANCE )
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 16:08:11 -04:00
msg_set_user ( m , i ) ;
else
pr_warn ( " Trying to set illegal importance in message \n " ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_prevnode ( struct tipc_msg * m )
{
return msg_word ( m , 3 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_prevnode ( struct tipc_msg * m , u32 a )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 3 , a ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_origport ( struct tipc_msg * m )
{
2015-03-25 12:07:25 -04:00
if ( msg_user ( m ) = = MSG_FRAGMENTER )
2019-06-25 19:37:00 +02:00
m = msg_inner_hdr ( m ) ;
2010-11-30 12:00:53 +00:00
return msg_word ( m , 4 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_origport ( struct tipc_msg * m , u32 p )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 4 , p ) ;
}
tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. This way, the
binding table can be updated on all peer nodes in one shot.
Broadcast is used when all peer nodes, as indicated by a new capability
flag TIPC_NAMED_BCAST, support reception of this message type.
Four problems need to be considered when introducing this feature.
1) When establishing a link to a new peer node we still update this by a
unicast 'bulk' update. This may lead to race conditions, where a later
broadcast publication/withdrawal bypass the 'bulk', resulting in
disordered publications, or even that a withdrawal may arrive before the
corresponding publication. We solve this by adding an 'is_last_bulk' bit
in the last bulk messages so that it can be distinguished from all other
messages. Only when this message has arrived do we open up for reception
of broadcast publications/withdrawals.
2) When a first legacy node is added to the cluster all distribution
will switch over to use the legacy 'replicast' method, while the
opposite happens when the last legacy node leaves the cluster. This
entails another risk of message disordering that has to be handled. We
solve this by adding a sequence number to the broadcast/replicast
messages, so that disordering can be discovered and corrected. Note
however that we don't need to consider potential message loss or
duplication at this protocol level.
3) Bulk messages don't contain any sequence numbers, and will always
arrive in order. Hence we must exempt those from the sequence number
control and deliver them unconditionally. We solve this by adding a new
'is_bulk' bit in those messages so that they can be recognized.
4) Legacy messages, which don't contain any new bits or sequence
numbers, but neither can arrive out of order, also need to be exempt
from the initial synchronization and sequence number check, and
delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
to all new messages so that those can be distinguished from legacy
messages and the latter delivered directly.
v1->v2:
- fix warning issue reported by kbuild test robot <lkp@intel.com>
- add santiy check to drop the publication message with a sequence
number that is lower than the agreed synch point
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:56:05 +07:00
static inline u16 msg_named_seqno ( struct tipc_msg * m )
{
return msg_bits ( m , 4 , 0 , 0xffff ) ;
}
static inline void msg_set_named_seqno ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 4 , 0 , 0xffff , n ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_destport ( struct tipc_msg * m )
{
return msg_word ( m , 5 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_destport ( struct tipc_msg * m , u32 p )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 5 , p ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_mc_netid ( struct tipc_msg * m )
{
return msg_word ( m , 5 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_mc_netid ( struct tipc_msg * m , u32 p )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 5 , p ) ;
}
2010-11-30 12:00:53 +00:00
static inline int msg_short ( struct tipc_msg * m )
{
2011-05-31 15:03:18 -04:00
return msg_hdr_sz ( m ) = = SHORT_H_SIZE ;
2010-11-30 12:00:53 +00:00
}
static inline u32 msg_orignode ( struct tipc_msg * m )
{
if ( likely ( msg_short ( m ) ) )
return msg_prevnode ( m ) ;
return msg_word ( m , 6 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_orignode ( struct tipc_msg * m , u32 a )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 6 , a ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_destnode ( struct tipc_msg * m )
{
return msg_word ( m , 7 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_destnode ( struct tipc_msg * m , u32 a )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 7 , a ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_nametype ( struct tipc_msg * m )
{
return msg_word ( m , 8 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_nametype ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 8 , n ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_nameinst ( struct tipc_msg * m )
{
return msg_word ( m , 9 ) ;
}
static inline u32 msg_namelower ( struct tipc_msg * m )
{
return msg_nameinst ( m ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_namelower ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 9 , n ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_nameinst ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_namelower ( m , n ) ;
}
2010-11-30 12:00:53 +00:00
static inline u32 msg_nameupper ( struct tipc_msg * m )
{
return msg_word ( m , 10 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_nameupper ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 10 , n ) ;
}
/*
2011-04-08 10:50:52 -04:00
* Constants and routines used to read and write TIPC internal message headers
*/
2006-01-02 19:04:38 +01:00
2007-02-09 23:25:21 +09:00
/*
2011-04-08 10:50:52 -04:00
* Connection management protocol message types
2006-01-02 19:04:38 +01:00
*/
# define CONN_PROBE 0
# define CONN_PROBE_REPLY 1
# define CONN_ACK 2
2007-02-09 23:25:21 +09:00
/*
2011-04-08 10:50:52 -04:00
* Name distributor message types
2006-01-02 19:04:38 +01:00
*/
# define PUBLICATION 0
# define WITHDRAWAL 1
2011-04-08 11:04:15 -04:00
/*
* Segmentation message types
*/
# define FIRST_FRAGMENT 0
# define FRAGMENT 1
# define LAST_FRAGMENT 2
/*
* Link management protocol message types
*/
# define STATE_MSG 0
# define RESET_MSG 1
# define ACTIVATE_MSG 2
/*
* Changeover tunnel message types
*/
tipc: eliminate delayed link deletion at link failover
When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.
In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.
We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 09:33:01 -04:00
# define SYNCH_MSG 0
# define FAILOVER_MSG 1
2011-04-08 11:04:15 -04:00
/*
* Config protocol message types
*/
# define DSC_REQ_MSG 0
# define DSC_RESP_MSG 1
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
# define DSC_TRIAL_MSG 2
# define DSC_TRIAL_FAIL_MSG 3
2011-04-08 11:04:15 -04:00
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
/*
* Group protocol message types
*/
# define GRP_JOIN_MSG 0
# define GRP_LEAVE_MSG 1
2017-10-13 11:04:26 +02:00
# define GRP_ADV_MSG 2
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:31 +02:00
# define GRP_ACK_MSG 3
tipc: add multipoint-to-point flow control
We already have point-to-multipoint flow control within a group. But
we even need the opposite; -a scheme which can handle that potentially
hundreds of sources may try to send messages to the same destination
simultaneously without causing buffer overflow at the recipient. This
commit adds such a mechanism.
The algorithm works as follows:
- When a member detects a new, joining member, it initially set its
state to JOINED and advertises a minimum window to the new member.
This window is chosen so that the new member can send exactly one
maximum sized message, or several smaller ones, to the recipient
before it must stop and wait for an additional advertisement. This
minimum window ADV_IDLE is set to 65 1kB blocks.
- When a member receives the first data message from a JOINED member,
it changes the state of the latter to ACTIVE, and advertises a larger
window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
continue sending with minimal disturbances to the data flow.
- The active members are kept in a dedicated linked list. Each time a
message is received from an active member, it will be moved to the
tail of that list. This way, we keep a record of which members have
been most (tail) and least (head) recently active.
- There is a maximum number (16) of permitted simultaneous active
senders per receiver. When this limit is reached, the receiver will
not advertise anything immediately to a new sender, but instead put
it in a PENDING state, and add it to a corresponding queue. At the
same time, it will pick the least recently active member, send it an
advertisement RECLAIM message, and set this member to state
RECLAIMING.
- The reclaimee member has to respond with a REMIT message, meaning that
it goes back to a send window of ADV_IDLE, and returns its unused
advertised blocks beyond that value to the reclaiming member.
- When the reclaiming member receives the REMIT message, it unlinks
the reclaimee from its active list, resets its state to JOINED, and
notes that it is now back at ADV_IDLE advertised blocks to that
member. If there are still unread data messages sent out by
reclaimee before the REMIT, the member goes into an intermediate
state REMITTED, where it stays until the said messages have been
consumed.
- The returned advertised blocks can now be re-advertised to the
pending member, which is now set to state ACTIVE and added to
the active member list.
- To be proactive, i.e., to minimize the risk that any member will
end up in the pending queue, we start reclaiming resources already
when the number of active members exceeds 3/4 of the permitted
maximum.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:34 +02:00
# define GRP_RECLAIM_MSG 4
# define GRP_REMIT_MSG 5
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 1
*/
static inline u32 msg_seq_gap ( struct tipc_msg * m )
{
2008-06-04 17:47:30 -07:00
return msg_bits ( m , 1 , 16 , 0x1fff ) ;
2006-01-02 19:04:38 +01:00
}
static inline void msg_set_seq_gap ( struct tipc_msg * m , u32 n )
{
2008-06-04 17:47:30 -07:00
msg_set_bits ( m , 1 , 16 , 0x1fff , n ) ;
2006-01-02 19:04:38 +01:00
}
2011-10-28 16:26:41 -04:00
static inline u32 msg_node_sig ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 0 , 0xffff ) ;
}
static inline void msg_set_node_sig ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 1 , 0 , 0xffff , n ) ;
}
2015-03-13 16:08:05 -04:00
static inline u32 msg_node_capabilities ( struct tipc_msg * m )
{
return msg_bits ( m , 1 , 15 , 0x1fff ) ;
}
static inline void msg_set_node_capabilities ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 1 , 15 , 0x1fff , n ) ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 2
*/
static inline u32 msg_dest_domain ( struct tipc_msg * m )
{
return msg_word ( m , 2 ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_dest_domain ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_word ( m , 2 , n ) ;
}
static inline u32 msg_bcgap_after ( struct tipc_msg * m )
{
return msg_bits ( m , 2 , 16 , 0xffff ) ;
}
static inline void msg_set_bcgap_after ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 2 , 16 , 0xffff , n ) ;
}
static inline u32 msg_bcgap_to ( struct tipc_msg * m )
{
return msg_bits ( m , 2 , 0 , 0xffff ) ;
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_bcgap_to ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 2 , 0 , 0xffff , n ) ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 4
*/
static inline u32 msg_last_bcast ( struct tipc_msg * m )
{
return msg_bits ( m , 4 , 16 , 0xffff ) ;
}
2015-10-22 08:51:41 -04:00
static inline u32 msg_bc_snd_nxt ( struct tipc_msg * m )
{
return msg_last_bcast ( m ) + 1 ;
}
2006-01-02 19:04:38 +01:00
static inline void msg_set_last_bcast ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 4 , 16 , 0xffff , n ) ;
}
tipc: fix changeover issues due to large packet
In conjunction with changing the interfaces' MTU (e.g. especially in
the case of a bonding) where the TIPC links are brought up and down
in a short time, a couple of issues were detected with the current link
changeover mechanism:
1) When one link is up but immediately forced down again, the failover
procedure will be carried out in order to failover all the messages in
the link's transmq queue onto the other working link. The link and node
state is also set to FAILINGOVER as part of the process. The message
will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
bytes (= the message header size). There is no problem if the original
message size is not larger than the link's MTU - 40, and indeed this is
the max size of a normal payload messages. However, in the situation
above, because the link has just been up, the messages in the link's
transmq are almost SYNCH_MSGs which had been generated by the link
synching procedure, then their size might reach the max value already!
When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
will exceed the link's MTU. As a result, the messages are dropped
silently and the failover procedure will never end up, the link will
not be able to exit the FAILINGOVER state, so cannot be re-established.
2) The same scenario above can happen more easily in case the MTU of
the links is set differently or when changing. In that case, as long as
a large message in the failure link's transmq queue was built and
fragmented with its link's MTU > the other link's one, the issue will
happen (there is no need of a link synching in advance).
3) The link synching procedure also faces with the same issue but since
the link synching is only started upon receipt of a SYNCH_MSG, dropping
the message will not result in a state deadlock, but it is not expected
as design.
The 1) & 3) issues are resolved by the last commit that only a dummy
SYNCH_MSG (i.e. without data) is generated at the link synching, so the
size of a FAILOVER_MSG if any then will never exceed the link's MTU.
For the 2) issue, the only solution is trying to fragment the messages
in the failure link's transmq queue according to the working link's MTU
so they can be failovered then. A new function is made to accomplish
this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
original message size is too large, it will be fragmented & reassembled
at the receiving side.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-24 08:56:12 +07:00
static inline u32 msg_nof_fragms ( struct tipc_msg * m )
{
return msg_bits ( m , 4 , 0 , 0xffff ) ;
}
static inline void msg_set_nof_fragms ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 4 , 0 , 0xffff , n ) ;
}
static inline u32 msg_fragm_no ( struct tipc_msg * m )
{
return msg_bits ( m , 4 , 16 , 0xffff ) ;
}
2006-01-02 19:04:38 +01:00
static inline void msg_set_fragm_no ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 4 , 16 , 0xffff , n ) ;
}
2015-07-30 18:24:19 -04:00
static inline u16 msg_next_sent ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 4 , 0 , 0xffff ) ;
}
2015-07-30 18:24:19 -04:00
static inline void msg_set_next_sent ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 4 , 0 , 0xffff , n ) ;
}
static inline void msg_set_long_msgno ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 4 , 0 , 0xffff , n ) ;
}
static inline u32 msg_bc_netid ( struct tipc_msg * m )
{
return msg_word ( m , 4 ) ;
}
static inline void msg_set_bc_netid ( struct tipc_msg * m , u32 id )
{
msg_set_word ( m , 4 , id ) ;
}
static inline u32 msg_link_selector ( struct tipc_msg * m )
{
2017-01-18 13:50:52 -05:00
if ( msg_user ( m ) = = MSG_FRAGMENTER )
m = ( void * ) msg_data ( m ) ;
2006-01-02 19:04:38 +01:00
return msg_bits ( m , 4 , 0 , 1 ) ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 5
*/
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static inline u16 msg_session ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 5 , 16 , 0xffff ) ;
}
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
static inline void msg_set_session ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 5 , 16 , 0xffff , n ) ;
}
static inline u32 msg_probe ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 0 , 1 ) ;
}
static inline void msg_set_probe ( struct tipc_msg * m , u32 val )
{
2011-05-25 13:28:27 -04:00
msg_set_bits ( m , 5 , 0 , 1 , val ) ;
2006-01-02 19:04:38 +01:00
}
static inline char msg_net_plane ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 1 , 7 ) + ' A ' ;
}
static inline void msg_set_net_plane ( struct tipc_msg * m , char n )
{
msg_set_bits ( m , 5 , 1 , 7 , ( n - ' A ' ) ) ;
}
static inline u32 msg_linkprio ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 4 , 0x1f ) ;
}
static inline void msg_set_linkprio ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 5 , 4 , 0x1f , n ) ;
}
static inline u32 msg_bearer_id ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 9 , 0x7 ) ;
}
static inline void msg_set_bearer_id ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 5 , 9 , 0x7 , n ) ;
}
static inline u32 msg_redundant_link ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 12 , 0x1 ) ;
}
2011-02-28 15:30:20 -05:00
static inline void msg_set_redundant_link ( struct tipc_msg * m , u32 r )
2006-01-02 19:04:38 +01:00
{
2011-02-28 15:30:20 -05:00
msg_set_bits ( m , 5 , 12 , 0x1 , r ) ;
2006-01-02 19:04:38 +01:00
}
tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 13:33:03 -04:00
static inline u32 msg_peer_stopping ( struct tipc_msg * m )
{
return msg_bits ( m , 5 , 13 , 0x1 ) ;
}
static inline void msg_set_peer_stopping ( struct tipc_msg * m , u32 s )
{
msg_set_bits ( m , 5 , 13 , 0x1 , s ) ;
}
2016-10-27 18:51:55 -04:00
static inline bool msg_bc_ack_invalid ( struct tipc_msg * m )
{
switch ( msg_user ( m ) ) {
case BCAST_PROTOCOL :
case NAME_DISTRIBUTOR :
case LINK_PROTOCOL :
return msg_bits ( m , 5 , 14 , 0x1 ) ;
default :
return false ;
}
}
static inline void msg_set_bc_ack_invalid ( struct tipc_msg * m , bool invalid )
{
msg_set_bits ( m , 5 , 14 , 0x1 , invalid ) ;
}
2011-10-07 15:19:11 -04:00
static inline char * msg_media_addr ( struct tipc_msg * m )
{
2015-02-27 08:56:57 +01:00
return ( char * ) & m - > hdr [ TIPC_MEDIA_INFO_OFFSET ] ;
2011-10-07 15:19:11 -04:00
}
2006-01-02 19:04:38 +01:00
2016-09-01 13:52:49 -04:00
static inline u32 msg_bc_gap ( struct tipc_msg * m )
{
return msg_bits ( m , 8 , 0 , 0x3ff ) ;
}
static inline void msg_set_bc_gap ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 8 , 0 , 0x3ff , n ) ;
}
2007-02-09 23:25:21 +09:00
/*
2006-01-02 19:04:38 +01:00
* Word 9
*/
2015-07-30 18:24:19 -04:00
static inline u16 msg_msgcnt ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
2015-07-30 18:24:19 -04:00
static inline void msg_set_msgcnt ( struct tipc_msg * m , u16 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
2019-07-24 08:56:11 +07:00
static inline u16 msg_syncpt ( struct tipc_msg * m )
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
static inline void msg_set_syncpt ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 11:58:47 -04:00
static inline u32 msg_conn_ack ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 11:58:47 -04:00
static inline void msg_set_conn_ack ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
2017-10-13 11:04:26 +02:00
static inline u16 msg_adv_win ( struct tipc_msg * m )
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 11:58:47 -04:00
{
return msg_bits ( m , 9 , 0 , 0xffff ) ;
}
2017-10-13 11:04:26 +02:00
static inline void msg_set_adv_win ( struct tipc_msg * m , u16 n )
tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 11:58:47 -04:00
{
msg_set_bits ( m , 9 , 0 , 0xffff , n ) ;
}
2007-02-09 23:25:21 +09:00
static inline u32 msg_max_pkt ( struct tipc_msg * m )
2006-01-02 19:04:38 +01:00
{
2010-09-22 20:43:57 +00:00
return msg_bits ( m , 9 , 16 , 0xffff ) * 4 ;
2006-01-02 19:04:38 +01:00
}
2007-02-09 23:25:21 +09:00
static inline void msg_set_max_pkt ( struct tipc_msg * m , u32 n )
2006-01-02 19:04:38 +01:00
{
msg_set_bits ( m , 9 , 16 , 0xffff , ( n / 4 ) ) ;
}
static inline u32 msg_link_tolerance ( struct tipc_msg * m )
{
return msg_bits ( m , 9 , 0 , 0xffff ) ;
}
static inline void msg_set_link_tolerance ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 9 , 0 , 0xffff , n ) ;
}
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
static inline u16 msg_grp_bc_syncpt ( struct tipc_msg * m )
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
static inline void msg_set_grp_bc_syncpt ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:31 +02:00
static inline u16 msg_grp_bc_acked ( struct tipc_msg * m )
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
static inline void msg_set_grp_bc_acked ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
tipc: add multipoint-to-point flow control
We already have point-to-multipoint flow control within a group. But
we even need the opposite; -a scheme which can handle that potentially
hundreds of sources may try to send messages to the same destination
simultaneously without causing buffer overflow at the recipient. This
commit adds such a mechanism.
The algorithm works as follows:
- When a member detects a new, joining member, it initially set its
state to JOINED and advertises a minimum window to the new member.
This window is chosen so that the new member can send exactly one
maximum sized message, or several smaller ones, to the recipient
before it must stop and wait for an additional advertisement. This
minimum window ADV_IDLE is set to 65 1kB blocks.
- When a member receives the first data message from a JOINED member,
it changes the state of the latter to ACTIVE, and advertises a larger
window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
continue sending with minimal disturbances to the data flow.
- The active members are kept in a dedicated linked list. Each time a
message is received from an active member, it will be moved to the
tail of that list. This way, we keep a record of which members have
been most (tail) and least (head) recently active.
- There is a maximum number (16) of permitted simultaneous active
senders per receiver. When this limit is reached, the receiver will
not advertise anything immediately to a new sender, but instead put
it in a PENDING state, and add it to a corresponding queue. At the
same time, it will pick the least recently active member, send it an
advertisement RECLAIM message, and set this member to state
RECLAIMING.
- The reclaimee member has to respond with a REMIT message, meaning that
it goes back to a send window of ADV_IDLE, and returns its unused
advertised blocks beyond that value to the reclaiming member.
- When the reclaiming member receives the REMIT message, it unlinks
the reclaimee from its active list, resets its state to JOINED, and
notes that it is now back at ADV_IDLE advertised blocks to that
member. If there are still unread data messages sent out by
reclaimee before the REMIT, the member goes into an intermediate
state REMITTED, where it stays until the said messages have been
consumed.
- The returned advertised blocks can now be re-advertised to the
pending member, which is now set to state ACTIVE and added to
the active member list.
- To be proactive, i.e., to minimize the risk that any member will
end up in the pending queue, we start reclaiming resources already
when the number of active members exceeds 3/4 of the permitted
maximum.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:34 +02:00
static inline u16 msg_grp_remitted ( struct tipc_msg * m )
{
return msg_bits ( m , 9 , 16 , 0xffff ) ;
}
static inline void msg_set_grp_remitted ( struct tipc_msg * m , u16 n )
{
msg_set_bits ( m , 9 , 16 , 0xffff , n ) ;
}
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
/* Word 10
*/
2017-10-13 11:04:25 +02:00
static inline u16 msg_grp_evt ( struct tipc_msg * m )
{
return msg_bits ( m , 10 , 0 , 0x3 ) ;
}
static inline void msg_set_grp_evt ( struct tipc_msg * m , int n )
{
msg_set_bits ( m , 10 , 0 , 0x3 , n ) ;
}
tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:
- Each time a unicast is sent, we set a the broadcast method for the
socket to "replicast" and "mandatory". This forces the first
subsequent broadcast message to follow the same network and data path
as the preceding unicast to a destination, hence preventing it from
overtaking the latter.
- In order to make the 'same data path' statement above true, we let
group unicasts pass through the multicast link input queue, instead
of as previously through the unicast link input queue.
- In the first broadcast following a unicast, we set a new header flag,
requiring all recipients to immediately acknowledge its reception.
- During the period before all the expected acknowledges are received,
the socket refuses to accept any more broadcast attempts, i.e., by
blocking or returning EAGAIN. This period should typically not be
longer than a few microseconds.
- When all acknowledges have been received, the sending socket will
open up for subsequent broadcasts, this time giving the link layer
freedom to itself select the best transmission method.
- The forced and/or abrupt transmission method changes described above
may lead to broadcasts arriving out of order to the recipients. We
remedy this by introducing code that checks and if necessary
re-orders such messages at the receiving end.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:31 +02:00
static inline u16 msg_grp_bc_ack_req ( struct tipc_msg * m )
{
return msg_bits ( m , 10 , 0 , 0x1 ) ;
}
static inline void msg_set_grp_bc_ack_req ( struct tipc_msg * m , bool n )
{
msg_set_bits ( m , 10 , 0 , 0x1 , n ) ;
}
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 11:04:23 +02:00
static inline u16 msg_grp_bc_seqno ( struct tipc_msg * m )
{
return msg_bits ( m , 10 , 16 , 0xffff ) ;
}
static inline void msg_set_grp_bc_seqno ( struct tipc_msg * m , u32 n )
{
msg_set_bits ( m , 10 , 16 , 0xffff , n ) ;
}
2015-07-30 18:24:19 -04:00
static inline bool msg_peer_link_is_up ( struct tipc_msg * m )
2015-07-16 16:54:30 -04:00
{
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
if ( likely ( msg_user ( m ) ! = LINK_PROTOCOL ) )
2015-07-16 16:54:30 -04:00
return true ;
2015-07-30 18:24:19 -04:00
if ( msg_type ( m ) = = STATE_MSG )
return true ;
return false ;
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
}
2015-07-30 18:24:19 -04:00
static inline bool msg_peer_node_is_up ( struct tipc_msg * m )
tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:
We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.
Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.
This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:54:31 -04:00
{
2015-07-30 18:24:19 -04:00
if ( msg_peer_link_is_up ( m ) )
return true ;
2015-07-16 16:54:30 -04:00
return msg_redundant_link ( m ) ;
}
2016-04-07 10:09:14 -04:00
static inline bool msg_is_reset ( struct tipc_msg * hdr )
{
return ( msg_user ( hdr ) = = LINK_PROTOCOL ) & & ( msg_type ( hdr ) = = RESET_MSG ) ;
}
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 07:51:21 +07:00
/* Word 13
*/
static inline void msg_set_peer_net_hash ( struct tipc_msg * m , u32 n )
{
msg_set_word ( m , 13 , n ) ;
}
static inline u32 msg_peer_net_hash ( struct tipc_msg * m )
{
return msg_word ( m , 13 ) ;
}
/* Word 14
*/
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 20:42:51 +01:00
static inline u32 msg_sugg_node_addr ( struct tipc_msg * m )
{
return msg_word ( m , 14 ) ;
}
static inline void msg_set_sugg_node_addr ( struct tipc_msg * m , u32 n )
{
msg_set_word ( m , 14 , n ) ;
}
static inline void msg_set_node_id ( struct tipc_msg * hdr , u8 * id )
{
memcpy ( msg_data ( hdr ) , id , 16 ) ;
}
static inline u8 * msg_node_id ( struct tipc_msg * hdr )
{
return ( u8 * ) msg_data ( hdr ) ;
}
2017-01-13 15:46:25 +01:00
struct sk_buff * tipc_buf_acquire ( u32 size , gfp_t gfp ) ;
2017-11-15 21:23:56 +01:00
bool tipc_msg_validate ( struct sk_buff * * _skb ) ;
tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence
if (msg_reverse())
tipc_link_xmit_skb()
at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.
In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 10:11:19 -04:00
bool tipc_msg_reverse ( u32 own_addr , struct sk_buff * * skb , int err ) ;
2017-10-13 11:04:20 +02:00
void tipc_skb_reject ( struct net * net , int err , struct sk_buff * skb ,
struct sk_buff_head * xmitq ) ;
2015-02-05 08:36:36 -05:00
void tipc_msg_init ( u32 own_addr , struct tipc_msg * m , u32 user , u32 type ,
2015-01-09 15:27:10 +08:00
u32 hsize , u32 destnode ) ;
tipc: split up function tipc_msg_eval()
The function tipc_msg_eval() is in reality doing two related, but
different tasks. First it tries to find a new destination for named
messages, in case there was no first lookup, or if the first lookup
failed. Second, it does what its name suggests, evaluating the validity
of the message and its destination, and returning an appropriate error
code depending on the result.
This is confusing, and in this commit we choose to break it up into two
functions. A new function, tipc_msg_lookup_dest(), first attempts to find
a new destination, if the message is of the right type. If this lookup
fails, or if the message should not be subject to a second lookup, the
already existing tipc_msg_reverse() is called. This function performs
prepares the message for rejection, if applicable.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:39 -05:00
struct sk_buff * tipc_msg_create ( uint user , uint type , uint hdr_sz ,
uint data_sz , u32 dnode , u32 onode ,
u32 dport , u32 oport , int errcode ) ;
2014-05-14 05:39:12 -04:00
int tipc_buf_append ( struct sk_buff * * headbuf , struct sk_buff * * buf ) ;
tipc: improve message bundling algorithm
As mentioned in commit e95584a889e1 ("tipc: fix unlimited bundling of
small messages"), the current message bundling algorithm is inefficient
that can generate bundles of only one payload message, that causes
unnecessary overheads for both the sender and receiver.
This commit re-designs the 'tipc_msg_make_bundle()' function (now named
as 'tipc_msg_try_bundle()'), so that when a message comes at the first
place, we will just check & keep a reference to it if the message is
suitable for bundling. The message buffer will be put into the link
backlog queue and processed as normal. Later on, when another one comes
we will make a bundle with the first message if possible and so on...
This way, a bundle if really needed will always consist of at least two
payload messages. Otherwise, we let the first buffer go its way without
any need of bundling, so reduce the overheads to zero.
Moreover, since now we have both the messages in hand, we can even
optimize the 'tipc_msg_bundle()' function, make bundle of a very large
(size ~ MSS) and small messages which is not with the current algorithm
e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 09:58:57 +07:00
bool tipc_msg_try_bundle ( struct sk_buff * tskb , struct sk_buff * * skb , u32 mss ,
u32 dnode , bool * new_bundle ) ;
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
bool tipc_msg_extract ( struct sk_buff * skb , struct sk_buff * * iskb , int * pos ) ;
tipc: fix changeover issues due to large packet
In conjunction with changing the interfaces' MTU (e.g. especially in
the case of a bonding) where the TIPC links are brought up and down
in a short time, a couple of issues were detected with the current link
changeover mechanism:
1) When one link is up but immediately forced down again, the failover
procedure will be carried out in order to failover all the messages in
the link's transmq queue onto the other working link. The link and node
state is also set to FAILINGOVER as part of the process. The message
will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
bytes (= the message header size). There is no problem if the original
message size is not larger than the link's MTU - 40, and indeed this is
the max size of a normal payload messages. However, in the situation
above, because the link has just been up, the messages in the link's
transmq are almost SYNCH_MSGs which had been generated by the link
synching procedure, then their size might reach the max value already!
When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
will exceed the link's MTU. As a result, the messages are dropped
silently and the failover procedure will never end up, the link will
not be able to exit the FAILINGOVER state, so cannot be re-established.
2) The same scenario above can happen more easily in case the MTU of
the links is set differently or when changing. In that case, as long as
a large message in the failure link's transmq queue was built and
fragmented with its link's MTU > the other link's one, the issue will
happen (there is no need of a link synching in advance).
3) The link synching procedure also faces with the same issue but since
the link synching is only started upon receipt of a SYNCH_MSG, dropping
the message will not result in a state deadlock, but it is not expected
as design.
The 1) & 3) issues are resolved by the last commit that only a dummy
SYNCH_MSG (i.e. without data) is generated at the link synching, so the
size of a FAILOVER_MSG if any then will never exceed the link's MTU.
For the 2) issue, the only solution is trying to fragment the messages
in the failure link's transmq queue according to the working link's MTU
so they can be failovered then. A new function is made to accomplish
this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
original message size is too large, it will be fragmented & reassembled
at the receiving side.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-24 08:56:12 +07:00
int tipc_msg_fragment ( struct sk_buff * skb , const struct tipc_msg * hdr ,
int pktmax , struct sk_buff_head * frags ) ;
2015-02-05 08:36:36 -05:00
int tipc_msg_build ( struct tipc_msg * mhdr , struct msghdr * m ,
2015-01-09 15:27:10 +08:00
int offset , int dsz , int mtu , struct sk_buff_head * list ) ;
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 14:00:41 +01:00
int tipc_msg_append ( struct tipc_msg * hdr , struct msghdr * m , int dlen ,
int mss , struct sk_buff_head * txq ) ;
2015-07-22 10:11:20 -04:00
bool tipc_msg_lookup_dest ( struct net * net , struct sk_buff * skb , int * err ) ;
2017-11-30 16:47:25 +01:00
bool tipc_msg_assemble ( struct sk_buff_head * list ) ;
2015-10-22 08:51:39 -04:00
bool tipc_msg_reassemble ( struct sk_buff_head * list , struct sk_buff_head * rcvq ) ;
2017-01-18 13:50:52 -05:00
bool tipc_msg_pskb_copy ( u32 dst , struct sk_buff_head * msg ,
struct sk_buff_head * cpy ) ;
2020-05-26 16:38:37 +07:00
bool __tipc_skb_queue_sorted ( struct sk_buff_head * list , u16 seqno ,
2015-10-15 14:52:43 -04:00
struct sk_buff * skb ) ;
2018-09-28 20:23:22 +02:00
bool tipc_msg_skb_clone ( struct sk_buff_head * msg , struct sk_buff_head * cpy ) ;
2014-07-16 20:41:00 -04:00
2015-05-14 10:46:14 -04:00
static inline u16 buf_seqno ( struct sk_buff * skb )
{
return msg_seqno ( buf_msg ( skb ) ) ;
}
2017-11-15 21:23:56 +01:00
static inline int buf_roundup_len ( struct sk_buff * skb )
{
return ( skb - > len / 1024 + 1 ) * 1024 ;
}
2015-02-05 08:36:44 -05:00
/* tipc_skb_peek(): peek and reserve first buffer in list
* @ list : list to be peeked in
* Returns pointer to first buffer in list , if any
*/
static inline struct sk_buff * tipc_skb_peek ( struct sk_buff_head * list ,
spinlock_t * lock )
{
struct sk_buff * skb ;
spin_lock_bh ( lock ) ;
skb = skb_peek ( list ) ;
if ( skb )
skb_get ( skb ) ;
spin_unlock_bh ( lock ) ;
return skb ;
}
tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.
We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.
This solves the sequentiality problem, and seems to cause no measurable
performance degradation.
A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 08:36:41 -05:00
/* tipc_skb_peek_port(): find a destination port, ignoring all destinations
* up to and including ' filter ' .
* Note : ignoring previously tried destinations minimizes the risk of
* contention on the socket lock
* @ list : list to be peeked in
* @ filter : last destination to be ignored from search
* Returns a destination port number , of applicable .
*/
static inline u32 tipc_skb_peek_port ( struct sk_buff_head * list , u32 filter )
{
struct sk_buff * skb ;
u32 dport = 0 ;
bool ignore = true ;
spin_lock_bh ( & list - > lock ) ;
skb_queue_walk ( list , skb ) {
dport = msg_destport ( buf_msg ( skb ) ) ;
if ( ! filter | | skb_queue_is_last ( list , skb ) )
break ;
if ( dport = = filter )
ignore = false ;
else if ( ! ignore )
break ;
}
spin_unlock_bh ( & list - > lock ) ;
return dport ;
}
/* tipc_skb_dequeue(): unlink first buffer with dest 'dport' from list
* @ list : list to be unlinked from
* @ dport : selection criteria for buffer to unlink
*/
static inline struct sk_buff * tipc_skb_dequeue ( struct sk_buff_head * list ,
u32 dport )
{
struct sk_buff * _skb , * tmp , * skb = NULL ;
spin_lock_bh ( & list - > lock ) ;
skb_queue_walk_safe ( list , _skb , tmp ) {
if ( msg_destport ( buf_msg ( _skb ) ) = = dport ) {
__skb_unlink ( _skb , list ) ;
skb = _skb ;
break ;
}
}
spin_unlock_bh ( & list - > lock ) ;
return skb ;
}
2015-07-30 18:24:23 -04:00
/* tipc_skb_queue_splice_tail - append an skb list to lock protected list
* @ list : the new list to append . Not lock protected
* @ head : target list . Lock protected .
*/
static inline void tipc_skb_queue_splice_tail ( struct sk_buff_head * list ,
struct sk_buff_head * head )
{
spin_lock_bh ( & head - > lock ) ;
skb_queue_splice_tail ( list , head ) ;
spin_unlock_bh ( & head - > lock ) ;
}
/* tipc_skb_queue_splice_tail_init - merge two lock protected skb lists
* @ list : the new list to add . Lock protected . Will be reinitialized
* @ head : target list . Lock protected .
*/
static inline void tipc_skb_queue_splice_tail_init ( struct sk_buff_head * list ,
struct sk_buff_head * head )
{
struct sk_buff_head tmp ;
__skb_queue_head_init ( & tmp ) ;
spin_lock_bh ( & list - > lock ) ;
skb_queue_splice_tail_init ( list , & tmp ) ;
spin_unlock_bh ( & list - > lock ) ;
tipc_skb_queue_splice_tail ( & tmp , head ) ;
}
tipc: reduce duplicate packets for unicast traffic
For unicast transmission, the current NACK sending althorithm is over-
active that forces the sending side to retransmit a packet that is not
really lost but just arrived at the receiving side with some delay, or
even retransmit same packets that have already been retransmitted
before. As a result, many duplicates are observed also under normal
condition, ie. without packet loss.
One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
receives packet #10, it puts into the deferdq. When the packet #5 comes
it sends NACK with gap [6 - 9]. However, shortly after that, when
packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
still out of order, so it makes another NACK with gap [7 - 9] and so on
... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
times, but in fact all the packets are not lost at all, so duplicates!
This commit reduces duplicates by changing the condition to send NACK,
also restricting the retransmissions on individual packets via a timer
of about 1ms. However, it also needs to say that too tricky condition
for NACKs or too long timeout value for retransmissions will result in
performance reducing! The criterias in this commit are found to be
effective for both the requirements to reduce duplicates but not affect
performance.
The tipc_link_rcv() is also improved to only dequeue skb from the link
deferdq if it is expected (ie. its seqno <= rcv_nxt).
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 11:09:52 +07:00
/* __tipc_skb_dequeue() - dequeue the head skb according to expected seqno
* @ list : list to be dequeued from
* @ seqno : seqno of the expected msg
*
* returns skb dequeued from the list if its seqno is less than or equal to
* the expected one , otherwise the skb is still hold
*
* Note : must be used with appropriate locks held only
*/
static inline struct sk_buff * __tipc_skb_dequeue ( struct sk_buff_head * list ,
u16 seqno )
{
struct sk_buff * skb = skb_peek ( list ) ;
if ( skb & & less_eq ( buf_seqno ( skb ) , seqno ) ) {
__skb_unlink ( skb , list ) ;
return skb ;
}
return NULL ;
}
2006-01-02 19:04:38 +01:00
# endif