kcm: Add description in Documentation
Add kcm.txt to desribe KCM and interfaces. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
29152a34f7
commit
10016594f4
285
Documentation/networking/kcm.txt
Normal file
285
Documentation/networking/kcm.txt
Normal file
@ -0,0 +1,285 @@
|
||||
Kernel Connection Mulitplexor
|
||||
-----------------------------
|
||||
|
||||
Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
|
||||
interface over TCP for generic application protocols. With KCM an application
|
||||
can efficiently send and receive application protocol messages over TCP using
|
||||
datagram sockets.
|
||||
|
||||
KCM implements an NxM multiplexor in the kernel as diagrammed below:
|
||||
|
||||
+------------+ +------------+ +------------+ +------------+
|
||||
| KCM socket | | KCM socket | | KCM socket | | KCM socket |
|
||||
+------------+ +------------+ +------------+ +------------+
|
||||
| | | |
|
||||
+-----------+ | | +----------+
|
||||
| | | |
|
||||
+----------------------------------+
|
||||
| Multiplexor |
|
||||
+----------------------------------+
|
||||
| | | | |
|
||||
+---------+ | | | ------------+
|
||||
| | | | |
|
||||
+----------+ +----------+ +----------+ +----------+ +----------+
|
||||
| Psock | | Psock | | Psock | | Psock | | Psock |
|
||||
+----------+ +----------+ +----------+ +----------+ +----------+
|
||||
| | | | |
|
||||
+----------+ +----------+ +----------+ +----------+ +----------+
|
||||
| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
|
||||
+----------+ +----------+ +----------+ +----------+ +----------+
|
||||
|
||||
KCM sockets
|
||||
-----------
|
||||
|
||||
The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
|
||||
bound to a multiplexor are considered to have equivalent function, and I/O
|
||||
operations in different sockets may be done in parallel without the need for
|
||||
synchronization between threads in userspace.
|
||||
|
||||
Multiplexor
|
||||
-----------
|
||||
|
||||
The multiplexor provides the message steering. In the transmit path, messages
|
||||
written on a KCM socket are sent atomically on an appropriate TCP socket.
|
||||
Similarly, in the receive path, messages are constructed on each TCP socket
|
||||
(Psock) and complete messages are steered to a KCM socket.
|
||||
|
||||
TCP sockets & Psocks
|
||||
--------------------
|
||||
|
||||
TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
|
||||
for each bound TCP socket, this structure holds the state for constructing
|
||||
messages on receive as well as other connection specific information for KCM.
|
||||
|
||||
Connected mode semantics
|
||||
------------------------
|
||||
|
||||
Each multiplexor assumes that all attached TCP connections are to the same
|
||||
destination and can use the different connections for load balancing when
|
||||
transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
|
||||
can be used to send and receive messages from the KCM socket.
|
||||
|
||||
Socket types
|
||||
------------
|
||||
|
||||
KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
|
||||
|
||||
Message delineation
|
||||
-------------------
|
||||
|
||||
Messages are sent over a TCP stream with some application protocol message
|
||||
format that typically includes a header which frames the messages. The length
|
||||
of a received message can be deduced from the application protocol header
|
||||
(often just a simple length field).
|
||||
|
||||
A TCP stream must be parsed to determine message boundaries. Berkeley Packet
|
||||
Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
|
||||
BPF program must be specified. The program is called at the start of receiving
|
||||
a new message and is given an skbuff that contains the bytes received so far.
|
||||
It parses the message header and returns the length of the message. Given this
|
||||
information, KCM will construct the message of the stated length and deliver it
|
||||
to a KCM socket.
|
||||
|
||||
TCP socket management
|
||||
---------------------
|
||||
|
||||
When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
|
||||
write space available (POLLOUT) events are handled by the multiplexor. If there
|
||||
is a state change (disconnection) or other error on a TCP socket, an error is
|
||||
posted on the TCP socket so that a POLLERR event happens and KCM discontinues
|
||||
using the socket. When the application gets the error notification for a
|
||||
TCP socket, it should unattach the socket from KCM and then handle the error
|
||||
condition (the typical response is to close the socket and create a new
|
||||
connection if necessary).
|
||||
|
||||
KCM limits the maximum receive message size to be the size of the receive
|
||||
socket buffer on the attached TCP socket (the socket buffer size can be set by
|
||||
SO_RCVBUF). If the length of a new message reported by the BPF program is
|
||||
greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
|
||||
socket. The BPF program may also enforce a maximum messages size and report an
|
||||
error when it is exceeded.
|
||||
|
||||
A timeout may be set for assembling messages on a receive socket. The timeout
|
||||
value is taken from the receive timeout of the attached TCP socket (this is set
|
||||
by SO_RCVTIMEO). If the timer expires before assembly is complete an error
|
||||
(ETIMEDOUT) is posted on the socket.
|
||||
|
||||
User interface
|
||||
==============
|
||||
|
||||
Creating a multiplexor
|
||||
----------------------
|
||||
|
||||
A new multiplexor and initial KCM socket is created by a socket call:
|
||||
|
||||
socket(AF_KCM, type, protocol)
|
||||
|
||||
- type is either SOCK_DGRAM or SOCK_SEQPACKET
|
||||
- protocol is KCMPROTO_CONNECTED
|
||||
|
||||
Cloning KCM sockets
|
||||
-------------------
|
||||
|
||||
After the first KCM socket is created using the socket call as described
|
||||
above, additional sockets for the multiplexor can be created by cloning
|
||||
a KCM socket. This is accomplished by an ioctl on a KCM socket:
|
||||
|
||||
/* From linux/kcm.h */
|
||||
struct kcm_clone {
|
||||
int fd;
|
||||
};
|
||||
|
||||
struct kcm_clone info;
|
||||
|
||||
memset(&info, 0, sizeof(info));
|
||||
|
||||
err = ioctl(kcmfd, SIOCKCMCLONE, &info);
|
||||
|
||||
if (!err)
|
||||
newkcmfd = info.fd;
|
||||
|
||||
Attach transport sockets
|
||||
------------------------
|
||||
|
||||
Attaching of transport sockets to a multiplexor is performed by calling an
|
||||
ioctl on a KCM socket for the multiplexor. e.g.:
|
||||
|
||||
/* From linux/kcm.h */
|
||||
struct kcm_attach {
|
||||
int fd;
|
||||
int bpf_fd;
|
||||
};
|
||||
|
||||
struct kcm_attach info;
|
||||
|
||||
memset(&info, 0, sizeof(info));
|
||||
|
||||
info.fd = tcpfd;
|
||||
info.bpf_fd = bpf_prog_fd;
|
||||
|
||||
ioctl(kcmfd, SIOCKCMATTACH, &info);
|
||||
|
||||
The kcm_attach structure contains:
|
||||
fd: file descriptor for TCP socket being attached
|
||||
bpf_prog_fd: file descriptor for compiled BPF program downloaded
|
||||
|
||||
Unattach transport sockets
|
||||
--------------------------
|
||||
|
||||
Unattaching a transport socket from a multiplexor is straightforward. An
|
||||
"unattach" ioctl is done with the kcm_unattach structure as the argument:
|
||||
|
||||
/* From linux/kcm.h */
|
||||
struct kcm_unattach {
|
||||
int fd;
|
||||
};
|
||||
|
||||
struct kcm_unattach info;
|
||||
|
||||
memset(&info, 0, sizeof(info));
|
||||
|
||||
info.fd = cfd;
|
||||
|
||||
ioctl(fd, SIOCKCMUNATTACH, &info);
|
||||
|
||||
Disabling receive on KCM socket
|
||||
-------------------------------
|
||||
|
||||
A setsockopt is used to disable or enable receiving on a KCM socket.
|
||||
When receive is disabled, any pending messages in the socket's
|
||||
receive buffer are moved to other sockets. This feature is useful
|
||||
if an application thread knows that it will be doing a lot of
|
||||
work on a request and won't be able to service new messages for a
|
||||
while. Example use:
|
||||
|
||||
int val = 1;
|
||||
|
||||
setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
|
||||
|
||||
BFP programs for message delineation
|
||||
------------------------------------
|
||||
|
||||
BPF programs can be compiled using the BPF LLVM backend. For exmple,
|
||||
the BPF program for parsing Thrift is:
|
||||
|
||||
#include "bpf.h" /* for __sk_buff */
|
||||
#include "bpf_helpers.h" /* for load_word intrinsic */
|
||||
|
||||
SEC("socket_kcm")
|
||||
int bpf_prog1(struct __sk_buff *skb)
|
||||
{
|
||||
return load_word(skb, 0) + 4;
|
||||
}
|
||||
|
||||
char _license[] SEC("license") = "GPL";
|
||||
|
||||
Use in applications
|
||||
===================
|
||||
|
||||
KCM accelerates application layer protocols. Specifically, it allows
|
||||
applications to use a message based interface for sending and receiving
|
||||
messages. The kernel provides necessary assurances that messages are sent
|
||||
and received atomically. This relieves much of the burden applications have
|
||||
in mapping a message based protocol onto the TCP stream. KCM also make
|
||||
application layer messages a unit of work in the kernel for the purposes of
|
||||
steerng and scheduling, which in turn allows a simpler networking model in
|
||||
multithreaded applications.
|
||||
|
||||
Configurations
|
||||
--------------
|
||||
|
||||
In an Nx1 configuration, KCM logically provides multiple socket handles
|
||||
to the same TCP connection. This allows parallelism between in I/O
|
||||
operations on the TCP socket (for instance copyin and copyout of data is
|
||||
parallelized). In an application, a KCM socket can be opened for each
|
||||
processing thread and inserted into the epoll (similar to how SO_REUSEPORT
|
||||
is used to allow multiple listener sockets on the same port).
|
||||
|
||||
In a MxN configuration, multiple connections are established to the
|
||||
same destination. These are used for simple load balancing.
|
||||
|
||||
Message batching
|
||||
----------------
|
||||
|
||||
The primary purpose of KCM is load balancing between KCM sockets and hence
|
||||
threads in a nominal use case. Perfect load balancing, that is steering
|
||||
each received message to a different KCM socket or steering each sent
|
||||
message to a different TCP socket, can negatively impact performance
|
||||
since this doesn't allow for affinities to be established. Balancing
|
||||
based on groups, or batches of messages, can be beneficial for performance.
|
||||
|
||||
On transmit, there are three ways an application can batch (pipeline)
|
||||
messages on a KCM socket.
|
||||
1) Send multiple messages in a single sendmmsg.
|
||||
2) Send a group of messages each with a sendmsg call, where all messages
|
||||
except the last have MSG_BATCH in the flags of sendmsg call.
|
||||
3) Create "super message" composed of multiple messages and send this
|
||||
with a single sendmsg.
|
||||
|
||||
On receive, the KCM module attempts to queue messages received on the
|
||||
same KCM socket during each TCP ready callback. The targeted KCM socket
|
||||
changes at each receive ready callback on the KCM socket. The application
|
||||
does not need to configure this.
|
||||
|
||||
Error handling
|
||||
--------------
|
||||
|
||||
An application should include a thread to monitor errors raised on
|
||||
the TCP connection. Normally, this will be done by placing each
|
||||
TCP socket attached to a KCM multiplexor in epoll set for POLLERR
|
||||
event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
|
||||
on the socket thus waking up the application thread. When the application
|
||||
sees the error (which may just be a disconnect) it should unattach the
|
||||
socket from KCM and then close it. It is assumed that once an error is
|
||||
posted on the TCP socket the data stream is unrecoverable (i.e. an error
|
||||
may have occurred in in the middle of receiving a messssge).
|
||||
|
||||
TCP connection monitoring
|
||||
-------------------------
|
||||
|
||||
In KCM there is no means to correlate a message to the TCP socket that
|
||||
was used to send or receive the message (except in the case there is
|
||||
only one attached TCP socket). However, the application does retain
|
||||
an open file descriptor to the socket so it will be able to get statistics
|
||||
from the socket which can be used in detecting issues (such as high
|
||||
retransmissions on the socket).
|
Loading…
Reference in New Issue
Block a user