linux/Documentation/networking/tcp.txt

TCP protocol
============

Last updated: 9 February 2008

Contents
========

- Congestion control
- How the new TCP output machine [nyi] works

Congestion control
==================

The following variables are used in the tcp_sock for congestion control:
snd_cwnd		The size of the congestion window
snd_ssthresh		Slow start threshold. We are in slow start if
			snd_cwnd is less than this.
snd_cwnd_cnt		A counter used to slow down the rate of increase
			once we exceed slow start threshold.
snd_cwnd_clamp		This is the maximum size that snd_cwnd can grow to.
snd_cwnd_stamp		Timestamp for when congestion window last validated.
snd_cwnd_used		Used as a highwater mark for how much of the
			congestion window is in use. It is used to adjust
			snd_cwnd down when the link is limited by the
			application rather than the network.

As of 2.6.13, Linux supports pluggable congestion control algorithms.
A congestion control mechanism can be registered through functions in
tcp_cong.c. The functions used by the congestion control mechanism are
registered via passing a tcp_congestion_ops struct to
tcp_register_congestion_control. As a minimum name, ssthresh,
cong_avoid, min_cwnd must be valid.

Private data for a congestion control mechanism is stored in tp->ca_priv.
tcp_ca(tp) returns a pointer to this space.  This is preallocated space - it
is important to check the size of your private data will fit this space, or
alternatively space could be allocated elsewhere and a pointer to it could
be stored here.

There are three kinds of congestion control algorithms currently: The
simplest ones are derived from TCP reno (highspeed, scalable) and just
provide an alternative the congestion window calculation. More complex
ones like BIC try to look at other events to provide better
heuristics.  There are also round trip time based algorithms like
Vegas and Westwood+.

Good TCP congestion control is a complex problem because the algorithm
needs to maintain fairness and performance. Please review current
research and RFC's before developing new modules.

The method that is used to determine which congestion control mechanism is
determined by the setting of the sysctl net.ipv4.tcp_congestion_control.
The default congestion control will be the last one registered (LIFO);
so if you built everything as modules, the default will be reno. If you
build with the defaults from Kconfig, then CUBIC will be builtin (not a
module) and it will end up the default.

If you really want a particular default value then you will need
to set it with the sysctl.  If you use a sysctl, the module will be autoloaded
if needed and you will get the expected protocol. If you ask for an
unknown congestion method, then the sysctl attempt will fail.

If you remove a tcp congestion control module, then you will get the next
available one. Since reno cannot be built as a module, and cannot be
deleted, it will always be available.

How the new TCP output machine [nyi] works.
===========================================

Data is kept on a single queue. The skb->users flag tells us if the frame is
one that has been queued already. To add a frame we throw it on the end. Ack
walks down the list from the start.

We keep a set of control flags


	sk->tcp_pend_event

		TCP_PEND_ACK			Ack needed
		TCP_ACK_NOW			Needed now
		TCP_WINDOW			Window update check
		TCP_WINZERO			Zero probing


	sk->transmit_queue		The transmission frame begin
	sk->transmit_new		First new frame pointer
	sk->transmit_end		Where to add frames

	sk->tcp_last_tx_ack		Last ack seen
	sk->tcp_dup_ack			Dup ack count for fast retransmit


Frames are queued for output by tcp_write. We do our best to send the frames
off immediately if possible, but otherwise queue and compute the body
checksum in the copy. 

When a write is done we try to clear any pending events and piggy back them.
If the window is full we queue full sized frames. On the first timeout in
zero window we split this.

On a timer we walk the retransmit list to send any retransmits, update the
backoff timers etc. A change of route table stamp causes a change of header
and recompute. We add any new tcp level headers and refinish the checksum
before sending.
[TCP]: Update sysctl and congestion control documentation. Update the documentation to remove the old sysctl values and include the new congestion control infrastructure. Includes changes to tcp.txt by Ian McDonald. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2005-06-23 23:22:36 +04:00			`TCP protocol`
			`============`

Documentation: fix tcp.txt Replace BIC with CUBIC as default congestion control. Fix grammar. Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net> 2008-02-18 09:21:04 +03:00			`Last updated: 9 February 2008`
[TCP]: Update sysctl and congestion control documentation. Update the documentation to remove the old sysctl values and include the new congestion control infrastructure. Includes changes to tcp.txt by Ian McDonald. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2005-06-23 23:22:36 +04:00
			`Contents`
			`========`

			`- Congestion control`
			`- How the new TCP output machine [nyi] works`

			`Congestion control`
			`==================`

			`The following variables are used in the tcp_sock for congestion control:`
			`snd_cwnd The size of the congestion window`
			`snd_ssthresh Slow start threshold. We are in slow start if`
			`snd_cwnd is less than this.`
			`snd_cwnd_cnt A counter used to slow down the rate of increase`
			`once we exceed slow start threshold.`
			`snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to.`
			`snd_cwnd_stamp Timestamp for when congestion window last validated.`
			`snd_cwnd_used Used as a highwater mark for how much of the`
			`congestion window is in use. It is used to adjust`
			`snd_cwnd down when the link is limited by the`
			`application rather than the network.`

			`As of 2.6.13, Linux supports pluggable congestion control algorithms.`
			`A congestion control mechanism can be registered through functions in`
			`tcp_cong.c. The functions used by the congestion control mechanism are`
			`registered via passing a tcp_congestion_ops struct to`
			`tcp_register_congestion_control. As a minimum name, ssthresh,`
			`cong_avoid, min_cwnd must be valid.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 02:20:36 +04:00
[TCP]: Update sysctl and congestion control documentation. Update the documentation to remove the old sysctl values and include the new congestion control infrastructure. Includes changes to tcp.txt by Ian McDonald. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2005-06-23 23:22:36 +04:00			`Private data for a congestion control mechanism is stored in tp->ca_priv.`
			`tcp_ca(tp) returns a pointer to this space. This is preallocated space - it`
			`is important to check the size of your private data will fit this space, or`
			`alternatively space could be allocated elsewhere and a pointer to it could`
			`be stored here.`

			`There are three kinds of congestion control algorithms currently: The`
			`simplest ones are derived from TCP reno (highspeed, scalable) and just`
			`provide an alternative the congestion window calculation. More complex`
			`ones like BIC try to look at other events to provide better`
			`heuristics. There are also round trip time based algorithms like`
			`Vegas and Westwood+.`

			`Good TCP congestion control is a complex problem because the algorithm`
			`needs to maintain fairness and performance. Please review current`
			`research and RFC's before developing new modules.`

			`The method that is used to determine which congestion control mechanism is`
			`determined by the setting of the sysctl net.ipv4.tcp_congestion_control.`
			`The default congestion control will be the last one registered (LIFO);`
Documentation: fix tcp.txt Replace BIC with CUBIC as default congestion control. Fix grammar. Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net> 2008-02-18 09:21:04 +03:00			`so if you built everything as modules, the default will be reno. If you`
			`build with the defaults from Kconfig, then CUBIC will be builtin (not a`
			`module) and it will end up the default.`
[TCP]: Update sysctl and congestion control documentation. Update the documentation to remove the old sysctl values and include the new congestion control infrastructure. Includes changes to tcp.txt by Ian McDonald. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2005-06-23 23:22:36 +04:00
			`If you really want a particular default value then you will need`
			`to set it with the sysctl. If you use a sysctl, the module will be autoloaded`
			`if needed and you will get the expected protocol. If you ask for an`
			`unknown congestion method, then the sysctl attempt will fail.`

			`If you remove a tcp congestion control module, then you will get the next`
Fix "can not" in Documentation and Kconfig Randy brought it to my attention that in proper english "can not" should always be written "cannot". I donot see any reason to argue, even if I mightnot understand why this rule exists. This patch fixes "can not" in several Documentation files as well as three Kconfigs. Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Adrian Bunk <bunk@stusta.de> 2006-10-04 00:53:09 +04:00			`available one. Since reno cannot be built as a module, and cannot be`
[TCP]: Update sysctl and congestion control documentation. Update the documentation to remove the old sysctl values and include the new congestion control infrastructure. Includes changes to tcp.txt by Ian McDonald. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2005-06-23 23:22:36 +04:00			`deleted, it will always be available.`

			`How the new TCP output machine [nyi] works.`
			`===========================================`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 02:20:36 +04:00
			`Data is kept on a single queue. The skb->users flag tells us if the frame is`
			`one that has been queued already. To add a frame we throw it on the end. Ack`
			`walks down the list from the start.`

			`We keep a set of control flags`


			`sk->tcp_pend_event`

			`TCP_PEND_ACK Ack needed`
			`TCP_ACK_NOW Needed now`
			`TCP_WINDOW Window update check`
			`TCP_WINZERO Zero probing`


			`sk->transmit_queue The transmission frame begin`
			`sk->transmit_new First new frame pointer`
			`sk->transmit_end Where to add frames`

			`sk->tcp_last_tx_ack Last ack seen`
			`sk->tcp_dup_ack Dup ack count for fast retransmit`


			`Frames are queued for output by tcp_write. We do our best to send the frames`
			`off immediately if possible, but otherwise queue and compute the body`
			`checksum in the copy.`

			`When a write is done we try to clear any pending events and piggy back them.`
			`If the window is full we queue full sized frames. On the first timeout in`
			`zero window we split this.`

			`On a timer we walk the retransmit list to send any retransmits, update the`
			`backoff timers etc. A change of route table stamp causes a change of header`
			`and recompute. We add any new tcp level headers and refinish the checksum`
			`before sending.`