e607df3f48
This is doc/design-thoughts/http2.txt.
278 lines
16 KiB
Plaintext
278 lines
16 KiB
Plaintext
2014/10/23 - design thoughts for HTTP/2
|
|
|
|
- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a
|
|
connection holds a compression context (headers table, etc...). We probably
|
|
need to have an h2_conn struct.
|
|
|
|
- multiple transactions will be handled in parallel for a given h2_conn. They
|
|
are called streams in HTTP/2 terminology.
|
|
|
|
- multiplexing : for a given client-side h2 connection, we can have multiple
|
|
server-side h2 connections. And for a server-side h2 connection, we can have
|
|
multiple client-side h2 connections. Streams circulate in N-to-N fashion.
|
|
|
|
- flow control : flow control will be applied between multiple streams. Special
|
|
care must be taken so that an H2 client cannot block some H2 servers by
|
|
sending requests spread over multiple servers to the point where one server
|
|
response is blocked and prevents other responses from the same server from
|
|
reaching their clients. H2 connection buffers must always be empty or nearly
|
|
empty. The per-stream flow control needs to be respected as well as the
|
|
connection's buffers. It is important to implement some fairness between all
|
|
the streams so that it's not always the same which gets the bandwidth when
|
|
the connection is congested.
|
|
|
|
- some clients can be H1 with an H2 server (is this really needed ?). Most of
|
|
the initial use case will be H2 clients to H1 servers. It is important to keep
|
|
in mind that H1 servers do not do flow control and that we don't want them to
|
|
block transfers (eg: post upload).
|
|
|
|
- internal tasks : some H2 clients will be internal tasks (eg: health checks).
|
|
Some H2 servers will be internal tasks (eg: stats, cache). The model must be
|
|
compatible with this use case.
|
|
|
|
- header indexing : headers are transported compressed, with a reference to a
|
|
static or a dynamic header, or a literal, possibly huffman-encoded. Indexing
|
|
is specific to the H2 connection. This means there is no way any binary data
|
|
can flow between both sides, headers will have to be decoded according to the
|
|
incoming connection's context and re-encoded according to the outgoing
|
|
connection's context, which can significantly differ. In order to avoid the
|
|
parsing trouble we currently face, headers will have to be clearly split
|
|
between name and value. It is worth noting that neither the incoming nor the
|
|
outgoing connections' contexts will be of any use while processing the
|
|
headers. At best we can have some shortcuts for well-known names that map
|
|
well to the static ones (eg: use the first static entry with same name), and
|
|
maybe have a few special cases for static name+value as well. Probably we can
|
|
classify headers in such categories :
|
|
|
|
- static name + value
|
|
- static name + other value
|
|
- dynamic name + other value
|
|
|
|
This will allow for better processing in some specific cases. Headers
|
|
supporting a single value (:method, :status, :path, ...) should probably
|
|
be stored in a single location with a direct access. That would allow us
|
|
to retrieve a method using hdr[METHOD]. All such indexing must be performed
|
|
while parsing. That also means that HTTP/1 will have to be converted to this
|
|
representation very early in the parser and possibly converted back to H/1
|
|
after processing.
|
|
|
|
Header names/values will have to be placed in a small memory area that will
|
|
inevitably get fragmented as headers are rewritten. An automatic packing
|
|
mechanism must be implemented so that when there's no more room, headers are
|
|
simply defragmented/packet to a new table and the old one is released. Just
|
|
like for the static chunks, we need to have a few such tables pre-allocated
|
|
and ready to be swapped at any moment. Repacking must not change any index
|
|
nor affect the way headers are compressed so that it can happen late after a
|
|
retry (send-name-header for example).
|
|
|
|
- header processing : can still happen on a (header, value) basis. Reqrep/
|
|
rsprep completely disappear and will have to be replaced with something else
|
|
to support renaming headers and rewriting url/path/...
|
|
|
|
- push_promise : servers can push dummy requests+responses. They advertise
|
|
the stream ID in the push_promise frame indicating the associated stream ID.
|
|
This means that it is possible to initiate a client-server stream from the
|
|
information coming from the server and make the data flow as if the client
|
|
had made it. It's likely that we'll have to support two types of server
|
|
connections: those which support push and those which do not. That way client
|
|
streams will be distributed to existing server connections based on their
|
|
capabilities. It's important to keep in mind that PUSH will not be rewritten
|
|
in responses.
|
|
|
|
- stream ID mapping : since the stream ID is per H2 connection, stream IDs will
|
|
have to be mapped. Thus a given stream is an entity with two IDs (one per
|
|
side). Or more precisely a stream has two end points, each one carrying an ID
|
|
when it ends on an HTTP2 connection. Also, for each stream ID we need to
|
|
quickly find the associated transaction in progress. Using a small quick
|
|
unique tree seems indicated considering the wide range of valid values.
|
|
|
|
- frame sizes : frame have to be remapped between both sides as multiplexed
|
|
connections won't always have the same characteristics. Thus some frames
|
|
might be spliced and others will be sliced.
|
|
|
|
- error processing : care must be taken to never break a connection unless it
|
|
is dead or corrupt at the protocol level. Stats counter must exist to observe
|
|
the causes. Timeouts are a great problem because silent connections might
|
|
die out of inactivity. Ping frames should probably be scheduled a few seconds
|
|
before the connection timeout so that an unused connection is verified before
|
|
being killed. Abnormal requests must be dealt with using RST_STREAM.
|
|
|
|
- ALPN : ALPN must be observed onthe client side, and transmitted to the server
|
|
side.
|
|
|
|
- proxy protocol : proxy protocol makes little to no sense in a multiplexed
|
|
protocol. A per-stream equivalent will surely be needed if implementations
|
|
do not quickly generalize the use of Forward.
|
|
|
|
- simplified protocol for local devices (eg: haproxy->varnish in clear and
|
|
without handshake, and possibly even with splicing if the connection's
|
|
settings are shared)
|
|
|
|
- logging : logging must report a number of extra information such as the
|
|
stream ID, and whether the transaction was initiated by the client or by the
|
|
server (which can be deduced from the stream ID's parity). In case of push,
|
|
the number of the associated stream must also be reported.
|
|
|
|
- memory usage : H2 increases memory usage by mandating use of 16384 bytes
|
|
frame size minimum. That means slightly more than 16kB of buffer in each
|
|
direction to process any frame. It will definitely have an impact on the
|
|
deployed maxconn setting in places using less than this (4..8kB are common).
|
|
Also, the header list is persistant per connection, so if we reach the same
|
|
size as the request, that's another 16kB in each direction, resulting in
|
|
about 48kB of memory where 8 were previously used. A more careful encoder
|
|
can work with a much smaller set even if that implies evicting entries
|
|
between multiple headers of the same message.
|
|
|
|
- HTTP/1.0 should very carefully be transported over H2. Since there's no way
|
|
to pass version information in the protocol, the server could use some
|
|
features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers,
|
|
...).
|
|
|
|
- host / :authority : ":authority" is the norm, and "host" will be absent when
|
|
H2 clients generate :authority. This probably means that a dummy Host header
|
|
will have to be produced internally from :authority and removed when passing
|
|
to H2 behind. This can cause some trouble when passing H2 requests to H1
|
|
proxies, because there's no way to know if the request should contain scheme
|
|
and authority in H1 or not based on the H2 request. Thus a "proxy" option
|
|
will have to be explicitly mentionned on HTTP/1 server lines. One of the
|
|
problem that it creates is that it's not longer possible to pass H/1 requests
|
|
to H/1 proxies without an explicit configuration. Maybe a table of the
|
|
various combinations is needed.
|
|
|
|
:scheme :authority host
|
|
HTTP/2 request present present absent
|
|
HTTP/1 server req absent absent present
|
|
HTTP/1 proxy req present present present
|
|
|
|
So in the end the issue is only with H/2 requests passed to H/1 proxies.
|
|
|
|
- ping frames : they don't indicate any stream ID so by definition they cannot
|
|
be forwarded to any server. The H2 connection should deal with them only.
|
|
|
|
There's a layering problem with H2. The framing layer has to be aware of the
|
|
upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass
|
|
it over a framing layer to mux the streams, the frame type must be passed below
|
|
so that frames are properly arranged. Header encoding is connection-based and
|
|
all streams using the same connection will interact in the way their headers
|
|
are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and
|
|
this entity has to know for each stream what its headers are.
|
|
|
|
Probably that we should remove *all* headers from transported data and move
|
|
them on the fly to a parallel structure that can be shared between H1 and H2
|
|
and consumed at the appropriate level. That means buffers only transport data.
|
|
Trailers have to be dealt with differently.
|
|
|
|
So if we consider an H1 request being forwarded between a client and a server,
|
|
it would look approximately like this :
|
|
|
|
- request header + body land into a stream's receive buffer
|
|
- headers are indexed and stripped out so that only the body and whatever
|
|
follows remain in the buffer
|
|
- both the header index and the buffer with the body stay attached to the
|
|
stream
|
|
- the sender can rebuild the whole headers. Since they're found in a table
|
|
supposed to be stable, it can rebuild them as many times as desired and
|
|
will always get the same result, so it's safe to build them into the trash
|
|
buffer for immediate sending, just as we do for the PROXY protocol.
|
|
- the upper protocol should probably provide a build_hdr() callback which
|
|
when called by the socket layer, builds this header block based on the
|
|
current stream's header list, ready to be sent.
|
|
- the socket layer has to know how many bytes from the headers are left to be
|
|
forwarded prior to processing the body.
|
|
- the socket layer needs to consume only the acceptable part of the body and
|
|
must not release the buffer if any data remains in it (eg: pipelining over
|
|
H1). This is already handled by channel->o and channel->to_forward.
|
|
- we could possibly have another optional callback to send a preamble before
|
|
data, that could be used to send chunk sizes in H1. The danger is that it
|
|
absolutely needs to be stable if it has to be retried. But it could
|
|
considerably simplify de-chunking.
|
|
|
|
When the request is sent to an H2 server, an H2 stream request must be made
|
|
to the server, we find an existing connection whose settings are compatible
|
|
with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If
|
|
none is found, a new connection must be established, unless maxconn is reached.
|
|
|
|
Servers must have a maxstream setting just like they have a maxconn. The same
|
|
queue may be used for that.
|
|
|
|
The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2
|
|
that becomes impossible (and useless). We still need something like the
|
|
"tcp-request session" hook to apply just after the SSL handshake is done.
|
|
|
|
It is impossible to defragment the body on the fly in HTTP/2. Since multiple
|
|
messages are interleaved, we cannot wait for all of them and block the head of
|
|
line. Thus if body analysis is required, it will have to use the stream's
|
|
buffer, which necessarily implies a copy. That means that with each H2 end we
|
|
necessarily have at least one copy. Sometimes we might be able to "splice" some
|
|
bytes from one side to the other without copying into the stream buffer (same
|
|
rules as for TCP splicing).
|
|
|
|
In theory, only data should flow through the channel buffer, so each side's
|
|
connector is responsible for encoding data (H1: linear/chunks, H2: frames).
|
|
Maybe the same mechanism could be extrapolated to tunnels / TCP.
|
|
|
|
Since we'd use buffers only for data (and for receipt of headers), we need to
|
|
have dynamic buffer allocation.
|
|
|
|
Thus :
|
|
- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to
|
|
send something that we need to build and that needs to be persistant in case
|
|
of partial send. H1 headers are built on the fly from the header table to a
|
|
temporary buffer that is immediately sent and whose amount of sent bytes is
|
|
the only information kept (like for PROXY protocol). H2 headers are more
|
|
complex since the encoding depends on what was successfully sent. Thus we
|
|
need to build them and put them into a temporary buffer that remains
|
|
persistent in case send() fails. It is possible to have a limited pool of
|
|
Tx buffers and refrain from sending if there is no more buffer available in
|
|
the pool. In that case we need a wake-up mechanism once a buffer is
|
|
available. Once the data are sent, the Tx buffer is then immediately recycled
|
|
in its pool. Note that no tx buffer being used (eg: for hdr or control) means
|
|
that we have to be able to serialize access to the connection and retry with
|
|
the same stream. It also means that a stream that times out while waiting for
|
|
the connector to read the second half of its request has to stay there, or at
|
|
least needs to be handled gracefully. However if the connector cannot read
|
|
the data to be sent, it means that the buffer is congested and the connection
|
|
is dead, so that probably means it can be killed.
|
|
|
|
- Rx buffers have to be pre-allocated just before calling recv(). A connection
|
|
will first try to pick a buffer and disable reception if it fails, then
|
|
subscribe to the list of tasks waiting for an Rx buffer.
|
|
|
|
- full Rx buffers might sometimes be moved around to the next buffer instead of
|
|
experiencing a copy. That means that channels and connectors must use the
|
|
same format of buffer, and that only the channel will have to see its
|
|
pointers adjusted.
|
|
|
|
- Tx of data should be made as much as possible without copying. That possibly
|
|
means by directly looking into the connection buffer on the other side if
|
|
the local Tx buffer does not exist and the stream buffer is not allocated, or
|
|
even performing a splice() call between the two sides. One of the problem in
|
|
doing this is that it requires proper ordering of the operations (eg: when
|
|
multiple readers are attached to a same buffer). If the splitting occurs upon
|
|
receipt, there's no problem. If we expect to retrieve data directly from the
|
|
original buffer, it's harder since it contains various things in an order
|
|
which does not even indicate what belongs to whom. Thus possibly the only
|
|
mechanism to implement is the buffer permutation which guarantees zero-copy
|
|
and only in the 100% safe case. Also it's atomic and does not cause HOL
|
|
blocking.
|
|
|
|
It makes sense to chose the frontend_accept() function right after the
|
|
handshake ended. It is then possible to check the ALPN, the SNI, the ciphers
|
|
and to accept to switch to the h2_conn_accept handler only if everything is OK.
|
|
The h2_conn_accept handler will have to deal with the connection setup,
|
|
initialization of the header table, exchange of the settings frames and
|
|
preparing whatever is needed to fire new streams upon receipt of unknown
|
|
stream IDs. Note: most of the time it will not be possible to splice() because
|
|
we need to know in advance the amount of bytes to write the header, and here it
|
|
will not be possible.
|
|
|
|
H2 health checks must be seen as regular transactions/streams. The check runs a
|
|
normal client which seeks an available stream from a server. The server then
|
|
finds one on an existing connection or initiates a new H2 connection. The H2
|
|
checks will have to be configurable for sharing streams or not. Another option
|
|
could be to specify how many requests can be made over existing connections
|
|
before insisting on getting a separate connection. Note that such separate
|
|
connections might end up stacking up once released. So probably that they need
|
|
to be recycled very quickly (eg: fix how many unused ones can exist max).
|
|
|