IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
There was a flag to skip the response payload on output, if any, by stating
it is bodyless. It is used for responses to HEAD requests or for 204/304
responses. This allow rewrites during analysis. For instance a HEAD request
can be rewrite to a GET request for any reason (ie, a server not supporting
HEAD requests). In this case, the server will send a response with a
payload. On frontend side, the payload will be skipped and a valid response
(without payload) will be sent to the client.
With this patch we introduce the corresponding flag for the request. It will
be used to skip the request payload. In addition, when payload must be
skipped for a request or a response, The zero-copy data forwarding is now
disabled.
Start-line flags for 303-See-Other response returned by the stats applet are
not properly set. Indeed, the reponse has a "content-length" header but both
HTX_SL_F_CHNK and HTX_SL_F_CLEN flags are set. Because of this bug, the
reponse is considered as chunked. So, let's remove HTX_SL_F_CHNK flag.
And also add HTX_SL_F_BODYLESS flag because there is no payload
("content-length" header is always set to 0).
This patch must be backported to all stable versions. On the 2.8 and lower
versions, the commit d0b04920d1 ("BUG/MINOR: htpp-ana/stats: Specify that
HTX redirect messages have a C-L header") must be backported first.
Tests have shown that switching nevlist to global.tune.maxpollevents
is totally unreliable when using evports, and that events seem to be
missed. A good reproducer seems to be QUIC. There are not enough
users of Solaris to warrant spending more time trying to get down to
this, and even the few that remain are by definition not interested
in performance, so let's just revert the commit that tried to lift the
value: e6662bf706 ("MEDIUM: evports: permit to report multiple events
at once").
No backport is needed.
In 97ea9c49f1 ("BUG/MEDIUM: fd: always align fdtab[] to 64 bytes"), the
patch doesn't do what the message says. The intent was only to align the
base fdtab addr on 64 bytes so that all fdtab entries are aligned and thus
don't share the same cache line. For that, fdtab pointer is adjusted from
fdtab_addr (unaligned) address after it is allocated. Thus, all we need
is an extra 64 bytes in the fdtab_addr array for the aligment. Because
we use calloc() to perform the allocation, a dumb mistake was made: the
'+64' was added on <size> calloc argument, which means EACH fdtab entry
is allocated with 64 extra bytes.
Given that a single fdtab entry is 64 bytes, since 97ea9c49f1 each fdtab
entry now takes 128 bytes! We doubled fdtab memory consumption.
To give you an idea, on my laptop, when looking at memory consumption
using 'ps -p `pidof haproxy` -o size' right after starting haproxy
process with default settings (no maxsock enforced):
before 97ea9c49f1:
-> 118440 (KB, ~= 118MB)
after 97ea9c49f1:
-> 183976 (KB, ~= 184MB)
To fix this, use calloc with 1 <nmemb> and manually provide the size with
<size> as we would do if we used malloc(). With this patch, we're back to
pre-97ea9c49f1 for fdtab memory consumption (with 64 extra bytes the
whole array, which is insignificant).
It should be backported to all stable versions.
In c614fd3b9 ("MINOR: log: add +cbor encoding option"), I wrongly used
strnlen() without noticing that the function is not portable (requires
_POSIX_C_SOURCE >= 2008) and that it was the first occurrence in the
entire project. In fact it is not a hard requirement since it's a pretty
simple function. Thus to restore build compatibility with minimal/older
build systems, let's actually get rid of it and use an equivalent portable
code where needed (we cannot simply rely on strlen() because the string
might not be NULL terminated, we must take upstream len into account).
No backport needed (unless c614fd3b9 gets backported)
getline() was used to read stats-file. However, this function is not
portable and may cause build issue on some systems. Replace it by
standard fgets().
No need to backport.
Since the ocsp-update is not strictly a tuning of the SSL stack, but a
feature of its own, lets rename the option.
The option was also missing from the index.
Increment glitch connection counter on every HTTP/3 or QPACK errors
which is a violation of the specification. This could be useful to get
rid early of bogus clients.
Implement basic support for glitches on QUIC multiplexer. This is mostly
identical too glitches for HTTP/2.
A new configuration option named tune.quic.frontend.glitches-threshold
is defined to limit the number of glitches on a connection before
closing it.
Glitches counter is incremented via qcc_report_glitch(). A new
qcc_app_ops callback <report_susp> is defined. On threshold reaching, it
allows to set an application error code to close the connection. For
HTTP/3, value H3_EXCESSIVE_LOAD is returned. If not defined, default
code INTERNAL_ERROR is used.
For the moment, no glitch are reported for QUIC or HTTP/3 usage. This
will be added in future patches as needed.
This commit is the second step to simplify HTTP/3 error management. This
times it deals with receive side on h3_rcv_buf().
Various internal HTTP/3 to HTX conversion functions does not set
H3_INTERNAL_ERROR on h3c err anymore. Only standard error code are set.
For every errors, both internal and protocol ones, a negative value is
returned. This ensure that h3_rcv_buf() looping is interrupted. This
function will then set H3_INTERNAL_ERROR only if no standard error is
registered via h3c or h3s.
Along the previous commit, this should better reflect internal errors
from protocol ones caused by a faulty client.
It's currently difficult to differentiate HTTP/3 standard protocol
violation from internal issues which use solely H3_INTERNAL_ERROR code.
This patch aims is the first step to simplify this. The objective is to
reduce H3_INTERNAL_ERROR. <err> field of h3c should be reserved
exclusively to other values.
Simplify error management in sending via h3_snd_buf(). Sending side is
straightforward as only internal errors can be encountered. Do not
manually set h3c.err to H3_INTERNAL_ERROR in HTX to HTTP/3 various
conversion function. Instead, just return a negative value which is
enough to break h3_snd_buf() loop. H3_INTERNAL_ERROR is thus positionned
on a single location in this function for all sending operations.
Rename enum values used for HTTP/3 and QPACK RFC defined codes. First
uses a prefix H3_ERR_* which serves as identifier between them. Also
separate QPACK values in a new dedicated enum qpack_err. This is deemed
cleaner.
There is two distinct enums both related to QPACK error management. The
first one is dedicated to RFC defined code. The other one is a set of
internal values returned by qpack_decode_fs(). There has been issues
discovered recently due to the confusion between them.
Rename internal values with the prefix QPACK_RET_*. The older name
QPACK_ERR_* will be used in a future commit for the first enum.
As stated in issue #2565, checks on the request target during H1 message
parsing are not good enough. Invalid paths, not starting by a slash are in
fact parsed as authorities. The same error is repeated at the sample fetch
level. This last point is annoying because routing rules may be fooled. It
is also an issue when the URI or the Host header are updated.
Because the error is repeated at different places, it must be fixed. We
cannot be lax by arguing it is the server's job to accept or reject invalid
request targets. With this patch, we strengthen the checks performed on the
request target during H1 parsing. Idea is to reject invalid requests at this
step to be sure it is safe to manipulate the path or the authority at other
places.
So now, the asterisk-form is only allowed for OPTIONS and OTHER methods.
This last point was added to not reject the H2 preface. In addition, we take
care to have only one asterisk and nothing more. For the CONNECT method, we
take care to have a valid authority-form. All other form are rejected. The
authority-form is now only supported for CONNECT method. No specific check
is performed on the origin-form (except for the CONNECT method). For the
absolute-form, we take care to have a scheme and a valid authority.
These checks are not perfect but should be good enough to properly identify
each part of the request target for a relative small cost. But, it is a
breaking change. Some requests are now be rejected while they was not on
older versions. However, nowadays, it is most probably not an issue. If it
turns out it's really an issue for legitimate use-cases, an option would be
to supports these kinds of requests when the "accept-invalid-http-request"
option is set, with the consequence of seeing some sample fetches having an
unexpected behavior.
This patch should fix the issue #2665. It MUST NOT be backported. First
because it is a breaking change. And then because by avoiding backporting
it, it remains possible to relax the parsing with the
"accept-invalid-http-request" option.
The target of a CONNECT request must not have scheme. However, this was not
checked during the message parsing. It is now rejected.
This patch may be backported as far as 2.4.
When a non-CONNECT H1 request is parsed, the authority is compared to the
host header value, to validate that they are the same. However there is an
issue here when a relative path is used (not begining with a '/'). In this
case, the path is considered as the authority and will be erroneously
compared to the host header value. It is observable with this kind of
request:
GET admin HTTP/1.1
Host: www.mysite.com
In this case "admin" is parsed as an authority while it is in fact a path.
At this step, it is not a big deal because it just happens on the very first
checks on the message during the parsing. However, the same happens when the
authority is updated. This will be fixed in another commit
Note this kind of request is invalid because the path does not start with a
'/'. But, till now, HAProxy does not reject it.
This patch is related to issue #2565. It must be backported as far as 2.4.
The ->takeover() is quite tricky. It didn't take care of the possibility
that the original thread's connection handler had been woken up to handle
an event (e.g. read0), failed to get a buffer, registered against its own
thread's buffer_wait queue and left the connection in an idle state.
A new thread could then come by, perform a takeover(), and when a buffer
was available, the new thread's tasklet would be woken up by the old one
via *_buf_available(), causing all sort of problems. These problems are
easy to reproduce, by running with shared backend connections and few
buffers (tune.buffers.limit=20, 8 threads, 500 connections, transfer
64kB objects and wait 2-5s for a crash to appear).
A first estimated solution consisted in removing the connection from the
idle list but it turns out that it would be worse for the delete stuff
(the connection no longer appearing as idle, making it impossible to find
it in order to close it). Also, idle counts wouldn't match anymore the
list's state, and the special case of private connections could be
difficult to handle as the connection could be forcefully re-added to the
idle list after allocation despite being private.
After multiple attempts to address the problem in various ways, it appears
that the only reliable solution for now (without starting to turn many
lists to mt_lists) is to have the takeover() function handle the buf_wait
detection or unregistration itself:
- when doing a regular takeover aiming at finding an idle connection
for a new request, connections that are blocked in a buffer_wait
queue are quite rare and not interesting at all (since not immediately
usable), so skipping them is sufficient. For this we detect that the
desired connection belongs to a buffer_wait list by checking its
buf_wait.list element. Note that this check is *not* thread-safe! The
LIST_DEL_INIT() is performed by __offer_buffers() after the callback
was called. But this is sufficient as it is now because the only way
for the element to be seen as not in a list is after the element was
last touched by __offer_buffers(), so the situation for this connection
will not change in a different way later.
- when doing a server delete, we're running under thread isolation.
The connection might get taken over to be killed. The only trick is
that private connections not belonging to any idle list may also
experience this, and in this case even the idle_conns lock will not
offer any protection against anything. But since we're run under
thread isolation, we're certain not to compete with the other thread,
so it's safe to directly unregister the connection from its owner
thread. Normally this is already handled by conn_release() in
cli_parse_delete_server(), which calls mux->destroy(), but this would
actually update the current thread's queue instead of the origin
thread's, thus we do need to perform an explicit dequeue before
completing the takeover.
With this, the problem now looks solved for HTTP/1, HTTP/2 and FCGI,
though extensive tests were essentially run on HTTP/1 and HTTP/2.
While the problem has been there for a very long time, there should be
no reason to backport it since buffer_wait didn't practically work
before 3.0-dev and the process used to freeze hard very quickly before
we'd even have a chance to meet that race.
In 2.4-dev8 with commit 5c7086f6b0 ("MEDIUM: connection: protect idle
conn lists with locks"), the idle conns list started to be protected
using the lock for takeover, and the SSL layer used to always take
that lock. Later in 2.4-dev11, with commit 4149168255 ("MEDIUM: ssl:
implement xprt_set_used and xprt_set_idle to relax context checks"), we
decided to relax this lock using TASK_F_USR1 just as is done in muxes.
However the xprt_set_used() call, that's supposed to clear the flag,
visibly suffered from a copy-paste and kept the OR operation instead of
the AND, resulting in the flag never being released, so that SSL on the
backend continues to take the lock on each and every I/O access even when
the connection is not idle.
The effect is only a reduced performance. This could be backported, but
given the non-zero risk of triggering another bug somewhere, it would
be prudent to wait for this fix to be sufficiently tested in new
versions first.
qpack_decode_fs() is used to decode QPACK field section on HTTP/3
headers parsing. Its return value is incoherent as it returns either
QPACK_DECOMPRESSION_FAILED defined in RFC 9204 or any other internal
values defined in qpack-dec.h. On failure, such return code is reused by
HTTP/3 layer to be reported via a CONNECTION_CLOSE frame. This is
incorrect if an internal error values was reported as it is not defined
by any specification.
Fir return values of qpack_decode_fs() in two ways. Firstly, fix invalid
usages of QPACK_DECOMPRESSION_FAILED when decoded content is too large
for the correct internal error QPACK_ERR_TOO_LARGE.
Secondly, adjust qpack_decode_fs() API to only returns internal code
values. A new internal enum QPACK_ERR_DECOMP is defined to replace
QPACK_DECOMPRESSION_FAILED. Caller is responsible to convert it to a
suitable error value. For other internal values, H3_INTERNAL_ERROR is
used. This is done through a set of convert functions.
This should be backported up to 2.6. Note that trailers are not
supported in 2.6 so chunk related to h3_trailers_to_htx() can be safely
skipped.
qcc_shutdown() is called whenever the connection must be closed. If
application protocol defined its owned shutdown callback, it is invoked
to use the correct error code. Else transport error code NO_ERROR is
used.
A bug occurs in the latter case as NO_ERROR is used with quic_err_app()
which is reserved for application errro codes. This will trigger the
emission of a CONNECTION_CLOSE of type 0x1d (Application) instead of
0x1c (Transport).
This bug is considered minor as it does not impact QUIC with HTTP/3. It
may only be visible when using experimental HTTP/0.9 protocol.
This should be backported up to 2.6. For 2.6, patch must be completed
rewritten due to code differences. Here is the change to apply :
diff --git a/src/mux_quic.c b/src/mux_quic.c
index 26fb70ddf..c48f82e27 100644
--- a/src/mux_quic.c
+++ b/src/mux_quic.c
@@ -1918,7 +1918,9 @@ static void qc_release(struct qcc *qcc)
qc_send(qcc);
}
else {
- qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ /* Duplicate from qcc_emit_cc_app() for Transport error code. */
+ if (!(qcc->conn->handle.qc->flags & QUIC_FL_CONN_IMMEDIATE_CLOSE))
+ qcc->conn->handle.qc->err = quic_err_transport(QC_ERR_NO_ERROR);
}
}
Since the following commit, idle connections are cleared before a server
is deleted. This is better than blocking server deletion due to inactive
connections :
6e0afb2e27
MEDIUM: server: close idle conn on server deletion
A BUG_ON() has been added to ensure that server idle conn counter is nul
after these connections are removed. However, Willy managed to trigger
it easily by repeatedly and randomly delete servers accross a
single-thread haproxy using a server-template with 1000 instances. In
parallel, a h1load client is executed to generate traffic.
This BUG_ON() reflected that it some connections referencing the server
targetted for deletion remained, even though idle server list is empty.
In fact, this is caused by connections scheduled for purging. These
connections are moved from idle server list to a global toremove_list
while still being accounted by the server.
A first approach could be to decrement server idle counter while moving
connection to the purge list. However, this is functionnaly incorrect as
these purgeable connections still reference the server and it could
cause a crash if cleared after it.
The correct fix for this issue is simply to remove every purgeable
connections before a server is deleted. This is implemented by this
patch by extending cli_parse_delete_server(). It could be enough to only
remove connections targetted the deleted server, but as these
connections will be purged anyway it is justified to clear the whole
list.
This must not be backported, unless the above mentionned patch is.
Based on Willy's idea (from 3.0-dev6 announcement message): in this patch
we try to reduce the max latency that can be caused by running lua scripts
with default settings.
Indeed, by default, hlua engine is allowed to process up to 10k
instructions per batch. While this value was found to be the optimal one
for a single thread, it turns out that keeping a thread busy for 10k lua
instructions could increase thread contention. This is especially true
when the script is loaded with 'lua-load', because in that case the
current thread owns the main lua lock and prevent other threads from
making any progress if they're also waiting on the main lock.
Thanks to Thierry Fournier's work, we know that performance-wise we can
reach optimal performance by sticking between 500 and 10k instructions
per batch. Given that, when the script is loaded using 'lua-load', if no
"tune.lua.forced-yield" was set by the user, we automatically divide the
default value (10K) by the number of threads haproxy can use to reduce
thread contention (given that all threads could compete for the main lua
lock), however we make sure not to return a value below 500, because
Thierry's work showed that this would come with a significant performance
loss.
The historical behavior may still be enforced by setting
"tune.lua.forced-yield" to 10000 in the global config section.
When support for UUIDv7 was added in commit
aab6477b67
the specification still was a draft.
It has since been published as RFC 9562.
This patch updates all UUID references from the obsoleted RFC 4122 and the
draft for RFC 9562 to the published RFC 9562.
An attach-srv config line usually looks like this:
tcp-request session attach-srv be/srv name ssl_c_s_dn(CN)
while a rhttp server line usually looks like this:
server srv rhttp@ sni req.hdr(host)
The server sni argument is used as a key for looking up connection in
the connection pool. The attach-srv name argument is used as a key for
inserting connections into the pool. For it to work correctly they must
match. There was a check that either both the attach-srv and server
provide that key or neither does.
It also checked that SSL and SNI was activated on the server. However,
thanks to current connect_server() implementation, it appears that SNI
is usable even without SSL to identify a connection in the pool. Thus,
it can be diverted from its original intent in reverse HTTP case to
serve even without SSL activated. For example, this could be useful to
use `fc_pp_unique_id` as a name expression (DISCLAIMER: note that for
now PROXY protocol is not compatible with rhttp).
Error is still reported if either SNI or name is used without the other.
This patch adjust the message to a more helpful one.
Arguably it would be easier to understand if instead of using `name` and
`sni` for `attach-srv` and `server` rules it used the same term in both
places - like "conn-pool-key" or something. That would make it clear
that the two must match.
When a log directive is defined in the global section, each time we use
"log global" in a proxy section, the global log directives are duplicated
for the current proxy. This works by creating a new proxy logger struct
and duplicating every members for each global one.
However, smp_rgs logger member is a special pointer member that is
allocated when "range" is used on a log directive. Currently, we simply
copy the array pointer (from the global one), instead of creating our own
copy. Because of that, range log sampling may not work properly in some
situations prior to 3f1284560 ("MINOR: log: remove the unused curr_idx in
struct smp_log_range") when used in global log directives, for instance:
global
log 127.0.0.1:5114 format raw sample 1-2,3:4 local0 info # should receive 75% of all proxy logs
log 127.0.0.1:5115 format raw sample 4:4 local0 info # should receive 25% of all proxy logs
listen proxy1
log global
listen proxy2
log global
May not work as expected, because curr_idx was stored within smp_rgs array
member prior to 3f1284560, and due to this bug, it happens to be shared
between every log directive inherited from a "global" one. The result is
that curr_idx counter will not behave properly because the index will be
increased globally instead of per-log directive, and it could even suffer
from concurrent thread accesses under load since we don't own the global
log directive's lock when manipulating it.
Another issue that was revealed because of this bug is that the smp_rgs
array allocated during config parsing is never freed in free_logger(),
resulting in small memory leak during clean exit.
To fix these issues all at once, let's properly duplicate smp_rgs logger
struct member in dup_logger() like we already do for other special members
so that every log directive have its own sms_rgs copy, and then
systematically free it in free_logger().
While this bug affects all stable versions (including 2.4), it's probably
best to not backport this beyond 2.6 because of 211ea252d
("BUG/MINOR: logs: fix logsrv leaks on clean exit") prerequisite that
first appears in 2.6.
[ada: for versions prior to 2.9, 969e212
("MINOR: log: add dup_logsrv() helper function") and 76acde91
("BUG/MINOR: log: keep the ref in dup_logger()") must be backported
first.
Note: Some ctx adjustments should be performed because 'logger' struct
used to be named 'logsrv' in the past and 2.9 introduced logger target
struct member. Thus it's probably easier to manually apply 76acde91 and
the current bugfix by hand directly on top of 969e212.
]
If add_sample_to_logformat_list() fails to allocate new logformat_node,
then we directly jump to error_free label to cleanup the node using
free_logformat_node() before returning an error.
However if the node failed to allocate, then the sample expression that
was allocated just before (not yet assigned) isn't released
(free_logformat_node() is a no-op when NULL is provided). Thus if expr
wasn't assigned to the node during early failure, then it must be manually
released.
This bug was introduced by 2462e5bcc ("BUG/MINOR: log: fix potential
lf->name memory leak") which wasn't marked for backports. It only
affects 3.0.
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.
These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.
At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
Now when trying to allocate a buffer, we can check if we've been notified
of availability via the callback, in which case we should not consult the
queue, or if we're doing a first allocation and check the queue. At this
point it still doesn't change much since the stream still doesn't make use
of it but some progress is expected.
Now when trying to allocate the work buffer, we can check if we've been
notified of availability via the buf_wait callback, in which case we
should not consult the queue, or if we're doing a first allocation and
check the queue.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag in addition to clearing the ALLOC one, for
each of the 3 levels where we may fail an allocation. The flag will be
cleared upon a successful allocation. This will soon be used to decide to
re-allocate without waiting again in the queue. For now it has no effect.
There's just a trick, we need to clear the various *_ALLOC flags before
testing h1_recv_allowed() otherwise it will return false!
When appctx_buf_available() is called, it now sets APPCTX_FL_IN_MAYALLOC
or APPCTX_FL_OUT_MAYALLOC depending on the reportedly permitted buffer
allocation, and these flags are cleared when the said buffers are
allocated. For now they're not used for anything else.
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag on the stream so that the stream knows it
is allowed to bypass the queue checks. For now this is not used.
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.
Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
While it could certainly still be improved, this first approach consists
in assigning buffers like this in the H1 mux:
- h1c->obuf : DB_MUX_TX
- h1c->ibuf : DB_MUX_RX
- h1s->rxbuf: DB_SE_RX
That's done via 3 distinct functions for better code clarity, and it
also allowed to move the missing buffer flags assignment there.
Among possible improvements would be to take into consideration the
state of the parser (i.e. no data yet vs data, or headers vs payload)
so that even server beginning of response or pure payload can be lowered
in priority.
Instead of having each caller of appctx_get_buf() think about setting
the blocking flag, better have the function do it, since it's already
handling the queue anyway. This way we're sure that both are consistent.
Now we don't want bufwait handlers to preallocate the resources they
were expecting since it contributes to the shortage. Let's just wake
the applet up and that's all.
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.
Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
The errors were not working fine anyway since we know that upon low memory
condition everything freezes. However we have a chance to do better now,
so let's start by re-enabling queueing when allocations fail.
Let's turn the buffer_wq into an array of 4 list heads. These are chosen
by criticality. The DB_CRIT_TO_QUEUE() macro maps each criticality level
into one of these 4 queues. The goal here clearly is to make it possible
to wake up the most critical queues in priority in order to let some tasks
finish their job and release buffers that others can use.
In order to avoid having to look up all queues, a bit map indicates which
queues are in use, which also allows to avoid looping in the most common
case where queues are empty..
The code places that were used to manipulate the buffer_wq manually
now just call b_queue() or b_requeue(). This will simplify the multiple
list management later.
The goal is to indicate how critical the allocation is, between the
least one (growing an existing buffer ring) and the topmost one (boot
time allocation for the life of the process).
The 3 tcp-based muxes (h1, h2, fcgi) use a common allocation function
to try to allocate otherwise subscribe. There's currently no distinction
of direction nor part that tries to allocate, and this should be revisited
to improve this situation, particularly when we consider that mux-h2 can
reduce its Tx allocations if needed.
For now, 4 main levels are planned, to translate how the data travels
inside haproxy from a producer to a consumer:
- MUX_RX: buffer used to receive data from the OS
- SE_RX: buffer used to place a transformation of the RX data for
a mux, or to produce a response for an applet
- CHANNEL: the channel buffer for sync recv
- MUX_TX: buffer used to transfer data from the channel to the outside,
generally a mux but there can be a few specificities (e.g.
http client's response buffer passed to the application,
which also gets a transformation of the channel data).
The other levels are a bit different in that they don't strictly need to
allocate for the first two ones, or they're permanent for the last one
(used by compression).
Abort reason code received on STOP_SENDING is notified to upper layer
since the following commit :
367ce1ebf3
MINOR: mux-quic: Set tha SE abort reason when a STOP_SENDING frame is received
However, this causes a crash when a STOP_SENDING is received on a QCS
instance without any stream instantiated. Fix this by checking first if
qcs->sd is not NULL before setting abort code.
This bug can easily be reproduced by emitting a STOP_SENDING as first
frame of a stream.
This should fix github issue #2563.
This does not need to be backported.
As reported by Tristan in GH #2561, syslog messages sent over rings are
malformed since commit 01aa0a05 ("MEDIUM: ring: change the ring reader
to use the new vector-based API now").
Indeed, take a look at the following log message produced prior to
01aa0a05:
181 <134>1 2024-05-07T09:45:21.543263+02:00 - haproxy 113700 - - 127.0.0.1:56136 [07/May/2024:09:45:21.491] front front/s1 0/0/21/30/51 404 369 - - ---- 1/1/0/0/0 0/0 "GET / HTTP/1.1"
Starting with 01aa0a05, here's the equivalent log message:
<134>1 2024-05-07T09:45:21.543263+02:00 - haproxy 112729 - - 127.0.0.1:56136 [07/May/2024:09:45:21.491] front front/s1 0/0/66/39/105 404 369 - - ---- 1/1/0/0/0 0/0 "GET / HTTP/1.1"-fwr
-> Message is missing octet counting header, and garbage bytes are found
at the end of the payload.
This bug is caused by a small mistake in syslog_applet_append_event():
when the function was refactored to use vector API instead of buffer
API, we used 'trash.area' as starting pointer to write the event instead
of 'trash.area + trash.data', causing existing octet counting prefix
(already written in trash) to be overwritten and trash.data to be
wrongly incremented.
No backport needed (01aa0a05 was introduced during 3.0 development)
Thanks to the previous fix, it is now possible to get the number of opened
streams for a connection and the negociated limit. Here, corresponding
sample feches are added, in fc_ and bc_ scopes.
On frontend side, the limit of streams is imposed by HAProxy. But on the
backend side, the limit is defined by the server. it may be useful for
debugging purpose because it may explain slow-downs on some processing.
There are 2 new ctl commands that may be used to retrieve the current number
of streams openned for a connection and its limit (the maximum number of
streams a mux connection supports).
For the PT and H1 muxes, the limit is always 1 and the current number of
streams is 0 for idle connections, otherwise 1 is returned.
For the H2 and the FCGI muxes, info are already available in the mux
connection.
For the QUIC mux, the limit is also directly available. It is the maximum
initial sub-ID of bidirectional stream allowed for the connection. For the
current number of streams, it is the number of SC attached on the connection
and the number of not already attached streams present in the "opening_list"
list.
Other muxes implement this callback function. It was not implemented for the
QUIC mux because it was useless. It will be used to retrieve the current/max
number of stream for a quic connection. So let's added it, adding the
default support for MUX_CTL_EXIT_STATUS command.
It is now possible to retrieve some info about the abort received for a
server or a client stream, if any.
* fs.aborted and bs.aborted can be used to know if an abort was received
on frontend or backend side. A boolean is returned.
* fs.rst_code and bs.rst_code return the code of the received RESET_STREAM
frame for a H2 stream or the code of the received STOP_SENDING frame for
a QUIC stream. In both cases, the error code attached to the frame is
returned. The sample fetch fails if no such frame was received or if the
stream is not an H2/QUIC stream.
When STOP_SENDING frame is received for a quic stream, the error code is now
saved in the SE abort reason. To do so, we use the QUIC source
(SE_ABRT_SRC_MUX_QUIC). For now, this code is only set but not used on the
opposite side.
When a H2 client sends a RST_STREAM(CANCEL) frame to abort a request, the
abort reason is now used on server side, in the H2 mux, to set the
RST_STREAM code. The main use case is to forward client cancellations to
gRPC applications.
This patch should fix the issue #172.
When RST_STREAM frame is received, the error code is now saved in the SE
abort reason. To do so, we use the H2 source (SE_ABRT_SRC_MUX_H2). For now,
this code is only set but not used on the opposite side.
A reason is now passed as parameter to muxes shutdowns to pass additional
info about the abort, if any. No info means no abort or only generic one.
For now, the reason is composed of 2 32-bits integer. The first on represents
the abort code and the other one represents the info about the code (for
instance the source). The code should be interpreted according to the associated
info.
One info is the source, encoding on 5 bits. Other bits are reserverd for now.
For now, the muxes are the only supported source. But we can imagine to extend
it to applets, streams, health-checks...
The current design is quite simple and will most probably evolved.. But the
idea is to let the opposite side forward some errors and let's a mux know
why its stream was aborted. At first glance, a abort reason must only be
evaluated if SE_SHW_SILENT flag is set.
The main goal at short term, is to forward some H2 RST_STREAM codes because
it is mandatory for gRPC applications, mainly to forward gRPC cancellation
from an H2 client to an H2 server. But we can imagine to alter this reason
at the applicative level to enrich it. It would also be used to report more
accurate errors in logs.
Previously check_config_validity() had its own curproxy variable. This
resulted in the acl() sample fetch being unable to determine which
proxy was in use when used from within log-format statements. This
change addresses the issue by having the check_config_validity()
function use the global variable instead.
This patch fixes the commit eea152ee68
("BUG/MINOR: signals/poller: ensure wakeup from signals").
There is some probability that run_poll_loop() becomes inifinite, if
TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len
check, when a signal has received just after the first one.
In such particular case, the 'wake' variable, which is used to terminate
thread's poll loop is never reset to 0. So, we never enter to the "stopping"
part of the run_poll_loop() and threads, except the one with id 0 (tid 0
handles signals), will continue to call _do_poll() eternally and will never
sleep, as its TH_FL_SLEEPING flag was unset.
This flag needs to be removed only for the tid 0, as it was done in the first
signal_queue_len check.
This fixes an issue #2537 "infinite loop when shutting down".
This fix must be backported in every stable version.
In lf_buildctx_prepare(), we perform costly bitwise operations for every
nodes to resolve node options and check for incompatibilities with global
options.
In fact, all this logic may safely be performed during postparsing. This
is what we're doing in this commit. Doing so saves us from unnecessary
runtime checks and could help speedup sess_build_logline().
Since checks are not as costly as before (due to them being performed
during postparsing and not on log building path anymore), an complementary
check for OPT_HTTP vs OPT_ENCODE incompatibity was added:
encoding is ignored if HTTP option is set, unless HTTP option wasn't
set globally and encoding was set globally, which means encoding
takes the precedence
Thanks to this patch, lf_buildctx_prepare() now only takes care of
assigning proper typecast and options settings depending if it's used
from global or per-node context, and prepares CBOR-specific structure
members when CBOR encode option is set.
In sess_build_logline(), we have multiple fieds such as '%t' that build
a fixed-length string out of a date struct and then print it using
lf_rawtext(). In fact, printing it using lf_rawtext() is only mandatory
to deal with encoding options, but when no encoding is used we can output
the result to tmplog directly. Since most dates generate between 25 and 30
chars, doing so spares us from writing them twice and could help make
sess_build_logline() a bit faster when no encoding is used. (to match with
pre-encoding patch series performance).
Now that lf_buildctx isn't pushed on the stack anymore, let's take this
opportunity to store a small buffer of 256 bytes within it, and then use
this buffer as general purpose buffer to build fixed-length strings that
are then printed using lf_{raw}text() function. By doing so we stop
relying on temporary stack buffers.
Following previous commit's logic, let's move lf_buildctx ctx away from
sess_build_logline() to stop abusing from the stack to push large
structure each time sess_build_logline() is called. Also, don't memset
the structure for each invokation, but only reset members explicitly when
required.
For that we now declare one static lf_buildctx per thread (using
THREAD_LOCAL) and make sess_build_logline() refer to it using a pointer.
'empty' buffer used in sess_build_logline() inside a loop, and since it
is only being read from and not modified, until recently it ended up being
cached most of the time and didn't cause overhead due to systematic push
on the stack.
However, due recent encoding work and new added variables on the stack,
we're starting to reach a stack limit and declaring 'empty' buffer within
the loop seems to cause non-negligible CPU overhead.
Since the variable isn't modified during log generation, let's declare
'empty' buffer as a global variable outside from sess_build_logline()
to prevent pushing it on the stack for each node evaluation.
Christian reported in GH #2556 that since 3.0-dev double spaces may be
found in log messages on some cases where it was not the case before.
As we were able to easily reproduce, a quick bisect led us to c6a7138
("MINOR: log: simplify last_isspace in sess_build_logline()"). While
it is true that all switch cases set the last_isspace variable to 0,
there was a subtelty for some fields such as '%hr', '%hrl', '%hs' or
'%hsl' and I overlooked it. Indeed, for '%hr', last_isspace was only set
to 0 if data was emitted, else the assignment didn't occur.
But with c6a7138, last_isspace is always set to 0 as long as the current
node type is not a separator. Because of that, if no data is emitted for
the current node value, and a space was already emitted prior to the
current node, then an extra space could be emitted after the node,
resulting in two spaces being emitted.
Note that while c6a7138 introduces a slight behavior regression regarding
last_isspace logic with the specific fields mentionned above, this
behavior could already be triggered with a failing or empty logformat
node sample expression. Consider this logformat expression:
log-format "%{-M}o | %[str()] |"
str() will not print anything, and since we disabled mandatory option with
'-M', nothing gets printed for the node sample expression. As a result, we
have the following output:
"| |"
Instead of (when mandatory option is enabled):
"| - |"
Thus in order to stick to the historical behavior, systematically set
last_isspace to 0 for EXPR nodes, and only set last_isspace to 0 when
data was written for TAG nodes. This way, '%hr', '%hrl', '%hs' or
'%hsl' should behave as before.
No backport needed.
Instead of chaining 2 switchcases and performing encoding checks for all
nodes let's actually split the logic in 2: first handle simple node types
(text/separator), and then handle dynamic node types (tag, expr). Encoding
options are only evaluated for dynamic node types.
Also, last_isspace is always set to 0 after next_fmt label, since next_fmt
label is only used for dynamic nodes, thus != LOG_FMT_SEPARATOR.
Since LF_NODE_WITH_OPT() macro (which was introduced recently) is now
unused, let's get rid of it.
No functional change should be expected.
(Use diff -w to check patch changes since reindentation makes the patch
look heavy, but in fact it remains fairly small)
Split code related to proxies list looping in cli_parse_clear_counters()
to a new dedicated function. This function is placed in the new module
stats-proxy.
Create a new module stats-proxy. Move stats functions related to proxies
list looping in it. This allows to reduce stats source file dividing its
size by half.
Convert FN_AGE in stat_cols_px[] as generic columns. These values will
be automatically used for dump/preload of a stats-file.
Remove srv_lastsession() / be_lastsession() function which are now
useless as last_sess is calculated via me_generate_field().
Extend generic stat column support to be able to fully support age stats
type. Several changes were required.
On output, me_generate_field() has been updated to report the difference
between the current tick with the stored value for FN_AGE type. Also, if
an age stats is hidden in show stats, -1 is returned instead of an empty
metric, which is the value to mark an age as unset.
On counters preload, load_ctr() was updated to handled FN_AGE. A similar
substraction is performed to the current tick value.
last_change was a member present in both proxy and server struct. It is
used as an age statistics to report the last update of the object.
Move last_change into fe_counters/be_counters. This is necessary to be
able to manipulate it through generic stat column and report it into
stats-file.
Note that there is a change for proxy structure with now 2 different
last_change values, on frontend and backend side. Special care was taken
to ensure that the value is initialized only on the proxy side. The
other value is set to 0 unless a listen proxy is instantiated. For the
moment, only backend counter is reported in stats. However, with now two
distinct values, stats could be extended to report it on both side.
Convert every FN_RATE in stat_cols_px[] to generic column. Thanks to
prior patch, this allows to automatically dump their value into
stats-file and preload corresponding freq-ctr on process startup.
Implement support for FN_RATE stat column into stat-file.
For the output part, only minimal change is required. Reuse the function
read_freq_ctr() to print the same value in both stats output and
stats-file dump.
For counter preloading, define a new utility function
preload_freq_ctr(). This can be used to initialize a freq-ctr type by
preloading previous period value. Reuse this function in load_ctr()
during stats-file parsing.
At the moment, no rate column is defined as generic. Thus, this commit
does not have functional change. This will be changed as soon as FN_RATE
are converted to generic columns.
Move freq-ctr defined in proxy or server structures into their dedicated
fe_counters/be_counters struct.
Functionnaly no change here. This commit will allow to convert rate
stats column to generic one, which is mandatory to manipulate them in
the stats-file.
Currently, only FN_COUNTER are dumped and preloaded via a stats-file.
Thus in several places we relied on the assumption that only FN_COUNTER
are valid in stats-file context.
New stats types will soon be implemented as they are also eligilible to
statistics reloading on process startup. Thus, prepare stats-file
functions to remove any FN_COUNTER restriction.
As one of this change, generate_stat_tree() now uses stcol_is_generic()
for stats name tree indexing before stats-file parsing.
Also related to stats-file parsing, individual counter preloading step
as been extracted from line parsing in a dedicated new function
load_ctr(). This will allow to extend it to support multiple mechanism
of counter preloading depending on the stats type.
req_tot counter is a special case as it is not managed identically
between frontend and backend side.
For the backend side, this metric is available directly into
be_counters, which allows to use a generic stat column definition.
On the frontend side however, the metric value is an aggredate of
multiple fe_counters value. This is the case since the splitting between
HTTP version introduced in the following patch :
9969adbcdc
MINOR: stats: add by HTTP version cumulated number of sessions and requests
This difference cannot be handled automatically by me_generate_field().
Add a special case in the function to produce it on frontend side
reusing the aggregated value. This not done however for stats-file as
there is no counter to preload.
If 'namespace' keyword is used in the backend server settings or/and in the
bind string, it means that haproxy process will call setns() to change its
default namespace to the configured one and then, it will create a
socket in this new namespace. setns() syscall requires CAP_SYS_ADMIN
capability in the process Effective set (see man 2 setns). Otherwise, the
process must be run as root.
To avoid to run haproxy as root, let's add cap_sys_admin capability in the
same way as we already added the support for some other network capabilities.
As CAP_SYS_ADMIN belongs to CAP_SYS_* capabilities type, let's add a separate
flag LSTCHK_SYSADM for it. This flag is set, if the 'namespace' keyword was
found during configuration parsing. The flag may be unset only in
prepare_caps_for_setuid() or in prepare_caps_from_permitted_set(), which
inspect process EUID/RUID and Effective and Permitted capabilities sets.
If system doesn't support Linux capabilities or 'cap_sys_admin' was not set
in 'setcap', but 'namespace' keyword is presented in the configuration, we
keep the previous strict behaviour. Process, that has changed uid to the
non-priviledged user, will terminate with alert. This alert invites the user
to recheck its configuration.
In the case, when haproxy will start and run under a non-root user and
'cap_sys_admin' is not set, but 'namespace' keyword is presented, this patch
does not change previous behaviour as well. We'll still let the user to try
its configuration, but we inform via warning, that unexpected things, like
socket creation errors, may occur.
setns() may return EPERM if thread, that tries to move into different
namespace, do not have CAP_SYS_ADMIN capability in its Effective set.
So, extending sock_handle_system_err() with this error allows to send
appropriate log message and set SF_ERR_PRXCOND (SC termination
flag in log) as stream termination error code. This error code can be
simply checked with SF_ERR_MASK at protocol layer.
quic_connect_server(), tcp_connect_server(), uxst_connect_server() duplicate
same code to check different ERRNOs, that socket() and setns() may return.
They also duplicate some runtime condition checks, applied to the obtained
server socket fd.
So, in order to remove these duplications and to improve code readability,
let's encapsulate socket() and setns() ERRNOs handling in
sock_handle_system_err(). It must be called just before fd's runtime condition
checks, which we also move in sock_create_server_socket by the same reason.
SO_MARK, SO_USER_COOKIE, SO_RTABLE socket options (used to set the special
mark/ID on socket, in order to perform mark-based routing) are only supported
by AF_INET sockets. So, let's check socket address family, when we enter into
this function.
As UNIX Domain sockets could be attached to Linux namespaces (see more details
about it from the Linux kernel patch set below:
https://lore.kernel.org/netdev/m1hbl7hxo3.fsf@fess.ebiederm.org),
it is better to use my_socket_at() in order to create UNIX listener's socket.
my_socket_at() takes in account a network namespace, that may be configured
for a frontend in the bind line:
frontend fe_foo
...
bind uxst@frontend.sock user haproxy group haproxy mode 660 namespace frontend
Like this, namespace aware applications as netstat for example, will see this
listening socket in its 'frontend' namespace and not in the root namespace as
it was before.
It is important to mention, that fixes in Linux kernel referenced above allow
to connect to this listener's socket from the root and from any other
namespace. UNIX Domain socket is protected by its permission set, which must
be set with caution on its inode.
As UNIX Domain sockets could be attached to Linux namespaces (see more details
about it from the Linux kernel patch set below:
https://lore.kernel.org/netdev/m1hbl7hxo3.fsf@fess.ebiederm.org),
it is better to use sock_create_server_socket() in UNIX stream protocol
implementation, as this function calls my_socket_at() and the latter takes
in account server network namespace, which could be configured as in example
below:
backend be_bar
...
server rpicam0 /run/ustreamer.sock namespace foonet
So, for UNIX Domain socket, used as an address of some backend server, this
patch makes possible to perform connect() to this backend server from the same
network namespace, where the server is running, or where its listening socket
was created.
Using sock_create_server_socket() in UNIX stream protocol implementation also
makes the code of uxst_connect_server() more uniform with tcp_connect_server()
and quic_connect_server().
Renaming sock to sock_fd makes it more clear, that sock_create_server_socket
returns the fd of newly created server socket and then we check this fd.
As we heavily use "fd" variable name in all protocol implementations, let's
prefix this one with the name of its object file: sock.o.
Since the extension of the buffers API to applets in 3.0-dev, an applet
may find itself unable to allocate a buffer, and will block respectively
on APPCTX_FL_OUTBLK_ALLOC or APPCTX_FL_INBLK_ALLOC depending on the
direction. However the code in sc_applet_process() doesn't consider this
situation when deciding to wake up an applet, so when the condition
arises, the applet keeps ringing and is killed by the loop detector.
The fix is trivial and simply consists in checking for the flags above.
No backport is needed since this is new in 3.0.
In 3f2e8d0ed ("MEDIUM: log: lf_* build helpers now take a ctx argument")
I made a mistake, because starting with this commit it is no longer
possible from a node to disable global logformat options.
The result is that when an option is set globally, it cannot be disabled
anymore.
For instance, it is not possible to do this anymore:
log-format "%{+X}o %{-X}Ts"
The original intent was to prevent encoding options from being
disabled once enabled globally, because when encoding is enabled globally
we start the object enumeration right away (ie: in CBOR and JSON we
announce dynamic map, and for each node we announce the key..), thus it
doesn't make sense to mix encoding types there, unless encoding is only
used per-node, in which case only the value gets encoded, thus it remains
possible to print a value in JSON/CBOR-compatible format while the next
one shouldn't be printed as-is.
Thus, to restore the original behavior, slightly change the logic in
lf_buildctx_prepare() so that only global encoding options take the
precedence over node's options (instead of all options).
No backport needed.
The BUG_ON() statement that was added in 9bdea51 ("MINOR: log/cbor:
_lf_cbor_encode_byte() explicitly requires non-NULL ctx") isn't
sufficient as Coverity still thinks the lf_buildctx itself may be NULL
as shown in GH #2554. In fact the original reports complains about the
lf_buildctx itself and I didn't understand it properly, let's add another
check in the BUG_ON() to ensure both cbor_ctx and cbor_ctx->ctx are not
NULL since it is not expected if used properly.
In 98b44e8 ("BUG/MINOR: log: fix global lf_expr node options behavior"),
I properly restored global node options behavior for when encoding is
not used, however the fix is not optimal when encoding is involved:
Indeed, encoding logic in sess_build_logline() relies on global node
options to know if encoding must be handled expression-wide or
individually. However, because of the above fix, if an expression is
made of 1 or multiple nodes that all set an encoding option manually
(without '%o'), we consider that the option was set globally, but
that's probably not what the user intended. Instead we should only
evaluate global options from '%o', so that it remains possible to
skip global encoding when needed.
No backport needed.
When CBOR encoding was added in c614fd3b9 ("MINOR: log: add +cbor encoding
option"), in LOGCHAR(), we forgot to check that we don't assign the NULL
value to tmplog (as we assume that tmplog cannot be NULL at the end of
sess_build_logline())
No backport needed.
In sess_build_logline(), contrary to what's stated in the comment
"only consider global ctx for key encoding", we check for
LOG_OPT_ENCODE flag on the current ctx options instead of global
ones. Because of this, we could end up doing the wrong thing if the
previous node had encoding enabled but it isn't set globally for
instance.
To fix the issue, let's simply check the presence of the flag on
g_options before entering the "key encoding" block.
This bug was introduced with 3f7c8387 ("MINOR: log: add +json encoding
option"), no backport needed.
The ocsp_update.http_proxy global option allows to set an HTTP proxy
address which will be used to send the OCSP update request with an
absolute form URI.
LF_NODE_WITH_OPT(node) returns true if the node's option may be set and
thus should be considered. Logic is based on logformat node's type:
for now only TAG and FMT nodes can be configured.
In 507223d5 ("MINOR: log: global lf_expr node options"), a mistake was
made because it was assumed that only the last occurence of %o
(LOG_FMT_GLOBAL) should be kept as global node options.
However, although not documented, it is possible to have multiple %o
within a single logformat expression to change the global settings on the
fly.
For instance, consider this example:
log-format "%{+X}o test1=%ms %{-X}o test2=%ms %{+X}o test3=%ms"
Prior to 3f2e8d0ed ("MEDIUM: log: lf_* build helpers now take a ctx
argument"), this would output something like this:
test1=18B test2=395 test3=18B
This is because global options is properly updated as the lf_expr string
is parsed. But now due to 507223d5 and 3f2e8d0ed, only the last %o
occurence is considered. With the above example, this gives:
test1=18B test2=18B test3=18B
To restore historical behavior, let's partially revert 507223d5: to
compute global node options, we now start with all options enabled and
then for each configurable node in lf_expr_postcheck(), we keep options
common to the current node and previous nodes using AND masking, this way
we really end up with options common to all nodes.
No backport needed.
As shown in GH #2550, Coverity is tempted to think that NULL-dereference
can occur in _lf_cbor_encode_byte() due to user-ctx being dereferenced
from cbor_ctx, while coverity thinks that cbor_ctx may be NULL.
In practise this cannot happen, because _lf_cbor_encode_byte() is
only leveraged through a function pointer that is set in conjunction with
the function pointer ctx (which ain't NULL). All this logic is done inside
lf_buildctx_prepare() when LOG_OPT_ENCODE_CBOR is set.
Since coverity doesn't seem to understand the logic properly, then it
might as well confuse humans, so let's make it clear in
_lf_cbor_encode_byte() that we expect non-NULL ctx by adding a BUG_ON()
Rename e_byte_fct to e_fct_byte and e_fct_byte_ctx to e_fct_ctx, and
adjust some comments to make it clear that e_fct_ctx is here to provide
additional user-ctx to the custom cbor encode function pointers.
For now, only e_fct_byte function may be provided, but we could imagine
having e_fct_int{16,32,64}() one day to speed up the encoding when we
know we can encode multiple bytes at a time, but for now it's not worth
the hassle.
Update parse_stat_line() used during stats-file parsing. For each line,
GUID is extracted first to access to the object instance. obj_type()
is then invoked to retrieve the correct object type.
Replace objt_* by __objt_* macros to mark its result as safe and non
NULL.
This should fix coverity report from github issue #2550.
No need to backport.
If a response varies on anything other than accept-encoding (origin or
referer) but still contains an 'Encoding' header, the cached responses
were never sent back.
This is because of the 'set_secondary_key_encoding' call that always
filled the accept-encoding part of the secondary signature with the
response's actual encoding, regardless of whether the response varies on
this or not. This meant that the accept-encoding part of the signature
could be non-null in the cached entry which made the
'get_secondary_entry' calls in 'http_action_req_cache_use' always fail
because in those cases the request's secondary signature always had a
null accept-encoding part.
This patch can be backported up to branch 2.4.
As previously discussed, "srv-unused" is sufficiently ambiguous to cause
some trouble over the long term. Better use "srv-removable" to indicate
that the server is removable, and if the conditions to delete a server
change over time, the wait condition will be adjusted without renaming
it.
I just added a new setting to set the number of reserved buffer, to
discover we already had one... Let's move the parsing of this keyword
(tune.buffers.reserve) and tune.buffers.limit to dynbuf.c where they
should be.
CBOR in hex format as implemented in previous commit is convenient because
the produced output is portable and can easily be embedded in regular
syslog payloads.
However, one of the goal of CBOR implementation is to be able to produce
"Concise Binary" object representation. Here is an excerpt from cbor.io
website:
"Some applications also benefit from CBOR itself being encoded in
binary. This saves bulk and allows faster processing."
Currently we don't offer that with '+cbor', quite the opposite actually
since a text string encoded with '+cbor' option will be larger than a
text string encoded with '+json' or without encoding at all, because for
each CBOR binary byte, 2 characters will be emitted.
Hopefully, the sink/log API allows for binary data to be passed as
parameter, this is because all relevant functions in the chain don't rely
on the terminating NULL byte and take a string pointer + string length as
parameter. We can actually rely on this property to support the '+bin'
option when combined with '+cbor' to produce RAW binary CBOR output.
Be careful though, as this is only intended for use with set-var-fmt or to
send binary data to capable UDP/ring endpoints.
Example:
log-format "%{+cbor,+bin}o %(test)[bin(00AABB)]"
Will produce:
bf64746573745f4300aabbffff
(output was piped to `hexdump -ve '1/1 "%.2x"'` to dump raw bytes as HEX
characters)
With cbor.me pretty printer, it gives us:
BF # map(*)
64 # text(4)
74657374 # "test"
5F # bytes(*)
43 # bytes(3)
00AABB # "\u0000\xAA\xBB"
FF # primitive(*)
FF # primitive(*)
In this patch, we make use of the CBOR (RFC8949) encode helper functions
from the previous commit to implement '+cbor' encoding option for log-
formats. The logic behind it is pretty similar to '+json' encoding option,
except that the produced output is a CBOR payload written in HEX format so
that it remains compatible to use this with regular syslog endpoints.
Example:
log-format "%{+cbor}o %[int(4)] test %(named_field)[str(ok)]"
Will produce:
BF6B6E616D65645F6669656C64626F6BFF
Detailed view (from cbor.me):
BF # map(*)
6B # text(11)
6E616D65645F6669656C64 # "named_field"
62 # text(2)
6F6B # "ok"
FF # primitive(*)
If the option isn't set globally, but on a specific node instead, then
only the value will be encoded according to CBOR specification.
Example:
log-format "test cbor bool: %{+cbor}[bool(true)]"
Will produce:
test cbor bool: F5
Add cbor helpers to encode strings (bytes/text) and integers according to
RFC8949, also add cbor_encode_ctx struct to pass encoding options such as
how to encode a single byte.
In this patch, we add the "+json" log format option that can be set
globally or per log format node.
What it does, it that it sets the LOG_OPT_ENCODE_JSON flag for the
current context which is provided to all lf_* log building function.
This way, all lf_* are now aware of this option and try to comply with
JSON specification when the option is set.
If the option is set globally, then sess_build_logline() will produce a
map-like object with key=val pairs for named logformat nodes.
(logformat nodes that don't have a name are simply ignored).
Example:
log-format "%{+json}o %[int(4)] test %(named_field)[str(ok)]"
Will produce:
{"named_field": "ok"}
If the option isn't set globally, but on a specific node instead, then
only the value will be encoded according to JSON specification.
Example:
log-format "{ \"manual_key\": %(named_field){+json}[bool(true)] }"
Will produce:
{"manual_key": true}
When the option is set, +E option will be ignored, and partial numerical
values (ie: because of logasap) will be encoded as-is.
Support '+bin' option argument on logformat nodes to try to preserve
binary output type with binary sample expressions.
For this, we rely on the log/sink API which is capable of conveying binary
data since all related functions don't search for a terminating NULL byte
in provided log payload as they take a string pointer and a string length
as argument.
Example:
log-format "%{+bin}o %[bin(00AABB)]"
Will produce:
00aabb
(output was piped to `hexdump -ve '1/1 "%.2x"'` to dump raw bytes as HEX
characters)
This should be used carefully, because many syslog endpoints don't expect
binary data (especially NULL bytes). This is mainly intended for use with
set-var-fmt actions or with ring/udp log endpoints that know how to deal
with such binary payloads.
Also, this option is only supported globally (for use with '%o'), it will
not have any effect when set on an individual node. (it makes no sense to
have binary data in the middle of log payload that was started without
binary data option)
Providing no_escape_map as <map> argument to _lf_encode_bytes() function
will make the function skip escaping since the map is empty.
This is for convenience, as it might be useful to call lf_encode_chunk()
to encoding binary data without escaping it.
In sess_build_logline(), for sample expression nodes, instead of directly
calling sample_fetch_as_type(... SMP_T_STR), let's first process the
sample using sample_process(), and then proceed with the conversion to
str if required.
Doing so will allow us to implement type casting and preserving logic.
Add internal lf_buildctx struct that is only used inside
sess_build_logline() scope and is passed to lf_* log building helpers
to expose current building context. For now, node options and the in_text
counter are stored in the ctx struct. Thanks to this change, lf_* building
functions don't depend on a logformat_node struct pointer, and may be used
in a standalone manner as long as a build context is provided.
Also, global options are now handled explictly in sess_build_logline() to
make sure that global options are always considered even if they were not
duplicated on every nodes.
No functional change should be expected.
lf_encode_string() and lf_encode_chunk() function are pretty similar. The
only difference is the stopping behavior, encode_chunk stops at a given
position while encode_string stops when encountering '\0'. Moreover,
both functions leverage tools.c encode helpers, but because of the
LOG_OPT_ESC option, they reimplement those helpers with added logic.
Instead of having to deal with code duplication which makes both functions
harder to maintain, let's define a _lf_encode_bytes() helper function
which satisfies lf_encode_string() and lf_encode_chunk() needs while
keeping the function as simple as possible.
_lf_encode_bytes() itself is made of multiple static inline helper
functions, in the attempt to keep checks outside of core loop for
better performance.
There is no need to expose such functions since they are only involved in
the log building process that occurs inside sess_build_logline().
Making functions static and removing their public prototype to ease code
maintenance.
Rename LOGQUOTE_{START,END} macros to more generic LOG_VARTEXT_{START,END}
in order to prepare for new encoding types that rely on specific treatment
for variable-length texts. No functional change should be expected.
Build fixed-length strings for %ts and %tsc to be able to print them
using lf_rawtext_len(), this way it will be easier to encode them
when new encoding options will be added.
No functional change should be expected.
Same as the previous commit, but for ip and port oriented values when
+X option is provided.
No functional change should be expected.
Because of this patch, we add a little overhead because we first generate
the text into a temporary variable and then use lf_rawtext() to print it.
Thus we have a double-copy, and this could have some performance
implications that were not yet evaluated. Due to the small number of bytes
that can end up being copied twice, we could be lucky and have no visible
performance impact, but if we happen to see a significant impact, it could
be useful to add a passthrough mechanism (to keep historical behavior)
when no encoding is involved.
Make use of the previous commit to print strings that should not be
modified.
For instance, when +X option is provided, we have to print numerical
values in ASCII HEX form. For that, we used snprintf() to output the
result to the log output buffer directly, but now we build the string in
a temporary buffer of fixed-size and then print it using lf_rawtext()
which will take care of encoding options.
Because of this patch, we add a little overhead because we first generate
the text into a temporary variable and then use lf_rawtext() to print it.
Thus we have a double-copy, and this could have some performance
implications that were not yet evaluated. Due to the small number of bytes
that can end up being copied twice, we could be lucky and have no visible
performance impact, but if we happen to see a significant impact, it could
be useful to add a passthrough mechanism (to keep historical behavior)
when no encoding is involved.
Don't directly call functions that take date as argument and output the
string representation to the log output buffer under sess_build_logline(),
and instead build the strings in temporary buffers of fixed size
(hopefully such functions, such as date2str_log() and gmt2str_log()
procuce strings of known size), and then print the result using
lf_rawtext() helper function. This way, we will be able to encode them
automatically as regular string/text when new encoding methods are added.
Because of this patch, we add a little overhead because we first generate
the text into a temporary variable and then use lf_rawtext() to print it.
Thus we have a double-copy, and this could have some performance
implications that were not yet evaluated. Due to the small number of bytes
that can end up being copied twice (< 30), we could be lucky and have no
visible performance impact, but if we happen to see a significant impact,
it could be useful to add a passthrough mechanism (to keep historical
behavior) when no encoding is involved.
similar to lf_text_{len}, except that quoting and mandatory options are
ignored. Use this to print the input string without any modification (
except for encoding logic).
Wrap ltoa(), lltoa(), ultoa() and utoa_pad() functions that are used by
sess_build_logline() to print numerical values by implementing a dedicated
helper named lf_int() that takes <dft_hld> as argument to know how to
write the integer by default (when no encoding is specified).
LF_INT_UTOA_PAD_4 is used to emulate utoa_pad(x, 4) since it's found only
once under sess_build_logline(), thus there is no need to pass an extra
parameter to lf_int() function.
Reminder:
Since 3.0-dev4, we can optionally give a name to logformat nodes:
log-format "%(custom_name1)B %(custom_name2)[str(value)]"
But we may also optionally set the expected node type by appending
':type' after the name, type being either sint,str or bool, like this:
log-format "%(string_as_int:sint)[str(14)]"
However, it is currently not possible to provide a type without providing
a name that is a least 1 char long. But it could be useful to provide a
type without setting a name, like this, for typecasting purposes only:
log-format "%(:sint)[bool(true)]"
Thus in order to allow this usage, don't set node->name if node name is
not at least 1 character long. By doing so, node->name will remain NULL
and will not be considered, but the typecast setting will.
make sess_build_logline() switch case more readable by performing some
simplifications: complex values are first extracted in a temporary
variable so that it's easier to refer to them and at a single place.
Thanks to 8226e92eb ("BUG/MINOR: tools/log: invalid
encode_{chunk,string} usage"), we only need to check for NULL return
value from encode_{chunk,string}() and escape_string() to know if the
call failed.
Since the introduction of the automatic seamless reload using the
internal socketpair, there is no way of disabling the seamless reload.
Previously we just needed to remove -x from the startup command line,
and remove any "expose-fd" keyword on stats socket lines.
This was introduced in 2be557f7c ("MEDIUM: mworker: seamless reload use
the internal sockpairs").
The patch copy /dev/null again and pass it to the next exec so we never
try to get socket from the -x.
Must be backported as far as 2.6.
This commit is the final to implement preloading of haproxy internal
counters via stats-file parsing.
Define a global keyword "stats-file". It allows to specify the path to
the stats-file which will be parsed on process startup.
This patch implement parsing of counter values line from stats-file. It
reuses domain context previously set by the last header line. Each
value is separated by ',' character, relative to the list of column
names describe by the header line.
This is implemented via static function parse_stat_line(). It first
extract a GUID and retrieve the object instance. Then each numerical
value is parsed and object counters updated. For the moment, only U64
counters metrics is supported. parse_stat_line() is called on each line
until a new header line is found.
This patch implements parsing of headers line from stats-file.
A header line is defined as starting with '#' character. It is directly
followed by a domain name. For the moment, either 'fe' or 'be' is
allowed. The following lines will contain counters values relatives to
the domain context until the next header line.
This is implemented via static function parse_header_line(). It first
sets the domain context used during apply_stats_file(). A stats column
array is generated to contains the order on which column are stored.
This will be reused to parse following lines values.
If an invalid line is found and no header was parsed, considered the
stats-file as ill formatted and stop parsing. This allows to immediately
interrupt parsing if a garbage file was used without emitting a ton of
warnings to the user.
This commit is the first one of a serie to implement preloading of
haproxy counters via stats-file parsing.
This patch defines a basic apply_stats_file() function. It implements
reading line by line of a stats-file without any parsing for the moment.
It is called automatically on process startup via init().
Extract GUID format validation in a dedicated function named
guid_is_valid_fmt(). For the moment, it is only used on guid_insert().
This will be reused when parsing stats-file, to ensure GUID has a valid
format before tree lookup.
Define a new CLI command "dump stats-file" with its handler
cli_parse_dump_stat_file(). It will loop twice on proxies_list to dump
first frontend and then backend side. It reuses the common function
stats_dump_stat_to_buffer(), using STAT_F_BOUND to restrict on the
correct side.
A new module stats-file.c is added to regroup function specifics to
stats-file. It defines two main functions :
* stats_dump_file_header() to generate the list of column list prefixed
by the line context, either "#fe" or "#be"
* stats_dump_fields_file() to generate each stat lines. Object without
GUID are skipped. Each stat entry is separated by a comma.
For the moment, stats-file does not support statistics modules. As such,
stats_dump_*_line() functions are updated to prevent looping over stats
module on stats-file output.
Prepare stats function to handle a new format labelled "stats-file". Its
purpose is to generate a statistics dump with a format closed from the
CSV output. Such output will be then used to preload haproxy internal
counters on process startup.
stats-file output differs from a standard CSV on several points. First,
only an excerpt of all statistics is outputted. All values that does not
make sense to preload are excluded. For the moment, stats-file only list
stats fully defined via "struct stat_col" method. Contrary to a CSV, sll
columns of a stats-file will be filled. As such, empty field value is
used to mark stats which should not be outputted.
Some adaptation specifics to stats-file are necessary into
me_generate_field(). First, stats-file will output separatedly values
from frontend and backend sides with their own respective set of
columns. As such, an empty field value is returned if stat is not
defined for either frontend/listener, or backend/server when outputting
the other side. Also, as stats-file does not support empty column,
stcol_hide() is not used for it.
A minor adjustement was necessary for stats_fill_fe_line() to pass
context flags. This is necessary to detect stat output format. All other
listener/server/backend corresponding functions already have it.
Convert most of proxy counters statistics to new "struct stat_col"
definition. Remove their corresponding switch..case entries in
stats_fill_*_line() functions. Their value are automatically calculate
via me_generate_field() invocation.
Along with this, also complete stcol_hide() when some stats should be
hidden.
Only a few counters where not converted. This is because they rely on
values stored outside of fe/be_counters structure, which
me_generate_field() cannot use for now.
Metric style stats can be automatically calculate since the introduction
of metric_generate() when using "struct stat_col" as input. This would
allow to centralize statistics generation. However, some stats are not
outputted under specific condition. For example, health check failures
on a server are only reported if checks are active.
To support this, define a new function metric_hide(). It is called by
metric_generate(). If true, it will skip metric calcuation and return an
empty field value instead. This allows to define "stat_col" metrics and
calculate them with metric_generate() but hiding them under certain
circumstances.
This commit is a direct follow-up of the previous one which define a new
type "struct stat_col" to fully define a statistic entry.
Define a new function metric_generate(). For metrics statistics, it is
able to automatically calculate a stat value field for "offsets" from
"struct stat_col". Use it in stats_fill_*_stats() functions. Maintain a
fallback to previously used switch-case for old-style statistics.
This commit does not introduce functional change as currently no
statistic is defined as "struct stat_col". This will be the subject of a
future commit.
Previously, statistics were simply defined as a list of name_desc, as
for example "stat_cols_px" for proxy stats. No notion of type was fixed
for each stat definition. This correspondance was done individually
inside stats_fill_*_line() functions. This renders the process to
define new statistics tedious.
Implement a more expressive stat definition method via a new API. A new
type "struct stat_col" for stat column to replace name_desc usage is
defined. It contains a field to store the stat nature and format. A
<cap> field is also defined to be able to define a proxy stat only for
certain type of objects.
This new type is also further extended to include counter offsets. This
allows to define a method to automatically generate a stat value field
from a "struct stat_col". This will be the subject of a future commit.
New type "struct stat_col" is fully compatible full name_desc. This
allows to gradually convert stats definition. The focus will be first
for proxies counters to implement statistics preservation on reload.
The name "metrics" was chosen to represent the various list of haproxy
exposed statistics. However, it is deemed as ambiguous as some stats are
indeed metric in the true sense, but some are not, as highlighted by
various "enum field_origin" values.
Replace it by the new name "stat_cols" for statistic columns. Along with
the already existing notion of stat lines it should better reflect its
purpose.
When a resync is needed, a local resync is first tried and if it does not
work, a remote resync is tried. It happens when the worker is started for
instance. There is a timeout to wait for the local resync, except for the
first start. And if the local resync fails or times out, the same timeout
is applied to the remote resync. This one is always applied, even if there
is no remote peer.
On the other hand, on reload, if the old worker has never performed its
resync, it does not try to resync the new worker. And here there is an
issue. On the first reload, when there is no remote peer, we must wait for
the resync timeout expiration to have a chance to resync the new worker. If
the reload happens too early, there is no resync at all. Concretly, after a
fresh start, if a reload happens in the first 5 seconds, there is no resync
with the new worker. The issue only concerns the first reload and affects
the second worker.
To fix the issue, we must only skip the remote resync if there is no remote
peer. This way, on a fresh start, the worker is immediately considered as
resync. The local reynsc is skipped because it is the first worker and the
remote resync is skipped because there is no remote peer.
This patch must be backported to all stable versions.
In loops on the peer list in the code, the 'ps' variable was used as a
shortcut for the peer session. However, if mays be confusing with the peers
section too. So, all occurrences to 'ps' variable were renamed to 'peer'.
Peers flags are mainly used from the sync task. At least, it is only updated
by the sync task. However, there is one place where a peer may read these
flags, when the message marking the end of a synchro is sent.
So to be sure the value retrieved at this place is consistent, we must use
an atomic operation to read it. And of course, from the sync task, atomic
operations must be used to update peers flags. However, from the sync task,
there is no reason to use atomic operations to read flags because they
cannot be update from somewhere eles.
When a process is reloaded, the old process must performed a synchronisation
with the new process. To do so, the sync task notify the local peer to
proceed and waits. Internally, the sync task used PEERS_F_DONOTSTOP flag to
know it should wait. However, this flag was only set/unset in a single
function. There is no real reason to set a flag to do so. A static variable
set to 1 when the resync starts and to 0 when it is finished is enough.
Peers flags were renamed and reordered, mainly to move flags used for
debugging purpose at the end.
PEERS_F_RESYNC_LOCAL and PEERS_F_RESYNC_REMOTE were also renamed to
PEERS_F_RESYNC_LOCAL_FINISHED and PEERS_F_RESYNC_REMOTE_FINISHED to be clear
on the fact the operation is finished when the flag is set.
There are too many holes in peer flags. So let's reorder them. In addition,
PEER_F_RESYNC_REQUESTED flag was renamed to PEER_F_DBG_RESYNC_REQUESTED to
clearly state it is a flag set for debugging purpose.
Finally, PEER_TEACH_RESET was replaced by PEER_TEACH_FLAGS and the bitwise
complement operator is now used on lines updating the peer flags. It is a
far more common way to do (in HAProxy code at least) and less surprising.
PEERS_F_TEACH_COMPLETE flag is only used for the old local peer to let the
sync task know it can stop waiting during a soft-stop. So it is less
confusing to rename this flag to clearly state it concerns local peer only.
A local peer assigned for leaning can immediately start to learn, without
sending any request. So we can do that first, before receiving
messages. This way, only PEER_LR_ST_PROCESSING state is evaluating when
received messages are processed.
In addition, when the resync request is sent, we are sure it is for a remote
peer.
Some flags were used to define the learn state of a peer. It was a bit
confusing, especially because the learn state of a peer is manipulated from
the peer applet but also from the sync task. It is harder to understand the
transitions if it is based on flags than if it is based a dedicated state
based on an enum. It is the purpose of this patch.
Now, we can define the following rules regarding this learn state:
* A peer is assigned to learn by the sync task
* The learn state is then changed by the peer itself to notify the
learning is in progress and when it is finished.
* Finally, when the peer finished to learn, the sync task must acknowledge
it by unassigning the peer.
This patch is a cleanup of the recent change about the relation between a
peer and the applet used to deal with I/O. Three flags was introduced to
reflect the peer applet state as seen from outside (from the sync task in
fact). Using flags instead of true states was in fact a bad idea. This work
but it is confusing. Especially because it was mixed with LEARN and TEACH
peer flags.
So, now, to make it clearer, we are now using a dedicated state for this
purpose. From the outside, the peer may be in one of the following state
with respects of its applet:
* the peer has no applet, it is stopped (PEER_APP_ST_STOPPED).
* the peer applet was created with a validated connection from the protocol
perspective. But the sync task must synchronized it with the peers
section. It is in starting state (PEER_APP_ST_STARTING).
* The starting starting was acknowledged by the sync task, the peer applet
can start to process messages. It is in running state
(PEER_APP_ST_RUNNING).
* The last peer applet was released and the associated connection
closed. But the sync task must synchronized it with the peers section. It
is in stopping state (PEER_APP_ST_STOPPING).
Functionnaly speaking, there is no true change here. But it should be easier
to understand now.
In addition to these changes, __process_peer_state() function was renamed
sync_peer_app_state().
Recently, some peer flags were added to deal with the connection state
(PEER_F_ST_*). 3 states were added:
* RELEASED: Set when we forced to shutdown the peer session and no new
session was created yet.
* CONNECTED: Set when the peer has established connection and validated it
from the peer protocol point of view
* ACCEPTED: Set when the peer has accepted a connection and validated it
from the peer protocol point of view
However, management of these pseudo states is a bit confusing. And it
appears there is no reason to have 2 flags to express there is a validated
peer session. CONNECTED state was used for a peer session on the frontend
side while ACCEPTED state was used for a peer session on the backend side.
So, there is now only one "connected" state and we test if the applet was
created on the frontend or the backend side to decide what to do, in
addition to the fact the peer is local or remote.
It is a transitionnal patch. True states will be created to deal with all
this stuff and corresponding flags will be removed.
This patch depends on the commit "MINOR: applet: Add a function to know the
sidde where an applet was created".
When a learning process is finished, partially or not, the event must be
processed by the sync task. It is important for the peer applet to wait in
this case, especially if the same peer is teaching to another peer, to be
sure to send the right resync finished message (full or partial).
Thanks to the previous patch, we can set PEER_F_WAIT_SYNCTASK_ACK flag on
the peer when a PEER_MSG_CTRL_RESYNCPARTIAL or PEER_MSG_CTRL_RESYNCFINISHED
message is received to be sure to stop the processing. Of course, we must
also take care to wake the peer up after having acknowledged the learn
status from the sync task.
This patch depends on the commit "BUG/MEDIUM: peers: Wait for sync task ack
when a resynchro is finished". Both must be backported if commit 9425aeaffb
("BUG/MAJOR: peers: Update peers section state from a thread-safe manner")
is backported.
Since recent fixes on peers, some changes on a peer must be acknowledged
by the sync task before letting the peer applet processing messages.
Blocking conditions was based on a combination of flags. It was
errorprone. So, this patch introduces PEER_F_WAIT_SYNCTASK_ACK peer flag for
this purpose. This flag is set by the peer when it must wait for an ack from
the sync task. This sync task, on its side, must remove it and wake the peer
up.
The TEACH flags only concerns the peer applet. There is no reason to set it
from the sync task. It is confusing. And at the end, after some
refactoring/fixes, setting these flags directly from the peer applet will
allow us to immediatly performing the corresponding teach processing, while
for now we must wait the sync task acknowledges the changes.
This flag was used for debugging purpose to know a resync was requested at
least once in the process life. Since the last bunch of fixes about the
peers locking mechanism, this info is now set per-peer. There is no reason
to still have it on peers too. So, just remove it.
When a session is shut down, the peer is switched in released state
(PEER_F_ST_RELEASED) and the sync task must process it to eventually
perform some clean up, in case the peer was assigned to learn.
However, this was only true when the session was shut down from the peer
applet itself. This was not performed when it was shut down from the sync
task. It is now fixed.
The previous fix (c0b2015aae "BUG/MEDIUM: peers: Don't set
PEERS_F_RESYNC_PROCESS flag on a peer") was made due to lack of knowledge on
the peers. A local peer, when assigned to learn, must start to learn
immediately without sending any request. This happens on reload.
Thus, in this case, the PEER_F_LEARN_PROCESS flag must be set with
PEER_F_LEARN_ASSIGN flag from the sync task.
This patch must only be backported if the above commit is backported.
Instead of leaving the hard-coded non-trivial operations in the H1
parsing code, let's just rely on the new intops functions that do the
same and that are less prone to being accidentally touched. It was
verified that the resulting code is exactly the same.
In 1.7 with commit 5f10ea30f4 ("OPTIM: http: improve parsing performance
of long URIs") we improved the URI parser's performance on platforms
supporting unaligned accesses by reading 4 chars at a time in a 32-bit
word. However, as reported in GH issue #2545, there's a bug in the way
the top bytes are checked, as the parser will stop when all 4 of them
are above 7e instead of when one of them is, so certain patterns can be
accepted through if the last ones are all valid. The fix requires to
negate the value but on the other hand it allows to parallelize some of
the tests and fuse the masks, which could even end up slightly faster.
This needs to be backported to all stable versions, but be careful, this
code moved a lot over time, from proto_http.c to h1.c, to http_msg.c, to
h1.c again. Better just grep for "24242424" or "21212121" in each version
to find it.
Big kudos to Martijn van Oosterhout (@kleptog) for spotting this problem
while analyzing that piece of code, and reporting it.
From Linux 5.17, anonymous regions can be name via prctl/PR_SET_VMA
so caches can be identified when looking at HAProxy process memory
mapping.
The most possible error is lack of kernel support, as a result
we ignore it, if the naming fails the mapping of memory context
ought to still occur.
Since 3.0-dev7 with commit 1a088da7c2 ("MAJOR: stktable: split the keys
across multiple shards to reduce contention"), building without threads
yields a warning about the shard not being used. This is because the
locks API does nothing of its arguments, which is the only place where
the shard is being used. We cannot modify the lock API to pretend to
consume its argument because quite often it's not even instantiated.
Let's just pretend we consume shard using an explict ALREADY_CHECKED()
statement instead. While we're at it, let's make sure that XXH32() is
not called when there is a single bucket!
No backport is needed.
Unlike the muxes, the applets have the responsibility to notify the SC if
they have more data to deliver to the stream. The same is done to notify the
SC that applets must be woken up ASAP to continue some processing. When an
applet is woken up, we pretend it has no more data to deliver by setting
SE_FL_HAVE_NO_DATA flag. If the applet removes this flag, we must take care
to not set it again just after. Otherwise, the applet may remain blocked if
there is no other condition to wake it up.
It is an issue for the applets using their own buffers because
SE_FL_HAVE_NO_DATA is erroneously set in sc_applet_recv() function, after
the applet execution. For instance, it happens for the cli applet when a
huge map is cleared. No data are delivered to the stream but we pretend it
is the case to clear the map per batches.
This patch should fix the issue #2543. No Backported needed.
Some flags are defined during statistics generation and output. They use
the prefix STAT_* which is also used for other purposes. Rename them
with the new prefix STAT_F_* to differentiate them from the other
usages.
Several unique names were used for different purposes under statistics
implementation. This caused the code to be difficult to understand.
* stat/stats name is removed when a more specific name could be used
* restrict field usage to purely refer to <struct field> which
represents a raw stat value.
* use "line" naming to represent an array of <struct field>
Info are used to expose haproxy global metrics. It is similar to proxy
statistics and any other module. As such, rename info indexes using
SI_I_INF_* prefix. Also info variable is renamed stat_line_info.
Thanks to this, naming is now consistent between info and other
statistics. It will help to integrate it as a "global" statistics
module.
Statistics were extended with the introduction of stats module. This
mechanism allows to expose various metrics for several haproxy
components. As a consequence of this, some static variables were
transformed to dynamic ones to be able to regroup all statistics
definition.
Rename these variables with more explicit naming :
* stat_lines can be used to generate one line of statistics for any
module using struct field as value
* metrics and metrics_len are used to stored description of metrics
indexed by module
Note that info is not integrated in the statistics module mechanism.
However, it could be done in the future to better reflect its purpose.
This commit is the first one of a serie which adjust naming convention
for stats module. The objective is to remove ambiguity and better
reflect how stats are implemented, especially since the introduction of
stats module.
This patch renames elements related to proxies statistics. One of the
main change is to rename ST_F_* statistics indexes prefix with the new
name ST_I_PX_*. This remove the reference to field which represents
another concept in the stats module. In the same vein, global
stat_fields variable is renamed metrics_px.
This commit is part of a series to align counters usage between
frontends/listeners on one side and backends/servers on the other.
On frontend side, "stot" is the total count of sessions for both proxies
and listeners. For proxies, fe_counters <cum_sess> is correctely used.
The bug is on listeners where <cum_conn> value is returned, which
instead indicates a number of connection. This commit fixes this by
returning <cum_sess> counter value for "stot" metric.
Along this fixes, use the opportunity to report "conn_tot" for listeners
using <cum_conn> value, as for frontend proxies.
This commit fixes a bug but must not be backported as stats output is
changed.
This commit is part of a serie to align counters usage between
frontends/listeners on one side and backends/servers on the other.
"stot" metric refers to the total number of sessions. On backend side,
it is interpreted as a number of streams. Previously, this was accounted
using <cum_sess> be_counters field for servers, but <cum_conn> instead
for backend proxies.
Adjust this by using <cum_sess> for both proxies and servers. As such,
<cum_conn> field can be removed from be_counters.
Note that several diagnostic messages which reports total frontend and
backend connections were adjusted to use <cum_sess>. However, this is an
outdated and misleading information as it does reports streams count on
backend side. These messages should be fixed in a separate commit.
This should be backported to all stable releases.
This commit is the first one of a series which aims to align counters
usage between frontends/listeners on one side and backends/servers on
the other.
Remove <down_trans> field from proxy structure. Use instead the same
name field from be_counters structure, which is already used for
servers.
32bits build was broken because of wrong printf length modifier.
src/ssl_ckch.c:4144:66: error: format specifies type 'long' but the argument has type 'unsigned int' [-Werror,-Wformat]
4143 | memprintf(err, "parsing [%s:%d] : cannot parse '%s' value '%s', too long, max len is %ld.\n",
| ~~~
| %u
4144 | file, linenum, args[cur_arg], args[cur_arg + 1], sizeof(alias_name));
| ^~~~~~~~~~~~~~~~~~
src/ssl_ckch.c:4217:64: error: format specifies type 'long' but the argument has type 'unsigned int' [-Werror,-Wformat]
4216 | memprintf(err, "parsing [%s:%d] : cannot parse '%s' value '%s', too long, max len is %ld.\n",
| ~~~
| %u
4217 | file, linenum, args[cur_arg], args[cur_arg + 1], sizeof(alias_name));
| ^~~~~~~~~~~~~~~~~~
2 errors generated.
make: *** [Makefile:1034: src/ssl_ckch.o] Error 1
make: *** Waiting for unfinished jobs....
Replace %ld by %zd.
Should fix issue #2542.
Limiting total allocatable process memory (VSZ) via setting RLIMIT_AS limit is
no longer effective, in order to restrict memory consumption at run time.
We can see from process memory map below, that there are many holes within
the process VA space, which bumps its VSZ to 1.5G. These holes are here by
many reasons and could be explaned at first by the full randomization of
system VA space. Now it is usually enabled in Linux kernels by default. There
are always gaps around the process stack area to trap overflows. Holes before
and after shared libraries could be explained by the fact, that on many
architectures libraries have a 'preferred' address to be loaded at; putting
them elsewhere requires relocation work, and probably some unshared pages.
Repetitive holes of 65380K are most probably correspond to the header that
malloc has to allocate before asked a claimed memory block. This header is
used by malloc to link allocated chunks together and for its internal book
keeping.
$ sudo pmap -x -p `pidof haproxy`
127136: ./haproxy -f /home/haproxy/haproxy/haproxy_h2.cfg
Address Kbytes RSS Dirty Mode Mapping
0000555555554000 388 64 0 r---- /home/haproxy/haproxy/haproxy
00005555555b5000 2608 1216 0 r-x-- /home/haproxy/haproxy/haproxy
0000555555841000 916 64 0 r---- /home/haproxy/haproxy/haproxy
0000555555926000 60 60 60 r---- /home/haproxy/haproxy/haproxy
0000555555935000 116 116 116 rw--- /home/haproxy/haproxy/haproxy
0000555555952000 7872 5236 5236 rw--- [ anon ]
00007fff98000000 156 36 36 rw--- [ anon ]
00007fff98027000 65380 0 0 ----- [ anon ]
00007fffa0000000 156 36 36 rw--- [ anon ]
00007fffa0027000 65380 0 0 ----- [ anon ]
00007fffa4000000 156 36 36 rw--- [ anon ]
00007fffa4027000 65380 0 0 ----- [ anon ]
00007fffa8000000 156 36 36 rw--- [ anon ]
00007fffa8027000 65380 0 0 ----- [ anon ]
00007fffac000000 156 36 36 rw--- [ anon ]
00007fffac027000 65380 0 0 ----- [ anon ]
00007fffb0000000 156 36 36 rw--- [ anon ]
00007fffb0027000 65380 0 0 ----- [ anon ]
...
00007ffff7fce000 4 4 0 r-x-- [ anon ]
00007ffff7fcf000 4 4 0 r---- /usr/lib/x86_64-linux-gnu/ld-2.31.so
00007ffff7fd0000 140 140 0 r-x-- /usr/lib/x86_64-linux-gnu/ld-2.31.so
...
00007ffff7ffe000 4 4 4 rw--- [ anon ]
00007ffffffde000 132 20 20 rw--- [ stack ]
ffffffffff600000 4 0 0 --x-- [ anon ]
---------------- ------- ------- -------
total kB 1499288 75504 72760
This exceeded VSZ makes impossible to start an haproxy process with 200M
memory limit, set at its initialization stage as RLIMIT_AS. We usually
have in this case such cryptic output at stderr:
$ haproxy -m 200 -f haproxy_quic.cfg
(null)(null)(null)(null)(null)(null)
At the same time the process RSS (a memory really used) is only 75,5M.
So to make process memory accounting more realistic let's base the memory
limit, set by -m option, on RSS measurement and let's use RLIMIT_DATA instead
of RLIMIT_AS.
RLIMIT_AS was used before, because earlier versions of haproxy always allocate
memory buffers for new connections, but data were not written there
immediately. So these buffers were not instantly counted in RSS, but were
always counted in VSZ. Now we allocate new buffers only in the case, when we
will write there some data immediately, so using RLIMIT_DATA becomes more
appropriate.
The commit 9425aeaffb ("BUG/MAJOR: peers: Update peers section state from a
thread-safe manner") introduced regressions about state transitions of a
peer.
A peer may be in a connected, accepted or released state. Before, changes for
these states were performed synchronously. Since the commit above, changes
are mainly performed in the sync process task.
The first regression was about the released then accepted state transition,
called the renewed state. In reality the state was always crushed by the
accepted state. After some review, the state was just removed to always
perform the cleanup in the sync process task before acknowledging the
connected or accepted states.
Then, a wakeup of the peer applet was missing from the sync process task
after the ack of connected or accepted states, blocking the applet.
Finally, when a peer is in released, connected or accepted state, we must
take care to wait the sync process task wakeup before trying to receive or
send messages.
This patch must only be backported if the above commit is backported.
The bug was introduced by commit 9425aeaffb ("BUG/MAJOR: peers: Update peers
section state from a thread-safe manner"). A peers flags was set on a peer
by error. Just remove it.
This patch must only be backported if the above commit is backported.
We got a detailed report analysis showing that our optimization consisting
in using poll() to detect already closed FDs within a 1024 range has an
issue with the case where 1024 consecutive FDs are open (hence do not show
POLLNVAL) and none of them has any activity report. In this case poll()
returns zero update and we would just skip the loop that inspects all the
FDs to close the valid ones. One visible effect is that the called programs
might occasionally see some FDs being exposed in the low range of their fd
space, possibly making the process run out of FDs when trying to open a
file for example.
Note that this is actually a fix for commit b8e602cb1b ("BUG/MINOR: fd:
make sure my_closefrom() doesn't miss some FDs") that already faced a
more common form of this problem (incomplete but non-empty FDs reported).
This can be backported up to 2.0.
As reported on github issue #2491, there's a very strange situation where
epoll_wait() appears to be reported EPOLLERR only (and not IN/OUT/HUP etc
as normally happens with EPOLLERR), and when connect() is called again to
check the state of the ongoing connection, it returns EALREADY, basically
saying "no news, please wait". This obviously triggers a wakeup loop. For
now it has remained impossible to reproduce this issue outside of the
reporter's environment, but that's definitely something that is impossible
to get out from.
The workaround here is to address the lowest level cause we can act on,
which is to avoid returning to wait if EPOLLERR was returned. Indeed, in
this case we know it will loop, so we must definitely take this one into
account. We only do that after connect() asks us to wait, so that a
properly established connection with a queued error at the end of an
exchange will not be diverted and will be handled as usual.
This should be backported to approximately all versions, at least as far
as 2.4 according to the reporter who observed it there.
Thanks to @donnyxray for their useful captures isolating the problem.
mux-ops .shutr and .shutw callback functions are merged into a unique
functions, called .shut. The shutdown mode is still passed as argument,
muxes are responsible to test it. Concretly, .shut() function of each mux is
now the content of the old .shutw() followed by the content of the old
.shutr().
se_shutdown() function is now used to perform a shutdown on a connection
endpoint and an applet endpoint. The same function is used for
both. sc_conn_shut() function was removed and appctx_shut() function was
updated to only deal with the applet stuff.
It is the same than the previous patch but for applets. Here there is
already only one function. But with this patch, appctx_shut() function was
modified to explicitly get shutdown mode as parameter. In addition
appctx_shutw() was removed.
The SC API to perform shutdowns on connection endpoints was unified to have
only one function, sc_conn_shut(), with read/write shut modes passed
explicitly. It means sc_conn_shutr() and sc_conn_shutw() were removed. The
next step is to do the same at the mux level.
CO_SHR_* and CO_SHW_* modes are in fact used by the stream-connectors to
instruct the muxes how streams must be shut done. It is then the mux
responsibility to decide if it must be propagated to the connection layer or
not. And in this case, the modes above are only tested to pass a boolean
(clean or not).
So, it is not consistant to still use connection related modes for
information set at an upper layer and never used by the connection layer
itself.
These modes are thus moved at the sedesc level and merged into a single
enum. Idea is to add more modes, not necessarily mutually exclusive, to pass
more info to the muxes. For now, it is a one-for-one renaming.
In .shutr and .shutw callback functions, we must rely on the connection
flags (CO_FL_SOCK_RD_SH/WR_SH) to decide to fully close the connection
instead of using sedesc flags. At the end, for the PT multiplexer, it is
equivalent. But it is more logicial and consistent this way.
This patch implements the backend side of the keylog fetches.
The code was ready but needed the SSL message callbacks.
This could be used like this:
log-format "CLIENT_EARLY_TRAFFIC_SECRET %[ssl_bc_client_random,hex] %[ssl_bc_client_early_traffic_secret]\n
CLIENT_HANDSHAKE_TRAFFIC_SECRET %[ssl_bc_client_random,hex] %[ssl_bc_client_handshake_traffic_secret]\n
SERVER_HANDSHAKE_TRAFFIC_SECRET %[ssl_bc_client_random,hex] %[ssl_bc_server_handshake_traffic_secret]\n
CLIENT_TRAFFIC_SECRET_0 %[ssl_bc_client_random,hex] %[ssl_bc_client_traffic_secret_0]\n
SERVER_TRAFFIC_SECRET_0 %[ssl_bc_client_random,hex] %[ssl_bc_server_traffic_secret_0]\n
EXPORTER_SECRET %[ssl_bc_client_random,hex] %[ssl_bc_exporter_secret]\n
EARLY_EXPORTER_SECRET %[ssl_bc_client_random,hex] %[ssl_bc_early_exporter_secret]"
Backend SSL connections never used the ssl_sock_msg_callbacks() which
prevent the use of keylog on the server side.
The impact should be minimum, though it add a major callback system for
protocol analysis, which is the same used on frontend connections.
https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_msg_callback.html
The patch add a call to SSL_CTX_set_msg_callback() in
ssl_sock_prepare_srv_ssl_ctx() the same way it's done for bind lines in
ssl_sock_prepare_ctx().
Add support for crt-base and key-base local keywords for the crt-store.
current_crtbase and current_keybase are filed with a copy of the global
keyword argument when a crt-store is declared, and updated with a new
path when the keywords are in the crt-store section.
The ckch_conf_kws[] array was updated with ¤t_crtbase and
¤t_keybase instead of the global_ssl ones so the parser can use
them.
The keyword must be used before any "load" line in a crt-store section.
Example:
crt-store web
crt-base /etc/ssl/certs/
key-base /etc/ssl/private/
load crt "site3.crt" alias "site3"
load crt "site4.crt" key "site4.key"
frontend in2
bind *:443 ssl crt "@web/site3" crt "@web/site4.crt"
Extract functions related to HTML stats webpage from stats.c into a new
module named stats-html. This allows to reduce stats.c to roughly half
of its original size.
A static variable trash_chunk was used as implicit buffer in most of
stats output function. It was a oneline buffer uses as temporary storage
before emitting to the final applet or CLI buffer.
Replaces it by a buffer defined in show_stat_ctx structure. This allows
to retrieve it in most of stats output function. An additional parameter
was added for the function where context was not already used. This
renders the code cleaner and will allow to split stats.c in several
source files.
As a result of a new member into show_stat_ctx, per-command context max
size has increased. This forces to increase APPLET_MAX_SVCCTX to ensure
pool size is big enough. Increase it to 128 bytes which includes some
extra room for the future.
This patch introduces named crt-store section. A named crt-store allows
to add a scope to the crt name.
For example, a crt named "foo.crt" in a crt-store named "web" will
result in a certificate called "@web/foo.crt".
Now that peers entirely rely on peer->srv for connection settings, and
that it was confirmed that it works properly thanks to previous commit,
let's finish what we started in f6ae258 ("MINOR: peers: rely on srv->addr
and remove peer->addr") and stop using srv->addr to find out peers port
and instead rely on srv->svc_port as it's already done for other proxy
types.
A dumb mistake was made in f6ae25858 ("MINOR: peers: rely on srv->addr
and remove peer->addr"). I completely overlooked the part where the bind
address settings are used as implicit server's address settings when the
peers are declared using the new bind+server config style (which is the
new recommended method to declare peers as it follows the same logic as
the one used in other proxy sections).
As such, the peers synchro fails to work between previous and new process
(localpeer mechanism) upon reload when declaring peers with way:
global
localpeer local
peers mypeers
bind 127.0.0.1:10001
server local
And one has to use the 'old' config style to make it work:
global
localpeer local
peers mypeers
peer local 127.0.0.1:10001
--
To fix the issue, let's explicitly set the server's addr:port
according to the bind's address settings (only the first listener is
considered) when local peer was declared using the 'bind+server' method.
No backport needed.
When a peer applet is pushing updates, we limit the number of update sent at
once via a global parameter to not spend too much time in the applet. On
interrupt, we claimed for more room to be woken up quickly. However, this
statement is only true if something was pushed in the buffer. Otherwise,
with an empty buffer, if the stream itself is not woken up, the applet
remains also blocked because there is no send activity on the other side to
unblock it.
In this case, instead of requesting more room, it is sufficient to state the
applet have more data to send.
This patch must be backported as far as 2.6.
This bug is related to the previous one ("BUG/MEDIUM: spoe: Always retry
when an applet fails to send a frame"). applet_putblk() function retruns -1
on error and it should always be interpreted as a missing of room in the
buffer. However, on the spoe, this was processed as an I/O error.
This patch must be backported as far as 2.8.
The crt-store load line parser relies on offsets of member of the
ckch_conf struct. However the new "alias" keyword as an offset to
-1, because it does not need to be used. Plan was to handle it that way
in the parser, but it wasn't supported yet. So -1 was still used in an
offset computation which was not used, but ASAN could see the problem.
This patch fixes the issue by using a signed type for the offset value,
so any negative value would be skipped. It also introduced a
PARSE_TYPE_NONE for the parser.
No backport needed.
The crt-store load line now allows to put an alias. This alias is used
as the key in the ckch_tree instead of the certificate. This way an
alias can be referenced in the configuration with the '@/' prefix.
This can only be define with a crt-store.
Since the beginning in 2.0 the nevlist parameter was set to 1 before
calling port_getn(), which means that a single FD event will be reported
per polling loop. This is extremely inefficient, and all the code was
designed to use global.tune.maxpollevents. It looks like it's a leftover
of a temporary debugging change. No apparent issues were found by setting
it to a higher value, so better do that.
That code is not much used nowadays with Solaris disappearing from the
landscape, so even if this definitely was a bug, it's preferable not to
backport that fix as it could uncover other subtle bugs that were never
raised yet.
Since 2.0 with commit 0ba4f483d2 ("MAJOR: polling: add event ports
support (Solaris)"), the polling system on Solaris suffers from a
signal handling problem. It turns out that this API is very bizarre,
as reported events are automatically unregistered and their counter
is updated in the same variable that was used to pass the count on
input, making it difficult to handle certain error codes (how should
one handle ENOSYS for example?). And to complete everything, the API
is able to return both EINTR and an event if a signal is reported.
The code tries to deal with certain such cases (e.g. ETIME for timeout
can also report an event), otherwise it defaults to clearing the
event counter upon error. This has the effect that EINTR clears the
list of events, which are also automatically cleared from the set by
the system.
This is visible when using external checks where the SIGCHLD of the
leaving child causes a wakeup that ruins the event counter and causes
endless loops, apparently due to the queued inter-thread byte in the
pipe used to wake threads up that never gets removed in this case.
Note that extcheck would also deserve deeper investigation because it
can immediately re-trigger a check in such a case, which is not normal.
Removing the wiping of the nevlist variable fixes the problem.
This can be backported to all versions since it affects 2.0.
Gcc before 7 does really not like direct operations on cast pointers
such as "((struct a*)b)->c += d;". It turns our that we have exactly
that construct in 3.0 since commit 5baa9ea168 ("MEDIUM: cache: Save
body size of cached objects and track it on delivery").
It's generally sufficient to use an intermediary variable such as :
"({ (struct a*) _ = b; _; })->c +=d;" but that's ugly. Fortunately
DISGUISE() implicitly does something very similar and works fine, so
let's use that.
No backport is needed.
Once data are received and placed in a channel buffer, if it is possible,
outgoing data are immediately forwarded. But we must take care to not do so
if there is also pending input data and a filter registered on the
channel. It is especially important for HTX streams because the HTX may be
altered, especially the extra field. And it is indeed an issue with the HTTP
compression filter and the H1 multiplexer. The wrong chunk size may be
announced leading to an internal error.
This patch should fix the issue #2530. It must be backported to all stable
versions.
The last fixes on the peers to improve the locking mechanism introduced new
peer flags and the value of some old flags was changed. This was done in the
commit 9b78e33837 ("MINOR: peers: Add 2 peer flags about the peer learn
status"). But, to ease the debugging of the peers team, old values are
restored.
This patch must be backported with the commit above.
Thanks to all previous changes, it is now possible to stop locking all peers
at once in the resync process function. Peer are locked one after the
other. Wen a peer is locked, another one may be locked when all peer sharing
the same shard must be updated. Otherwise, at anytime, at most one peer is
locked. This should significantly improve the situation.
This patch depends on the following patchs:
* BUG/MAJOR: peers: Update peers section state from a thread-safe manner
* BUG/MINOR: peers: Report a resync was explicitly requested from a thread-safe manner
* MINOR: peers: Add functions to commit peer changes from the resync task
* MINOR: peers: sligthly adapt part processing the stopping signal
* MINOR: peers: Add flags to report the peer state to the resync task
* MINOR: peers: Add 2 peer flags about the peer learn status
* MINOR: peers: Split resync process function to separate running/stopping states
It may be good to backport it to 2.9. All the seris should fix the issue #2470.
It is the main part of this series. In the peer applet, only the peer flags
are updated. It is now the responsibility of the resync process function to
check changes on each peer to update the peers section state accordingly.
Concretly, changes on the connection state (accepted, connected, released or
renewed) are first reported at the peer level and then handled in
__process_peer_state() function.
In the same manner, when the learn status of a peer changes, the peers
section state is no longer updated immediately. The resync task is woken up
to deal with this changes.
Thanks to these changes, the peers should be now really thread-safe.
This patch relies on the following ones:
* BUG/MINOR: peers: Report a resync was explicitly requested from a thread-safe manner
* MINOR: peers: Add functions to commit peer changes from the resync task
* MINOR: peers: sligthly adapt part processing the stopping signal
* MINOR: peers: Add flags to report the peer state to the resync task
* MINOR: peers: Add 2 peer flags about the peer learn status
* MINOR: peers: Split resync process function to separate running/stopping states
No bug was reported about the thread-safety of peers. Only a performance
issue was encountered with a huge number of peers (> 50). So there is no
reason to backport all these patches further than 2.9.
Flags on the peers section state must be updated from a thread-safe manner.
It is not true today. With this patch we take care PEERS_F_RESYNC_REQUESTED
flag is only set by the resync task. To do so, a peer flag is used. This
flag is only set once and never removed. It is juste used for debugging
purpose. So it is enough to set it on a peer and be sure to report it on the
peers section when the sync task is executed.
This patch relies on previous ones:
* MINOR: peers: Add functions to commit peer changes from the resync task
* MINOR: peers: sligthly adapt part processing the stopping signal
* MINOR: peers: Add flags to report the peer state to the resync task
* MINOR: peers: Add 2 peer flags about the peer learn status
* MINOR: peers: Split resync process function to separate running/stopping states
For now, nothing is done in these functions. It is only a patch to prepare
the huge part of the refactoring about the locking mechanism of the peers.
These functions will be responsible to check peers state and their learn
status to update the peers section flags accordingly.
The signal and the PEERS_F_DONOTSTOP flag are now handled in the loop on peers
to force sessions shutdown. We will need to loop on all peers to update their
state. It is easier this way.
As the previous patch, this patch is also part of the refactoring of peer
locking mechanisme. Here we add flags to represent a transitional state for
a peer. It will be the resync task responsibility to update the peers state
accordingly.
A peer may be in 4 transitional states:
* accepted : a connection was accepted from a peer
* connected: a connection to a peer was established
* release : a peer session was released
* renewed : a peer session was released because it was replaced by a new
one. Concretly, this will be equivalent to released+accepted
If none of these flags is set, it means the transition, if any, was
processed by the resync task, or no transition happened.
PEER_F_LEARN_PROCESS and PEER_F_LEARN_FINISHED flags are added to help to
fix locking issue about peers. Indeed, a peer is able to update the peers
"section" state under its own lock. Because the resync task locks all peers
at once, there is no conflict at this level. But there is nothing to prevent
2 peers to update the peers state in same time. So it seems there is no real
issue here, but there is a theorical thread-safety issue here. And it means
the locking mechanism of the peers must be reviewed.
In this context, the 2 flags above will help to move all update of the peers
state in the scope of resync task. Each peer will be able to update its own
state and the resync task will be responsible to update the peers state
accordingly.
The function responsible to deal with resynchro between all peers is now split
in two subfunctions. The first one is used when HAProxy is running while the
other one is used in soft-stop case.
This patch is required to be able to refactor locking mechanism of the peers.
There were several places in grpc and its dependency protobuf where unaligned
accesses were done. Read accesses to 32 (resp. 64) bits values should be performed
by read_u32() (resp. read_u64()).
Replace these unligned read accesses by correct calls to these functions.
Same fixes for doubles and floats.
Such unaligned read accesses could lead to crashes with bus errors on CPU
archictectures which do not fix them at run time.
This patch depends on this previous commit:
861199fa71 MINOR: net_helper: Add support for floats/doubles.
Must be backported as far as 2.6.
The global 'key-base' keyword allows to read the 'key' parameter of a
crt-store load line using a path prefix.
This is the equivalent of the 'crt-base' keyword but for 'key'.
It only applies on crt-store.
Add crt-base support for "crt-store". It will be used by 'crt', 'ocsp',
'issuer', 'sctl' load line parameter.
In order to keep compatibility with previous configurations and scripts
for the CLI, a crt-store load line will save its ckch_store using the
absolute crt path with the crt-base as the ckch tree key. This way, a
`show ssl cert` on the CLI will always have the completed path.
In 3.0-dev, with commit 7c9ce715c9 ("MINOR: ring: make callers use
ring_data() and ring_size(), not ring->buf"), we made startup_logs_dup()
use ring_size() to get the old ring size and pass it to ring_new() to
create a new ring. But due to the ambiguity of the allocate vs usable
size, this resulted in slightly shrinking the buffer compared to the
previous one, occasionally causing crashes if the first one was already
full of warnings, as seen in GH issue #2529. We need to use the allocated
size instead, thanks to the function brought by previous commit.
No backport is needed, this only affects 3.0-dev. Thanks to @felipewd
for the detailed report that allowed to spot the problem.
There's currently an abiguity around ring_size(), it's said to return
the allocated size but returns the usable size. We can't change it as
it's used everywhere in the code like this. Let's fix the comment and
add ring_allocated_size() instead for anything related to allocation.
In 2.6, a build issue for LRU in standalone test mode was addressed by
commit bf9c07fd9 ("BUILD/DEBUG: lru: update the standalone code to
support the revision"), but using revision 1 while looking up rev 0
results in 100% misses. Let's fix this and commit with revision 0 as
well.
No backport is needed, this only happens when hacking on the code.
Frontend and listen sections allow unlimited number of bind statements, it is
often, when there is a bind statement per supported protocol, like below:
listen test
mode http
bind quic4@0.0.0.0:443 name quic ssl crt ...
bind 0.0.0.0:443 name https ssl alpn http/1.1,h2 crt ...
bind 0.0.0.0:8080 ...
...
It seems useful to show corresponded protocol name in alerts and warnings,
when problem occures with port binding, connection resuming or sharding. This
helps to figure out immediately, which bind statement has a wrong setting or
which protocol module is the root cause of the issue.
When the integrity check fails, it's useful to get a dump of the area
around the first faulty byte. That's what this patch does. For example
it now shows this before reporting info about the tag itself:
Contents around first corrupted address relative to pool item:.
Contents around address 0xe4febc0792c0+40=0xe4febc0792e8:
0xe4febc0792c8 [80 75 56 d8 fe e4 00 00] [.uV.....]
0xe4febc0792d0 [a0 f7 23 a4 fe e4 00 00] [..#.....]
0xe4febc0792d8 [90 75 56 d8 fe e4 00 00] [.uV.....]
0xe4febc0792e0 [d9 93 fb ff fd ff ff ff] [........]
0xe4febc0792e8 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc0792f0 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc0792f8 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc079300 [d9 93 fb ff ff ff ff ff] [........]
This may be backported to 2.9 and maybe even 2.8 as it does help spot
the cause of the memory corruption.
This function is particularly useful to dump unknown areas watching
for opportunistic symbols, so let's move it to tools.c so that we can
reuse it a little bit more.
When a corruption was detected in an object, it's often said that the
tag doesn't match the pool, but it should also check if it matches the
location of an earlier pool_free() call, which happens when -dMcaller
is used. That's what we're doing now.
In 2.9 with commit 7968fe3889 ("MEDIUM: stick-table: change the ref_cnt
atomically") we significantly relaxed the stick-tables locking when
dealing with peers by adjusting the ref_cnt atomically and moving it
out of the lock.
However it opened a tiny window that became problematic in 3.0-dev7
when the table's contention was lowered by commit 1a088da7c2 ("MAJOR:
stktable: split the keys across multiple shards to reduce contention").
What happens is that some peers may access the entry for reading at
the moment it's about to expire, and while the read accesses to push
the data remain unnoticed (possibly that from time to time we push
crap), but the releasing of the refcount causes a new write that may
damage anything else. The scenario is the following:
process_table_expire() peer_send_teachmsgs()
RDLOCK(&updt_lock);
tick_is_expired() != 0
ebmb_delete(ts->key);
if (ts->upd.node.leaf_p) {
HA_ATOMIC_INC(&ts->ref_cnt);
RDUNLOCK(&updt_lock);
WRLOCK(&updt_lock);
eb32_delete(&ts->upd);
}
__stksess_free(t, ts);
peer_send_updatemsg(ts);
RDLOCK(&updt_lock);
HA_ATOMIC_DEC(&ts->ref_cnt);
Here it's clear that the bottom part of peer_send_teachmsgs() believes
to be protected but may act on freed data.
This is more visible when enabling -dMtag,no-merge,integrity because
the ATOMIC_DEC(&ref_cnt) decrements one byte in the area, that makes
the eviction check fail while the tag has the address of the left
__stksess_free(), proving a completed pool_free() before the decrement,
and the anomaly there is pretty visible in the crash dump. Changing
INC()/DEC() with ADD(2)/DEC(2) shows that the byte is now off by two,
confirming that the operation happened there.
The solution is not very hard, it consists in checking for the ref_cnt
on the left after grabbing the lock, and doing both before deleting the
element, so that we have the guarantee that either the peer will not
take it or that it has already started taking it.
This was proven to be sufficient, as instead of crashing after 3s of
injection with 4 peers, 16 threads and 130k RPS, it survived for 15mn.
In order to stress the setup, a config involving 4+ peers, tracking
HTTP request with randoms and applying a bwlim-out filter with a
random key, with a client made of 160 h2 conns downloading 10 streams
of 4MB objects in parallel managed to trigger it within a few seconds:
frontend ft
http-request track-sc0 rand(100000) table tbl
filter bwlim-out lim-out limit 2047m key rand(100000000),ipmask(32) min-size 1 table tbl
http-request set-bandwidth-limit lim-out
use_backend bk
backend bk
server s1 198.18.0.30:8000
server s2 198.18.0.34:8000
backend tbl
stick-table type ip size 1000k expire 1s store http_req_cnt,bytes_in_rate(1s),bytes_out_rate(1s) peers peers
This seems to be very dependent on the timing and setup though.
This will need to be backported to 2.9. This part of the code was
reindented with shards but the block should remain mostly unchanged.
The logic to apply is the same.
Sending "trace peers event" on the CLI crashes because the event list
in the peers is not finished. This was introduced in 2.4 by commit
d865935f32 ("MINOR: peers: Add traces to peer_treat_updatemsg().")
so this must be backported to 2.4.
When adding the shards support to tables with commit 1a088da7c ("MAJOR:
stktable: split the keys across multiple shards to reduce contention"),
the condition to stop eliminating entries based on the batch size being
reached is based on a pre-decrement of the max_search counter, but now
it goes back into the outer loop which doesn't check it, so next time
it does it when entering the next shard, it will become even more
negative and will properly stop, but at first glance it looks like an
int overflow (which it is not). Let's make sure the outer loop stops
on this condition so that we don't continue searching when the limit
is reached.
While changing the stick-table indexing that led to commit 1a088da7c
("MAJOR: stktable: split the keys across multiple shards to reduce
contention"), I met a problem with the task's expiration date being
incorrectly updated, I fixed it and apparently I committed the wrong
version :-/
The effect is that the task's date is only correctly reset if the
table is empty, otherwise the task wakes up again and is queued at
the previous date, eating 100% CPU. The tick_isfirst() must not be
used when storing the last result.
No backport is needed as this was only merged in 3.0-dev7.
crt-list will be enhanced with ckch_conf keywords, however these keywords
does not fill the 'ssl_conf' structure. So we don't need to allocate the
ssl_conf for every options between [ ] but only when we found a relevant
one.
'crt-store' is a new section useful to define the struct ckch_store.
The "load" keyword in the "crt-store" section allows to define which
files you want to load for a specific certificate definition.
Ex:
crt-store
load crt "site1.crt" key "site1.key"
load crt "site2.crt" key "site2.key"
frontend in
bind *:443 ssl crt "site1.crt" crt "site2.crt"
This is part of the certificate loading which was discussed in #785.
When cache and stats applets were changed to use their own buffers, a change
was also performed to no longer access the stream from the I/O
handller. Among other things, the HTTP start-line of the request is now
retrieved to get the method. But, when these changes were brought, the inbuf
buffer allocation failures were not handled.
It is of course not so common. But if this happens, a crash may be
experienced. To fix the issue, we now check for inbuf allocation failures
before accessing it.
No backported needed.
We observed that a dynamic server which health check is down for longer
than slowstart delay at startup doesn't trigger the warmup phase, it
receives full traffic immediately. This has been confirmed by checking
haproxy UI, weight is immediately the full one (e.g. 75/75), without any
throttle applied. Further tests showed that it was similar if it was in
maintenance, and even when entering a down or maintenance state after
being up.
Another issue is that if the server is down for less time than
slowstart, when it comes back up, it briefly has a much higher weight
than expected for a slowstart.
An easy way to reproduce is to do the following:
- Add a server with e.g. a 20s slowstart and a weight of 10 in config
file
- Put it in maintenance using CLI (set server be1/srv1 state maint)
- Wait more than 20s, enable it again (set server be1/srv1 state ready)
- Observe UI, weight will show 10/10 immediately.
If server was down for less than 20s, you'd briefly see a weight and
throttle value that is inconsistent, e.g. 50% throttle value and a
weight of 5 if server comes back up after 10s before going back to
6% after a second or two.
Code analysis shows that the logic in server_recalc_eweight stops the
warmup task by setting server's next state to SRV_ST_RUNNING if it
didn't change state for longer than the slowstart duration, regardless
of its current state. As a consequence, a server being down or disabled
for longer than the slowstart duration will never enter the warmup phase
when it will be up again.
Regarding the weight when server comes back up, issue is that even if
the server is down, we still compute its next weight as if it was up,
hence when it comes back up, it can briefly have a much higher weight
than expected during slowstart, until the warmup task is called again
after last_change is updated.
This patch aims to fix both issues.
The CPU variable, when used, is almost always exclusively used with
"generic" to disable any CPU-specific optimizations, or "native" to
enable "-march=native". Other options are not used and are just making
CPU_CFLAGS more confusing.
This commit just drops all pre-configured variants and replaces them
with documentation about examples of supported options. CPU_CFLAGS is
preserved as it appears that it's mostly used as a proxy to inject the
distro's CFLAGS, and it's just empty by default.
The CPU variable is checked, and if set to anything but "generic", it
emits a warning about its deprecation and invites the user to read
INSTALL.
Users who would just set CPU_CFLAGS will be able to continue to do so,
those who were using CPU=native will have to pass CPU_CFLAGS=-march=native
and those who were passing one of the other options will find it in the
doc as well.
Note that this also removes the "CPU=" line from haproxy -vv, that most
users got used to seeing set to "generic" or occasionally "native"
anyway, thus that didn't provide any useful information.
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
Some compilers report this on the cache:
src/cache.c:235: warning: 'release_entry_locked' declared inline after being called
src/cache.c:235: warning: previous declaration of 'release_entry_locked' was here
And indeed, the function is first declared non-inline and later inline.
Let's just set the inline status from the beginning. It's not really
needed to backport this.
Using an invalid GUID for guid_insert() causes a crash. This is easily
reproducible using for example an invalid character with "guid" keyword.
Here is the provided backtrace :
Thread 1 "haproxy" received signal SIGSEGV, Segmentation fault.
0x00005555561fda95 in guid_insert (objt=0x520000002080, uid=0x519000002dac "@foo2", errmsg=0x7ffff4c0a7a0)
at src/guid.c:83
83 ha_free(&guid->node.key);
This error is present in guid_insert() cleanup parts. GUID node is not
allocated in case of an early error so it's impossible to dereference it
to free guid.node.key. Fix this simply by using an intermediary pointer
key.
This does not need to be backported.
In HTTP keep-alive, if we face a connection error to the server while
sending the request, the error should not be reported, and the client-side
connection should simply be closed, so that client knows it can retry.
If the error happens during the connection stage, there is two cases. We
have a connection timeout or an allocation error. In this case, the 503
response must be skipped if it is not the first request on the client-side
connection. Or we have a connection error. In this case, the 503 response
must be skipped if it is a reused server connection. Otherwise, during the
connection stage, the 503-Service-unavailable response is delivered to the
client. The part works properly.
If the error happens after this stage, the 502-Bad-gateway response
delivering should only be based on the server-side connection status. For a
reused server connection, the client-side connection must be closed with no
reponses. However, for a fresh server-side connection, a 502-Bad-gateway
response must be delivered to the client. Unfortunately, This part is
buggy. Only the client-side connection state is considered and the response
is skipped if it is not the first request for the same client connection.
The bug is not so visbile in HTTP/1.1 but in H2 and H3 it is pretty annoying
because for a connection, requests are multiplexed, in parallels. It means
there is no first request. So, because of this bug, for H2 and H3,
502-Bad-gateway responses because of a connection error before receiveing
the response are always skipped.
To fix the issue, in http_wait_for_response() analyser, we must only rely on
SF_SRV_REUSED stream flag to skip the 502 response or not. This flag is set
if the server connection was reused.
The bug is their since a while. SF_SRV_REUSED flag was added in the version
1.5 especially to fix this kind of bug. But only the 503 case was fixed.
This patch should fix the issue #2285. It must be backported to every stable
versions.
qc_send() is implemented as a loop to repeatedly invoke
qc_prep_pkts()/qc_send_ppkts(). This ensures that all data are emitted
even if bigger that a single Tx buffer instance. This is useful if
congestion window is empty but big enough for application data.
Looping is interrupted if qc_prep_pkts() returns a negative error
code, for example due to no space left in congestion window. It can also
returns 0 if no input data to sent, which also interrupt the loop.
To limit this last case, removed quic_enc_level from send_list each time
everything already send via qc_prep_pkts(). Loop can then be interrupted
as soon as send_list is empty, avoiding an extra superfluous call to
qc_prep_pkts().
qc_send() was systematically called by quic_conn IO handlers with all
instantiated quic_enc_level. Change this to only register quic_enc_level
for send if needed. Do not call at all qc_send() if no qel registered.
A new function qel_need_sending() is defined to detect if sending is
required. First, it checks if quic_enc_level has prepared frames or
probing is set. It can also returns true if ACK required either on
quic_enc_level itself or because of quic_conn ack timer fired. Finally,
a CONNECTION_CLOSE emission for quic_conn is also a valid case.
This should reduce the number of invocations of qc_send(). This could
improve slightly performance, as well as simplify traces debugging.
A series of previous patches have clean up sending function for
handshake case. Their new exposed API is now flexible enough to convert
app case to use the same functions.
As such, qc_send_hdshk_pkts() is renamed qc_send() and become the single
entry point for QUIC emission. It is used during application packets
emission in quic_conn_app_io_cb(), qc_send_mux(). Also the internal
function qc_prep_hpkts() is renamed qc_prep_pkts().
Remove the new unneeded qc_send_app_pkts() and qc_prep_app_pkts().
Also removed qc_send_app_probing(). It was a simple wrapper over other
application send functions. Now, default qc_send() can be reuse for such
cases with <old_data> argument set to true.
An adjustment was needed when converting qc_send_hdshk_pkts() to the
general qc_send() version. Previously, only a single packets
encoding/emission cycle was performed. This was enough as handshake
packets are always smaller than Tx buffer. However, it may be possible
to emit more application data. As such, a loop is necessary to perform
multiple encoding/emission cycles, as this was already the case in
qc_send_app_pkts().
No functional difference should happen with this commit. However, as
these are critcal functions with a lot of changes, this patch is
labelled as medium.
quic_conn_io_cb() manually implements emission by using lower level
functions qc_prep_pkts() and qc_send_ppkts(). Replace this by using the
higher level function qc_send_hdshk_pkts() which notably handle buffer
allocation and purging.
This allows to clean up send API by flagging qc_prep_pkts() and
qc_send_ppkts() as static. They are now used in a single location inside
qc_send_hdshk_pkts().
qc_send_hdshk_pkts() is a wrapper for qc_prep_hpkts() used on
retransmission. It was restricted to use two quic_enc_level pointers as
distinct arguments. Adapt it to directly use the same list of
quic_enc_level which is passed then to qc_prep_hpkts().
Now for retransmission quic_enc_level send list is built directly into
qc_dgrams_retransmit() which calls qc_send_hdshk_pkts().
Along this change, a new utility function qel_register_send() is
defined. It is an helper to build the quic_enc_level send list. It
enfores that each quic_enc_level instance is only registered in a single
list to prevent memory issues. It is both used in qc_dgrams_retransmit()
and quic_conn_io_cb().
Emission of packets during handshakes was implemented via an API which
uses two alternative ways to specify the list of frames.
The first one uses a NULL list of quic_enc_level as argument for
qc_prep_hpkts(). This was an implicit method to iterate on all qels
stored in quic_conn instance, with frames already inserted in their
corresponding quic_pktns.
The second method was used for retransmission. It uses a custom local
quic_enc_level list specified by the caller as input to qc_prep_hpkts().
Frames were accessible through <retransmit> list pointers of each
quic_enc_level used in an implicit mechanism.
This commit clarifies the API by using a single common method. Now
quic_enc_level list must always be specified by the caller. As for
frames list, each qels must set its new field <send_frms> pointer to the
list of frames to send. Callers of qc_prep_hpkts() are responsible to
always clear qels send list. This prevent a single instance of
quic_enc_level to be inserted while being attached to another list.
This allows notably to clean up some unnecessary code. First,
<retransmit> list of quic_enc_level is removed as it is replaced by new
<send_frms>. Also, it's now possible to use proper list_for_each_entry()
inside qc_prep_hpkts() to loop over each qels. Internal functions for
quic_enc_level selection is now removed.
Clean up trailer of qc_send_hdshk_pkts() by removing label "leave". Only
"out" label is now used. This operation is safe as LIST_DEL_INIT() is
idempotent. Caller of qc_send_hdshk_pkts() also ensures input frame
lists are freed, so it's better to always reset quic_enc_level
<retrans_frms> member.
Also take the opportunity to reset QUIC_FL_CONN_RETRANS_OLD_DATA only if
already set. This is considered more robust and will also remove
unneeded trace occurences.
No functional change. The main objective of this commit is to clean up
code in preparation of a refactoring on send functions.
In c83684519 ("MEDIUM: log: add the ability to include samples in logs")
we checked the return value of lf_text_len() as an integer instead of
comparing the pointer with NULL explicitly. Since this may be confusing,
let's test the return value against NULL.
[ada: for backports, the patch needs to be applied manually because of
c6a713842 ("MINOR: log: simplify last_isspace in sess_build_logline()")]
According to snprintf() man page:
The functions snprintf() and vsnprintf() do not write more than
size bytes (including the terminating null byte ('\0')). If the
output was truncated due to this limit, then the return value is
the number of characters (excluding the terminating null byte)
which would have been written to the final string if enough space
had been available. Thus, a return value of size or more means
that the output was truncated.
However, in sess_build_logline(), each time we need to check the return
value of snprintf(), here is how we proceed:
iret = snprintf(tmplog, max, ...);
if (iret < 0 || iret > max)
// error
// success
tmplog += iret;
Here is the issue: if snprintf() lacks 1 byte space to write the
terminating NULL byte, it will return max. Which means in this case
that we fail to know that snprintf() truncated the output in reality,
and we still add iret to tmplog pointer. Considering sess_build_logline()
should NOT write more than <maxsize> bytes (including the terminating NULL
byte) as per the function description, in this case the function would
write <maxsize>+1 byte (to write the terminating NULL byte upon return),
which may lead to invalid write if <dst> was meant to hold <maxsize> bytes
at maximum.
Hopefully, this bug wasn't triggered so far because sess_build_logline()
is called with logline as <dst> argument and <global.max_syslog_len> as
<maxsize> argument, logline being initialized with 1 extra byte upon
startup.
But we better fix this to comply with the function description and prevent
any side-effect since some sess_build_logline() helpers may assume that
'tmplog-dst < maxsize' is always true. Also sess_build_logline() users
probably don't expect NULL-byte to be accounted for in the produced
logline length.
This should be backported to all stable versions.
[ada: for backports, the patch needs to be applied manually because of
c6a713842 ("MINOR: log: simplify last_isspace in sess_build_logline()")]
encode_{chunk,string}() is often found to be used this way:
ret = encode_{chunk,string}(start, stop...)
if (ret == NULL || *ret != '\0') {
//error
}
//success
Indeed, encode_{chunk,string} will always try to add terminating NULL byte
to the output string, unless no space is available for even 1 byte.
However, it means that for the caller to be able to spot an error, then it
must provide a buffer (here: start) which is already initialized.
But this is wrong: not only this is very tricky to use, but since those
functions don't return NULL on failure, then if the output buffer was not
properly initialized prior to calling the function, the caller will
perform invalid reads when checking for failure this way. Moreover, even
if the buffer is initialized, we cannot reliably tell if the function
actually failed this way because if the buffer was previously initialized
with NULL byte, then the caller might think that the call actually
succeeded (since the function didn't return NULL and didn't update the
buffer).
Also, sess_build_logline() relies lf_encode_{chunk,string}() functions
which are in fact wrappers for encode_{chunk,string}() functions and thus
exhibit the same error handling mechanism. It turns out that
sess_build_logline() makes unsafe use of those functions because it uses
the error-checking logic mentionned above while buffer (tmplog) is not
guaranteed to be initialized when entering the function. This may
ultimately cause malfunctions or invalid reads if the output buffer is
lacking space.
To fix the issue once and for all and prevent similar bugs from being
introduced, we make it so encode_{string, chunk} and escape_string()
(based on encode_string()) now explicitly return NULL on failure
(when the function failed to write at least the ending NULL byte)
lf_encode_{string,chunk}() helpers had to be patched as well due to code
duplication.
This should be backported to all stable versions.
[ada: for 2.4 and 2.6 the patch won't apply as-is, it might be helpful to
backport ae1e14d65 ("CLEANUP: tools: removing escape_chunk() function")
first, considering it's not very relevant to maintain a dead function]
In c5bff8e550 ("BUG/MINOR: log: improper behavior when escaping log data")
we fixed lf_text_len() behavior with +E (escape) option.
However we introduced an inconsistency if output buffer is too small to
hold the whole output and truncation occurs: indeed without +E option up
to <size> bytes (including NULL byte) will be used whereas with +E option
only <size-1> bytes will be used. Fixing the function and related comment
so that the function behaves the same in regards to truncation whether +E
option is used or not.
This should be backported to all stable versions.
When sharded listeners were introdcued in 2.5 with commit 6dfbef4145
("MEDIUM: listener: add the "shards" bind keyword"), a point was
overlooked regarding how IDs are assigned to listeners: they are just
duplicated! This means that if a "option socket-stats" is set and a
shard is configured, or multiple thread groups are enabled, then a stats
dump will produce several lines with exactly the same socket name and ID.
This patch tries to address this by trying to assign consecutive numbers
to these sockets. The usual algo is maintained, but with a preference for
the next number in a shard. This will help users reserve ranges for each
socket, for example by using multiples of 100 or 1000 on each bind line,
leaving enough room for all shards to be assigned.
The mechanism however is quite tricky, because the configured listener
currently ends up being the last one of the shard. This helps insert them
before the current position without having to revisit them. But here it
causes a difficulty which is that we'd like to restart from the current
ID and assign new ones on top of it. What is done is that the number is
passed between shards and the current one is cleared (and removed from
the tree) so that we instead insert the new one. It's tricky because of
the situation which depends whether it's the listener that was already
assigned on the bind line or not. But overall, always removing the entry,
always adding the new one when the ID is not zero, and passing them from
the reference to the next one does the trick.
This may be backported to all versions till 2.6.
When a command is too big to fit in a buffer, a error is returned before
closing. However, the error is also returned if the command is small enough
but incomplete. It happens on abort. In this case, the error must not be
reported. The regression was introduced when a dedicated sn_buf callbac
function was added.
To fix the issue, both cases are now handled separately.
No backport needed.
There are several places where the QUIC low-level code performs unaligned
accesses by casting unaligned char* pointers to uint32_t, but this is
totally forbidden as it only works on machines that support unaligned
accesses, and either crashes on other ones (SPARC, MIPS), can result in
reading garbage (ARMv5) or be very slow due to the access being emulated
(RISC-V). We do have functions for this, such as read_u32() and write_u32()
that rely on the compiler's knowledge of the machine's capabilities to
either perform an unaligned access or do it one byte at a time.
This must be backported at least as far as 2.6. Some of the code moved a
few times since, so in order to figure the points that need to be fixed,
one may look for a forced pointer cast without having verified that either
the machine is compatible or that the pointer is aligned using this:
$ git grep 'uint[36][24]_t \*)'
Or build and run the code on a MIPS or SPARC and perform requests using
curl to see if they work or crash with a bus error. All the places fixed
in this commit were found thanks to an immediate crash on the first
request.
This was tagged medium because the affected archs are not the most common
ones where QUIC will be found these days.
Since the Linux capabilities support add-on (see the commit bd84387beb
("MEDIUM: capabilities: enable support for Linux capabilities")), we can also
check haproxy process effective and permitted capabilities sets, when it
starts and runs as non-root.
Like this, if needed network capabilities are presented only in the process
permitted set, we can get this information with capget and put them in the
process effective set via capset. To do this properly, let's introduce
prepare_caps_from_permitted_set().
First, it checks if binary effective set has CAP_NET_ADMIN or CAP_NET_RAW. If
there is a match, LSTCHK_NETADM is removed from global.last_checks list to
avoid warning, because in the initialization sequence some last configuration
checks are based on LSTCHK_NETADM flag and haproxy process euid may stay
unpriviledged.
If there are no CAP_NET_ADMIN and CAP_NET_RAW in the effective set, permitted
set will be checked and only capabilities given in 'setcap' keyword will be
promoted in the process effective set. LSTCHK_NETADM will be also removed in
this case by the same reason. In order to be transparent, we promote from
permitted set only capabilities given by user in 'setcap' keyword. So, if
caplist doesn't include CAP_NET_ADMIN or CAP_NET_RAW, LSTCHK_NETADM would not
be unset and warning about missing priviledges will be emitted at
initialization.
Need to call it before protocol_bind_all() to allow binding to priviledged
ports under non-root and 'setcap cap_net_bind_service' must be set in the
global section in this case.
Linux capabilities support and ability to preserve it for running process
after switching to a global.uid was added recently by the commit bd84387beb
("MEDIUM: capabilities: enable support for Linux capabilities")).
This new feature hasn't yet been taken into account by last config checks,
which are performed at initialization stage.
So, to update it, let's perform it after set_identity() call. Like this,
current EUID is already changed to a global.uid and prepare_caps_for_setuid()
would unset LSTCHK_NETADM flag, only if capabilities given in the 'setcap'
keyword in the configuration file were preserved.
Otherwise, if system doesn't support Linux capabilities or they were not set
via 'setcap', we keep the previous strict behaviour: process will terminate
with an alert, in order to insist that user: either needs to change
run UID (worst case: start and run as root), or he needs to set/recheck
capabilities listed as 'setcap' arguments.
In the case, when haproxy will start and run under a non-root user this patch
doesn't change the previous behaviour: we'll still let him try the
configuration, but we inform via warning that unexpected things may occur.
Need to be backported until v2.9, including v2.9.
This commit is similar with the two previous ones. Its purpose is to add
GUID support on listeners. Due to bind_conf and listeners configuration,
some specifities were required.
Its possible to define several listeners on a single bind line, for
example by specifying multiple addresses. As such, it's impossible to
support a "guid" keyword on a bind line. The problem is exacerbated by
the cloning of listeners when sharding is used.
To resolve this, a new keyword "guid-prefix" is defined for bind lines.
It allows to specify a string which will be used as a prefix for
automatically generated GUID for each listeners attached to a bind_conf.
Automatic GUID listeners generation is implemented via a new function
bind_generate_guid(). It is called on post-parsing, after
bind_complete_thread_setup(). For each listeners on a bind_conf, a new
GUID is generated with bind_conf prefix and the index of the listener
relative to other listeners in the bind_conf. This last value is stored
in a new bind_conf field named <guid_idx>. If a GUID cannot be inserted,
for example due to a non-unique value, an error is returned, startup is
interrupted with configuration rejected.
This commit is similar to previous one, except that it implements GUID
support for server instances. A guid_node field is inserted into server
structure. A new "guid" server keyword is defined.
Implement proxy identiciation through GUID. As such, a guid_node member
is inserted into proxy structure. A proxy keyword "guid" is defined to
allow user to fix its value.
GUID format is unspecified to allow users to choose the naming scheme.
Some restrictions however are added by this patch, mainly to ensure
coherence and memory usage.
The first restriction is on the length of GUID. No more than 127
characters can be used to prevent memory over consumption.
The second restriction is on the character set allowed in GUID. Utility
function invalid_char() is used for this : it allows alphanumeric
values and '-', '_', '.' and ':'.
Define a new module guid. Its purpose is to be able to attach a global
identifier for various objects such as proxies, servers and listeners.
A new type guid_node is defined. It will be stored in the objects which
can be referenced by such GUID. Several functions are implemented to
properly initialized, insert, remove and lookup GUID in a global tree.
Modification operations should only be conducted under thread isolation.
Currently, the way proxy-oriented logformat directives are handled is way
too complicated. Indeed, "log-format", "log-format-error", "log-format-sd"
and "unique-id-format" all rely on preparsing hints stored inside
proxy->conf member struct. Those preparsing hints include the original
string that should be compiled once the proxy parameters are known plus
the config file and line number where the string was found to generate
precise error messages in case of failure during the compiling process
that happens within check_config_validity().
Now that lf_expr API permits to compile a lf_expr struct that was
previously prepared (with original string and config hints), let's
leverage lf_expr_compile() from check_config_validity() and instead
of relying on individual proxy->conf hints for each logformat expression,
store string and config hints in the lf_expr struct directly and use
lf_expr helpers funcs to handle them when relevant (ie: original
logformat string freeing is now done at a central place inside
lf_expr_deinit(), which allows for some simplifications)
Doing so allows us to greatly simplify the preparsing logic for those 4
proxy directives, and to finally save some space in the proxy struct.
Also, since httpclient proxy has its "logformat" automatically compiled
in check_config_validity(), we now use the file hint from the logformat
expression struct to set an explicit name that will be reported in case
of error ("parsing [httpclient:0] : ...") and remove the extraneous check
in httpclient_precheck() (logformat was parsed twice previously..)
split parse_logformat_string() into two functions:
parse_logformat_string() sticks to the same behavior, but now becomes an
helper for lf_expr_compile() which uses explicit arguments so that it
becomes possible to use lf_expr_compile() without a proxy, but also
compile an expression which was previously prepared for compiling (set
string and config hints within the logformat expression to avoid manually
storing string and config context if the compiling step happens later).
lf_expr_dup() may be used to duplicate an expression before it is
compiled, lf_expr_xfer() now makes sure that the input logformat is
already compiled.
This is some prerequisite works for log-profiles implementation, no
functional change should be expected.
This patch tries to address a design flaw with how logformat expressions
are parsed from config. Indeed, some parse_logformat_string() calls are
performed during config parsing when the proxy mode is not yet known.
Here's a config example that illustrates the issue:
defaults
mode tcp
listen test
bind :8888
http-response set-header custom-hdr "%trl" # needs http
mode http
The above config should work, because the effective proxy mode is http,
yet haproxy fails with this error:
[ALERT] (99051) : config : parsing [repro.conf:6] : error detected in proxy 'test' while parsing 'http-response set-header' rule : format tag 'trl' is reserved for HTTP mode.
To fix the issue once and for all, let's implement smart postparsing for
logformat expressions encountered during config parsing:
- split parse_logformat_string() (and subfonctions) in order to create a
new lf_expr_postcheck() function that must be called to finish
preparing and checking the logformat expression once the proxy type is
known.
- save some config hints info during parse_logformat_string() to
generate more precise error messages during lf_expr_postcheck(), if
needed, we rely on curpx->conf.args.{file,line} hints for that because
parse_logformat_string() doesn't know about current file and line
number.
- lf_expr_postcheck() uses PR_FL_CHECKED proxy flag to know if the
function may try to make the proxy compatible with the expression, or
if it should simply fail as soon as an incompatibility is detected.
- if parse_logformat_string() is called from an unchecked proxy, then
schedule the expression for postparsing, else (ie: during runtime),
run the postcheck right away.
This change will also allow for some logformat expression error handling
simplifications in the future.
PR_FL_CHECKED is set on proxy once the proxy configuration was fully
checked (including postparsing checks).
This information may be useful to functions that need to know if some
config-related proxy properties are likely to change or not due to parsing
or postparsing/check logics. Also, during runtime, except for some rare cases
config-related proxy properties are not supposed to be changed.
log format expressions are broadly used within the code: once they are
parsed from input string, they are converted to a linked list of
logformat nodes.
We're starting to face some limitations because we're simply storing the
converted expression as a generic logformat_node list.
The first issue we're facing is that storing logformat expressions that
way doesn't allow us to add metadata alongside the list, which is part
of the prerequites for implementing log-profiles.
Another issue with storing logformat expressions as generic lists of
logformat_node elements is that it's starting to become really hard to
tell when we rely on logformat expressions or not in the code given that
there isn't always a comment near the list declaration or manipulation
to indicate that it's relying on logformat expressions under the hood,
so this adds some complexity for code maintenance.
This patch looks quite impressive due to changes in a lot of header and
source files (since logformat expressions are broadly used), but it does
a simple thing: it defines the lf_expr structure which itself holds a
generic list of logformat nodes, and then declares some helpers to
manipulate lf_expr elements and fixes the code so that we now exclusively
manipulate logformat_node lists as lf_expr elements outside of log.c.
For now, lf_expr struct only contains the list of logformat nodes (no
additional metadata), but now that we have dedicated type and helpers,
doing so in the future won't be problematic at all and won't require
extensive code changes.
This is a pretty simple patch despite requiring to make some visible
changes in the code:
When parsing a logformat string, log tags (ie: '%tag', AKA log tags) are
turned into logformat nodes with their type set to the type of the
corresponding logformat_tag element which was matched by name. Thus, when
"compiling" a logformat tag, we only keep a reference to the tag type
from the original logformat_tag.
For example, for "%B" log tag, we have the following logformat_tag
element:
{
.name = "B",
.type = LOG_FMT_BYTES,
.mode = PR_MODE_TCP,
.lw = LW_BYTES,
.config_callback = NULL
}
When parsing "%B" string, we search for a matching logformat tag
inside logformat_tags[] array using the provided name, once we find a
matching element, we craft a logformat node whose type will be
LOG_FMT_BYTES, but from the node itself, we no longer have access to
other informations that are set in the logformat_tag struct element.
Thus from a logformat_node resulting from a log tag, with current
implementation, we cannot easily get back to matching logformat_tag
struct element as it would require us to scan the whole logformat_tags
array at runtime using node->type to find the matching element.
Let's take a simpler path and consider all tag-specific LOG_FMT_*
subtypes as being part of the same logformat node type: LOG_FMT_TAG.
Thanks to that, we're now able to distinguish logformat nodes made
from logformat tag from other logformat nodes, and link them to
their corresponding logformat_tag element from logformat_tags[] array. All
it costs is a simple indirection and an extra pointer in logformat_node
struct.
While at it, all LOG_FMT_* types related to logformat tags were moved
inside log.c as they have no use outside of it since they are simply
lookup indexes for sess_build_logline() and could even be replaced by
function pointers some day...
rename logformat_type internal struct to logformat_tag to to make it less
confusing, then expose logformat_tag struct through header file so that it
can be referenced in other structs.
also rename logformat_keywords[] to logformat_tags[] for better
consistency.
What we use to call logformat variable in the code is referred as
log-format tag in the documentation. Having both 'var' and 'tag' labels
referring to the same thing is really confusing. Let's make the code
comply with the documentation by replacing all logformat var/variable/VAR
occurences with either tag or TAG.
No functional change should be expected, the only visible side-effect from
user point of view is that "variable" was replaced by "tag" in some error
messages.
When support for dynamic names was added for use_backend rules in
702d44f2f ("MEDIUM: proxy: support use_backend with dynamic names"), the
sample expression resulting from parse_logformat_string() was only freed
for non dynamic rules (when the expression resolved to a simple string
node). But for complex expressions (ie: multiple nodes), rule->dynamic
was set but the expression was never released, resulting in a small
memory leak when freeing the parent proxy.
To fix the issue, in free_proxy(), we free the switching rule expression
if the switching rule is dynamic.
This should be backported to every stable versions.
[ada: prior to 2.9, free_logformat_list() helper did not exist: we may
use the same manual sample expr freeing logic as in server_rules pruning
right above it]
As per the `sd_notify` manual:
> A field carrying the monotonic timestamp (as per CLOCK_MONOTONIC) formatted
> in decimal in μs, when the notification message was generated by the client.
> This is typically used in combination with "RELOADING=1", to allow the
> service manager to properly synchronize reload cycles. See systemd.service(5)
> for details, specifically "Type=notify-reload".
Thus this change allows users with a recent systemd to switch to
`Type=notify-reload`, should they desire to do so. Correct behavior was
verified with a Fedora 39 VM.
see systemd/systemd#25916
[wla: the service file should be updated this way:]
diff --git a/admin/systemd/haproxy.service.in b/admin/systemd/haproxy.service.in
index 22a53d8aab..8c6dadb5e5 100644
--- a/admin/systemd/haproxy.service.in
+++ b/admin/systemd/haproxy.service.in
@@ -8,12 +8,11 @@ EnvironmentFile=-/etc/default/haproxy
EnvironmentFile=-/etc/sysconfig/haproxy
Environment="CONFIG=/etc/haproxy/haproxy.cfg" "PIDFILE=/run/haproxy.pid" "EXTRAOPTS=-S /run/haproxy-master.sock"
ExecStart=@SBINDIR@/haproxy -Ws -f $CONFIG -p $PIDFILE $EXTRAOPTS
-ExecReload=@SBINDIR@/haproxy -Ws -f $CONFIG -c $EXTRAOPTS
-ExecReload=/bin/kill -USR2 $MAINPID
KillMode=mixed
Restart=always
SuccessExitStatus=143
-Type=notify
+Type=notify-reload
+ReloadSignal=SIGUSR2
# The following lines leverage SystemD's sandboxing options to provide
# defense in depth protection at the expense of restricting some flexibility
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
This bug arrived with this commit:
MAJOR: stktable: split the keys across multiple shards to reduce contention
At this time, there are no callers which call stktable_get_entry() without checking
the nullity of <key> passed as parameter. But the documentation of this function
says it supports this case where the <key> passed as parameter could be null.
Move the nullity test on <key> at first statement of this function.
Thanks to @chipitsine for having reported this issue in GH #2518.
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.
The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):
function dump_stk()
local dmp = core.proxies['tbl'].stktable:dump({});
local count = 0
for _, __ in pairs(dmp) do
count = count + 1
end
core.Info('Total entries: ' .. count)
end
core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);
##
global
tune.lua.log.stderr on
lua-load-per-thread lua-cnttbl.lua
listen front
bind :8001
http-request lua.dump_stk if { path_beg /stk }
http-request track-sc1 rand(),upper,hex table tbl
http-request redirect location /
backend tbl
stick-table size 100k type string len 12 store http_req_cnt
##
$ h2load -c 16 -n 10000 0:8001/
$ curl 0:8001/stk
## A count close to 100k appears on haproxy's stderr
## On the CLI, "show table tbl" | wc will show the same.
Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.
The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.
The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().
The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.
At this point, this allowed to completely unlock the performance on a
80-thread setup:
before: 5.4 Gbps, 150k RPS, 80 cores
52.71% haproxy [.] stktable_lookup_key
36.90% haproxy [.] stktable_get_entry.part.0
0.86% haproxy [.] ebmb_lookup
0.18% haproxy [.] process_stream
0.12% haproxy [.] process_table_expire
0.11% haproxy [.] fwrr_get_next_server
0.10% haproxy [.] eb32_insert
0.10% haproxy [.] run_tasks_from_lists
after: 36 Gbps, 980k RPS, 80 cores
44.92% haproxy [.] stktable_get_entry
5.47% haproxy [.] ebmb_lookup
2.50% haproxy [.] fwrr_get_next_server
0.97% haproxy [.] eb32_insert
0.92% haproxy [.] process_stream
0.52% haproxy [.] run_tasks_from_lists
0.45% haproxy [.] conn_backend_get
0.44% haproxy [.] __pool_alloc
0.35% haproxy [.] process_table_expire
0.35% haproxy [.] connect_server
0.35% haproxy [.] h1_headers_to_hdr_list
0.34% haproxy [.] eb_delete
0.31% haproxy [.] srv_add_to_idle_list
0.30% haproxy [.] h1_snd_buf
WIP: uint64_t -> long
WIP: ulong -> uint
code is much smaller
Thanks to the previous commit, we can now simply perform an atomic read
on stksess->seen and take the write lock to recreate the entry only if
at least one peer has seen it, otherwise leave it untouched. On a test
on 40 cores, the performance used to drop from 2.10 to 1.14M RPS when
one peer was connected, now it drops to 2.05, thus there's basically
no impact of connecting a peer vs ~45% previously, all spent in the
read lock. This can be particularly important when often updating the
same entries (user-agent, source address during an attack etc).
Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.
This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
Some bwlim error messages at parsing time were missing the trailing '\n'
in commit 2b6777021d ("MEDIUM: bwlim: Add support of bandwith limitation
at the stream level"). This commit can be backported wherever the commit
above is (likely as far as 2.7).
Thanks to previous commit, we can now build with USE_SYSTEMD=1 on any
system without requiring any parts from systemd. It just turns our that
there was one remaining include in haproxy.c that needed to be replaced
with haproxy/systemd.h to build correctly. That's what this commit does.
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.
The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.
This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.
References:
https://github.com/systemd/systemd/issues/32028https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes
Before:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffcfaf65000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)
After:
wla@kikyo:~% ldd /usr/sbin/haproxy
linux-vdso.so.1 (0x00007ffd51901000)
libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)
A backport to all stable versions could be considered at some point.
Since faa8c3e ("MEDIUM: lb-chash: Deterministic node hashes based on
server address") the following configuration will cause haproxy to crash:
backend test1
mode http
balance hash int(1)
server s1 haproxy.org:80
This is because lbprm.update_server_eweight() method is now systematically
called in _srv_set_inetaddr_port() upon srv addr/port change (and with the
above config it happens during startup after initial dns resolution).
However, depending on the chosen lbprm algo, update_server_eweight function
may not be set (it is not a mandatory method, some lb implementations don't
define it).
Thus, using 'balance hash' with map-based hashing or 'balance sticky' will
cause a crash due to a NULL de-reference in _srv_set_inetaddr_port(). To
fix the issue, we first check that the update_server_eweight() method is
set before using it.
No backport needed unless faa8c3e ("MEDIUM: lb-chash: Deterministic node
hashes based on server address") gets backported.
This issue arrived with this commit:
"MINOR: quic: HyStart++ implementation (RFC 9406)"
Thanks to @chipitsine for having reported this issue in GH #2513.
Should be backported where the previous commit will be backported.
In peer_send_msg(), we take a lock before calling
peer_send_teach_process_msgs because of the check on the flags and update
indexes, and the function then drops it then takes it again just to resume
in the same situation, so that on return we can drop it again! Not only
this is absurd because it doubles the costs of taking the lock, it's also
totally inefficient because it takes a write lock while the only usage
that is done with it is to read the indexes! Let's drop the lock from
peer_send_teach_process_msgs() and move it explicitly in its only caller
around the condition, and turn it into a read lock only.
The MAX() macro was used to limit the count of bytes to be transferred
in appctx_raw_rcv_buf() by commit ee53d8421f ("MEDIUM: applet: Simplify
a bit API to exchange data with applets") instead of MIN(). It didn't
seem to have any consequences until commit f37ddbeb4b ("MAJOR: cli:
Update the CLI applet to handle its own buffers") that triggers a BUG_ON()
in __b_putblk() when the other side is slow to read, because we're trying
to append a full buffer on top of a non-empty one. A way to reproduce it
is to dump a heavy stick table on the CLI with a screen scrolling.
No backport is needed since this was introduced in 3.0-dev3 and revealed
after dev5 only.
In 2.9, the stick-tables' locking was split between the lock used to
manipulate the contents (->lock) and the lock used to manipulate the
list of updates and the update indexes (->updt_lock). This was done
with commit 87e072eea5 ("MEDIUM: stick-table: use a distinct lock for
the updates tree"). However a part was overlooked in the peers code,
the parts that consult (and update) the indexes use the table's lock
instead of the update lock. It's surprising that it hasn't caused more
trouble. It's likely due to the fact that the tree nodes are not often
immediately freed and that their memory area remains connected to valid
nodes in the tree during peer_stksess_lookup(), while other parts only
check or update indexes, thus are not that critical.
This needs to be backported wherever the commit above is, thus logically
2.9.
It is only an issue when the kernel splicing is used. The zero-copy
forwarding via the buffers is not affected. When a shutdown is received on
the producer side and some data are blocked in the pipe for a while, the
shutdown may be forwarded to the other side. Usually, in this case, the
shutdown must be scheduled, waiting all output data (from the channel and
the consumer's iobuf) are sent. But only the channel was considered.
The bug was introduced by commit 20c463955d ("MEDIUM: channel: don't look at
iobuf to report an empty channel"). To fix the issue, we must also check
data blocked in the consummer iobuf.
This patch should solve the issue #2505. It must be backported to 2.9.
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.
Implemented only for Cubic.
According to the documentation, "option redispatch 0" is expected to
disable redispatch just like "no option redispatch", but due to the
fact that it keeps PR_O_REDISP set, it doesn't actually work. Let's
make sure value 0 is properly handled and drops PR_O_REDISP. This can
be backported to all versions since it seems it has been broken since
its introduction in 1.6 with commit 726ab7145c ("MEDIUM: backend: Allow
redispatch on retry intervals").
As a workaround, "no option redispatch" does work though.
In 2.7 we addressed a race condition in the stick tables expiration task
with commit fbb934d ("BUG/MEDIUM: stick-table: fix a race condition when
updating the expiration task"). The issue was that the task could be
running on another thread which would destroy its expiration timer
while one had just recalculated it and prepares to queue it, causing
a bug due to the attempt to queue an expired task. The fix consisted in
enclosing the change into the stick-table's lock, which had a very low
cost since it's done only after having checked that the date changed,
i.e. no more than once every millisecond.
But as reported by Ricardo and Felipe from Taghos in github issue #2508,
a tiny race remained after the fix: the unlock() was done before the call
to task_queue(), leaving a tiny window for another thread to run between
unlock() and task_queue() and erase the timer. As confirmed, it's
sufficient to also protect the task_queue() call.
But overall this raises a point regarding the task_queue() API on tasks
that may run anywhere. A while ago an attempt was made at removing the
timer for woken up tasks, but something like this would be deserved
with more atomicity on the timer manipulation (e.g. atomically use
task_schedule() instead maybe). This should be backported to all
stable branches.
Motivation: When services are discovered through DNS resolution, the order in
which DNS records get resolved and assigned to servers is arbitrary. Therefore,
even though two HAProxy instances using chash balancing might agree that a
particular request should go to server3, it is likely the case that they have
assigned different IPs and ports to the server in that slot.
This patch adds a server option, "hash-key <key>" which can be set to "id" (the
existing behaviour, default), "addr", or "addr-port". By deriving the keys for
the chash tree nodes from a server's address and port we ensure that independent
HAProxy instances will agree on routing decisions. If an address is not known
then the key is derived from the server's puid as it was previously.
When adjusting a server's weight, we now check whether the server's hash has
changed. If it has, we have to remove all its nodes first, since the node keys
will also have to change.
A compilation error occurs when using DEBUG_MEM_STATS due to a variable
now being unused in debug_iohandler_memstats() :
src/debug.c: In function ‘debug_iohandler_memstats’:
src/debug.c:1862:24: error: unused variable ‘sc’ [-Werror=unused-variable]
1862 | struct stconn *sc = appctx_sc(appctx);
| ^~
This is caused since the following commit :
94b8ed446f
MEDIUM: cli/applet: Stop to test opposite SC in I/O handler of CLI commands
This must not be backported.
log load-balancing implementation was not seamlessly integrated within
lbprm API. The consequence is that it could become harder to maintain
over time since it added some specific cases just for the log backend.
Moreover, it resulted in some code duplication since balance algorithms
that are common to logs and regular (tcp, http) backends were specifically
rewritten for log backends.
Thanks to the previous commit, we now have all the prerequisites to make
log load-balancing fully leverage lbprm logic. Thus in this patch we make
__do_send_log_backend() use existing lbprm algorithms, and we no longer
require log-specific lbprm initialization in cfgparse.c and in
postcheck_log_backend().
As a bonus, for log backends this allows weighed algorithms to properly
support weights (ie: roundrobin, random and log-hash) since we now
leverage the same lb algorithms that we use for tcp/http backends
(doc was updated).
As previously mentioned in cd352c0db ("MINOR: log/balance: rename
"log-sticky" to "sticky""), let's define a sticky algorithm that may be
used from any protocol. Sticky algorithm sticks on the same server as
long as it remains available.
The documentation was updated accordingly.
b61147fd ("MEDIUM: log/balance: merge tcp/http algo with log ones")
introduced some ambiguities, because while it shares some algos with the
ones from mode {tcp,http}, we forgot report an error when the user tries
to use an algorithm that is not available in this mode (as per the doc).
Because of that, haproxy would silently drop log messages during runtime.
To fix that, we ensure that algo is one of the supported ones during log
backend postparsing. If the algo is not supported, we raise an error.
This should be backported in 2.9 with b61147fd
The CLI applet is now using its own snd_buf callback function. Instead of
copying as most output data as possible, only one command is copied at a
time.
To do so, a new state CLI_ST_PARSEREQ is added for the CLI applet. In this
state, the CLI I/O handle knows a full command was copied into its input
buffer and it must parse this command to evaluate it.
This flag can be use by endpoints to know the data to send, via .snd_buf
callback function are the last ones. It is useful to know a shutdown is
pending but it cannot be delivered while sedning data are not consumed.
It is now the responsbility of applets .snd_buf callback function to notify
the input buffer is full. This will allow the applets to not consume all
data waiting for more data. Of course, it is only useful for applets using a
custom .snd_buf callback function.
It is the third applet to be refactored to use its own buffers. In addition to
the CLI applet, some I/O handlers of CLI commands were also updated, especially
the stats ones.
Some command I/O handlers were updated to use applet's buffers instead of
channels ones.
It is an harmless bug for now because only stats and cache applets are using
their own buffers and it is not possible to trigger this bug with these
applets. However, it remains important to try a receive if EOI, EOS or ERROR
is reached by the applet while no data was produced. Otherwise, it is not
possible to ack these events at the SE level.
No backport needed.
The main CLI I/O handle is responsible to interrupt the processing on
shutdown/abort. It is not the responsibility of the I/O handler of CLI
commands to take care of it.
Instead of using connection versions, we now use generic versions. It means
we will also perfom sync receives and sync sends on applets too, but only
for applets using their own buffers. Old applets are not concerned.
sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.
Too big command, larger than a buffer, was silently rejected by the CLI
applet. It was handled as an error and the connection was closed, but no
error message was reported to user to notify him. Now an error is reported
before closing. It is only displayed if the chunk buffer used by the CLI
applet is full and no delimiter (\n or ;) is found to mark the end of the
command. It works for a simple command but also for a command with a huge
payload.
This patch could be backported to all stable versions.
This commit allows "cookie" keyword for dynamic servers. After code
review, nothing was found which could prevent a dynamic server to use
it. An extra warning is added under cli_parse_add_server() if cookie
value is ignored due to a non HTTP backend.
This patch is not considered a bugfix. However, it may backported if
needed as its impact seems minimal.
When adding a server dynamically, we observe that when a backend has a
dynamic persistence cookie, the new server has no cookie as we receive
the following HTTP header:
set-cookie: test-cookie=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
Whereas we were expecting to receive something like the following, which
is what we receive for a server added in the config file:
set-cookie: test-cookie=abcdef1234567890; path=/
After investigating code path, srv_set_dyncookie() is never called when
adding a server through CLI, it is only called when parsing config file
or using "set server bkd1/srv1 addr".
To fix this, call srv_set_dyncookie() inside cli_parse_add_server().
This patch must be backported up to 2.4.
Since their first implementation, dynamic servers are created into
maintenance state. This has been done purposely to avoid immediate
activation of a newly inserted server.
However, this principle is incompatible if "enabled" keyword is used on
"add server". The newly created instance will be unreacheable as proxy
load-balancing algorithm is not informed of its presence via
srv_lb_propagate(). The new server could be unblocked by toggling its
state with "disable server" / "enable server" commands, which will
trigger srv_lb_propagate() invocation.
To avoid this unexpected state, simply forbid "enabled" keyword for
dynamic servers. In the long-term, it could be possible to re authorize
it but at least this requires to call srv_lb_propagate() on dynamic
server creation.
This should fix github issue #2497.
This patch should not be backported as-is, to avoid breaking dynamic
servers API on stable versions. "enabled" should instead be ignored for
them. This will be implemented in a dedicated patch on top of 2.9.
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
The inconsistencies in 'ocsp-update' parameter were only checked when
parsing a crt-list line so if a certificate was used on a bind line
after being used in a crt-list with 'ocsp-update' set to 'on', then no
error would be raised. This patch helps detect such inconsistencies.
This patch can be backported up to branch 2.8.
In a crt-list such as the following:
foo.pem [ocsp-update off] foo.com
foo.pem bar.com
we would get a wrong "Incompatibilities found in OCSP update mode ..."
error message during init when the two lines are actually saying the
same thing since the default for 'ocsp-update' option is 'off'.
This patch can be backported up to branch 2.8.
A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.
Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.
No backport is needed.
Sebastien Gross reported that 'interface' keyword ('source' subargument)
is silently ignored when used from 'default-server' directive despite the
documentation implicitly stating that the keyword should be supported
there.
When support for 'source' keyword was added to 'default-server' directive
in dba97077 ("MINOR: server: Make 'default-server' support 'source'
keyword."), we properly duplicated the conn iface_name from the default-
server but we forgot to copy the conn iface_len which must be set as well
since it is used as setsockopt()'s 'optlen' argument in
tcp_connect_server().
It should be backported to all stable versions.
OpenSSL 3.2 triggers the code part added by commit 25da217 ("MINOR: ssl:
Update ssl_fc_curve/ssl_bc_curve to use SSL_get0_group_name") which
contains a variable declaration in the for() statement and breaks on
older compilers, as reported in GH issues #2501.
Let's just declare it normally to fix the problem. This must be
backported wherever the commit above is (at least 2.9).
We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.
It has been found that performing a first pass consisting in copying
all messages, and a second one to notify about releases is more efficient
on AMD than updating all of them on the fly using a CAS, despite making
writers wait longer to be released.
Maybe it's related to the ability for the CPU to prefetch the contents
during a simple load while it wouldn't do it for an XCHG, it's unsure
at this point. This will also mater permit to use relaxed stores to
release threads.
On ARM the performance increased to 7.0M/s. If this patch is applied
before the dropping of the intermediary step, instead it drops to
3.9M/s. This shows the dependency between such changes that strive to
limit the number of writes on the fast path.
On x86_64, the EPYC at 3C6T saw a small drop from 4.57M to 4.45M, but
the 24C48T setup saw a nice 33% boost from 3.33M to 4.44M, i.e. we
get stable perf at 3 and 24 cores, despite having 8 CCX involved and
fighting with each other.
Other possibilities are:
- use of HA_ATOMIC_XCHG() instead of FETCH_OR()
=> slightly faster (4.62/7.37 vs 4.58/7.34). Pb: requires to
modify the readers to wait much longer since the tail value
won't be valid in this case during updates, and it will have
to wait by looping over it.
- use other conditions to release a cell
=> to be tested
Archs relying on CAS benefit from a read prior to FETCH_OR, so it's
not just x86 that benefits from this. Let's just change the condition
to only exclude __ARM_FEATURE_ATOMICS which is the only one faster
without.
The loop was cleaned up a little bit so that the inner loops are more
readable and that the ifdef'd parts are whole blocks and not just an
"if" condition. A few conditions were adjusted to benefit from "break"
and "continue".
This is mostly a cleanup in that it turns the two-level loop into a
single one, but it also simplifies the code a little bit and brings
some performance savings again, which are mostly noticeable on ARM,
but don't change anything for x86.
x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.
By doing that and placing the cpu_relax at the right places, the ARM
reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small
increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M
to 3.33M due to the write contention hidden behind the CAS that
implements the FETCH_OR(), that we'll address next.
The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.
It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.
By doing so we're cascading contention areas so that multiple parts can
work in parallel.
Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.
The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.
This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.
Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.
This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.
It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
Code disassembly shows that ring->storage->tail and ring->queue are
accessed a lot and reloaded a lot due to aliasing. Let's just have
variables for them in the local stack. It makes the code smaller and
slightly faster.
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.
There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
If there's nothing to read, it's pointless for a reader to try to update
the offset pointer, that's two atomic ops to replace a value by itself
twice. Let's just stop this.
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.
The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
The writer is using tags 0xFF instead of readers count at the front of
messages that are undergoing an update, while the tail has already been
updated. The reader needs to take care of this because it can face these
messages and mistakenly parse data that's still being written, leading
to corruption (especially if this happens while the size is changing).
Let's just stop reading when facing reserved codes, since they indicate
that the end of usable messages was reached.
Since we're going to remove the lock, there's no more way to prevent the
ring from being fed while we're attaching a client to it. We need to
freeze the buffer while looking at its head so that we can attach there
and have a trustable one. We could do it by setting the lock bit on the
tail offset but quite frankly we don't need to bother with that, attaching
a client is rare enough to permit a thread_isolate().
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
There's no point looking for freshly attached readers if there are none,
taking this lock requires an atomic write to a shared area, something we
clearly want to avoid.
A general test with 213-byte messages on different thread counts shows
how the performance degrades across CCX and how this patch improves the
situation:
Before After
3C6T/1CCX: 6.39 Mmsg/s 6.35 Mmsg/s
6C12T/2CCX: 2.90 Mmsg/s 3.16 Mmsg/s
12C24T/4CCX: 2.14 Mmsg/s 2.33 Mmsg/s
24C48T/8CCX: 1.75 Mmsg/s 1.92 Mmsg/s
This tends to confirm that the queues will really be needed and that
they'll have to be per-ccx hence per thread-group. They will amortize
the number of updates on head & tail (one per multiple messages).
We know we can continue to protect the message area so we can unlock the
tail as soon as we know its new value. Now we're seeing ~6.4M msg/s vs
5.4M previously on 3C6T of a 3rd gen EPYC, and 1.88M vs 1.54M for 24C48T
threads, which is a significant gain!
This requires to carefully write the new head counter before releasing
the writers, and to change the calculation of the work area from
tail..head to tail...new_tail while writing the message.
Now the lock is only taken around the readers list. With careful
ordering of writes to head/tail, the ring remains protected.
The perf is a bit better, though (1.54M msg/s vs 1.4M at 48T on
a 3rd gen EPYC, and 5.4M vs 5.3M for a 3C6T setup).
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
The reader now needs to protect the positions it's reading. This is
already done via the readers counter at the beginning of messages,
but as long as the lock is present, this counter is decremented
before starting to parse messages, and incremented at the end.
We must now do that in reverse, first protect the end of the messages,
and only then remove ourselves from the already processed messages, so
that at no point could a writer pass over and possibly overwrite data
we're currently watching.
The goal here is to start to protect the writing area inside the area
itself so that we'll later be able to release the ring's lock. We're not
there yet, but at least the tail is marked as protected for as long as the
message is not fully written.
We'll want to reserve some special values for the readers count to
temporary lock the following message, but for this it will be mandatory
that readers check for them before incrementing/decrementing the counter.
Let'sdo that using a CAS. The readers performance is not as critical as
the writer's anyway so the slight overhead is not a problem.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
This is the start of the replacement of the buffer API calls. Only the
ring_write() function was touched. Instead of manipulating a buffer all
along, we now extract the ring buffer's head and tail upon entry, store
them locally and use them using the vec<->ring API until the last moment
where we can update the buffer with the new values. One subtle point is
that we must never fill the buffer past the last byte otherwise the
vec-to-ring conversion gets lost and there's no more possibility to know
where's the beginning nor the end (just like when dealing with head+tail
in fact), because it then becomes impossible to distinguish between an
empty and a full buffer.
In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.
The ring resizing was already quite tricky, but when facing atomic
writes it will no longer be possible and we definitely do not want to
have to deal with a lock there. Since it's only done at boot time, and
possibly later from the CLI, let's just do it under thread isolation.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.