IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
It's better to see "make" entering a subdir than seeing nothing, so
let's use a command name for make. Since make 3.81, "+" needs to be
prepended in front of the command to pass the job server to the subdir.
The $(Q), $(V), $(cmd_xx) handling needs to be reused in sub-project
makefiles and it's a pain to maintain inside the main makefile. Let's
just move that into a new subdir include/make/ with a dedicated file
"verbose.mk". It slightly cleans up the makefile in addition.
When building with DEBUG_MEM_STATS, we only see b_alloc() and b_free() as
users of the "buffer" pool, because all call places rely on these more
convenient functions. It's annoying because it makes it very hard to see
which parts of the code are consuming buffers.
By switching the b_alloc() and b_free() inline functions to macros, we
can now finally track the users of struct buffer, e.g:
mux_h1.c:513 P_FREE size: 1275002880 calls: 38910 size/call: 32768 buffer
mux_h1.c:498 P_ALLOC size: 1912438784 calls: 58363 size/call: 32768 buffer
stream.c:763 P_FREE size: 4121493504 calls: 125778 size/call: 32768 buffer
stream.c:759 P_FREE size: 2061697024 calls: 62918 size/call: 32768 buffer
stream.c:742 P_ALLOC size: 3341123584 calls: 101963 size/call: 32768 buffer
stream.c:632 P_FREE size: 1275068416 calls: 38912 size/call: 32768 buffer
stream.c:631 P_FREE size: 637435904 calls: 19453 size/call: 32768 buffer
channel.h:850 P_ALLOC size: 4116480000 calls: 125625 size/call: 32768 buffer
channel.h:850 P_ALLOC size: 720896 calls: 22 size/call: 32768 buffer
dynbuf.c:55 P_FREE size: 65536 calls: 2 size/call: 32768 buffer
Let's do this since it doesn't change anything for the output code
(beyond adding the call places). Interestingly the code even got
slightly smaller now.
This macro just serves as an intermediary for __pool_alloc() and forwards
the flag. When DEBUG_MEM_STATS is set, it will be used to collect all
pool allocations including those which need to pass an explicit flag.
It's now used by b_alloc() which previously couldn't be tracked by
DEBUG_MEM_STATS, causing some free() calls to have no corresponding
allocations.
Similarly to the previous patch, it's better to keep a local copy of
the new node's key instead of accessing it every time. This slightly
reduces the code's size in the descent and further improves the load
time to 7.45s.
looking at a perf profile while loading a conf with a huge map, it
appeared that there was a hot spot on the access to the new node's
prefix, which is unexpectedly being reloaded for each visited node
during the tree descent. Better keep a copy of it because with large
trees that don't fit into the L3 cache the memory bandwidth is scarce.
Doing so reduces the load time from 8.0 to 7.5 seconds.
Once in a while we spot a bug in the deinit code that is complex,
especially when it has to deal with incomplete initializations, and the
ability to bypass this step has regularly been raised. In addition for
fast-reloading setups it could theoretically save some time. Tests have
shown that very large configs can barely save ~100-150ms by skipping the
deinit step. However the ability not to crash if a bug is encountered can
occasionally help.
This patch adds an option to do exactly this. It's obviously not enabled
by default and the documentation discourages from using it, but this might
be useful in the future.
The ->exp_next field of the stick-table was probably useful in 1.5 but
it currently only carries a copy of what the future value of the table's
task's expire value will be, while it's systematically copied over there
immediately after being assigned. As such it provides exactly a local
variable. Let's remove it, as it costs atomic operations.
Coverity raised a potential overflow issue in these new functions that
work on unsigned long long objects. They were added in commit 9b25982
"BUG/MEDIUM: ssl: Verify error codes can exceed 63".
This patch needs to be backported alongside 9b25982.
When the code is preprocessed first and compiled later, such as when
built under distcc, a lot of fallthrough warnings are emitted because
the preprocessor has already stripped the comments.
As an alternative, a "fallthrough" attribute was added with the same
compilers as those which started to emit those warnings. However it's
not portable to older compilers. Let's just define a __fallthrough
statement that corresponds to this attribute on supported compilers
and only switches to the classical empty do {} while (0) on other ones.
This way the code will support being cleaned up using __fallthrough.
It happens that gcc since 5.x has this macro which is only mentioned
once in the doc, associated with __builtin_has_attribute(). Clang had
it at least since 3.0. In addition it validates #ifdef when present,
so it's easy to detect it. Here we're providing a fallback to another
macro __has_attribute_<name> so that it's possible to define that macro
to the value 1 for older compilers when the attribute is supported.
In order to simplify compiler-specific checks, we'll need to check if some
attributes exist. In order to ease declarations, we'll only focus on those
that exist and will set them to 1. Let's first add a macro aimed at doing
this. Passed a macro name in argument, it will return 1 if the macro is
defined and equals 1, otherwise it will return 0. This is based on the
concatenation of the macro's value with a name to form the name of a macro
which contains one comma, resulting in some other macros arguments being
shifted by one when the macro is defined. As such it's only a matter of
pushing both a 1 and a 0 and picking the correct argument to see the
desired one. It was verified to work since at least gcc-3.4 so it should
be portable enough.
When the code is preprocessed first and compiled later, such as when
built under distcc, the "fall through" comments are dropped and warnings
are emitted. Let's use the alternative "fallthrough" attribute instead,
that is supported by versions of gcc and clang that also produce this
warning.
This is libslz upstream commit 0fdf8ae218f3ecb0b7f22afd1a6b35a4f94053e2
This is the latest released version and a minor update on top of the
current one (0.8.0). It addresses a few build issues (some for which
patches were already backported), and particularly the fallthrough
issue by using an attribute instead of a comment.
HA_WEAK() is supposed to take a symbol in argument, not a string, since
the asm statements it produces already quote the argument. Having it
quoted twice doesn't work on older compilers and was the only reason
why DEBUG_MEM_STATS didn't work on older compilers.
CLI 'add server' handler relies on usermsgs_ctx to display errors in
internal function on CLI output. This may be also extended to other
handlers.
However, to not clutter stderr from another contextes, usermsgs_ctx must
be resetted when it is not needed anymore. This operation cannot be
conducted in the CLI parse handler as display is conducted after it.
To achieve this, define new CLI states CLI_ST_PRINT_UMSG /
CLI_ST_PRINT_UMSGERR. Their principles is nearly identical to states for
dynamic messages printing.
Rename CLI_ST_PRINT_FREE to CLI_ST_PRINT_DYNERR.
Most notably, this highlights that this is reserved to error printing.
This is done to ensure consistency between CLI_ST_PRINT/CLI_ST_PRINT_DYN
and CLI_ST_PRINT_ERR/CLI_ST_PRINT_DYNERR. The name is also consistent
with the function cli_dynerr() which activates it.
The ca-ignore-err and crt-ignore-err directives are now able to use the
openssl X509_V_ERR constant names instead of the numerical values.
This allow a configuration to survive an OpenSSL upgrade, because the
numerical ID can change between versions. For example
X509_V_ERR_INVALID_CA was 24 in OpenSSL 1 and is 79 in OpenSSL 3.
The list of errors must be updated when a new major OpenSSL version is
released.
The CRT and CA verify error codes were stored in 6 bits each in the
xprt_st field of the ssl_sock_ctx meaning that only error code up to 63
could be stored. Likewise, the ca-ignore-err and crt-ignore-err options
relied on two unsigned long longs that were used as bitfields for all
the ignored error codes. On the latest OpenSSL1.1.1 and with OpenSSLv3
and newer, verify errors have exceeded this value so these two storages
must be increased. The error codes will now be stored on 7 bits each and
the ignore-err bitfields are replaced by a big enough array and
dedicated bit get and set functions.
It can be backported on all stable branches.
[wla: let it be tested a little while before backport]
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
Add a new counter "quic_rxbuf_full". It is incremented each time
quic_sock_fd_iocb() is interrupted on full buffer.
This should help to debug github issue #1903. It is suspected that
QUIC receiver buffers are full which in turn cause quic_sock_fd_iocb()
to be called repeatedly resulting in a high CPU consumption.
Subscribing was not properly designed between quic-conn and quic MUX
layers. Align this as with in other haproxy components : <subs> field is
moved from the MUX to the quic-conn structure. All mention of qcc MUX is
cleaned up in quic_conn_subscribe()/quic_conn_unsubscribe().
Thanks to this change, ACK reception notification has been simplified.
It's now unnecessary to check for the MUX existence before waking it.
Instead, if <subs> quic-conn field is set, just wake-up the upper layer
tasklet without mentionning MUX. This should probably be extended to
other part in quic-conn code.
This should be backported up to 2.6.
Add "shards" new keyword for "peers" section to configure the number
of peer shards attached to such secions. This impact all the stick-tables
attached to the section.
Add "shard" new "server" parameter to configure the peers which participate to
all the stick-tables contents distribution. Each peer receive the stick-tables updates
only for keys with this shard value as distribution hash. The "shard" value
is stored in ->shard new server struct member.
cfg_parse_peers() which is the function which is called to parse all
the lines of a "peers" section is modified to parse the "shards" parameter
stored in ->nb_shards new peers struct member.
Add srv_parse_shard() new callback into server.c to pare the "shard"
parameter.
Implement stksess_getkey_hash() to compute the distribution hash for a
stick-table key as the 64-bits xxhash of the key concatenated to the stick-table
name. This function is called by stksess_setkey_shard(), itself
called by the already implemented function which create a new stick-table
key (stksess_new()).
Add ->idlen new stktable struct member to store the stick-table name length
to not have to compute it each time a stick-table key hash is computed.
This patch complete the previous incomplete commit. The new counter
sendto_err_unknown is now displayed on stats page/CLI show stats.
This is related to github issue #1903.
This should be backported up to 2.6.
Remove ABORT_NOW() statement on unhandled sendto error. Instead use a
dedicated counter sendto_err_unknown to report these cases.
If we detect increment of this counter, strace can be used to detect
errno value :
$ strace -p $(pidof haproxy) -f -e trace=sendto -Z
This should be backported up to 2.6.
This should help to debug github issue #1903.
Max stream data was not enforced and respect for local/remote uni
streams. Previously, qcs instances incorrectly reused the limit defined
from bidirectional ones.
This is now fixed. Two fields are added in qcc structure connection :
* value for local flow control to enforce on remote uni streams
* value for remote flow control to respect on local uni streams
These two values can be reused to properly initialized msd field of a
qcs instance in qcs_new(). The rest of the code is similar.
This must be backported up to 2.6.
adding a new mt macro: MT_LIST_APPEND_LOCKED.
This macro may be used to append an item to an existing
list, like MT_LIST_APPEND.
But here the item will be forced into locked/busy state
prior to appending, so that it is already referenced
in the list while still preventing concurrent accesses
until we decide to unlock it.
The macro returns a struct mt_list "np", that is needed
at unlock time using regular MT_LIST_UNLOCK_ELT macro.
MT_LIST_LOCK_ELT macro was documented with an ambiguous
usage restriction, implying that concurrent list deletion
was not supported.
But it seems that either the code has evolved, or the comment is
wrong because the locking behavior implemented here is exactly
the same one used in MT_LIST_DELETE, and no such restriction is
described for MT_LIST_DELETE.
I made some tests to make sure concurrent MT_LIST_DELETE (or deletion
from mt_list_for_each_entry_safe) don't cause unexepected results.
At the present time, this macro is not used, this fix only
targets upcoming developments that might rely on this.
No backport needed.
A minor typo was made in MT_LIST_LOCK_ELT, preventing
haproxy from compiling if MT_LIST_LOCK_ELT is
used in the code.
Today, the macro is unused, and that's the reason why
the typo has remained unnoticed for such a long time.
Fixing it so it can be used in upcoming developments.
No backport required.
When the lua task finished before the httpclient that are associated to
it, there is a risk that the httpclient try to task_wakeup() the lua
task which does not exist anymore.
To fix this issue the httpclient used in a lua task are stored in a
list, and the httpclient are destroyed at the end of the lua task.
Must be backported in 2.5 and 2.6.
Received packets treatment has some difference regarding if this is the
first one or not of the encapsulating datagram. Previously, this was set
via a function argument. Simplify this by defining a new Rx packet flag
named QUIC_FL_RX_PACKET_DGRAM_FIRST.
This change does not have functional impact. It will simplify API when
qc_lstnr_pkt_rcv() is broken into several functions : their number of
arguments will be reduced thanks to this patch.
This should be backported up to 2.6.
pn_offset field was only set if header protection cannot be removed.
Extend the usage of this field : it is now set everytime on packet
parsing in qc_lstnr_pkt_rcv().
This change helps to clean up API of Rx functions by removing
unnecessary variables and function argument.
This change has no functional impact. It is a part of a refactoring
series on qc_lstnr_pkt_rcv(). The objective is facilitate integration of
FD-owned socket patches.
This should be backported up to 2.6.
Add a new field version on quic_rx_packet structure. This is set on
header parsing in qc_lstnr_pkt_rcv() function.
This change has no functional impact. It is a part of a refactoring
series on qc_lstnr_pkt_rcv(). The objective is facilitate integration of
FD-owned socket patches.
This should be backported up to 2.6.
Right now the QUIC thread mapping derives the thread ID from the CID
by dividing by global.nbthread. This is a problem because this makes
QUIC work on all threads and ignores the "thread" directive on the
bind lines. In addition, only 8 bits are used, which is no more
compatible with the up to 4096 threads we may have in a configuration.
Let's modify it this way:
- the CID now dedicates 12 bits to the thread ID
- on output we continue to place the TID directly there.
- on input, the value is extracted. If it corresponds to a valid
thread number of the bind_conf, it's used as-is.
- otherwise it's used as a rank within the current bind_conf's
thread mask so that in the end we still get a valid thread ID
for this bind_conf.
The extraction function now requires a bind_conf in order to get the
group and thread mask. It was better to use bind_confs now as the goal
is to make them support multiple listeners sooner or later.
When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use
an shm which is used to keep the logs when switching to mworker wait
mode. This allows to keep the failed reload logs.
When allocating the startup-logs at first start of the process, haproxy
will do a shm_open with a unique path using the PID of the process, the
file is unlink immediatly so we don't let unwelcomed files be. The fd
resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD
environment variable so it can be mmap again when switching to wait
mode.
When forking children, the process is copying the mmap to a a mallocated
ring so we never share the same memory section between the master and
the workers. When switching to wait mode, the shm is not used anymore as
it is also copied to a mallocated structure.
This allow to use the "show startup-logs" command over the master CLI,
to get the logs of the latest startup or reload. This way the logs of
the latest failed reload are also kept.
This is only activated on the linux-glibc target for now.
Split the b_force_xfer() into b_ncat() and b_force_xfer().
The previous b_force_xfer() implementation was basically a copy with a
b_del on the src buffer. Keep this implementation to make b_ncat(), and
just call b_ncat() + b_del() into b_force_xfer().
Cast an unified ring + storage area to a ring from area, without
reinitializing the data buffer. Reinitialize the waiters and the lock.
It helps retrieving a previously allocated ring, from an mmap for
example.
QUIC datagrams are read from a random thread. They are then redispatch
to the connection thread according to the first packet DCID. These
operations are implemented through a special buffer designed to avoid
locking.
Refactor this code with the following changes :
* <rxbuf> type is renamed <quic_receiver_buf>. Its list element is also
renamed to highligh its attach point to a receiver.
* <quic_dgram> and <quic_receiver_buf> definition are moved to
quic_sock-t.h. This helps to reduce the size of quic_conn-t.h.
* <quic_dgram> list elements are renamed to highlight their attach point
into a <quic_receiver_buf> and a <quic_dghdlr>.
This should be backported up to 2.6.
rxbuf is the structure used to store QUIC datagrams and redispatch them
to the connection thread.
Each receiver manages a list of rxbuf. This was stored both as an array
and a mt_list. Currently, only mt_list is needed so removed <rxbufs>
member from receiver structure.
This should be backported up to 2.6.
Implement quic_tls_secrets_keys_alloc()/quic_tls_secrets_keys_free() to allocate
the memory for only one direction (RX or TX).
Modify ha_quic_set_encryption_secrets() to call these functions for one of this
direction (or both). So, for now on we can rely on the value of the secret keys
to know if it was derived.
Remove QUIC_FL_TLS_SECRETS_SET flag which is no more useful.
Consequently, the secrets are dumped by the traces only if derived.
Must be backported to 2.6.
This issue was reproduced with -Q picoquic client option to split a big ClientHello
message into two Initial packets and haproxy as server without any knowledged of
any previous ORTT session (restarted after a firt 0RTT session). The ORTT received
packets were removed from their queue when the second Initial packet was parsed,
and the QUIC handshake state never progressed and remained at Initial state.
To avoid such situations, after having treated some Initial packets we always
check if there are ORTT packets to parse and we never remove them from their
queue. This will be done after the hanshake is completed or upon idle timeout
expiration.
Also add more traces to be able to analize the handshake progression.
Tested with ngtcp2 and picoquic
Must be backported to 2.6.
Implement quic_get_ncbuf() to dynamically allocate a new ncbuf to be attached to
any quic_cstream struct which needs such a buffer. Note that there is no quic_cstream
for 0RTT encryption level. quic_free_ncbuf() is added to release the memory
allocated for a non-contiguous buffer.
Modify qc_handle_crypto_frm() to call this function and allocate an ncbuf for
crypto data which are not received in order. The crypto data which are received in
order are not buffered but provide to the TLS stack (calling qc_provide_cdata()).
Modify qc_treat_rx_crypto_frms() which is called after having provided the
in order received crypto data to the TLS stack to provide again the remaining
crypto data which has been buffered, if possible (if they are in order). Each time
buffered CRYPTO data were consumed, we try to release the memory allocated for
the non-contiguous buffer (ncbuf).
Also move rx.crypto.offset quic_enc_level struct member to rx.offset quic_cstream
struct member.
Must be backported to 2.6.
Add new quic_cstream struct definition to implement the CRYPTO data stream.
This is a simplication of the qcs object (QUIC streams) for the CRYPTO data
without any information about the flow control. They are not attached to any
tree, but to a QUIC encryption level, one by encryption level except for
the early data encryption level (for 0RTT). A stream descriptor is also allocated
for each CRYPTO data stream.
Must be backported to 2.6
The CPU usage pattern was found to be high (5%) on a machine with
48 threads and only 100 servers checked every second That was
supposed to be only 100 connections per second, which should be very
cheap. It was figured that due to the check tasks unbinding from any
thread when going back to sleep, they're queued into the shared queue.
Not only this requires to manipulate the global queue lock, but in
addition it means that all threads have to check the global queue
before going to sleep (hence take a lock again) to figure how long
to sleep, and that they would all sleep only for the shortest amount
of time to the next check, one would pick it and all other ones would
go down to sleep waiting for the next check.
That's perfectly visible in time-to-first-byte measurements. A quick
test consisting in retrieving the stats page in CSV over a 48-thread
process checking 200 servers every 2 seconds shows the following tail:
percentile ttfb(ms)
99.98 2.43
99.985 5.72
99.99 32.96
99.995 82.176
99.996 82.944
99.9965 83.328
99.997 83.84
99.9975 84.288
99.998 85.12
99.9985 86.592
99.999 88
99.9995 89.728
99.9999 100.352
One solution could consist in forcefully binding checks to threads at
boot time, but that's annoying, will cause trouble for dynamic servers
and may cause some skew in the load depending on some server patterns.
Instead here we take a different approach. A check remains bound to its
thread for as long as possible, but upon every wakeup, the thread's load
is compared with another random thread's load. If it's found that that
other thread's load is less than half of the current one's, the task is
bounced to that thread. In order to prevent that new thread from doing
the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just
woke up and we're bouncing the task only on this condition.
Tests have shown that the initial load was very unfair before, with a few
checks threads having a load of 15-20 and the vast majority having zero.
With this modification, after two "inter" delays, the load is either zero
or one everywhere when checks start. The same test shows a CPU usage that
significantly drops, between 0.5 and 1%. The same latency tail measurement
is much better, roughly 10 times smaller:
percentile ttfb(ms)
99.98 1.647
99.985 1.773
99.99 4.912
99.995 8.76
99.996 8.88
99.9965 8.944
99.997 9.016
99.9975 9.104
99.998 9.224
99.9985 9.416
99.999 9.8
99.9995 10.04
99.9999 10.432
In fact one difference here is that many threads work while in the past
they were waking up and going down to sleep after having perturbated the
shared lock. Thus it is anticipated that this will scale way smoother
than before. Under strace it's clearly visible that all threads are
sleeping for the time it takes to relaunch a check, there's no more
thundering herd wakeups.
However it is also possible that in some rare cases such as very short
check intervals smaller than a scheduler's timeslice (such as 4ms),
some users might have benefited from the work being concentrated on
less threads and would instead observe a small increase of apparent
CPU usage due to more total threads waking up even if that's for less
work each and less total work. That's visible with 200 servers at 4ms
where show activity shows that a few threads were overloaded and others
doing nothing. It's not a problem, though as in practice checks are not
supposed to eat much CPU and to wake up fast enough to represent a
significant load anyway, and the main issue they could have been
causing (aside the global lock) is an increase last-percentile latency.
The new functions fconn_show_flags() and fstrm_show_flags() decode the flags
state into a string, and are used by dev/flags:
$ /dev/flags/flags fconn 0x3100
fconn->flags = FCGI_CF_GET_VALUES | FCGI_CF_KEEP_CONN | FCGI_CF_MPXS_CONNS
./dev/flags/flags fstrm 0x3300
fstrm->flags = FCGI_SF_WANT_SHUTW | FCGI_SF_WANT_SHUTR | FCGI_SF_OUTGOING_DATA | FCGI_SF_BEGIN_SENT
The same was performed for the H2 and H1 multiplexers. FCGI connection and
stream flags are moved in a dedicated header file. It will be mainly used to
be able to decode mux-fcgi flags from the flags utility.
In this patch, we move the flags and enums to mux_fcgi-t.h, as well as the
two state decoding inline functions.
We're generalizing the change performed in previous commit "MEDIUM:
stick-table: requeue the expiration task out of the exclusive lock"
to stktable_requeue_exp() so that it can also be used by callers of
__stktable_store(). At the moment there's still no visible change
since it's still called under the write lock. However, the previous
code in stitable_touch_with_exp() was updated to use this function.
stream_store_counters() calls stksess_kill_if_expired() for each active
counter. And this one takes an exclusive lock on the table before
checking if it has any work to do (hint: it almost never has since it
only wants to delete expired entries). However a lock is still neeed for
now to protect the ref_cnt, but we can do it atomically under the read
lock.
Let's change the mechanism. Now what we do is to check out of the lock
if the entry is expired. If it is, we take the write lock, expire it,
and decrement the refcount. Otherwise we just decrement the refcount
under a read lock. With this change alone, the config based on 3
trackers without the previous patches saw a 2.6x improvement, but here
it doesn't yet change anything because some heavy contention remains
on the lookup part.