Commit Graph

7539 Commits

Author SHA1 Message Date
Aurelien DARRAGON
56d8074798 MINOR: proxy: add PR_FL_CHECKED flag
PR_FL_CHECKED is set on proxy once the proxy configuration was fully
checked (including postparsing checks).

This information may be useful to functions that need to know if some
config-related proxy properties are likely to change or not due to parsing
or postparsing/check logics. Also, during runtime, except for some rare cases
config-related proxy properties are not supposed to be changed.
2024-04-04 19:10:01 +02:00
Aurelien DARRAGON
6810c41f8e MEDIUM: tree-wide: add logformat expressions wrapper
log format expressions are broadly used within the code: once they are
parsed from input string, they are converted to a linked list of
logformat nodes.

We're starting to face some limitations because we're simply storing the
converted expression as a generic logformat_node list.

The first issue we're facing is that storing logformat expressions that
way doesn't allow us to add metadata alongside the list, which is part
of the prerequites for implementing log-profiles.

Another issue with storing logformat expressions as generic lists of
logformat_node elements is that it's starting to become really hard to
tell when we rely on logformat expressions or not in the code given that
there isn't always a comment near the list declaration or manipulation
to indicate that it's relying on logformat expressions under the hood,
so this adds some complexity for code maintenance.

This patch looks quite impressive due to changes in a lot of header and
source files (since logformat expressions are broadly used), but it does
a simple thing: it defines the lf_expr structure which itself holds a
generic list of logformat nodes, and then declares some helpers to
manipulate lf_expr elements and fixes the code so that we now exclusively
manipulate logformat_node lists as lf_expr elements outside of log.c.

For now, lf_expr struct only contains the list of logformat nodes (no
additional metadata), but now that we have dedicated type and helpers,
doing so in the future won't be problematic at all and won't require
extensive code changes.
2024-04-04 19:10:01 +02:00
Aurelien DARRAGON
7d8f45b647 MEDIUM: log: carry tag context in logformat node
This is a pretty simple patch despite requiring to make some visible
changes in the code:

When parsing a logformat string, log tags (ie: '%tag', AKA log tags) are
turned into logformat nodes with their type set to the type of the
corresponding logformat_tag element which was matched by name. Thus, when
"compiling" a logformat tag, we only keep a reference to the tag type
from the original logformat_tag.

For example, for "%B" log tag, we have the following logformat_tag
element:

  {
    .name = "B",
    .type = LOG_FMT_BYTES,
    .mode = PR_MODE_TCP,
    .lw = LW_BYTES,
    .config_callback = NULL
  }

When parsing "%B" string, we search for a matching logformat tag
inside logformat_tags[] array using the provided name, once we find a
matching element, we craft a logformat node whose type will be
LOG_FMT_BYTES, but from the node itself, we no longer have access to
other informations that are set in the logformat_tag struct element.

Thus from a logformat_node resulting from a log tag, with current
implementation, we cannot easily get back to matching logformat_tag
struct element as it would require us to scan the whole logformat_tags
array at runtime using node->type to find the matching element.

Let's take a simpler path and consider all tag-specific LOG_FMT_*
subtypes as being part of the same logformat node type: LOG_FMT_TAG.

Thanks to that, we're now able to distinguish logformat nodes made
from logformat tag from other logformat nodes, and link them to
their corresponding logformat_tag element from logformat_tags[] array. All
it costs is a simple indirection and an extra pointer in logformat_node
struct.

While at it, all LOG_FMT_* types related to logformat tags were moved
inside log.c as they have no use outside of it since they are simply
lookup indexes for sess_build_logline() and could even be replaced by
function pointers some day...
2024-04-04 19:10:01 +02:00
Aurelien DARRAGON
8cf5c3d7f0 MINOR: log: expose logformat_tag struct
rename logformat_type internal struct to logformat_tag to to make it less
confusing, then expose logformat_tag struct through header file so that it
can be referenced in other structs.

also rename logformat_keywords[] to logformat_tags[] for better
consistency.
2024-04-04 19:10:01 +02:00
Aurelien DARRAGON
c85cbc1061 MEDIUM: log: rename logformat var to logformat tag
What we use to call logformat variable in the code is referred as
log-format tag in the documentation. Having both 'var' and 'tag' labels
referring to the same thing is really confusing. Let's make the code
comply with the documentation by replacing all logformat var/variable/VAR
occurences with either tag or TAG.

No functional change should be expected, the only visible side-effect from
user point of view is that "variable" was replaced by "tag" in some error
messages.
2024-04-04 19:10:01 +02:00
Willy Tarreau
1a088da7c2 MAJOR: stktable: split the keys across multiple shards to reduce contention
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.

The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):

  function dump_stk()
    local dmp = core.proxies['tbl'].stktable:dump({});
    local count = 0
    for _, __ in pairs(dmp) do
        count = count + 1
    end
    core.Info('Total entries: ' .. count)
  end

  core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);

  ##
  global
    tune.lua.log.stderr on
    lua-load-per-thread lua-cnttbl.lua

  listen front
    bind :8001
    http-request lua.dump_stk if { path_beg /stk }
    http-request track-sc1 rand(),upper,hex table tbl
    http-request redirect location /

  backend tbl
    stick-table size 100k type string len 12 store http_req_cnt

  ##
  $ h2load -c 16 -n 10000 0:8001/
  $ curl 0:8001/stk

  ## A count close to 100k appears on haproxy's stderr
  ## On the CLI, "show table tbl" | wc will show the same.

Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.

The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.

The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().

The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.

At this point, this allowed to completely unlock the performance on a
80-thread setup:

 before: 5.4 Gbps, 150k RPS, 80 cores
  52.71%  haproxy    [.] stktable_lookup_key
  36.90%  haproxy    [.] stktable_get_entry.part.0
   0.86%  haproxy    [.] ebmb_lookup
   0.18%  haproxy    [.] process_stream
   0.12%  haproxy    [.] process_table_expire
   0.11%  haproxy    [.] fwrr_get_next_server
   0.10%  haproxy    [.] eb32_insert
   0.10%  haproxy    [.] run_tasks_from_lists

 after: 36 Gbps, 980k RPS, 80 cores
  44.92%  haproxy    [.] stktable_get_entry
   5.47%  haproxy    [.] ebmb_lookup
   2.50%  haproxy    [.] fwrr_get_next_server
   0.97%  haproxy    [.] eb32_insert
   0.92%  haproxy    [.] process_stream
   0.52%  haproxy    [.] run_tasks_from_lists
   0.45%  haproxy    [.] conn_backend_get
   0.44%  haproxy    [.] __pool_alloc
   0.35%  haproxy    [.] process_table_expire
   0.35%  haproxy    [.] connect_server
   0.35%  haproxy    [.] h1_headers_to_hdr_list
   0.34%  haproxy    [.] eb_delete
   0.31%  haproxy    [.] srv_add_to_idle_list
   0.30%  haproxy    [.] h1_snd_buf

WIP: uint64_t -> long

WIP: ulong -> uint

code is much smaller
2024-04-03 17:34:47 +02:00
Willy Tarreau
4c1480f13b MINOR: stick-tables: mark the seen stksess with a flag "seen"
Right now we're taking the stick-tables update lock for reads just for
the sake of checking if the update index is past it or not. That's
costly because even taking the read lock is sufficient to provoke a
cache line write, while when under load or attack it's frequent that
the update has not yet been propagated and wouldn't require anything.

This commit brings a new field to the stksess, "seen", which is zeroed
when the entry is updated, and set to one as soon as at least one peer
starts to consult it. This way it will reflect that the entry must be
updated again so that this peer can see it. Otherwise no update will
be necessary. For now the flag is only set/reset but not exploited.
A great care is taken to avoid writes whenever possible.
2024-04-03 17:34:47 +02:00
William Lallemand
aa3632962f MEDIUM: mworker: get rid of libsystemd
Given the xz drama which allowed liblzma to be linked to openssh, lets remove
libsystemd to get rid of useless dependencies.

The sd_notify API seems to be stable and is now documented. This patch replaces
the sd_notify() and sd_notifyf() function by a reimplementation inspired by the
systemd documentation.

This should not change anything functionnally. The function will be built when
haproxy is built using USE_SYSTEMD=1.

References:
  https://github.com/systemd/systemd/issues/32028
  https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes

Before:

wla@kikyo:~% ldd /usr/sbin/haproxy
	linux-vdso.so.1 (0x00007ffcfaf65000)
	libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x000074637fef4000)
	libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000074637fe4f000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000074637f400000)
	liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x000074637fe0d000)
	libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x000074637f92a000)
	libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x000074637f365000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074637f000000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074637f27a000)
	libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x000074637fdff000)
	libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x000074637eeb8000)
	liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000074637fdcd000)
	libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x000074637ee01000)
	liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x000074637fda8000)
	/lib64/ld-linux-x86-64.so.2 (0x000074637ff5d000)
	libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x000074637f904000)

After:

wla@kikyo:~% ldd /usr/sbin/haproxy
	linux-vdso.so.1 (0x00007ffd51901000)
	libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f758d6c0000)
	libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f758d61b000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f758ca00000)
	liblua5.4.so.0 => /lib/x86_64-linux-gnu/liblua5.4.so.0 (0x00007f758d5d9000)
	libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f758d365000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f758d5ba000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f758c600000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f758c915000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f758d729000)

A backport to all stable versions could be considered at some point.
2024-04-03 15:53:18 +02:00
Frederic Lecaille
a305bb92b9 MINOR: quic: HyStart++ implementation (RFC 9406)
This is a simple algorithm to replace the classic slow start phase of the
congestion control algorithms. It should reduce the high packet loss during
this step.

Implemented only for Cubic.
2024-04-02 18:47:19 +02:00
Anthony Deschamps
faa8c3e024 MEDIUM: lb-chash: Deterministic node hashes based on server address
Motivation: When services are discovered through DNS resolution, the order in
which DNS records get resolved and assigned to servers is arbitrary. Therefore,
even though two HAProxy instances using chash balancing might agree that a
particular request should go to server3, it is likely the case that they have
assigned different IPs and ports to the server in that slot.

This patch adds a server option, "hash-key <key>" which can be set to "id" (the
existing behaviour, default), "addr", or "addr-port". By deriving the keys for
the chash tree nodes from a server's address and port we ensure that independent
HAProxy instances will agree on routing decisions. If an address is not known
then the key is derived from the server's puid as it was previously.

When adjusting a server's weight, we now check whether the server's hash has
changed. If it has, we have to remove all its nodes first, since the node keys
will also have to change.
2024-04-02 07:00:10 +02:00
Aurelien DARRAGON
3c6dfa618a MEDIUM: log/balance: leverage lbprm api for log load-balancing
log load-balancing implementation was not seamlessly integrated within
lbprm API. The consequence is that it could become harder to maintain
over time since it added some specific cases just for the log backend.
Moreover, it resulted in some code duplication since balance algorithms
that are common to logs and regular (tcp, http) backends were specifically
rewritten for log backends.

Thanks to the previous commit, we now have all the prerequisites to make
log load-balancing fully leverage lbprm logic. Thus in this patch we make
__do_send_log_backend() use existing lbprm algorithms, and we no longer
require log-specific lbprm initialization in cfgparse.c and in
postcheck_log_backend().

As a bonus, for log backends this allows weighed algorithms to properly
support weights (ie: roundrobin, random and log-hash) since we now
leverage the same lb algorithms that we use for tcp/http backends
(doc was updated).
2024-03-29 17:08:37 +01:00
Aurelien DARRAGON
9aea6df81f MINOR: lbprm: implement true "sticky" balance algo
As previously mentioned in cd352c0db ("MINOR: log/balance: rename
"log-sticky" to "sticky""), let's define a sticky algorithm that may be
used from any protocol. Sticky algorithm sticks on the same server as
long as it remains available.

The documentation was updated accordingly.
2024-03-29 17:08:37 +01:00
Christopher Faulet
87426e82ec MAJOR: cli: Use a custom .snd_buf function to only copy the current command
The CLI applet is now using its own snd_buf callback function. Instead of
copying as most output data as possible, only one command is copied at a
time.

To do so, a new state CLI_ST_PARSEREQ is added for the CLI applet. In this
state, the CLI I/O handle knows a full command was copied into its input
buffer and it must parse this command to evaluate it.
2024-03-28 17:32:55 +01:00
Christopher Faulet
838fb54de6 MINOR: stconn: Add a connection flag to notify sending data are the last ones
This flag can be use by endpoints to know the data to send, via .snd_buf
callback function are the last ones. It is useful to know a shutdown is
pending but it cannot be delivered while sedning data are not consumed.
2024-03-28 17:32:55 +01:00
Christopher Faulet
2c6321842b MEDIUM: applet: Handle applets with their own buffers in put functions
applet_putchk() and other similar functions are now testing the applet's
type to use the applet's outbuf instead of the channel's buffer. This will
ease applets convertion because most of them relies on these functions.
2024-03-28 17:28:20 +01:00
Christopher Faulet
1380b97285 MEDIUM: buf: Add b_getline() and b_getdelim() functions
These functions are very similar to co_getline() and co_getdelim(). The
first one retrieves the longest part of the buffer that is composed
exclusively of characters not in the a delimiter set. The second one stops
on LF only and thus returns a line.
2024-03-28 17:28:20 +01:00
Christopher Faulet
5056cbdb86 MINOR: sc_strm: Add generic version to perform sync receives and sends
sc_sync_recv() and sc_sync_send() were added to use connection or applet
versions, depending on the endpoint type. For now these functions are not
used. But this will be used by process_stream() to replace the connection
version.
2024-03-28 17:28:20 +01:00
Remi Tricot-Le Breton
7359c0c7f4 MEDIUM: ssl: Add 'tune.ssl.ocsp-update.mode' global option
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
2024-03-27 11:38:28 +01:00
Willy Tarreau
6c1b29d06f MINOR: ring: make the number of queues configurable
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
2024-03-25 17:34:19 +00:00
Willy Tarreau
e3f101a19a MINOR: ring: add the definition of a ring waiting cell
This is what will be used to describe one waiting thread, its message
in the queues, and the aggregation of pending messages after it.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c7bd7a68e4 OPTIM: ring: have only one thread at a time wake up all readers
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.

There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
2024-03-25 17:34:19 +00:00
Willy Tarreau
9e99cfbeb6 MAJOR: ring: drop the now unneeded lock
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.

The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
2024-03-25 17:34:19 +00:00
Willy Tarreau
a2d2dbf210 MEDIUM: ring/applet: turn the wait_entry list to an mt_list instead
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
2024-03-25 17:34:19 +00:00
Willy Tarreau
eb3d5f464d MEDIUM: ring: use the topmost bit of the tail as a lock
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.

For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.

Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
2024-03-25 17:34:19 +00:00
Willy Tarreau
dd8ea5d928 MEDIUM: ring: align the head and tail fields in the ring_storage structure
We really want to let the readers and writers act on different areas, so
we want to have the tail and the head on separate cache lines, themselves
separate from the rest of the ring. Doing so improves the performance from
2.15 to 2.35M msg/s at 48 threads on a 24-core EPYC.

This increases the header space from 32 to 192 bytes when threads are
enabled. But since we already have the header size available in the file,
haring remains able to detect the aligned vs unaligned formats and call
dump_v2a() when aligned is detected.
2024-03-25 17:34:19 +00:00
Willy Tarreau
bf3dead20c MEDIUM: ring: remove the struct buffer from the ring
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.

The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.

The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
2024-03-25 17:34:19 +00:00
Willy Tarreau
01aa0a057c MEDIUM: ring: change the ring reader to use the new vector-based API now
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
2024-03-25 17:34:19 +00:00
Willy Tarreau
03816ccfa9 MAJOR: ring: insert an intermediary ring_storage level
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.

This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.

It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
2024-03-25 17:34:19 +00:00
Willy Tarreau
01abdcb307 MINOR: ring: add a flag to indicate a mapped file
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.
2024-03-25 17:34:19 +00:00
Willy Tarreau
b30fd8cc2d MINOR: ring: also add ring_area(), ring_head(), ring_tail()
These will essentially be used to simplify the conversion to a new API.
2024-03-25 17:34:19 +00:00
Willy Tarreau
dc4836c15c MINOR: ring: add ring_dup() to copy a ring into another one
This will mostly be used during reallocation and boot-time duplicates,
the purpose is simply to save the caller from having to know the details
of the internal representation.
2024-03-25 17:34:19 +00:00
Willy Tarreau
a185d3d90d MINOR: ring: add ring_size() to return the ring's size
This is just to ease conversion so that callers stop accessing the ring's
buffer.
2024-03-25 17:34:19 +00:00
Willy Tarreau
4c41fcd0da MINOR: ring: add ring_data() to report the amount of data in a ring
This will be used as an accessor for the few functions that need this
outside of ring.c.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c222cb8389 MINOR: vecpair: add necessary functions to use vecpairss from/to ring APIs
Many ring-based APIs need a tail and a head, with some extra assumption
that the user takes care of not filling the ring so that tail==head is
unambiguous. Vectors are particularly suited to this usage so here we
create 4 functions to create vectors representing free room or data
from a ring, as well as updating rings based on a pair of vectors that
represents either free space or data.
2024-03-25 17:34:19 +00:00
Willy Tarreau
63261aae39 MINOR: vecpair: add new vector pair based data manipulation mechanisms
The buffers API defines both a storage layout and how to handle the
data. The storage is shared with the chunks API which only deals with
non-wrapping messages while buffers support wrapping both of the data
and of the free space. As such, most of the buffers code already makes
special cases of two parts in a buffer, the first one before wrapping
and the optional second one after the wrapping occurred.

The thing is, there are plenty of other places (e.g. rings) where the
code dealing with wrapping is desirable but with a different storage
layout. Let's export the existing buffer handling code related to
reading/writing wrapping data and make it work with arbitrary vector
pairs instead. This will handle wrapping and holes in messages if
desired, and it will be up to the caller to decide how its messages
are arranged and to pass the relevant ptr,len elements.

The code is limited to two vectors because this is sufficient to deal
with wrapping without making the code needlessly complex. I.e. this will
not reassemble an iovec. For vectors, since we already had the ist type,
there's no point inventing a new type, and it's even possible that over
time some callers will find benefits in using this unified API (i.e. no
NOP translation layer). It also allows to pass inputs as direct arguments
and outputs as pointers. Not only this is more efficient code-wise, but
it also avoids the accidental use of a wrong function. It was indeed
found that naming functions is even harder than with the buffer as the
notion of from/to is even fuzzier here.

The API will likely continue to evolve and some functions might get
renamed to more explicit ones over time to limit confusion. For now
the code provides anything needed to reset/create/fill/erase/read/peek
or measure vector pairs and to manipulate chars/blocks/varints to/from
there.
2024-03-25 17:34:19 +00:00
Willy Tarreau
0b1c17a2dd MINOR: ring: reserve one special value for the readers count
In order to support concurrent writers we'll need to lock areas in the
buffer. For this we'll use one special value of the single-byte readers
count. Let's reserve it now and use the macro instead of the hardcoded
255.
2024-03-25 17:34:19 +00:00
Willy Tarreau
63242a59c4 MINOR: buf: add b_getblk_ofs() that works relative to area and not head
For some concurrently accessed buffers we can't rely on head/data etc,
but sometimes the access patterns guarantees that the buffer contents
are there. Let's implement a function to read contents from a fixed
offset, which never checks head nor data, only the area and its size.
It's the caller's job to get this offset.
2024-03-25 17:34:19 +00:00
Willy Tarreau
2f28981546 MINOR: buf: add b_putblk_ofs() to copy a block at a specific position
This new function b_putblk_ofs() puts one full block of data of length
<len> from <blk> into the buffer, starting from absolute offset <offset>
after the buffer's area.  As a convenience to avoid complex checks in
callers, the offset is allowed to exceed a valid one by no more than one
buffer size, and will automatically be wrapped. The caller is responsible
for ensuring that <len> doesn't exceed the known length of the available
room at this position, otherwise data may be overwritten. The buffer's
length is *not* updated, so generally the caller will have updated it
before calling this function. This is meant to be used on concurrently
accessed buffers, so that a writer can append data while a reader is
blocked by other means from reaching the current area The function
guarantees never to use ->head nor ->data.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c5004ccb36 MINOR: buf: add b_rel_ofs() to turn an absolute offset into a relative one
It basically does the opposite of b_peek_ofs(). If x=b_peek_ofs(y), then
y=b_rel_ofs(x).
2024-03-25 17:34:19 +00:00
Willy Tarreau
15e47b6a59 MINOR: buf: add b_add_ofs() to add a count to an absolute position
This function is used to compute a new absolute buffer offset by adding a
length to an existing valid offset. It will check for wrapping.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c62a2d540d MEDIUM: ring: move the ring reader code to ring_dispatch_messages()
This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.

The current "show ring" code was ported to this and it continues to
work as expected.
2024-03-25 17:34:19 +00:00
Willy Tarreau
ad31e53287 REORG: dns/ring: split the ring between the generic one and the DNS one
A ring is used for the DNS code but slightly differently from the generic
one, which prevents some important changes from being made to the generic
code without breaking DNS. As the use cases differ, it's better to just
split them apart for now and have the DNS code use its own ring that we
rename dns_ring and let the generic code continue to live on its own.

The unused parts such as CLI registration were dropped, resizing and
allocation from a mapped area were dropped. dns_ring_detach_appctx() was
kept despite not being used, so as to stay consistent with the comments
that say it must be called, despite the DNS code explicitly mentioning
that it skips it for now (i.e. this may change in the future).

Hopefully after the generic rings are converted the DNS code can migrate
back to them, though this is really not necessary.
2024-03-25 17:34:19 +00:00
Willy Tarreau
201c706330 MINOR: log/applet: add new function syslog_applet_append_event()
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. Contrary to its sibling applet_append_line(), instead of just
appending an LF at the end of the line, it prepends the message size
in decimal and a space before the message, as expected by syslog TCP
implementaions. This will be used to simplify the ring reader code.
2024-03-25 17:34:19 +00:00
Willy Tarreau
6ae41dc510 MINOR: applet: add new function applet_append_line()
This function takes a buffer on input, and offset and a length, and
consumes the block from that buffer to send it to the appctx's output
buffer. This will be used to simplify the ring reader code.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c038ca8e8c MINOR: atomic: add a read-specific variant of __ha_cpu_relax()
Tests on various systems show that x86 prefers not to wait at all inside
read loops while aarch64 prefers to wait a little bit. Instead of having
to stuff ifdefs around __ha_cpu_relax() inside plenty of such loops
waiting for a condition to appear, better implement a new variant that
we call __ha_cpu_relax_for_read() which honors each architecture's
preferences and is the same as __ha_cpu_relax() for other ones.
2024-03-25 17:34:19 +00:00
Aurelien DARRAGON
3de1acfb23 BUILD: server: fix build regression on old compilers (<= gcc-4.4)
Willy reported that since 3ac79b504 ("MEDIUM: server:
make server_set_inetaddr() updater serializable"), haproxy fails to
compile on some older compilers such as gcc-4.4 with this kind of error:

  src/server.c: In function 'snr_resolution_cb':
  src/server.c:4471: error: unknown field 'dns_resolver' specified in initializer
  compilation terminated due to -Wfatal-errors.
  make: *** [Makefile:1006: src/server.o] Error 1

This is due to referencing a member inside anonymous union from a compound
literal assignment. Apparently such use of anonymous union wasn't properly
supported back then on older compilers. To fix the issue, we give "u" name
to the parent union use this name to explicitly refer to the union where
relevant in the code (only a few changes fortunately).

The fix itself was verified to restore build compatibility with gcc 4.4
(and even 4.2).

As 3ac79b504 is used as a prerequisite for 64c9c8ef3 ("BUG/MINOR:
server/dns: use server_set_inetaddr() to unset srv addr from DNS"), please
consider backporting this patch too if 64c9c8ef3 happens to be backported
in 2.9.
2024-03-25 16:23:37 +01:00
Amaury Denoyelle
0d4273f04b MEDIUM: server: close private idle connection before server deletion
This commit similar to the following one :
  65ae241dcfe710e1cdd3ec4e7a9bde38d2e4c116
  MEDIUM: server: close idle conn before server deletion

This patch implements a similar logic, this time to close private idle
connections stored in sessions. The principle is identical to the above
commit : conn_release() is used on idle connections after a takeover to
ensure thread safety.

An extra change was required to be able to execute takeover on such
connections. Their original thread ID was unknown, contrary to non
private connections which are stored in sharded lists. As such, a new
tid member has been added under sess_priv_conns chaining element.
2024-03-22 17:12:27 +01:00
Amaury Denoyelle
f3862a9bc7 MINOR: connection: extend takeover with release option
Extend takeover API both for MUX and XPRT with a new boolean argument
<release>. Its purpose is to signal if the connection will be freed
immediately after the takeover, rendering new resources allocation
unnecessary.

For the moment, release argument is always false. However, it will be
set to true on delete server CLI handler to proactively close server
idle connections.
2024-03-22 16:12:36 +01:00
Amaury Denoyelle
ff2e71ae24 MINOR: connection: implement conn_release()
Several places reuse the same code to ensure a connection is properly
freed, either via its MUX or by calling the proper set of functions.
Factorize all of this in a new function conn_release().

This new function is now called via session_free() and
session_accept_fd(). It will also be reused on delete server to
proactively close idle connections.
2024-03-22 16:12:36 +01:00
Remi Tricot-Le Breton
5c25c577a0 BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing
The CLI command "update ssl ocsp-response" was forcefully removing an
OCSP response from the update tree regardless of whether it used to be
in it beforehand or not. But since the main OCSP upate task works by
removing the entry being currently updated from the update tree and then
reinserting it when the update process is over, it meant that in the CLI
command code we were modifying a structure that was already being used.

These concurrent accesses were not properly locked on the "regular"
update case because it was assumed that once an entry was removed from
the update tree, the update task was the only one able to work on it.

Rather than locking the whole update process, an "updating" flag was
added to the certificate_ocsp in order to prevent the "update ssl
ocsp-response" command from trying to update a response already being
updated.

An easy way to reproduce this crash was to perform two "simultaneous"
calls to "update ssl ocsp-response" on the same certificate. It would
then crash on an eb64_delete call in the main ocsp update task function.

This patch can be backported up to 2.8. Wait a little bit before
backporting.
2024-03-20 16:12:10 +01:00