BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server

An interesting bug was revealed by commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()"). When shutting
down a server to redistribute its connections, no check is made on the
backend's queue. If we're turning off the last server and the backend
has pending connections, these ones will wait there till the queue
timeout. But worse, since the commit above, we can enter an endless loop
in the following situation:

  - streams are present in the backend's queue
  - streams are purged on the last server via srv_shutdown_streams()
  - that one calls pendconn_redistribute(srv) which does not purge
    the backend's pendconns
  - a stream performs some load balancing and enters assign_server_and_queue()
  - assign_server() is called in turn
  - the LB algo is non-deterministic and there are entries in the
    backend's queue. The function notices it and returns SRV_STATUS_FULL
  - assign_server_and_queue() calls pendconn_add() to add the connection
    to the backend's queue
  - on return, pendconn_must_try_again() is called, it figures there's
    no stream served anymore on the server nor the proxy, so it removes
    the pendconn from the queue and returns 1
  - assign_server_and_queue() loops back to the beginning to try again,
    while the conditions have not changed, resulting in an endless loop.

Ideally a change count should be used in the queues so that it's possible
to detect that some dequeuing happened and/or that a last stream has left.
But that wouldn't completely solve the problem that is that we must never
ever add to a queue when there's no server streams to dequeue the new
entries.

The current solution consists in making pendconn_redistribute() take care
of the proxy after the server in case there's no more server available on
the proxy. It at least ensures that no pending streams are left in the
backend's queue when shutting streams down or when the last server goes
down. The try_again loop remains necessary to deal with inevitable races
during pendconn additions. It could be limited to a few rounds, though,
but it should never trigger if the conditions are sufficient to permit
it to converge.

One way to reproduce the issue is to run a config with a single server
with maxconn 1 and plenty of threads, then run in loops series of:

 "disable server px/s;shutdown sessions server px/s;
  wait 100ms server-removable px/s; show servers conn px;
  enable server px/s"

on the CLI at ~10/s while injecting with around 40 concurrent conns at
40-100k RPS. In this case in 10s - 1mn the crash can appear with a
backtrace like this one for at least 1 thread:

  #0  pendconn_add (strm=strm@entry=0x17f2ce0) at src/queue.c:487
  #1  0x000000000064797d in assign_server_and_queue (s=s@entry=0x17f2ce0) at src/backend.c:1064
  #2  0x000000000064a928 in srv_redispatch_connect (s=s@entry=0x17f2ce0) at src/backend.c:1962
  #3  0x000000000064ac54 in back_handle_st_req (s=s@entry=0x17f2ce0) at src/backend.c:2287
  #4  0x00000000005ae1d5 in process_stream (t=t@entry=0x17f4ab0, context=0x17f2ce0, state=<optimized out>) at src/stream.c:2336

It's worth noting that other threads may often appear waiting after the
poller and one in server_atomic_sync() waiting for isolation, because
the event that is processed when shutting the server down is consumed
under isolation, and having less threads available to dequeue remaining
requests increases the probability to trigger the problem, though it is
not at all necessary (some less common traces never show them).

This should carefully be backported wherever the commit above was
backported.

(cherry picked from commit 53f52e67a08fc2cd5ae64caf4148a5de884f9959)
Signed-off-by: Willy Tarreau <w@1wt.eu>
This commit is contained in:
Willy Tarreau 2024-10-01 18:57:51 +02:00
parent 16cfaec89b
commit c8cde77467

View File

@ -495,6 +495,8 @@ int pendconn_redistribute(struct server *s)
{
struct pendconn *p;
struct eb32_node *node, *nodeb;
struct proxy *px = s->proxy;
int px_xferred = 0;
int xferred = 0;
/* The REDISP option was specified. We will ignore cookie and force to
@ -502,7 +504,7 @@ int pendconn_redistribute(struct server *s)
*/
if (!(s->cur_admin & SRV_ADMF_MAINT) &&
(s->proxy->options & (PR_O_REDISP|PR_O_PERSIST)) != PR_O_REDISP)
return 0;
goto skip_srv_queue;
HA_SPIN_LOCK(QUEUE_LOCK, &s->queue.lock);
for (node = eb32_first(&s->queue.head); node; node = nodeb) {
@ -526,7 +528,33 @@ int pendconn_redistribute(struct server *s)
_HA_ATOMIC_SUB(&s->queue.length, xferred);
_HA_ATOMIC_SUB(&s->proxy->totpend, xferred);
}
return xferred;
skip_srv_queue:
if (px->lbprm.tot_wact || px->lbprm.tot_wbck)
goto done;
HA_SPIN_LOCK(QUEUE_LOCK, &px->queue.lock);
for (node = eb32_first(&px->queue.head); node; node = nodeb) {
nodeb = eb32_next(node);
p = eb32_entry(node, struct pendconn, node);
/* force-persist streams may occasionally appear in the
* proxy's queue, and we certainly don't want them here!
*/
p->strm_flags &= ~SF_FORCE_PRST;
__pendconn_unlink_prx(p);
task_wakeup(p->strm->task, TASK_WOKEN_RES);
px_xferred++;
}
HA_SPIN_UNLOCK(QUEUE_LOCK, &px->queue.lock);
if (px_xferred) {
_HA_ATOMIC_SUB(&px->queue.length, px_xferred);
_HA_ATOMIC_SUB(&px->totpend, px_xferred);
}
done:
return xferred + px_xferred;
}
/* Check for pending connections at the backend, and assign some of them to