BUG/MEDIUM: wdt: fix wrong thread being checked for sleeping

In 2.7, the method used to check for a sleeping thread changed with
commit e7475c8e7 ("MEDIUM: tasks/fd: replace sleeping_thread_mask with
a TH_FL_SLEEPING flag"). Previously there was a global sleeping mask
and now there is a flag per thread. The commit above partially broke
the watchdog by looking at the current thread's flags via th_ctx
instead of the reported thread's flags, and using an AND condition
instead of an OR to update and leave. This can cause a wrong thread
to be killed when the load is uneven. For example, when enabling
busy polling and sending traffic over a single connection, all
threads have their run time grow, and if the one receiving the
signal is also processing some traffic, it will not match the
sleeping/harmless condition and will set the stuck flag, then die
upon next invocation. While it's reproducible in tests, it's unlikely
to be met in field.

This fix should be backported to 2.7.
This commit is contained in:
Willy Tarreau 2023-02-17 14:55:41 +01:00
parent 91fe0bc77a
commit 5405c9cdf3

View File

@ -83,7 +83,7 @@ void wdt_handler(int sig, siginfo_t *si, void *arg)
if (!p || n - p < 1000000000UL)
goto update_and_leave;
if ((_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_SLEEPING) &&
if ((_HA_ATOMIC_LOAD(&ha_thread_ctx[thr].flags) & TH_FL_SLEEPING) ||
(_HA_ATOMIC_LOAD(&ha_tgroup_ctx[tgrp-1].threads_harmless) & thr_bit)) {
/* This thread is currently doing exactly nothing
* waiting in the poll loop (unlikely but possible),