2016-04-26 05:31:43 +03:00
Writing CTDB cluster mutex helpers
==================================
CTDB uses cluster-wide mutexes to protect against a "split brain",
which could occur if the cluster becomes partitioned due to network
failure or similar.
2022-01-10 11:18:14 +03:00
CTDB uses a cluster-wide mutex for its "cluster lock", which is used
2016-04-26 05:31:43 +03:00
to ensure that only one database recovery can happen at a time. For
2022-01-10 11:18:14 +03:00
an overview of cluster lock configuration see the CLUSTER LOCK
2016-04-26 05:31:43 +03:00
section in ctdb(7). CTDB tries to ensure correct operation of the
2022-01-10 11:18:14 +03:00
cluster lock by attempting to take the cluster lock when CTDB knows
2016-04-26 05:31:43 +03:00
that it should already be held.
By default, CTDB uses a supplied mutex helper that uses a fcntl(2)
lock on a specified file in the cluster filesystem.
However, a user supplied mutex helper can be used as an alternative.
The rest of this document describes the API for mutex helpers.
A mutex helper is an external executable
----------------------------------------
A mutex helper is an external executable that can be run by CTDB.
There are no CTDB-specific compilation dependencies. This means that
a helper could easily be scripted around existing commands. Mutex
helpers are run relatively rarely and are not time critical.
Therefore, reliability is preferred over high performance.
Taking a mutex with a helper
----------------------------
1. Helper is executed with helper-specific arguments
2. Helper attempts to take mutex
3. On success, the helper writes ASCII 0 to standard output
4. Helper stays running, holding mutex, awaiting termination by CTDB
5. When a helper receives SIGTERM it must release any mutex it is
holding and then exit.
Status codes
------------
CTDB ignores the exit code of a helper. Instead, CTDB reacts to a
single ASCII character that is sent to it via a helper's standard
output.
Valid status codes are:
0 - The helper took the mutex and is holding it, awaiting termination.
1 - The helper was unable to take the mutex due to contention.
2 - The helper took too long to take the mutex.
Helpers do not need to implement this status code. CTDB
already implements any required timeout handling.
3 - An unexpected error occurred.
If a 0 status code is sent then it the helper should periodically
check if the (original) parent processes still exists while awaiting
termination. If the parent process disappears then the helper should
2016-08-05 07:17:01 +03:00
release the mutex and exit. This avoids stale mutexes. Note that a
helper should never wait for parent process ID 1!
2016-04-26 05:31:43 +03:00
If a non-0 status code is sent then the helper can exit immediately.
However, if the helper does not exit then it must terminate if it
receives SIGTERM.
Logging
-------
Anything written to standard error by a helper is incorporated into
CTDB's logs. A helper should generally only output to stderr for
unexpected errors and avoid output to stderr on success or on mutex
contention.