1
0
mirror of https://github.com/samba-team/samba.git synced 2025-01-08 21:18:16 +03:00
Commit Graph

2651 Commits

Author SHA1 Message Date
Martin Schwenke
97a1714ee9 ctdb-mutex: open() and fstat() when testing lock file
This makes a file descriptor available for other I/O.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
c07e81abf0 ctdb-mutex: Factor out function fcntl_lock_fd()
Allows blocking mode and start offset to be specified.  Always locks a
1-byte range.

Make the lock structure static to avoid initialising the whole
structure each time.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
9daf22a5c9 ctdb-mutex: Handle pings from lock checking child to parent
The ping timeout is specified by passing an extra argument to the
mutex helper, representing the ping timeout in seconds.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
b5db286791 ctdb-mutex: Do inode checks in a child process
In future this will allow extra I/O tests and a timeout in the parent
to (hopefully) release the lock if the child gets wedged.  For
simplicity, use tmon only to detect when either parent or child goes
away.  Plumbing a timeout for pings from child to parent will be done
later.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
2ecdbcb22c ctdb-mutex: Rename wait_for_lost to lock_io_check
This will be generalised to do more I/O-based checks.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
7ab2e8f127 ctdb-mutex: Rename recheck_time to recheck_interval
There will be more timeouts so clarify the intent of this one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
c396b61504 ctdb-mutex: Consistently use progname in error messages
To avoid error messages having ridiculously long paths, set progname
to basename(argv[0]).

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-28 10:09:34 +00:00
Martin Schwenke
3efa56aa61 ctdb-daemon: Fix printing of tickle ACKs
Commit f5a2037734 arguably got this
back-to-front:

  2022-07-27T09:50:01.985857+10:00 testn1 ctdbd[17820]: ../../ctdb/server/ctdb_takeover.c:514 sending TAKE_IP for '10.0.1.173'
  2022-07-27T09:50:01.990601+10:00 testn1 ctdbd[17820]: Send TCP tickle ACK: 10.0.1.77:33004 -> 10.0.1.173:2049
  2022-07-27T09:50:01.991323+10:00 testn1 ctdb-takeover[19758]: TAKEOVER_IP 10.0.1.173 succeeded on node 0

Unfortunately there is an inconsistency somewhere in the connection
tracking code used for tickle ACKs, making this less than obvious.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Thu Jul 28 09:02:08 UTC 2022 on sn-devel-184
2022-07-28 09:02:08 +00:00
Martin Schwenke
c77a4fde7a ctdb-daemon: Modernise debug in ctdb_add_public_address()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
d62fcba7dc ctdb-daemon: Avoid spurious error sending ARPs for released IP
A public IP address can be released in between (and probably before)
attempts to send ARPs.  One situation when this can occur is when a
cluster is shutting down: node A shuts down first, public IPs from
node A are taken over by node B, node B is shutdown.

Notice this when it occurs and cancel further attempts to send ARPs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
f5a2037734 ctdb-daemon: Modernise debug in ctdb_control_send_arp()
For the tickle ACK logging, render the connection in a buffer.  This
produces more complete information.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
9898e7c555 ctdb-recoverd: Clean up banning culprit code
Make this fully self-contained in the recovery daemon and avoid
indexing by PNN.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
19fbc2da38 ctdb-recoverd: Add pnn field to banning state structure
This structure is now standalone, so indexing by PNN can be avoided
via a subsequent commit.  Index by culprit here to make this commit
simple.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
0b5dd07604 ctdb-recoverd: Add function node_flags() and use it in elections
Indexing a node map by PNN is suboptimal.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-07-22 16:09:31 +00:00
Martin Schwenke
e752f841e6 ctdb-daemon: Use DEBUG() macro for child logging
Directly using dbgtext() with file logging results in a log entry with
no header, which is wrong.  This is a regression, introduced in commit
10d15c9e5d.  Prior to this, CTDB's
callback for file logging would always add a header.

Use DEBUG() instead dbgtext().  Note that DEBUG() effectively compares
the passed script_log_level with DEBUGLEVEL, so an explicit check is
no longer necessary.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15090

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>

Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Thu Jun 16 13:33:10 UTC 2022 on sn-devel-184
2022-06-16 13:33:10 +00:00
Martin Schwenke
88f35cf862 ctdb-daemon: Drop unused prefix, logfn, logfn_private
These aren't set anywhere in the code.

Drop the log argument because it is also no longer used.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15090

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>
2022-06-16 12:42:35 +00:00
Martin Schwenke
90a96f06a9 ctdb-recoverd: Do not ban on unknown error when taking cluster lock
If the cluster filesystem is unavailable then I/O errors may occur.
This is no worse than contention, so don't ban.  This avoids having
services unavailable for longer than necessary.

Update the associated test to simply confirm that this results in a
leaderless cluster, and leadership is restored when the lock can once
again be taken.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-05-31 05:06:29 +00:00
Martin Schwenke
da9decfc5e ctdb-daemon: Remove unused #includes of rb_tree.h
ctdb_takeover.c and eventscript.c no longer use this.
ipalloc_common.c has never used it.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-05-31 05:06:29 +00:00
Martin Schwenke
80de84d36e ctdb-daemon: Log per-database summary of resent calls
After a recovery that takes a significant amount of time the logs are
flooded with messages about every resent call.

Log a summary instead and demote per-call messages to INFO level.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-05-31 05:06:29 +00:00
Pavel Filipenský
8cb6565011 ctdb: Covscan: unchecked return value for trbt_traversearray32()
Signed-off-by: Pavel Filipenský <pfilipen@redhat.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
2022-05-14 03:49:32 +00:00
Martin Schwenke
d52b497d11 ctdb-locking: Don't pass NULL to tevent_req_is_unix_error()
If there is an error then this pointer is unconditionally
dereferenced.

However, the only possible error appears to be ENOMEM, where a crash
caused by dereferencing a NULL pointer isn't a terrible outcome.  In
the absence of a security issue this is probably not worth
backporting.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-05-03 09:19:31 +00:00
Martin Schwenke
490e5f4d4c ctdb-mutex: Don't pass NULL to tevent_req_is_unix_error()
If there is an error then this pointer is unconditionally
dereferenced.

However, the only possible error appears to be ENOMEM, where a crash
caused by dereferencing a NULL pointer isn't a terrible outcome.  In
the absence of a security issue this is probably not worth
backporting.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-05-03 09:19:31 +00:00
Martin Schwenke
6fb08a6580 ctdb-daemon: Don't release all public IPs during shutdown sequence
This further untangles public IP handling from the main daemon.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-04-06 06:34:37 +00:00
Martin Schwenke
f49446cb1e ctdb-daemon: Load tunables from ctdb.tunables
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-04-06 06:34:37 +00:00
Martin Schwenke
a509ee059e ctdb-daemon: New function ctdb_tunables_load()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-04-06 06:34:37 +00:00
Martin Schwenke
0e74e03c9c ctdb-recoverd: Consistently log start of election
Elections should now be quite rare, so always log when one begins.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-02-14 01:47:31 +00:00
Martin Schwenke
bf55a0117d ctdb-recoverd: Always send unknown leader broadcast when starting election
This is currently missed when the cluster lock is lost.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-02-14 01:47:31 +00:00
Martin Schwenke
9b3fab052b ctdb-recoverd: Consistently have caller set election-in-progress
The problem here is that election-in-progress must be set to
potentially avoid restarting the election broadcast timeout in
main_loop(), so this is already done by leader_handler().

Have force_election() set election-in-progress for all election types
and do not bother setting it in cluster_lock_election().

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-02-14 01:47:31 +00:00
Martin Schwenke
188a902156 ctdb-recoverd: Always cancel election in progress
Election-in-progress is set by unknown leader broadcast, so needs to
be cleared in all cases when election completes.

This was seen in a case where the leader node stalled, so didn't send
leader broadcasts for some time.  The node continued to hold the
cluster lock, so another node could not become leader.  However, after
the node returned to normal it still did not send leader broadcasts
because election-in-progress was never cleared.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-02-14 01:47:31 +00:00
Martin Schwenke
34d2ca0ae6 ctdb-config: Add configuration option [cluster] leader timeout
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
1dfb266038 ctdb-config: [legacy] recmaster capability -> [cluster] leader capability
Rename this configuration item and move it into the [cluster]
configuration section.

Update documentation to match.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
f5a39058f0 ctdb-config: [cluster] recovery lock -> [cluster] cluster lock
Retain "recovery lock" and mark as deprecated for backward
compatibility.

Some documentation is still inconsistent.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
73555e8248 ctdb-recoverd: Use race for cluster lock as election when lock is enabled
If the cluster is partitioned then nodes in one partition can not take
the lock anyway, so election is pointless.  It just introduces
unnecessary corner cases.

Instead just race for the lock.

When a node notices a lack of leader and notifies other nodes of an
election via an unknown leader broadcast, the cluster lock election is
hooked into this broadcast.

The test needs to be updated because losing the cluster lock can now
result in a leadership change.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
a76374070d ctdb-daemon: Drop implementation of {GET,SET}_RECMASTER controls
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
16efbca003 ctdb-daemon: Drop unused old client recmaster functions
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
c68267b2a6 ctdb-recoverd: Drop calls to ctdb_ctrl_setrecmaster()
Nothing fetches this value anymore.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
58d7fcdf7c ctdb-recoverd: Drop recovery master verification
This doesn't make sense if leader broadcasts are used.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
958746f947 ctdb-recoverd: Simplify some stopped/banned checks to inactive checks
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
358c59f51a ctdb-recoverd: No longer take cluster lock during recovery
Confirm instead that it is already held.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
36ffaaa691 ctdb-recoverd: Add and use function cluster_lock_enabled()
Now all references to ctdb->recovery_lock are encapsulated in the
cluster lock code.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
5ee664ee17 ctdb-recoverd: Terminology change: recovery lock -> cluster lock
No functional changes, just name changes for clarity.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
0f2250f4f9 ctdb-recoverd: Take cluster lock when election completes
It is no longer just a recovery lock but is always held by the cluster
leader.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
011e880002 ctdb-recoverd: Factor out function cluster_lock_take()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:33 +00:00
Martin Schwenke
b029ca4d51 ctdb-recoverd: Drop leader validation
The introduction of the leader broadcast timeout provides an
alternative to the current leader validation.  Using the leader
broadcast may not be as fast but it is more correct.

When the leader node is stopped or banned, the only way of triggering
an election is currently to fetch the leader's node map to check
whether the it is still active.  This is because the leader will no
longer push the node map to other nodes.  However, having all nodes
fetch the node map from an inactive leader may be unreliable.

Most of the other cases are also handled more reliably by the leader
broadcast timeout.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
7e53fab0a3 ctdb-recoverd: Drop special case for elected-before-connected
This no longer occurs at startup due to the leader broadcast timeout.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
ef4b8c13c0 ctdb-recoverd: Handle leader broadcast timeout
If no leader broadcasts have been received from the leader for more
than 5s then trigger an election.

Apart from being sane behaviour, this avoids elected-before-connected
bugs at startup, where a node elects itself leader before it is
connected to other nodes.

When a node processes a leader broadcast timeout it sends an unknown
leader broadcast to all nodes.  That causes cancellation of the leader
broadcast timeout across the cluster.  This is particular important at
startup, since nodes may be started in a staggered fashion.  Without
this cluster-wide cancellation, a node might notice the lack of
leader, win an election and complete a recovery before other nodes
notice the lack of leader.  When the leader broadcast timeout finally
occurs on the other nodes then they'll put the cluster back into an
unnecessary recovery.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
5c7f6da0f0 ctdb-recoverd: Send leader broadcasts
These are triggered on 1 second timer, but are only sent if the node
is the current leader and there is no election underway.

If this node can not be the leader then ensure it releases the
recovery lock.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
789a75abfa ctdb-recoverd: Process leader broadcasts
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
c2cfd9c21a ctdb-recoverd: Add an explicit flag for election in progress
An alternate election method will be added that doesn't use the
election timeout, so this provides a common way for recognising when
an election is in progress.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
ac5a3ca063 ctdb-recoverd: Only start election if node can be leader
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
7baadfe27e ctdb-recoverd: Add and use function this_node_can_be_leader()
This makes the code self-documenting.

In ctdb_election_data() there is a slight behaviour change.  An
inactive node will now try to lose an election.  This case should not happen
because:

* An inactive node can't win an election round and then send a reply.

* Any inactive node should never start an election.  There are
  currently places where this happens and they will be fixed later.

There is an instance where this could be used in
validate_recovery_master() but this involves a more serious logic
change.  Overhaul this function later.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
94b546c268 ctdb-recoverd: Logging/comments: recovery master -> leader
There are some remaining instances in this file but they will be
removed in subsequent commits.

Modernise debug macros as appropriate.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
dd79e9bd14 ctdb-recoverd: Rename recmaster field to leader
Recovery master is being renamed to leader.  This follows clustering
best practice (e.g. RAFT).

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
2ee6763c7d ctdb-recoverd: Use rec->pnn everywhere
This is currently referenced in a number of inconsistent
ways, including:

* pnn
* rec->ctdb->pnn
* ctdb->pnn
* ctdb_get_pnn(ctdb)
* ctdb_get_pnn(rec->ctdb)

The first of these always requires some thought about the context - is
this the node PNN or some other PNN (e.g. argument to function)?

rec->pnn is now always used when referring to the recovery daemon's
PNN.

Doing this also reduces reliance on struct ctdb_context internals.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
4af3b10a37 ctdb-recoverd: Change argument to srvid_disable_and_reply()
Reduce dependency on struct ctdb_context internals, enable a
subsequent change.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
b7c138ca99 ctdb-recoverd: Simplify arguments to ctdb_ban_node()
ban_time argument is always ctdb->tunable.recovery_ban_period, so
build this in and make the calling code more readable.

ctdb_ban_node() already logs how long a node is banned for, so don't
repeatedly log this.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
a5e0ddac62 ctdb-recoverd: Simplify arguments to verify_local_ip_allocation()
All other arguments are available via rec, so simplify.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
67b5191640 ctdb-recoverd: Simplify arguments to do_recovery()
pnn and nodemap are both available via the rec context, so simplify.
vnnmap is unused.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
57882beb16 ctdb-recoverd: Simplify arguments to some election functions
The pnn and nodemap arguments to force_election() and
send_election_request() are always effectively rec->pnn and
rec->nodemap, so simplify.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
9dbe7cc85e ctdb-recoverd: Add PNN to recovery daemon context
This is currently referenced in a number of inconsistent
ways, including:

* pnn
* rec->ctdb->pnn
* ctdb->pnn
* ctdb_get_pnn(ctdb)
* ctdb_get_pnn(rec->ctdb)

The first of these always requires some thought about the context - is
this the node PNN or some other PNN (e.g. argument to function)?

The intention is to always use rec->pnn when referring to the recovery
daemon's PNN.

Doing this also reduces reliance on struct ctdb_context internals.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
ff0140e470 ctdb-recoverd: Use this_node_is_leader() in an extra context
This is arguably clearer.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
c8721d01c6 ctdb-recoverd: Factor out and use function this_node_is_leader()
Make the code self-documenting.

This preempts an upcoming change to terminology but doing it now saves
a lot of churn.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 10:21:32 +00:00
Martin Schwenke
57a32cebdd ctdb-recoverd: Pass SIGHUP to running helper
The recovery and takeover helpers can run for a while and generate
non-trivial logs, so have them reopen their logs to support log
rotation.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Jan 17 04:36:30 UTC 2022 on sn-devel-184
2022-01-17 04:36:30 +00:00
Martin Schwenke
8e949a6082 ctdb-recoverd: Record helper PID in recovery daemon context
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
97a45f6f25 ctdb-recoverd: Add log reopening on SIGHUP to helpers
Recovery and takeover helpers can run for a while and generate
non-trivial logs.  They should support log reopening.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
51f0380e83 ctdb-daemon: Enable log reopening for event daemon
Add and call hook to pass on SIGHUP to eventd.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
c554a325fe ctdb-daemon: Enable log reopening for recovery daemon
Pass on a SIGHUP to the recovery daemon, which will then reopen its
logs.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
4acfefed61 ctdb-recoverd: Add basic log reopening
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
4ed37de82b ctdb-daemon: Add basic top-level log reopening
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
9e7d2d9794 ctdb-daemon: Don't mark a node as unhealthy when connecting to it
Remote nodes are already initialised as UNHEALTHY when the node list
is initialised at startup (ctdb_load_nodes_file() calls
convert_node_map_to_list()) and when disconnected (ctdb_node_dead()).
So, drop this code.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Thu Sep  9 02:38:34 UTC 2021 on sn-devel-184
2021-09-09 02:38:34 +00:00
Martin Schwenke
7f697b1938 ctdb-daemon: Ignore flag changes for disconnected nodes
If this node is not connected to a node then we shouldn't know
anything about it.  The state will be pushed later by the recovery
master.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
ae10a8a4b7 ctdb-daemon: Simplify ctdb_control_modflags()
Now that there are separate disable/enable controls used by the ctdb
tool this control can ignore any flag updates for the current nodes.
These only come from the recovery master, which depends on being able
to fetch flags for all nodes.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
916c5ee131 ctdb-recoverd: Mark CTDB_SRVID_SET_NODE_FLAGS obsolete
CTDB_SRVID_SET_NODE_FLAGS is no longer sent so drop monitor_handler()
and replace with srvid_not_implemented().  Mark the SRVID obsolete in
its comment.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
e75256767f ctdb-daemon: Don't bother sending CTDB_SRVID_SET_NODE_FLAGS
The code that handles this message is
ctdb_recoverd.c:monitor_handler().  Although it appears to do
something potentially useful, it only logs the flags changes.  All
changes made are to local structures - there are no actual
side-effects.

It used to trigger a takeover run when the DISABLED flag changed.
This was dropped back in commit
662f06de9f.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
0132bd5a22 ctdb-daemon: Modernise remaining debug macro in this function
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
b6d25d079e ctdb-daemon: Update logging for flag changes
When flags change, promote the message to NOTICE level and switch the
message to the style that is currently generated by
ctdb-recoverd.c:monitor_handler().  This will allow monitor_handler()
to go away in future.

Drop logging when flags do not change.  The recovery master now logs
when it pushes flags for a node, so the lack of a corresponding
"changed flags" message here indicates that no update was required.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
eec44e2862 ctdb-daemon: Correct the condition for logging unchanged flags
Don't trust the old flags from the recovery master.

Surrounding code will change in future comments, including the use of
old-style debug macros, so just make this change clear.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
15a6489c28 ctdb_daemon: Implement controls DISABLE_NODE/ENABLE_NODE
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
60c1ef1465 ctdb-daemon: Start as disabled means PERMANENTLY_DISABLED
DISABLED is UNHEALTHY | PERMANENTLY_DISABLED, which is not what is
intended here.  Luckily, it doesn't do any harm because nodes are
marked unhealthy at startup anyway.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
1ac7bc7532 ctdb-daemon: Factor out a function to get node structure from PNN
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
e0a7b5a9e8 ctdb-daemon: Add a helper variable
Simplifies a subsequent change.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
8305f6a7f1 ctdb-recoverd: Push flags for a node if any remote node disagrees
This will usually happen if flags on the node in question change, so
keeping the code simple and pushing to all nodes won't hurt.  When all
nodes come up there might be differences in connected nodes, causing
such "fix ups".  Receiving nodes will ignore no-op pushes.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
620d078714 ctdb-recoverd: Update the local node map before pushing out flags
The resulting code structure looks a little weird.  However, there is
another condition that requires the flags to be pushed that will be
inserted before the continue statement in a subsequent commit..

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
82a075d4d7 ctdb-recoverd: Add a helper variable
Improves readability and simplifies subsequent changes.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14784
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-09-09 01:46:49 +00:00
Martin Schwenke
e40d452722 ctdb-daemon: Close server socket when switching to client
The socket is set close-on-exec but that doesn't help for processes
that do not exec().  This should be done for all child processes.

This has been seen in testing where "ctdb shutdown" waits for the
socket to close before succeeding.  It appears that lingering
vacuuming processes have not closed the socket when becoming clients
so they cause "ctdb shutdown" to hang even though the main daemon
process has exited.  The cause of the lingering vacuuming processes
has been previously examined but still isn't understood.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2021-06-25 09:16:31 +00:00
Amitay Isaacs
d07875330a ctdb-locking: Pass additional arguments to debug locks script
1. PID of lock helper waiting for lock
2. Scope of lock: "record" or "db"
3. Path to database that lock helper is trying to lock
4. Whether the database uses mutexes: "mutex" or "fcntl"

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2021-05-28 06:46:29 +00:00
Volker Lendecke
06b740e2fb ctdb: Fix a typo
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>
2021-03-09 22:36:28 +00:00
Martin Schwenke
65ab8cb014 ctdb-daemon: Do not attempt to chown Unix domain socket in test mode
If run with UID wrapper and UID_WRAPPER_ROOT=1 then securing the
socket will fail.

Test mode means that local daemons are in use, so securing the socket
is not important.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
2020-11-02 08:58:31 +00:00
Martin Schwenke
78c3b5b6a8 ctdb-daemon: Clean up call to bind socket
Variable res is only used once and ret is re-used many times.  Drop
res, use ret, which doesn't need to be initialised.  Modernise debug
macro.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
2020-11-02 08:58:31 +00:00
Martin Schwenke
9404f8631e ctdb-daemon: Clean up socket bind/secure/listen
Obey the coding style, modernise debug macros, clean up whitespace.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
2020-11-02 08:58:31 +00:00
Martin Schwenke
4b01f54041 ctdb-recoverd: Drop unnecessary and broken code
update_flags() has already updated the recovery master's canonical
node map, based on the flags from each remote node, and pushed out
these flags to all nodes.

If i == j then the node map has already been updated from this remote
node's flags, so simply drop this case.

Although update_flags() has updated flags for all nodes, it did not
update each node map in remote_nodemaps[] to reflect this.  This means
that remote_nodemaps[] may contain inconsistent flags for some nodes
so it should not be used to check consistency when i != j.

Further, a meaningful difference in flags can only really occur if
update_flags() failed.  In that case this code is never reached.

These observations combine to imply that this whole loop should be
dropped.

This leaves potential sub-second inconsistencies due to out-of-band
healthy/unhealthy flag changes pushed via CTDB_SRVID_PUSH_NODE_FLAGS.
These updates could be dropped (takeover run asks each node for
available IPs rather than making centralised decisions based on node
flags) but for now they will be fixed in the next iteration of
main_loop().

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14513
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-10-06 03:12:35 +00:00
Martin Schwenke
3ab52b5286 ctdb-recoverd: Drop unnecessary code
This has already been done in update_flags().

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14513
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-10-06 03:12:35 +00:00
Martin Schwenke
d98f68f918 ctdb-daemon: Drop implementation of old-style database pull/push controls
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Sep 11 06:29:32 UTC 2020 on sn-devel-184
2020-09-11 06:29:32 +00:00
Martin Schwenke
2efce7d477 ctdb-recovery: Simplify database push function names
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
f4e2206e88 ctdb-recovery: Drop unnecessary database push wrapper
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
225a699633 ctdb-recovery: Drop passing of capabilities into database pull
This is no longer necessary because the capability new style database
pull is assumed to always be available.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
595c1a7c0f ctdb-recovery: Simplify database pull function names
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
f968576642 ctdb-recovery: Remove use of old pull and push controls
Removes use of the old controls without cleaning up the code.  Clean
up can be done later.

After this change the CTDB_CAP_FRAGMENTED_CONTROLS capability is no
longer checked.  This capability can be removed along with the
controls.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
8bb6a6607d ctdb-recoverd: Broadcast takeover run message when verifying IPs
This makes it consistent with the monitoring code.  If the master has
changed then this means the master will always get the message.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 18 06:24:11 UTC 2020 on sn-devel-184
2020-08-18 06:24:11 +00:00
Martin Schwenke
4aa8e72d60 ctdb-recoverd: Rename update_local_flags() -> update_flags()
This also updates remote flags so the name is misleading.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14466
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-08-18 05:02:25 +00:00