mirror of
https://github.com/samba-team/samba.git
synced 2024-12-23 17:34:34 +03:00
f3b282ca25
Michael (This used to be ctdb commit b7290b0a8325021515a904ac60fa9afa019c9db1)
437 lines
21 KiB
Plaintext
437 lines
21 KiB
Plaintext
Valid as of 1.0.66, may/will change in the future
|
|
|
|
|
|
RECMASTER
|
|
=========
|
|
Recovery Master, this is one of the nodes in the cluster that has been designated to
|
|
be the "recovery master".
|
|
The recovery master is responsible for performing full checks of cluster and cluster node consistency and is also responsible for performing the actual database recovery procedure.
|
|
|
|
Only one node at a time can be the recovery master.
|
|
This is ensured by CTDB using a lock on a single file in the shared gpfs filesystem:
|
|
/etc/sysconfig/ctdb :
|
|
...
|
|
# Options to ctdbd. This is read by /etc/init.d/ctdb
|
|
# you must specify the location of a shared lock file across all the
|
|
# nodes. This must be on shared storage
|
|
# there is no default here
|
|
CTDB_RECOVERY_LOCK=/gpfs/.ctdb/shared
|
|
...
|
|
|
|
In order to prevent that two nodes become recovery master at the same time (==split brain)
|
|
CTDB here relies on GPFS that GPFS will guarantee coherent locking across the cluster.
|
|
Thus CTDB relies on that GPFS MUST only allow one ctdb process on one node to take out and
|
|
hold this lock.
|
|
|
|
The recovery master is designated through an election process.
|
|
|
|
|
|
VNNMAP
|
|
======
|
|
The VNNMAP is a list of all nodes in the cluster that is currently part of the cluster
|
|
and participates in hosting the cluster databases.
|
|
All nodes that are CONNECTED but not BANNED be present in the VNNMAP.
|
|
|
|
The VNNMAP is the list of LMASTERS for the cluster as reported by 'ctdb status' "
|
|
...
|
|
Size:3
|
|
hash:0 lmaster:0
|
|
hash:1 lmaster:1
|
|
hash:2 lmaster:2
|
|
...
|
|
|
|
|
|
CLUSTER MONITORING
|
|
==================
|
|
All nodes in the cluster monitor its own health and its own consistency regards to the
|
|
recovery master. How and what the nodes monitor for differs between the node which is
|
|
the recovery master and normal nodes.
|
|
This monitoring it to ensure that the cluster is healthy and consistent.
|
|
This is not related to monitoring of inidividual node health, a.k.a. eventscript monitroing.
|
|
|
|
At the end of each step in the process are listed some of the most common and important
|
|
error messages that can be generated during that step.
|
|
|
|
|
|
NORMAL NODE CLUSTER MONITORING
|
|
------------------------------
|
|
Monitoring is performed in the dedicated recovery daemon process.
|
|
The implementation can be found in server/ctdb_recoverd.c:monitor_cluster()
|
|
This is an overview of the more important tasks during monitoring.
|
|
These tests are to verify that the local node is consistent with the recovery master.
|
|
|
|
Once every second the following monitoring loop is performed :
|
|
|
|
1, Verify that the parent ctdb daemon on the local node is still running.
|
|
If it is not, the recovery daemon logs an error and terminates.
|
|
"CTDB daemon is no longer available. Shutting down recovery daemon"
|
|
|
|
2, Check if any of the nodes has been recorded to have misbehaved too many times.
|
|
If so we ban the node and log a message :
|
|
"Node %u has caused %u failures in %.0f seconds - banning it for %u seconds"
|
|
|
|
3, Check that there is a recovery master.
|
|
If not we initiate a clusterwide election process and log :
|
|
"Initial recovery master set - forcing election"
|
|
and we restart monitoring from 1.
|
|
|
|
4, Verify that recovery daemon and the local ctdb daemon agreed on all the
|
|
node BANNING flags.
|
|
If the recovery daemon and the local ctdb daemon disagrees on these flags we update
|
|
the local ctdb daemon, logs one of two messages and restarts monitoring from 1 again.
|
|
"Local ctdb daemon on recmaster does not think this node is BANNED but the recovery master disagrees. Unbanning the node"
|
|
"Local ctdb daemon on non-recmaster does not think this node is BANNED but the recovery master disagrees. Re-banning the node"
|
|
|
|
5, Verify that the node designated to be recovery master exists in the local list of all nodes.
|
|
If the recovery master is not in the list of all cluster nodes a new recovery master
|
|
election is triggered and monitoring restarts from 1.
|
|
"Recmaster node %u not in list. Force reelection"
|
|
|
|
6, Check if the recovery master has become disconnected.
|
|
If is has, log an error message, force a new election and restart monitoring from 1.
|
|
"Recmaster node %u is disconnected. Force reelection"
|
|
|
|
7, Read the node flags off the recovery master and verify that it has not become banned.
|
|
If is has, log an error message, force a new election and restart monitoring from 1.
|
|
"Recmaster node %u no longer available. Force reelection"
|
|
|
|
8, Verify that the recmaster and the local node agrees on the flags (BANNED/DISABLED/...)
|
|
for the local node.
|
|
If there is an inconsistency, push the flags for the local node out to all other nodes.
|
|
"Recmaster disagrees on our flags flags:0x%x recmaster_flags:0x%x Broadcasting out flags."
|
|
|
|
9, Verify that the local node hosts all public ip addresses it should host and that it does
|
|
NOT host any public addresses it should not host.
|
|
If there is an inconsistency we log an error, trigger a recovery to occur and restart
|
|
monitoring from 1 again.
|
|
"Public address '%s' is missing and we should serve this ip"
|
|
"We are still serving a public address '%s' that we should not be serving."
|
|
|
|
These are all the checks we perform during monitoring for a normal node.
|
|
These tests are performed on all nodes in the cluster which is why it is optimized to perform
|
|
as few network calls to other nodes as possible.
|
|
Each node only performs 1 call to the recovery master in each loop and to no other nodes.
|
|
|
|
NORMAL NODE CLUSTER MONITORING
|
|
------------------------------
|
|
The recovery master performs a much more extensive test. In addition to tests 1-9 above
|
|
the recovery master also performs the following tests:
|
|
|
|
10, Read the list of nodes and flags from all other CONNECTED nodes in the cluster.
|
|
If there is a failure to read this list from one of the nodes, then log an
|
|
error, mark this node as a candidate to become BANNED and restart monitoring from 1.
|
|
"Unable to get nodemap from remote node %u"
|
|
|
|
11, Verify that the local recovery master and the remote node agrees on the flags
|
|
for the remote node. If there is a inconsistency for the BANNING flag,
|
|
log an error, trigger a new recmaster election and restart monitoring from 1.
|
|
"Remote node %u had different BANNED flags 0x%x, local had 0x%x - trigger a re-election"
|
|
"Remote node %u had flags 0x%x, local had 0x%x - updating local"
|
|
|
|
12, Verify that the local recovery master and the remote node agrees on the flags
|
|
for the remote node. If one of the flags other than the BANNING flag was inconsistent,
|
|
just update the set of flags for the local recovery daemon, log an information message
|
|
and continue monitoring.
|
|
"Remote node %u had flags 0x%x, local had 0x%x - updating local"
|
|
|
|
13, Read the list of public ip addresses from all of the CONNECTED nodes and merge into a
|
|
single clusterwide list.
|
|
If we fail to read the list of ips from a node, log an error and restart monitoring from 1.
|
|
"Failed to read public ips from node : %u"
|
|
|
|
14, Verify that all other nodes agree that this node is the recovery master.
|
|
If one of the other nodes discgrees this is the recovery master, log an error,
|
|
force a new election and restart monitoring from 1.
|
|
"Node %d does not agree we are the recmaster. Need a new recmaster election"
|
|
|
|
15, Check if the previous attempt to run a recovery failed, and if it did, try a new recovery.
|
|
After the recovery, restart monitoring from 1.
|
|
"Starting do_recovery"
|
|
|
|
16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL.
|
|
If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart
|
|
monitoring from 1.
|
|
"Node:%u was in recovery mode. Restart recovery process"
|
|
|
|
17, Verify that the filehandle to the recovery lock file is valid.
|
|
If it is not, this may mean a split brain and is a critical error.
|
|
Try a new recovery and restart monitoring from 1.
|
|
"recovery master doesn't have the recovery lock"
|
|
|
|
18, Verify that GPFS allows us to read from the recovery lock file.
|
|
If not there is a critical GPFS issue and we may have a split brain.
|
|
Try forcing a new recovery and restart monitoring from 1.
|
|
"failed read from recovery_lock_fd - %s"
|
|
|
|
19, Read the list of all nodes and flags from all CONNECTED nodes in the cluster.
|
|
If fail to read the nodemap from one of the remote nodes, log an error and restart
|
|
monitoring from 1.
|
|
"Unable to get nodemap from remote node %u"
|
|
|
|
20, If the nodemap differs between the local node and the remote node, log an error
|
|
and force a recovery.
|
|
This would happen if the /etc/ctdb/nodes file differs across nodes in the cluster.
|
|
It is unlikely that the recovery will rectify the situation.
|
|
This is a critical error, it is most likely the entire cluster will be unavailable
|
|
until the files are fixed or have became banned.
|
|
"Remote node:%u has different node count. %u vs %u of the local node"
|
|
|
|
21, If a remote node disagrees on the content of the nodes list, try a recovery and restart
|
|
monitoring from 1.
|
|
It is unlikely that the recovery will rectify the situation.
|
|
This is a critical error, it is most likely the entire cluster will be unavailable
|
|
until the files are fixed or have became banned.
|
|
"Remote node:%u has different nodemap pnn for %d (%u vs %u)."
|
|
|
|
22, If a remote node disgrees on the node flags in the list, try a recovery to re-sync
|
|
the flags and restart monitoring from 1.
|
|
"Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)"
|
|
|
|
23, Verify that all active nodes are part of the VNNMAP.
|
|
If not, this would be a new node that has become CONNECTED but does not yet participate
|
|
in the cluster.
|
|
Perform a recovery to merge the new node to the cluster and restart monitoring from 1.
|
|
"The vnnmap count is different from the number of active nodes. %u vs %u"
|
|
or
|
|
"Node %u is active in the nodemap but did not exist in the vnnmap"
|
|
|
|
24, Read the VNNMAP from all CONNECTED nodes.
|
|
Verify that all nodes have the same VNNMAP content and that all nodes are in the same
|
|
generation instance of the databases.
|
|
If not, force a recovery to re-synchronize the vnnmap and the databases across the cluster
|
|
and restart monitoring from 1.
|
|
"Remote node %u has different generation of vnnmap. %u vs %u (ours)"
|
|
"Remote node %u has different size of vnnmap. %u vs %u (ours)"
|
|
"Remote node %u has different vnnmap."
|
|
|
|
25, If there has been changes to the cluster that requires a reallocation of public ip
|
|
addresses. On all nodes run the "startrecovery" event. Run "releaseip" and "takeip"
|
|
events to reassign the ips across the cluster and finally run the "recovered" event.
|
|
|
|
Finished monitoring, continue monitoring from 1.
|
|
|
|
|
|
CLUSTER RECOVERY
|
|
================
|
|
Recoveries are driven by the recovery daemon on the node that is currently the recovery
|
|
master.
|
|
Most of the logging that is performed during recovery is only logged on the node that
|
|
is the recovery master.
|
|
Make sure to find which node is the recovery master and check the log for that node.
|
|
|
|
Example log entries that start in column 1 are expected to be present in the
|
|
log. Example log entries that are indented 3 columns are optional and might
|
|
only be present if an error occured.
|
|
|
|
|
|
1, Log that recovery has been initiated.
|
|
"Starting do_recovery"
|
|
|
|
It might log an informational message :
|
|
"New recovery culprit %u".
|
|
This is only semi-accurate and might may not mean that there is any problem
|
|
at all with the node indicated.
|
|
|
|
|
|
2, Check if a node has caused too many failed recoveries and if so ban it from
|
|
the cluster, giving the other nodes in the cluster a chance to recovery
|
|
operation.
|
|
"Node %u has caused %u recoveries in %.0f seconds - banning it for %u seconds"
|
|
|
|
|
|
3, Verify that the recovery daemon can lock the recovery lock file.
|
|
At this stage this should be recovery master.
|
|
If this operation fails it means we have a split brain and have to abort recovery.
|
|
"("ctdb_recovery_lock: Unable to open %s - (%s)"
|
|
"ctdb_recovery_lock: Failed to get recovery lock on '%s'"
|
|
"Unable to get recovery lock - aborting recovery"
|
|
"ctdb_recovery_lock: Got recovery lock on '%s'"
|
|
|
|
|
|
4, Log which node caused the recovery to be initiated.
|
|
This is a semi-accurate information message only.
|
|
This line does NOT mean that there has to be something wrong with the node listed.
|
|
"Recovery initiated due to problem with node %u"
|
|
|
|
|
|
5, Pull the names of all databases from all nodes and verify that these databases also
|
|
exists locally.
|
|
If a database is missing locally, just create it.
|
|
It is not an error if a database is missing locally. Databases are created on demand and
|
|
this could happen if it was one database which samba has never tried to access on the
|
|
local node.
|
|
|
|
|
|
6, Check the list of databases on each remote node and create any databases that may be missing
|
|
on the remote node.
|
|
"Recovery - created remote databases"
|
|
|
|
|
|
7, Set recovery mode to ACTIVE on all remote nodes.
|
|
|
|
|
|
8, Run the "startrecovery" eventscript on all nodes.
|
|
|
|
At this stage you will also get a few additional log entries, these are not
|
|
from the recovery daemon but from the main ctdb daemon due to running
|
|
the eventscript :
|
|
"startrecovery eventscript has been invoked"
|
|
"Monitoring has been disabled"
|
|
"Executing event script ...
|
|
...
|
|
|
|
|
|
9, Create a new generation id and update the generation id and the VNNMAP on the local node
|
|
only.
|
|
This guarantees that the generation id will now be inconsistent across the cluster and
|
|
that if recovery fails a new recovery is attempted in the next iteration of the monitoring
|
|
loop.
|
|
|
|
|
|
10, Start a TDB TRANSACTION on all nodes for all databases.
|
|
This is to ensure that if recovery is aborted or fails that we do not
|
|
modify any databases on only some of the nodes.
|
|
"started transactions on all nodes"
|
|
|
|
|
|
11, For each database, pull the content from all CONNECTED nodes and merge it into
|
|
the TDB databases on the local node.
|
|
This merges the records from the remote nodes based on their serial numbers so we
|
|
only keep the most recent record found.
|
|
"Recovery - pulled remote database 0x%x"
|
|
|
|
|
|
12, For each database, perform a fast TDB WIPE operation to delete the entire TDB under the
|
|
transaction started above.
|
|
|
|
|
|
13, For each database, drop all empty records.
|
|
Force the DMASTER field of all records to point to the recovery master.
|
|
Push the database out to all other nodes.
|
|
|
|
The PUSH process lists some additional log entries for each database of the
|
|
form :
|
|
"Recovery - pulled remote database 0x..."
|
|
"Recovery - pushed remote database 0x... of size ..."
|
|
|
|
|
|
14, Commit all changes to all TDB databases.
|
|
"Recovery - starting database commits"
|
|
"Recovery - committed databases"
|
|
|
|
|
|
15, Create a new VNNMAP of all CONNECTED nodes, create a new generation number
|
|
and piush this new VNNMAP out to all nodes.
|
|
"Recovery - updated vnnmap"
|
|
|
|
|
|
16, Update all nodes that the local node is the recovery master.
|
|
"Recovery - updated recmaster"
|
|
|
|
|
|
17, synchronize node flags across the cluster.
|
|
"Recovery - updated flags"
|
|
|
|
18, Change recovery mode back to NORMAL.
|
|
"Recovery - disabled recovery mode"
|
|
|
|
|
|
19, Re-allocate all public ip addresses across the cluster.
|
|
"Deterministic IPs enabled. Resetting all ip allocations"
|
|
|
|
If the IP address allocation on the local node changes you might get
|
|
"Takeover of IP 10.0.0.201/24 on interface eth0"
|
|
"Release of IP 10.0.0.204/24 on interface eth0"
|
|
|
|
"Recovery - takeip finished"
|
|
|
|
|
|
20, Run the "recovered" eventscript on all nodes.
|
|
"Recovery - finished the recovered event"
|
|
|
|
You will also get an entry from the local ctdb daemon itself that it has
|
|
switched back to recovery mode NORMAL.
|
|
"Recovery has finished"
|
|
|
|
|
|
21, Broadcast a message to all samba daemons in the cluster that the databases have been
|
|
recovered. Samba will now do some additional checking/cleanup of the content in the stored
|
|
records.
|
|
|
|
"Recovery complete"
|
|
|
|
|
|
22. Finished. At this stage a 10 second timeout (ctdb listvars : rerecoverytimeout) is
|
|
initiated. The cluster will not allow a new recovery to be performed until this timeout
|
|
has expired.
|
|
|
|
"New recoveries supressed for the rerecovery timeout"
|
|
"Rerecovery timeout elapsed. Recovery reactivated."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Example : RECOVERY LOG ON RECMASTER
|
|
====================================
|
|
2008/12/01 09:57:28.110732 [ 4933]: 10.0.0.21:4379: node 10.0.0.24:4379 is dead: 2 connected
|
|
2008/12/01 09:57:28.110838 [ 4933]: Tearing down connection to dead node :3
|
|
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:2682 The vnnmap count is different from the number of active nodes. 4 vs 3
|
|
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1327 Starting do_recovery
|
|
2008/12/01 09:57:28.967297 [ 4935]: ctdb_recovery_lock: Got recovery lock on '/gpfs/.ctdb/shared'
|
|
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1355 Recovery initiated due to problem with node 0
|
|
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1381 Recovery - created remote databases
|
|
2008/12/01 09:57:28.973543 [ 4933]: server/ctdb_recover.c:589 Recovery mode set to ACTIVE
|
|
2008/12/01 09:57:28.974823 [ 4933]: server/ctdb_recover.c:904 startrecovery eventscript has been invoked
|
|
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1431 started transactions on all nodes
|
|
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x42fe72c5
|
|
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x42fe72c5 of size 0
|
|
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x1421fb78
|
|
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x1421fb78 of size 0
|
|
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xc0bdde6a
|
|
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xc0bdde6a of size 0
|
|
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x17055d90
|
|
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x17055d90 of size 8
|
|
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x7bbbd26c
|
|
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x7bbbd26c of size 1
|
|
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xf2a58948
|
|
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xf2a58948 of size 51
|
|
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x92380e87
|
|
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x92380e87 of size 17
|
|
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x63501287
|
|
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x63501287 of size 1
|
|
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xe98e08b6
|
|
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xe98e08b6 of size 4
|
|
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x2672a57f
|
|
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x2672a57f of size 28
|
|
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xb775fff6
|
|
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xb775fff6 of size 6
|
|
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1440 Recovery - starting database commits
|
|
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1452 Recovery - committed databases
|
|
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1502 Recovery - updated vnnmap
|
|
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1511 Recovery - updated recmaster
|
|
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1522 Recovery - updated flags
|
|
2008/12/01 09:57:29.305235 [ 4933]: server/ctdb_recover.c:589 Recovery mode set to NORMAL
|
|
2008/12/01 09:57:29.307245 [ 4935]: server/ctdb_recoverd.c:1531 Recovery - disabled recovery mode
|
|
2008/12/01 09:57:29.307245 [ 4935]: Deterministic IPs enabled. Resetting all ip allocations
|
|
2008/12/01 09:57:29.311071 [ 4933]: takeoverip called for an ip '10.0.0.201' that is not a public address
|
|
2008/12/01 09:57:29.311186 [ 4933]: takeoverip called for an ip '10.0.0.202' that is not a public address
|
|
2008/12/01 09:57:29.311204 [ 4933]: takeoverip called for an ip '10.0.0.203' that is not a public address
|
|
2008/12/01 09:57:29.311299 [ 4933]: takeoverip called for an ip '10.0.0.204' that is not a public address
|
|
2008/12/01 09:57:29.537210 [ 4935]: server/ctdb_recoverd.c:1542 Recovery - takeip finished
|
|
2008/12/01 09:57:29.545404 [ 4933]: Recovery has finished
|
|
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1551 Recovery - finished the recovered event
|
|
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1557 Recovery complete
|
|
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1565 New recoveries supressed for the rerecovery timeout
|
|
2008/12/01 09:57:39.815648 [ 4935]: server/ctdb_recoverd.c:1567 Rerecovery timeout elapsed. Recovery reactivated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|