samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2025-02-04 17:47:26 +03:00

Author	SHA1	Message	Date
Martin Schwenke	e77d5f99e3	ctdb/recoverd: Do not refuse disabling takeover runs on inactive nodes Failure might be expected when disabling takeover runs on banned nodes, since they might be suffering from performance problems or similar. More broadly, administrators who reconfigure a cluster that isn't in a happy state aren't necessarily doing something sensible. However, allowing takeover runs to be disabled on inactive nodes stops reconfiguration of stopped nodes. This is probaby an unreasonable limitation, so drop it. Signed-off-by: Martin Schwenke <martin@meltin.net> Reviewed-by: Amitay Isaacs <amitay@gmail.com>	2014-01-17 17:59:19 +11:00
Martin Schwenke	44a0466ac1	ctdb-recoverd: Only respond to currently queued ipreallocated requests Otherwise new requests can come in during the latter parts of the takeover run when the IP allocation algorithm has already run, and the new requests will be dequeued even though they haven't really be processed. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-27 18:46:16 +01:00
Martin Schwenke	efc77ba6ac	ctdb-recoverd: For persistent databases a sequence number of 0 is valid Otherwise recovery ends up done by RSN when it is unnecessary. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-27 18:46:16 +01:00
Martin Schwenke	028fe930b6	ctdb-recoverd: Fix backward compatibility for CTDB_SRVID_TAKEOVER_RUN When running a mixed version cluster, compatibility with older versions was was broken during recent refactorisation. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-27 18:46:16 +01:00
Martin Schwenke	6fbf399191	ctdb-recoverd: A node refuses to play against itself Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-27 18:46:16 +01:00
Martin Schwenke	2038d166ad	ctdb-recoverd: Remove duplicate code to update flags during recovery This also happens earlier in do_recovery() and the nodemap is not updated after that, so this update is redundant. Signed-off-by: Martin Schwenke <martin@meltin.net> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-27 18:46:16 +01:00
Amitay Isaacs	6d1b74f052	ctdb-server: Coverity fixes Signed-off-by: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Michael Adam <obnox@samba.org>	2013-11-19 17:13:03 +01:00
Martin Schwenke	62076d3089	recoverd: Rebalancing should be done regardless tunable Rebalance target nodes should be set even if a deferred rebalance is not configured. The user can explicitly cause a takeover run. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit afd9b51644af074752d74c412cb4e7ec2eba2c69)	2013-10-30 12:19:49 +11:00
Martin Schwenke	6b42805717	recoverd: Improve an error message in the election code Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 275ed9ebe287e39d891888c13810c70f347af8ac)	2013-10-30 11:34:56 +11:00
Martin Schwenke	5f80f4255c	Revert "if a new node enters the cluster, that node will already be frozen at start" This is unnecessary due to 03e2e436db5cfd29a56d13f5d2101e42389bfc94. Furthermore, if a node doesn't force an election but wins it then it can fail to record that it is the new recovery master. This can lead to a reverse split brain where there is no recovery master. This reverts commit c5035657606283d2e35bea40992505e84ca8e7be. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> Conflicts: server/ctdb_recoverd.c (This used to be ctdb commit c8b542e059a54b8d524bd430cad9d82e5edd864d)	2013-10-30 11:34:56 +11:00
Martin Schwenke	f88cf2d013	Revert "recoverd: Disable takeover runs on other nodes for 5 minutes" 5 minutes is too long to leave the cluster in limbo if the recovery daemon dies during a takeover run, even though this is quite unlikely. We need a new recover master to be able to do takeover runs fairly quickly. This reverts commit 71080676bb4acbd0d9b595a30cf7fe6dddbf426f. (This used to be ctdb commit 3e41170c78fc7a2bf526129c9b7db3739b61c6bf)	2013-10-29 17:14:55 +11:00
Martin Schwenke	fbd2617cb8	recoverd: Remove function reload_nodes_file() It is a 1 line wrapper around ctdb_load_nodes_file(), so use that instead. We need less code... :-) Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4a5d5935f4410a93a3343d85a24dbcddae2c4c20)	2013-10-22 14:34:03 +11:00
Martin Schwenke	a93361fca2	Revert "null out the pointer before we reload the nodes file" This reverts commit 4b0f32047e8bece0a052bdbe2209afe91b7e8ce3. This is not necessary. It just causes a memory leak. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 25fd05505f61dc595c0ef25bb6e332274d5530e8)	2013-10-22 14:34:03 +11:00
Amitay Isaacs	e63232e974	recoverd: Ignore failed flag updates on inactive nodes Signed-off-by: Amitay Isaacs <amitay@gmail.com> Pair-programmed-with: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 484c46eaae056480baf050fd91868f2fd0537985)	2013-10-22 14:34:03 +11:00
Martin Schwenke	4812291ff8	recoverd: Fix the VNN lmaster consistency check It does cope with node that don't have the lmaster capability. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 588172bcb6bf267339e2bd09e23d2c4904a27a41)	2013-10-22 11:49:54 +11:00
Martin Schwenke	430ae84877	recoverd: Disable takeover runs on other nodes for 5 minutes 60 seconds might not be long enough to kill all connections and release IPs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 71080676bb4acbd0d9b595a30cf7fe6dddbf426f)	2013-09-19 12:58:32 +10:00
Martin Schwenke	07d3a1b234	recoverd: Improve logging for takeover runs Takeover runs are currently silent when they succeed. However, they are important, so log something by default. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit b39aa2e401fbb581207d986bac93778e9c01acdc)	2013-09-19 12:57:36 +10:00
Martin Schwenke	566d66e6ab	recoverd: Be careful about freeing the list of IP rebalance target nodes It can change during a takeover run. If it does then don't free it. There are potentially fancier solutions (e.g. check what PNNs are new to the list) to this issue but this is the simplest. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e81589b7084c661adf617e166cc2c25b4939f841)	2013-09-19 12:54:31 +10:00
Martin Schwenke	b33ee7a2a4	recoverd: Fix the implementation of CTDB_SRVID_REBALANCE_NODE The current implementation has a few flaws: * A takeover run is called unconditionally when the timer goes even if the recovery master role has moved. This means a node other than the recovery master can incorrectly do a takeover run. * The rebalancing target nodes are cleared in the setup for a takeover run, regardless of whether the takeover run succeeds. * The timer to force a rebalance isn't cleared if another takeover run occurs before the deadline. Any forced rebalancing will happen in the first takeover run and when the timer expires some time later then an unnecessary takeover run will occur. * If the recovery master role moves then the rebalancing data will stay on the original node and affect the next takeover run to occur if the recovery master role should come back to the original node. Instead, store an array of rebalance target nodes in the recovery master context. This is passed as an extra argument to ctdb_takeover_run() each time it is called and is cleared when a takeover run succeeds. The timer hangs off the array of rebalance target nodes, which is cleared if the node isn't the recovery master. This means that it is possible to lose rebalance data if the recovery master role moves. However, that's a difficult problem to solve. The best way of approaching it is probably to try to stop the recovery master role from jumping around unnecesarily when inactive nodes join the cluster. The long term solution is to avoid this nonsense completely. The IP allocation algorithm needs to cache state between runs so that it knows which nodes have just become healthy. This also needs recovery master stability. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c51c1efe5fc7fa668597f2acd435dee16e410fc9)	2013-09-19 12:54:31 +10:00
Martin Schwenke	1793412de2	recoverd: Remove unused CTDB_SRVID_RELOAD_ALL_IPS and handler Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4cd727439a0824ebb8dbcf737d9888ffc3c41184)	2013-09-19 12:54:31 +10:00
Martin Schwenke	e7cc998570	recoverd: Defer ipreallocated requests when takeover runs are disabled The takeover run will fail anyway but deferring seems like a cleaner option. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 428f800bcdf3dbfe19de8bb36099fbf01ebeaab4)	2013-09-19 12:54:31 +10:00
Martin Schwenke	2f472b4573	recoverd: Reimplement CTDB_SRVID_DISABLE_IP_CHECK Use disable_takeover_runs_handler() instead of maintaining duplicate logic. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0a51a85915486b2a8fded7ba6444b18c6c1ee8e8)	2013-09-19 12:54:31 +10:00
Martin Schwenke	5f0913d321	recoverd: New SRVID message CTDB_SRVID_DISABLE_TAKEOVER_RUNS This implements a superset of CTDB_SRVID_DISABLE_IP_CHECK. It stops the IP checks but also causes any attempted takeover runs to fail and be rescheduled. This is meant to completely stop IP movements. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 00db4de53a0d86013e79e6577e7e6cf3ef864e56)	2013-09-19 12:54:31 +10:00
Martin Schwenke	0ba7e2ce31	recoverd: Factor out the SRVID handling code The code that handles IP reallocate requests can be reused. This also changes the result back to a SRVID caller to the PNN on success or a negative error code on failure. None of the callers currently look at the result so this is harmless... but it will be useful later. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e4eae6e3291baa299a1d0f733ab11b138ee699a3)	2013-09-19 12:54:30 +10:00
Martin Schwenke	4c3f8dc3bb	recoverd: Make the SRVID request structure generic No need for a separate one for each SRVID. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d9c22b04d5aa7938a3965bd3144568664eb772ce)	2013-09-19 12:54:30 +10:00
Martin Schwenke	c503997746	recoverd: Move disabling of IP checks into do_takeover_run() Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 48b603fbf16311daa47b01e7a33d477ed51da56d)	2013-09-19 12:54:30 +10:00
Martin Schwenke	bbbb55eef9	recoverd: do_takeover_run() should mark when a takeover run is in progress Nested takeover runs should never happens so they should fail. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 8ed29c60c0a7dd29f2a6efdf694d38e94281e1c4)	2013-09-19 12:54:30 +10:00
Martin Schwenke	a1f915f6b5	recoverd: takeover_fail_callback() doesn't need to set rec->need_takeover_run It is set on every failure anyway. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e5f94c7857405bdeac233069003c3769b3dc3616)	2013-09-19 12:54:30 +10:00
Martin Schwenke	e167e2e7c7	recoverd: New function do_takeover_run() Factor the calling sequence for ctdb_takeover_run() into a new function and call it instead. This changes rec->need_takeover_run to false for each successful takeover run and that seems to be the right thing to do. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9a3f0c0e61ca5c17e020c6e0463d73c7cf4f7c09)	2013-09-19 12:54:30 +10:00
Martin Schwenke	30a50c6e1e	recoverd: Stabilise the recovery master role On rare occasions when a node that has been inactive it will trigger an election when it becomes active again. If that node has been up for the longest then it will win the election and the recovery master role will spuriously move. While a node remains inactive we reset the priority time to discourage it from winning elections. The priority time will now reflect roughly how long the node has been active rather than how long it has been up. That means the most stable node is more likely to win elections. Having a stable recovery master means that disabling takeover runs while reloading IPs is more likely to succeed. It also improves the chances of being able to cache information in the recovery master - for example, between takeover runs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0f48f22f45e4c82eba2582efae307e25385de81)	2013-09-19 12:54:29 +10:00
Martin Schwenke	3afcc53516	recoverd: Remove an unused temporary talloc context Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit da22d5e60dc023009854025cc9e6bc4b0a84c60e)	2013-08-22 17:00:20 +10:00
Martin Schwenke	e657f75484	recoverd: Log more information when interfaces change Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3ef93a1a3e60cdf5d8954e7a16a988ea6126916b)	2013-08-22 17:00:20 +10:00
Amitay Isaacs	cb8310ddb6	recoverd: Improve log message when nodes disagree on recmaster Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 7b7aa7b599536cd60ebb84d363607bb4e953248a)	2013-08-14 16:55:51 +10:00
Amitay Isaacs	de6b97ce4f	Revert "recoverd: Use correct tdb flags when creating missing databases" This reverts commit 10a057d8e15c8c18e540598a940d3548c731b0b4. This approach would not work when creating local databases since currently there is no control to receive TDB flags for remote databases. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ca61eb776ab862bd269e45ee0f9f96e7e1e0e001)	2013-08-14 14:15:33 +10:00
Amitay Isaacs	f15e1a28a7	recoverd: Use correct tdb flags when creating missing databases When creating missing databases either locally or remotely, make sure to use the correct tdb flags from other nodes. Without this, volatile databases can get attached without TDB_INCOMPATIBLE_HASH flag. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 10a057d8e15c8c18e540598a940d3548c731b0b4)	2013-08-01 11:08:25 +10:00
Amitay Isaacs	5ba280d8ce	recoverd: Make sure to use jenkins hash for recovery databases Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 32c83e209823e9a4d6306bb7fd63d4500f3e2668)	2013-08-01 10:51:14 +10:00
Amitay Isaacs	f1f787ccac	recoverd: Assemble up-to-date node flags information from remote nodes Currently nodemap used by recovery master is the one obtained from the local node. This information may have been updated while processing main loop. Before comparing node flags on all the nodes, create up-to-date node flags information based on the information received from all the nodes. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit fcf77dec5af973a0e32f3999bc012053a6f47a96)	2013-07-30 15:34:32 +10:00
Martin Schwenke	ca13f28eef	recoverd: Really fix bogus info in message about changed flags Commit 9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this. However, this was wrong because old_flags and new_flags were confused. The latter has since been fixed in commit 7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed properly. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 40f2825d6e818dc8c745b6385a545969dfb45fbc)	2013-07-11 15:18:06 +10:00
Sumit Bose	157f1cfefd	Fixes for various issues found by Coverity Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 05bfdbbd0d4abdfbcf28e3930086723508b35952)	2013-07-11 15:16:55 +10:00
Martin Schwenke	a86f1f109a	recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c6fded59fa4da67f738a90fdacb51900e41801f9)	2013-07-10 15:19:27 +10:00
Amitay Isaacs	1c21f37e57	ctdbd: Set process names for child processes This helps distinguish processes in process list in top, perf, etc. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2493f57ce268d6fe7e4c40a87852c347fd60d29e)	2013-07-10 14:33:19 +10:00
Martin Schwenke	0108e8ff10	recoverd: Minor style improvements for ctdb_reload_remote_public_ips() * Add a variable to the loop to make the code more readable and have it generally fit into 80 columns. * Improve comments. * Improve log messages. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0a292fa8939a1343e44cadaa8ed9f3c0f18ca82f)	2013-07-05 15:52:33 +10:00
Martin Schwenke	7290798a41	recoverd: Clean up log messages in remote IP verification The log messages in verify_remote_ip_allocation() are confusing because they don't include the PNN of the problem node, because it is not known in this function. Add the PNN of the node being verified as a function argument and then shuffle the log messages around to make them clearer. Also fold 3 nested if statements into just one. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0942fa01cd422133fc9398f56b4855397d7bc86)	2013-07-05 15:52:33 +10:00
Martin Schwenke	15115becef	recoverd: Fix an unclear log message - "Restart recovery process" When the recovery master notices a node in recovery mode it starts the recovery process, it doesn't restart it. Update documentation to match. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 298c4d2c3b4ea3d900c91f5a0a5aca2952a13d61)	2013-07-05 15:52:33 +10:00
Martin Schwenke	bfe0b93652	recoverd: Fix an incorrect comment Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9f6cd8b0bea619991c9f3bf35188c5950dabf8f4)	2013-07-05 15:52:33 +10:00
Amitay Isaacs	f032c60cd5	recoverd: Send the result from child process only once The result has been sent before the child keeps waiting for parent ctdbd process. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 9aa13bcedd83d463c871e3cf1f3a65da3cd83992)	2013-07-04 20:43:52 +10:00
Michael Adam	3c65197b7a	recoverd: when the recmaster is banned, use that information when forcing an election When we trigger an election because the recmaster considers itself inactive, update our local nodemap with the recmaster's flags before calling force_election(). This way, we don't send the inactive node freeze commands (e.g.) that may fail and then lead to ourselves getting banned. The theory is that this should help avoiding banning loops. Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 932360992b08a5483d90c0590218ba0fd756119e)	2013-07-02 12:59:09 +10:00
Michael Adam	082da536cb	recoverd: fix a comment typo Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 741944f118e98f178b860194eecb215180949d18)	2013-07-02 12:59:09 +10:00
Michael Adam	159b9a2989	recoverd: fix a comment in main_loop Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit ac06c46e4a80c635f6094b5ac6f0bf3e3a02db95)	2013-07-02 12:59:09 +10:00
Michael Adam	26365f2a5f	recoverd: eliminate some trailing spaces from ctdb_election_win() Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit df30c0a05ed908fc2a997c56ff5484736b23b70f)	2013-07-02 12:59:09 +10:00

1 2 3 4 5 ...

322 Commits