samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2024-12-22 13:34:15 +03:00

Author	SHA1	Message	Date
Martin Schwenke	950e23f664	ctdbd: Make ctdb_reloadips_child send controls asynchronously Deleting IPs can take a while because IPs are released and connections are killed. This can take a while so do them in parallel. In fact, since the set of IPs being added and deleted will be disjoint, send all the adds/deletes at the same time and then wait. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 85a5b544ec032173e98c9cc3b5402a76b961aa3b)	2013-09-19 12:54:31 +10:00
Martin Schwenke	b33ee7a2a4	recoverd: Fix the implementation of CTDB_SRVID_REBALANCE_NODE The current implementation has a few flaws: * A takeover run is called unconditionally when the timer goes even if the recovery master role has moved. This means a node other than the recovery master can incorrectly do a takeover run. * The rebalancing target nodes are cleared in the setup for a takeover run, regardless of whether the takeover run succeeds. * The timer to force a rebalance isn't cleared if another takeover run occurs before the deadline. Any forced rebalancing will happen in the first takeover run and when the timer expires some time later then an unnecessary takeover run will occur. * If the recovery master role moves then the rebalancing data will stay on the original node and affect the next takeover run to occur if the recovery master role should come back to the original node. Instead, store an array of rebalance target nodes in the recovery master context. This is passed as an extra argument to ctdb_takeover_run() each time it is called and is cleared when a takeover run succeeds. The timer hangs off the array of rebalance target nodes, which is cleared if the node isn't the recovery master. This means that it is possible to lose rebalance data if the recovery master role moves. However, that's a difficult problem to solve. The best way of approaching it is probably to try to stop the recovery master role from jumping around unnecesarily when inactive nodes join the cluster. The long term solution is to avoid this nonsense completely. The IP allocation algorithm needs to cache state between runs so that it knows which nodes have just become healthy. This also needs recovery master stability. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c51c1efe5fc7fa668597f2acd435dee16e410fc9)	2013-09-19 12:54:31 +10:00
Martin Schwenke	c503997746	recoverd: Move disabling of IP checks into do_takeover_run() Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 48b603fbf16311daa47b01e7a33d477ed51da56d)	2013-09-19 12:54:30 +10:00
Martin Schwenke	701c450e90	recoverd: Fail takeover run if "ipreallocated" fails Previously flagging a failure was probably avoided because of attempts to run "ipreallocated" events on stopped and banned nodes, which would fail because they are in recovery. Given the change to a new control and that fallback only retries the old method on active nodes, this should never fail in reasonable circumstances. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 53722430ad35f80935aabd12fa07654126443b8b)	2013-09-19 12:54:30 +10:00
Martin Schwenke	630196423a	recoverd: Banned nodes should not be told to run "ipreallocated" event They will reject it because they are in recovery. This can result in extra banning credits being applied to banned nodes. This corresponds to commit 9132e6814ed927fa317f333f03dedb18f75d0e5b from the 1.2.40 branch. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 403938804caf1322f9773d63197e4303a7b2a788)	2013-09-18 17:16:35 +10:00
Martin Schwenke	8d11da3546	recoverd: Remove an orphaned comment This should have been removed with the associated code in commit 14bd0b6961ef1294e9cba74ce875386b7dfbf446. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 36de63843de10a1f2a9ccdbbee24cc1d08542984)	2013-09-11 15:35:16 +10:00
Martin Schwenke	4e62553fcb	recoverd: Update a comment to use current terminology Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ea5576071b22e1877903ec0921d375626a23e13b)	2013-09-11 15:35:10 +10:00
Martin Schwenke	1ae731198a	recoverd: Move struct ctdb_public_ip_list back into ctdb_takeover.c This is an internal structure. It was moved into ctdb_private.h a long time ago to allow unit testing. Unit test compilation was changed shortly afterwards to make this unnecessary. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit db57261d7dc264e161659a8c547f44fbd9e88eeb)	2013-08-22 17:00:20 +10:00
Martin Schwenke	a5cb72cac3	ctdbd: Kill client process without checking for tracked child Commit f73a4b1495830bcdd094a93732a89dd53b3c2f78 added a safety check to ensure that CTDB never kills unrelated processes. However, client processes are unrelated. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 782814288bb560099ee44b607bf35f3eddf37f82)	2013-07-29 15:58:51 +10:00
Martin Schwenke	f46ab595d1	recoverd: Call takeover fail callback only once per node Currently the fail callback is called once per (takeip/releaseip) control failure. This is overkill and can get a node banned much too quickly. Instead, keep track of control failures per node and only call fail callback once per failed node. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit bf4a7c1ad87e0e848296d15d63eb8cd901ca5335)	2013-07-29 15:48:48 +10:00
Amitay Isaacs	1c21f37e57	ctdbd: Set process names for child processes This helps distinguish processes in process list in top, perf, etc. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2493f57ce268d6fe7e4c40a87852c347fd60d29e)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	bcb64aa55f	recoverd: Fix buffer overflow error in reloadips Signed-off-by: Amitay Isaacs <amitay@gmail.com> Pair-Programmed-With: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 41182623891d74a7e9e9c453183411a161201e67)	2013-07-05 15:52:34 +10:00
Martin Schwenke	dcdae86dc7	ctdbd: Log something when releasing all IPs At the moment this is silent and it can be confusing to see IPs just disappear. Also, this message: Been in recovery mode for too long. Dropping all IPS can cause anxiety when all IPs should already have been dropped. Adding a comforting message saying that 0 IPs were dropped relieves such anxiety. :-) Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4d0f26b306fc465d551d340b0e7dce4412eae3fd)	2013-07-05 15:52:33 +10:00
Martin Schwenke	7290798a41	recoverd: Clean up log messages in remote IP verification The log messages in verify_remote_ip_allocation() are confusing because they don't include the PNN of the problem node, because it is not known in this function. Add the PNN of the node being verified as a function argument and then shuffle the log messages around to make them clearer. Also fold 3 nested if statements into just one. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0942fa01cd422133fc9398f56b4855397d7bc86)	2013-07-05 15:52:33 +10:00
Martin Schwenke	26b161156a	ctdbd: Release IP callback should fail if the IP is still hosted At the moment there (at least) are 2 bugs that cause rogue IPs: * A race where release_ip_callback() runs after a "subsequent" take IP has completed. The IP is back on an interface but we unset vnn->iface in the callback. * A "releaseip" eventscript times out. We ignore the timeout and call it success, deleting the VNN even if the IP is still hosted. We could decide not to ignore the timeout and ban the node, but killing TCP connections can take a long time and that might result in a lot of manning. We probably won't reinstate banning on "releaseip" until killing TCP connections has been optimised. In both cases, a rogue IP can be avoided by leaving vnn->iface set and simply failing the control. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit c5797f2942e83da24df548ea07196fbbac0eab20)	2013-07-05 15:52:32 +10:00
Martin Schwenke	793233f6b6	ctdbd: Log warnings in release IP when unexpected interface is encountered Previous code changes work around a potential problems but do not provide useful information when the a problem occurs. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit f1f1b0c24b9b6cd24b83a4e4da16e179287ec6ac)	2013-07-05 15:52:32 +10:00
Amitay Isaacs	6391f61fbc	build: Fix compiler warnings for uninitialized variables Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 5408c5c4050539e5aa06a5e82ceb63a6cb5cef0c)	2013-07-04 20:43:52 +10:00
Mathieu Parent	d82b9ae410	build: Fix tdb.h path to enable building with system TDB library (This used to be ctdb commit f8bf99de3a5f56be67aaa67ed836458b1cf73e86)	2013-06-14 16:45:27 +10:00
Martin Schwenke	1ab2bbb349	recoverd: Backward compatibility for nodes without IPREALLOCATED control Consider the case of upgrading a cluster node by node, where some nodes are still running older versions of CTDB without the IPREALLOCATED control. If a "new" node takes over as recovery master and a failover occurs, then it will attempt to send IPREALLOCATED controls to all nodes. The "old" nodes will fail in a fairly nondescript way (result == -1). To try to handle this situation, fall back to the EVENTSCRIPT control to handle "ipreallocated". Only do this on the failed nodes. However, do not do this on nodes that timed out (they've probably implemented the control and we should call the regular fail_callback to get those nodes banned) or for stopped nodes (since they can't actually run the "ipreallocated" event via the EVENTSCRIPT control). Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit b2654853ce9b7c18c5874b080bc94d3118078a5d)	2013-05-27 15:15:25 +10:00
Martin Schwenke	f35e9bba9b	recoverd: Nodes can only takeover IPs if they are in runstate RUNNING Currently the order of the first IP allocation, including the first "ipreallocated" event, and the "startup" event is undefined. Both of these events can (re)start services. This stops IPs being hosted before the "startup" event has completed. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit f15dd562fd8c08cafd957ce9509102db7eb49668)	2013-05-24 16:27:55 +10:00
Martin Schwenke	7f03618ae4	recoverd: Handle errors carefully when fetching tunables If a tunable is not implemented on a remote node then this should not be fatal. In this case the takeover run can continue using benign defaults for the tunables. However, timeouts and any unexpected errors should be fatal. These should abort the takeover run because they can lead to unexpected IP movements. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c0c27762ea728ed86405b29c642ba9e43200f4ae)	2013-05-24 16:27:55 +10:00
Martin Schwenke	116f62a7b3	recoverd: Set explicit default value when getting tunable from nodes Both of the current defaults are implicitly 0. It is better to make the defaults obvious. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 1190bb0d9c14dc5889c2df56f6c8986db23d81a1)	2013-05-24 16:04:57 +10:00
Martin Schwenke	e78b064dcc	recoverd: Whitespace improvements Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 473cfcb019f0cb4a094bf10397f7414f7923ee57)	2013-05-24 15:55:11 +10:00
Martin Schwenke	1a181a4284	recoverd: Use talloc_array_length() for simpler code Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f6792f478197774d2f3b2258c969b67c83e017ab)	2013-05-24 15:55:10 +10:00
Martin Schwenke	63577c96db	ctdbd: Replace ctdb->done_startup with ctdb->runstate This allows states, including startup and shutdown states, to be clearly tracked. This doesn't include regular runtime "states", which are handled by node flags. Introduce new functions ctdb_set_runstate(), runstate_to_string() and runstate_from_string(). Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 8076773a9924dcf8aff16f7d96b2b9ac383ecc28)	2013-05-24 14:08:06 +10:00
Martin Schwenke	5fdf71b898	recoverd: takeover_run_core() should not use modified node flags Modifying the node flags with IP-allocation-only flags is not necessary. It causes breakage if the flags are not cleared after use. ctdb_takeover_run() no longer needs the general node flags - it only needs the IP flags. Instead of modifying the node flags in nodemap, construct a custom IP flags list and have takeover_run_core() use that instead of node flags. As well as being safer, this makes the IP allocation code more self contained and a little bit clearer. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 14bd0b6961ef1294e9cba74ce875386b7dfbf446)	2013-05-23 16:18:23 +10:00
Martin Schwenke	e769f8575a	ctdbd: Log add and delete of IPs At the moment, when someone deletes all the IPs on a node, all we see are the release IP messages and we have to guess why. Some would argue that add/release are more significant than take/release so they should be logged. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3c3df1d6afec7e3e721f9bcd4e8b8e008fd6e50b)	2013-05-22 14:24:22 +10:00
Martin Schwenke	0baefba368	ctdbd: Removed bogus comment in ctdb_find_iface() Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4a8d90d0812a3242f58a2a0e2aa0f528f60f7013)	2013-05-22 14:24:21 +10:00
Martin Schwenke	54e91df60d	recoverd: Move IP flags into ctdb_takeover.c These should never be seen outside the IP allocation code. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e143abd16ccde2e0edfe103673d31a5fb06b6aef)	2013-05-09 12:55:42 +10:00
Martin Schwenke	50f19b5bd4	recoverd: Clear IP flags after IP allocation algorithm has run If these flags are left set they will confuse other recovery daemon code. Factor the clearing code into new function clear_ipflags(). Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 45c776958017ea7001f061842c9e0f60e4a25f23)	2013-05-09 12:55:42 +10:00
Martin Schwenke	530020d83b	recoverd: Remove unused mask argument and initial mask calculation This has been replaced by set_ipflags() and associated functionality. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d0a3822573db296e73cc897835f783c8abc084b3)	2013-05-07 16:20:47 +10:00
Martin Schwenke	ee7357de51	recoverd: When calculating rebalance candidates don't consider flags This is really a check to see if a node is already hosting IPs. If so, we assume it was previously healthy so it isn't considered as a rebalance candidate. There's no need to limit this to healthy node, since this is checked elsewhere. Due to this the variable newly_healthy is renamed everywhere to rebalance_candidates. The mask argument is now completely unused. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 65e0ea6c2c0629e19349ba4b9affa221fde2b070)	2013-05-07 16:20:47 +10:00
Martin Schwenke	c9056b4f88	recoverd: Remove unused mask argument from IP allocation functions This is a no-op and is in a separate commit to make the previous commit less cumbersome. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 107e656bbe24f9d21fbaf886a3e9417da4effe5a)	2013-05-07 16:20:47 +10:00
Martin Schwenke	0445c988e2	recoverd: Fix tunable NoIPTakeoverOnDisabled, rename to NoIPHostOnAllDisabled This really needs to be per-node. The rename is because nodes with this tunable switched on should drop IPs if they become unhealthy (or disabled in some other way). * Add new flag NODE_FLAGS_NOIPHOST, only used in recovery daemon. * Enhance set_ipflags_internal() and set_ipflags() to setup NODE_FLAGS_NOIPHOST depending on setting of NoIPHostOnAllDisabled and/or whether nodes are disabled/inactive. * Replace can_node_servce_ip() with functions can_node_host_ip() and can_node_takeover_ip(). These functions are the only ones that need to look at NODE_FLAGS_NOIPTAKEOVER and NODE_FLAGS_NOIPHOST. They can make the decision without looking at any other flags due to previous setup. * Remove explicit flag checking in IP allocation functions (including unassign_unsuitable_ips()) and just call can_node_host_ip() and can_node_takeover_ip() as appropriate. * Update test code to handle CTDB_SET_NoIPHostOnAllDisabled. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 1308a51f73f2e29ba4dbebb6111d9309a89732cc)	2013-05-07 16:20:46 +10:00
Martin Schwenke	ac80824709	recoverd: Factor out new function all_nodes_are_disabled() Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 12aef10e9889760d98f58c8d916f19d069fa381a)	2013-05-07 16:20:46 +10:00
Martin Schwenke	657162fb34	recoverd: Refactor code to get NoIPTakeover tunable from all nodes Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 1fb5352d2b6918fcc6f630db49275d25a3eebe8d)	2013-05-07 16:20:46 +10:00
Martin Schwenke	17521b31b2	recoverd: Add debug message when dropping IPs in IP allocation Update tests accordingly. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 91405282ba4abad4ad8e8c5f7ee4c83c75f38280)	2013-05-07 16:20:46 +10:00
Martin Schwenke	745c6bc363	recoverd: ctdb_takeover_run() uses CTDB_CONTROL_IPREALLOCATED This means "ipreallocated" is now run on stopped nodes. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 83b61f7414b1f7a3424497ac987ca0724fba9eaa)	2013-05-06 13:38:21 +10:00
Martin Schwenke	2e59cd5428	ctdbd: New control CTDB_CONTROL_IPREALLOCATED This is an alternative to using ctdb_run_eventscripts() that can be used when in recovery. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 27a44685f0d7a88804b61a1542bb42adc8f88cb1)	2013-05-06 13:38:21 +10:00
Amitay Isaacs	77a29b3733	recoverd/takeover: Use IP->node mapping info from nodes hosting that IP When collating IP information for IP layout, only trust the nodes that are hosting an IP, to have correct information about that IP. Ignore what all the other nodes think. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 1c7adbccc69ac276d2b957ad16c3802fdb8868ca)	2013-04-08 11:14:32 +10:00
Martin Schwenke	53bd183683	recoverd: Separate each IP allocation algorithm into its own function This makes the code much more readable and maintainable. As a side effect, fix a memory leak in LCP2. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 6a1d88a17321f7e1dc84b4823d5e7588516a6904)	2013-01-08 10:16:11 +11:00
Martin Schwenke	2e8df43561	recoverd: New function unassign_unsuitable_ips() Move the code into a new function so it can be called from a number of places. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 8adb255e62dbe60d1e983047acd7b9c941231d11)	2013-01-08 10:16:11 +11:00
Martin Schwenke	bcefb76884	recoverd: Move failback retry loop into basic_failback() and lcp2_failback() The retry loop is currently in ctdb_takeover_run_core(). Pushing it into each function will make it possible to put each algorithm into a separate top-level function. This will make the code much clearer and more maintainable. Also keep associated test code compatible. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f6ce18d011dd9043b04256690d826deb2640cd89)	2013-01-08 10:16:11 +11:00
Martin Schwenke	443fbb9e01	recoverd: Trying to failback more IPs no longer allocates unassigned IPs Neither basic_failback() nor lcp2_failback() unassign IPs anymore, so there's no point looping back that far. Also fix a unit test that now fails because looping back to handle unassigned IPs is no longer logged. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c09aeaecad7d3232b1c07bab826b96818756f5e0)	2013-01-08 10:16:11 +11:00
Martin Schwenke	dfa7ce7b73	recoverd: basic_failback() can call find_takeover_node() directly Instead of unassigning, looping back and depending on basic_allocate_unassigned. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4dc08e37dec464c8785a2ddae15c7c69d3c81ac3)	2013-01-08 10:16:11 +11:00
Martin Schwenke	326328d520	recoverd: Don't do failback at all when deterministic IPs are in use This seems to be the right thing to do instead of calling into the failback code and continually skipping the release of an IP. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4c87e7cb3fa2cf2e034fa8454364e0a7fe0c8f81)	2013-01-08 10:16:11 +11:00
Martin Schwenke	ef403f70f2	recoverd: Move the test for both 'DeterministicIPs' and 'NoIPFailback' set If this is done earlier then some other logic can be improved. Also, this should be a warning since no error condition is set. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e06476e07197b7327b8bdac9c0b2e7281798ffec)	2013-01-08 10:16:11 +11:00
Martin Schwenke	a3911ed7bf	recoverd: Fix a memory leak in IP allocation Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit bcd5f587aff3ba536cb0b5ef00d2d802352bae25)	2013-01-08 10:16:11 +11:00
Martin Schwenke	4f0d68cba6	ctdbd: Clean up orphaned interfaces when an IP is deleted Add a new function ctdb_remove_orphaned_ifaces() and call it in ctdb_control_del_public_address(). ctdb_remove_orphaned_ifaces() uses a naive implementation that does things in a very obvious way. There are many ways to improve the performance - some are mentioned in a comment in the code. However, I doubt that this will be a bottleneck even with a large number of public IPs. Running the eventscript is likely to outweigh the cost of this cleanup. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a)	2013-01-07 12:19:33 +11:00
Martin Schwenke	0f1bcebc80	ctdbd: Make the link status of new interfaces more flexible Neither up nor down is a good default value for the link status of a new interface. Up means that IPs can be assigned to interfaces before the true state is known and they can move away quickly if the interface is actually down. Down means that IPs can't be assigned to an interface for a variable amount of time - until a monitor cycle occurs - and this can result in imbalanced IPs. This is a neat compromise. Before the startup event completes, IPs can't be assigned to interfaces because all interfaces begin in a down state. As soon as the startup event completes, IPs can be allocated to any interface that has been marked up by the eventscript. Later, during normal operation, newly added IPs can be assigned to new interfaces immediately. The IPs will still move away if an interface is noticed to be down in the next monitor cycle, but that is the exception rather than the rule. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9275a69a414482f1053ae14528d5972575b9214e)	2012-11-19 15:53:13 +11:00

1 2 3 4 5

237 Commits