samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2025-01-10 01:18:15 +03:00

Author	SHA1	Message	Date
Martin Schwenke	430ae84877	recoverd: Disable takeover runs on other nodes for 5 minutes 60 seconds might not be long enough to kill all connections and release IPs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 71080676bb4acbd0d9b595a30cf7fe6dddbf426f)	2013-09-19 12:58:32 +10:00
Martin Schwenke	07d3a1b234	recoverd: Improve logging for takeover runs Takeover runs are currently silent when they succeed. However, they are important, so log something by default. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit b39aa2e401fbb581207d986bac93778e9c01acdc)	2013-09-19 12:57:36 +10:00
Martin Schwenke	566d66e6ab	recoverd: Be careful about freeing the list of IP rebalance target nodes It can change during a takeover run. If it does then don't free it. There are potentially fancier solutions (e.g. check what PNNs are new to the list) to this issue but this is the simplest. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e81589b7084c661adf617e166cc2c25b4939f841)	2013-09-19 12:54:31 +10:00
Martin Schwenke	4fb0d4a301	recoverd: reloadips should rebalance target nodes for new IPs Otherwise, if existing IPs are added to extra nodes (that have, perhaps, been disconnected) then those IPs will not be rebalanced across the extra nodes. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ceb30432a9a550778aed0b422a654fc5287b82a3)	2013-09-19 12:54:31 +10:00
Martin Schwenke	950e23f664	ctdbd: Make ctdb_reloadips_child send controls asynchronously Deleting IPs can take a while because IPs are released and connections are killed. This can take a while so do them in parallel. In fact, since the set of IPs being added and deleted will be disjoint, send all the adds/deletes at the same time and then wait. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 85a5b544ec032173e98c9cc3b5402a76b961aa3b)	2013-09-19 12:54:31 +10:00
Martin Schwenke	b33ee7a2a4	recoverd: Fix the implementation of CTDB_SRVID_REBALANCE_NODE The current implementation has a few flaws: * A takeover run is called unconditionally when the timer goes even if the recovery master role has moved. This means a node other than the recovery master can incorrectly do a takeover run. * The rebalancing target nodes are cleared in the setup for a takeover run, regardless of whether the takeover run succeeds. * The timer to force a rebalance isn't cleared if another takeover run occurs before the deadline. Any forced rebalancing will happen in the first takeover run and when the timer expires some time later then an unnecessary takeover run will occur. * If the recovery master role moves then the rebalancing data will stay on the original node and affect the next takeover run to occur if the recovery master role should come back to the original node. Instead, store an array of rebalance target nodes in the recovery master context. This is passed as an extra argument to ctdb_takeover_run() each time it is called and is cleared when a takeover run succeeds. The timer hangs off the array of rebalance target nodes, which is cleared if the node isn't the recovery master. This means that it is possible to lose rebalance data if the recovery master role moves. However, that's a difficult problem to solve. The best way of approaching it is probably to try to stop the recovery master role from jumping around unnecesarily when inactive nodes join the cluster. The long term solution is to avoid this nonsense completely. The IP allocation algorithm needs to cache state between runs so that it knows which nodes have just become healthy. This also needs recovery master stability. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c51c1efe5fc7fa668597f2acd435dee16e410fc9)	2013-09-19 12:54:31 +10:00
Martin Schwenke	1793412de2	recoverd: Remove unused CTDB_SRVID_RELOAD_ALL_IPS and handler Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4cd727439a0824ebb8dbcf737d9888ffc3c41184)	2013-09-19 12:54:31 +10:00
Martin Schwenke	e7cc998570	recoverd: Defer ipreallocated requests when takeover runs are disabled The takeover run will fail anyway but deferring seems like a cleaner option. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 428f800bcdf3dbfe19de8bb36099fbf01ebeaab4)	2013-09-19 12:54:31 +10:00
Martin Schwenke	2f472b4573	recoverd: Reimplement CTDB_SRVID_DISABLE_IP_CHECK Use disable_takeover_runs_handler() instead of maintaining duplicate logic. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0a51a85915486b2a8fded7ba6444b18c6c1ee8e8)	2013-09-19 12:54:31 +10:00
Martin Schwenke	5f0913d321	recoverd: New SRVID message CTDB_SRVID_DISABLE_TAKEOVER_RUNS This implements a superset of CTDB_SRVID_DISABLE_IP_CHECK. It stops the IP checks but also causes any attempted takeover runs to fail and be rescheduled. This is meant to completely stop IP movements. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 00db4de53a0d86013e79e6577e7e6cf3ef864e56)	2013-09-19 12:54:31 +10:00
Martin Schwenke	0ba7e2ce31	recoverd: Factor out the SRVID handling code The code that handles IP reallocate requests can be reused. This also changes the result back to a SRVID caller to the PNN on success or a negative error code on failure. None of the callers currently look at the result so this is harmless... but it will be useful later. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e4eae6e3291baa299a1d0f733ab11b138ee699a3)	2013-09-19 12:54:30 +10:00
Martin Schwenke	4c3f8dc3bb	recoverd: Make the SRVID request structure generic No need for a separate one for each SRVID. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d9c22b04d5aa7938a3965bd3144568664eb772ce)	2013-09-19 12:54:30 +10:00
Martin Schwenke	c503997746	recoverd: Move disabling of IP checks into do_takeover_run() Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 48b603fbf16311daa47b01e7a33d477ed51da56d)	2013-09-19 12:54:30 +10:00
Martin Schwenke	bbbb55eef9	recoverd: do_takeover_run() should mark when a takeover run is in progress Nested takeover runs should never happens so they should fail. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 8ed29c60c0a7dd29f2a6efdf694d38e94281e1c4)	2013-09-19 12:54:30 +10:00
Martin Schwenke	a1f915f6b5	recoverd: takeover_fail_callback() doesn't need to set rec->need_takeover_run It is set on every failure anyway. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e5f94c7857405bdeac233069003c3769b3dc3616)	2013-09-19 12:54:30 +10:00
Martin Schwenke	701c450e90	recoverd: Fail takeover run if "ipreallocated" fails Previously flagging a failure was probably avoided because of attempts to run "ipreallocated" events on stopped and banned nodes, which would fail because they are in recovery. Given the change to a new control and that fallback only retries the old method on active nodes, this should never fail in reasonable circumstances. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 53722430ad35f80935aabd12fa07654126443b8b)	2013-09-19 12:54:30 +10:00
Martin Schwenke	e167e2e7c7	recoverd: New function do_takeover_run() Factor the calling sequence for ctdb_takeover_run() into a new function and call it instead. This changes rec->need_takeover_run to false for each successful takeover run and that seems to be the right thing to do. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9a3f0c0e61ca5c17e020c6e0463d73c7cf4f7c09)	2013-09-19 12:54:30 +10:00
Martin Schwenke	30a50c6e1e	recoverd: Stabilise the recovery master role On rare occasions when a node that has been inactive it will trigger an election when it becomes active again. If that node has been up for the longest then it will win the election and the recovery master role will spuriously move. While a node remains inactive we reset the priority time to discourage it from winning elections. The priority time will now reflect roughly how long the node has been active rather than how long it has been up. That means the most stable node is more likely to win elections. Having a stable recovery master means that disabling takeover runs while reloading IPs is more likely to succeed. It also improves the chances of being able to cache information in the recovery master - for example, between takeover runs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0f48f22f45e4c82eba2582efae307e25385de81)	2013-09-19 12:54:29 +10:00
Martin Schwenke	630196423a	recoverd: Banned nodes should not be told to run "ipreallocated" event They will reject it because they are in recovery. This can result in extra banning credits being applied to banned nodes. This corresponds to commit 9132e6814ed927fa317f333f03dedb18f75d0e5b from the 1.2.40 branch. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 403938804caf1322f9773d63197e4303a7b2a788)	2013-09-18 17:16:35 +10:00
Martin Schwenke	8d11da3546	recoverd: Remove an orphaned comment This should have been removed with the associated code in commit 14bd0b6961ef1294e9cba74ce875386b7dfbf446. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 36de63843de10a1f2a9ccdbbee24cc1d08542984)	2013-09-11 15:35:16 +10:00
Martin Schwenke	4e62553fcb	recoverd: Update a comment to use current terminology Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ea5576071b22e1877903ec0921d375626a23e13b)	2013-09-11 15:35:10 +10:00
Michael Adam	18f17aaa33	server: standardize formatting of comment block for ctdb_reply_dmaster() while I'm at it.. This was the comment block I was touching and meant to adapt in commit 00d3bf092e2f72eda330978c75ec85f17e870553. My search was apparently not unique... Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 09940255011b119dc6af3304f5d3e9568e6006fd)	2013-08-26 13:24:32 +02:00
Martin Schwenke	3afcc53516	recoverd: Remove an unused temporary talloc context Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit da22d5e60dc023009854025cc9e6bc4b0a84c60e)	2013-08-22 17:00:20 +10:00
Martin Schwenke	1ae731198a	recoverd: Move struct ctdb_public_ip_list back into ctdb_takeover.c This is an internal structure. It was moved into ctdb_private.h a long time ago to allow unit testing. Unit test compilation was changed shortly afterwards to make this unnecessary. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit db57261d7dc264e161659a8c547f44fbd9e88eeb)	2013-08-22 17:00:20 +10:00
Martin Schwenke	e657f75484	recoverd: Log more information when interfaces change Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3ef93a1a3e60cdf5d8954e7a16a988ea6126916b)	2013-08-22 17:00:20 +10:00
Amitay Isaacs	58e96eb178	traverse: Log when database traverse is started Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 256b157232c60bc432c94e54b1fae9699f737557)	2013-08-22 17:00:19 +10:00
Amitay Isaacs	e850a6d2ca	ctdbd: Finish eventscript callback processing before debugging hung script This ensures that the result of eventscripts is updated and callback is processed before debugging hung script. So "ctdb scriptstatus" output will be useful from debug hung script. Signed-off-by: Amitay Isaacs <amitay@gmail.com> Pair-Programmed-With: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4ed2efb838d2ac97746666f614ebef5fdf3cdd5e)	2013-08-22 17:00:19 +10:00
Amitay Isaacs	19444f7c3d	ctdbd: Make sure call data is freed if doing an early return This should avoid memory bloat when a request bounces between nodes. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 7677fb263f06a97398e2c546e32273fb96edca69)	2013-08-22 16:59:49 +10:00
Amitay Isaacs	1467b666f2	Revert "LACOUNT: Add back lacount mechanism to defer migrating a fetched/read copy until after default of 20 consecutive requests from the same node" This reverts commit 035c0d981bde8c0eee8b3f24ba8e2dc817e5b504. This is a premature optimization. Record can bounce between nodes very quickly if it is a contended record. There is no need to hold a record on a node unnecessarily. In case record contention becomes bad, enabling sticky records on a database is a better idea. Conflicts: include/ctdb_private.h server/ctdb_tunables.c Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ac417b0003f0116f116834ad2ac51482d25cfa0d)	2013-08-22 14:08:52 +10:00
Amitay Isaacs	59dae19f5a	ctdbd: Print a log message when a key becomes hot Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 48f40985f4592c28402303ccbb458756f4914f75)	2013-08-22 14:08:52 +10:00
Michael Adam	621bfe8b0d	server: standardize formatting of comment block for ctdb_reply_dmaster() while I'm at it.. Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 00d3bf092e2f72eda330978c75ec85f17e870553)	2013-08-19 17:12:33 +02:00
Michael Adam	922246de73	server: fix wording and punctuation in comment block for ctdb_reply_dmaster(). Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit cb3a1c5af3b796dba30cae07118670d3c9e57df7)	2013-08-19 17:12:32 +02:00
Amitay Isaacs	cb8310ddb6	recoverd: Improve log message when nodes disagree on recmaster Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 7b7aa7b599536cd60ebb84d363607bb4e953248a)	2013-08-14 16:55:51 +10:00
Amitay Isaacs	ae30b61255	vacuuming: Fix vacuuming bug where requests keep bouncing between nodes (part 2) This is caused by corruption of a record header such that the records on two nodes point to each other as dmaster. This makes a request for that record bounce between nodes endlessly. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit f0853013655ac3bedf1b793de128fb679c6db6c6)	2013-08-14 16:55:51 +10:00
Amitay Isaacs	ee8d573069	vacuuming: Fix vacuuming bug where requests keep bouncing between nodes (part 1) This is caused by corruption of a record header such that the records on two nodes point to each other as dmaster. This makes a request for that record bounce between nodes endlessly. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit a610bc351f0754c84c78c27d02f9a695e60c5b0f)	2013-08-14 16:55:51 +10:00
Amitay Isaacs	de6b97ce4f	Revert "recoverd: Use correct tdb flags when creating missing databases" This reverts commit 10a057d8e15c8c18e540598a940d3548c731b0b4. This approach would not work when creating local databases since currently there is no control to receive TDB flags for remote databases. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ca61eb776ab862bd269e45ee0f9f96e7e1e0e001)	2013-08-14 14:15:33 +10:00
Amitay Isaacs	a98baa539e	ctdbd: When a record is made sticky, log only once Instead of logging from ctdb_request_call(), log the message from ctdb_make_record_sticky(). That way if the record is already sticky, the message is not repeated unnecessarily. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 44a64d1c388bfe3c3388b191edfaedecfb7bb831)	2013-08-09 11:07:37 +10:00
Amitay Isaacs	d42cea6efe	ctdbd: Improve high hopcount log messages when request is redirected Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 9cde47e1a5bf1b9ca3b4da8c2db94caac2b1aa5e)	2013-08-09 11:07:37 +10:00
Amitay Isaacs	ded2f28954	ctdbd: Avoid leaking file descriptor if talloc fails Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit d7f6bc3fed2dc61e6e587b4c0ec0ac27d533bbbe)	2013-08-09 11:04:55 +10:00
Amitay Isaacs	a030b938ca	eventscript: Wait for debug hung script to finish or timeout before continuing Currently if the debug hung script takes long time to finish, the subsequent monitor event can collide with the previous event which is not yet finished. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 9e99e0eb072e2b845914ee3896acbc66b96138d7)	2013-08-09 11:04:55 +10:00
Amitay Isaacs	477a51aba5	locking: Do not create multiple lock processes for the same key If there are multiple lock helper processes waiting for the same record, then it will cause a thundering herd when that record has been unlocked. So avoid scheduling lock contexts for the same record. This will also mean that multiple requests will get queued up behind the same lock context and can be processed quickly once the lock has been obtained. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ebecc3a18f1cb397a78b56eaf8f752dd5495bcc9)	2013-08-09 11:04:55 +10:00
Amitay Isaacs	9ba793a80f	locking: Move function find_lock_context() before ctdb_lock_schedule() So that ctdb_lock_schedule() can call this function without requiring extra prototype declaration. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 68af5405acc123b5a90decd2123e2a02961a8fcf)	2013-08-09 11:04:42 +10:00
Amitay Isaacs	b77fec9381	ctdbd: Print set db sticky message after it's set Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 824dcec35ec461d78e22b2ea109473b32bfe3972)	2013-08-01 11:08:26 +10:00
Amitay Isaacs	f15e1a28a7	recoverd: Use correct tdb flags when creating missing databases When creating missing databases either locally or remotely, make sure to use the correct tdb flags from other nodes. Without this, volatile databases can get attached without TDB_INCOMPATIBLE_HASH flag. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 10a057d8e15c8c18e540598a940d3548c731b0b4)	2013-08-01 11:08:25 +10:00
Amitay Isaacs	5ba280d8ce	recoverd: Make sure to use jenkins hash for recovery databases Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 32c83e209823e9a4d6306bb7fd63d4500f3e2668)	2013-08-01 10:51:14 +10:00
Amitay Isaacs	f1f787ccac	recoverd: Assemble up-to-date node flags information from remote nodes Currently nodemap used by recovery master is the one obtained from the local node. This information may have been updated while processing main loop. Before comparing node flags on all the nodes, create up-to-date node flags information based on the information received from all the nodes. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit fcf77dec5af973a0e32f3999bc012053a6f47a96)	2013-07-30 15:34:32 +10:00
Amitay Isaacs	0993387f4a	ctdbd: Don't consider a hot record if the hopcount is zero Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ab35773518ad15588013f4d859f7bee790437450)	2013-07-30 15:34:32 +10:00
Amitay Isaacs	054d8727ed	ctdbd: Fix updating of hot keys in database statistics Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit fde4b4db5a57f75c5efa5647c309f33e0d5a68f3)	2013-07-29 16:00:46 +10:00
Amitay Isaacs	d8fc36781c	ctdbd: Remove incomplete ctdb_db_statistics_wire structure Instead of maintaining another structure, add an element as place holder for marshall buffer of hot keys. This avoids duplication of the structure. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit e73b2e12adc9db1dedb48d32bba3a8406a80f4cd)	2013-07-29 16:00:46 +10:00
Amitay Isaacs	854216236b	Revert "ctdbd: Remove incomplete ctdb_db_statistics_wire structure" The structure cannot be removed without adding support for marshalling keys for hot records. This reverts commit 26a4653df594d351ca0dc1bd5f5b2f5b0eb0a9a5. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 023ca2e84f5ed064a288526b9c2bc7e06674dd81)	2013-07-29 16:00:46 +10:00
Martin Schwenke	a5cb72cac3	ctdbd: Kill client process without checking for tracked child Commit f73a4b1495830bcdd094a93732a89dd53b3c2f78 added a safety check to ensure that CTDB never kills unrelated processes. However, client processes are unrelated. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 782814288bb560099ee44b607bf35f3eddf37f82)	2013-07-29 15:58:51 +10:00
Martin Schwenke	f46ab595d1	recoverd: Call takeover fail callback only once per node Currently the fail callback is called once per (takeip/releaseip) control failure. This is overkill and can get a node banned much too quickly. Instead, keep track of control failures per node and only call fail callback once per failed node. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit bf4a7c1ad87e0e848296d15d63eb8cd901ca5335)	2013-07-29 15:48:48 +10:00
Martin Schwenke	6cbcc4a8d9	ctdbd: Pass event name to hung script debugger Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit e0f3fa1020e13b84bdd672538168d148f1847d57)	2013-07-23 11:28:07 +10:00
Martin Schwenke	88ba32b787	ctdbd: Sleep at exit to allow time for log messages to flush Register print_exit_message() earlier so that it covers most of the early exits. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 90d792cf28d6a823141e4c417b6978f02a9cf596)	2013-07-19 15:40:59 +10:00
Martin Schwenke	84f5528d9b	ctdbd: Exit if something is already listening on CTDB socket Don't blindly remove the socket. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3dd5b925dcf0e9a5b877638e471c5ecf36b46c58)	2013-07-19 15:40:43 +10:00
Martin Schwenke	a3bef911f3	ctdbd: Allow extra recovery to repair persistent DBs during first recovery Commit 8076773a9924dcf8aff16f7d96b2b9ac383ecc28 introduced a potential regression because a node may not have completed the "recovered" event (so might still be in CTDB_RUNSTATE_FIRST_RECOVERY) when another node becomes healthy. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 57ef5d3827ea3417a32703e259a53ce6fd10ac45)	2013-07-19 15:35:41 +10:00
Martin Schwenke	ca13f28eef	recoverd: Really fix bogus info in message about changed flags Commit 9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this. However, this was wrong because old_flags and new_flags were confused. The latter has since been fixed in commit 7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed properly. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 40f2825d6e818dc8c745b6385a545969dfb45fbc)	2013-07-11 15:18:06 +10:00
Sumit Bose	157f1cfefd	Fixes for various issues found by Coverity Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 05bfdbbd0d4abdfbcf28e3930086723508b35952)	2013-07-11 15:16:55 +10:00
Sumit Bose	d039f799ac	Check return value of tdb_delete() Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 5cdcc3d45d358ddbcd7e864898eed9cbd9935429)	2013-07-11 15:16:55 +10:00
Martin Schwenke	a86f1f109a	recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c6fded59fa4da67f738a90fdacb51900e41801f9)	2013-07-10 15:19:27 +10:00
Amitay Isaacs	14c49eabe4	ctdbd: Print tdb flags when logging attached to database message Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 846109169ee5e3d03135156e45c8dac93aa2e95b)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	1c21f37e57	ctdbd: Set process names for child processes This helps distinguish processes in process list in top, perf, etc. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2493f57ce268d6fe7e4c40a87852c347fd60d29e)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	4357aebdb9	traverse: Remove unused start_time field Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit dc834d5e78c3fb97ae15cddf1139b3c4a4051a7c)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	bf3dd9488e	traverse: Send records directly from traverse child to srcnode Currently CTDB daemon reads records from a child process and then sends them to srcnode via TRAVERSE_DATA control. This ties up main CTDB daemon and also requires an extra copy of the record in the CTDB daemon. Instead send records directly from traverse child process. The control from child process still goes via local CTDB daemon as there is no infrastructure currently to open a TCP socket to the srcnode. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 1a74192aa7d51ed99553e7292860027f06b6ef37)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	557b92fc88	traverse: Pass reqid and srcnode information to local database traverse So that traverse child process can directly send the TRAVERSE_DATA control to the srcnode without first sending it to local node. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit faabce1b99fb3de9ff03bf54d303e7656538fee3)	2013-07-10 14:33:19 +10:00
Amitay Isaacs	d46c24f4d0	ctdbd: No need for DeadlockTimeout tunable The code for deadlock detection and killing smbd process causing deadlock has been removed and replaced with external debug script. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2211cd94bea266547d3e6f167d3160a6b23bec88)	2013-07-10 14:33:18 +10:00
Amitay Isaacs	c620457c0b	locking: Use external script to debug locking issues Use an external script to parse /proc/locks and log useful debugging information about locks rather than doing that in C code. To use this feature, add configuration variable to /etc/sysconfig/ctdb: CTDB_DEBUG_LOCKS=/etc/ctdb/debug_locks.sh Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2bfb8499366d530f16515b08928056bbda40f781)	2013-07-10 14:33:18 +10:00
Amitay Isaacs	9ae379c91a	locking: Update locking bucket intervals 0 < 1 ms 1 < 10 ms 2 < 100 ms 3 < 1 s 4 < 2 s 5 < 4 s 6 < 8 s 7 < 16 s 8 < 32 s 9 < 64 s 10 >= 64 s Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 6fc36a7036933237d09151a0baf4d8ccd2bc2c99)	2013-07-10 14:33:18 +10:00
Amitay Isaacs	1afb7fccb2	locking: Update locks latency in CTDB statistics only for RECORD or DB locks Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit dcc42a75b4638b3aa40c44ed9e0aaae26483e2b0)	2013-07-10 14:33:18 +10:00
Amitay Isaacs	d36aa928fd	ctdbd: Remove incomplete ctdb_db_statistics_wire structure Send the ctdb_db_statistics directly instead of first copying it to duplicate ctdb_db_statistics_wire structure. This simplifies the implementation of the control to get database statistics. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 26a4653df594d351ca0dc1bd5f5b2f5b0eb0a9a5)	2013-07-10 14:33:18 +10:00
Amitay Isaacs	c0798dfb64	ctdbd: Update debug messages for setting readonly property on database Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 545a46437dfb2b755bb2fddb11dea8c4ccce3ed7)	2013-07-10 14:32:52 +10:00
Amitay Isaacs	bcb64aa55f	recoverd: Fix buffer overflow error in reloadips Signed-off-by: Amitay Isaacs <amitay@gmail.com> Pair-Programmed-With: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 41182623891d74a7e9e9c453183411a161201e67)	2013-07-05 15:52:34 +10:00
Martin Schwenke	dcdae86dc7	ctdbd: Log something when releasing all IPs At the moment this is silent and it can be confusing to see IPs just disappear. Also, this message: Been in recovery mode for too long. Dropping all IPS can cause anxiety when all IPs should already have been dropped. Adding a comforting message saying that 0 IPs were dropped relieves such anxiety. :-) Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4d0f26b306fc465d551d340b0e7dce4412eae3fd)	2013-07-05 15:52:33 +10:00
Martin Schwenke	0108e8ff10	recoverd: Minor style improvements for ctdb_reload_remote_public_ips() * Add a variable to the loop to make the code more readable and have it generally fit into 80 columns. * Improve comments. * Improve log messages. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0a292fa8939a1343e44cadaa8ed9f3c0f18ca82f)	2013-07-05 15:52:33 +10:00
Martin Schwenke	7290798a41	recoverd: Clean up log messages in remote IP verification The log messages in verify_remote_ip_allocation() are confusing because they don't include the PNN of the problem node, because it is not known in this function. Add the PNN of the node being verified as a function argument and then shuffle the log messages around to make them clearer. Also fold 3 nested if statements into just one. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0942fa01cd422133fc9398f56b4855397d7bc86)	2013-07-05 15:52:33 +10:00
Martin Schwenke	15115becef	recoverd: Fix an unclear log message - "Restart recovery process" When the recovery master notices a node in recovery mode it starts the recovery process, it doesn't restart it. Update documentation to match. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 298c4d2c3b4ea3d900c91f5a0a5aca2952a13d61)	2013-07-05 15:52:33 +10:00
Martin Schwenke	bfe0b93652	recoverd: Fix an incorrect comment Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9f6cd8b0bea619991c9f3bf35188c5950dabf8f4)	2013-07-05 15:52:33 +10:00
Martin Schwenke	9c8cc863f7	ctdbd: Use ctdb_die() on "setup" event failure This is slightly easier to read because it all fits on 1 line. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 035bf3eecf99337c84d4ad16cdbf297b1fa037db)	2013-07-05 15:52:33 +10:00
Martin Schwenke	c327c91490	ctdbd: Avoid a core dump when "init" event fails The "init" event only really fails in the scripts, which should log something useful on failure. Therefore, a core dump isn't terribly useful and sometimes attracts unwanted attention. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3af2d833b63af9931792106db71797f3692669a8)	2013-07-05 15:52:33 +10:00
Martin Schwenke	26b161156a	ctdbd: Release IP callback should fail if the IP is still hosted At the moment there (at least) are 2 bugs that cause rogue IPs: * A race where release_ip_callback() runs after a "subsequent" take IP has completed. The IP is back on an interface but we unset vnn->iface in the callback. * A "releaseip" eventscript times out. We ignore the timeout and call it success, deleting the VNN even if the IP is still hosted. We could decide not to ignore the timeout and ban the node, but killing TCP connections can take a long time and that might result in a lot of manning. We probably won't reinstate banning on "releaseip" until killing TCP connections has been optimised. In both cases, a rogue IP can be avoided by leaving vnn->iface set and simply failing the control. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit c5797f2942e83da24df548ea07196fbbac0eab20)	2013-07-05 15:52:32 +10:00
Martin Schwenke	793233f6b6	ctdbd: Log warnings in release IP when unexpected interface is encountered Previous code changes work around a potential problems but do not provide useful information when the a problem occurs. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit f1f1b0c24b9b6cd24b83a4e4da16e179287ec6ac)	2013-07-05 15:52:32 +10:00
Amitay Isaacs	6391f61fbc	build: Fix compiler warnings for uninitialized variables Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 5408c5c4050539e5aa06a5e82ceb63a6cb5cef0c)	2013-07-04 20:43:52 +10:00
Amitay Isaacs	f032c60cd5	recoverd: Send the result from child process only once The result has been sent before the child keeps waiting for parent ctdbd process. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 9aa13bcedd83d463c871e3cf1f3a65da3cd83992)	2013-07-04 20:43:52 +10:00
Amitay Isaacs	c944a589ca	ctdbd: Don't ban self if init or shutdown event fails There is no point in banning the node if init or shutdown event times out since it's going to quit anyway. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ef1c4e99ca66e7a990bc557f34abb624c315e6ba)	2013-07-02 12:59:09 +10:00
Michael Adam	3c65197b7a	recoverd: when the recmaster is banned, use that information when forcing an election When we trigger an election because the recmaster considers itself inactive, update our local nodemap with the recmaster's flags before calling force_election(). This way, we don't send the inactive node freeze commands (e.g.) that may fail and then lead to ourselves getting banned. The theory is that this should help avoiding banning loops. Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 932360992b08a5483d90c0590218ba0fd756119e)	2013-07-02 12:59:09 +10:00
Michael Adam	082da536cb	recoverd: fix a comment typo Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 741944f118e98f178b860194eecb215180949d18)	2013-07-02 12:59:09 +10:00
Michael Adam	159b9a2989	recoverd: fix a comment in main_loop Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit ac06c46e4a80c635f6094b5ac6f0bf3e3a02db95)	2013-07-02 12:59:09 +10:00
Michael Adam	26365f2a5f	recoverd: eliminate some trailing spaces from ctdb_election_win() Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit df30c0a05ed908fc2a997c56ff5484736b23b70f)	2013-07-02 12:59:09 +10:00
Martin Schwenke	aa79a656a7	recoverd: Don't continue if the current node gets banned Can not continue with recovery or monitoring cluster. Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 14399de1dd0bd8dabf1f48b1457e3ccb37589d8a)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	b29b6ae39e	recoverd: Refactor code to ban misbehaving nodes Since we have nodemap information, there is no need to hardcode the limit of 20. Signed-off-by: Amitay Isaacs <amitay@gmail.com> Pair-Programmed-With: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit aea12dce83ef385e9fb3bc03ac7ace0874a0e3fe)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	c22de8d1c0	recoverd: Move code to ban other nodes after we get local node flags If a node gets banned first, then it should not ban other nodes. This code was moved up in main_loop to avoid waiting for nodemap from other nodes (commit 83b0261f2cb453195b86f547d360400103a8b795). To prevent a banned node from banning other nodes, we need to first get nodemap information from local node, so trying to ban other nodes can fail if we are already banned. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ae1693905036ecdbc4594fde1f12500faae4a554)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	32f9d7c0d4	recoverd: Delay the initial election if node is started in stopped state Since there is an early exit if a node is stopped or banned, we can wait till the node becomes active to start initial election. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 593a17678fbd3109e118154b034d43b852659518)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	d2411e74f1	recoverd: Update capabilities only if the current node is active Since we do an early return if a node is stopped or banned, move update capabilities code below the early return and just before we check the capabilities of current recovery master. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 93bcb6617e1024f810533e12390a572f51703ca0)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	73e6cc765d	recoverd: No need to check if node is recovery master when inactive If a node is stopped or banned, it will cause early return from the main_loop, so this check is redundent. The election will called by an active node. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 815ddd3341b7e9db39e05a3a3fcd9a1420f053bc)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	870409ed1c	recoverd: Always do an early exit from main_loop if node is stopped or banned A stopped or banned node cannot do anything useful. So do not participate in any cluster activity and do not cause any unnecessary network traffic. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 2396981c4bcf30530aeb7f4395093cc202105b50)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	7b761c4b97	recoverd: Do not set banning credits on a node if current node is inactive If the current node is banned or stopped, then it should not assign banning credits to other nodes since the current node will not have up-to-date flags of other nodes. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 38304f88e0c634e97d4687c25adef975f71537b8)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	5deebd3b75	banning: Do not come out of ban if databases are not frozen Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit a60f228f8380f222f838eb619d2ab55f96f11ac2)	2013-07-02 12:59:09 +10:00
Amitay Isaacs	9a944d71dc	banning: No need to check if banned pnn is for local node If the banned pnn is not the local node, the function returns early. So no need for additional check. Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 297d93cecc3c0655e72ecac38508e113bdbeab9c)	2013-07-02 12:59:08 +10:00
Amitay Isaacs	c6914e3891	banning: Make ctdb_local_node_got_banned() a void function When this function is called, we are already committed to banning and there is no point in failing this function. In case, freezing of databases fails, it will be fixed from recovery daemon. (This used to be ctdb commit bb178338658b4ae32382a1f62f7c21cee1d4878f)	2013-07-02 12:59:08 +10:00
Amitay Isaacs	cf1d4bfde3	recoverd: Also check if current node is in recovery when it is banned Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit 6a9dbb8fb0f1f6e8c206189cdc2d33bb371ea2a8)	2013-07-02 12:59:08 +10:00

1 2 3 4 5 ...

1386 Commits