samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2025-06-22 07:17:05 +03:00

Author	SHA1	Message	Date
Ronnie Sahlberg	c58a6b39a6	add more debugging output to eventscripts and when a script has timed out, print a full "pstree -p" to the log. Example : \|-ctdbd(29826)-+-ctdbd(29862) \| `-ctdbd(31897)-+-00.ctdb(31898)---sleep(31908) change the default timeout to 60 seconds for eventscripts (This used to be ctdb commit a3406c10d70f89d332eab25d481083142dff987d)	2009-10-14 14:14:28 +11:00
Ronnie Sahlberg	c971d934a9	From Wolfgang Mueller-Friedt Remove the explicit vacuum/repack commands from the 00.ctdb eventscript and implement this in the ctdb daemon. Combine vacuuming and repacking into one cheap read traverse to enumerate all candidate records and one write traverse that both repacks the database and also deletes the record locally where we are lmaster and where the records have already been deleted remotely. this code also adds initial autotuning heuristics for the vacuum intervals and how many records to delete in each iteration. minor stylish changes made by ronnie s (This used to be ctdb commit 95a3ee551241aa164967991fe5efe078e1714bde)	2009-09-29 13:27:19 +10:00
Ronnie Sahlberg	a1e1503328	change the defaults for repacking to repack once every 120 seconds and letting it work for 30 second before timing out. (This used to be ctdb commit 2aa5d18bb42dca4ef9cb049b4fa9d7bc999ce4ad)	2009-07-29 13:31:12 +10:00
Wolfgang Mueller-Friedt	16af87bf25	repack limit tunable Signed-off-by: Wolfgang Mueller-Friedt <wolfmuel@de.ibm.com> (This used to be ctdb commit a2768b0732f2ab2e3fafda55587bd2e99eedf0fa)	2009-07-29 13:30:39 +10:00
Ronnie Sahlberg	1653af16a6	vacuum event framework Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> Signed-off-by: Wolfgang Mueller-Friedt <wolfmuel@de.ibm.com> (This used to be ctdb commit 30cdad97706a9e9bb210120699aa939f6b16e8ca)	2009-07-29 13:26:29 +10:00
Ronnie Sahlberg	816db4be38	Do not allow the "VerifyRecoveryLock" tunable to be changed if there is no reclock file (This used to be ctdb commit 5334e40978350b6b597ee020bac52e37c8f9a8ba)	2009-06-25 14:45:17 +10:00
Ronnie Sahlberg	0ddf79a3bc	increase the timeout before we shutdown when ther ecovery daemon is hung (This used to be ctdb commit facddcacb4a961cddb117818fa38a3e97770b2fa)	2009-06-18 09:20:18 +10:00
Ronnie Sahlberg	98a54c4675	Track how long it takes to take out the recovery lock from both the main dameon and also from the recovery daemon. Log this in "ctdb statistics". Also add a varaible "RecLockLatencyMs" that will log an error everytime it takes longer than this to access the reclock file. (This used to be ctdb commit 042377ed803bb8f7ca9d6ea1a387427b7b8ba45a)	2009-05-14 10:33:25 +10:00
root	6793f077a8	Add a new variable VerifyRecoveryLock which can be used to disable the test that the recovery daemon holds the lock properly when performing a recovery (This used to be ctdb commit 329df9e47e6ca8ab5143985a999e68f37c6d88a5)	2009-05-01 01:17:59 +10:00
Ronnie Sahlberg	38ea6708dd	add a tuneable RecoveryDropAllIPs so it is possible to control after how long a node that has been stuck in recovery will wait until it will yield all public addresses. this now defaults to 60 seconds This is useful if a split brain occurs due to network partitioning since it will make sure that the "other half" of the cluster that does not contain the recovery master will eventually release all ips and thus avoiding a duplicate ip situation for the public addresses (This used to be ctdb commit 70f21428c9eec96bcc787be191e7478ad68956dc)	2009-04-24 18:28:08 +10:00
Ronnie Sahlberg	3363480da4	tweak some timeouts so that we do trigger a banning even if the control hangs/timesout (This used to be ctdb commit 1860a365e6ba8212e15c33016c80a2adcf8d10f4)	2009-04-24 14:45:07 +10:00
Ronnie Sahlberg	d94917ec49	Change the (dodgy) seqnumfrequency variable to have ms resolution instead of second resolution. Rename the variable to SeqnumInterval for 1, it is an interval and not a 1/interval unit 2, so that we catch when people use this old variable and can update the sysconfig file instead of silently changin semantics of this variable this is a real dodgy variable (This used to be ctdb commit 68eac459e5d2b6b534f72821036675ffe5d7a350)	2009-04-01 17:21:38 +11:00
Ronnie Sahlberg	e1b0cea427	add control and logging of very high latencies. log the type of operation and the database name for all latencies higher than a treshold (This used to be ctdb commit 1d581dcd507e8e13d7ae085ff4d6a9f3e2aaeba5)	2008-10-30 12:49:53 +11:00
Ronnie Sahlberg	a3bbe238c9	The ctdb daemon keeps track of whether the recovery process is running correctly by measuring how long it was since the last successful communication with the recovery daemon was recorded. After a certain timeout the ctdb daemon would deem the recovery daemon as inoperable and shut down. If the system clock is suddenly changed forward by many (60 or more) seconds this could cause the timeout to trigger prematurely/immediately where ctdb would incorrectly think that more than 60 seconds had passed since last successful communications and thus abort. Instead of cehcking for one timeout occuring, only deem the recovery daemon to be "down" and trigger a shutdown if communications have timedout for three intervals in a row. (This used to be ctdb commit 196968c552e6ebcb57389d769a4b25f42fa8bc5d)	2008-09-17 14:17:41 +10:00
Ronnie Sahlberg	6474f3278d	additional monitoring between the two daemons. we currently only monitor that the dameons are running by kill(0, pid) and verifying the the domain socket between them is ok. this is not sufficient since we can have a situation where the recovery daemon is hung. this new code monitors that the recovery daemon is operating. if the recovery hangs, we log this and shut down the main daemon (This used to be ctdb commit cd69d292292eaab3aac0e9d9fc57cb621597c63c)	2008-09-09 13:44:46 +10:00
Ronnie Sahlberg	6bfbec28a4	use more libral handling of event scripts timing out. If the event script that timed out was for the "monitor" event, then even if it timed out we still return SUCCESS back to the guy invoking the eventscript. Only consider the eventscript for "monitor" to have failed with an error IFF it actually terminated with an error, or if it timed out 5 times in a row and hung. (This used to be ctdb commit 60f3c04bd8b20ecbe937ffed08875cdc6898b422)	2008-07-07 20:38:59 +10:00
Ronnie Sahlberg	fd921aea28	ban the node after 3 failed scripts by default (This used to be ctdb commit b4e6d8e37c7f985f357af82b4a524959bb97ec4c)	2008-06-13 13:45:23 +10:00
Ronnie Sahlberg	779468ab3f	if the event scripts hangs EventScriptsBanCount consecutive times in a row the node will ban itself for the default recovery ban period (This used to be ctdb commit 7239d7ecd54037b11eddf47328a3129d281e7d4a)	2008-06-13 13:18:06 +10:00
Ronnie Sahlberg	27a7f854f5	add improvements to tracking memory usage in ctdbd adn the recovery daemon and a ctdb command to pull the talloc memory map from a recovery daemon ctdb rddumpmemory (This used to be ctdb commit d23950be7406cf288f48b660c0f57a9b8d7bdd05)	2008-04-01 15:34:54 +11:00
Ronnie Sahlberg	a89ed0fdc2	add a new tunable 'NoIPFailback' when this tunable is set, ip addresses will only be failed over when a node fails. And only those ip addresses held by the failed node will be reallocated in the cluster. When a node becomes active again, this will not lead to any failback of ip addresses. This can reduce the number of "ip address movements" in the cluster since we dont automatically fail an ip address back, but can also lead to an unbalanced cluster since we no longer attempt to spread the ip addresses out evenly across the active nodes. This tuneable can NOT be active at the same time as DeterministicIPs are used. (This used to be ctdb commit d3b8a461b15bc584fa1785eb5922de6d49d8f6c4)	2008-03-03 12:52:16 +11:00
Ronnie Sahlberg	f6f7f54bd6	add a new tunable : reclockpingperiod once every such interval : * the recovery master on each node will uppdate the "connected" count in the reclock count file (ctdb getreclock) * if the node thinks it is a recovery master but it detects another node that is DISCONNECTED but which still holds a lock to the reclock count file this may mean that we have a split cluster. if that other node that is DISCONNECTED but still holds the lock on hte reclock pnn count file, is MORE connected than the local node, yield the recmaster role and let the other half of the lcuster take over this add a second, last chance mechanism to detect split clusters. IF the cluster is split but GPFS is not yet split, this mechanism makes the largest half of the cluster become the active half. (This used to be ctdb commit 07af425f444531942cce8abff112c1524228d287)	2008-03-03 09:19:30 +11:00
Ronnie Sahlberg	7bc8007f93	add a new tunable DisableWhenUnhealthy which when set will cause a node to automatically become DISABLED anytime monitoring fails and the node becomes UNHEALTHY. Use with caution. (This used to be ctdb commit c20293360db67f9876b0c84e5e9e12a5868964cb)	2008-02-22 10:33:09 +11:00
Andrew Tridgell	f6e53f433b	merge from ronnie (This used to be ctdb commit e7b57d38cf7255be823a223cf15b7526285b4f1c)	2008-02-04 20:07:15 +11:00
Andrew Tridgell	538f519dba	exponential backoff in health monitoring for faster startup (This used to be ctdb commit 1b04a1f675f73b48366ba98803a58c3d8df1b6e1)	2008-01-10 14:40:56 +11:00
Andrew Tridgell	4f5b717aa3	change default tunables to cope with larger dbs (This used to be ctdb commit d91a2d43d1f0562cc3a12e6e1e2767f75d888f72)	2008-01-06 12:36:58 +11:00
Andrew Tridgell	e4aefbc66d	a new tunable DatabaseMaxDead that enables the tdb max dead cache logic (This used to be ctdb commit 01c519c3658a8fcb9545b507b597e723658e4c4e)	2008-01-05 09:36:53 +11:00
Andrew Tridgell	a55c3709ea	make DeterministicIPs the default (This used to be ctdb commit e7d077e98a40a62dbd6bfd174f29afba7b5529ef)	2007-12-04 15:18:27 +11:00
Ronnie Sahlberg	056aac6e0c	add a new tunable : DeterministicIPs that makes the allocation of public addresses to nodes deterministic. Activate it by adding CTDB_SET_DeterministicIPs=1 in /etc/sysconfig/ctdb When this is set, the first entry in /etc/ctdb/public_addresses will always be hosted by node 0, when that node is available, the second entry by node1 and so on. This tunable allows the allocation of addresses to become very unbalanced and is only for debugging/testing use. Beware, this feature requires that /etc/ctdb/public_addresses are identical on all the nodes in the cluster. (This used to be ctdb commit f0ca221f235731542090d8a6c86f2b7cd2ce2f96)	2007-10-16 12:15:02 +10:00
Andrew Tridgell	174879621e	add config option for disabling bans (This used to be ctdb commit 153b911f7f957d4c564b04f5aa878033a02da9e4)	2007-10-15 13:22:58 +10:00
Andrew Tridgell	80100c3573	run monitoring more quickly when unhealthy and at startup (This used to be ctdb commit ff1c205928e3ef5bcc6bf4e4b2122a19fa38d8f4)	2007-09-24 10:12:18 +10:00
Ronnie Sahlberg	fca90ce3c3	updated ctdb tickle management there is an array for each node/public address that contains tcp tickles we send a TCP_ADD as a broadcast to all nodes when a client is added if tcp tickles are removed, they are only removed immediately from the local node. once every 20 seconds a node will push/broadcast out the tickle list for all public addresses it manages. this will remove any deleted tickles from the remote nodes (This used to be ctdb commit e3c432a915222e1392d91835bc7a73a96ab61ac9)	2007-07-20 15:05:55 +10:00
Andrew Tridgell	32de198fd3	update lib/replace from samba4 (This used to be ctdb commit f0555484105668c01c21f56322992e752e831109)	2007-07-10 15:29:31 +10:00
Ronnie Sahlberg	1cd8bc0c64	add a tuneable to control how long we wait after a successful recovery before we alow another recovery to be initiated (This used to be ctdb commit f3b43519423b7a73e6a2dd986bdf11203b8653cf)	2007-07-04 08:36:59 +10:00
Andrew Tridgell	732353de5f	- merged ctdb_store test from ronnie - added DatabaseHashSize tunable - added logging of events inside recovery (for timing) (This used to be ctdb commit 3593cdb928b91e217faf1b3c537fa28dc82cdace)	2007-06-17 23:31:44 +10:00
Andrew Tridgell	031e205832	raise the default keepalive limit (This used to be ctdb commit 4776a187a183bd129ded70e9c018c197b3d618be)	2007-06-11 22:27:23 +10:00
Andrew Tridgell	ae3d54094b	start splitting the code into separate client and server pieces (This used to be ctdb commit 603cd77988c181525946cd5eb0f4d0d646b58059)	2007-06-07 22:06:19 +10:00

1 2

86 Commits