samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2024-12-25 23:21:54 +03:00

Author	SHA1	Message	Date
Ronnie Sahlberg	94a56ea410	reqrite the handling of flag updates across the cluster to eliminate a race between the ctdb tool and the recovery daemon both at once trying to push flag changes across the cluster. (This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)	2008-11-20 12:43:18 +11:00
Ronnie Sahlberg	beed899c4f	null out the pointer before we reload the nodes file (This used to be ctdb commit 4b0f32047e8bece0a052bdbe2209afe91b7e8ce3)	2008-10-17 21:38:42 +11:00
Ronnie Sahlberg	a924ef78b6	when we reload the nodes file, we may need to reload the nodes file inside the recovery daemon as well. (This used to be ctdb commit 82fd2b6b5cd8e988c38fa6b74121a048757bdeef)	2008-10-17 21:18:06 +11:00
Ronnie Sahlberg	ad56356005	fix a slow memory leak in the recovery daemon in the error paths for the memdump function (This used to be ctdb commit 5e641ef9d6cca286061138a9680dcf2495736e8b)	2008-09-16 09:00:48 +10:00
Ronnie Sahlberg	7b718fffd7	fix some slow memory leaks in the vacuuming handler in the recovery daemon (This used to be ctdb commit 95bf36559d62f29e6f538f3a173b504ef3258341)	2008-09-16 07:55:57 +10:00
Ronnie Sahlberg	ab3649155a	From Volker L Fix a slow memory leak in the recovery daemon if there is a recoery triggered during the public ip reassignment process (This used to be ctdb commit 0aca4daf908b76d6013ff3dfad41beb9114fc1a3)	2008-09-16 06:50:28 +10:00
Ronnie Sahlberg	3bedb7f6d1	lower the debug level for when printing that the nodeflags have changed (This used to be ctdb commit a89977f8cb2463a87147dcc0ad936cb5d4131670)	2008-09-09 13:55:31 +10:00
Ronnie Sahlberg	6474f3278d	additional monitoring between the two daemons. we currently only monitor that the dameons are running by kill(0, pid) and verifying the the domain socket between them is ok. this is not sufficient since we can have a situation where the recovery daemon is hung. this new code monitors that the recovery daemon is operating. if the recovery hangs, we log this and shut down the main daemon (This used to be ctdb commit cd69d292292eaab3aac0e9d9fc57cb621597c63c)	2008-09-09 13:44:46 +10:00
Ronnie Sahlberg	ef997d344f	initial ipv6 patch Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> (This used to be ctdb commit 1f131f21386f428bbbbb29098d56c2f64596583b)	2008-08-19 14:58:29 +10:00
Andrew Tridgell	76528cfc6b	fixed a memory leak in the recovery daemon thanks to vl for spotting this (This used to be ctdb commit 96df98d9f86ecc6bb1a458eb2101e5c1bc0f96e6)	2008-08-11 23:33:05 +10:00
Ronnie Sahlberg	b9d8bb23af	remove the reclock file we store pnn counts in. This file creates additional locking stress on the backend filesystem and we may not need it anyway. (This used to be ctdb commit 84236e03e40bcf46fa634d106903277c149a734f)	2008-08-06 11:52:26 +10:00
Andrew Tridgell	cf739ac892	renamed the pulldb structure to a ctdb_marshall_buffer (This used to be ctdb commit bad53b2d342bb9760497e6f4a61e64ca50d6e771)	2008-07-30 19:59:18 +10:00
Andrew Tridgell	abe0232818	rename the structure we use for marshalling multiple records (This used to be ctdb commit 4d205476d286570a6e1f52b59af42858ce051106)	2008-07-30 14:24:56 +10:00
Ronnie Sahlberg	1bfcca524d	From Michael Adams, change one element from private to private_data Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> (This used to be ctdb commit 0de79352c9b36c118e36905f08ebbe38ecbb957e)	2008-07-22 09:07:42 +10:00
Ronnie Sahlberg	d0707c98c0	if a new node enters the cluster, that node will already be frozen at start but the rest of the nodes are not frozen. at this stage an election is called by the new node. Since in this case the nodes are not froze, we can not modify the recmaster of the nodes so it is expected that this control would fail. Add a boolean to send_election_request() to make it not try to set the recmaster locally for the case where we are in an election phase while not frozen. (This used to be ctdb commit c5035657606283d2e35bea40992505e84ca8e7be)	2008-07-18 12:07:25 +10:00
Ronnie Sahlberg	6d5f96c249	lower a debug statement (This used to be ctdb commit 3d58f9b524a40c7b43a2a855212db090e9becefa)	2008-07-18 10:41:18 +10:00
Ronnie Sahlberg	334db8ccba	proper waitpid() fix. remove all waitpid() calls and use the event system to trap sigchld (This used to be ctdb commit 77458b2b6b51b2970c12b0e5b097088d3fb9d358)	2008-07-09 14:02:54 +10:00
Ronnie Sahlberg	522830dea8	Revert "waitpid() can block if it takes a long time before the child terminates" This reverts commit bfba5c7249eff8a10a43b53c1b89dd44b625fd10. revert the waitpid changes. we need to waitpid for some childredn so should refactor the approach completely (This used to be ctdb commit 702ced6c2fe569c01fe96c60d0f35a7e61506a96)	2008-07-08 17:41:31 +10:00
Ronnie Sahlberg	79425ddec5	Revert "set sigchild to SIG_IGN instead of SIG_DFL" This reverts commit b1f1e80d3ad50280a300f2ed021513cf0a6f3a76. (This used to be ctdb commit 2030e9ff2ca044181b72c3b87d513bf27057b5a2)	2008-07-08 17:40:53 +10:00
Ronnie Sahlberg	71d2315eee	set sigchild to SIG_IGN instead of SIG_DFL (This used to be ctdb commit b1f1e80d3ad50280a300f2ed021513cf0a6f3a76)	2008-07-08 16:31:23 +10:00
Ronnie Sahlberg	d67de4a7d2	waitpid() can block if it takes a long time before the child terminates so we should not call it from the main daemon. 1, set SIGCHLD to SIG_DFL to make sure we ignore this signal 2, get rid of all waitpid() calls 3, change reporting of event script status code from _exit()/waitpid() to write()/read() one byte across the pipe. (This used to be ctdb commit bfba5c7249eff8a10a43b53c1b89dd44b625fd10)	2008-07-08 03:48:11 +10:00
Ronnie Sahlberg	f25fd04f73	in the destructor for the lock-wait child, make sure that we cancel any pending transactions. (This used to be ctdb commit 45b6ff64f6ddf037b810c4e5f8b9f04d71067b98)	2008-07-07 08:50:12 +10:00
Andrew Tridgell	50cd520c6a	don't use mmap in tdb if --nosetsched is set. That makes valgrind happier (it doesn't like the mmap/msync calls in tdb) (This used to be ctdb commit f3a729998ce67f5d2e3b2ad41d96e8f04c0d18d8)	2008-07-04 17:32:21 +10:00
Ronnie Sahlberg	64c4639ce9	we dont need to explicitely thaw the databases from the recovery daemon since this is already done implicitely when we changed recovery mode back to normal (This used to be ctdb commit af1f6cf7561fe9cb5c97f940d4458c83bdd8e2a0)	2008-07-03 12:46:09 +10:00
Ronnie Sahlberg	ef769e7237	track both when we last started and ended a recovery. make ctdb uptime print how long the recovery took in the recovery daemon when we check that the public ip address allocation on the local node is correct (we have the ips we should have and we dont have any we shouldnt have) use ctdb uptime and check the recovery start/stop times and make sure we dont check for ip allocation inconsistencies during a recovery where the ip address allocation is in flux. (This used to be ctdb commit f86551580349b7f662f9a07e4eb0c1189e38e429)	2008-07-02 13:55:59 +10:00
Ronnie Sahlberg	1ccc4a8e2b	test (This used to be ctdb commit 4f2d722cf29175c3c207e6ebb6d4f9e370767249)	2008-06-26 14:14:37 +10:00
Ronnie Sahlberg	c5de452dca	reduce loglevel of the info message we are updating the flags on all nodes (This used to be ctdb commit 9a98a21979558dcd6421b3fcb97d21ab82b792d8)	2008-06-26 13:15:41 +10:00
Ronnie Sahlberg	c5e7e0b2fd	force an update of the flags from the recmaster after each monitoring run (This used to be ctdb commit 251aeadc8b16a9c27a4bae78c97ad6e93e6cfdf4)	2008-06-26 13:08:37 +10:00
Ronnie Sahlberg	97f8bf16c5	verify that the recmaster has the correct flags for us and if not tell the recmaster what the flags should be (This used to be ctdb commit 3387597926ad71e4140cc504b828486d99a3ec8e)	2008-06-26 11:08:09 +10:00
Ronnie Sahlberg	e6d1d766c5	make it possible to re-start a recovery without marking the current node as the culprit. (This used to be ctdb commit 3a69fad0b1dee4a482461680c556358409e53c4d)	2008-06-13 11:47:42 +10:00
Ronnie Sahlberg	4b6b094860	add a callback for failed nodes to the async control helper. this callback is called for every node where the control failed (or timed out) when we issue the start recovery control from recovery master, set any node that fails as a culprit so it will eventually be banned (This used to be ctdb commit 72f89bac13cbe8c3ca3e7a942469cd2ff25abba2)	2008-06-12 16:53:36 +10:00
Ronnie Sahlberg	1c88f422d5	add a parameter for the tdb-flags to the client function ctdb_attach() so that we can pass TDB_NOSYNC when we attach to a persistent database and want fast unsafe writes instead of slow but safe tdb_transaction writes. enhance the ctdb_persistent test suite to test both safe and unsafe writes (This used to be ctdb commit 4948574f5a290434f3edd0c052cf13f3645deec4)	2008-06-04 10:46:20 +10:00
Ronnie Sahlberg	37b681627e	dont check whether the "recovered" event was successful or not since this event wont run unless the recovery mode is normal but we can not know what the recovery mode will be in the future on a remote node so since we issue these commands that will execute in the future at some other node it is pointless to try to check if it worked or not in particular if "failure to successfully run the eventscript" would then trigger a full new recovery which is disruptive and expensive. (This used to be ctdb commit 2c292039a0139dcf5bb2bd964eb6f8902d094c50)	2008-05-15 15:01:01 +10:00
Ronnie Sahlberg	f2661ec859	remove some unnessecary tests if ->vnn is null or not (This used to be ctdb commit f0169ac8166a19d65ce254496e21d095aed87c2f)	2008-05-15 13:28:19 +10:00
Ronnie Sahlberg	09cc3ccff5	Update some debug statements. Dont say that recovery failed if the failed function was invoked from outside of recovery (This used to be ctdb commit 3038d0b74895b51af4f85f2f304508ed16d245f4)	2008-05-15 12:28:52 +10:00
Andrew Tridgell	e465110f95	Fix the chicken and egg problem with ctdb/samba and a registry smb.conf This attempts to fix the problem of ctdb event scripts blocking due to attempted access to the ctdb databases during recovery. The changes are: - now only the 'shutdown' and 'startrecovery' events can be called with the databases locked in recovery. The event scripts must ensure that for these two events no database access is attempted - the recovered, takeip and releaseip events could previously be called inside a recovery. The code now ensures that this doesn't happen, delaying the events till after recovery has finished - the 50.samba event script now avoids using testparm unless it is really needed This needs extensive testing. (This used to be ctdb commit e3cdb8f2be6a44ec877efcd75c7297edb008a80b)	2008-05-14 20:57:04 +10:00
Ronnie Sahlberg	adf40341a7	ctdb->methods becomes NULL when we shutdown the transport. If we shutdown the transport and CTDB later decides to send a command out for queueing, the call to ctdb->methods->allocate_pkt() will SEGV. This could trigger for example when we are in the process of shuttind down CTDBD and have already shutdown the transport but we are still waiting for the "shutdown" eventscripts to finish. If the event scripts now take much much longer to execute for some reason, this race condition becomes much more probable. Decorate all dereferencing of ctdb->methods-> with a check that ctdb->menthods is non-NULL (This used to be ctdb commit c4c2c53918da6fb566d6e9cbd6b02e61ae2921e7)	2008-05-11 14:28:33 +10:00
Ronnie Sahlberg	f196afd58b	fix a bug where the public ip addresses of the cluster would not be redistributed across the cluster after a recovery was performed. Remove a bogus check inside the recovery daemon that ONLY redistributed public addresses IFF the local node had/served public addresses. This was a valid optimization long ago when we enforced that all nodes must use the same public addresses file but is invalid today where we can have different public addresses configs on all nodes and even have some nodes that do NOT use public addresses at all. (This used to be ctdb commit 5833e6b99d9afaf35dc8354df8676b9115418b23)	2008-05-09 13:41:31 +10:00
Andrew Tridgell	abe6d816bb	fixed realloc bug Should always use type safe talloc functions when possible. In this case we were allocating bytes instead of uint32_t (This used to be ctdb commit cb14ee57dd0a589242da1ac2830bb7939df460a5)	2008-05-08 19:59:24 +10:00
Ronnie Sahlberg	92b61cd7d5	Expand the client async framework so that it can take a callback function. This allows us to use the async framework also for controls that return outdata. Add a "capabilities" field to the ctdb_node structure. This field is only initialized and kept valid inside the recovery daemon context and not inside the main ctdb daemon. change the GET_CAPABILITIES control to return the capabilities in outdata instead of in the res return variable. When performing a recovery inside the recovery daemon, read the capabilities from all connected nodes and update the ctdb->nodes list of nodes. when building the new vnnmap after the database rebuild in recovery, do not include any nodes which lack the LMASTER capability in the new vnnmap. Unless there are no available connected node that sports the LMASTER capability in which case we let the local node (recmaster) take on the lmaster role temporarily (i.e. become a member of the vnnmap list) (This used to be ctdb commit 0f1883c69c689b28b0c04148774840b2c4081df6)	2008-05-06 15:42:59 +10:00
Ronnie Sahlberg	2c23959616	make sure we lose all elections for recmaster role if we do not have the recmaster capability. (unless there are no other node at all available with this capability) (This used to be ctdb commit 8556e9dc897c6b9b9be0b52f391effb1f72fbd80)	2008-05-06 13:56:56 +10:00
Ronnie Sahlberg	6863c8f573	close and reopen the reclock pnn file at regular intervals. handle failure to get/hold the reclock pnn file better and just treat it as a transient backend filesystem error and try again later instead of shutting down the recovery daemon when we have lost the pnn file and if we are recmaster release the recmaster role so that someone else can become recmaster isntead (This used to be ctdb commit e513277fb09b951427be8351d04c877e0a15359d)	2008-05-06 13:27:17 +10:00
Ronnie Sahlberg	80f85dc390	Monitor that the recovery daemon is still running from the main ctdb daemon and if it has terminated, then we shut down the main daemon as well (This used to be ctdb commit 7e587acaf8006254e89ff9b4bf48454821c85863)	2008-05-06 11:19:17 +10:00
Ronnie Sahlberg	073f4a7cb4	when a node disgrees with us re who is recmaster make it mark that node as a lcuprit so it eventually gets banned (This used to be ctdb commit eff3f326f8ce6070c9f3c430cd14d1b71a8db220)	2008-04-22 00:56:27 +10:00
Ronnie Sahlberg	27a7f854f5	add improvements to tracking memory usage in ctdbd adn the recovery daemon and a ctdb command to pull the talloc memory map from a recovery daemon ctdb rddumpmemory (This used to be ctdb commit d23950be7406cf288f48b660c0f57a9b8d7bdd05)	2008-04-01 15:34:54 +11:00
Ronnie Sahlberg	57d29f1011	add a num_connected field to the rec structure that holds the number of connected nodes num_active only contains the number of active nodes and would thus not count banned nodes (This used to be ctdb commit 06d3ce470766ef0b60d68ccd84de5437146cc147)	2008-03-03 10:24:17 +11:00
Ronnie Sahlberg	f6f7f54bd6	add a new tunable : reclockpingperiod once every such interval : * the recovery master on each node will uppdate the "connected" count in the reclock count file (ctdb getreclock) * if the node thinks it is a recovery master but it detects another node that is DISCONNECTED but which still holds a lock to the reclock count file this may mean that we have a split cluster. if that other node that is DISCONNECTED but still holds the lock on hte reclock pnn count file, is MORE connected than the local node, yield the recmaster role and let the other half of the lcuster take over this add a second, last chance mechanism to detect split clusters. IF the cluster is split but GPFS is not yet split, this mechanism makes the largest half of the cluster become the active half. (This used to be ctdb commit 07af425f444531942cce8abff112c1524228d287)	2008-03-03 09:19:30 +11:00
Ronnie Sahlberg	cadd95263f	change recmaster from being a local variable in monitor_cluster() to be a member of the ctdb_recoverd structure (This used to be ctdb commit b7f955338f50c92374b4f559268fb3a1a516aefa)	2008-03-03 07:53:46 +11:00
Ronnie Sahlberg	814570f904	update the reclock pnn count for how many nodes are connected to the current node once every 60 seconds (This used to be ctdb commit bf1863cc9e2539b2c3e53c664b493b459ebfcc8b)	2008-02-29 13:14:47 +11:00
Ronnie Sahlberg	efa29c6c98	store the num_active variable (number of connected/active nodes) inside the rec structure and avoid passing this as an extra parameter to do_recovery() (This used to be ctdb commit 8bb229aa3b4bd41e48d4e4e2e148d8680c8ba436)	2008-02-29 12:55:20 +11:00

1 2 3

135 Commits