samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2024-12-23 17:34:34 +03:00

Author	SHA1	Message	Date
Stefan Metzmacher	0b5bd411ca	server/banning: also release all ips if we're banning ourself metze (This used to be ctdb commit c386f2c62f06f1c60047b7d4b1ec7a9eec11873c)	2010-09-14 15:50:31 +10:00
Stefan Metzmacher	96ddf2f607	server/monitor: ask for a takeoverrun after propagating our new flags metze (This used to be ctdb commit 942f44123350d4d0c4ad7f3fcd5ff2d0d175739b)	2010-09-14 15:48:10 +10:00
Ronnie Sahlberg	e040a966af	Dont set next_interval to 0. This can cause ctdbd to spin at 100% in the eventsystem, creating a timed event that will immediately trigger again and again. On uniprocessors this cause the eventscript we are actually waiting for to basically become cpu starved and never complete. (This used to be ctdb commit 92c8408fba957a8ded13f7e285da290502735234)	2010-08-20 15:00:45 +10:00
Ronnie Sahlberg	2e8aac6689	Merge commit 'rusty/ports-from-1.0.112' into foo (This used to be ctdb commit 13e58d92f5f1723e850a82ae030d0ca57e89b1ee)	2010-08-19 13:17:56 +10:00
Rusty Russell	9fbb191b78	logging: give a unique logging name to each forked child. This means we can distinguish which child is logging, esp. via syslog where we have no pid. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 68b3761a0874429b90731741f0531f76dcfbb081)	2010-08-18 11:46:32 +09:30
Rusty Russell	f93440c4b7	event: Update events to latest Samba version 0.9.8 In Samba this is now called "tevent", and while we use the backwards compatibility wrappers they don't offer EVENT_FD_AUTOCLOSE: that is now a separate tevent_fd_set_auto_close() function. This is based on Samba version `7f29f817fa`. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 85e5e760cc91eb3157d3a88996ce474491646726)	2010-08-18 09:16:31 +09:30
Rusty Russell	8946028a07	speed startup: add --sloppy-start. The extra recovery interval wait was introduced in 821333afb458 but no explanation was provided in that message. Nonetheless, if starting the entire cluster for the first time, it should be safe to skip this. We use the commandline arg --sloppy-start which should discourage people from using it outside testing. Seconds between ctdbd first log message and node healthy: BEFORE: 16.10 AFTER: 4.03 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 509e2e89ae233a0e91998d95267bf62f296a73cd)	2010-06-22 22:52:34 +09:30
Rusty Russell	ed31caffab	speed startup: run startup immediately after recovery finished. Seconds between ctdbd first log message and node healthy: BEFORE: 17.08 AFTER: 16.10 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 372201d418f041d69646793105f6898ab12a7d91)	2010-06-22 22:50:45 +09:30
Rusty Russell	eb61b11497	speed startup: immediately run first monitor event after startup. Once we've done a startup, we need to run a monitor event successfully to be marked as healthy. Rather than wait the usual 5 seconds, run it immediately (which will then reset next_interval to 5 seconds). Seconds between ctdbd first log message and node healthy: BEFORE: 23.58 AFTER: 18.09 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit c8651494febcb1c9e558b2002e2a72c2bf547c06)	2010-06-22 22:50:07 +09:30
Stefan Metzmacher	fd06167caa	server: add "init" event This is needed because the "startup" event runs after the initial recovery, but we need to do some actions before the initial recovery. metze (This used to be ctdb commit e953808449c102258abb6cba6f4abf486dda3b82)	2010-01-20 09:44:36 +01:00
Stefan Metzmacher	94bc40307a	server: Use tdb_check to verify persistent tdbs on startup Depending on --max-persistent-check-errors we allow ctdb to start with unhealthy persistent databases. The default is 0 which means to reject a startup with unhealthy dbs. The health of the persistent databases is checked after each recovery. Node monitoring and the "startup" is deferred until all persistent databases are healthy. Databases can become healthy automaticly by a completely HEALTHY node joining the cluster. Or by an administrator with "ctdb backupdb/restoredb" or "ctdb wipedb". metze (This used to be ctdb commit 15f133d5150ed1badb4fef7d644f10cd08a25cb5)	2009-12-16 08:06:10 +01:00
Ronnie Sahlberg	649ba2631d	Rename the tunable EventScriptBanCount to EventScriptTimeoutCount since we no longer ban nodes when dodgy scripts continue to hang. We now only mark nodes as unhealthy if monitor events fail or timeout. Never ban. (This used to be ctdb commit 5c8e56fc7a518e115bceac257867739283cf6a1e)	2009-12-14 15:53:23 +11:00
Ronnie Sahlberg	e76561f544	remove the variable "disable when unhealthy" there is no rational need for a setting where we permanently mark nodes as disabled everytime an eventscript fails (This used to be ctdb commit 68a8ee99b128a5ec883600735626bdb3bbc9c503)	2009-12-14 15:40:54 +11:00
Rusty Russell	9914d3f561	eventscript: don't make ourselves healthy if we're under ban_count If we've timed out, but we've not timed out more than ctdb->tunable.script_ban_count, we pretend we haven't. There's a logic bug in the way this is done: if we were unhealthy before, this would set us to "healthy" again (status == 0). I don't think this would happen in real life, but it's a little surprising. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit e6488c0e05bab5c4c2c0a6370930b0b27e5ed56e)	2009-12-07 23:52:01 +10:30
Rusty Russell	928b8dcb31	eventscript: handle banning within the callbacks Currently the timeout handler in eventscript.c does the banning if a timeout happens. However, because monitor events are different, it has to special case them. As we call the callback anyway in this case, we should make that handle -ETIME as it sees fit: for everyone but the monitor event, we simply ban ourselves. The more complicated monitor event banning logic is now in ctdb_monitor.c where it belongs. Note: I wrapped the other bans in "if (status == -ETIME)", though they should probably ban themselves on any error. This change should be a noop. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 9ecee127e19a9e7cae114a66f3514ee7a75276c5)	2009-12-07 23:48:57 +10:30
Ronnie Sahlberg	698a0e4e9a	When starting up ctdbd, wait until all initial recoveries have finished and until we have gone through a full re-recovery timeout without triggering any pending recoveries before we start up the services and start monitoring the node. (This used to be ctdb commit 821333afb458358f90446062b0242790695e5060)	2009-12-01 13:19:58 +11:00
Martin Schwenke	a64ccf07c1	Add flag to ctdb_event_script_callback indicating when called by client. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit a1d654a982ca56fade82552f4e6b5586236d3233)	2009-11-26 15:49:49 +11:00
Rusty Russell	2d9254404d	eventscript: introduce enum for different event script calls. Rather than doing strcmp everywhere, pass an explicit enum around. This also subtly documents what options are available. The "options" arg is now used for extra arguments only. Unfortunately, gcc complains on empty format strings, so we make ctdb_event_script() take no varargs, and add ctdb_event_script_args(). We leave ctdb_event_script_callback() taking varargs, which means callers have to do "%s", "". For the moment, we have CTDB_EVENT_UNKNOWN for handling forced scripts from the ctdb tool. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 8001488be4f2beb25e943fe01b2afc2e8779930d)	2009-11-24 11:16:49 +10:30
Rusty Russell	2763df22de	eventscript: put timeout inside ctdb_event_script_callback_v Everyone uses the same timeout value, so just remove it from the API. If we ever need variable timeouts, that might as well be central too. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (This used to be ctdb commit 533c3e053293941d2a9484b495e78d45f478bb08)	2009-11-24 11:09:46 +10:30
Ronnie Sahlberg	e07ca41886	change the eventscript handling to allow EventScriptTimeout for each individual script isntead of for the entire set of scripts restructure the talloc hierarchy to allow this (This used to be ctdb commit 64da4402c6ad485f1d0a604878a7b0c01a0ea5f0)	2009-10-28 16:11:54 +11:00
Ronnie Sahlberg	1d7681709b	dont run the monitor event so frequently after a event has failed. use _exit() instead of exit() when terminating an eventscript. (This used to be ctdb commit cc30ee2f4f33cb75b2be980c2d4dff6c7c23852f)	2009-10-27 13:51:45 +11:00
Stefan Metzmacher	198866d82d	server: if takeover runs when the recovery master becomes unhealthy The problem was this: When the monitor event fails, the node->flags get updated, and an update (containing the old and new flags) is sent to the recovery master. If the recovery master sends the update to itself (the same process), it was compairing the node->flags variable with the received new flags. This check always found both flag values to be equal and never sets the rec->need_takeover_run variable to true. There were two problem, first the push_flags_handler() function didn't pass the received old flags. And the ctdb_control_modflags() function ignored the received old flags. metze (This used to be ctdb commit 8ec633b64a05a2d903c2b9639909f15f6375548f)	2009-10-26 14:21:45 +11:00
Ronnie Sahlberg	e627fae600	if a lock wait child died/finished, we could have released the lockwait handle and set it to NULL before we call the destructors for releaseing the waiters. The waiters reference the locakwait handle in order to remove itself from the li nked list which caused a SEGV. We dont actually need to remove ourselves from this list here since if the parent freeze_handle holding the list is freed, then all waiters are rele ased as well, and the only place we actually need to relink the waiter is in ctd b_freeze_lock_handler, where we want to respond back to the clients and release the waiters but we still want to keep the freeze_handle hanging around. (This used to be ctdb commit e01ab46bafad09a5e320d420734db129d35863bc)	2009-10-22 13:41:28 +11:00
Ronnie Sahlberg	598419e57b	Dont run eventscript monitor when the databases are frozen. The databases can become frozen a while before we do the actual recovery since we have the re-recovery timeout. There is no point in doing much monitoring if we are waiting for a recovery, or if we are banned. This will eliminate some annoying log entries where certain tests will fail if the databases are locked. (This used to be ctdb commit ff824676fab94168707aada7423ae766bc0f711c)	2009-10-15 16:03:43 +11:00
Ronnie Sahlberg	80be59d35e	when we change state between healthy/unhealthy, make sure we ask the recovery master to perform an explicit ip reallocation. This is more reliable and faster than having the recovery dameon track these changes, and since we now have an explicit method to ask the recovery daemon to perform an explicit ip reallocation, we should use this. (This used to be ctdb commit 3807681e74f4bfe92befdae6ed616ff5f1a99880)	2009-10-14 11:59:16 +11:00
Ronnie Sahlberg	73c0adb029	initial attempt at freezing databases in priority order (This used to be ctdb commit e8d692590da1070c87a4144031e3306d190ebed2)	2009-10-12 12:08:39 +11:00
Ronnie Sahlberg	e90dd8015f	add a new notification to trigger on when ctdb has started (This used to be ctdb commit b1fe04f2e9447f762a0b805763deb29296585ff8)	2009-10-01 14:05:30 +10:00
Ronnie Sahlberg	cda5f02c7c	new prototype banning code (This used to be ctdb commit 0c4c2240267af183d54ffd4c0aacda208f6eff6a)	2009-09-04 02:20:39 +10:00
Ronnie Sahlberg	41a519191e	dont let other nodes modify the STOPPED flag for the local process when pushing out flags changes (This used to be ctdb commit 501a2747d839ca291b70c761098549cf6d47a158)	2009-07-09 13:20:14 +10:00
Ronnie Sahlberg	d1c40424f6	When we ban a node, only drop the IPs on the node being banned, not on every node (This used to be ctdb commit 46e8c3737e6ff54fc80de8e962e922924c27bc35)	2009-06-10 10:35:20 +10:00
Ronnie Sahlberg	ad40ee25f9	add a mechanism where the ctdb daemon will run a usercontrolled script when the node status changes to/from UNHEALTHY state. This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes. (This used to be ctdb commit ce534a83a05dbd40238e4eee0669d60ff396f935)	2009-03-31 14:23:31 +11:00
Ronnie Sahlberg	94a56ea410	reqrite the handling of flag updates across the cluster to eliminate a race between the ctdb tool and the recovery daemon both at once trying to push flag changes across the cluster. (This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)	2008-11-20 12:43:18 +11:00
Ronnie Sahlberg	f4fd4d0af8	dont disable/enable monitoring for each eventscript, instead just disable the monitoring during the "startrecovery" event and enable it again once recovery has completed (This used to be ctdb commit 68029894f80804c9f31fc90ed0c1b58f75812c3d)	2008-05-16 08:20:40 +10:00
Ronnie Sahlberg	e8e67ef576	add a mechanism to force a node to run the eventscripts with arbitrary arguments ctdb eventscript "command argument argument ..." (This used to be ctdb commit 118a16e763d8332c6ce4d8b8e194775fb874c8c8)	2008-04-02 11:13:30 +11:00
Ronnie Sahlberg	7bc8007f93	add a new tunable DisableWhenUnhealthy which when set will cause a node to automatically become DISABLED anytime monitoring fails and the node becomes UNHEALTHY. Use with caution. (This used to be ctdb commit c20293360db67f9876b0c84e5e9e12a5868964cb)	2008-02-22 10:33:09 +11:00
Andrew Tridgell	f6e53f433b	merge from ronnie (This used to be ctdb commit e7b57d38cf7255be823a223cf15b7526285b4f1c)	2008-02-04 20:07:15 +11:00
Andrew Tridgell	9d6ac0cf55	added debug constants to allow for better mapping to syslog levels (This used to be ctdb commit 7ba8f1dde318eab03f4257e5a89fd23e7281e502)	2008-02-04 17:44:24 +11:00
Ronnie Sahlberg	ba31feaec0	split node health monitoring and checking for connected/disconnected nodes into two separate files. move the monitoring of keepalives for detecting connected/disconnected remote nodes into ctdb_keepalive.c (This used to be ctdb commit 23a57b20c314d5f11a433cf251eb9d9de743849a)	2008-01-15 08:42:12 +11:00
Andrew Tridgell	b866a147d2	get rid of monitor_retry as well (This used to be ctdb commit c957cf9c1d99d5d3f4ca726f7a867c829660a2b7)	2008-01-10 14:49:43 +11:00
Andrew Tridgell	538f519dba	exponential backoff in health monitoring for faster startup (This used to be ctdb commit 1b04a1f675f73b48366ba98803a58c3d8df1b6e1)	2008-01-10 14:40:56 +11:00
Andrew Tridgell	7edb41692e	merge from ronnie (This used to be ctdb commit 6653a0b67381310236e548e5fc0a9e27209b44e0)	2007-12-03 10:19:24 +11:00
Ronnie Sahlberg	2f1baf34d3	up the loglevel for the enable/disable monitoring to level 1 (This used to be ctdb commit 5043a0afeedbd30c7f64c2733c8ae5bf75479a98)	2007-12-01 10:06:42 +11:00
Ronnie Sahlberg	07dd0f6ff0	log that monitoring has been "disabled" not that it has been "stopped" when monitoring is disabled (This used to be ctdb commit e7c92f661a523deae9544b679d412ae79cc0ede7)	2007-11-30 10:53:35 +11:00
Ronnie Sahlberg	975fbc8e22	always set up a new monitoring event regardless of whether monitoring is enabled or not (This used to be ctdb commit c3035f46d1a65d2d97c8be7e679d59e471c092c2)	2007-11-30 10:14:43 +11:00
Ronnie Sahlberg	50573c5391	add ctdb_disable/enable_monitoring() that only modifies the monitoring flag. change calling of the recovered/takeip/releaseip event scripts to use these enable/disable functions instead of stopping/starting monitoring. when we disable monitoring we want all events to still be running in particular the events to monitor for dead nodes and we only want to supress running the monitor event scripts (This used to be ctdb commit a006dcc4f75aba950dd701ad7d1a84e89df285e8)	2007-11-30 10:09:54 +11:00
Ronnie Sahlberg	192ba82b73	->monitor_context is NULL when monitoring is disabled. Check whether monitoring is enabled or not before creating new events and log why the event is not set up othervise (This used to be ctdb commit 2f352b2606c04a65ce461fc2e99e6d6251ac4f20)	2007-11-30 09:02:37 +11:00
Ronnie Sahlberg	8ac8cce487	dont manipulate ctdb->monitoring_mode directly from the SET_MON_MODE control, instead call ctdb_start/stop_monitoring() ctdb_stop_monitoring() dont allocate a new monitoring context, leave it NULL. Also set the monitoring_mode in this function so that ctdb_stop/start_monitoring() and ->monitoring_mode are kept in sync. Add a debug message to log that we have stopped monitoring. ctdb_start_monitoring() check whether monitoring is already active and make the function idempotent. Create the monitoring context when monitoring is started. Update ->monitoring_mode once the monitoring has been started. Add a debug message to log that we have started monitoring. When we temporarily stop monitoring while running an event script, restart monitoring after the event script wrapper returns instead of in the event script callback. Let monitoring_mode start out as DISABLED and let it be enabled once we call ctdb_start_monitoring. dont check for MONITORING_DISABLED in check_fore_dead_nodes(). If monitoring is disabled, this event handler will not be called. (This used to be ctdb commit 3a93ae8bdcffb1adbd6243844f3058fc742f76aa)	2007-11-30 08:44:34 +11:00
Ronnie Sahlberg	c36ce05d08	if we get a modflag control but the flags remain unchanged, log this (This used to be ctdb commit 5a0cd9b37b21665054bd35facd87f0a6ff4dcd55)	2007-11-23 10:31:51 +11:00
Andrew Tridgell	3427793f01	don't do the first startup event until we are out of recovery (This used to be ctdb commit 689940eb6e23f16ee063331caf3986613a8963ea)	2007-11-12 13:10:15 +11:00
Andrew Tridgell	bde886988b	prevent a deadly embrace between smbd and ctdbd by moving the calling of the startup event scripts after the point where recovery has started and the node is in normal operation This makes the 'startup' script just a special type of the 'monitor' script which is called first (This used to be ctdb commit 7424c30a5fd04aea0137c466b4318c3f185280d8)	2007-11-12 10:53:11 +11:00

1 2

61 Commits