IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This can cause ctdbd to spin at 100% in the eventsystem,
creating a timed event that will immediately trigger again
and again.
On uniprocessors this cause the eventscript we are actually waiting for to
basically become cpu starved and never complete.
(This used to be ctdb commit 92c8408fba957a8ded13f7e285da290502735234)
This means we can distinguish which child is logging, esp. via syslog where we have no pid.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 68b3761a0874429b90731741f0531f76dcfbb081)
In Samba this is now called "tevent", and while we use the backwards
compatibility wrappers they don't offer EVENT_FD_AUTOCLOSE: that is now
a separate tevent_fd_set_auto_close() function.
This is based on Samba version 7f29f817fa.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 85e5e760cc91eb3157d3a88996ce474491646726)
The extra recovery interval wait was introduced in 821333afb458 but no
explanation was provided in that message. Nonetheless, if starting
the entire cluster for the first time, it should be safe to skip this.
We use the commandline arg --sloppy-start which should discourage
people from using it outside testing.
Seconds between ctdbd first log message and node healthy:
BEFORE: 16.10
AFTER: 4.03
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 509e2e89ae233a0e91998d95267bf62f296a73cd)
Seconds between ctdbd first log message and node healthy:
BEFORE: 17.08
AFTER: 16.10
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 372201d418f041d69646793105f6898ab12a7d91)
Once we've done a startup, we need to run a monitor event successfully
to be marked as healthy. Rather than wait the usual 5 seconds, run it
immediately (which will then reset next_interval to 5 seconds).
Seconds between ctdbd first log message and node healthy:
BEFORE: 23.58
AFTER: 18.09
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit c8651494febcb1c9e558b2002e2a72c2bf547c06)
This is needed because the "startup" event runs after the initial recovery,
but we need to do some actions before the initial recovery.
metze
(This used to be ctdb commit e953808449c102258abb6cba6f4abf486dda3b82)
Depending on --max-persistent-check-errors we allow ctdb
to start with unhealthy persistent databases.
The default is 0 which means to reject a startup with
unhealthy dbs.
The health of the persistent databases is checked after each
recovery. Node monitoring and the "startup" is deferred
until all persistent databases are healthy.
Databases can become healthy automaticly by a completely
HEALTHY node joining the cluster. Or by an administrator
with "ctdb backupdb/restoredb" or "ctdb wipedb".
metze
(This used to be ctdb commit 15f133d5150ed1badb4fef7d644f10cd08a25cb5)
since we no longer ban nodes when dodgy scripts continue to hang.
We now only mark nodes as unhealthy if monitor events fail or timeout. Never ban.
(This used to be ctdb commit 5c8e56fc7a518e115bceac257867739283cf6a1e)
there is no rational need for a setting where we permanently mark nodes as disabled everytime an eventscript fails
(This used to be ctdb commit 68a8ee99b128a5ec883600735626bdb3bbc9c503)
If we've timed out, but we've not timed out more than
ctdb->tunable.script_ban_count, we pretend we haven't.
There's a logic bug in the way this is done: if we were unhealthy before,
this would set us to "healthy" again (status == 0). I don't think this
would happen in real life, but it's a little surprising.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e6488c0e05bab5c4c2c0a6370930b0b27e5ed56e)
Currently the timeout handler in eventscript.c does the banning if a
timeout happens. However, because monitor events are different, it has
to special case them.
As we call the callback anyway in this case, we should make that handle
-ETIME as it sees fit: for everyone but the monitor event, we simply ban
ourselves. The more complicated monitor event banning logic is now in
ctdb_monitor.c where it belongs.
Note: I wrapped the other bans in "if (status == -ETIME)", though they
should probably ban themselves on any error. This change should be a
noop.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 9ecee127e19a9e7cae114a66f3514ee7a75276c5)
and until we have gone through a full re-recovery timeout without triggering
any pending recoveries before we start up the services and start monitoring
the node.
(This used to be ctdb commit 821333afb458358f90446062b0242790695e5060)
Rather than doing strcmp everywhere, pass an explicit enum around. This
also subtly documents what options are available. The "options" arg
is now used for extra arguments only.
Unfortunately, gcc complains on empty format strings, so we make
ctdb_event_script() take no varargs, and add ctdb_event_script_args(). We
leave ctdb_event_script_callback() taking varargs, which means callers
have to do "%s", "".
For the moment, we have CTDB_EVENT_UNKNOWN for handling forced scripts
from the ctdb tool.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 8001488be4f2beb25e943fe01b2afc2e8779930d)
Everyone uses the same timeout value, so just remove it from the API.
If we ever need variable timeouts, that might as well be central too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 533c3e053293941d2a9484b495e78d45f478bb08)
The problem was this:
When the monitor event fails, the node->flags get updated,
and an update (containing the old and new flags) is sent to
the recovery master.
If the recovery master sends the update to itself (the same process),
it was compairing the node->flags variable with the received new flags.
This check always found both flag values to be equal
and never sets the rec->need_takeover_run variable to true.
There were two problem, first the push_flags_handler() function
didn't pass the received old flags.
And the ctdb_control_modflags() function ignored the received old flags.
metze
(This used to be ctdb commit 8ec633b64a05a2d903c2b9639909f15f6375548f)
The waiters reference the locakwait handle in order to remove itself from the li
nked list which caused a SEGV.
We dont actually need to remove ourselves from this list here since
if the parent freeze_handle holding the list is freed, then all waiters are rele
ased as well, and the only place we actually need to relink the waiter is in ctd
b_freeze_lock_handler, where we want to respond back to the clients and release
the waiters but we still want to keep the freeze_handle hanging around.
(This used to be ctdb commit e01ab46bafad09a5e320d420734db129d35863bc)
The databases can become frozen a while before we do the actual recovery
since we have the re-recovery timeout.
There is no point in doing much monitoring if we are waiting for a recovery,
or if we are banned.
This will eliminate some annoying log entries where certain tests will fail if the databases are locked.
(This used to be ctdb commit ff824676fab94168707aada7423ae766bc0f711c)
master to perform an explicit ip reallocation.
This is more reliable and faster than having the recovery dameon track these
changes, and since we now have an explicit method to ask the recovery daemon
to perform an explicit ip reallocation, we should use this.
(This used to be ctdb commit 3807681e74f4bfe92befdae6ed616ff5f1a99880)
This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes.
(This used to be ctdb commit ce534a83a05dbd40238e4eee0669d60ff396f935)
race between the ctdb tool and the recovery daemon both at once
trying to push flag changes across the cluster.
(This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)
just disable the monitoring during the "startrecovery" event and enable it again once recovery has completed
(This used to be ctdb commit 68029894f80804c9f31fc90ed0c1b58f75812c3d)
nodes into two separate files.
move the monitoring of keepalives for detecting connected/disconnected
remote nodes into ctdb_keepalive.c
(This used to be ctdb commit 23a57b20c314d5f11a433cf251eb9d9de743849a)
flag.
change calling of the recovered/takeip/releaseip event scripts to use
these enable/disable functions instead of stopping/starting monitoring.
when we disable monitoring we want all events to still be running
in particular the events to monitor for dead nodes and we only want to
supress running the monitor event scripts
(This used to be ctdb commit a006dcc4f75aba950dd701ad7d1a84e89df285e8)
Check whether monitoring is enabled or not before creating new events
and log why the event is not set up othervise
(This used to be ctdb commit 2f352b2606c04a65ce461fc2e99e6d6251ac4f20)
control, instead call ctdb_start/stop_monitoring()
ctdb_stop_monitoring() dont allocate a new monitoring context, leave it
NULL. Also set the monitoring_mode in this function so that
ctdb_stop/start_monitoring() and ->monitoring_mode are kept in sync.
Add a debug message to log that we have stopped monitoring.
ctdb_start_monitoring() check whether monitoring is already active and
make the function idempotent.
Create the monitoring context when monitoring is started.
Update ->monitoring_mode once the monitoring has been started.
Add a debug message to log that we have started monitoring.
When we temporarily stop monitoring while running an event script,
restart monitoring after the event script wrapper returns instead of in
the event script callback.
Let monitoring_mode start out as DISABLED and let it be enabled once we call ctdb_start_monitoring.
dont check for MONITORING_DISABLED in check_fore_dead_nodes(). If
monitoring is disabled, this event handler will not be called.
(This used to be ctdb commit 3a93ae8bdcffb1adbd6243844f3058fc742f76aa)
of the startup event scripts after the point where recovery has
started and the node is in normal operation
This makes the 'startup' script just a special type of the 'monitor'
script which is called first
(This used to be ctdb commit 7424c30a5fd04aea0137c466b4318c3f185280d8)