IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
"statd-callout notify" currently complains until an add-client or
del-client is done.
Given that we might use ctdb.tdb for something else in the future it
makes sense attach to it in the "startup" event. This could be done
in the background but it should be so lightweight that a timeout will
indicate serious problems.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This feature was added quite a while ago but was not enabled by
default. It is a useful feature so enable it to dump stack traces of
up to 5 stuck processes by default.
This can be disabled by setting:
CTDB_NFS_DUMP_STUCK_THREADS=0
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Feb 25 04:06:45 CET 2014 on sn-devel-104
This comment was true when 50.samba was spaghetti because it tried to
automatically manage both smbd (and nmbd) and winbind. It isn't true
anymore.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Wed Feb 19 04:07:12 CET 2014 on sn-devel-104
* Add stack dumps for "interesting" processes that sometimes get
stuck, so try to print stack traces for them if they appear in the
pstree output.
* Add new configuration variables CTDB_DEBUG_HUNG_SCRIPT_LOGFILE and
CTDB_DEBUG_HUNG_SCRIPT_STACKPAT. These are primarily for testing
but the latter may be useful for live debugging.
* Load CTDB configuration so that above configuration variables can be
set/changed without restarting ctdbd.
Add a test that tries to ensure that all of this is working.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
If a primary IP address is being deleted from an interface, the
secondaries are remembered and added back after the primary is
deleted. This is done under a lock shared by the add/del script code.
It is necessary because, by default, Linux deletes secondaries when
the corresponding primary is deleted.
There is a race here between ctdbd and the scripts, since ctdbd
doesn't know about the lock. If ctdbd receives a release IP control
and the IP address is not on an interface then it is regarded as a
"Redundant release of IP" so no "releaseip" event is generated. This
can occur if the IP address in question is a secondary that has been
temporarily dropped. It is more likely if the number of secondaries
is large.
Since Linux 2.6.12 (i.e. 2005) Linux has supported a
promote_secondaries option on interfaces. This option is currently
undocumented but that will change in Linux 3.14. With
promote_secondaries enabled the kernel will not drop secondaries but
will promote a corresponding secondary instead. The kernel does all
necessary locking.
Use promote_secondaries to simplify the code, avoid re-adding
secondaries, avoid re-adding routes and provide improved performance.
This could be done conditionally, with a fallback to legacy
secondary-re-adding code, but no supported Linux distribution is
running a pre-2.6.12 kernel so this is unnecessary.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This adds new files for Ganesha's recovery. myreleaseip_* are used by
the recovery thread on the node where IP is released. The releaseip_*
and tekeip_* files are used by recovery thread where IP is taken over.
Signed-off-by: Srikrishan Malik <srimalik@in.ibm.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Signed-off-by: Srikrishan Malik <srimalik@in.ibm.com>
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Thu Jan 30 11:18:19 CET 2014 on sn-devel-104
Services can be flagged for reconfigure when they release IPs at
shutdown. The flag is never removed and the service is prematurely
reconfigured during the first "ipreallocated" event, before any IPs
are hosted and before the "startup" event has actually started the
services.
$CTDB_VARDIR/state directly contained the service state subdirectories
and is already removed in the "init" event. Just push the service
state subdirectories down a level and put everything else in a
subdirectory.
This way all the eventscript state gets cleaned up every time CTDB
starts up.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Jan 17 09:58:26 CET 2014 on sn-devel-104
Currently the lock is held until the corresponding eventscript
completes, since the process still exists. If the regular part of an
eventscript hangs then the lock might unnecessarily be held for a long
time. The pathological case is when a monitor event gets stuck in
D-wait state and the script times out but can't be killed so the lock
is still held. This can cause an unwanted monitor replay.
Change this so that the lock is released immediately after the
reconfiguration is complete.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
"monitor" events can be cancelled. If a reconfigure action does a
service restart then the "monitor" event can be cancelled at the
inconvenient moment after the service is stopped. In this case the
service stays down and the node may become unhealthy (depending on
whether there are any repair actions in the monitor event).
A long time ago we did service reconfiguration in "monitor" events
following failovers. Service reconfiguration was then moved to the
"ipreallocated" event. However, reconfiguration in "monitor" events
has been kept as a last resort in case an "ipreallocate" event does
not occur. The only important case that this covers is "ctdb
deleteip", where "releaseip" events are generated without a
corresponding "ipreallocated". Therefore, IPs can be deleted without
running the required service reconfiguration.
The supported way of removing IP addresses is now via "ctdb
reloadips", which always causes a takeover run with a corresponding
"ipreallocate" event.
This means that service reconfiguration in "monitor" events is no
longer required and should be removed because it is unsafe.
Also update the associated tests. Make the first confirm that the
monitor event no longer does reconfiguration. Change the others to
test that monitor status is correctly replayed when something else is
doing a reconfigure and currently holds the reconfigure lock.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Dec 17 06:32:35 CET 2013 on sn-devel-104
If these configuration variables are not defined, then there should
a default fallback. This is a workaround till CTDB compile time
configuration can be accessed at runtime.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
If NFS RPC checks do restart Ganesha, then it's possible that share
check can fail prematurely.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
If $statd_state is empty then the loop will run once and print
spurious errors.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Michael Adam <obnox@samba.org>
This prevents spamming of logs if multiple lock requests are waiting
and keep timing out.
Also, improve the logging format with separators.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
This is naive and assumes no performance problems when updating
persistent DBs. It also does no error handling.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
That is, don't use fixed paths.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
The background update is never guaranteed to complete before the cache
is used, so don't bother trying it at the beginning. Instead, put a
timeout on a foreground update.
If the foreground update fails:
* If there's no available cache file then die.
* If there is a previous cache file then use it and log a warning.
* Do a background update at the end of the monitor event.
Also remove commas in the "smb ports" list before use, since (newer?)
testparm seem to insert commas into the default value. Update the
associated test to add a comma.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 8c6f511254ecb0381a609b37e3a0ee6e5ec5d562)
Elsewhere we're moving the socket to /var/run/ctdb. We might end up
with PID files and sockets for other daemons later, so let's call the
directory "ctdb" instead of "ctdbd".
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b63f6fd2d295c8e18cbf3420ab05fce07b727f31)
Use /var/run/ctdb/ctdbd.socket because there might be other daemons
that need sockets in the future.
The local daemons test code to create a link for the default
convenience socket has to be removed because the link can't be created
as a regular user in the new location. This should be OK since all
calls to the ctdb tool in the test code should be wrapped in onnode.
When debugging tests, a developer will have to set CTDB_SOCKET by
hand.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-programmed-with: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit dc67a4e24af9d07aead2a1710eeaf5d6cc409201)
* It should run on "ipreallocated" instead of "recovered"
* Variable name NODE -> ip since that's what it is
* Simplify some logic
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 45e2bc66abf9fcfeadcc279a656ed7fd1838920a)
Routes only need to be updated when IPs have moved. IP takeover runs
will generate "ipreallocated", which is enough. "recovered" always
follows "ipreallocated" anyway, so avoid the redundancy.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1152215fc69217e4292762e28d193b7ea0e06ee3)
Any time a node changes flags in any significant way there will be a
takeover run, which will generate an "ipreallocated" event. The
"recovered" event always happens straight after a takeover run so we
update the NAT gateway twice.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 542c70d6281d636ecd51502fbbf219f418bfac66)
There is no reconfigure code for these scripts so no need to check for
reconfiguration.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 41df1637c1d8a7b2f5a9974408db71b1f74cb2f2)
Nothing ever (or has ever) set the "needs reconfigure" flag, so this
code is unnecessary.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5b77fd95bda5f1960aca952e1b759231890b56f3)
A generic framework is no longer needed now that the "ctdb" checker is
the only one left. Simplify the code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 044d302b41a2040642355401e3236fcecc3a620a)
"ctdb checktcpport" is no longer experimental so the other checkers
are no longer required.
Remove tests related to the removed checkers.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 50e330d0679614bee2e7bab028436e929f74ca50)
The current setting is inconsistent with settings on most systems,
putting /bin before /sbin. Use of /usr/local/bin, which may be
required on some systems, is also overridden. This can make it
difficult to do interactive debugging of script problems.
Rely on the system PATH instead.
If system-specific changes need to be made then this can be done in a
configuration file.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cfbff39e22e42f3997f637290748290833525714)
Reduce the complexity, including the depth of background processes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 49f077c475b078889ff0492fe7d567a64d6cb87c)
Otherwise calls to "ctdb natgwlist" will not behave as expected if a
non-standard file is used, since that command will use the default
file location.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e574b30257126679704b088c4334a8e7a53a9c3f)
The old logic was actually wrong. If CTDB_LOGFILE is unset then a
default is used, not syslog.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 79e2029f9bc078126e865aa715100a3870c7604b)
Allowing people to put random options in CTDB_OPTIONS complicates some
logic (particularly around use of syslog). If we're going to have
variables for options then let's make sure we have a variable for each
option and make people use them.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e55f3a1577eff0182802b0341d865d961aeae1c7)
All CTDB configuration variables should start with CTDB_.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f12658aff125996ae45eea23241d8c3d0567b893)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 8f660d0dd52013e5876806be908e8e603aa6e968)
This uses potentially insecure temporary files and is not referenced
anywhere else.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 4b914d7e217202f3d11a8e95f9f74bc17869475b)
Anecdotal evidence suggests that most nfsd RPC check failures are due
to cluster filesystem or storage problem. Apparently these are rarely
helped by attempting to restart the NFS service because the restart
tends to hang.
Fail after 2 nfsd RPC check failures, instead of waiting for 6
failures. Restart on every 10th failure to try to bring the node back
to good health.
Update unit tests to match.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e9ef93f7b6dad59eabaa32124df81f3e74c651ef)
It should print the actual number of consecutive failures rather than
the limit.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ff5f0d1e29af2b293e30cdc54bed03a644be7038)
This makes the gaps in the logs more obvious.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 11fbf4789d783dd0bac22754b374dd9ea4b03bad)
Passing "localhost" to the rpcinfo command causes overheads, like
reading /etc/services multiple times.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 1d61988af9e4fa3621a3e2d06a859bcb53df2d67)
Also add it to the corresponding eventscript unit test infrastructure.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f4ef83a256f59eeb00b9a5bc10c28347e1ad1031)
While doing this:
* Explicitly assign RPC program and version information in
_nfs_check_rpc_common(). This is more lines of code but is easier
to read.
* Don't print the options when starting a service. Trying to print it
makes the code messy for little benefit.
Update the eventscript unit testing code and a Ganesha test to
reflect this.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e8b531405665885196c95fe1608db33a255bf761)
They're hard to maintain and provide very little benefit.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1a1be43f8466d46913dcdfe6dcedb94316cd28ad)
That is, /dev/null the "stop" output. This is consistent with the way
CTDB generally deals with the output when stopping a service.
It also makes updating the eventscript unit tests easier.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c7332526b1b488abefeb4be78a7cd3f2f9abc451)
CTDB daemon is not ready to accept clients in INIT runstate (init event).
CTDB daemon will start accepting connections in SETUP runstate (setup event)
and later.
Also, minor log formatting changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 81d7ce03b28d592a1337639e14d9ea141e20bfff)
On cluster where recovery lock file is not being used, asking CTDB daemon
is unnecessary overhead. And if CTDB is using recovery file, then changing
configuration without restarting is *stupid*.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 44eb86e6042adb6efe75d2a5528b82a0f21d496d)
This ensures that any invocation of the ctdb tool (within the wrapper)
gets the desired value. This at least ensures that ctdbd will be
started.
If a non-standard value is set for CTDB_SOCKET then command-line users
will still need the variable in their environment.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 37ccc7c6cc43a80aaa92291aea7a438f4225488a)
This avoids issuing multiple "ctdb killtcp" commands to terminate tcp
connections, one per connection. This will considerably reduce the
time when there is a large number of tcp connections. This also makes
it possible to avoid calling "ctdb killtcp" when there are no connections.
Add a couple of unit tests for killtcp and update eventscript unit
test infrastructure to support.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit a20d94717d2e4ab866d8a002cdf39c0669b74c6a)
The timeout information printed by ctdbd is less than useful because
it refers to the cumulative time taken by the eventscripts run so far.
Adding scriptstatus output indicates where time was actually spent.
Since there is now quite a bit of output, serialise the calls to this
script using flock.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 1b016b2dfc5d7d3f2a42ce4dfe569608e90eb714)
A missing interface is at least as bad as an interface with a link
that is down so should have a similar effect.
This couldn't be done previously because orphaned interfaces used to
be listed for monitoring. This was worked around in 10.interface in
commit 49b2d1bd9554461ed8edbfc21e777c0eca9e1443 and fixed in ctdbd in
commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.
If $CTDB_PARTIALLY_ONLINE_INTERFACES="yes" then monitoring won't
actually fail but the interface is still marked as down.
While we're touching this code, use "ip link" instead of "ip addr".
It is marginally cheaper but not enough for a separate patch. ;-)
This effectively reverts d67955b42f7627be9dae995230c8fcbb8a948ec2.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 501f19b16fd6d67fbb754248868c38ee5bcf79ef)
This was previosuly changed because ctdbd didn't garbage collect
orphaned interfaces. This was fixed in commit
cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c6ab0f9405d5fa5b0b1693bc92e59da0d555a9d7)
It can be very disconcerting when logging to syslog is expected but
nothing is being logged there.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 412bc0e20bef694d4e911dc9c984fd7716231f1f)
Based on an original patch by Sumit Bose <sbose@redhat.com>.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e43a4b7b69a21c4cec2453dcac436b64bf5d7f06)
Currently the initscript is very complex. This makes it hard to read
and hard to add support for new init systems, such as systemd.
Create a wrapper called ctdbd_wrapper to be installed alongside ctdbd.
This is called by the initscript to start and stop ctdbd. It does the
ctdbd option construct and waits until ctdbd is properly initialised
before it exits.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit e3abc7eebab5cceddc4ce7817890dd5db9be3450)
This allows 60.ganesha to be unit tested, except for the core Ganesha
monitoring code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f606df4f2db754592e6d1a16c26e155cacb2beef)
Support for this was removed in commit
77302dbfd85754e02559eccb2dd6c090db0b6b9f and I overlooked its use in
60.ganesha.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 520914e7ee1b879c1080e5857fda18ed5b973fd6)
The "setup" event isn't called until ctdbd is in CTDB_RUNSTATE_SETUP
anyway...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9ea57af557028b1d2e5c560e7bcf4d014b9a8b1e)
This essentially reverts d4621277240721e6d130a930b0100506b64467ea.
This was added for testing but the test code was actually broken.
CTDB itself will only process public IPs if $CTDB_PUBLIC_ADDRESSES is
set, so no code should try to be more flexible than that!
The test code has been fixed instead.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3b11b27f3e22e99947bc2d6c49c4427bd7a0e332)
It makes sense to do this in the "init" event and make the initscript
less complicated.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3bc93f312b8464fbfa2b2c44fffedc591fe5a3e0)
It makes sense to do this in the "init" event and make the initscript
less complicated.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0b77cceb49a30a181063adc7868d42d2851318e8)
Otherwise secondary addresses that aren't owned by CTDB could be
dropped.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5ffce65a1ad659b198ddf647622b899bdde45c72)
Change all callers to maintain current behaviour.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0b67397ef5419c781a35916575151da7b7e7cc27)
If some nfsd threads are still alive after a shutdown during a restart
then this indicates the maximum number of threads for which a stack
trace should be dumped. This can be useful for trying to determine
why nfsd is stuck.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 2503245db10d567af708a04edd3a3b488c24f401)
Consider the following example:
1. There are 256 nfsd threads configured.
2. 200 threads are "stuck" in system calls, perhaps waiting for the
underlying filesystem when an attempt is made to restart NFS.
3. 56 threads exit when NFS is stopped.
4. 56 new threads are started when NFS is started.
5. 200 "stuck" threads exit leaving only 56 threads running.
Setting this option to "yes" makes the 60.nfs monitor event look for
this situation and try to correct it.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 99b0d8b8ecc36dfc493775b9ebced54539c182d2)
60.nfs and 60.ganesha touch $statd_update_trigger every time they're
run. This stops the statd-callout updates from ever being called.
Make this logic self-contained and move it to new function
nfs_statd_update() in the functions file. Call this in 60.nfs and
60.ganesha with the appropriate update period as the only argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reported-by: Poornima Gupte <poornima.gupte@in.ibm.com>
(This used to be ctdb commit 1b5968f6be084590667f4f15ff3bef13ed9a2973)
Every time a node that wasn't the NAT gateway master gets reconfigured
something like this appears in the log:
ctdbd: 11.natgw: Failed to del 10.0.1.139 on dev eth1
Since this usually fails it is better to mute the error than to have
it pollute the log.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0ca7a98ffef50cbd06849cfbf65fb4a3d668b7bd)
This is needed for AIX and possibly others.
Also provide a cheaper mktemp function is needed in the run_tests
script.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b2b572e9049c7138bd223226475bef8fe3e01f10)
The current code calls "ctdb setnatgwstate ..." on every event.
However, calling the ctdb tool in the "init" event is not permitted.
Instead, update the capability when it is needed and at regular
intervals via the "monitor" event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 39a43feae7c7de07ddaf2d6cb962f923d47d0c19)
This adds more serialisation to the startup, ensuring that the
"startup" event runs after everything to do with the first recovery
(including the "recovered" event).
Given that it now takes longer to get to the "startup" state, the
initscript needs to wait until ctdbd gets to "first_recovery".
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit ed6814ff0a59ddbb1c1b3128b505380f60d7aeb7)
If one or more run states are specified then "ctdb runstate" succeeds
only if ctdbd is in one of those run states.
At the moment, if the "setup" event fails then the initscript succeeds
but ctdbd exits almost immediately. This behaviour isn't very
friendly.
The initscript now waits until ctdbd is in "startup" or "running" run
state via the use of "ctdb runstate startup running", meaning that ctdbd
has successfully passed the "setup" event.
The "setup" event code in 00.ctdb now waits until ctdbd is in the
"setup" run state before proceeding via the use of "ctdb runstate setup".
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 4a2effcc455be67ff4a779a59ca81ba584312cd6)
This makes it easier to add notification handlers.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d29e9a420b133088bf23a847c8d1dbce56c25eb0)
fff88940f71058e4eefd65f50a6701389c005c17 introduced a regression.
Without $service_name set by default, the CTDB configuration is no
longer loaded when loadconfig() is called without any arguments.
That's bad.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f1619a36c1beba11533052dc5728fa3adaa08870)
No longer used, support removed from test infrastructure.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0eb351ff4c7ee096de7c5e0a59561067091fa32e)
* New directory nfs-rpc-checks.d/ replaces hardcoded rules in 60.nfs
* Installation and packaging additions to handle nfs-rpc-checks.d/
* Unit test updates, including deleting 1 test that sanity checked
test infrastructure
* Test infrastructure changes to use nfs-rpc-checks.d/
Note that this removes support for $CTDB_NFS_SKIP_KNFSD_ALIVE_CHECK in
60.nfs. To get the equivalent behaviour, edit 20.nfsd.check and
remove/comment all lines.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 7e792d6768d9ca420ce3713cb122e63afd594b15)
Want nfs_check_rpc_services() to support filenames without the 'k'.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d9775fcbd6e30eef8382bea68e2f9bad2309f2c1)
This is intended to replace nfs_check_rpc_service(), which builds
configuration into eventscripts.
nfs_check_rpc_services() uses a directory of configuration checks that
can be edited by an administrator. The files have one limit check and
a set of actions per line. The program name is extracted from the
file name.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9bc8fbee6550ed2814fb35c70d57fab21ef1b8fd)
This creates new function _nfs_check_rpc_common().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cc3bb42e48bbdabd19187c231846b98589b4f4f3)
This is unused so doesn't need to be maintained. An attempt to use it
now will explicitly fail rather than implicitly fail via bitrot.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 887733dd7be53158bfe07b30ef31b611d0f8122f)
This reverts commit 92f74fd589467b46c758e116e97417edfe8773d7.
This change is unused and is just complicating the function.
Conflicts:
config/functions
(This used to be ctdb commit 77302dbfd85754e02559eccb2dd6c090db0b6b9f)
The code in 60.nfs is going to be genericised, so make all the checks
look the same.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 15b0f78cbf8d6ba481b7eba9e4fe3f4270214c72)
ctdb_check_counter_limit() can soon be removed...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit bb2cdff77e8ec79e7d319159b9c9848ecfaaa0f1)
It is in the background but it still might cause the counter to be
reset before it is checked.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ef2cf75e95ff382c65524a4d77eb00ab8411d2fc)
That way we don't even check the counter...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 136abd4604dc68f7c696704bac708bae53cf1940)
This has 2 advantages:
1. It uses get_tcp_connections_for_ip() to check for leftover
connections, instead of custom code.
2. It checks for the timeout condition before sleeping. The current
code sleeps and then checks, so wastes a second.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 60a08eb96e1d97aab31e9bd4af01683c650541c2)
Uses new function get_tcp_connections_for_ip(). This avoids using a
temporary file and running netstat twice.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a621622903c7ef17764b15293d6ea8df5a53c7e1)
... using kill_tcp_connections()
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 10e4db8f796d1e3259733180494db3b4bbad291a)
This change is a no-op. However, In a subsequent commit we'll merge
kill_tcp_connections_local_only() with this function.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 23c0f5f48e3e5a0c1a3254c582299f7893cf0d33)
Setting these variables spawns lots of unnecessary processes, which
would surely slow down these functions on a busy system.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3eae161472e6352f7f656851c73dc056f95113eb)
* Command is now multiple arguments, preserving quoting
* $service_name no longer printed, no longer an argument
* Debug output from failed command
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9e25fb261447a196de05937052779b36e75e7215)
The documentation comments are wrong... and remove option
$service_name argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d9e6cb945c5edac9ca6405c9228bf647fab814f5)
Complicated argument handling was introduced to deal with multiple
services per eventscript. This was a failure and we split 50.samba.
This simplifies several functions to use global $service_name
unconditionally instead of having an optional argument.
$service_name is no automatically longer set in the functions file.
This means it needs to be explicitly set in 13.per_ip_routing because
this script uses ctdb_service_check_reconfigure().
Eventscript unit test infrastructure needs to set $service_name during
fake service setup, and policy routing tests need to be updated
accordingly.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 27aab8783898a50da8c4bc887b512d8f0c0d842c)
Complicated argument handling was introduced to deal with multiple
services per eventscript. This was a failure and we split 50.samba.
This simplifies several functions to use global $service_name
unconditionally instead of having an optional argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b5802c4735e1c719a5cf9ce69489d5947bd5e8c5)
Complicated argument handling was introduced to deal with multiple
services per eventscript. This was a failure and we split 50.samba.
This simplifies several functions to use global $service_name
unconditionally instead of having an optional argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e24baac0d2952e86d5ff31235901f06e2f2b2449)
Complicated argument handling was introduced to deal with multiple
services per eventscript. This was a failure and we split 50.samba.
This simplifies several functions to use global $service_name
unconditionally instead of having an optional argument.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c2ea72ff565222f9edab408638bd45dbba6e8ff7)
5940a2494e9e43a83f2bca098bd04dfc1a8f2e93 makes script_log() always
pass a message to logger, so script_log() can no longer log stdin.
Put all the tag fu in the actual tag so the message argument is empty
if no message was passed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9dee4c84273633b9ad82e94dabbf0e6f86edbcef)
It isn't used, superceded by "ipreallocated".
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c2bb8596a8af6406ef50e53953884df9d6246a96)
Use "ipreallocated" instead. The "stopped" event pre-dates the
"ipreallocated" event. The only way of stopping a node is via the
ctdb tool, which explicitly causes a takeover run to occur after the
node is stopped. The takeover run will generate an "ipreallocated"
event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 978d4a0d6d8c9877b23f72e3a7b78c1245d16908)
Our practice is to search logs for "ctdbd:". We want to make sure we
find everything.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5940a2494e9e43a83f2bca098bd04dfc1a8f2e93)
Previous commits stopped the top level of the script from creating
certain directories but some functions assume that required
directories exist.
Create those directories instead.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0076cfc4666e5a96eb2c8affb59585b090840e00)
The current logic is horrible and creates an unnecessary file. Let's
make the script debug level independent of ctddb's debug level.
* Have debug() use $CTDB_SCRIPT_DEBUGLEVEL directly
* Remove ctdb_set_current_debuglevel()
* Remove the "getdebug" command from ctdb stub in eventscript unit
tests
* Update relevant eventscript unit tests to use
$CTDB_SCRIPT_DEBUGLEVEL
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 85efa446c7f5c5af1c3a960001aa777775ae562f)
Move the use of the service command below inclusion of functions file,
which sets $PATH.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d254d03f69cbdc3e473202b759af6e1392cbb59c)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit e7a4b7e35a1e4b826846e2494a3803abb57065ee)
"ctdb ping" can time out. How many times should we try?
Instead, depend on the initscript to implement something sane.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit 90cb337e5ccf397b69a64298559a428ff508f196)
Using "ctdb ping" and "ctdb status" is fraught with danger. These
commands can timeout when ctdbd is running, leading callers to believe
that ctdbd is not running. Timeouts could be increased but we would
still have to handle potential timeouts.
Everything else in the world implements the "status" option by
checking if the relevant process is running. This change makes CTDB
do the same thing and uses standard distro functions.
This change is backward compatible in sense that a missing
/var/run/ctdb/ directory means that we don't do a PID file check but
just depend on the distro's checking method. Therefore, if CTDB was
started with an older version of this script then "service ctdb
status" will still work.
This script does not support changing the value of CTDB_VALGRIND
between calls. If you start with CTDB_VALGRIND=yes then you need to
check status with the same setting. CTDB_VALGRIND is a debug
variable, so this is acceptable.
This also adds sourcing of /lib/lsb/init-functions to make the Debian
function status_of_proc() available.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit 687e2eace4f48400cf5029914f62b6ddabb85378)
In RHEL 6+, rpc.statd runs as "rpcuser" instead of root as on RHEL 5. This
prevents CTDB tool commands talking to daemon since "rpcuser" cannot access
CTDB socket.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit fe8c4880b371492a38554868d4ca10918c54e412)
This is an artifact from older versions of Samba. In the newer versions of
Samba, "smbstatus -np" command does not do anything useful, but causes a
traverse in CTDB which is expensive and causes CPU utilization to shoot up.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 053b89c6dbce47001505524606889334559d2ec4)
This means it can be set like any other configuration option in the
configuration file, without needing to export it there.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a0ef73e197dc9147f7718e0813fe803ff0b3d54d)
Use an environment variable instead. This just means that the
initscript exports CTDB_DEBUG_HUNG_SCRIPT and the code checks for the
environment variable.
The justification for this simplification is that more debug options
will be arriving soon and we want to handle them consistently without
needing to add a command-line option for each. So, the convention
will be to use an environment variable for each debug option.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0581f9a84e58764d194f4e04064c2c5b393c348b)
Unobtrusive recovery: Ganesha will not be restarted on failovers.
Ganesha health: Use the counters in /var/lib/nfs/ganesha_local to track progress
instead of the null call which can timeout if the server is too busy.
Signed-off-by: Srikrishan Malik <srimalik@in.ibm.com>
Signed-off-by: Lance Russell <lancerus@us.ibm.com>
(This used to be ctdb commit 0e651e9da0f1f3c836b4474612ab13d0ccd272d9)
Currently it silently continues without attempting to set tunables.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 735ec99b99c7bb579851ce8293011aaf1dcc552a)
When using syslog any provided message arguments are ignored and not
passed to logger. This means that logger blocks waiting on stdin.
That's bad.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 50abf597cefe6f8ea2a2ff7694bf84641344a9b1)
This improves maintainability.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e2aaa64925cca359c71520e01a18fc9461b0da4d)
Incorporate some of the logic from ctdb-crash-cleanup.sh that ensures
IPs are deleted even if they have the wrong netmask or are on the
wrong interface.
Factoring out some of the code will allow it to be used elsewhere.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 03356fd5ae7a3ac35fde0289cbea7c71ecf07367)
This makes it easier to run the scripts externally.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 740ea8ea5084149c8b552a01ee1c98c558b12384)
... so it can be improved and used elsewhere.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b23c30253cc9eb274b895cac0f8c65245ba0a200)
A default action of restarting the service doesn't obey the principle
of least surprise. It cause the NFS service to be implicitly
reintroduced.
This allows no-op functions to be removed from some eventscripts and
service restart functions to be added to others.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c75b5e5b4d000f5c7dab403df8238ceed390c1c0)
It looks like this restart was accidentally reintroduced in commit
fc0678d351187cfa4c71123f97c0f493aacd5d16 when $service_reconfigure
became unset so the default action of restarting the service would
occur. From there cleanups have explicitly reintroduced it and
carried it through the code.
Also update the unit tests affected by this change.
The restart was originally removed in commit
bc481c3f1a44c50648488c4f8a7f15ec395d446f.
The default reconfigure action of restarting a service is clearly
suboptimal and will be addressed in a separate patch.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 2629de72e1f37b5e46772c2ef8d8d0012fc4ed37)
At the moment the caller has no idea why it thinks CTDB isn't running
and we can't debug failures...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 776590bf84d221092298346a28d7fc0552a67c9d)
creating the smb.conf cache with "-v" results in a cache file
that fails to load with "testparm -s ..." later on due to
"copy = " not being processable. (Copying the empty service name fails).
Signed-off-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit 81788cfabe960497b050c5ee4e4e487ee061012a)
The current code lists available interfaces. If IPs are configured in
some other way than the public addresses file (e.g. ctdb addip) and their
interfaces default to being marked down then, since down interfaces are
not available, these interfaces can never be marked up.
The configured interfaces should be listed instead.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d8f010355b715e49709836e057a5d0f110919897)
Provided that monitor_interfaces() sets the state of each interface,
there's no need to mark all interfaces as up before running
monitor_interfaces() in the startup event. monitor_interfaces() will
set the true status of each interface anyway. The duplication is
unnecessary and may cause extra action in the recovery daemon because
the state of some interfaces is changed an extra time.
Instead, add a comment at the top of the loop in monitor_interfaces()
to warn against early loop exits.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f243a916ee71013f7402b9c396c2ead88eb3aab0)
This file is #!/bin/sh. On sn-devel at least, with this /bin/sh the
shell does not like == for string equality.
(This used to be ctdb commit e2213db479129ce9c2b2fb88ec8c53cbd33d54b3)
This reverts commit 88f88d86b0d08240f749fb721b8c401c2eeb1099.
This is dangerous and, on reflection, I can't see it being useful.
There are often permanent IPs on interfaces that CTDB shares with its
public IPs.
(This used to be ctdb commit 16aba4eb620844626a1c71c58b51658caf44dea6)
The recovery process has no protection against the "recovered" event
failing, so this can cause a recovery loop.
Instead of failing the "recovered" event, add a "monitor" event and
fail that instead. In this case the failure semantics are well
defined.
A separate patch should ban nodes if the "recovered" event fails for
an unknown reason.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit eaa7c165f58abd7e259c37d76b7dd37c91e13d9f)
We've seen this function report "Unknown family, 0" and then CTDB
disappeared without a trace. If we can reproduce it then this might
help us to debug it.
The idea is that you do something like the following in /etc/sysconfig/ctdb:
export CTDB_EXTERNAL_TRACE="/etc/ctdb/config/gcore_trace.sh"
When we hit this error than we call out to gcore to get a core file so
we can do forensics. This might block CTDB for a few seconds.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 7895bc003f087ab2f3181df3c464386f59bfcc39)
ctdb_check_counter_limits does not fail but succeed if count >= limit
Signed-off-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit af540ef728303b4a0a188b17c695e9aefab34489)
The tunable variables defined in CTDB configuration file are currently
set up from init script as well as part of "setup" event in 00.ctdb
eventscript. Remove the duplication of this code and set tunable
variables only from setup event. During the "setup" event, it's possible
that ctdb tool commands can timeout if CTDB daemon is not ready. To guard
against such eventuality, wait till "ctdb ping" command succeeds before
executing any other ctdb tool commands.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 632c1b9c1cc2e242376358ce49fd2022b3f27aa2)
This rebuilds all policy routes and can be used if the configuration
changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c185ffd2822fcee26d07398464c59b66c61f53fa)
If $CTDB_SERVICE_AUTOSTARTSTOP="yes" then service start/stop is done
in the background with logging.
Fix some unit tests for samba and winbind.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3a3dae4cb5ec8b4b8381a4013adda25b87641f3a)
winbind and samba can be separately managed. This makes the service
starting and stopping code way too complicated, and even adds a small
amount of complexity to the monitoring code. The sensible option is
to split this eventscript in two.
There are two potentially backward incompatible changes here:
* Functionality has been removed that allowed 50.samba to manage
winbind when CTDB_MANAGES_WINBIND was unset but the smb.conf
"security" parameter was set to "ADS" or "DOMAIN".
Maintaining this functionality would have required moving the
testparm-related code to the functions file, deciding where the
cache file should go, and then calling it from both 49.winbind and
50.samba. This feature wasn't of great value and asking
administrators to set an extra variable in exchange for code
simplicity seems like a reasonable deal.
* External code will need to be changed if it calls 50.samba directly
with winbind-related expectations. This is fairly obvious!
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 34535ae64420926b9a3bf7d453fed4e6f4c90115)
Initialising a new ctdbd will destroy the Unix domain socket so
existing processes will be useless anyway.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 043ef77086797a703aec436a26a05c56a1bcbf2b)
This puts it under the umbrella of the previous warning that should
also have been printed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5c3be8f26dcde0b1b3d86928953e74d4a8b35958)
del_routing_for_ip() currently fails silently, which could hide real
errors.
In add_routing_for_ip() we don't want to see any error when calling
del_routing_for_ip(), since we don't expect the rule to be there.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 30d69defa7e97ab5e3ba0492a27868dde2616494)
Currently, if the configuration file is specified by
$CTDB_PER_IP_ROUTING_CONF but is missing, takeip fails but (the
absent) monitor event "succeeds", so the state of a node will
flip-flop.
Instead of this, if the configuration file is missing then fail early
on for all events.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c64c6c77c3f6aa2898e5a575547b587bea868c76)
When the configuration file is missing this causes the node to
flip-flop betwen unhealthy (when takeip fails) and healthy (no monitor
event here).
Will reimplement this properly.
This reverts commit 351ca413eec460330571ca8b01ad269728fe15df.
(This used to be ctdb commit 5277d749c9111716fd723647d5421907476422bf)
The loops can all be done without cat or grep.
The pair of loops in updateip is combined into a single loop.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 96fdda124f5511fb76190e7c7a7f0b98e6b01a31)
This makes the case implicit where $CTDB_PUBLIC_ADDRESSES is unset.
This is OK because that's not an interesting code path.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5b2725d1ae052e848c2487cb10c5393a877d118c)
It is just meant to be even, so divided *and* multiplied by 2. Use
$(( )) to make it more readable.
While touching this code, make the related calculation a bit more
readable too.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 25d45e69f4ffc2b26061ac13038d52a353e79e61)
At the moment routes from 11.routing can fail to be added because they
conflict with the default route added by 11.natgw.
NAT gateway is meant to be a last resort, so routes from 11.routing
should override it.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 624f4677e99ed1710a0ace76201150349b1a0335)
* "ctdb natgw" is run twice when it doesn't need to be.
* Tweak the parsing of "ctdb natgw" output so that it is done by the
shell instead of a bunch of external processes.
* Make default NAT gateway be -1, even on error. If the process
failed entirely then it could previously be empty.
* Streamline the error handling using die() for when there is no NAT
gateway.
* Downcase script-local variable names.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 630cfe6451ba23d959fa4907fbba42702337ed3b)
It can be build without forking unnecessary processes.
Also downcase variable name because it is local to script.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 34f58a0773618c4508a55ad75fc4602dad5a5f4c)
aeb70c7e7822854eb87873a5c7783e27e6e72318 said it moved these but it
redundantly duplicated them instead. That commit also fixed the
problem because it moved the rules after delete_all() not out of the
startup event as claimed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 07149edaecb3caa672163e5a3b89715557d5205a)
$CTDB_NATGW_PUBLIC_IP can be split into $_ip and $_maskbits without
forking lots of processes.
Also "local" isn't supported by POSIX.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e20fdb974158061f4627d6f360c168d764690e6f)
This currently causes warning in the logs.
This change is not SLES10-compatible but we already have some other
non-SLES10-compatible changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 7640352c6697f9d4e0d13afbc8523afc64e7d462)
Break this debug and datacollection out into an external script to make it easier to modify what data we need to collect.
For now we only collect a pstree so we can see what part of the script we hung in.
S1037271
(This used to be ctdb commit 6e68797af67bee36f2bad045f94806e7e98f27e9)
Originally from Srikrishan Malik <srikrishan.malik@in.ibm.com> with
some style changes by me.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 637cab6304dae66b85668506028c76ea1ee88980)
Sometimes the restart can hang when there are I/O problems. Then the
eventscript times out and gets killed so the node never marked as
unhealthy.
Restarting in the background avoids this.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 13acd58c41fba1a33894fbd654fed69ea0eac322)
This can be optional because the 1st item of each action-triple is a
test comparison that starts with '-'.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 92f74fd589467b46c758e116e97417edfe8773d7)
This can be optional because the 1st item of each action-triple is a
test comparison that starts with '-'.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1957d53b78f101cd0cd37d9705a225deef5174a2)
I fixed one of these previously but didn't notice this one... :-(
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0c674efd19368d41d9cc28909d2b16c1af54c86c)
Corrupt non-persistent databases never get analysed because ctdbd
zeroes them at startup.
Modify the initscript so that corrupt non-persistent databases are
moved aside to a backup. If the number of backups for a particular
database exceeds $CTDB_MAX_CORRUPT_DB_BACKUPS (default 10) then the
oldest excess backups are garbage collected.
Abstracts from and cleans up the code for checking persistent
databases.
Logging of related messages is done to syslog or a log file as
specified.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 00cd75595685dae829758abf1a4cb644af7ed50e)
Currently it spews out random messages about the file being missing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 351ca413eec460330571ca8b01ad269728fe15df)
Make add_ip_to_iface() and delete_ip_from_iface() do their own locking
so the external script is no longer required.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 93f90caf91246074d9359bf31a39b26212cccc42)
This is no longer used by 13.per_ip_routing or anything else.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 2a2ea6c61a05af2d0765e964abcc7ef04047431e)
The relevant functions are now in that script.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 45c3476d12bf0f52966b72d286f101fce1382cd2)
The current version is quite difficult to read. This one is hopefully
clearer.
Major changes:
* The configuration file has a more forgiving syntax. Items can be
separated by arbitrary whitespace.
* Mappings between IP addresses and table IDs are no longer stored in
files in a state directory. Instead they are stored in
/etc/iproute2/rt_tables as mappings between table IDs and labels, as
allowed by the ip command. The current structure of the labels is
ctdb.<source-ip>. This means that once the labels are setup the
routing tables can be referenced by just knowing the source IP. As
with the old state directory, mappings in this file owned by CTDB
are deleted when CTDB shuts down.
* There are no release or re-add scripts.
- Release scripts are not necessary as an optimisation because of
the previous improvement (i.e. use of rt_tables). No lookup is
necessary to delete rules or flush tables.
- Re-add scripts are no longer used. Routes can still go missing
when removal of a primary IP from an interfaces (or similar)
causes removal of all other addresses (i.e. secondaries) and also
all associated routes. However, any missing routes are now
re-added in the "ipreallocated" event. This happens shortly after
takeip/releaseip/updateip and means that the routes will only be
re-added once. The window for missing routes is slightly bigger
but is not expected to be significant.
* The magic "__auto_link_local__" configuration value no longer causes
a dynamic configuration file to be maintained in a state directory.
The link local configuration is now generated when needed from the
public_addresses file. This greatly simplifies the code. This
approach is slightly less efficient but should not be significant.
The above changes mean that, apart from maintaining mappings in the
rt_tables file, there are no state files kept anymore.
Some utility functions only used by this script have been rewritten
and moved into this script. They will be removed from the functions
file by a future commit.
The route re-add code will also be removed from interface_modify.sh by
a future commit. It is currently harmless.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0f7cbbb55f26cf3c953e98fe5e7eaa12f59fbf78)
Args:
1. Error message to be printed.
2. Option exit code (default 1)
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 97b0c138cb97e30db27c40b4ee1481109ae90c78)
track and produce audit logs when someone runs "service ctdb <something>"
S1033891
(This used to be ctdb commit 4f4fbd4080a3a7226d3b82637f803c4b71217d39)
For a number of reasons (delip failure, admin stupidity, ...) an
interface that hosts public addresses can also contain spurious,
unmanaged addresses.
Add functionality to 10.interfaces, controlled by new configuration
variable CTDB_DELETE_UNEXPECTED_IPS, to delete these addresses when
encountered as part of a monitor event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 88f88d86b0d08240f749fb721b8c401c2eeb1099)
The script name is now prepended to output by ctdbd.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit bfa0fe70db195413a6d7a98f46f7a1270aba678c)
* $fs can be parsed using shell prefix and suffix removal.
* df output can be parsed with a single call to sed.
Failure is indicated by empty output from sed, so we check for that
as the error condition, changing the associated message
appropriately.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c5ef0d1440f1d952784cc67946c414d149722d01)
... on Debian system and derivated.
(ctdb_diagnostics still hardcodes /etc/sysconfig/)
(This used to be ctdb commit 1341329f6125d491b82c873f793af819e677f714)
Also, add -P to df, to avoid multiline on Linux when device name is long (this is the case with LVM)
(This used to be ctdb commit f4d5a5810f1a840a41c3541a3b822fce44d41e9a)
Print useful output and return a suitable exit code.
The DISABLED and TIMEDOUT statuses use fake negative return codes, and
these can't be faked from the shell. So we map DISABLED to OK and
TIMEDOUT to ERROR - this should avoid nearly all surprises. When we
do this we add a note to the beginning of the output. The alternative
is to "fix" ctdbd to use only codes that can actually be returned by
shell scripts. However, the reason for using negative codes is
probably to distinguish them from real ones...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit dda44d026e0c1b02feb02185b8c200a542be341a)
In the current code services can only be reconfigured asynchronously.
This means that configuration file changes can be made, an asychronous
reconfigure event can be triggered, and it always succeeds. Some time
later when a service is actually reconfigured then a failure may be
seen
This adds a synthetic reconfigure event that reconfigures a service
synchronously so that any failure is reported on exit.
ctdb_service_check_reconfigure() is essentially reimplemented.
If a reconfigure event is in flight and an ipreallocated or monitor
event occurs then any scheduled asynchronous reconfigure is deferred
until the next monitor cycle. This is to avoid reconfigures trampling
on each other. In this case a monitor event will also replay the
previous status to try to avoid exposing any temporary instability.
If a reconfigure event collides with another reconfigure event it will
exit with status 2, indicating that the reconfigure should be retried.
The reconfigure event is implemented using a subprocess to control the
exit from the synthetic event.
As before, if a monitor event causes a scheduled synchronous
reconfigure to occure then it will replay the previous status for the
service, given that a reconfigure can cause temporary instability.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 220578bfd3507152b29ba4c28942f9d5e8733886)
This is the first eventscript. Sanity check as early as possible and
everyone benefits.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0564717fcc1e21688ae5dacbd437fd493bcb8853)
Pass this "$@" to do common eventscript argument checking.
For regular use putting this in 00.ctdb would be enough. However, for
developer testing it can be useful to call this in other eventscripts.
For example, 10.interfaces and 13.per_ip_routing currently check these
by hand.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 36de7e7fd6dfeed61ef9977b8d5b568f90a9707b)
Some of the current auto-start/stop logic is broken, particularly for
Samba. Fixing it is non-trivial.
If $CTDB_SERVICE_AUTOSTARTSTOP is "yes" then auto-start/stop services
when told to newly manage or no longer manage them. This defaults to
"yes".
However, if using a canned configuration file that doesn't set
$CTDB_SERVICE_AUTOSTARTSTOP then this stops the auto-start-stop logic
from working. Therefore, this works around CQ S1026685 - on the
system in question another daemon controls service auto-start/stop and
CTDB just gets in the way.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ef71b8290ae49117d7bcc7166598b77cb64cc8a0)
There are sites that have multiple entries for the same export. This
optimises the share check in this case.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1ccdae79b64b236fc27f4653606429d73c9c3595)
New function ctdb_check_tcp_ports_ctdb(). This should be fast... and
is now the default checker. If it fails in an unexpected way we fall
back to the nmap and netstat checkers.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a1e16a707ce204817531a61455000361f972080a)
Split the netstat-specific parts of ctdb_check_tcp_ports() into new
function ctdb_check_tcp_ports_netstat().
Implement new ctdb_check_tcp_ports_nmap() function that uses
"nmap -PS" to check if the desired ports are listening.
ctdb_check_ctdb_ports() now uses new configuration variable
CTDB_TCP_PORT_CHECKERS to decide which port checkers to try. Default
value is currently "nmap netstat". If nmap is not found then this
will fall back to netstat - if logging is at debug level this will
also fill the logs with message saying the nmap checker failed. This
indicates that either nmap should be installed or the default value of
CTDB_TCP_PORT_CHECKERS should be changed (in a configuration file) to
avoid trying to use nmap.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d9651175b40b9454e7d4e98291955fcf1445085e)
Use the new debug function to conditionally print the netstat output.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 44c14aeeb11080980fe07c7396d06843a4870747)
Sometimes smbd and other services can take a while to start,
especially when there is a lot of activity after ctdbd has just
started. The TCP port check can then pollute the logs with lots of
"ERROR" messages and possibly extra debug.
This creates a flag file when a service is started (but not restarted)
and this flag is removed the first time that TCP port checks succeed
for that service. When a port check fails and the flag file still
exists, a less extreme "INFO" message is printed rather than the usual
"ERROR" message. This means that until the node actually becomes
healthy we see more friendly messages.
The subtext is that we're hearing false positive reports "recreates"
of CQ S1024874 (samba stopped responding on port 445) quite often when
ctdbd is started. This reduces the chances of people reporting such
false recreates...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 571865eb6ef847857129d0b1e2ba5fa7254bfe8c)
ctdb_check_tcp_ports() runs "netstat -a -t -n" in a loop for each
port. There are 2 problems with this:
* Netstat is run on each loop iteration when it need only be run once.
* The -a option is used to list all connections but the function only
cares about the listening ports. There may be many thousands of
non-listening ports to grep through.
This changes ctdb_check_tcp_ports() to run netstat with the -l option
instead of the -a option. It also only runs netstat once before the
main loop.
When a port is found to not be listening the output of the netstat
command is now dumped to help with debugging.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 830355a8b18c53cfcc3ad1e3009bbb1a7a681fa0)
The debug function passes its arguments to echo if
$CTDB_CURRENT_DEBUGLEVEL is >= 4 (i.e. DEBUG). If no args are given
then use stdin - this allows the function to be used with here
documents.
To ensure $CTDB_CURRENT_DEBUGLEVEL is set,
ctdb_set_current_debuglevel() is called near the end of the functions
file.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 6143483d9f87322578c00f12081e381f425226ca)
This function ensures that CTDB_CURRENT_DEBUGLEVEL is set. It works
like this:
1. If it is already set then do nothing, since it might have been set
some other way.
The recommended "other way" would be to add a file in rc.local.d/.
2. If it is not set then set it by sourcing
/var/ctdb/eventscript_debuglevel.
3. If this file does not exist then create it using output from "ctdb
getdebug".
If the optional 1st argument is set to "create" then don't source an
existing file but create a new one instead - this is useful for
creating the file just once in each event run in, say, 00.ctdb.
If there's a problem getting the debug level from ctdb then it is
silently set to 0 - no use spamming logs if our debug code is
broken...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 93910921c8a25f2b029733cd938069ff7c7bdab7)
See the comment in the code for details.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8ee9856996a8ec738e9d3ea7f1561605da526b8c)
This potentially masks errors and was basically included by accident.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e7e4a1b4f31118027fd13a6223192f9957cf2e74)
The startup event intends to mark interfaces up. However, it doesn't
actually do that because $INTERFACES is empty.
This uses the function get_all_interfaces() to list the
interfaces... and then mark them up.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit fc62bf0975c6059ee467285565d0dc3b4daaf238)
Interfaces are currently marked down. Mark them up instead, as per
the comment... and discussion with Ronnie.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 35942841229cc72ce363a7236aec708f1a33136b)
Move existing interface listing code to new function in preparation
for using it in startup event.
While we're here change the "sort | uniq" into "sort -u" and save some
complexity.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cd1442531ad079b11c60f46ee9d34f5104bef219)
* sed can read files, it doesn't need a file piped to it
* use $() subshells instead of `` - they seem to quote better in dash
* tweak the uniquifying code so that it is easier to read
* add comments
* remove some extraneous semicolons at ends of lines
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5f49537889a92c3cb68d9203912188bedf00ecd4)
This effectively reverts 953dbfbddad656a64e30a6aca115cb1479d11573 and
is a policy decision.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 380c9263eb37db5a250264316e250c2160908263)
Change most of the uppercase variable names to lowercase for
consistency with other variables, readability and so they can be
easily distinguished from environment/configuration variables. Change
the name of 2 of the variabless to add some clarity. Changes are as
follows:
INTERFACES -> all_interfaces
IFACES -> ctdb_interfaces
IFACE -> iface
I -> i
REALIFACE -> realiface
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 7b201c1087b1433cfbc95de76cb4205e484ccd6f)
The logic in the monitor event itself is very complex. Nearly all of
it can go away by adding a single check of
$CTDB_PARTIALLY_ONLINE_INTERFACES to the return logic of
monitor_interfaces() and reversing the sense of the corresponding
check.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit fa93177442c65c2a4eb2d5d5dba0a0da1c486969)
The name of variable $ok gives no clue to its meaning/use so this
changes that variable to be named $up_interfaces_found.
The return logic relating to $ok and $fail is difficult to read, so
these variables are given true/fale values, allowing the return logic
to be simplified.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3402930319d462eab5525410f6a676952e120182)
The same few lines of logic are used every time an interface up or down.
This encapsulates those few lines in 2 new functions.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ab443c4d7d282f282792abc6a6ac224ab06abe30)
We reduce the number of failures before attempting a restart.
However, after 6 failures we mark the cluster unhealthy and no longer
try to restart. If the previous 2 attempts didn't work then there
isn't any use in bogging the system down with an attempted restart on
every monitor event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f654739080b40b7ac1b7f998cacc689d3d4e3193)
This adds a helper function called nfs_check_rpc_service() and uses it
to make the monitor event much more readable. An example of usage is
as follows:
nfs_check_rpc_service "mountd" \
-ge 10 "verbose restart:b unhealthy" \
-eq 5 "restart:b"
The first argument to nfs_check_rpc_service() is the name of the RPC
service to be checked. The RPC service corresponding to this command
is checked for availability using the rpcinfo command. If the service
is available then the function succeeds and subsequent arguments are
ignored.
If the rpcinfo check fails then a failure counter for that particular
RPC service is incremented and subsequent arguments are processed in
groups of 3:
1. An integer comparison operator supported by test.
2. An integer failure limit.
3. An action string.
The value of the failure counter is checked using (1) and (2) above.
The first check that succeeds has its action string processed - note
that this explains the somewhat curious reverse ordering of checks.
It the example above:
* If the counter is >= 10 then a verbose message is printed
describing the failure, the service is restarted in the background
and the node is marked as unhealthy (via an "exit 1" from the
function).
* If the counter is == 5 then the service us restarted in the
background.
For more action options please see the code.
This also changes the ctdb_check_rpc() function so that it no longer
takes a program number to check. It now just takes a real RPC program
name that rpcinfo can resolve via /etc/rpc.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9b66057964756a6245bafb436eb6106fb6a2866e)
Commit 35a60a63a9b5c7d98dde514ae552239506b691c9 introduced a
regression, reported by "Jonathan Buzzard" <J.Buzzard@dundee.ac.uk>,
as follows:
Basically the use of sed in the following code snippet does not work
for long exports where exportfs wraps the host or network onto the
next line.
exportfs | grep -v '^#' | grep '^/' |
sed -e 's/[[:space:]]*[^[:space:]]*$//' |
ctdb_check_directories
The result is that the you get lots of blank lines being sent to
ctdb_check_directories which causes the host to be marked as
unhealthy and then thrashing sets in of the managed IP's making the
whole cluster unusable.
This tightens up the sed expression so that it is less likely to
produce a spurious empty line. It also removes an unnecessary "grep -v".
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit bd39b91ad12fd05271a7fced0e6f9d8c4eba92e6)
This means that it now occurs on every reconfigure event. As a result
the ipreallocated event is removed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c45a89418ba733ff91d48340d72bdb6d2ef80051)
* Make this function applicable to "ipreallocated" event too.
* Monitor event should not always succeed just because we reconfigure.
If the service was unhealthy before the reconfigure and we end the
reconfigure with "exit 0" then we can cause the node's health status
to flip-flop.
To avoid this we return the status of the service from the previous
monitor event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 21dfcbbdccd906fcd6ab7bba81418ce565bf63aa)
Samba doesn't need to do anything for configuration changes. It will
notice configuration changes and reload automatically.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit de13350c17261032a7468c2cf4d2cf4a8d66a840)
* Reduce the failure counts so that restart attempts happen sooner.
* Use service_start() and service_stop() for the restart.
ctdb_service_start() resets the failure count, which isn't very
useful in this context.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 01776b9f29af9ad5c8534649ece1bd100e450434)
This should eventually be able to replace ctdb_check_counter_limit()
and ctdb_check_counter_equal(), although it doesn't issue warnings
like the former.
It takes 4 optional arguments:
1. _msg - If "error" then over limit causes an error message and and
exit 1. Anything else fails silently but the function returns 1.
Default is "error".
2. _op - An integer operator supported by test (e.g. -eq, -ge, -gt).
Default is -ge.
3. _limit - Limit for the counter to be used in comparison. Default is
$service_fail_limit.
4. _service_name - Used to identify the counter. Default is
$service_name.
For example:
ctdb_check_counter error -ge 5 foo
will print a message and exit 1 if the counter for foo is >= 5,
whereas
ctdb_check_counter check -ge 5 foo
will just return 1 if the counter for foo is >= 5, and
ctdb_counter_check
with print a message and exit 1 if the counter for $service_name is >=
$service_fail_limit.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5b01b7233515669e995e037205796e265643b176)
When stopping (as opposed to restarting) it is useful to see this
information.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a9ab1937239761dc32b143c9d225447bc6f090b4)
d362be7d32079ac1390d67056ce107bfbca2c937 wasn't well thought out.
Subsequent commits depend on ctdb_counter_init() taking an argument,
so this makes those cases work.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 05a8fcfbac3da2b5843b31e0fe258255cc761190)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0f003f05e28037eefdce3a686fcb52cd2289af9d)
The state directory basename becomes "nfs" rather than "statd". One
line of code i moved from the "startup" event to service_start().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cc4c5c19af7efe01c48f73bb5ec5e607ed79db4c)
To simplify we also remove the reconfigure from the recovered event
because the monitor event will handle this very quickly anyway.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit da3aedd1a472b430b75989d3c157efedd382e327)
* Add an optional service name argument to existing reconfigure
functions.
* User function service_reconfigure() instead of variable
$service_reconfigure to specify how a service is reconfigured.
* New function ctdb_service_check_reconfigure() reconfigures a service
if it is flagged for reconfigure.
* Remove $service_reconfigure settings from 40.vsftpd and 41.httpd -
they're the defaults.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 15d4111d0761d82f57d5d4f0b1227812d14e4d7c)
Move flagging of managed or unmanaged services into
ctdb_service_start() and ctdb_service_stop(). That way services will
be correctly flagged if they are started from the startup and shutdown
events.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8675744cbd90b5a5095ed6fff7b36ae82004a457)
service_start is currently a variable. This makes passing arguments
hard. We change it to be a function and put default definitions into
the functions file.
We use a convention that if a service name argument is passed to a
redefined version of service_start() or service_stop() then it will
act unconditionally. If no argument is passed then it can use
internal logic to decide if services should really be started. This
is useful when a single eventscript handles multiple services.
This is a cherry-pick of ae38895 that needed to be reset mid-stream.
There is still some breakage following this commit.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 86e4aefed9fd1028660c98e3ea758c2b75ffc1d8)
This function generates a lot of trace when running under "set -x".
This is due to the backward compatibility code.
This adds 3 optimisations:
1. Before invoking the backward compatiblity code,
is_ctdb_managed_service() returns early if the service is listed in
$CTDB_MANAGED_SERVICES.
2. ctdb_compat_managed_service() actually now updates
$CTDB_MANAGED_SERVICES instead of temporary variable $t.
This means that a subsequent call to is_ctdb_managed_service() will
short circuit due to optimisation (1).
3. ctdb_compat_managed_service() only adds a service to
$CTDB_MANAGED_SERVICES if it is the service being checked by
is_ctdb_managed_service().
This stops irrelevant services being added to
$CTDB_MANAGED_SERVICES multiple times by multiple calls to
is_ctdb_managed_service().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 758f4667c60089e09a0439c1eb74f5e426ca5e2e)
Currently it checks $CTDB_MANAGES_WINBIND directly in several places.
This doesn't work when someone sets $CTDB_MANAGED_SERVICES directly.
This modifies check_ctdb_manages_winbind() so that it return a
condition rather than modifying $CTDB_MANAGES_WINBIND. This makes
some code more readable.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 538902fbc1e74134a03987b36b3733ad641f8971)
Currently it checks $CTDB_MANAGES_SAMBA directly. This doesn't work
when someone sets $CTDB_MANAGED_SERVICES directly.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d8f0f8948abd340088720718fef7dc858661ba23)
When the value of $CTDB_MANAGES_SAMBA or $CTDB_MANAGES_WINBIND (or
corresponding changes are made to $CTDB_MANAGED_VERSIONS), the
associated service should be started or stopped as necessary.
This add calls to ctdb_start_stop_service() to manage
starting/stopping samba and winbind.
An associated cleanup is made to the initial checks that one of
$CTDB_MANAGES_SAMBA or $CTDB_MANAGES_WINBIND is set, replacing them
with calls to is_ctdb_managed_service().
To handle the winbind cases ctdb_start_stop_service() and
is_ctdb_managed_service() are updated to take an optional service name
parameter.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Conflicts:
config/events.d/50.samba
Most of this merged elsewhere. This just removes a check that
this is the monitor event.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 257a2e350280c0b76ed2fac588cad167381fda52)
In dash, this fails gracefully with nothing to stderr:
t=$(cat /does_not_exist) 2>/dev/null
In bash the error from cat is still printed due to different order of
evaluation.
This works everywhere:
t=$(cat /does_not_exist 2>/dev/null)
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a6e61867c7a58d5a77cd8641d8df0b105cddff77)
Also remove some unnecessary absolute paths for commands, which were
making the code slightly difficult to read.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1b3f2dd62efb240f8486016fe0f8dfb73d6ccc66)
This also fixes a bug where update_config_from_tdb() used an incorrect
filename in one place.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a5ce2adaa39f077f56582072a97bb64d0eba4b4d)
Without this you can get into a situation where ctdbd can not start.
If the active file for a service exists but the service is not
running, then trying to stop the service may fail, causing the
eventscript to exit from ctdb_start_stop_service().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 28379ca0f747c5952d690a451834ce7421adfd34)
This includes a comment about using POSIX Bourne shell, including a
suggestion not to use "local" variables.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5ae002c7513b1b2aa5136437a1a19f8cd179b869)
To be used by eventscripts to create a per-service directory for their
own state data. $service_state_dir is set to point to the new
directory.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a273554791c2a5281aee28f8e2be0c514e14c91e)
This was done ad hoc and was badly named.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9a084a121f629b2c1bcefc1e4c4a4a5cacf53987)
The "ip" command is currently run as "/sbin/ip". This makes it
impossible to replace with a stub in unit testing. The functions file
controls $PATH, so we don't need absolute paths.
This replaces the absolute paths...
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5b4c712aab3edc0059f2e5a6730b7fdcf7e5f4ec)
POSIX sh doesn't have local variables. Debian's dash doesn't behave
the same way as bash on this contruct:
local var=`command that produces multiple words`
It only assigns the 1st word and may print an error.
Just remove the use of the "local" keyword in monitor_interfaces() to
solve this. It isn't actually limiting the scope of any variables
that are used outside the function.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 95d9a1e19655461288a2c7e52abf9d01ab23e05a)
Another unit testing hook. This is easier than dropping files into
rc.local.d/ and then removing them.
The file has to be executable.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b13ac3bdaf326a6cdfd87da9195eb9630806c418)
Call call_proc(), put the output into a variable and then use it.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 2dfdc997f432d522034922b43cb6f8f878d11ba7)
For eventscript unit testing it will be necessary to override external
commands to allow stub implementations to be used. If absolute paths
aren't used then this can be done using either a fake bin/
subdirectory or by using shell functions.
This removes all of the simple cases of absolute paths.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Conflicts:
config/ctdb.init
config/events.d/50.samba
Keep old code but remove absolute paths.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 05851d50b0078de8bf4691442d718825adca6fe8)
These provide a thin layer around writing and reading files in /proc.
They can be easily replaced by stubs for unit testing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 637f9d8af517b73c72ed8f3cc2a2661f11eb2126)
These haven't been used for a long time.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f5fd361cadb3ea18d29e2d7215a7853718e48d00)
* $CTDB_ETCDIR defaults to /etc but can be changed for testing. All
hard-coded instances of /etc have been changed to $CTDB_ETCDIR.
This includes references to /etc/init.d and /etc/sysconfig.
* service() and nice_service() functions now call new function
_service(). This makes it easier to override these functions (say,
in rc.local) for testing and call most of the existing functionality
using _service().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f43c9a7604b779bb6257ddb2bf3cbe266d496a63)
This will be needed when eventscripts that use it are called
externally.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ebd53b66b0cc66d9d04830781886234167fc2164)
If the last IP address on an interfaces is removed then that
interfaces should no longer be checked by 10.interfaces. However,
"ctdb ifaces" still lists such interfaces so they are currently
checked.
The problem really needs to be addressed in ctdbd but a neat quick
eventscript fix will be minimally invasive...
This changes the code to use "ctdb -Y ip -v" instead of "ctdb -Y
ifaces". The former includes details of all public addresses and
associated interfaces, so when an address is removed there is no
output for it. This avoids orphaned interfaces from being listed.
The logic is also slightly improved so that $IFACES includes just a
(non-uniquified) list of interfaces, allowing an existing loop to be
removed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 49b2d1bd9554461ed8edbfc21e777c0eca9e1443)
adding/removing IP addresses causing routes might be dropped by the system.
The easiest workaround for this is to unconditionally try to reapply
all static routes for all interfaces once ipreallocation has finished,
not just adding them back on the affected interface.
This worksaround a funky issue in
CQ S1023538
(This used to be ctdb commit 84600d1f53632d5fe76c308727f31f61b5ec1010)
in use by public addresses. this can happen when we have removed existing interfaces/ip addresses and prevents us from verifying the status of other interfaces
(This used to be ctdb commit d67955b42f7627be9dae995230c8fcbb8a948ec2)
script if/when we have for example NATGW configured but no public addresses defined on that interface
CQ S1023378
(This used to be ctdb commit 8837daa424732aeb5a20814b1709c345a97a0e09)
we can not just check if MII Status is up for bonding mode 4, since the kernel will always report the bond device as UP
even if all cables are disconneccted.
For mode 4, ignore the status of the bond device and instead chek if at least one slave interface is up
when determining if the device is good or bad
(This used to be ctdb commit a6930cec6d9503dba18b9d4839d87a1c1a8ddba2)
Simplify the handling of setting the links in the 10.interface eventscript
and remove the optimization to only call setifacelink on state change
to make the code simpler to read.
If a take ip event fails, flag the node as unhealthy.
Add a check to the interface script to check if the interface exists
or if it has been deleted.
So that we can capture and become UNHELTHY if someone deletes an interface
we are using to host public addresses.
(This used to be ctdb commit 4ab63d2a7262aff30d5eced184c294c9c9dd4974)
* continous -> continuous
* activete -> activate
(thanks to lintian)
See https://bugzilla.samba.org/show_bug.cgi?id=6935
Signed-off-by: Michael Adam <obnox@samba.org>
(This used to be ctdb commit fb6987c2f747d6dbf9bb3899a480124d1c242a90)
Dont update the statd settings that often.
When we have very many nodes and very many ips, this would generate
a lot of unnessecary load on the system
(This used to be ctdb commit 0c030c9384500f340d8382c20e1e91b11aa377e9)
We were potentially leaving a node unable to serve requests for too
long.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 5be8610ffa33db49e33949560d0ef2fa5f3c0c73)
This was defaulting to just "service nfs restart", which doesn't have
the workarounds we need.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0f462e9e9fe12b595f3c7452123db8e69548abd6)
Otherwise we might short-circuit events that are run only once and
actually need to do something.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c4f9e8a43540bc049b2771e0a2d76d37b9d17331)
Otherwise there can be strange error messages from services
stopping/starting, without any context.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8bcf7ab164429ddc0ae530133e114f186a8146dd)
"service nfs restart" can fail. To stop nfsd it sends a SIGINT and
nfsd might take a while to process it if the system is loaded.
Starting nfsd may then fail because resources are still in use.
This does some /proc magic to tell nfsd to do no more processing. It
then runs service stop, kills nfsd with SIGKILL, and then runs service
start. This is much less likely to fail.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a9bf4f82852975b0b627f61ceb2d23401f630805)
availability at all (since we cant restart it, there is not point checking
if it is alive)
(This used to be ctdb commit 6075e85ba6c0f58fd1ab2ce3b09dd3d6ff491365)
Httpd can be very slow to start on some platforms,
wait 5 monitor intervals before we try to restart it if
it has not bound to port 80 yet.
After 10 failed intervals, flag the node as unhealthy.
(This used to be ctdb commit 6ec1993aa5f2778b8227ce5f6eca0d19e4ae9788)
Try to restart LOCKD after 10 failures and
flag the node as unhealthy after 15 failures
(This used to be ctdb commit 5a67889c9166835aef3443051812d14af07dfca5)
Net serverid wipe can take a bit of time sometimes so background it.
Only perform auto start/stop of the managed service on the monitor event
(This used to be ctdb commit deba5cbbf7703a1a24ce88a06c73fca056e05521)
make changes to ctdb event scripts to support NFS-Ganesha.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
(This used to be ctdb commit 7298588ed54492f106954c893dd86b0a36783470)
We put the ip on loopback just to make sure we would still interoperate with
non-standard configurations on unix-KDC, that are configured to verify the optional
HostAddresses field.
This is not required for AD, since AD does not use this field, and is replaced in
unix land with other/better mechanisms than this "dodgy" check.
This makes it "easier" for applications that have bound to the natgw address
to detect a socket problem and try to reconnect/recover if the ip address
is completely missing from the system.
At the same time, use the winbind specific hook that exists to explicitely tell winbindd : this address is gone, so if you have bound to it, this is a good time to close and rebind your socket.
cq 1020333
(This used to be ctdb commit 0da94869d2912b2a412ba3fbd2137d88ce4e4389)
ctdb_service_start() currently succeeds if ctdb_counter_init()
succeeds.
This changes it to fail when a service start fails.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ddb73962d72d933bf0edc28be0dbb45bea7e5ef4)
When the value of $CTDB_MANAGES_SAMBA or $CTDB_MANAGES_WINBIND (or
corresponding changes are made to $CTDB_MANAGED_VERSIONS), the
associated service should be started or stopped as necessary.
This add calls to ctdb_start_stop_service() to manage
starting/stopping samba and winbind.
An associated cleanup is made to the initial checks that one of
$CTDB_MANAGES_SAMBA or $CTDB_MANAGES_WINBIND is set, replacing them
with calls to is_ctdb_managed_service().
To handle the winbind cases ctdb_start_stop_service() and
is_ctdb_managed_service() are updated to take an optional service name
parameter.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d98f175e8420d921a123ae9c0ce00945350b1537)
update nfs to try to restart the service after 10 consecutive failures
and to flag the node unhealthy after 15
add similar function to mountd
(This used to be ctdb commit 1569a54bb82fc433895ed68f816cf48399ad9d40)
Rename loadconfig() to _loadconfig(). Add a new loadconfig() that
simply calls _loadconfig().
This makes it easy for the test suite to override loadconfig().
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1d77a3adfff893b3c01b87f791e72c0d3148425c)
These failures are sometimes the result of slow restarts so we want to
avoid dirtying the logs or marking a node unhealthy because of them,
unless they are excessive.
For these 2 cases we use the existing fail counting code but hack a
temporary service_name in a subshell to allow separate fail counts.
We also update ctdb_check_rpc() so that it captures the error output
from rpcinfo and we add a message including the service name to the
beginning. The error is printed to stdout but is also stored in
ctdb_check_rpc_out to allow it to be conditionally used by the caller.
This function also now returns non-zero rather than exiting on
failure.
Other direct rpcinfo calls are relaced by called to ctdb_check_rpc()
for consistency.
Option handling code for service restarts is cleaned up so that fits
in 80 columns. A more informative restart messageis now used in all
cases, printing the exact command being used to start a service.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 79c25fe241cf5d8f92e23d3736823ebaf4e1769d)
on any kind of tiny unexpected error
unconditionally try to remove ip addresses from both old and new interface
before trying to add it to the new interface to make it less
fragile
(This used to be ctdb commit 80acca2c91c9053c799365bae918db7ed8bdc56f)
this stops the script from failing with an error if
both interfaces are specified as the same, which otherwise breaks and leads to an infinite recovery loop
(This used to be ctdb commit 565de03a784ed441490f8cd0b137b5cec8716d55)
Ctdb can also be configured to ignore checking for knfsd and if it is alive.
In that situation, no attempt will be made to restart nfs, and sicne nfs is not running, lockd can not be restarted either.
To workaround this, everytime we try to restart the lockmanager, also try to restart nfsd
(This used to be ctdb commit 953dbfbddad656a64e30a6aca115cb1479d11573)
even if we are not currently the natgw master.
This adds extra reliability in case we have stopped previously without removing it proper,
but does add spam messages to syslog everytime we shutdowm.
Remove these spam messages from pulluting the syslog upon normal shutdown
(This used to be ctdb commit cd84da6f247ee46bbab8318298d1cd3cfc87aba9)
Normally, the config.tdb database would not exist, so we do not need
to spam syslog with a "config.tdb does not exist" message every time we start ctdb
(This used to be ctdb commit 5792809b72e534161c5ca9ef5c9897abcb3b899c)
network connectivity outside of the cluster to still be able to
participate in a natgw group.
These nodes can not become natgw master since they lack external network
connectivity.
These nodes are configured just the same way as for any other node with
NATGW, with the following two exceptions :
* we do NOT set CTDB_NATGW_PUBLIC_IFACE at all on these nodes.
since these ndoes lack external network we should not check the interface
for link.
* we must set CTDB_NATGW_SLAVE_ONLY=yes to flag that this is a node that
can not become natgw master.
(This used to be ctdb commit ab7b00a37e55beffc074be95b55d8a5c7cb9eef2)
since that will usually be /etc/ctdb/state and storing this under /etc is just
wrong.
Add a new variable CTDB_VARDIR that defaults to /var/ctdb and store the data there instead.
(This used to be ctdb commit 516423c25afa9861d9988096efa8a4a2b12b31b1)
the clusterwide persistent data associated with the lock manager and
statd notifications.
Use persistent databases to store this data instead of a shared directory.
(This used to be ctdb commit fc0678d351187cfa4c71123f97c0f493aacd5d16)
This is called everytime a reallocation is performed.
While STARTRECOVERY/RECOVERED events are only called when
we do ipreallocation as part of a full database/cluster recovery,
this new event can be used to trigger on when we just do a light
failover due to a node becomming unhealthy.
I.e. situations where we do a failover but we do not perform a full
cluster recovery.
Use this to trigger for natgw so we select a new natgw master node
when failover happens and not just when cluster rebuilds happen.
(This used to be ctdb commit 7f4c591388adae20e98984001385cba26598ec67)
This adds a new function update_tickles() that tracks tickles for a
given port using the new ctdb addtickle/deltickle commands. This
function is used in events.d/60.nfs to handle NFS tickles.
events.d/61.nfstickle is removed. The
/proc/sys/net/ipv4/tcp_tw_recycle setup is also moved to
events.d/60.nfs.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit dca4c4ebf3c35f8db3ae208efb7a83abbf726ed6)
This database can be used, as an option, to store
the public address assignment instead of editing the /etc/ctdb/public-addresses file manually.
This configuration is stored in one record per key, with a key-name of
public-addresses:node#<pnn>
where <pnn> is the node number.
The content of this record is the same syntax as the /etc/ctdb/public-addresses file.
When ctdbd starts, if this key exist and contains data. It is extracted from the database and compared with the normal file /etc/ctdb/public-addresses.
If the content differs, the config database "wins" and is used to overwrite/update the /etc/ctdb/public-addresses file, after which ctdbd is restarted.
The main benefit with this option is that it can be used to update the public address configuration for nodes that are offline/unreachable by updating their configuration in the persistent database.
Once the offline node is available again, it will resync its databases with the rest of the cluster, find out that the config has changed, apply the changes and restart ctdbd automatically.
The command to store the public address configuration for a node into the persistent database is :
ctdb pstore config.tdb public-addresses:node#<pnn> <filename>
where <pnn> is the node# we wish to update the config for, and <filename> is a file containing the new content for that nodes public address configuration.
(This used to be ctdb commit 292d7435a360efd7f15a7a99f658a605e07c0a81)
sometimes (very rarely) fails to restart the service.
Add a function to restart NFSd on SLES and RHEL-like systems.
If we detect the system is unhealthy due to kNFSd not running,
try to restart the service again "service nfs restart" and
hope for the best.
CQ1019372
(This used to be ctdb commit 25c4ce7e919f13226219f036bcffd2be76b2f06c)
The existing code wasn't working as designed in the start event. It
should work here.
BZ: 62613
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit aeb70c7e7822854eb87873a5c7783e27e6e72318)
Currently we do a "sleep 1" after starting and before running
set_ctdb_variables to set the tunables. This is too arbitrary and
might fail if the system is heavily loaded. This, for example, could
result in some nodes running with DeterministicIPs and some without,
in which case a different IP allocation algorithm would run depending
on who is the recmaster!
This makes the start function wait until "ctdb ping" succeeds (with 10
second timeout) before trying to run set_ctdb_variables. If a timeout
occurs then the start function attempts to kill ctdbd before exiting
with a failure.
It also cleans up the status reporting code for Red Hat and SUSE so
that the final status code is reported. Currently there are cases
where a correct status is prematurely reported before a failure
occurs.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cdcd05662a30b51caaeeab4ac44138cac2474e0a)
Currently the file for each IP address is reopened to append the
details of each source socket.
This optimisation puts all the logic into awk, including the matching
of output lines from netstat. The source sockets for each for each
destination IP are written into an array entry and then each array
entry is written to the corresponding file in a single operation.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 6549e9b01538998d51a5f72bfc569776d232b024)
For non bash shells $_s_script might end with '/*'.
We do the workarround this way, because it makes sense to check
that a script is executable, before trying to execute it.
metze
[ This actually applies to any shell -- Rusty Russell ]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e665cfde03fc9ec2264e99512ed5470872a2fd04)
When doing a releaseip event, we do them in parallel for all the separate
IPs. This creates a problem for iptables, which isn't reentrant, giving
the strange message:
iptables encountered unknown error "18446744073709551615" while initializing table "filter"
The worst possible symptom of this is that releaseip won't remove the rule
which prevents us listening to clients during releaseip, and the node will be
healthy but non-responsive.
The simple workaround is to flock-wrap iptables. Better would be to rework
the code so we didn't need to use iptables in these paths.
CQ:S1018353
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 72d6914ee913272312d7b68f1be5ad05ad06587d)
same ip address as a normal public-address,
check for this in the natgw script and warn the user.
Also prevent ctdb from starting up since this configuration will not work.
BZ60933
(This used to be ctdb commit 480af69b63b9162c85d8e04461ca9e4a083c04a4)
If the driver is virtio_net then we assume that the link is up rather
than ignoring the check altogether.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 3044d07da2a58260fa06bf489890b279bcf3ec39)
Skip link test for this type of devices
Signed-off-by: Ralph Wuerthner <ralph.wuerthner@de.ibm.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 2ea0a9f1a93781a0d036feb9fcc0d120b182922f)
As the basename of the script will be used for the readd script
from setup_iface_ip_readd_script, it's know easier to identify
what script is called by delete_ip_from_iface() while readding
ips to the interface.
metze
(This used to be ctdb commit 3ee225b0b6ed37c22478bd145ced56b1b9b86842)
This is needed because we need to resetup the routing table when
the delete_ip_from_iface() function readds the ip to the interface.
metze
(This used to be ctdb commit ea87185ec9977006ef72d5a68c875154e4c84099)
This combines the logic into a shell function which can be used by the
"takeip" and "updateip" hooks.
We check the return values of the "ip" commands now
instead of ignoring them.
We now create a setup_script.sh similar to the release_script.sh
which makes it easier to analyze problems.
metze
(This used to be ctdb commit 624e8878851b4957cc7c02e922ec86926d6927ee)
This also initializes the variables correctly for the
shutdown|removenatgw code path to delete_all.
metze
(This used to be ctdb commit 2c2cbed4fcbc868a990fa6b32fc96126ffc61bb5)
This adds a generic infrastructure to register scripts which will
be called when the delete_ip_from_iface() funtion needs to readd
secondary ips to an interface.
metze
(This used to be ctdb commit ac97d65f44e8dc8bf2ec8f68e4db3448521755a2)
to control whether or not to check if we are swapping, and produce
useful output into the logfile if we are.
For production systems with dedicated nas-heads we should never swap.
But for developer/test systems we often use smaller nondedicated systems where
we can no longer guarantee that we will not be using swap.
(This used to be ctdb commit db87849bf3380914a63a626412bec209dbea7d20)
We should never enter swap; if we do, show the memory state of the machine and the process list. This will help us diagnose what caused the condition before it's too late and the box starts OOM-killing processes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 627a6d67a0e9e61f8713e62695b3518c51909230)
If "$1" was empty than loadconfig would load the ctdb config twice.
This stops that from happening.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 0406d406da70aaee7ad6aac236114905c5d03ed2)
Proper fix for 085d1bea78fabf754ef6dd6d323f74a1d361e45c's workaround.
$NFS_TICKLE_SHARED_DIRECTORY was being used before it is set via
loadconfig.
Ronnie actually spotted this one. :-)
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ee8b2e298351d05197a2e1494f3331433644c1e6)
Also, change the order of the comparison so it is consistent with
others in the script.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 44696e15cdb23e7656d3bb0ead54f509495738a7)
This puts single quotes around everything and uses eval on the
command-lines that actually start ctdbd. The eval causes the single
quotes to be interpreted.
The "redhat" init style no longer uses the Red Hat daemon function.
It loses the quoting and re-splits on spaces. Instead we add an extra
line that uses the success/failure functions to keep things pretty.
Note that this means that we don't respect daemon's
$DAEMON_COREFILE_LIMIT variable but we do our own core file handling
with $CTDB_SUPPRESS_COREFILE anyway. daemon's core file handling was
probably overriding what we were doing anyway, so this can be regarded
as a bug fix.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 522fbb012524fe41a67dbe43589a282dda6bcbe2)
This is very useful for testing, I use such a script:
cat ~/bin/ethtool
#!/bin/sh
IFACE=$1
case "$IFACE" in
Neth2)
;;
Neth3)
;;
Neth4)
;;
Neth5)
;;
*)
exec /usr/sbin/ethtool $@
;;
esac
ip link set down $IFACE
exec /usr/sbin/ethtool $@
metze
(This used to be ctdb commit 3bab985cf615720eded4d47b4f9f37a9c28840aa)
With this option set to "yes", we don't become unhealthy
as long as at least one interface is still available.
metze
(This used to be ctdb commit d054eb33c6ae92560cddb40732e5dcf622591a3c)
With this script it's possible to generate routing tables
per public ip address.
metze
(This used to be ctdb commit ff5678fbec2daef461143acf00cef3f94d7655fc)
When two releaseip events run in parallel it's possible that the 2nd script
readds a secondary ip that was removed by the 1st script.
metze
(This used to be ctdb commit e02417b2a55c45ac2c125b1b3463c9c39e7bc07a)