1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-24 21:34:56 +03:00
Commit Graph

505 Commits

Author SHA1 Message Date
Martin Schwenke
5d655ac6f2 ctdb-recoverd: Only check for LMASTER nodes in the VNN map
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14085

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-08-21 11:50:30 +00:00
Martin Schwenke
6fe963c3f7 ctdb-recoverd: Periodically log recovery master of incomplete cluster
Only do this if the recovery lock is unset.  Log every minute for the
first 10 minutes, then every 10 minutes, then every hour.

This is useful for determining whether a split brain occurred.  It is
particularly useful if logging failed or was throttled at startup, so
there is no evidence of the split brain when it began.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-07-26 03:34:16 +00:00
Martin Schwenke
f2559ef8ce ctdb-recoverd: Log the master at the end of elections
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-07-26 03:34:16 +00:00
Martin Schwenke
35368d871d ctdb-recovery: Avoid -1 as a PNN, use CTDB_UNKNOWN_PNN instead
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-06-05 10:25:50 +00:00
Martin Schwenke
978c7dbd55 ctdb-recovery: Fix signed/unsigned comparison by casting
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-06-05 10:25:50 +00:00
Martin Schwenke
fa7bd35b6a ctdb-recovery: Fix signed/unsigned comparisons by declaring as unsigned
Simple cases where variables need to be declared as an unsigned type
instead of an int.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-06-05 10:25:50 +00:00
Martin Schwenke
6a2941e2a9 ctdb-recoverd: Fix memory leak
state is always freed before exiting this function, so allocate fde
off it instead of long-lived ctdb context.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13943

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-05-14 07:25:37 +00:00
Martin Schwenke
13a1a48089 ctdb-recoverd: Time out attempt to take recovery lock after 120s
Currently this will wait forever.  It really needs a timeout in case
the cluster filesystem (or other lock mechanism) is completely wedged.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13800

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-02-25 02:12:17 +01:00
Martin Schwenke
45a77d65b2 ctdb-recoverd: Ban node on unknown error when taking recovery lock
We really shouldn't see unknown errors.  They probably represent a
misconfigured recovery lock or similar.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13800

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-02-25 02:12:17 +01:00
Martin Schwenke
c0fb62ed39 ctdb-recoverd: Make recoverd context available in recovery lock handle
BUG: https://bugzilla.samba.org/show_bug.cgi?id=13800

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-02-25 02:12:16 +01:00
Martin Schwenke
7e4aae6943 ctdb-recoverd: Clean up logging on failure to take recovery lock
Add an explicit case for a timeout and clean up the other messages.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13800

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-02-25 02:12:16 +01:00
Martin Schwenke
621658cbed ctdb-recoverd: Free cluster mutex handler on failure to take lock
If nested events occur while the file descriptor handler is still
active then chaos can ensue.  For example, if a node is banned and the
lock is explicitly cancelled (e.g. due to election loss) then
double-talloc-free()s abound.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13800

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-02-25 02:12:16 +01:00
Martin Schwenke
da8aaf2aee ctdb-recoverd: Call an election when the recovery lock is lost
The lock may have been lost due to a failure in the underlying locking
mechanism.  This could be due to quorum loss or similar.  It is best
to call an election to confirm that this node should still be master.
At worst, the node will reelect itself, fail to take the lock and then
ban itself.  This is a suitable outcome for a node that has been
partitioned from others in the cluster.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-12-18 02:02:03 +01:00
Andreas Schneider
2d512b278e debug: Use debuglevel_(get|set) function
Signed-off-by: Andreas Schneider <asn@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>

Autobuild-User(master): Andreas Schneider <asn@cryptomilk.org>
Autobuild-Date(master): Thu Nov  8 11:03:11 CET 2018 on sn-devel-144
2018-11-08 11:03:11 +01:00
Martin Schwenke
486022ef8f ctdb-recoverd: Set recovery lock handle at start of attempt
This allows the attempt to be cancelled if an election is lost and an
unlock is done before the attempt is completed.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Tue Sep 18 02:18:30 CEST 2018 on sn-devel-144
2018-09-18 02:18:30 +02:00
Martin Schwenke
b1dc568784 ctdb-recoverd: Handle cancellation when releasing recovery lock
If the recovery lock is in the process of being taken then free the
cluster mutex handle but leave the recovery lock handle in place.
This allows ctdb_recovery_lock() to fail.

Note that this isn't yet live because rec->recovery_lock_handle is
still only set at the completion of the attempt to take the lock.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
a755d060c1 ctdb-recoverd: Return early when the recovery lock is not held
This makes upcoming changes simpler.

Update to modern debug macro while touching relevant line.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
c52216740b ctdb-recoverd: Store recovery lock handle
... not just cluster mutex handle.

This makes the recovery lock handle long-lived and with allow the
releasing code to cancel an in-progress attempt to take the recovery
lock.

The cluster mutex handle is now allocated off the recovery lock
handle.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
a53b264aee ctdb-recoverd: Use talloc() to allocate recovery lock handle
At the moment this is still local and is freed after the mutex is
successfully taken.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
af22f03dbe ctdb-recoverd: Rename hold_reclock_state to ctdb_recovery_lock_handle
This will be a longer lived structure.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
c516e58ce9 ctdb-recoverd: Re-check master on failure to take recovery lock
If the master changed while trying to take the lock then fail gracefully.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
59fc01646c ctdb-recoverd: Clean up taking of recovery lock
No functional changes, just coding style cleanups and debug message
tweaks.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-09-17 22:58:20 +02:00
Martin Schwenke
929634126a ctdb-config: Switch tunable DisableIPFailover to a config option
Use the "failover:disabled" option instead.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13589

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-08-24 10:59:21 +02:00
Martin Schwenke
914e9f22d8 ctdb-daemon: Pass DisableIPFailover tunable via environment variable
Preparation for obsoleting this tunable.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13589

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-08-24 10:59:21 +02:00
Martin Schwenke
b318cf22ba ctdb-recoverd: Set the process name correctly
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-07-02 08:51:22 +02:00
Martin Schwenke
57834c64be ctdb-common: Rename system utility files
system_socket.[ch] will contain all the raw socket code and other
functions that use ctdb_sock_addr.  system.[ch] will contain other
platform dependent functions.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-07-02 08:51:20 +02:00
Amitay Isaacs
6e588913dd ctdb-recoverd: Abort recovery/takeover if recmaster changes
Recovery and takeover are run via helper from recovery daemon.  While the
helpers are running, it's possible for the current node to lose election.
If that happens, abort the currently running recovery/takeover helper.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-09-12 12:23:19 +02:00
Amitay Isaacs
1f7f112317 ctdb-client: Fix ctdb_attach() to use database flags
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12978

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Fri Aug 25 13:32:58 CEST 2017 on sn-devel-144
2017-08-25 13:32:58 +02:00
Amitay Isaacs
9987fe7209 ctdb-client: Optionally return database id from ctdb_ctrl_createdb()
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12978

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-08-25 09:41:26 +02:00
Amitay Isaacs
4bd0a20a75 ctdb-client: Fix ctdb_ctrl_createdb() to use database flags
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12978

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-08-25 09:41:25 +02:00
Amitay Isaacs
ea91967b0d ctdb-client: Drop tdb_flags argument to ctdb_attach()
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-26 15:47:24 +02:00
Amitay Isaacs
ea46699b27 ctdb-recovery: Do not run local ip verification when in recovery
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12857

If we drop public IPs because CTDB is in recovery for too long, then
avoid spamming logs "Trigger takeoverrun" every second.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-24 10:28:21 +02:00
Amitay Isaacs
2fd2ccd4c8 ctdb-recovery: Get recmode unconditionally in the main_loop
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12857

This can be used later in the main_loop to avoid the local ip check.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-24 10:28:21 +02:00
Chris Lamb
f7dc9f1e12 Correct "supressed" typo.
Signed-off-by: Chris Lamb <chris@chris-lamb.co.uk>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
Reviewed-by: Garming Sam <garming@catalyst.net.nz>
2017-02-22 08:26:21 +01:00
Martin Schwenke
f2485d3ab9 ctdb-recoverd: Integrate takeover helper
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-12-19 04:07:08 +01:00
Martin Schwenke
5b60414265 ctdb-recoverd: Generalise helper state, handler and launching
These can also be used for takeover handler.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-12-19 04:07:08 +01:00
Amitay Isaacs
41c964fdbc ctdb-recovery: Start recovery helper with ctdb_vfork_exec
The recovery helper does it's own logging, so there is no need to
pass logfd.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Mon Dec  5 11:59:42 CET 2016 on sn-devel-144
2016-12-05 11:59:42 +01:00
Amitay Isaacs
d53dbd0dcc ctdb-daemon: Initialize logging in recovery daemon
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-12-05 08:09:22 +01:00
Amitay Isaacs
74ccc7280a ctdb-recoverd: Log a message when terminating
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-12-05 08:09:22 +01:00
Amitay Isaacs
3d6860b275 ctdb-daemon: Remove setting of debug_extra from switch_from_server_to_client()
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-12-05 08:09:22 +01:00
Martin Schwenke
bdc049dfce ctdb-common: Drop CTDB's copy of sys_read() and sys_write()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Nov 29 11:22:40 CET 2016 on sn-devel-144
2016-11-29 11:22:40 +01:00
Amitay Isaacs
2a9584dc0a ctdb-daemon: Remove unused code cmdline.[ch]
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-11-25 04:19:23 +01:00
Amitay Isaacs
67351e61ee ctdb-recoverd: Drop code to freeze databases from set_recovery_mode()
This function is called only once from force_election() and does not
require freezing of databases.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-09-14 08:39:28 +02:00
Martin Schwenke
abe5445c24 ctdb-recoverd: Don't directly release rogue IP addresses
This is inconsistent with the rest of the local IP verification.  It
should notice problems but not try to fix them directly.  Like other
cases, it should use an IP takeover run to try to fix the problem.  In
this case the address might have just been added and an out-of-band
RELEASE_IP might cause conflicts (i.e. "another change is in flight")
with a scheduled IP takeover run.

This effectively reverts commit
694c1b269e.  Not sure why this was
needed after c7e648c2d1.  More recently
commit 6471541d6d moves responsibility
for determining interface/netmask to 10.interface so this should
continue to work just fine.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-08-17 23:00:26 +02:00
Amitay Isaacs
6693fa59dc ctdb-recoverd: Remove code that updates database priorities during recovery
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-07-25 21:29:42 +02:00
Amitay Isaacs
9338443a92 ctdb-recovery: Remove serial database recovery code
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-07-25 21:29:42 +02:00
Martin Schwenke
a26d39e5ce ctdb-recoverd: Drop code to change the IP assignment tree
The tree is no longer used in verification.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-07-04 15:42:24 +02:00
Martin Schwenke
35644d0d82 ctdb-ipalloc: Drop remote IP verification
It is only run during a takeover run and only logs errors.  It doesn't
actually do anything to fix potential errors.  The takeover run should
fix any inconsistencies anyway.

Instead, leave a comment in the recovery daemon's monitoring loop to
add proper remote IP verification later.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-07-04 15:42:24 +02:00
Amitay Isaacs
ecb74721e7 ctdb-recoverd: Avoid duplicate recoverd event in parallel recovery
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11956

In do_recovery, after the recovery and takeover is complete, recoverd
event is triggered.  When the parallel database recovery was separated,
ctdb_recovery_helper implemented sending END_RECOVERY control which
causes recoverd event to be triggered.  So when there is parallel database
recovery, recoverd event is triggered twice.

Instead move the call to run_recovered_eventscript() explicitly in
the serial recovery code path.  This avoids the duplication trigger of
recoverd event.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-06-08 10:33:19 +02:00
Martin Schwenke
174449c1e0 ctdb-recoverd: Release recovery lock on exit
The recovery lock helper must exit when it notices its parent is gone.
However, that can take a few seconds.

The usual way of terminating the recovery daemon is for the main ctdbd
to send it a SIGTERM.  Installing a handler is nice and simple.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2016-06-08 00:51:29 +02:00