IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The database generation for each database is updated only during recovery.
After recovery is complete the database generation would be the same as
the global generation.
The database generation is required for parallel database recovery.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
The same structure is required in new controls for database transactions.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
If a node was previously recovery master (say, 20 years ago) and it
becomes recovery master again then, if IP assignments have changed,
verify_remote_ip_allocation() can produce messages like the following
when called during recovery:
ctdbd: recoverd:Inconsistent IP allocation - node 0 thinks 10.1.1.1 is held by node 0 while it is assigned to node 1
When a node loses an election it should clear all data specific to it
being the recovery master.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
For all the records listed in VACUUM_FETCH, migration requests are sent
immediately without waiting. This means there can only be a single
VACUUM_FETCH processing active at a time.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
With the processing of one element factored out,
it is more natural to have the actual loop inside the
handler function. This also makes the talloc/free
bracked around the loop more obvious.
Signed-off-by: Michael Adam <obnox@samba.org>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It isn't possible to hold the recovery lock without having a lock file
set.
This is part of a goal to generalise the recovery lock mechanism to
just use a helper program, which may use a lock file or may use
something else.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Election packets from the current node are ignored at the beginning of
the function, so this does not need to be checked.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Databases are only pulled by the recovery master, so it can compare
with current node PNN.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Unless this node is the recovery master then this is not needed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It isn't used anywhere else and is always re-initialised to 0.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
They are initialised and updated but the values are never used.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Simplify update_capabilities() using the capabilities API and store
the capabilities in new field rec->caps rather than scattered around
ctdb->nodes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The potential resulting recovery won't run anyway. Also recoveries
may have been disabled by "reloadnodes" and if the nodemaps are
inconsistent between nodes then avoid triggering an unnecessary
recovery.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Factor out new function srvid_disable_and_reply(), which can be
re-used.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This can be used to disable and re-enable an operation, and do all the
relevant sanity checking.
Most of this is from existing functions
disable_takeover_runs_handler(), clear_takeover_runs_disable() and
reenable_takeover_runs().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Just continue to hold it, otherwise a broken node might win an
election and grab the lock.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Have it just silently take or fail to take the lock, except on an
unexpected failure (where it should log an error).
This means that when it is called we need to keep the old behaviour
and explicitly release the lock. In do_recovery() the lock is
released and a message is printed before attempting to take the lock.
In the daemon sanity check the lock must be released in the error path
if it is actually taken.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This has not done anything useful since commit
b9d8bb23af8abefb2d967e9b4e9d6e60c4a3b520. Instead, just check that
the lock is held.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Unlock the recovery lock file. This way knowledge of the file
descriptor isn't sprinkled throughout the code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It is pointless having a recovery lock but not sanity checking that it
is working. Also, the logic that uses this tunable is confusing. In
some places the recovery lock is released unnecessarily because the
tunable isn't set.
Simplify the logic by assuming that if a recovery lock is specified
then it should be verified.
Update documentation that references this tunable.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Processing one migration request at a time is very slow and processing
a batch of records can take longer than VacuumInterval. This causes
subsequent vacuum fetch requests to be dropped. The dropped records
can accumulate quickly and will cause the vacuum database traverse to
be quite expensive.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Dec 5 17:06:58 CET 2014 on sn-devel-104
As far as we know, nobody uses this and it just complicates the
logging subsystem.
Remove all ringbuffer code and documentation. Update the local
daemons startup code correspondingly.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>
When ctdb daemon starts up, it considers itself the recovery master
and tries to do first recovery. However, it's possible that there is
already a recovery master and the current node has not yet heard from it.
So do not ban ourselves immediately if ctdb_recovery_lock() fails when
doing first recovery.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
This makes it consistent with Samba, to ease transition.
Update unit test code to link to with tdb_wrap instead of including
db_wrap.c.
There are some potential whitespace fixes in this commit that have
been ignored. CTDB's lib/tdb_wrap will be deleted after the
transition to Samba's lib/tdb_wrap, so there's no point polishing it
too much.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This makes it consistent with the rest of the code and avoids problems
when some variant of lib/util isn't in the include path.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Sometimes the recovery daemon fails to get the recovery lock on one
node so that node is banned. This seems to always happen during an
election. The recovery is triggered because other nodes are found to
have recovery mode enabled. They have recovery mode enabled because
an election has been forced.
The recovery daemon's main_loop() only does an initial check for an
election. After that, a node can force an election and, in the
process, set itself to be the current winner. In this situation,
verify_recmode() will always return MONITOR_RECOVERY_NEEDED so
do_recovery() is called. If the previous recovery master hasn't
admitted defeat and released the recovery lock, then do_recovery()
will rightly fail. However, it would be better if it failed a little
more gracefully, since this case is not that unusual.
Instead of trying to take the recovery lock, return early with an
error if there is an election in progress. Note that the race is
still there but it is now much narrower.
There are probably more subtle ways of avoiding this issue, including
something like this in main_loop():
- if (pnn != rec->recmaster) {
+ if (pnn != rec->recmaster || rec->election_timeout) {
return;
}
However, this check is done earlier so it leaves the race window open
a little wider.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Jul 21 06:57:07 CEST 2014 on sn-devel-104
Setting recovery mode to active is the only correct way to inform recovery
daemon to run database recovery. Only freezing databases without setting
recovery mode should not trigger database recovery, as this mechanism
is used in tool to implement wipedb/restoredb commands.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
That makes people think there's a problem (and report bugs) so say
something a bit less scary instead...
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It is a non-trivial event and will make it easier to debug recovery
lock issues.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is unnecessary since ctdbd_pid is set very early in the code before
creating any other processes including recovery daemon.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Sat Jul 5 09:20:27 CEST 2014 on sn-devel-104
As part of vacuuming, recoverd attaches to databases to migrate records.
When detaching a database from main daemon, it should be removed from
recovery daemon also.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Michael Adam <obnox@samba.org>
Autobuild-User(master): Michael Adam <obnox@samba.org>
Autobuild-Date(master): Wed Apr 23 17:05:45 CEST 2014 on sn-devel-104
Database priority is a global property and all the nodes should have the
priority set for the databases. Just setting priority on one node can
lead to problems in the recovery as a database can be frozen at wrong
priority and then freezing database would not succeed.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
Autobuild-User(master): David Disseldorp <ddiss@samba.org>
Autobuild-Date(master): Mon Apr 7 14:06:26 CEST 2014 on sn-devel-104