1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-22 13:34:15 +03:00
Commit Graph

55 Commits

Author SHA1 Message Date
Martin Schwenke
97a45f6f25 ctdb-recoverd: Add log reopening on SIGHUP to helpers
Recovery and takeover helpers can run for a while and generate
non-trivial logs.  They should support log reopening.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2022-01-17 03:43:30 +00:00
Martin Schwenke
2efce7d477 ctdb-recovery: Simplify database push function names
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
f4e2206e88 ctdb-recovery: Drop unnecessary database push wrapper
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
225a699633 ctdb-recovery: Drop passing of capabilities into database pull
This is no longer necessary because the capability new style database
pull is assumed to always be available.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
595c1a7c0f ctdb-recovery: Simplify database pull function names
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Martin Schwenke
f968576642 ctdb-recovery: Remove use of old pull and push controls
Removes use of the old controls without cleaning up the code.  Clean
up can be done later.

After this change the CTDB_CAP_FRAGMENTED_CONTROLS capability is no
longer checked.  This capability can be removed along with the
controls.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-09-11 05:06:42 +00:00
Ralph Boehme
2327471756 lib: relicense smb_strtoul(l) under LGPLv3
Signed-off-by: Ralph Boehme <slow@samba.org>
Reviewed-by: Swen Schillig <swen@linux.ibm.com>
Reviewed-by: Volker Lendecke <vl@samba.org>

Autobuild-User(master): Jeremy Allison <jra@samba.org>
Autobuild-Date(master): Mon Aug  3 22:21:04 UTC 2020 on sn-devel-184
2020-08-03 22:21:02 +00:00
Martin Schwenke
76a8174279 ctdb-recovery: Create database on nodes where it is missing
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-03-23 23:45:38 +00:00
Martin Schwenke
e6e63f8fb8 ctdb-recovery: Fetch database name from all nodes where it is attached
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-03-23 23:45:38 +00:00
Martin Schwenke
1bdfeb3fdc ctdb-recovery: Pass db structure for each database recovery
Instead of db_id and db_flags.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-03-23 23:45:38 +00:00
Martin Schwenke
c6f74e590f ctdb-recovery: GET_DBMAP from all nodes
This builds a complete list of databases across the cluster so it can
be used to create databases on the nodes where they are missing.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-03-23 23:45:38 +00:00
Martin Schwenke
4c0b9c3605 ctdb-recovery: Replace use of ctdb_dbid_map with local db_list
This will be used to build a merged list of databases from all nodes,
allowing the recovery helper to create missing databases.

It would be possible to also include the db_name field in this
structure but that would cause a lot of churn.  This field is used
locally in the recovery of each database so can continue to live in
the relevant state structure(s).

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2020-03-23 23:45:38 +00:00
Amitay Isaacs
1c56d6413f ctdb-recovery: Refactor banning a node into separate computation
If a node is marked for banning, confirm that it's not become inactive
during the recovery.  If yes, then don't ban the node.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2020-03-23 23:45:37 +00:00
Amitay Isaacs
c6a0ff1bed ctdb-recovery: Don't trust nodemap obtained from local node
It's possible to have a node stopped, but recovery master not yet
updated flags on the local ctdb daemon when recovery is started.  So do
not trust the list of active nodes obtained from the local node.  Query
the connected nodes to calculate the list of active nodes.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2020-03-23 23:45:37 +00:00
Amitay Isaacs
6e2f8756f1 ctdb-recovery: Consolidate node state
This avoids passing multiple arguments to async computation.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2020-03-23 23:45:37 +00:00
Amitay Isaacs
072ff4d12b ctdb-recovery: Fetched vnnmap is never used, so don't fetch it
New vnnmap is constructed using the information from all the connected
nodes.  So there is no need to fetch the vnnmap from recovery master.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14294

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2020-03-23 23:45:37 +00:00
Swen Schillig
73640b8ad8 ctdb: Update all consumers of strtoul_err(), strtoull_err() to new API
Signed-off-by: Swen Schillig <swen@linux.ibm.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Reviewed-by: Christof Schmitt <cs@samba.org>
2019-06-30 11:32:18 +00:00
Martin Schwenke
90622ab901 ctdb-recovery: Fix signed/unsigned comparisons by declaring as unsigned
Simple cases where variables and function parameters need to be
declared as an unsigned type instead of an int.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2019-06-05 10:25:50 +00:00
Amitay Isaacs
278eb236ae ctdb-daemon: Fix maybe-uninitialized error with picky developer
263/386] Compiling ctdb/server/ctdb_recovery_helper.c
In file included from ../../server/ctdb_recovery_helper.c:24:0:
../../server/ctdb_recovery_helper.c: In function ‘main’:
../../../lib/talloc/talloc.h:911:34: error: ‘mem_ctx’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 #define TALLOC_FREE(ctx) do { if (ctx != NULL) { talloc_free(ctx); ctx=NULL; } } while(0)

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
2019-03-01 17:21:15 +00:00
Swen Schillig
55acae774a ctdb-server: Use wrapper for string to integer conversion
In order to detect an value overflow error during
the string to integer conversion with strtoul/strtoull,
the errno variable must be set to zero before the execution and
checked after the conversion is performed. This is achieved by
using the wrapper function strtoul_err and strtoull_err.

Signed-off-by: Swen Schillig <swen@linux.ibm.com>
Reviewed-by: Ralph Böhme <slow@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>
2019-03-01 00:32:11 +00:00
Martin Schwenke
27df4f002a ctdb-recovery: Ban a node that causes recovery failure
... instead of applying banning credits.

There have been a couple of cases where recovery repeatedly takes just
over 2 minutes to fail.  Therefore, banning credits expire between
failures and a continuously problematic node is never banned,
resulting in endless recoveries.  This is because it takes 2
applications of banning credits before a node is banned, which
generally involves 2 recovery failures.

The recovery helper makes up to 3 attempts to recover each database
during a single run.  If a node causes 3 failures then this is really
equivalent to 3 recovery failures in the model that existed before the
recovery helper added retries.  In that case the node would have been
banned after 2 failures.

So, instead of applying banning credits to the "most failing" node,
simply ban it directly from the recovery helper.

If multiple nodes are causing recovery failures then this can cause a
node to be banned more quickly than it might otherwise have been, even
pre-recovery-helper.  However, 90 seconds (i.e. 3 failures) is a long
time to be in recovery, so banning earlier seems like the best
approach.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13670

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Nov  5 06:52:33 CET 2018 on sn-devel-144
2018-11-05 06:52:33 +01:00
Martin Schwenke
7dbf833697 ctdb: Fix some -Werror=strict-overflow issues
All quite obvious.  For the LCP2 one, we're not actually counting so
use a bool instead of an int.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
2018-04-27 06:53:16 +02:00
Amitay Isaacs
de3f0d889b ctdb-recovery-helper: Deregister message handler in error paths
BUG: https://bugzilla.samba.org/show_bug.cgi?id=13188

If PULL_DB control times out but the remote node is still sending the
data, then the tevent_req for pull_database_send will be freed without
removing the message handler.  So when the data is received, srvid
handler will be called and it will try to access tevent_req which will
result in use-after-free and abort.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-12-13 08:48:18 +01:00
Amitay Isaacs
676df8770b ctdb-protocol: Fix marshalling for ctdb_rec_buffer
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-08-30 14:59:23 +02:00
Amitay Isaacs
b8a0420d10 ctdb-daemon: Add implementation for CTDB_CONTROL_DB_ATTACH_REPLICATED control
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-29 10:34:27 +02:00
Amitay Isaacs
1e10f224ff ctdb-recovery: Use db_flags instead of a boolean persistent flag
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-29 10:34:27 +02:00
Amitay Isaacs
c9d9f56bff ctdb-recovery: Assign banning credits if database fails to freeze
https://bugzilla.samba.org/show_bug.cgi?id=12857

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2017-06-24 10:28:21 +02:00
Amitay Isaacs
6ebcba49d0 ctdb-recovery: Delete empty records during recovery
Persistent databases are now always recovered by sequence number.  So
there is no need to keep the empty records in the database since they
will never be recovered record-by-record using RSN.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Sat Jun 17 16:47:55 CEST 2017 on sn-devel-144
2017-06-17 16:47:55 +02:00
Amitay Isaacs
40cc7a1eb3 ctdb-recovery: Log messages at various debug levels
This avoids spamming the logs during recovery at NOTICE level.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Tue Jun 13 13:22:09 CEST 2017 on sn-devel-144
2017-06-13 13:22:09 +02:00
Amitay Isaacs
41c964fdbc ctdb-recovery: Start recovery helper with ctdb_vfork_exec
The recovery helper does it's own logging, so there is no need to
pass logfd.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Mon Dec  5 11:59:42 CET 2016 on sn-devel-144
2016-12-05 11:59:42 +01:00
Martin Schwenke
bdc049dfce ctdb-common: Drop CTDB's copy of sys_read() and sys_write()
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Nov 29 11:22:40 CET 2016 on sn-devel-144
2016-11-29 11:22:40 +01:00
Amitay Isaacs
f2414841f2 ctdb-daemon: Mark RecoverPDBBySeqNum tunable deprecated
Persistent databases are now always recovered by sequence number, so
there is no need for this tunable.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Fri Nov 25 08:13:59 CET 2016 on sn-devel-144
2016-11-25 08:13:59 +01:00
Amitay Isaacs
54e392b385 ctdb-recovery: Avoid NULL dereference in failure case
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12434

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Volker Lendecke <vl@samba.org>

Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Mon Nov 21 12:26:04 CET 2016 on sn-devel-144
2016-11-21 12:26:04 +01:00
Amitay Isaacs
6b93b57921 ctdb-recovery-helper: Add missing initialisation of ban_credits
BUG: https://bugzilla.samba.org/show_bug.cgi?id=12275

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-09-19 08:23:22 +02:00
Amitay Isaacs
f1a8fb11dd ctdb-recovery-helper: Fix format-nonliteral warning
... and printf format errors.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=12137

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Uri Simchoni <uri@samba.org>
2016-08-10 08:18:16 +02:00
Amitay Isaacs
600cec4d44 ctdb-recovery: Terminate if recovery fails without any banning credits
In case of database recovery failure, if there are no banning credits
assigned, then the async computation is never terminated.  The else
condition is missing in (max_credits >= NUM_RETRIES) check.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Fri Jun 24 09:56:23 CEST 2016 on sn-devel-144
2016-06-24 09:56:23 +02:00
Amitay Isaacs
1847556562 ctdb-recovery-helper: Fix a comment
The sequence of events are incorrectly documented.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-06-24 05:59:08 +02:00
Amitay Isaacs
93dcca2a5f ctdb-recovery: Update timeout and number of retries during recovery
The timeout RecoverTimeout (default 120) is used for control messages
sent during the recovery.  If any of the nodes does not respond to any
of the recovery control messages for RecoverTimeout seconds, then it
will cause a failure of recovery of a database.  Recovery helper will
retry the recovery for a database 5 times.

In the worst case, if a database could not be recovered within 5 attempts,
a total of 600 seconds would have passed.  During this time period other
timeouts will be triggered causing unnecessary failures as follows:

1. During the recovery, even though recoverd is processing events,
   it does not send a ping message to ctdb daemon.  If a ping message is
   not received for RecdPingTimeout (default 60) seconds, then ctdb will
   count it as unresponsive recovery daemon.  If the recovery daemon
   fails for RecdFailCount (default 10) times, then ctdb daemon will
   restart recovery daemon.  So after 600 seconds, ctdb daemon will
   restart recovery daemon.

2. If ctdb daemon stays in recovery for RecoveryDropAllIPs (default 120),
   then it will drop all the public addresses.  This will cause all
   SMB client to be disconnected unnecessarily.  The released public
   addresses will not be taken over till the recovery is complete.

To avoid dropping of IPs and restarting recovery daemon during a delayed
recovery, adjust RecoverTimeout to 30 seconds and limit number of
retries for recovering a database to 3.  If we don't hear from a node
for more than 25 seconds, then the node is considered disconnected.
So 30 seconds is sufficient timeout for controls during recovery.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Mon Jun  6 08:49:15 CEST 2016 on sn-devel-144
2016-06-06 08:49:15 +02:00
Amitay Isaacs
c51b8c2234 ctdb-recovery-helper: Add banning to parallel recovery
If one or more nodes are misbehaving during recovery, keep track of
failures as ban_credits.  If the node with the highest ban_credits exceeds
5 ban credits, then tell recovery daemon to assign banning credits.

This will ban only a single node at a time in case of recovery failure.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>

Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Fri Mar 25 06:57:32 CET 2016 on sn-devel-144
2016-03-25 06:57:32 +01:00
Amitay Isaacs
ad7a407a13 ctdb-recovery-helper: Introduce new #define variable
... instead of hardcoding number of retries.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:16 +01:00
Amitay Isaacs
e5a714a3c2 ctdb-recovery-helper: Improve log message
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:16 +01:00
Amitay Isaacs
ffea827bae ctdb-recovery-helper: Introduce push database abstraction
This abstraction uses capabilities of the remote nodes to either send
older PUSH_DB controls or newer DB_PUSH_START and DB_PUSH_CONFIRM
controls.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
b96a4759b3 ctdb-recovery-helper: Introduce pull database abstraction
This abstraction depending on the capability of the remote node either
uses older PULL_DB control or newer DB_PULL control.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
e1fdfdd1c1 ctdb-recovery-helper: Write recovery records to a recovery file
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
9058fe06df ctdb-recovery-helper: Re-factor function to retain records from recdb
Also, rename traverse function and traverse state for recdb_records
consistently.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
a80ff09ed3 ctdb-recovery-helper: Create accessors for recdb structure fields
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
70011a1bfb ctdb-recovery-helper: Rename pnn to dmaster in recdb_records()
This variable is used to set the dmaster value for each record in
recdb_traverse().

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
5b926d882e ctdb-recovery-helper: Pass capabilities to database recovery functions
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
5f43f92796 ctdb-recovery-helper: Factor out generic recv function
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-25 03:26:15 +01:00
Amitay Isaacs
700f39372a ctdb-recovery-helper: Get tunables first, so control timeout can be set
During the recovery process, the timeout value for sending all controls
is decided by RecoverTimeout tunable.  So in the recovery process,
first get the tunables, so the control timeout gets set correctly.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
2016-03-10 03:34:18 +01:00