1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-23 17:34:34 +03:00
Commit Graph

292 Commits

Author SHA1 Message Date
Martin Schwenke
3afcc53516 recoverd: Remove an unused temporary talloc context
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit da22d5e60dc023009854025cc9e6bc4b0a84c60e)
2013-08-22 17:00:20 +10:00
Martin Schwenke
e657f75484 recoverd: Log more information when interfaces change
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 3ef93a1a3e60cdf5d8954e7a16a988ea6126916b)
2013-08-22 17:00:20 +10:00
Amitay Isaacs
cb8310ddb6 recoverd: Improve log message when nodes disagree on recmaster
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 7b7aa7b599536cd60ebb84d363607bb4e953248a)
2013-08-14 16:55:51 +10:00
Amitay Isaacs
de6b97ce4f Revert "recoverd: Use correct tdb flags when creating missing databases"
This reverts commit 10a057d8e15c8c18e540598a940d3548c731b0b4.

This approach would not work when creating local databases since currently
there is no control to receive TDB flags for remote databases.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit ca61eb776ab862bd269e45ee0f9f96e7e1e0e001)
2013-08-14 14:15:33 +10:00
Amitay Isaacs
f15e1a28a7 recoverd: Use correct tdb flags when creating missing databases
When creating missing databases either locally or remotely, make sure
to use the correct tdb flags from other nodes.  Without this, volatile
databases can get attached without TDB_INCOMPATIBLE_HASH flag.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 10a057d8e15c8c18e540598a940d3548c731b0b4)
2013-08-01 11:08:25 +10:00
Amitay Isaacs
5ba280d8ce recoverd: Make sure to use jenkins hash for recovery databases
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 32c83e209823e9a4d6306bb7fd63d4500f3e2668)
2013-08-01 10:51:14 +10:00
Amitay Isaacs
f1f787ccac recoverd: Assemble up-to-date node flags information from remote nodes
Currently nodemap used by recovery master is the one obtained from the local
node.  This information may have been updated while processing main loop.
Before comparing node flags on all the nodes, create up-to-date node flags
information based on the information received from all the nodes.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit fcf77dec5af973a0e32f3999bc012053a6f47a96)
2013-07-30 15:34:32 +10:00
Martin Schwenke
ca13f28eef recoverd: Really fix bogus info in message about changed flags
Commit 9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this.
However, this was wrong because old_flags and new_flags were confused.
The latter has since been fixed in commit
7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed
properly.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 40f2825d6e818dc8c745b6385a545969dfb45fbc)
2013-07-11 15:18:06 +10:00
Sumit Bose
157f1cfefd Fixes for various issues found by Coverity
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 05bfdbbd0d4abdfbcf28e3930086723508b35952)
2013-07-11 15:16:55 +10:00
Martin Schwenke
a86f1f109a recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit c6fded59fa4da67f738a90fdacb51900e41801f9)
2013-07-10 15:19:27 +10:00
Amitay Isaacs
1c21f37e57 ctdbd: Set process names for child processes
This helps distinguish processes in process list in top, perf, etc.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 2493f57ce268d6fe7e4c40a87852c347fd60d29e)
2013-07-10 14:33:19 +10:00
Martin Schwenke
0108e8ff10 recoverd: Minor style improvements for ctdb_reload_remote_public_ips()
* Add a variable to the loop to make the code more readable and have
  it generally fit into 80 columns.

* Improve comments.

* Improve log messages.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 0a292fa8939a1343e44cadaa8ed9f3c0f18ca82f)
2013-07-05 15:52:33 +10:00
Martin Schwenke
7290798a41 recoverd: Clean up log messages in remote IP verification
The log messages in verify_remote_ip_allocation() are confusing
because they don't include the PNN of the problem node, because it is
not known in this function.

Add the PNN of the node being verified as a function argument and then
shuffle the log messages around to make them clearer.

Also fold 3 nested if statements into just one.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit f0942fa01cd422133fc9398f56b4855397d7bc86)
2013-07-05 15:52:33 +10:00
Martin Schwenke
15115becef recoverd: Fix an unclear log message - "Restart recovery process"
When the recovery master notices a node in recovery mode it starts the
recovery process, it doesn't restart it.

Update documentation to match.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 298c4d2c3b4ea3d900c91f5a0a5aca2952a13d61)
2013-07-05 15:52:33 +10:00
Martin Schwenke
bfe0b93652 recoverd: Fix an incorrect comment
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 9f6cd8b0bea619991c9f3bf35188c5950dabf8f4)
2013-07-05 15:52:33 +10:00
Amitay Isaacs
f032c60cd5 recoverd: Send the result from child process only once
The result has been sent before the child keeps waiting for parent
ctdbd process.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 9aa13bcedd83d463c871e3cf1f3a65da3cd83992)
2013-07-04 20:43:52 +10:00
Michael Adam
3c65197b7a recoverd: when the recmaster is banned, use that information when forcing an election
When we trigger an election because the recmaster considers itself inactive,
update our local nodemap with the recmaster's flags before calling
force_election(). This way, we don't send the inactive node freeze commands
(e.g.) that may fail and then lead to ourselves getting banned.

The theory is that this should help avoiding banning loops.

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 932360992b08a5483d90c0590218ba0fd756119e)
2013-07-02 12:59:09 +10:00
Michael Adam
082da536cb recoverd: fix a comment typo
Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 741944f118e98f178b860194eecb215180949d18)
2013-07-02 12:59:09 +10:00
Michael Adam
159b9a2989 recoverd: fix a comment in main_loop
Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit ac06c46e4a80c635f6094b5ac6f0bf3e3a02db95)
2013-07-02 12:59:09 +10:00
Michael Adam
26365f2a5f recoverd: eliminate some trailing spaces from ctdb_election_win()
Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit df30c0a05ed908fc2a997c56ff5484736b23b70f)
2013-07-02 12:59:09 +10:00
Martin Schwenke
aa79a656a7 recoverd: Don't continue if the current node gets banned
Can not continue with recovery or monitoring cluster.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 14399de1dd0bd8dabf1f48b1457e3ccb37589d8a)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
b29b6ae39e recoverd: Refactor code to ban misbehaving nodes
Since we have nodemap information, there is no need to hardcode the
limit of 20.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit aea12dce83ef385e9fb3bc03ac7ace0874a0e3fe)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
c22de8d1c0 recoverd: Move code to ban other nodes after we get local node flags
If a node gets banned first, then it should not ban other nodes.

This code was moved up in main_loop to avoid waiting for nodemap
from other nodes (commit 83b0261f2cb453195b86f547d360400103a8b795).

To prevent a banned node from banning other nodes, we need to first get
nodemap information from local node, so trying to ban other nodes can
fail if we are already banned.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit ae1693905036ecdbc4594fde1f12500faae4a554)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
32f9d7c0d4 recoverd: Delay the initial election if node is started in stopped state
Since there is an early exit if a node is stopped or banned, we can wait till
the node becomes active to start initial election.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 593a17678fbd3109e118154b034d43b852659518)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
d2411e74f1 recoverd: Update capabilities only if the current node is active
Since we do an early return if a node is stopped or banned, move update
capabilities code below the early return and just before we check the
capabilities of current recovery master.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 93bcb6617e1024f810533e12390a572f51703ca0)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
73e6cc765d recoverd: No need to check if node is recovery master when inactive
If a node is stopped or banned, it will cause early return from the
main_loop, so this check is redundent.  The election will called by an
active node.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 815ddd3341b7e9db39e05a3a3fcd9a1420f053bc)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
870409ed1c recoverd: Always do an early exit from main_loop if node is stopped or banned
A stopped or banned node cannot do anything useful.  So do not participate
in any cluster activity and do not cause any unnecessary network traffic.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 2396981c4bcf30530aeb7f4395093cc202105b50)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
7b761c4b97 recoverd: Do not set banning credits on a node if current node is inactive
If the current node is banned or stopped, then it should not assign banning
credits to other nodes since the current node will not have up-to-date flags
of other nodes.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 38304f88e0c634e97d4687c25adef975f71537b8)
2013-07-02 12:59:09 +10:00
Amitay Isaacs
cf1d4bfde3 recoverd: Also check if current node is in recovery when it is banned
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 6a9dbb8fb0f1f6e8c206189cdc2d33bb371ea2a8)
2013-07-02 12:59:08 +10:00
Amitay Isaacs
3052006bf9 recoverd: Set node_flags information as soon as we get nodemap
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 8d622660a14c929e365d306147b378ea6ab92175)
2013-07-02 12:59:08 +10:00
Amitay Isaacs
36d8d25b6c recovered: Remove old comment as the code corresponding to that has gone away
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 34af2cdf686d5d77854cbaa7bbcd8f878e9171c7)
2013-07-02 12:59:08 +10:00
Amitay Isaacs
d439aa05a8 recoverd: Print banning message only after verifying pnn
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 4be8dff3a4451192f838497b4747273685959bed)
2013-06-28 14:20:12 +10:00
Amitay Isaacs
6960bf78ff recoverd: When updating flags on nodes, send updated flags and not old flags
This was broken by commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa.
Instead of a SRVID_SET_NODE_FLAGS message to recovery daemon, a control
was sent to the local daemon which in turn informed the recovery daemon.
And while doing this change old flags were sent via CONTROL_MODIFY_FLAGS.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 7eb2f89979360b6cc98ca9b17c48310277fa89fc)
2013-06-28 14:20:12 +10:00
Martin Schwenke
7513f0ba61 recoverd: Log node that causes takoever run to fail
Extend takeover_fail_callback() to just log (and not do any ban
processing) when the callback data is NULL.  Always call
ctdb_takeover_run() with the callback so that useful errors are always
logged.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit c429394afbabaee09f9216dc743419adddf523ea)
2013-06-13 15:55:48 +10:00
Martin Schwenke
58772d600b recoverd: Interface reference count changes should not cause takeover runs
At the moment a naive compare of the all the interface data is done.
So, if any IPs move then the reference counts for the the relevant
interfaces change, interfaces appear to have changed and another
takeover run is initiated by each node that took/released IPs.

This change stops the spurious takeover runs by changing the interface
comparison to ignore the reference counts.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 0b7257642f62ebd83c05b6e2922f0dc2737f175c)
2013-05-02 17:11:43 +10:00
Michael Adam
ca1f3de8b4 recoverd: remove bogus comment "qqq" from "add prototype new banning code"
Signed-off-by: Michael Adam <obnox@samba.org>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 9f01b8db72780acf2f88f1392bc0a796dd4c6176)
2013-04-17 12:43:48 +02:00
Martin Schwenke
2476d8a9fd recoverd: update_capabilities() should use connected nodes
... as the comment says... not just active nodes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 4f71dca8df19a63f198e2d6d59e605b49ec5e803)
2013-02-20 14:51:24 +11:00
Martin Schwenke
689384a7b4 Logging: Fix breakage when freeing the log ringbuffer
Commit a82d3ec12f0fda16d6bfa8442a07595de897c10e broke fetching from
the log ringbuffer.  The solution there is still generally good: there
is no need to keep the ringbuffer in children created by
ctdb_fork()... except for those special children that are created to
fetch data from the ringbuffer!

Introduce a new function ctdb_fork_no_free_ringbuffer() that does
everything ctdb_fork() needs to do except free the ringbuffer (i.e. it
is the old ctdb_fork() function).  The new ctdb_fork() function just
calls that function and then frees the ringbuffer in the child.

This means all callers of ctdb_fork() have the convenience of having
the ringbuffer freed.  There are 3 special cases:

* Forking the recovery daemon.  We want to be able to fetch from the
  ringbuffer there.

* The ringbuffer fetching code.  Change the 2 calls in this code (main
  daemon, recovery daemon) to call ctdb_fork_no_free_ringbuffer()
  instead.

While we're here, clear the log ringbuffer when the recovery deamon is
forked, since it will contain a copy of the messages from the main
daemon.

Note to self: always test... even the most obvious patches...  ;-)

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 00db5fa00474f8a83f1aa3b603fd756cc9b49ff4)
2013-02-07 11:26:29 +11:00
Amitay Isaacs
385325ad90 recoverd: Fix printing of node flags from local information
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 124e2a471aeda9c900fd898178a30522d7d74221)
2013-01-23 16:56:03 +11:00
Amitay Isaacs
96ba396697 recoverd: Create recoverd monitoring timed events off recoverd context
This ensures that when shutting down CTDB, all the timed events
associated with monitoring recoverd are destroyed and recoverd
is not restarted.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 7393e2b290f9879ff72d5c5a9ce933034129f0e8)
2013-01-09 16:22:39 +11:00
Amitay Isaacs
30299c387f daemon: On shutdown, destroy timed events that check if recoverd is active
When CTDB is shutting down, recovery daemon is stopped, but the
event that checks if recovery daemon is still alive is not destroyed.
So recovery master is restarted during shutdown if CTDB daemon takes
longer to shutdown.

There are two processes that check if recovery daemon is working.

1. ctdb_check_recd() - which checks every 30 seconds if the recovery
   daemon process exists.

2. ctdb_recd_ping_timeout() - which is triggered when recovery daemon
   fails to ping CTDB daemon.

Both the events are periodic and need to be destroyed when shutting down.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 746168df2e691058e601016110fae818c6a265c3)
2013-01-09 13:20:26 +11:00
Michael Adam
8732e2356f recovery: data corruption of persistent DBs after recoveries: don't delete emtpy records
The record-by-record mode of recovery deletes empty records.
For persistent databases, this can lead to data corruption
by deleting records that should be there:

- Assume the cluster has been running for a while.

- A record R in a persistent database has been created and
  deleted a couple of times, the last operation being deletion,
  leaving an empty record with a high RSN, say 10.

- Now a node N is turned off.

- This leaves the local database copy of D on N with the empty
  copy of R and RSN 10. On all other nodes, the recovery has deleted
  the copy of record R.

- Now the record is created again while node N is turned off.
  This creates R with RSN = 1 on all nodes except for N.

- Now node N is turned on again. The following recovery will chose
  the older empty copy of R due to RSN 10 > RSN 1.

==> Hence the record is gone after the recovery.

On databases like Samba's registry, this can damage the higher-level
data structures built from the various tdb-level records.

This patch fixes that problem by not deleting empty records in recoveries
for persistent databases.

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 6860c79aea416f56cfd7a6af790bbdf495dbc54e)
2012-11-20 00:48:24 +01:00
Michael Adam
9c65a7ef81 recoverd: fix a comment typo
Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 909269a4a3690e1245117ca1af935401455785e6)
2012-11-20 00:48:23 +01:00
Amitay Isaacs
85c8deca3f recoverd: Track the nodes that fail takeover run and set culprit count
If any of the nodes fail takeover run (either due to timeout or failure
to complete within takeover_timeout interval) from main loop, recovery
master will give up trying takeover run with following message:

  "Unable to setup public takeover addresses. Try again later"

And as a side-effect the monitoring is disabled on all the nodes. Before
ctdb_takeover_run() is called from main loop, monitoring get disabled via
startrecovery event. Since ctdb_takeover_run() fails, it never runs
recovered event and monitoring does not get re-enabled.

In main_loop, ctdb_takeover_run() is called with a takeover_fail_callback.
This callback will get called if any of the nodes fail in handling
takeip/releaseip/ipreallocated events in ctdb_takeover_run().

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit a5c6bb1fffb8dc3960af113957a1fd080cc7c245)
2012-11-14 10:59:54 +11:00
Martin Schwenke
db5dfe891c recoverd: Add CTDB_SRVID_GETLOG and CTDB_SRVID_CLEARLOG
These support getting and clearing logs from the ring-buffer in the
recovery daemon.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit cbca233d1e03b2410e0bb63b936328d4a8b3c7b4)
2012-10-22 11:15:36 +11:00
Martin Schwenke
bfbcdea610 recoverd: Clarify some misleading log messages
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 14589bf7c16ba017fe00d4e8bea8cc501546c60f)
2012-10-18 20:05:43 +11:00
Martin Schwenke
a884c8c453 recoverd: Verifying local IPs should only check for unhosted available IPs
Currently it checks for unhosted IPs among the known IPs rather than
available IPs.  This means that a takeover run can be flagged even
when that takeover run will be unable to assign a known, unhosted IP.

Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 3cc878bc97fdac764a60ed805f64d649eaab06e8)
2012-10-18 20:05:42 +11:00
Martin Schwenke
4719df62d6 recoverd: Track failure of "recovered" event, banning culprits
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 9550c497e6d6ef5ee44826c4bd9ed5ad65174263)
2012-10-11 12:10:45 +11:00
Martin Schwenke
62046a8a4c recoverd: When starting a takeover run disable IP verification
Disable for TakeoverTimeout seconds.

Otherwise the the recovery daemon can get overzealous and start trying
to add/delete addresses that it thinks are missing but where the
eventscript just hasn't finished.  This didn't used to matter so much
but it is more important now that concurrent takeip/releaseip/updateip
generate error - we want to avoid spamming the log.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 56fcee3c7730cb12fa666072d5400949af6e5f7c)
2012-10-11 12:10:45 +11:00
Martin Schwenke
735c9107e1 recoverd: All inactive nodes should yield recovery master role
Not just stopped nodes.  In reality, this means that banned nodes will
also yield, since nodes in the other inactive states won't be running
a daemon.

This seems sensible since if another node notices that an inactive
node is the recovery master then it will force an election anyway.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit fc18188b7b63eb0dafbc47e3abf80e306e1dfc31)
2012-08-08 16:15:03 +10:00