1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-24 21:34:56 +03:00
Commit Graph

292 Commits

Author SHA1 Message Date
Andrew Tridgell
4115492992 - catch ESTALE in the recovery lock by trying a read()
- priortise nodes that are unbanned and healthy in the election

(This used to be ctdb commit 929feb475dfdf7283f0e99b50b179e1c91d3a39f)
2007-10-05 13:28:21 +10:00
Andrew Tridgell
fb48f2d5a2 we are the culprit if we can't get the reclock
(This used to be ctdb commit 1d320e113c6134ff6822b985a47131d8204af35a)
2007-10-05 12:01:40 +10:00
Ronnie Sahlberg
72379ee3eb change async.private to async.private_data since private is a reserved
work in c++

(This used to be ctdb commit 79eb28f6cd5dcc30b04966d202a050eaf98a2552)
2007-09-26 14:25:32 +10:00
Ronnie Sahlberg
359448ff00 when we have a public ip address mismatch (i.e. we hold addresses we
shouldnt   or we are not holding addresses wqe should)
we must first freeze the local node before we set the recovery mode

(This used to be ctdb commit a77a77e8b5180f6a4a1f3d7d4ff03811f3b71b56)
2007-09-24 10:52:26 +10:00
Andrew Tridgell
c60988325d added support for persistent databases in ctdbd
(This used to be ctdb commit 3115090a0d882beca9d70761130b74bb0821f201)
2007-09-21 12:24:02 +10:00
Andrew Tridgell
955d4d8615 make sure all public IPs are removed at startup
(This used to be ctdb commit b16f33787f2a9471285037f4a6d470e826536570)
2007-09-14 11:56:40 +10:00
Ronnie Sahlberg
6052078b53 let each node verify that they have a correct assignment of public ip
addresses (i.e. htey hold those they should hold   and they dont hold 
any of those they shouldnt hold)

if an inconsistency is found, mark the local node as recovery mode 
active
and wait for the recovery master to trigger a full blown recovery

(This used to be ctdb commit 55a5bfc8244c5b9cdda3f11992f384f00566b5dc)
2007-09-14 10:16:36 +10:00
Andrew Tridgell
42fc00bda9 - merge from ronnie
- add a flag to check that recovery completed correctly. If not, re-trigger it in monitoring

(This used to be ctdb commit d5ed941d9bab4af30d8b5f9b77bdf43d9218d69b)
2007-09-14 09:49:12 +10:00
Ronnie Sahlberg
4186d8eaba when a ctdb_takeover_run has failed we must make sure that
need_takeover_run is set to true  or else we might forget to rerun it 
again during the next recovery


othervise,  need_takeover_run is only set to true IFF the node flags for 
a remote node and the local nodes differ.
It is possible that a takeover run fails  and thus the reassignment of 
ip addresses is incomplete  but before we get back to the test in    
monitor_cluster()  that all the node flags of all nodes have converged 
and they now match each others again.   and thus causing 
monitor_cluster() to fail to realize that a takeover run is needed.

(This used to be ctdb commit ae7e866787cebd14394983ce1834387c959d1022)
2007-09-13 14:51:37 +10:00
Andrew Tridgell
9d50595b8a prevent recursion in the calling of ctdb_takeover_run
(This used to be ctdb commit 0fbdeb7c91b965d9bc5ecc7b24e31070378d8f1d)
2007-09-13 14:08:18 +10:00
Andrew Tridgell
30de14fe79 force recovery if unable to tell a node to release an IP
(This used to be ctdb commit 6895788d2499344a03357e5c1103cb8383e9eaf7)
2007-09-13 11:19:49 +10:00
Ronnie Sahlberg
77ec4d5248 allow different nodes in the cluster to use different public_addresses
files
so that we can partition the cluster into different subsets of nodes 
which each serve a different subset of the public addresses

(This used to be ctdb commit 889e0fe69e4c88c6166282b12843b8d9727552d6)
2007-09-04 23:15:23 +10:00
Ronnie Sahlberg
8f819c6a0e get rid of the ctdb_vnn_list structure and just use a single list of
ctdb_vnn

(This used to be ctdb commit 7b9fd06321af17043136b1420b57284450ae7ba5)
2007-09-04 18:20:29 +10:00
Ronnie Sahlberg
157be530dd change ctdb_ctrl_getvnn to ctdb_ctrl_getpnn
(This used to be ctdb commit ef47cc4cd416065c69382e4d9e76c30a0a34e42f)
2007-09-04 10:38:48 +10:00
Ronnie Sahlberg
211b497818 change ctdb_node_flags_change.vnn to ctdb_node_flags_changed.pnn
change ctdb_ban_info.vnn to ctdb_ban_info.pnn

(This used to be ctdb commit fcedd40e0493948829e1c921d4fe30e9196e398a)
2007-09-04 10:33:10 +10:00
Ronnie Sahlberg
583b6e6ba6 change ctdb_get_vnn to ctdb_get_pnn
(This used to be ctdb commit 1e19930198c2bcc7ccb755e0ee51555fb823029a)
2007-09-04 10:18:44 +10:00
Ronnie Sahlberg
fc9d39c3a6 change ctdb_validate_vnn to ctdb_validate_pnn
(This used to be ctdb commit a4a1f41b69475b9dc16d8fd7f8965c32e96c32f0)
2007-09-04 10:09:58 +10:00
Ronnie Sahlberg
eb4cf6a686 change ctdb->vnn to ctdb->pnn
(This used to be ctdb commit 8c776e5707e503ec6586aae39ac6b3ea5a2fd2bc)
2007-09-04 10:06:36 +10:00
Ronnie Sahlberg
12ebb74838 change how we do public addresses and takeover so that we can have
multiple public addresses spread across multiple interfaces on each 
node.

this is a massive patch since we have previously made the assumtion that 
we only have one public address per node.

get rid of the public_interface argument.  the public addresses file 
now explicitely lists which interface the address belongs to

(This used to be ctdb commit 462ebbc791e906a6b874c862defea43235597ca8)
2007-09-04 09:50:07 +10:00
Ronnie Sahlberg
7f02e16143 add async versions of the freeze node control and freeze all nodes in
parallell 

(This used to be ctdb commit f34e89f54d9f4380e76eb1b5b2385a4d8500b505)
2007-08-27 10:31:22 +10:00
Ronnie Sahlberg
a9c45b2562 change the monitoring of recmode in the recovery daemon to use a fully
async eventdriven api for controls

(This used to be ctdb commit 8d0e43428c507967a0d96e6a4c5c821ac269c546)
2007-08-27 09:40:10 +10:00
Ronnie Sahlberg
495a6403da change the api for managing callbacks to controls so that isntead of
passing it as a parameter we set the callback function explicitely from 
the caller if the ..._send() function returned a valid state pointer.

(This used to be ctdb commit aa939570662786455f63299b62c99882cff29d42)
2007-08-24 10:42:06 +10:00
Ronnie Sahlberg
62a03ef9d5 get rid of the explicit global timeout used in the previous example and
try this time by relying on the timeouts for the individual controls

(This used to be ctdb commit 448a0eb4fd896dc545aa0b4bb2ba4628491578be)
2007-08-23 19:38:54 +10:00
Ronnie Sahlberg
f854b5f876 try out a slightly different api for controls where you provide a
callback function which is called upon completion (or timeout) of the 
control.

modify scanning of recmaster in the monitoring_cluster code to try the 
api out

(This used to be ctdb commit c37843f1d97b169afec910e7ddb4e5ac12c3015c)
2007-08-23 19:27:09 +10:00
Ronnie Sahlberg
4c13bf0c5f break checking that the recoverymode on all nodes are ok out into its
own function

(This used to be ctdb commit 813cf9a252af96da24122b80f24aabeed2911939)
2007-08-23 13:48:39 +10:00
Ronnie Sahlberg
8fd3df2553 hang the ctdb_req_control structure off the ctdb_client_control_state
struct  so that if we timeout a control we can print debug info such as 
what opcode failed and to which node

we dont need the *status parameter to ctdb_client_control_state

create async versions of the getrecmaster control

pass a memory context to getrecmaster

(This used to be ctdb commit 558b680c82f830fba82c283c78c2de8a0b150b75)
2007-08-23 13:00:10 +10:00
Ronnie Sahlberg
f6e0336b23 create a define to represent the 'invalid' generation id we used in two
places.

create a new helper function to generate new generation id values that 
know about the invalid id and avoids generating it.

update the ctdb status tool to know about the invalid generation id and 
print the string INVALID instead

(This used to be ctdb commit 4fbcd189543cb8a92227fdcd3d158472e558ccda)
2007-08-22 12:38:31 +10:00
Ronnie Sahlberg
8b06fc7284 change the structure used for node flag change messages so that we can
see both the old flags as well as the new flags (so we can tell which 
flags changed)

send the CTDB_SRVID_RECONFIGURE messages to connected nodes only, not to 
every node, connected or not, in the cluster.


in the handler inside the recovery daemon which is invoked for node flag 
change messages, only do a takeover_run() and redistribute the ip addresses IF it was the 
disabled or the unhealthy flags that changed. Also send out the cluster 
reconfigured message in this case.
If any of the other flags changed we dont need to do the takeover_run(0 
here since that will be done during recovery.

(This used to be ctdb commit 5549b2058e2c148a8ca9d419123acf3247bb8829)
2007-08-21 17:25:15 +10:00
Andrew Tridgell
32de198fd3 update lib/replace from samba4
(This used to be ctdb commit f0555484105668c01c21f56322992e752e831109)
2007-07-10 15:29:31 +10:00
Ronnie Sahlberg
a859723912 nicer handling of DISCONNECTED flag when we update the node flags from
a remote message

(This used to be ctdb commit 9a50ad22be61a09761ffda89de91ef3221917c84)
2007-07-09 17:40:15 +10:00
Ronnie Sahlberg
69f3a09e6f when a remote node has sent us a message to update the flags for a node,
dont let those messages modify the DISCONNECTED flag.

the DISCONNECTED flag must be managed locally since it describes whether 
the local node can communicate with the remote node or not

(This used to be ctdb commit 5650673205d335a32d4f27f66847ea66752a00f0)
2007-07-09 13:21:17 +10:00
Ronnie Sahlberg
b871c3e365 a better way to fix the DISCONNECT|BANNED vs DISCONNECT bug
(This used to be ctdb commit 5c638d7731c5a268de02d3a37828ac7aec9a12de)
2007-07-09 12:55:15 +10:00
Ronnie Sahlberg
3499c8c673 when checking the nodemap flags for consitency while monitoring the
cluster,   we cant check that both the BANNED and the DISCONNECTED flags 
are both set at the same time   since if a node becomes banned just 
before it is DISCONNECTED   there is no guarantee that all other nodes 
will have seen the BANNED flag.

So we must first check the DISCONNECTED flag only   and only if the 
DISCONNECTED flag is not set should we check the BANNED flag.


othervise this can cause a recovery loop while some nodes thing the 
disconnected node is DISCONNECTED|BANNED and other think it is just 
DISCONNECTED

(This used to be ctdb commit 0967b2fff376ead631d98e78b3a97253fc109c69)
2007-07-09 12:33:00 +10:00
Ronnie Sahlberg
1cd8bc0c64 add a tuneable to control how long we wait after a successful recovery
before we alow another recovery to be initiated

(This used to be ctdb commit f3b43519423b7a73e6a2dd986bdf11203b8653cf)
2007-07-04 08:36:59 +10:00
Andrew Tridgell
732353de5f - merged ctdb_store test from ronnie
- added DatabaseHashSize tunable
- added logging of events inside recovery (for timing)

(This used to be ctdb commit 3593cdb928b91e217faf1b3c537fa28dc82cdace)
2007-06-17 23:31:44 +10:00
Andrew Tridgell
91362083a1 make sure we start the freeze process quickly on all nodes when we are going to do recovery - this prevents serialisation of freeze, which can take a long time
(This used to be ctdb commit 52675c19e420d83d9556a3e73d9a4b490078aa9c)
2007-06-11 23:03:23 +10:00
Andrew Tridgell
a31ece536c more detail in recovery message
(This used to be ctdb commit bc18a39efcf1fa5edfadc4c2f842f7cf035e4fbd)
2007-06-11 21:37:09 +10:00
Andrew Tridgell
18ae6e56f0 propogate flag changes to all connected nodes
(This used to be ctdb commit 711d1f7e20f1e98caaf08a57df0b1825ff6e97a0)
2007-06-09 21:58:50 +10:00
Ronnie Sahlberg
40585aed37 should be sufficient to unban nodes when we unbecome recmaster
(This used to be ctdb commit 8a6c4e675b4b877a9d0a7a3701973573ff0b71e8)
2007-06-09 20:13:25 +10:00
Ronnie Sahlberg
5458196b3f unban all nodes when we release recmaster role or when we win an
election

(This used to be ctdb commit 48fb7483b3fe391e2d0b78718af29f69a641525e)
2007-06-09 20:11:51 +10:00
Ronnie Sahlberg
9a0d7a688f add code to unban when we become/unbecome recmaster
(This used to be ctdb commit a22cf9b8a6fd46128faca958f33a75cb3fc1ee12)
2007-06-09 19:42:41 +10:00
Andrew Tridgell
ae3d54094b start splitting the code into separate client and server pieces
(This used to be ctdb commit 603cd77988c181525946cd5eb0f4d0d646b58059)
2007-06-07 22:06:19 +10:00