samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2024-12-25 23:21:54 +03:00

999 lines

27 KiB

C

Raw Normal View History

start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`/*`
			`ctdb recovery daemon`

			`Copyright (C) Ronnie Sahlberg 2007`

			`This library is free software; you can redistribute it and/or`
			`modify it under the terms of the GNU Lesser General Public`
			`License as published by the Free Software Foundation; either`
			`version 2 of the License, or (at your option) any later version.`

			`This library is distributed in the hope that it will be useful,`
			`but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU`
			`Lesser General Public License for more details.`

			`You should have received a copy of the GNU Lesser General Public`
			`License along with this library; if not, write to the Free Software`
			`Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA`
			`*/`

			`#include "includes.h"`
			`#include "lib/events/events.h"`
			`#include "system/filesys.h"`
better timeout handling for calls, controls and traverses (This used to be ctdb commit 63346a6c59d4821b4c443939b5d88db8cd20f5fe) 2007-05-10 08:06:48 +04:00			`#include "system/time.h"`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`#include "popt.h"`
			`#include "cmdline.h"`
			`#include "../include/ctdb.h"`
			`#include "../include/ctdb_private.h"`

change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`static int timed_out = 0;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`static void timeout_func(struct event_context ev, struct timed_event te,`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`struct timeval t, void *private_data)`
			`{`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`timed_out = 1;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`#define CONTROL_TIMEOUT() timeval_current_ofs(5, 0)`
tweak timeouts (This used to be ctdb commit 54a90797469f56d796efd82e9294efff3c5dabcc) 2007-05-27 03:43:25 +04:00			`#define MONITOR_TIMEOUT() timeval_current_ofs(1, 0)`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`static int set_recovery_mode(struct ctdb_context ctdb, struct ctdb_node_map nodemap, uint32_t rec_mode)`
			`{`
			`int j, ret;`

			`/* set recovery mode to active on all nodes */`
			`for (j=0; j<nodemap->num; j++) {`
			`/* dont change it for nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

separate out the freeze/thaw handling from recovery (This used to be ctdb commit 0b0640bd8b8334961f240e0cf276ac112cd6e616) 2007-05-12 09:15:27 +04:00			`if (rec_mode == CTDB_RECOVERY_ACTIVE) {`
			`ret = ctdb_ctrl_freeze(ctdb, timeval_current_ofs(5, 0), nodemap->nodes[j].vnn);`
			`if (ret != 0) {`
			`DEBUG(0, (__location__ " Unable to freeze node %u\n", nodemap->nodes[j].vnn));`
			`return -1;`
			`}`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setrecmode(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, rec_mode);`
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set recmode on node %u\n", nodemap->nodes[j].vnn));`
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`return -1;`
			`}`
separate out the freeze/thaw handling from recovery (This used to be ctdb commit 0b0640bd8b8334961f240e0cf276ac112cd6e616) 2007-05-12 09:15:27 +04:00
			`if (rec_mode == CTDB_RECOVERY_NORMAL) {`
			`ret = ctdb_ctrl_thaw(ctdb, timeval_current_ofs(5, 0), nodemap->nodes[j].vnn);`
			`if (ret != 0) {`
			`DEBUG(0, (__location__ " Unable to thaw node %u\n", nodemap->nodes[j].vnn));`
			`return -1;`
			`}`
			`}`
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`}`

			`return 0;`
			`}`

recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`static int set_recovery_master(struct ctdb_context ctdb, struct ctdb_node_map nodemap, uint32_t vnn)`
			`{`
			`int j, ret;`

			`/* set recovery master to vnn on all nodes */`
			`for (j=0; j<nodemap->num; j++) {`
			`/* dont change it for nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, vnn);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set recmaster on node %u\n", nodemap->nodes[j].vnn));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`return -1;`
			`}`
			`}`

			`return 0;`
			`}`

add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`static int create_missing_remote_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap, uint32_t vnn, struct ctdb_dbid_map dbmap, TALLOC_CTX mem_ctx)`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`{`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`int i, j, db, ret;`
			`struct ctdb_dbid_map *remote_dbmap;`

update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`/* verify that all other nodes have all our databases */`
			`for (j=0; j<nodemap->num; j++) {`
			`/* we dont need to ourself ourselves */`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`
			`/* dont check nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getdbmap(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, &remote_dbmap);`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get dbids from node %u\n", vnn));`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`return -1;`
			`}`

			`/* step through all local databases */`
			`for (db=0; db<dbmap->num;db++) {`
			`const char *name;`


			`for (i=0;i<remote_dbmap->num;i++) {`
			`if (dbmap->dbids[db] == remote_dbmap->dbids[i]) {`
			`break;`
			`}`
			`}`
			`/* the remote node already have this database */`
			`if (i!=remote_dbmap->num) {`
			`continue;`
			`}`
			`/* ok so we need to create this database */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ctdb_ctrl_getdbname(ctdb, CONTROL_TIMEOUT(), vnn, dbmap->dbids[db], mem_ctx, &name);`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get dbname from node %u\n", vnn));`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`return -1;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ctdb_ctrl_createdb(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, name);`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to create remote db:%s\n", name));`
update to rhe recovery daemon ctdb_ctrl_ calls are timedout due to nodes arriving or leaving the cluster it crashes the recovery daemon afterwards with a SEGV but no useful stack backtrace (This used to be ctdb commit cd3abc7349e86555ccd87cd47a1dcc2adad2f46c) 2007-05-06 00:58:01 +04:00			`return -1;`
			`}`
			`}`
			`}`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`return 0;`
			`}`


create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`static int create_missing_local_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap, uint32_t vnn, struct ctdb_dbid_map *dbmap, TALLOC_CTX mem_ctx)`
			`{`
			`int i, j, db, ret;`
			`struct ctdb_dbid_map *remote_dbmap;`

			`/* verify that we have all database any other node has */`
			`for (j=0; j<nodemap->num; j++) {`
			`/* we dont need to ourself ourselves */`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`
			`/* dont check nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getdbmap(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, &remote_dbmap);`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get dbids from node %u\n", vnn));`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`return -1;`
			`}`

			`/* step through all databases on the remote node */`
			`for (db=0; db<remote_dbmap->num;db++) {`
			`const char *name;`

			`for (i=0;i<(*dbmap)->num;i++) {`
			`if (remote_dbmap->dbids[db] == (*dbmap)->dbids[i]) {`
			`break;`
			`}`
			`}`
			`/* we already have this db locally */`
			`if (i!=(*dbmap)->num) {`
			`continue;`
			`}`
			`/* ok so we need to create this database and`
			`rebuild dbmap`
			`*/`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ctdb_ctrl_getdbname(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, remote_dbmap->dbids[db], mem_ctx, &name);`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get dbname from node %u\n", nodemap->nodes[j].vnn));`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`return -1;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ctdb_ctrl_createdb(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, name);`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to create local db:%s\n", name));`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`return -1;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getdbmap(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, dbmap);`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to reread dbmap on node %u\n", vnn));`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`return -1;`
			`}`
			`}`
			`}`

			`return 0;`
			`}`

create a helper function for recovery that pulls and merges all remote databases onto the local node (This used to be ctdb commit 5cecc47449c369f91e83389a94b987ac32b1e3f4) 2007-05-06 04:16:48 +04:00
			`static int pull_all_remote_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap, uint32_t vnn, struct ctdb_dbid_map dbmap, TALLOC_CTX mem_ctx)`
			`{`
			`int i, j, ret;`

			`/* pull all records from all other nodes across onto this node`
			`(this merges based on rsn)`
			`*/`
			`for (i=0;i<dbmap->num;i++) {`
			`for (j=0; j<nodemap->num; j++) {`
			`/* we dont need to merge with ourselves */`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`
			`/* dont merge from nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_copydb(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, vnn, dbmap->dbids[i], CTDB_LMASTER_ANY, mem_ctx);`
create a helper function for recovery that pulls and merges all remote databases onto the local node (This used to be ctdb commit 5cecc47449c369f91e83389a94b987ac32b1e3f4) 2007-05-06 04:16:48 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to copy db from node %u to node %u\n", nodemap->nodes[j].vnn, vnn));`
create a helper function for recovery that pulls and merges all remote databases onto the local node (This used to be ctdb commit 5cecc47449c369f91e83389a94b987ac32b1e3f4) 2007-05-06 04:16:48 +04:00			`return -1;`
			`}`
			`}`
			`}`

			`return 0;`
			`}`



added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`static int update_dmaster_on_all_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap,`
			`uint32_t vnn, struct ctdb_dbid_map dbmap, TALLOC_CTX mem_ctx)`
break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00			`{`
			`int i, j, ret;`

			`/* update dmaster to point to this node for all databases/nodes */`
			`for (i=0;i<dbmap->num;i++) {`
			`for (j=0; j<nodemap->num; j++) {`
			`/* dont repoint nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setdmaster(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, ctdb, dbmap->dbids[i], vnn);`
break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set dmaster for node %u db:0x%08x\n", nodemap->nodes[j].vnn, dbmap->dbids[i]));`
break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00			`return -1;`
			`}`
			`}`
			`}`

			`return 0;`
			`}`

added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`/*`
			`vacuum one database`
			`*/`
			`static int vacuum_db(struct ctdb_context ctdb, uint32_t db_id, struct ctdb_node_map nodemap)`
			`{`
			`uint64_t max_rsn;`
			`int ret, i;`

			`/* find max rsn on our local node for this db */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_get_max_rsn(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE, db_id, &max_rsn);`
added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`if (ret != 0) {`
			`return -1;`
			`}`

			`/* set rsn on non-empty records to max_rsn+1 */`
			`for (i=0;i<nodemap->num;i++) {`
			`if (!nodemap->nodes[i].flags & NODE_FLAGS_CONNECTED) {`
			`continue;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_set_rsn_nonempty(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[i].vnn,`
added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`db_id, max_rsn+1);`
			`if (ret != 0) {`
			`DEBUG(0,(__location__ " Failed to set rsn on node %u to %llu\n",`
			`nodemap->nodes[i].vnn, max_rsn+1));`
			`return -1;`
			`}`
			`}`

			`/* delete records with rsn < max_rsn+1 on all nodes */`
			`for (i=0;i<nodemap->num;i++) {`
			`if (!nodemap->nodes[i].flags & NODE_FLAGS_CONNECTED) {`
			`continue;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_delete_low_rsn(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[i].vnn,`
added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`db_id, max_rsn+1);`
			`if (ret != 0) {`
			`DEBUG(0,(__location__ " Failed to delete records on node %u with rsn below %llu\n",`
			`nodemap->nodes[i].vnn, max_rsn+1));`
			`return -1;`
			`}`
			`}`

add an extra blank line (This used to be ctdb commit 75096dde58df6532abbf5b9ebd771e8810156483) 2007-05-06 04:30:18 +04:00
added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`return 0;`
			`}`


			`static int vacuum_all_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap,`
			`struct ctdb_dbid_map *dbmap)`
			`{`
			`int i;`

			`/* update dmaster to point to this node for all databases/nodes */`
			`for (i=0;i<dbmap->num;i++) {`
			`if (vacuum_db(ctdb, dbmap->dbids[i], nodemap) != 0) {`
			`return -1;`
			`}`
			`}`
			`return 0;`
			`}`


			`static int push_all_local_databases(struct ctdb_context ctdb, struct ctdb_node_map nodemap,`
			`uint32_t vnn, struct ctdb_dbid_map dbmap, TALLOC_CTX mem_ctx)`
create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00			`{`
			`int i, j, ret;`

			`/* push all records out to the nodes again */`
			`for (i=0;i<dbmap->num;i++) {`
			`for (j=0; j<nodemap->num; j++) {`
			`/* we dont need to push to ourselves */`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`
			`/* dont push to nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_copydb(ctdb, CONTROL_TIMEOUT(), vnn, nodemap->nodes[j].vnn, dbmap->dbids[i], CTDB_LMASTER_ANY, mem_ctx);`
create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to copy db from node %u to node %u\n", vnn, nodemap->nodes[j].vnn));`
create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00			`return -1;`
			`}`
			`}`
			`}`

			`return 0;`
			`}`


added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`static int update_vnnmap_on_all_nodes(struct ctdb_context ctdb, struct ctdb_node_map nodemap,`
			`uint32_t vnn, struct ctdb_vnn_map vnnmap, TALLOC_CTX mem_ctx)`
break out the code to update all nodes to the new vnnmap into a helper function (This used to be ctdb commit 81d39177949b54715710907d14ddc888dc09b064) 2007-05-06 04:42:18 +04:00			`{`
			`int j, ret;`

			`/* push the new vnn map out to all the nodes */`
			`for (j=0; j<nodemap->num; j++) {`
			`/* dont push to nodes that are unavailable */`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setvnnmap(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, vnnmap);`
break out the code to update all nodes to the new vnnmap into a helper function (This used to be ctdb commit 81d39177949b54715710907d14ddc888dc09b064) 2007-05-06 04:42:18 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set vnnmap for node %u\n", vnn));`
break out the code to update all nodes to the new vnnmap into a helper function (This used to be ctdb commit 81d39177949b54715710907d14ddc888dc09b064) 2007-05-06 04:42:18 +04:00			`return -1;`
			`}`
			`}`

			`return 0;`
			`}`


moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`static int do_recovery(struct ctdb_context *ctdb,`
added IP takeover logic for public IPs to ctdb (This used to be ctdb commit 374adb729472670f35cef41269b8719f49c0de0e) 2007-05-25 11:04:13 +04:00			`TALLOC_CTX *mem_ctx, uint32_t vnn, uint32_t num_active,`
			`struct ctdb_node_map nodemap, struct ctdb_vnn_map vnnmap)`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`{`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`int i, j, ret;`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`uint32_t generation;`
			`struct ctdb_dbid_map *dbmap;`
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00			`struct flock lock;`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00
- nicer message if freeze child dies - change local generation count after recovery/freeze started (This used to be ctdb commit d9768142797f083a8c09b55d6a8a93cc12089348) 2007-05-12 09:59:49 +04:00			`/* set recovery mode to active on all nodes */`
			`ret = set_recovery_mode(ctdb, nodemap, CTDB_RECOVERY_ACTIVE);`
			`if (ret!=0) {`
			`DEBUG(0, (__location__ " Unable to set recovery mode to active on cluster\n"));`
			`return -1;`
			`}`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00			`/* get the recmaster lock */`
			`if (ctdb->node_list_fd != -1) {`
			`close(ctdb->node_list_fd);`
			`}`

			`ctdb->node_list_fd = open(ctdb->node_list_file, O_RDWR);`
			`if (ctdb->node_list_fd == -1) {`
			`DEBUG(0,("Unable to open %s - aborting recovery (%s)\n",`
			`ctdb->node_list_file, strerror(errno)));`
			`return -1;`
			`}`

			`lock.l_type = F_WRLCK;`
			`lock.l_whence = SEEK_SET;`
			`lock.l_start = 0;`
			`lock.l_len = 1;`
			`lock.l_pid = 0;`

			`if (fcntl(ctdb->node_list_fd, F_SETLK, &lock) != 0) {`
			`DEBUG(0,("Unable to lock %s - aborting recovery (%s)\n",`
			`ctdb->node_list_file, strerror(errno)));`
			`return -1;`
			`}`

			`DEBUG(0, (__location__ " Recovery initiated\n"));`

add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`/* pick a new generation number */`
			`generation = random();`

			`/* change the vnnmap on this node to use the new generation`
			`number but not on any other nodes.`
			`this guarantees that if we abort the recovery prematurely`
			`for some reason (a node stops responding?)`
			`that we can just return immediately and we will reenter`
			`recovery shortly again.`
			`I.e. we deliberately leave the cluster with an inconsistent`
			`generation id to allow us to abort recovery at any stage and`
			`just restart it from scratch.`
			`*/`
			`vnnmap->generation = generation;`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setvnnmap(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, vnnmap);`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set vnnmap for node %u\n", vnn));`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`return -1;`
			`}`

			`/* get a list of all databases */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getdbmap(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, &dbmap);`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`if (ret != 0) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Unable to get dbids from node :%u\n", vnn));`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`return -1;`
			`}`

create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00

add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`/* verify that all other nodes have all our databases */`
			`ret = create_missing_remote_databases(ctdb, nodemap, vnn, dbmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to create missing remote databases\n"));`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`return -1;`
			`}`


recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`/* verify that we have all the databases any other node has */`
			`ret = create_missing_local_databases(ctdb, nodemap, vnn, &dbmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to create missing local databases\n"));`
create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`


create a helper function to make sure the local node that does recovery has all the databases that exist on any other remote node (This used to be ctdb commit 0f436e3d40fea6e6a146019b0c664e80e81e88b4) 2007-05-06 04:12:42 +04:00
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`/* verify that all other nodes have all our databases */`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`ret = create_missing_remote_databases(ctdb, nodemap, vnn, dbmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to create missing remote databases\n"));`
add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`

add a helper function to create all missing remote databases detected during recovery (This used to be ctdb commit 04758c6f7d8f61260be6d2472380cb7904984427) 2007-05-06 04:04:37 +04:00
we must repoint dmaster to an invalid node during recovery to stop the shortcut from working (This used to be ctdb commit 5e18930be8c0efb87aa9e2780d9457634b24e156) 2007-05-08 08:51:55 +04:00			`/* pull all remote databases onto the local node */`
			`ret = pull_all_remote_databases(ctdb, nodemap, vnn, dbmap, mem_ctx);`
break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to pull remote databases\n"));`
break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`


break the code that repoints dmaster for all local and remote records into a separate helper function (This used to be ctdb commit d5ab30d0ac21e736eb34eaa19bccfee5f0ce7cfb) 2007-05-06 04:22:13 +04:00
create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00			`/* push all local databases to the remote nodes */`
			`ret = push_all_local_databases(ctdb, nodemap, vnn, dbmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to push local databases\n"));`
create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`

create a helper function for recovery to push all local databases out onto the remote nodes (This used to be ctdb commit 1ba76d374652cfa29e56fb77c7190349e42d3bcc) 2007-05-06 04:38:44 +04:00

update a comment to be more desciptive (This used to be ctdb commit 96082c54d830974bf9a4d5bad33ad60379a85798) 2007-05-06 06:46:56 +04:00			`/* build a new vnn map with all the currently active nodes */`
we have to get a NEW generation id after completing recovery to solve a race condition with the logic to retransmit in ctdb_call.c/ctdb_call_timeout() (This used to be ctdb commit 1044ddca9ff5c434816de35d3f659aa182704e97) 2007-05-11 06:03:19 +04:00			`generation = random();`
remove old s3 recovery code fixed vnnmap wire format in recover daemon (This used to be ctdb commit e03fab7bfe0cf43f40c49a3d63e75dc44001d8d8) 2007-05-10 02:49:57 +04:00			`vnnmap = talloc(mem_ctx, struct ctdb_vnn_map);`
			`CTDB_NO_MEMORY(ctdb, vnnmap);`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`vnnmap->generation = generation;`
			`vnnmap->size = num_active;`
remove old s3 recovery code fixed vnnmap wire format in recover daemon (This used to be ctdb commit e03fab7bfe0cf43f40c49a3d63e75dc44001d8d8) 2007-05-10 02:49:57 +04:00			`vnnmap->map = talloc_array(vnnmap, uint32_t, vnnmap->size);`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`for (i=j=0;i<nodemap->num;i++) {`
			`if (nodemap->nodes[i].flags&NODE_FLAGS_CONNECTED) {`
added IP takeover logic for public IPs to ctdb (This used to be ctdb commit 374adb729472670f35cef41269b8719f49c0de0e) 2007-05-25 11:04:13 +04:00			`vnnmap->map[j++] = nodemap->nodes[i].vnn;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`
			`}`



break out the code to update all nodes to the new vnnmap into a helper function (This used to be ctdb commit 81d39177949b54715710907d14ddc888dc09b064) 2007-05-06 04:42:18 +04:00			`/* update to the new vnnmap on all nodes */`
			`ret = update_vnnmap_on_all_nodes(ctdb, nodemap, vnn, vnnmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to update vnnmap on all nodes\n"));`
break out the code to update all nodes to the new vnnmap into a helper function (This used to be ctdb commit 81d39177949b54715710907d14ddc888dc09b064) 2007-05-06 04:42:18 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`


recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`/* update recmaster to point to us for all nodes */`
			`ret = set_recovery_master(ctdb, nodemap, vnn);`
			`if (ret!=0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set recovery master\n"));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`return -1;`
			`}`


we must repoint dmaster to an invalid node during recovery to stop the shortcut from working (This used to be ctdb commit 5e18930be8c0efb87aa9e2780d9457634b24e156) 2007-05-08 08:51:55 +04:00			`/* repoint all local and remote database records to the local`
			`node as being dmaster`
			`*/`
			`ret = update_dmaster_on_all_databases(ctdb, nodemap, vnn, dbmap, mem_ctx);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to update dmaster on all databases\n"));`
we must repoint dmaster to an invalid node during recovery to stop the shortcut from working (This used to be ctdb commit 5e18930be8c0efb87aa9e2780d9457634b24e156) 2007-05-08 08:51:55 +04:00			`return -1;`
			`}`

added automatic vacuuming of empty records during recovery (This used to be ctdb commit f9181a784ac7009df5e9c996f4e0c3e99098b59a) 2007-05-23 11:21:14 +04:00			`/*`
			`run a vacuum operation on empty records`
			`*/`
			`ret = vacuum_all_databases(ctdb, nodemap, dbmap);`
			`if (ret != 0) {`
			`DEBUG(0, (__location__ " Unable to vacuum all databases\n"));`
			`return -1;`
			`}`

added IP takeover logic for public IPs to ctdb (This used to be ctdb commit 374adb729472670f35cef41269b8719f49c0de0e) 2007-05-25 11:04:13 +04:00			`/*`
			`if enabled, tell nodes to takeover their public IPs`
			`*/`
			`if (ctdb->takeover.enabled) {`
			`ret = ctdb_takeover_run(ctdb, nodemap);`
			`if (ret != 0) {`
fixed some debug messages (This used to be ctdb commit 037f0149c0c0e65af0a1669b9a52586129e4b48f) 2007-05-29 07:48:30 +04:00			`DEBUG(0, (__location__ " Unable to setup public takeover addresses\n"));`
added IP takeover logic for public IPs to ctdb (This used to be ctdb commit 374adb729472670f35cef41269b8719f49c0de0e) 2007-05-25 11:04:13 +04:00			`return -1;`
			`}`
			`}`
we must repoint dmaster to an invalid node during recovery to stop the shortcut from working (This used to be ctdb commit 5e18930be8c0efb87aa9e2780d9457634b24e156) 2007-05-08 08:51:55 +04:00
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`/* disable recovery mode */`
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`ret = set_recovery_mode(ctdb, nodemap, CTDB_RECOVERY_NORMAL);`
			`if (ret!=0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set recovery mode to normal on cluster\n"));`
break out the setting/clearing of recovery mode into a dedicated helper function (This used to be ctdb commit dba4e4f8aa4f2fde1e9f8d93bdf3a33f7de8ce18) 2007-05-06 03:53:12 +04:00			`return -1;`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`}`

send a message to clients when an IP has been released (This used to be ctdb commit 8b7ab0b00253462593d368052c2cb10a385b4e63) 2007-05-25 18:05:30 +04:00			`/* send a message to all clients telling them that the cluster`
			`has been reconfigured */`
- don't try to send controls to dead nodes - use only connected nodes in a traverse (This used to be ctdb commit 9a676dd5d331022d946a56c52c42fc6985b93dbc) 2007-05-17 17:23:41 +04:00			`ctdb_send_message(ctdb, CTDB_BROADCAST_ALL, CTDB_SRVID_RECONFIGURE, tdb_null);`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Recovery complete\n"));`
recovery daemon this program is a client to the local ctdb daemon every second it pulls all vnnmap and nodemaps from all nodes that are available and checks if a recovery is required a recovery is required if : * all nodes do NOT have an identical vnnmap and generation * all nodes do NOT have an identical nodemap * there are active nodes that are NOT in the nodemap * there are nodes in the nodemap that are NOT active During recovery, the recovery tool will also make sure that all nodes know about and have created all databases. (This used to be ctdb commit 2f2650467bac7e8954de7c17cb34f46b0bdbcd26) 2007-05-04 09:21:40 +04:00			`return 0;`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`}`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`struct election_message {`
			`uint32_t vnn;`
			`};`

			`static int send_election_request(struct ctdb_context ctdb, TALLOC_CTX mem_ctx, uint32_t vnn)`
			`{`
			`int ret;`
			`TDB_DATA election_data;`
			`struct election_message emsg;`
			`uint64_t srvid;`

			`srvid = CTDB_SRVTYPE_RECOVERY;`

			`emsg.vnn = vnn;`

			`election_data.dsize = sizeof(struct election_message);`
			`election_data.dptr = (unsigned char *)&emsg;`


			`/* first we assume we will win the election and set`
			`recoverymaster to be ourself on the current node`
			`*/`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(), vnn, vnn);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " failed to send recmaster election request"));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`return -1;`
			`}`


			`/* send an election message to all active nodes */`
			`ctdb_send_message(ctdb, CTDB_BROADCAST_ALL, srvid, election_data);`

			`return 0;`
			`}`


			`/*`
			`handler for recovery master elections`
			`*/`
			`static void election_handler(struct ctdb_context *ctdb, uint64_t srvid,`
			`TDB_DATA data, void *private_data)`
			`{`
			`int ret;`
			`struct election_message em = (struct election_message )data.dptr;`
			`TALLOC_CTX *mem_ctx;`

setup the random number generator a bit better (This used to be ctdb commit 708585eb0ed31b0df6543a1d7a20b82e751877c2) 2007-05-10 07:10:23 +04:00			`mem_ctx = talloc_new(ctdb);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00
			`/* someone called an election. check their election data`
			`and if we disagree and we would rather be the elected node,`
			`send a new election message to all other nodes`
			`*/`
			`/* for now we just check the vnn number and allow the lowest`
			`vnn number to become recovery master`
			`*/`
			`if (em->vnn > ctdb_get_vnn(ctdb)) {`
			`ret = send_election_request(ctdb, mem_ctx, ctdb_get_vnn(ctdb));`
			`if (ret!=0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " failed to initiate recmaster election"));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`}`
			`talloc_free(mem_ctx);`
			`return;`
			`}`

- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00			`/* release the recmaster lock */`
			`if (ctdb->node_list_fd != -1) {`
			`close(ctdb->node_list_fd);`
			`ctdb->node_list_fd = -1;`
			`}`

recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`/* ok, let that guy become recmaster then */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(), ctdb_get_vnn(ctdb), em->vnn);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " failed to send recmaster election request"));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`talloc_free(mem_ctx);`
			`return;`
			`}`

			`talloc_free(mem_ctx);`
			`return;`
			`}`


			`static void force_election(struct ctdb_context ctdb, TALLOC_CTX mem_ctx, uint32_t vnn, struct ctdb_node_map *nodemap)`
			`{`
			`int ret;`
when starting a new election, also force all nodes into recovery mode so there is no internode traffic to interfere with our election (This used to be ctdb commit ccfb67a076c72a0e7f2b6dc5fce9c19f652ba2ad) 2007-05-10 03:48:14 +04:00
			`/* set all nodes to recovery mode to stop all internode traffic */`
			`ret = set_recovery_mode(ctdb, nodemap, CTDB_RECOVERY_ACTIVE);`
			`if (ret!=0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to set recovery mode to active on cluster\n"));`
when starting a new election, also force all nodes into recovery mode so there is no internode traffic to interfere with our election (This used to be ctdb commit ccfb67a076c72a0e7f2b6dc5fce9c19f652ba2ad) 2007-05-10 03:48:14 +04:00			`return;`
			`}`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00
			`ret = send_election_request(ctdb, mem_ctx, vnn);`
			`if (ret!=0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " failed to initiate recmaster election"));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`return;`
			`}`

moved system specific ip code to system.c (This used to be ctdb commit 9de9e4ccda9665108baac12a8716b189d26340b1) 2007-05-26 08:01:08 +04:00			`/* wait for a few seconds to collect all responses */`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`timed_out = 0;`
moved system specific ip code to system.c (This used to be ctdb commit 9de9e4ccda9665108baac12a8716b189d26340b1) 2007-05-26 08:01:08 +04:00			`event_add_timed(ctdb->ev, mem_ctx, timeval_current_ofs(3, 0),`
			`timeout_func, ctdb);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`while (!timed_out) {`
			`event_loop_once(ctdb->ev);`
			`}`
			`}`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00

moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`void monitor_cluster(struct ctdb_context *ctdb)`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`{`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`uint32_t vnn, num_active, recmode, recmaster;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`TALLOC_CTX *mem_ctx=NULL;`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`struct ctdb_node_map *nodemap=NULL;`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`struct ctdb_node_map *remote_nodemap=NULL;`
			`struct ctdb_vnn_map *vnnmap=NULL;`
			`struct ctdb_vnn_map *remote_vnnmap=NULL;`
			`int i, j, ret;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
			`again:`
			`if (mem_ctx) {`
			`talloc_free(mem_ctx);`
			`mem_ctx = NULL;`
			`}`
			`mem_ctx = talloc_new(ctdb);`
			`if (!mem_ctx) {`
			`DEBUG(0,("Failed to create temporary context\n"));`
			`exit(-1);`
			`}`

			`/* we only check for recovery once every second */`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`timed_out = 0;`
tweak timeouts (This used to be ctdb commit 54a90797469f56d796efd82e9294efff3c5dabcc) 2007-05-27 03:43:25 +04:00			`event_add_timed(ctdb->ev, mem_ctx, MONITOR_TIMEOUT(), timeout_func, ctdb);`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`while (!timed_out) {`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`event_loop_once(ctdb->ev);`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`vnn = ctdb_ctrl_getvnn(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE);`
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00			`if (vnn == (uint32_t)-1) {`
			`DEBUG(0,("Failed to get local vnn - retrying\n"));`
			`goto again;`
			`}`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00			`ctdb->vnn = vnn;`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00
			`/* get the vnnmap */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getvnnmap(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, &vnnmap);`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get vnnmap from node %u\n", vnn));`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00			`goto again;`
			`}`


start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`/* get number of nodes */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), vnn, mem_ctx, &nodemap);`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get nodemap from node %u\n", vnn));`
change the signature for ctdb_ctrl_getnodemap() so that a timeout parameter is added. change ctdb_get_connected_nodes in the same way (This used to be ctdb commit d85f23bcf4c1230225abb2f4a053c70b68d469aa) 2007-05-04 03:01:01 +04:00			`goto again;`
			`}`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`/* count how many active nodes there are */`
			`num_active = 0;`
			`for (i=0; i<nodemap->num; i++) {`
			`if (nodemap->nodes[i].flags&NODE_FLAGS_CONNECTED) {`
			`num_active++;`
			`}`
			`}`


recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`/* check which node is the recovery master */`
raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getrecmaster(ctdb, CONTROL_TIMEOUT(), vnn, &recmaster);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get recmaster from node %u\n", vnn));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`goto again;`
			`}`
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00
			`if (recmaster == (uint32_t)-1) {`
			`DEBUG(0,(__location__ " Initial recovery master set - forcing election\n"));`
			`force_election(ctdb, mem_ctx, vnn, nodemap);`
			`goto again;`
			`}`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00
			`/* verify that the recmaster node is still active */`
			`for (j=0; j<nodemap->num; j++) {`
			`if (nodemap->nodes[j].vnn==recmaster) {`
			`break;`
			`}`
setup the random number generator a bit better (This used to be ctdb commit 708585eb0ed31b0df6543a1d7a20b82e751877c2) 2007-05-10 07:10:23 +04:00			`}`
- startup frozen, and do an initial recovery - fixed a bug in traverse - get a lock on the node list file in the recmaster recovery daemon (This used to be ctdb commit 162a5647535ad1cb3e8e5d4042a2784365fb1913) 2007-05-23 08:35:19 +04:00
			`if (j == nodemap->num) {`
			`DEBUG(0, ("Recmaster node %u not in list. Force reelection\n", recmaster));`
			`force_election(ctdb, mem_ctx, vnn, nodemap);`
			`goto again;`
			`}`

recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`DEBUG(0, ("Recmaster node %u no longer available. Force reelection\n", nodemap->nodes[j].vnn));`
			`force_election(ctdb, mem_ctx, vnn, nodemap);`
			`goto again;`
			`}`


			`/* if we are not the recmaster then we do not need to check`
			`if recovery is needed`
			`*/`
			`if (vnn!=recmaster) {`
			`goto again;`
			`}`


			`/* verify that all active nodes agree that we are the recmaster */`
			`for (j=0; j<nodemap->num; j++) {`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getrecmaster(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, &recmaster);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get recmaster from node %u\n", vnn));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`goto again;`
			`}`

			`if (recmaster!=vnn) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, ("Node %u does not agree we are the recmaster. Force reelection\n", nodemap->nodes[j].vnn));`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`force_election(ctdb, mem_ctx, vnn, nodemap);`
			`goto again;`
			`}`
			`}`


add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00			`/* verify that all active nodes are in normal mode`
			`and not in recovery mode`
			`*/`
			`for (j=0; j<nodemap->num; j++) {`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getrecmode(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, &recmode);`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00			`if (ret != 0) {`
			`DEBUG(0, ("Unable to get recmode from node %u\n", vnn));`
			`goto again;`
			`}`
			`if (recmode!=CTDB_RECOVERY_NORMAL) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Node:%u was in recovery mode. Restart recovery process\n", nodemap->nodes[j].vnn));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
add a test in the function that checks whether the cluster needs recovery or not that all active nodes are in normal mode. If we discover that some node is still in recoverymode it may indicate that a previous recovery ended prematurely and thus we should start a new recovery (This used to be ctdb commit c15517872e6c98c8c425a8d47d2b348ecb0620b0) 2007-05-06 22:41:12 +04:00			`goto again;`
			`}`
			`}`


update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`/* get the nodemap for all active remote nodes and verify`
			`they are the same as for this node`
			`*/`
			`for (j=0; j<nodemap->num; j++) {`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, &remote_nodemap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get nodemap from remote node %u\n", nodemap->nodes[j].vnn));`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`

			`/* if the nodes disagree on how many nodes there are`
			`then this is a good reason to try recovery`
			`*/`
			`if (remote_nodemap->num != nodemap->num) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Remote node:%u has different node count. %u vs %u of the local node\n", nodemap->nodes[j].vnn, remote_nodemap->num, nodemap->num));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`

			`/* if the nodes disagree on which nodes exist and are`
			`active, then that is also a good reason to do recovery`
			`*/`
			`for (i=0;i<nodemap->num;i++) {`
			`if ((remote_nodemap->nodes[i].vnn != nodemap->nodes[i].vnn)`
			`\|\| (remote_nodemap->nodes[i].flags != nodemap->nodes[i].flags)) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Remote node:%u has different nodemap.\n", nodemap->nodes[j].vnn));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`
			`}`

			`}`


			`/* there better be the same number of lmasters in the vnn map`
setup the random number generator a bit better (This used to be ctdb commit 708585eb0ed31b0df6543a1d7a20b82e751877c2) 2007-05-10 07:10:23 +04:00			`as there are active nodes or we will have to do a recovery`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`*/`
			`if (vnnmap->size != num_active) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " The vnnmap count is different from the number of active nodes. %u vs %u\n", vnnmap->size, num_active));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`

			`/* verify that all active nodes in the nodemap also exist in`
			`the vnnmap.`
			`*/`
			`for (j=0; j<nodemap->num; j++) {`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`

			`for (i=0; i<vnnmap->size; i++) {`
			`if (vnnmap->map[i] == nodemap->nodes[j].vnn) {`
			`break;`
			`}`
			`}`
			`if (i==vnnmap->size) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Node %u is active in the nodemap but did not exist in the vnnmap\n", nodemap->nodes[j].vnn));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`
			`}`


also verify that the generation id is the same on all the nodes and if not, trigger a recovery (This used to be ctdb commit 46b8a66ee70419c153acf45eeec88c1fc8f230ce) 2007-05-04 05:57:45 +04:00			`/* verify that all other nodes have the same vnnmap`
			`and are from the same generation`
			`*/`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`for (j=0; j<nodemap->num; j++) {`
			`if (!(nodemap->nodes[j].flags&NODE_FLAGS_CONNECTED)) {`
			`continue;`
			`}`
			`if (nodemap->nodes[j].vnn == vnn) {`
			`continue;`
			`}`

raise the control timeout in recovery (This used to be ctdb commit 43424ff66daf28c202c12982f20a9f662b6fb125) 2007-05-24 07:49:27 +04:00			`ret = ctdb_ctrl_getvnnmap(ctdb, CONTROL_TIMEOUT(), nodemap->nodes[j].vnn, mem_ctx, &remote_vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Unable to get vnnmap from remote node %u\n", nodemap->nodes[j].vnn));`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`

also verify that the generation id is the same on all the nodes and if not, trigger a recovery (This used to be ctdb commit 46b8a66ee70419c153acf45eeec88c1fc8f230ce) 2007-05-04 05:57:45 +04:00			`/* verify the vnnmap generation is the same */`
			`if (vnnmap->generation != remote_vnnmap->generation) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Remote node %u has different generation of vnnmap. %u vs %u (ours)\n", nodemap->nodes[j].vnn, remote_vnnmap->generation, vnnmap->generation));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
also verify that the generation id is the same on all the nodes and if not, trigger a recovery (This used to be ctdb commit 46b8a66ee70419c153acf45eeec88c1fc8f230ce) 2007-05-04 05:57:45 +04:00			`goto again;`
			`}`

update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`/* verify the vnnmap size is the same */`
			`if (vnnmap->size != remote_vnnmap->size) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Remote node %u has different size of vnnmap. %u vs %u (ours)\n", nodemap->nodes[j].vnn, remote_vnnmap->size, vnnmap->size));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`

			`/* verify the vnnmap is the same */`
			`for (i=0;i<vnnmap->size;i++) {`
			`if (remote_vnnmap->map[i] != vnnmap->map[i]) {`
fixed %d which should be %u (This used to be ctdb commit 2792cf718ff1e66fe99f870f683a13baa160f629) 2007-05-23 14:15:09 +04:00			`DEBUG(0, (__location__ " Remote node %u has different vnnmap.\n", nodemap->nodes[j].vnn));`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`do_recovery(ctdb, mem_ctx, vnn, num_active, nodemap, vnnmap);`
update getvnnmap control to take a timeout parameter dont explicitely free the vnnmap pointer in the getvnnmap control this is freed by the mem_ctx instead add code to the recoverd to detect when/if recovery is required veiry that the number of active nodes, the nodemap and the vnn map is consistent across the entire cluster and if not trigger a recovery (which right now just prints "we need to do recovery" to the screen. (This used to be ctdb commit 2b0a207a3748bdb3394dc9fd0d1c344ee1bb0bb5) 2007-05-04 03:45:53 +04:00			`goto again;`
			`}`
			`}`
			`}`

			`goto again;`

start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`static void ctdb_recoverd_parent(struct event_context ev, struct fd_event fde,`
			`uint16_t flags, void *private_data)`
			`{`
			`DEBUG(0,("recovery daemon parent died - exiting\n"));`
			`_exit(1);`
			`}`

			`int ctdb_start_recoverd(struct ctdb_context *ctdb)`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`{`
			`int ret;`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`uint64_t srvid;`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`int fd[2];`
			`pid_t child;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`if (pipe(fd) != 0) {`
			`return -1;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`child = fork();`
			`if (child == -1) {`
			`return -1;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00
			`if (child != 0) {`
			`close(fd[0]);`
			`return 0;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`close(fd[1]);`

			`event_add_fd(ctdb->ev, ctdb, fd[0], EVENT_FD_READ\|EVENT_FD_AUTOCLOSE,`
			`ctdb_recoverd_parent, &fd[0]);`
setup the random number generator a bit better (This used to be ctdb commit 708585eb0ed31b0df6543a1d7a20b82e751877c2) 2007-05-10 07:10:23 +04:00
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`close(ctdb->daemon.sd);`
			`ctdb->daemon.sd = -1;`

			`srandom(getpid() ^ time(NULL));`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00
			`/* initialise ctdb */`
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`ret = ctdb_socket_connect(ctdb);`
			`if (ret != 0) {`
added lockwait child code for entering recovery mode. A child processes holds lockall locks for the entire recovery process (This used to be ctdb commit f892f30def75b0d964c35eae38c4cf675597dd28) 2007-05-12 08:34:21 +04:00			`DEBUG(0, (__location__ " Failed to init ctdb\n"));`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`exit(1);`
			`}`

recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00			`/* register a message port for recovery elections */`
			`srvid = CTDB_SRVTYPE_RECOVERY;`
			`ctdb_set_message_handler(ctdb, srvid, election_handler, NULL);`

moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`monitor_cluster(ctdb);`
recovery daemon with recovery master election election is primitive, it elects the lowest vnn as the recovery master two new controls, to get/set recovery master for a node to use recovery daemon, start one ./bin/recoverd --socket=ctdb.socket* for each ctdb daemon it has been briefly tested by deleting and adding nodes to a 4 node cluster but needs more testing (This used to be ctdb commit 541d1cc49d46d44042a31a8404d521412ef2fdb3) 2007-05-07 00:51:58 +04:00
moved the recovery daemon into the main ctdbd and enable it by default (This used to be ctdb commit 2a7d42124731f43d013cb76a798525eab4cc1ee0) 2007-05-15 09:13:36 +04:00			`DEBUG(0,("ERROR: ctdb_recoverd finished!?\n"));`
			`return -1;`
start working on a recovery daemon change ctdb_control so it takes a timeval pointer as argument. this is the timeout. if the node has not responded within hte timeout ctdb_control will return an error instead of hanging. if the timeval pointer is NULL then the call will block indefinitely if there is no response. this is used for now in the createdb control but all the helpers ctdb_ctrl_* should probably be updated to take a timeout parameter as well. (This used to be ctdb commit 1fe64b04869b17dbf123851b0fe09df8d28a6211) 2007-05-04 02:30:18 +04:00			`}`

999 lines 27 KiB C Raw Normal View History Unescape Escape

999 lines

27 KiB

C

Raw Normal View History