1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-24 21:34:56 +03:00
Commit Graph

307 Commits

Author SHA1 Message Date
Andrew Tridgell
dc15a9c1f6 - accept an optional set of tdb_flags from clients on open a database,
thus allowing the client to pass through the TDB_NOSYNC flag

- ensure that tdb_store() operations on persistent databases that don't
  have TDB_NOSYNC set happen inside a transaction wrapper, thus making
  them crash safe

(This used to be ctdb commit 49330f97c78ca0669615297ac3d8498651831214)
2008-04-10 15:25:48 +10:00
Ronnie Sahlberg
cd1858d126 fix compiler warning during a fatal error failing to lock down the socket
(This used to be ctdb commit 0ad22de1a614dc2d1926546027be5f5eea3381ed)
2008-04-10 09:56:49 +10:00
Ronnie Sahlberg
2da3fe1b17 From Chris Cowan
secure the domain socket and set permissions properly

(This used to be ctdb commit ac6a362fc2fc4a56b4c310478a96eb12daace176)
2008-04-10 06:51:53 +10:00
Ronnie Sahlberg
6b797f148c From Chris Cowan
Add support in AIX to track the PID of a client that connects to the unix domain socket

(This used to be ctdb commit 4c006c675d577d4a45f4db2929af6d50bc28dd9e)
2008-04-03 10:58:51 +11:00
Ronnie Sahlberg
e8e67ef576 add a mechanism to force a node to run the eventscripts with arbitrary arguments
ctdb eventscript "command argument argument ..."

(This used to be ctdb commit 118a16e763d8332c6ce4d8b8e194775fb874c8c8)
2008-04-02 11:13:30 +11:00
Ronnie Sahlberg
03d30f405d decorate the memdump output with a nice field for ctdb_client structures to show the pid of the client that attached
(This used to be ctdb commit 0d9314302d0b988b6ab5d533deef40c5b343c249)
2008-04-01 17:17:21 +11:00
Ronnie Sahlberg
27a7f854f5 add improvements to tracking memory usage in ctdbd adn the recovery daemon
and a ctdb command to pull the talloc memory map from a recovery daemon
ctdb rddumpmemory

(This used to be ctdb commit d23950be7406cf288f48b660c0f57a9b8d7bdd05)
2008-04-01 15:34:54 +11:00
Ronnie Sahlberg
0d7b34c9e5 Add two new controls to add/delete public ip address from a node at runtime.
The controls only modify the runtime setting of which public addresses a node
can server and does not modify /etc/ctdb/public_addresses.
To make the change permanent you also need to edit /etc/ctdb/public_addresses
manually.

After ip addresses have been added/deleted you need to invoke a recovery
for the ip addresses to be redistributed.

(This used to be ctdb commit f8294d103fdd8a720d0b0c337d3973c7fdf76b5c)
2008-03-27 09:23:27 +11:00
Ronnie Sahlberg
26ec64a571 fix a memory leak
allocate the memory to the 'call' context and not off the 'ctdb' context

(This used to be ctdb commit be89005bd5d13409e377d425db2aad1c0d5b3826)
2008-03-25 11:11:13 +11:00
Ronnie Sahlberg
2863d2cfd1 From M Dietz,
Add back the controls to enable/disable monitoring we used to have for debugging but removed a while ago

(This used to be ctdb commit 8477f6a079e2beb8c09c19702733c4e17f5032fe)
2008-03-25 08:27:38 +11:00
Ronnie Sahlberg
d53424731f in ctdb_call_local() we can not talloc_steal() the returned data and hang it off ctdb.
This can cause a memory leak if the call is terminated before we have managed to respond to the client.
(and the call is talloc_free()d but the data is still hanging off ctdb)

instead we must talloc_steal() the data and hang it off the call structure to avoid the memory leak.

In order to do this we must also change the call structure that is passed into ctdb_call_local() to be allocated through talloc().

This structure was previously either a static variable, or an element of a larger talloc()ed structure (ctdb_call_state or ctdb_client_call_state) so
we must change all creations of a ctdb_call into explicitely creating it through talloc()

(This used to be ctdb commit 4becf32aea088a25686e8bc330eb47d85ae0ef8f)
2008-03-19 13:54:17 +11:00
Ronnie Sahlberg
e19264ea26 change the log level for the message when someone connects to a non-public ip
(This used to be ctdb commit bc9c4f0d52e9b06aceb08cea99ed3fd20b44616c)
2008-03-13 07:54:55 +11:00
Ronnie Sahlberg
74d57f8d51 Redo the vacukming process to mkake it scalable.
Vacumming used to delete one record at a time on all nodes, that was
m*n behaviour and would require a huge storm of ctdb->ctdb controls and just wouldnt scale at all.

The new vacuming process collects all records to be deleted locally and then only sends 1 control to the other nodes. This control contains a list of all records to be deleted.

(This used to be ctdb commit 9e625ece19a91f362c9539fa73b6b2108f0d9c53)
2008-03-13 07:53:29 +11:00
Ronnie Sahlberg
a89ed0fdc2 add a new tunable 'NoIPFailback'
when this tunable is set, ip addresses will only be failed over when a node
fails. And only those ip addresses held by the failed node will be reallocated
in the cluster.

When a node becomes active again, this will not lead to any failback of ip addresses.

This can reduce the number of "ip address movements" in the cluster since we dont automatically fail an ip address back, but can also lead to an unbalanced cluster since we no longer attempt to spread the ip addresses out evenly across the active nodes.

This tuneable can NOT be active at the same time as DeterministicIPs are used.

(This used to be ctdb commit d3b8a461b15bc584fa1785eb5922de6d49d8f6c4)
2008-03-03 12:52:16 +11:00
Ronnie Sahlberg
e08519b74d when we reallocate the ip addresses for nodes, we must make sure that
a node that has been allocated to server an ip actually CAN serve that ip
(if we use differing public_addresses files on each node)

(This used to be ctdb commit fdaf7cb2d7682507fbf4c6c2b833b327c93fac08)
2008-03-03 10:53:23 +11:00
Ronnie Sahlberg
57d29f1011 add a num_connected field to the rec structure that holds the number
of connected nodes

num_active only contains the number of active nodes and would thus not count
banned nodes

(This used to be ctdb commit 06d3ce470766ef0b60d68ccd84de5437146cc147)
2008-03-03 10:24:17 +11:00
Ronnie Sahlberg
f6f7f54bd6 add a new tunable : reclockpingperiod
once every such interval :
* the recovery master on each node will uppdate the "connected" count in the
reclock count file (ctdb getreclock)
* if the node thinks it is a recovery master but it detects another node
  that is DISCONNECTED but which still holds a lock to the reclock count file
  this may mean that we have a split cluster.
  if that other node that is DISCONNECTED but still holds the lock on hte reclock
  pnn count file, is MORE connected than the local node,
  yield the recmaster role and let the other half of the lcuster take over

this add a second, last chance mechanism to detect split clusters.
IF the cluster is split but GPFS is not yet split, this mechanism makes
the largest half of the cluster become the active half.

(This used to be ctdb commit 07af425f444531942cce8abff112c1524228d287)
2008-03-03 09:19:30 +11:00
Ronnie Sahlberg
cadd95263f change recmaster from being a local variable in monitor_cluster() to be a member of the ctdb_recoverd structure
(This used to be ctdb commit b7f955338f50c92374b4f559268fb3a1a516aefa)
2008-03-03 07:53:46 +11:00
Ronnie Sahlberg
814570f904 update the reclock pnn count for how many nodes are connected to the current node once every 60 seconds
(This used to be ctdb commit bf1863cc9e2539b2c3e53c664b493b459ebfcc8b)
2008-02-29 13:14:47 +11:00
Ronnie Sahlberg
efa29c6c98 store the num_active variable (number of connected/active nodes) inside the rec
structure and avoid passing this as an extra parameter to do_recovery()

(This used to be ctdb commit 8bb229aa3b4bd41e48d4e4e2e148d8680c8ba436)
2008-02-29 12:55:20 +11:00
Ronnie Sahlberg
e0036942bc add a new file <reclock>.pnn where each recovery daemon can lock that byte at offset==pnn to offer an alternative way to detect which nodes are active instead of relying on CONNECTED being accurate.
(This used to be ctdb commit 21d3319eaf463e2a00637d440ee2d4d15f53bf09)
2008-02-29 12:37:42 +11:00
Ronnie Sahlberg
4adeafef11 add a control to get the name of the reclock file from the daemon
(This used to be ctdb commit 9effb22cc1616d684352d7ebabb359e69adb0f52)
2008-02-29 10:03:39 +11:00
Ronnie Sahlberg
7bc8007f93 add a new tunable DisableWhenUnhealthy which when set will cause a node to automatically become DISABLED anytime monitoring fails and the node becomes UNHEALTHY.
Use with caution.

(This used to be ctdb commit c20293360db67f9876b0c84e5e9e12a5868964cb)
2008-02-22 10:33:09 +11:00
Ronnie Sahlberg
f3b474cffb Add debug output to indicate why a node starts up in DISABLED state
(This used to be ctdb commit 8df75775966ead36e1073896fedeff674a6e0587)
2008-02-22 09:52:57 +11:00
Ronnie Sahlberg
39539f6044 Add a new parameter to /etc/sysconfig/ctdb
CTDB_START_AS_DISABLED="yes"

and command line argument
--start-as-disabled

When set, this makes the ctdb node to always start in DISABLED mode and will thus not host any public ip addresses.
The administrator must manually "ctdb enable" the node after it has started when the administrator wants the node to start hosting public ip addresses.

Using this option it is possible to start ctdb on a node without causing any reallocation of ip addresses when it is starting. The node will still merge with the cluster and there will still be a recovery phase but the ip address allocations will not change in the cluster.

(This used to be ctdb commit b93d29f43f5306c244c887b54a77bca8a061daf2)
2008-02-22 09:42:52 +11:00
Ronnie Sahlberg
9f99b44fd1 to make it easier/less disruptive to add nodes to a running cluster
add a new control that causes the node to drop the current nodes list
and reread it from the nodes file.
During this operation, the node will also drop the tcp layer and restart it.

When we drop the tcp layer, by talloc_free()ing the ctcp structure
add a destructor to ctcp so that we also can clean up and remove the references in the ctdb structure to the transport layer

add two new commands for the ctdb tool.
one to list all nodes in the nodesfile and the second a command to trigger a node to drop the transport and reinitialize it with the nde nodes file

(This used to be ctdb commit 4bc20ac73e9fa94ffd43cccb6eeb438eeff9963c)
2008-02-19 14:44:48 +11:00
Ronnie Sahlberg
bef60e8200 read the current debuglevel in each loop in the recovery daemon so that we
pick up when they change in the parent daemon

(This used to be ctdb commit 792d5471ff0c2947b6e66183925860de27f30eaf)
2008-02-18 19:38:04 +11:00
Ronnie Sahlberg
3f56526037 Specify and print debuglevels by name and not by number
(This used to be ctdb commit 79ad830294b8b677fbd0c5ad7ed6fbde71f74f8d)
2008-02-05 10:26:23 +11:00
Andrew Tridgell
f6e53f433b merge from ronnie
(This used to be ctdb commit e7b57d38cf7255be823a223cf15b7526285b4f1c)
2008-02-04 20:07:15 +11:00
Andrew Tridgell
9d6ac0cf55 added debug constants to allow for better mapping to syslog levels
(This used to be ctdb commit 7ba8f1dde318eab03f4257e5a89fd23e7281e502)
2008-02-04 17:44:24 +11:00
Andrew Tridgell
feb7c05734 removed dependence on dprintf
(This used to be ctdb commit c156db449218bf9432e3a6cb3ce0f617197c9069)
2008-01-29 14:31:51 +11:00
Andrew Tridgell
146d4b0db7 merge async recovery changes from Ronnie
(This used to be ctdb commit 576e317640d25f8059114f15c6f1ebcee5e5b6e2)
2008-01-29 13:59:28 +11:00
Andrew Tridgell
eb044bb1d6 make ctdb dumpmemory work remotely, and dump the talloc
memory tree to stdout. This is much more useful than putting it in the log, and also fixes
a bug where the pipe would overflow internally and cause ctdbd to lockup

(This used to be ctdb commit e236979e2162d9bd7a495086342168a696cf76c5)
2008-01-22 14:22:41 +11:00
Andrew Tridgell
d945b1af03 merge from ronnie
(This used to be ctdb commit 5f6d59b9d18c694d82591238bc7a6bb98726a3ed)
2008-01-17 16:46:56 +11:00
Ronnie Sahlberg
9625483c2d add ctdb_uptime.c
(This used to be ctdb commit 4c7153681ed4d68d601720d043f9ff95ac7647a9)
2008-01-17 16:37:05 +11:00
Ronnie Sahlberg
9055978b46 add a ctdb uptime command that prints when ctdb was started and when the
last recovery occured

(This used to be ctdb commit b86e8ccbdac044bb949c4fc2ebb27635126272a9)
2008-01-17 11:33:23 +11:00
Andrew Tridgell
5683a8d1e1 cope better with large debug dumps
(This used to be ctdb commit fc3733f8e966376f50799fd1aa7b0a8e1cf66e0e)
2008-01-16 23:06:37 +11:00
Andrew Tridgell
be9594c156 fixed handling of \r from stdout of subprocesses
(This used to be ctdb commit f1acec5db4948d8e48412a8546bb181b08a2c5fd)
2008-01-16 22:40:01 +11:00
Andrew Tridgell
0080683da8 fixed two 64bit warnings
(This used to be ctdb commit c61fe240713ae2e917f69f827c6927405f02f5d4)
2008-01-16 22:16:15 +11:00
Andrew Tridgell
97ede94e40 The recovery daemon does not need to be a realtime task
(This used to be ctdb commit f552acf7c1f9dd37eb35d9716ea3fb02304aae8f)
2008-01-16 22:08:33 +11:00
Andrew Tridgell
b62b7fcde8 added syslog support, and use a pipe to catch logging from child processes to the ctdbd logging functions
(This used to be ctdb commit 1306b04cd01e996fd1aa1159a9521f2ff7b06165)
2008-01-16 22:03:01 +11:00
Ronnie Sahlberg
5b7838d768 ctdb_control_send() does not need to take an outdata parameter
remove the outdata parameter from the function and all callers

(This used to be ctdb commit e3951337f8df2ae19cce61c954036590c7a03582)
2008-01-16 10:23:26 +11:00
Andrew Tridgell
bf9e33d4cf - catch a case where the client disconnects during a call
- track all talloc memory, using NULL context

(This used to be ctdb commit bf89c56002f5311520e91cb367753bc46e5dddc9)
2008-01-16 09:44:48 +11:00
Andrew Tridgell
6c56e9d347 fixed a memory leak in the recovery daemon
(This used to be ctdb commit 73c27cf4c62cbe44b2b8fd00f907974d0808500c)
2008-01-15 20:11:44 +11:00
Ronnie Sahlberg
ba31feaec0 split node health monitoring and checking for connected/disconnected
nodes into two separate files.

move the monitoring of keepalives for detecting connected/disconnected 
remote nodes into ctdb_keepalive.c

(This used to be ctdb commit 23a57b20c314d5f11a433cf251eb9d9de743849a)
2008-01-15 08:42:12 +11:00
Andrew Tridgell
b866a147d2 get rid of monitor_retry as well
(This used to be ctdb commit c957cf9c1d99d5d3f4ca726f7a867c829660a2b7)
2008-01-10 14:49:43 +11:00
Andrew Tridgell
538f519dba exponential backoff in health monitoring for faster startup
(This used to be ctdb commit 1b04a1f675f73b48366ba98803a58c3d8df1b6e1)
2008-01-10 14:40:56 +11:00
Andrew Tridgell
3b3fceacbe block alarm signals during critical sections of vacuum
(This used to be ctdb commit cfb14ae76f00f10d27b56c034b2247ab12d63065)
2008-01-10 09:43:14 +11:00
Andrew Tridgell
59d69bb709 only match vacuum list if on the same database
(This used to be ctdb commit 27e56955e93027534780cc7549ddb224670d82b6)
2008-01-09 10:22:20 +11:00
Andrew Tridgell
9559249e15 ensure the main daemon doesn't use a blocking lock on the freelist
(This used to be ctdb commit 73f8257906b09e6516f675883d8e7a3c455ad869)
2008-01-08 22:31:48 +11:00
Andrew Tridgell
1c91398aef ensure the recovery daemon is not clagged up by vacuum calls
(This used to be ctdb commit ff7e80e247bf5a86adda0ef850d901478449675b)
2008-01-08 21:28:42 +11:00
Andrew Tridgell
96100fcae6 added two new ctdb commands:
ctdb vacuum   : vacuums all the databases, deleting any zero length
                 ctdb records

 ctdb repack   : repacks all the databases, resulting in a perfectly
                 packed database with no freelist entries

(This used to be ctdb commit 3532119c84ab3247051ed6ba21ba3243ae2f6bf4)
2008-01-08 17:23:27 +11:00
Andrew Tridgell
25bb60f112 show start/stop time of recovery on all nodes
(This used to be ctdb commit 9f7662279c367eb3e8a58e6f4aeca521e6f1f1d0)
2008-01-08 09:30:11 +11:00
Andrew Tridgell
37861932ce merge from ronnie
(This used to be ctdb commit 0aa6e04438aa5ec727815689baa19544df042cf7)
2008-01-07 16:17:22 +11:00
Andrew Tridgell
d38fbaa38b nicer onnode output
(This used to be ctdb commit ac5c1e090d007bc2e3965589731620b87c0217fb)
2008-01-07 14:31:13 +11:00
Andrew Tridgell
4258098e98 catch internal traversal errors
(This used to be ctdb commit 8caa85ad71be5d20a8d6f0cb3d52aff6905657a4)
2008-01-07 14:08:25 +11:00
Andrew Tridgell
528e4d7a2b more efficient traversal in pulldb control
(This used to be ctdb commit fe614b10868e63b70e081b5bbfb74bf16fdf5716)
2008-01-07 14:07:01 +11:00
Andrew Tridgell
748843a3c6 added paranoid transaction ids
(This used to be ctdb commit afc1da53873cdbd31fcc8c6b22fae262e344cf6e)
2008-01-06 13:24:55 +11:00
Andrew Tridgell
c08f2616cd new simpler and much faster recovery code based on tdb transactions
(This used to be ctdb commit 9ef2268a1674b01f60c58fed72af8ac982fe77a3)
2008-01-06 12:38:01 +11:00
Andrew Tridgell
4f5b717aa3 change default tunables to cope with larger dbs
(This used to be ctdb commit d91a2d43d1f0562cc3a12e6e1e2767f75d888f72)
2008-01-06 12:36:58 +11:00
Andrew Tridgell
108aafcdb2 non-persistent databases don't need sync transactions
(This used to be ctdb commit 52fd86addd23e4d6e0af2c716bd83d19675b1f5a)
2008-01-06 12:36:30 +11:00
Andrew Tridgell
9311f7fb7e fixed the bug that make "onnode N service ctdb start" hang
(This used to be ctdb commit b50dcb16f30a60abce42f491f9b0aae7948b8206)
2008-01-05 12:09:29 +11:00
Andrew Tridgell
e4aefbc66d a new tunable DatabaseMaxDead that enables the tdb max dead cache logic
(This used to be ctdb commit 01c519c3658a8fcb9545b507b597e723658e4c4e)
2008-01-05 09:36:53 +11:00
Andrew Tridgell
023a230d9c a useful hack for checking correct behaviour of recovery
(This used to be ctdb commit d88b95a5407b53ead47ca0638ee60653ea3d3d07)
2008-01-05 09:36:21 +11:00
Andrew Tridgell
f79dfd04c0 convert much of the recovery logic to be async and parallel across all nodes
(This used to be ctdb commit 8b72a02bf1045d8befb342a4111ca1316889262e)
2008-01-05 09:35:43 +11:00
Andrew Tridgell
9a625534c1 this fixes the non-dmaster bug that has plagued us for months
(This used to be ctdb commit 2acf6c6201862debfca054a09262f75c066d2deb)
2008-01-05 09:34:47 +11:00
Andrew Tridgell
fc21f78231 make some specific cases of the non-dmaster bug non-fatal
(This used to be ctdb commit 7b516ab06c7ba7ffe9ecf3f76720df5360176b2c)
2008-01-05 09:32:29 +11:00
Andrew Tridgell
e9987cf236 fixed a warning
(This used to be ctdb commit f34d0f9351c1cda3327efb14e173f249f7854570)
2008-01-05 09:30:49 +11:00
Andrew Tridgell
afc7275c16 fixed a warning
(This used to be ctdb commit d6255438d63943736b24a7a6da190b6933379a61)
2008-01-04 12:42:10 +11:00
Andrew Tridgell
2509821503 prevent a re-ban loop for single node clusters
(This used to be ctdb commit b20a3369655bcba274c99091157ba7466994e848)
2008-01-04 12:11:29 +11:00
Andrew Tridgell
41fb8e283b add randrec to Makefile
(This used to be ctdb commit ded1f7903e8a6525ab1888e8c4f50c71fa23cc19)
2008-01-04 09:19:06 +11:00
Andrew Tridgell
bb06e831a0 more optimisations to recovery
(This used to be ctdb commit 9a41ad0a842cd4f3792d6e84b5c809b7ff6f342e)
2008-01-02 22:44:46 +11:00
Andrew Tridgell
2a2f1e3d91 fixed segv on failed ctdb_ctrl_getnodemap
(This used to be ctdb commit 5daf9a72f0e60a9af7cf32ae6d759be7d94857ec)
2007-12-27 10:07:01 +11:00
Andrew Tridgell
6ef3bff4ed merge from ronnie
(This used to be ctdb commit 072ef744951d3aa59dd8be70578b99b18c37d988)
2007-12-04 15:20:40 +11:00
Andrew Tridgell
a55c3709ea make DeterministicIPs the default
(This used to be ctdb commit e7d077e98a40a62dbd6bfd174f29afba7b5529ef)
2007-12-04 15:18:27 +11:00
Ronnie Sahlberg
7cef33b40a rework banning/unbanning nodes
ctdb_recoverd.c
Always handle banning/unbanning locally on the node that is being 
banned/unbanned instead of on the recovery master.
This means that if a ban request comes in to the recovery master for a 
remote node, we pass the request on to the remote node instead of 
setting up the ban and ban timeouts locally.

ctdb.c
send ban/unban requests to the node being banned/unbanned instead of to 
the recmaster

(This used to be ctdb commit 880dd9f5fd0b91e450da93e195cc5c62cb1dcd6e)
2007-12-03 15:45:53 +11:00
Ronnie Sahlberg
64008e28bb for the banned status, we should allocate this structure as a child of
the banned_nodes array and not the rec structure so that  ban_state is 
destroyed when the banned_nodes array gets destroyed
(and so that when this struct is destroyed, that any pending 
ctdb_ban_timeout events are also destroyed.)

othervise we may end up with multiple ban_timeout timed events going in 
parallell since we destroy/recreate the banned_nodes structure during 
election   but we never destroy/recreate the rec structure.

(This used to be ctdb commit fbd663d56a2a4421a5c0e541962c87e2e9c7cd82)
2007-12-03 11:39:17 +11:00
Andrew Tridgell
7edb41692e merge from ronnie
(This used to be ctdb commit 6653a0b67381310236e548e5fc0a9e27209b44e0)
2007-12-03 10:19:24 +11:00
Ronnie Sahlberg
2f1baf34d3 up the loglevel for the enable/disable monitoring to level 1
(This used to be ctdb commit 5043a0afeedbd30c7f64c2733c8ae5bf75479a98)
2007-12-01 10:06:42 +11:00
Ronnie Sahlberg
07dd0f6ff0 log that monitoring has been "disabled" not that it has been "stopped"
when monitoring is disabled

(This used to be ctdb commit e7c92f661a523deae9544b679d412ae79cc0ede7)
2007-11-30 10:53:35 +11:00
Ronnie Sahlberg
975fbc8e22 always set up a new monitoring event regardless of whether monitoring is
enabled or not

(This used to be ctdb commit c3035f46d1a65d2d97c8be7e679d59e471c092c2)
2007-11-30 10:14:43 +11:00
Ronnie Sahlberg
50573c5391 add ctdb_disable/enable_monitoring() that only modifies the monitoring
flag.
change calling of the recovered/takeip/releaseip event scripts to use 
these enable/disable functions instead of stopping/starting monitoring.

when we disable monitoring we want all events to still be running
in particular the events to monitor for dead nodes  and we only want to 
supress running the monitor event scripts

(This used to be ctdb commit a006dcc4f75aba950dd701ad7d1a84e89df285e8)
2007-11-30 10:09:54 +11:00
Ronnie Sahlberg
0eb6c04dc1 get rid of the control to set the monitoring mode.
monitoring should always be enabled
(though a node may want to temporarily disable running the "monitor"
event scripts but can do so internally without the need for this 
control)

(This used to be ctdb commit e3a33618026823e6af845fd8513cddb08e6b5584)
2007-11-30 10:00:04 +11:00
Ronnie Sahlberg
192ba82b73 ->monitor_context is NULL when monitoring is disabled.
Check whether monitoring is enabled or not before creating new events
and log why the event is not set up othervise

(This used to be ctdb commit 2f352b2606c04a65ce461fc2e99e6d6251ac4f20)
2007-11-30 09:02:37 +11:00
Ronnie Sahlberg
8ac8cce487 dont manipulate ctdb->monitoring_mode directly from the SET_MON_MODE
control, instead call ctdb_start/stop_monitoring()

ctdb_stop_monitoring() dont allocate a new monitoring context, leave it 
NULL. Also set the monitoring_mode in this function so that 
ctdb_stop/start_monitoring() and ->monitoring_mode are kept in sync.
Add a debug message to log that we have stopped monitoring.

ctdb_start_monitoring()  check whether monitoring is already active and 
make the function idempotent.
Create the monitoring context when monitoring is started.
Update ->monitoring_mode once the monitoring has been started.
Add a debug message to log that we have started monitoring.

When we temporarily stop monitoring while running an event script,
restart monitoring after the event script wrapper returns instead of in 
the event script callback.

Let monitoring_mode start out as DISABLED and let it be enabled once we call ctdb_start_monitoring.

dont check for MONITORING_DISABLED in check_fore_dead_nodes(). If 
monitoring is disabled, this event handler will not be called.

(This used to be ctdb commit 3a93ae8bdcffb1adbd6243844f3058fc742f76aa)
2007-11-30 08:44:34 +11:00
Ronnie Sahlberg
5c3a270991 move ctdb_set_culprit higher up in the file
when we are the recmaster and we update the local flags for all the 
nodes, if one of the nodes fail to respond and give us his flags,
set that node as a "culprit"

as one of the first things to do in the monitor_cluster loop, check if 
the current culprit has caused too many (20) failures and if so ban that 
node.


this is for the situation where a remote node may still be CONNECTED but 
it fails to respond to the getnodemap control  causing the recovery 
master to loop in monitor_cluster   aborting the monitoring when the 
node fails to respond   but before anything will trigger a call to 
do_recovery().
If one or more of the databases or nodes are frozen at this stage, this 
would lead to smbd being blocked for potentially a longish time.

(This used to be ctdb commit 83b0261f2cb453195b86f547d360400103a8b795)
2007-11-28 15:04:20 +11:00
Ronnie Sahlberg
9e73dc87cc Add a --node-ip argument so that one can specify which ip address a
specific instance of ctdbd should bind to. This helps when running a
"virtual" cluster on a single machine where all instcances bind to 
different alias interfaces.

If --node-ip is specified, then we will only try to bind to this ip 
address only. Othervise we fall back to the original method trying the
ip addresses in /etc/ctdb/nodes one by one until we find one we can bind 
to.

No variable in /etc/sysconfig/ctdb added since this parameter only makes 
sense in a virtual test/debug cluster.

(This used to be ctdb commit d96cb02c2c24f9eabbc53d3d38e90dea49cff3e0)
2007-11-26 10:52:55 +11:00
Ronnie Sahlberg
0597be3386 when monitoring the node from the recovery daemon, check that the
recovery daemon and the ctdb daemon both agree on whether the node is 
banned or not   and if they disagree then reban the node again after 
logging an error to the debug log

(This used to be ctdb commit 6cd6e534493066edd4bb2c6ae5be0e9a9d495aa0)
2007-11-23 12:41:29 +11:00
Ronnie Sahlberg
a260145f9f check for recursive bans in ctdb_ban_node() and remove the previous ban
if this is an attempt to ban an already banned node

(This used to be ctdb commit 214f2d7b04d0a491d466fc85c8d016efde416f9e)
2007-11-23 12:38:37 +11:00
Ronnie Sahlberg
6b284e5905 add log output for when ctdb_ban_node() and ctdb_unban_node() are called
when these functions are called to ban or unban a node make sure we 
update the CTDB_NODE_BANNED flag in rec->node_flags since this field and
flag are checked during the election process

(This used to be ctdb commit 740c632ae96a2d34327d1b575780aaf079d93f4f)
2007-11-23 12:36:14 +11:00
Ronnie Sahlberg
b5e79fb06f If update_local_flags() finds that a node has changed its BANNED status
so it differs from what the local ctdb daemon on the recovery master 
thinks it should be  we should call for a re-election

(This used to be ctdb commit 21ad6039c31ef5cc0e40a35a41220f91943947cb)
2007-11-23 11:53:06 +11:00
Ronnie Sahlberg
b2a81fb6b1 when we as the recovery daemon on the recovery master detects that the
flags differ between the local ctdb daemon and the remote node
we can force a flags update on all nodes and not just the local daemon

(This used to be ctdb commit a924eb89c966ecbae029ca137e06cffd40cc70fd)
2007-11-23 11:31:42 +11:00
Ronnie Sahlberg
af5bc9b915 add an extra log if we get a modflags control but it doesnt change any
flags


in update_local_flags()
(this is only called if we are or we belive we are the recmaster)
when we detect that the flags of a remote node is different from what 
our local node thinks the flags should be for that remote node
we should send a node-flag-changed message to the local daemon so 
that it updates the flags for that node.

(This used to be ctdb commit 36225e4e271f7a4065398253747fb20054f99a53)
2007-11-23 10:52:29 +11:00
Ronnie Sahlberg
c36ce05d08 if we get a modflag control but the flags remain unchanged, log this
(This used to be ctdb commit 5a0cd9b37b21665054bd35facd87f0a6ff4dcd55)
2007-11-23 10:31:51 +11:00
Ronnie Sahlberg
e95a4b5cdb when we print "Remote node had flags xx local had flags xx
we swapped the flags when printing them to the log

(This used to be ctdb commit 9fc8831a7fcd34763567227d61cd525ec441ebf2)
2007-11-23 09:54:38 +11:00
Andrew Tridgell
45f0fdfc20 make election handling much more scalable
(This used to be ctdb commit 05938d462b92bd9ecb8e35f53651bded47c48675)
2007-11-13 10:27:44 +11:00
Andrew Tridgell
3427793f01 don't do the first startup event until we are out of recovery
(This used to be ctdb commit 689940eb6e23f16ee063331caf3986613a8963ea)
2007-11-12 13:10:15 +11:00
Andrew Tridgell
bde886988b prevent a deadly embrace between smbd and ctdbd by moving the calling
of the startup event scripts after the point where recovery has
started and the node is in normal operation

This makes the 'startup' script just a special type of the 'monitor'
script which is called first

(This used to be ctdb commit 7424c30a5fd04aea0137c466b4318c3f185280d8)
2007-11-12 10:53:11 +11:00
Ronnie Sahlberg
1d6a74f943 when shutting down, we should stop monitoring
(This used to be ctdb commit 325683ef8f326f0565a827ff2c493adcab6e0d64)
2007-10-22 12:34:51 +10:00
Ronnie Sahlberg
4a97876fb7 when we are shutting down, we should first shut down the recovery daemon
(This used to be ctdb commit 39ade6b329adcd3234124d6a8daaa6181abf739b)
2007-10-22 12:34:08 +10:00