1
0
mirror of https://github.com/samba-team/samba.git synced 2025-01-13 13:18:06 +03:00
Commit Graph

344 Commits

Author SHA1 Message Date
Ronnie Sahlberg
98a54c4675 Track how long it takes to take out the recovery lock from both the main dameon and also from the recovery daemon.
Log this in "ctdb statistics".

Also add a varaible "RecLockLatencyMs" that will log an error everytime it takes longer than this to access the reclock file.

(This used to be ctdb commit 042377ed803bb8f7ca9d6ea1a387427b7b8ba45a)
2009-05-14 10:33:25 +10:00
root
af25fa38f3 fixed a problem with clients disconnecting during a traverse
When a client (such as smbstatus) is killed, it may have outstanding
traverse children on remote nodes. We need to catch the client
disconnect in ctdbd and send a control to all nodes telling them to
kill those outstanding traverse children.

(This used to be ctdb commit f2fb2df4619a14f7f6c11f9132ee7d793028042c)
2009-05-06 07:32:25 +10:00
root
6793f077a8 Add a new variable VerifyRecoveryLock which can be used to disable the test that the recovery daemon holds the lock properly when performing a recovery
(This used to be ctdb commit 329df9e47e6ca8ab5143985a999e68f37c6d88a5)
2009-05-01 01:17:59 +10:00
Ronnie Sahlberg
38ea6708dd add a tuneable RecoveryDropAllIPs so it is possible to control after how long a node that has been stuck in recovery will wait until it will yield all public addresses.
this now defaults to 60 seconds

This is useful if a split brain occurs due to network partitioning since it will make sure that the "other half" of the cluster that does not contain the recovery master will eventually release all ips and thus avoiding a duplicate ip situation for the public addresses

(This used to be ctdb commit 70f21428c9eec96bcc787be191e7478ad68956dc)
2009-04-24 18:28:08 +10:00
Ronnie Sahlberg
d94917ec49 Change the (dodgy) seqnumfrequency variable to have ms resolution instead of second resolution.
Rename the variable to SeqnumInterval for
1, it is an interval and not a 1/interval unit
2, so that we catch when people use this old variable and can update the sysconfig file instead of silently changin semantics of this variable

this is a real dodgy variable

(This used to be ctdb commit 68eac459e5d2b6b534f72821036675ffe5d7a350)
2009-04-01 17:21:38 +11:00
Ronnie Sahlberg
297ab50173 remove a prototype for a function no longer used
(This used to be ctdb commit 9ac9745ba9296d01e3b18148ae8c3240e51cf090)
2009-04-01 17:13:48 +11:00
Ronnie Sahlberg
ad40ee25f9 add a mechanism where the ctdb daemon will run a usercontrolled script when the node status changes to/from UNHEALTHY state.
This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes.

(This used to be ctdb commit ce534a83a05dbd40238e4eee0669d60ff396f935)
2009-03-31 14:23:31 +11:00
Ronnie Sahlberg
689f76f0b0 Merge branch 'obnox'
(This used to be ctdb commit 972036a5d510fb9b399f1ee34a8861dee4221267)
2009-03-24 17:49:55 +11:00
Ronnie Sahlberg
7265c713db we need to set the port properly in the parse_ip helper
(This used to be ctdb commit 43fe18d86995744ba61c7a6405b70edcb265930a)
2009-03-24 13:45:11 +11:00
Michael Adam
a83ed1d743 Merge commit 'ctdb-ronnie/master'
(This used to be ctdb commit 39a972b0d6d0d70282c25c54a124b67431467e77)
2009-03-23 10:07:44 +01:00
root
629d5ee1fa add a new command "ctdb scriptstatus"
this command shows which eventscripts were executed during the last monitoring cycle and the status from each eventscript.

If an eventscript timedout or returned an error we also
show the output from the eventscript.

Example :
[root@rcn1 ctdb-git]# ./bin/ctdb scriptstatus
6 scripts were executed last monitoring cycle
00.ctdb              Status:OK    Duration:0.021 Mon Mar 23 19:04:32 2009
10.interface         Status:OK    Duration:0.048 Mon Mar 23 19:04:32 2009
20.multipathd        Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
40.vsftpd            Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
41.httpd             Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
50.samba             Status:ERROR    Duration:0.057 Mon Mar 23 19:04:33 2009
   OUTPUT:ERROR: Samba tcp port 445 is not responding

Add a new helper function "switch_from_server_to_client()" which both
the recovery daemon can use as well as in the child process we start for running the actual eventscripts.

Create several new controls, both for the eventscript child process to inform the master daemon of the current status of the scripts as well as for the ctdb tool to extract this information from the runninc daemon.

(This used to be ctdb commit c98f90ad61c9b1e679116fbed948ddca4111968d)
2009-03-23 19:07:45 +11:00
Michael Adam
839dec1b12 move common code of system_linux.c and system_aix.c into new system_common.c
Michael

(This used to be ctdb commit 124874847e5e03ce2a44bddfe778f01dfb0a7a03)
2009-02-28 03:08:31 +01:00
Michael Adam
3cca0f75e4 Fix treatment of link local ipv6 addresses: set the scope id.
metze / Michael

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 9d12de1ca6107801dada927729e755c0949d73bf)
2009-01-19 22:50:53 +01:00
root
321866dbba finish the ipv6 support.
allow clients to register either ipv4 or ipv6 client connections to the tickles list

(This used to be ctdb commit d9b44d7c3255b0fd7359b9afeb613e6ff4c4eaac)
2009-01-13 16:17:20 +11:00
Ronnie Sahlberg
edb7241c05 redesign how reloadnodes is implemented.
modify the transport methods to allow to restart individual connections
and set up destructors properly.

only tear down/set-up tcp connections to nodes removed from the cluster
or nodes added to the cluster.
Leave tcp connections to unchanged nodes connected.

make "ctdb reloadnodes" explicitely cause a recovery of the cluster once
the files have been realoaded

(This used to be ctdb commit d1057ed6de7de9f2a64d8fa012c52647e89b515b)
2008-12-02 13:26:30 +11:00
Ronnie Sahlberg
a782bdbacd inew version 1.0.66
ddwq

(This used to be ctdb commit 499a01fece2a5f24f1b2943cf3dc6e9a3a8ca3b5)
2008-11-24 19:06:02 +11:00
Ronnie Sahlberg
94a56ea410 reqrite the handling of flag updates across the cluster to eliminate a
race between the ctdb tool and the recovery daemon both at once
trying to push flag changes across the cluster.

(This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)
2008-11-20 12:43:18 +11:00
Ronnie Sahlberg
e1b0cea427 add control and logging of very high latencies.
log the type of operation and the database name for all latencies higher
than a treshold

(This used to be ctdb commit 1d581dcd507e8e13d7ae085ff4d6a9f3e2aaeba5)
2008-10-30 12:49:53 +11:00
Ronnie Sahlberg
b9bd20ce55 add a context and a timed event so that once we have been in recovery
mode for too long we drop all public ip addresses

(This used to be ctdb commit 403c68f96e1380dd07217c688de2730464f77ea0)
2008-10-22 11:04:41 +11:00
Ronnie Sahlberg
ce66008e08 specify a "script log level" on the commandline to set under which log
level any/all output from eventscripts will be logged as

(This used to be ctdb commit cdc79d4f22f1a6aec5c34115969421f93663932a)
2008-10-17 07:56:12 +11:00
Ronnie Sahlberg
cb300382b0 update TAKEIP/RELEASEIP/GETPUBLICIP/GETNODEMAP controls so we retain an
older ipv4-only version of these controls.

We need this so that we are backwardcompatible with old versions of ctdb
and so that we can interoperate with a ipv4-only recmaster during a
rolling upgrade.

(This used to be ctdb commit 6b76c520f97127099bd9fbaa0fa7af1c61947fb7)
2008-10-14 10:40:29 +11:00
Ronnie Sahlberg
a3bbe238c9 The ctdb daemon keeps track of whether the recovery process is running
correctly by measuring how long it was since the last successful
communication with the recovery daemon was recorded.

After a certain timeout the ctdb daemon would deem the recovery daemon
as inoperable and shut down.

If the system clock is suddenly changed forward by many (60 or more)
seconds this could cause the timeout to trigger prematurely/immediately
where ctdb would incorrectly think that more than 60 seconds had passed
since last successful communications and thus abort.

Instead of cehcking for one timeout occuring, only deem the recovery
daemon to be "down" and trigger a shutdown if communications have
timedout for three intervals in a row.

(This used to be ctdb commit 196968c552e6ebcb57389d769a4b25f42fa8bc5d)
2008-09-17 14:17:41 +10:00
Ronnie Sahlberg
6474f3278d additional monitoring between the two daemons.
we currently only monitor that the dameons are running by kill(0, pid)
and verifying the the domain socket between them is ok.

this is not sufficient since we can have a situation where the recovery
daemon is hung.

this new code monitors that the recovery daemon is operating.
if the recovery hangs, we log this and shut down the main daemon

(This used to be ctdb commit cd69d292292eaab3aac0e9d9fc57cb621597c63c)
2008-09-09 13:44:46 +10:00
Ronnie Sahlberg
a35fa0aa8f rename ctdb_tcp_client back to the original name ctdb_control_tcp
(This used to be ctdb commit 4d1c0418cfe6170bc081684dbe45908a5d285f0b)
2008-08-27 10:24:35 +10:00
Ronnie Sahlberg
5193caec6d make the function to canonicalize a sockaddr structure public
(This used to be ctdb commit 1157d61a0bc557d8ffc453c518dfc48473492bfd)
2008-08-20 11:58:27 +10:00
Ronnie Sahlberg
ef997d344f initial ipv6 patch
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>

(This used to be ctdb commit 1f131f21386f428bbbbb29098d56c2f64596583b)
2008-08-19 14:58:29 +10:00
Andrew Tridgell
aa1bc0abba added a new control CTDB_CONTROL_TRANS2_COMMIT_RETRY so we can tell
the difference between a initial commit attempt and a retry, which
allows us to get the persistent updates counter right for retries

(This used to be ctdb commit 7f29c50ccbc7789bfbc20bcb4b65758af9ebe6c5)
2008-08-08 13:11:28 +10:00
Andrew Tridgell
5a0249d34c return a more detailed error code from a trans2 commit error
(This used to be ctdb commit 6915661a460cd589b441ac7cd8695f35c4e83113)
2008-08-08 09:58:49 +10:00
Ronnie Sahlberg
b9d8bb23af remove the reclock file we store pnn counts in.
This file creates additional locking stress on the backend filesystem and we may not need it anyway.

(This used to be ctdb commit 84236e03e40bcf46fa634d106903277c149a734f)
2008-08-06 11:52:26 +10:00
Andrew Tridgell
98502135e7 added new multi-record transaction commit code
(This used to be ctdb commit 9ff3380099fe6f4d39de126db0826971a10ee692)
2008-07-30 19:57:00 +10:00
Andrew Tridgell
abe0232818 rename the structure we use for marshalling multiple records
(This used to be ctdb commit 4d205476d286570a6e1f52b59af42858ce051106)
2008-07-30 14:24:56 +10:00
Ronnie Sahlberg
1bfcca524d From Michael Adams,
change one element from private to private_data

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>

(This used to be ctdb commit 0de79352c9b36c118e36905f08ebbe38ecbb957e)
2008-07-22 09:07:42 +10:00
Ronnie Sahlberg
6eb4e46fe1 Add two new controls to start and cancel a persistent update.
This allows ctdb to automatically start a new full blown recovery
if a client has started updating the local tdb for a persistent database
but is kill -9ed before it has ensured the update is distributed clusterwide.

(This used to be ctdb commit 1ffccb3e0b3b5bd376c5302304029af393709518)
2008-07-17 13:50:55 +10:00
Ronnie Sahlberg
ab8535eaa5 make LVS a capability so that we can see which nodes are configured with
LVS and which are not using LVS.

"ctdb getcapabilities"

(This used to be ctdb commit 172d01fb34f032e098b1c77a7b0f17bf11301640)
2008-07-10 10:37:22 +10:00
Andrew Tridgell
9999f18369 an extraordinarily ugly patch!
This is a hack to allow backtraces under valgrind to show what opcode
is getting uninitialised bytes

(This used to be ctdb commit 67bb12c8f0af5914efb44b76bc6ddbb11fc0fcdf)
2008-07-04 18:00:24 +10:00
Andrew Tridgell
8be67e0e09 CTDB_NO_MEMORY_VOID() needs to return on error
(This used to be ctdb commit 6d21fd57bedffce2298ce7fe4c7d889c858ba7fa)
2008-07-04 16:58:29 +10:00
Ronnie Sahlberg
ef769e7237 track both when we last started and ended a recovery.
make ctdb uptime print how long the recovery took

in the recovery daemon when we check that the public ip address
allocation on the local node is correct (we have the ips we should have
and we dont have any we shouldnt have) use ctdb uptime and check the
recovery start/stop times and make sure we dont check for ip allocation
inconsistencies during a recovery  where the ip address allocation is in flux.

(This used to be ctdb commit f86551580349b7f662f9a07e4eb0c1189e38e429)
2008-07-02 13:55:59 +10:00
Ronnie Sahlberg
05b50ebe0a print the opcode when an async callback detects an error
(This used to be ctdb commit 423934629704683d3a3042570577fb4e04b17a6d)
2008-07-02 12:21:53 +10:00
Ronnie Sahlberg
779468ab3f if the event scripts hangs EventScriptsBanCount consecutive times in a row
the node will ban itself for the default recovery ban period

(This used to be ctdb commit 7239d7ecd54037b11eddf47328a3129d281e7d4a)
2008-06-13 13:18:06 +10:00
Ronnie Sahlberg
4b6b094860 add a callback for failed nodes to the async control helper.
this callback is called for every node where the control failed (or timed out)

when we issue the start recovery control from recovery master,
set any node that fails as a culprit   so it will eventually be banned

(This used to be ctdb commit 72f89bac13cbe8c3ca3e7a942469cd2ff25abba2)
2008-06-12 16:53:36 +10:00
Ronnie Sahlberg
d8433cacb2 first cut to convert takeover_callback_state{}
to use ctdb_sock_addr instead of sockaddr_in

(This used to be ctdb commit 5444ebd0815e335a75ef4857546e23f490a22338)
2008-06-04 17:12:57 +10:00
Ronnie Sahlberg
7d39ac131b convert handling of gratious arps and their controls and helpers to
use the ctdb_sock_addr structure so tehy work for both ipv4 and ipv6

(This used to be ctdb commit 86d6f53512d358ff68b58dac737ffa7576c3cce6)
2008-06-04 15:13:00 +10:00
Ronnie Sahlberg
ceaf488f05 do persistent writes in a child process
(This used to be ctdb commit 2da3d1f876f5d654f849af8a3e588f5a61300c3d)
2008-05-28 13:04:25 +10:00
Ronnie Sahlberg
ed2cf0291d second try for safe transaction stores into persistend tdb databases
for stores into persistent databases, ALWAYS use a lockwait child take out the lock for the record and never the daemon itself.

(This used to be ctdb commit 7fb6cf549de1b5e9ac5a3e4483c7591850ea2464)
2008-05-22 12:47:33 +10:00
Ronnie Sahlberg
909ff219e0 Start implementing support for ipv6.
This enhances the framework for sending tcp tickles to be able to send ipv6 tickles as well.

Since we can not use one single RAW socket to send both handcrafted ipv4 and ipv6 packets, instead of always opening TWO sockets, one ipv4 and one ipv6 we get rid of the helper ctdb_sys_open_sending_socket() and just open (and close)  a raw socket of the appropriate type inside ctdb_sys_send_tcp().
We know which type of socket v4/v6 to use based on the sin_family of the destination address.

Since ctdb_sys_send_tcp() opens its own socket  we no longer nede to pass a socket
descriptor as a parameter.  Get rid of this redundant parameter and fixup all callers.

(This used to be ctdb commit 406a2a1e364cf71eb15e5aeec3b87c62f825da92)
2008-05-14 15:47:47 +10:00
Ronnie Sahlberg
2bc0e5a69f add a new container to hold a socketaddr for either ipv4 or ipv6
(This used to be ctdb commit 93b98838824fae5f47e4ed6b95ae9e4e7597bec3)
2008-05-14 15:40:44 +10:00
Ronnie Sahlberg
b8eb5925cf Try to use tdb transactions when updating a record and record header inside the ctdb daemon.
If a transaction could be started, do safe transaction store when updating the record inside the daemon.
If the transaction could not be started (maybe another samba process has a lock on the database?) then just do a normal store instead (instead of blocking the ctdb daemon).

The client can "signal" ctdb that updates to this database should, if possible, be done using safe transactions by specifying the TDB_NOSYNC flag when attaching to the database.
The TDB flags are passed to ctdb in the "srvid" field of the control header when attaching using the CTDB_CONTROL_DB_ATTACH_PERSISTENT.

Currently, samba3.2 does not yet tell ctdbd to handle any persistent databases using safe transactions.

If samba3.2 wants a particular persistent database to be handled using
safe transactions inside the ctdbd daemon, it should pass
TDB_NOSYNC as the flags to the call to attach to a persistent database
in ctdbd_db_attach()     it currently specifies 0 as the srvid

(This used to be ctdb commit 8d6ecf47318188448d934ab76e40da7e4cece67d)
2008-05-12 13:37:31 +10:00
Ronnie Sahlberg
92b61cd7d5 Expand the client async framework so that it can take a callback function.
This allows us to use the async framework also for controls that return
outdata.

Add a "capabilities" field to the ctdb_node structure. This field is
only initialized and kept valid inside the recovery daemon context and not
inside the main ctdb daemon.

change the GET_CAPABILITIES control to return the capabilities in outdata instead of in the res return variable.

When performing a recovery inside the recovery daemon, read the capabilities from all connected nodes and update the ctdb->nodes list of nodes.
when building the new vnnmap after the database rebuild in recovery, do not include any nodes which lack the LMASTER capability in the new vnnmap.
Unless there are no available connected node that sports the LMASTER capability in which case we let the local node (recmaster) take on the lmaster role temporarily (i.e. become a member of the vnnmap list)

(This used to be ctdb commit 0f1883c69c689b28b0c04148774840b2c4081df6)
2008-05-06 15:42:59 +10:00
Ronnie Sahlberg
a9c45f9513 Add a capabilities field to the ctdb structure
Define two capabilities :
can be recmaster
can be lmaster
Default both capabilities to YES

Update the ctdb tool to read capabilities off a node

(This used to be ctdb commit 50f1255ea9ed15bb8fa11cf838b29afa77e857fd)
2008-05-06 10:02:27 +10:00
Ronnie Sahlberg
0e1a20b603 Revert "Revert "Revert "- accept an optional set of tdb_flags from clients on open a database,"""
remove the transaction stuff and push   so that the git tree will work

This reverts commit 539bbdd9b0d0346b42e66ef2fcfb16f39bbe098b.

(This used to be ctdb commit 876d3aca18c27c2239116c8feb6582b3a68c6571)
2008-04-10 15:59:51 +10:00