1
0
mirror of https://github.com/samba-team/samba.git synced 2025-01-26 10:04:02 +03:00

410 Commits

Author SHA1 Message Date
root
08492a524b change the talloc hierarchy for the main transaction_start context and the individual transaction_all handles
(This used to be ctdb commit 919b29850671b59bcf748aec25658ea09d8b4f1c)
2009-05-06 07:33:07 +10:00
root
af25fa38f3 fixed a problem with clients disconnecting during a traverse
When a client (such as smbstatus) is killed, it may have outstanding
traverse children on remote nodes. We need to catch the client
disconnect in ctdbd and send a control to all nodes telling them to
kill those outstanding traverse children.

(This used to be ctdb commit f2fb2df4619a14f7f6c11f9132ee7d793028042c)
2009-05-06 07:32:25 +10:00
root
bfea570af4 when tracking the ctdb statistics, only decrement num_clients and pending_calls IFF the counter is >0
Otherwise there is the chance that we will reset the statistics after the counter has been incremented (client connects) to zero   and when the client disconnects we decrement it to a negative number.

this is a pure cosmetic patch with no operational impact to ctdb

(This used to be ctdb commit 72f1c696ee77899f7973878f2568a60d199d4fea)
2009-05-01 12:30:26 +10:00
root
6793f077a8 Add a new variable VerifyRecoveryLock which can be used to disable the test that the recovery daemon holds the lock properly when performing a recovery
(This used to be ctdb commit 329df9e47e6ca8ab5143985a999e68f37c6d88a5)
2009-05-01 01:17:59 +10:00
Ronnie Sahlberg
3a6ace330e we only need to have transaction nesting disabled when we start the new transaction for the recovery
(This used to be ctdb commit bf8dae63d10498e6b6179bbacdd72f1ff0fc60be)
2009-04-26 08:48:15 +10:00
Ronnie Sahlberg
d20bb2498d set the TDB_NO_NESTING flag for the tdb before we start a transaction from within recovery
(This used to be ctdb commit 1b2029dbb055ff07367ebc1f307f5241320227b2)
2009-04-26 08:42:54 +10:00
Ronnie Sahlberg
38ea6708dd add a tuneable RecoveryDropAllIPs so it is possible to control after how long a node that has been stuck in recovery will wait until it will yield all public addresses.
this now defaults to 60 seconds

This is useful if a split brain occurs due to network partitioning since it will make sure that the "other half" of the cluster that does not contain the recovery master will eventually release all ips and thus avoiding a duplicate ip situation for the public addresses

(This used to be ctdb commit 70f21428c9eec96bcc787be191e7478ad68956dc)
2009-04-24 18:28:08 +10:00
Ronnie Sahlberg
ce3283f7cb increase the loglevel for the message we print when we automatically release all ips when we have been in recovery for too long
(This used to be ctdb commit 7af060ded5113a49832f6a08a942523a202586b3)
2009-04-24 18:11:10 +10:00
Ronnie Sahlberg
3363480da4 tweak some timeouts so that we do trigger a banning even if the control hangs/timesout
(This used to be ctdb commit 1860a365e6ba8212e15c33016c80a2adcf8d10f4)
2009-04-24 14:45:07 +10:00
Ronnie Sahlberg
e5532b6f26 If we can not pull a database from a node during recovery, mark this node as a "culprit" so that it will eventually become banned.
(This used to be ctdb commit 69dc3bf60b86d8df6dc5c7c6ebf303e847fb2ba9)
2009-04-24 14:44:57 +10:00
Ronnie Sahlberg
a87e6f56ae we only need to switch into client mode from the eventscript child if we are running the monitor event
(This used to be ctdb commit 13e2c9044950f21918e4610726e73ed3d8f76920)
2009-04-06 14:03:09 +10:00
Ronnie Sahlberg
e5e2f6f8f7 increase the listen queue. Now that the eventscripts may become clients and connect back to the server we do get a lot more concurrent connection attempts (takepip/teleaseip are performed in parallell)
(This used to be ctdb commit 018f8b0b1823ef59b46f1a671aec5309d10628f4)
2009-04-06 14:00:41 +10:00
Ronnie Sahlberg
1f87ee85bc use _exit() and not exit() when we terminate a failed eventscript child process
(This used to be ctdb commit 33b296cee177adc61edc911caec8c24b3efa8441)
2009-04-06 13:16:36 +10:00
Ronnie Sahlberg
2e1208e648 We dont need to verify the nodemap on remote nodes that are banned
(This used to be ctdb commit 7f8f9385deee6eff2b7303147bc6412bbdc122df)
2009-04-06 12:00:22 +10:00
Ronnie Sahlberg
2393df3989 if we cant pull the remote nodemap off a node we should mark it as a culprit so it eventually becomes banned.
(This used to be ctdb commit 0889ae3c237bdb3bd72d45f2f64f5e5d8420870c)
2009-04-02 14:50:43 +11:00
Ronnie Sahlberg
d94917ec49 Change the (dodgy) seqnumfrequency variable to have ms resolution instead of second resolution.
Rename the variable to SeqnumInterval for
1, it is an interval and not a 1/interval unit
2, so that we catch when people use this old variable and can update the sysconfig file instead of silently changin semantics of this variable

this is a real dodgy variable

(This used to be ctdb commit 68eac459e5d2b6b534f72821036675ffe5d7a350)
2009-04-01 17:21:38 +11:00
Ronnie Sahlberg
ad40ee25f9 add a mechanism where the ctdb daemon will run a usercontrolled script when the node status changes to/from UNHEALTHY state.
This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes.

(This used to be ctdb commit ce534a83a05dbd40238e4eee0669d60ff396f935)
2009-03-31 14:23:31 +11:00
Ronnie Sahlberg
7265c713db we need to set the port properly in the parse_ip helper
(This used to be ctdb commit 43fe18d86995744ba61c7a6405b70edcb265930a)
2009-03-24 13:45:11 +11:00
root
629d5ee1fa add a new command "ctdb scriptstatus"
this command shows which eventscripts were executed during the last monitoring cycle and the status from each eventscript.

If an eventscript timedout or returned an error we also
show the output from the eventscript.

Example :
[root@rcn1 ctdb-git]# ./bin/ctdb scriptstatus
6 scripts were executed last monitoring cycle
00.ctdb              Status:OK    Duration:0.021 Mon Mar 23 19:04:32 2009
10.interface         Status:OK    Duration:0.048 Mon Mar 23 19:04:32 2009
20.multipathd        Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
40.vsftpd            Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
41.httpd             Status:OK    Duration:0.011 Mon Mar 23 19:04:33 2009
50.samba             Status:ERROR    Duration:0.057 Mon Mar 23 19:04:33 2009
   OUTPUT:ERROR: Samba tcp port 445 is not responding

Add a new helper function "switch_from_server_to_client()" which both
the recovery daemon can use as well as in the child process we start for running the actual eventscripts.

Create several new controls, both for the eventscript child process to inform the master daemon of the current status of the scripts as well as for the ctdb tool to extract this information from the runninc daemon.

(This used to be ctdb commit c98f90ad61c9b1e679116fbed948ddca4111968d)
2009-03-23 19:07:45 +11:00
root
dc05c1b80c create a helper function that converts a ctdb instance in daemon mode to become
a ctdb client instance.

use this from the recovery daemon child process to switch to client mode
and connect back to the main daemon

(This used to be ctdb commit 16f31786a031255ab5b3099a0a3c745de973347a)
2009-03-23 12:37:30 +11:00
Mathieu PARENT
f0d585217e build: Make log-directory configurable indepently of VARDIR
This adds a new configure option "--with-logdir".
logdir defaults to "${localstatedir}/log" .
It is important to have logdir configurable for debian systems,
where localstatedir is set to "/var/lib" and not "/var".

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit b0c6854d1e886456fabdc8f1c3bd21c89311c601)
2009-02-04 00:19:22 +01:00
Michael Adam
3cca0f75e4 Fix treatment of link local ipv6 addresses: set the scope id.
metze / Michael

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 9d12de1ca6107801dada927729e755c0949d73bf)
2009-01-19 22:50:53 +01:00
Stefan Metzmacher
23b550d6fc Fix segfault in ip takeover fallback code.
metze

Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit 3b88f3dec5227e8579672974f7028fb356ee1d94)
2009-01-16 07:22:59 +11:00
root
321866dbba finish the ipv6 support.
allow clients to register either ipv4 or ipv6 client connections to the tickles list

(This used to be ctdb commit d9b44d7c3255b0fd7359b9afeb613e6ff4c4eaac)
2009-01-13 16:17:20 +11:00
Ronnie Sahlberg
28bbe2f407 dont call ctdb_fatal() just because we are asked to restart a connection
to a remote node and ctdb->methods is NULL.

This can happen when we are in the middle of a normal shutdown of the
daemon and we have already shut down the transport layer (thus setting
ctdb->methods == NULL in the transport layer destructor)
band there is some unprocessed data related to a remote node.

This prevents an ugly race condition where ctdb might sometimes (rare)
cause a core dump during "ctdb shutdown".

(This used to be ctdb commit fc4e8b5a5d3699221620a8d76701c8589f2b4ff1)
2008-12-17 12:04:41 +11:00
root
8241d3f9cf update to the flags handling
make sure to abort the monitoring and restart if we failed to get the nodemap from a remote node

(This used to be ctdb commit 4eac0214e732e6c2f867d66ec71d4406680dbb94)
2008-12-09 10:45:14 +11:00
root
e54347fa4e redo and update how we synchronize flags across the cluster.
this simplifies the code and should close a race condition between the local recovery daemon and a remote node when flags are changing.

(This used to be ctdb commit 32d460b8469eb53145f04161a5d01166f9b5f09e)
2008-12-05 16:32:30 +11:00
Ronnie Sahlberg
edb7241c05 redesign how reloadnodes is implemented.
modify the transport methods to allow to restart individual connections
and set up destructors properly.

only tear down/set-up tcp connections to nodes removed from the cluster
or nodes added to the cluster.
Leave tcp connections to unchanged nodes connected.

make "ctdb reloadnodes" explicitely cause a recovery of the cluster once
the files have been realoaded

(This used to be ctdb commit d1057ed6de7de9f2a64d8fa012c52647e89b515b)
2008-12-02 13:26:30 +11:00
Ronnie Sahlberg
a782bdbacd inew version 1.0.66
ddwq

(This used to be ctdb commit 499a01fece2a5f24f1b2943cf3dc6e9a3a8ca3b5)
2008-11-24 19:06:02 +11:00
Ronnie Sahlberg
1e2831898c allow to change the recmaster even the database is not frozen
(This used to be ctdb commit 03e2e436db5cfd29a56d13f5d2101e42389bfc94)
2008-11-21 16:24:12 +11:00
Andrew Tridgell
59b6a9a9e6 fixed problem with looping ctdb recoveries
After a node failure, GPFS can get into a state where non-blocking
fcntl() locks can take a long time. This means to the ctdb set_recmode
test timing out, which leads to a recovery failure, and a new
recovery. The recovery loop can last a long time.

The fix is to consider a fcntl timeout as a success of this test. The
test is to see that we can't lock the shared reclock file, so a
timeout is fine for a success.

(This used to be ctdb commit 6579a6a2a7161214adedf0f67dce62f4a4ad1afe)
2008-11-21 10:24:13 +11:00
Ronnie Sahlberg
331b9bdb5f dont override/change CTDB_BASE if it is already set by the shell
(This used to be ctdb commit 0a6f9326cb99f14b5c9edd0d8854d8229df49910)
2008-11-20 16:39:56 +11:00
Ronnie Sahlberg
a2a5904f66 Keepalive packets were only sent every KeepaliveInterval if the socket
had been completely idle during that interval.
If we had been sending other packets such as Messages, Calls or Controls
there wouldnt be any need for an explicit keepalive and thus we didnt
send one.

This does make it somewhat awkward when analyzing traces since it is
non-intuitive when keepalives are sent and when they are not sent.

Change the keepalive logic to always send a keepalive regardless of
whether the link is idle or not.

(This used to be ctdb commit 7a18f33ec7512100dd067c65f0470889ff8fd591)
2008-11-20 13:35:08 +11:00
Ronnie Sahlberg
94a56ea410 reqrite the handling of flag updates across the cluster to eliminate a
race between the ctdb tool and the recovery daemon both at once
trying to push flag changes across the cluster.

(This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)
2008-11-20 12:43:18 +11:00
Ronnie Sahlberg
06728fdac9 we actually need a ctdb_db variable
(This used to be ctdb commit aba984f1b85f5a2d370b093061cf15843ee53758)
2008-11-03 21:54:52 +11:00
Ronnie Sahlberg
d7007793ea latency is measured in us, not ms
use an explicit ctdb_db variable instead of dereferencing state

(This used to be ctdb commit 8c6a02fb423a8cbcbfc706767e3d353cd48073c3)
2008-10-30 13:34:10 +11:00
Ronnie Sahlberg
e1b0cea427 add control and logging of very high latencies.
log the type of operation and the database name for all latencies higher
than a treshold

(This used to be ctdb commit 1d581dcd507e8e13d7ae085ff4d6a9f3e2aaeba5)
2008-10-30 12:49:53 +11:00
Ronnie Sahlberg
b9bd20ce55 add a context and a timed event so that once we have been in recovery
mode for too long we drop all public ip addresses

(This used to be ctdb commit 403c68f96e1380dd07217c688de2730464f77ea0)
2008-10-22 11:04:41 +11:00
Ronnie Sahlberg
beed899c4f null out the pointer before we reload the nodes file
(This used to be ctdb commit 4b0f32047e8bece0a052bdbe2209afe91b7e8ce3)
2008-10-17 21:38:42 +11:00
Ronnie Sahlberg
a924ef78b6 when we reload the nodes file, we may need to reload the nodes file
inside the recovery daemon as well.

(This used to be ctdb commit 82fd2b6b5cd8e988c38fa6b74121a048757bdeef)
2008-10-17 21:18:06 +11:00
Ronnie Sahlberg
ce66008e08 specify a "script log level" on the commandline to set under which log
level any/all output from eventscripts will be logged as

(This used to be ctdb commit cdc79d4f22f1a6aec5c34115969421f93663932a)
2008-10-17 07:56:12 +11:00
Ronnie Sahlberg
5808a7be96 allow multiple eventscripts using the same prefix.
this eases the pain for users that use out of tree eventscripts

(This used to be ctdb commit 8313dfb6fc5404cd2d065af6620412f8664ada11)
2008-10-16 17:57:50 +11:00
Ronnie Sahlberg
233b0e5cbb lower the loglevel for the informational message that a TCP_ADD opeation
described an ip address not known to be a public address.

This could happen if someone for genuine reasons accesses a share
through a static ip address.
It can also happen if non homogenous public address configurations are
used and when a tcp description is pushed out to a different node that
does not server/know the specific ip address.

(This used to be ctdb commit 9b1d089c99413f3681440f3cf33c293d118c9108)
2008-10-15 03:02:09 +11:00
Ronnie Sahlberg
41d19e650c Revert "from Mathieu Parent <math.parent@gmail.com>"
This reverts commit dc9cd4779db4a89697731e4cf415be51067a07c1.

Conflicts:

(This used to be ctdb commit d13da2e8fe2fab619540525d98a5502a23ab7d20)
2008-10-15 01:08:29 +11:00
Ronnie Sahlberg
cb300382b0 update TAKEIP/RELEASEIP/GETPUBLICIP/GETNODEMAP controls so we retain an
older ipv4-only version of these controls.

We need this so that we are backwardcompatible with old versions of ctdb
and so that we can interoperate with a ipv4-only recmaster during a
rolling upgrade.

(This used to be ctdb commit 6b76c520f97127099bd9fbaa0fa7af1c61947fb7)
2008-10-14 10:40:29 +11:00
Ronnie Sahlberg
e5a3a73e64 from Mathieu Parent <math.parent@gmail.com>
Hi,

I have attached a patch necessary as debian log dir (/var/log) is not
a subdir of VARDIR (/var/lib on rpm systems, /var/lib/ctdb on debian).
As I don't know much about autotools and friends, this patch may be
hacky.

This is part of the process to minimize diff between distributions.

(This used to be ctdb commit dc9cd4779db4a89697731e4cf415be51067a07c1)
2008-10-13 08:27:33 +11:00
Ronnie Sahlberg
3411e98e14 skip empty lines in the public addresses file, not skip all non-empty
lines

(This used to be ctdb commit dc108adada33bb713f71a2859eda3b439ed0cd1a)
2008-10-07 19:34:34 +11:00
Ronnie Sahlberg
374906860c from Michael Adams : allow #-style comments in the nodes and public
addresses file

(This used to be ctdb commit 5f96b33a379c80ed8a39de1ee41f254cf48733f9)
2008-10-07 19:25:10 +11:00
Ronnie Sahlberg
46187433ca remove an unused variable
(This used to be ctdb commit 4237bd3753dcb024c17461e974414bef1b609416)
2008-10-07 18:14:44 +11:00
Ronnie Sahlberg
1778280d50 When we reload the nodes file
instead of shutting down/restarting the entire tcp layer
just bounce all outgoing connections and reconnect

(This used to be ctdb commit e701a531868149f16561011e65794a4a46ee6596)
2008-10-07 18:12:54 +11:00