IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In include/ctdb.h, ctdb_callback_t and ctdb_rrl_callback_t were
defined with a void *private variable. The variable name was
changed to void *private_data to avoid issues encountered in
the Samba autoconf script.
Evan Kinney <evan.kinney@sas.com>
(This used to be ctdb commit 1f453aa4b5e749468c7788afac09c6f0900ea18f)
We've been seeing "Invalid packet of length 0" errors, but we don't know
what is sending them. Add a name for each queue, and print nread.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e6cf0e8f14f4263fbd8b995418909199924827e9)
The extra recovery interval wait was introduced in 821333afb458 but no
explanation was provided in that message. Nonetheless, if starting
the entire cluster for the first time, it should be safe to skip this.
We use the commandline arg --sloppy-start which should discourage
people from using it outside testing.
Seconds between ctdbd first log message and node healthy:
BEFORE: 16.10
AFTER: 4.03
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 509e2e89ae233a0e91998d95267bf62f296a73cd)
Because this doesn't use a generic callback, it's not quite as trivial
as the other sync wrappers.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 1f20b938d46d4fcd50d2b473c1ab8dc31d178d2d)
These are important for testing, since we can easily tell if we
leak memory if there are outstanding allocations after calling
these.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 18a212aa40d0ff9ff59775c6fcf9dc973e991460)
Ronnie and I tracked down a bug which seems to be caused by a node
running so slowly that we timed out the request and reused the request
id before it responded.
The result was that we unlocked the wrong record, leading to the
following:
ctdbd: tdb_unlock: count is 0
ctdbd: tdb_chainunlock failed
smbd[1630912]: [2010/06/08 15:32:28.251716, 0] lib/util_sock.c:1491(get_peer_addr_internal)
ctdbd: Could not find idr:43
ctdbd: server/ctdb_call.c:492 reqid 43 not found
This exact problem is now detected, but in general we want to delay
id reuse as long as possible to make our system more robust.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 9eb9c53ef29f4871ae2fe62fc5cb6145fca89eed)
I missed some int->bool conversions previously, particularly the
return of ctdb_writerecord().
By always handing functions ctdb_connection or ctdb_db, we keep it
consistent with the rest of the API and can do extra lock consistency
checks.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 3f939956ddd693cba6ea5c655288f4f5ca95f768)
Now we have more messages, it seems to make sense to document their usage
and make them consistent.
In particular, LOG_CRIT for internal libctdb problems, LOG_ALERT for
API misuse.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit a6fed3f577c7ec51df38ed15ecb9db6ea2ae7c8f)
This allows us to keep the datastructure valid after the lock has been released by the application and we can trap and warn when the application is accessing the lock after it has been released. I.e. application bugs.
(This used to be ctdb commit 463a266205f145cd9c4c36b9c59d3747eeef0e2e)
Full documentation for all the functions.
This looks longer than it is, because it sorts them into async and
sync parts, and also renames some formal parameters.
Added TODO to libctdb directory to track our plans.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 108e9c2450876a9f8821aa7efd5be971eee5afd3)
We're best off including ctdb_protocol.h to get these, even if we
document the important ones in ctdb.h.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit cdc19dc73032470d57f38bf825d8113b3a0c8cd1)
Return bool instead of -1/0; that's what the young kids are doing
these days!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit e285b5d5a9d4fbc4f75dbb237d2fcdbd84f2d605)
This is based on Ronnie's work, merged with mine. That means
errors are all my fault.
Differences from Ronnie's:
1) use syslog's LOG_ levels directly.
2) typesafe arg to log function, and use it (eg stderr) in helper function.
3) store fn in ctdb context, and expose ctdb_log_level directly thru API.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 86259aa395555aaf7b2fae7326caa2ea62961092)
This is going to help for logging, since we want it there.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 0786152472bc43efae4c896f7c6c07c6e080b9b2)
After discussion with Ronnie, we decided to revisit this interface. We use
the name ctdb_readrecordlock_async, as it is *not* always a send, and we
use a specific callback to avoid the "fake request" creation on the fast
path.
The request itself is never exposed: this means it can't be cancelled,
but we can revisit that later if need be.
This makes both use and implementation simpler.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 03b5546ae45a60ab41eb4f7159a45bfdbf959888)
PDUs are padded to 8 byte boundary. If padding is used, memset it to 0
to keep valgrind happy.
(This used to be ctdb commit 8818d5c483558c0faa6a3923ed5e675fdcfc13af)
and print the time startistics was taken and for how long the statistics have been collected to the "ctdb statistics" output.
(This used to be ctdb commit 1bdfe0cd3370a335b960ce1ef97eade93b0cd2fa)
Previously we could hang in poll with the callback pending (since we
fake it): explicitly call it immediately.
Note: I experienced corruption using DLIST_ADD_END (ctdb->pnn was blatted
when adding to the message_handler list). I switched them all to DLIST_ADD,
but maybe I'm using it wrong?
(This used to be ctdb commit 3727165f0d206999d2cfc2800ff8868640868c7c)
This is a bit tricky for those cases where we need to do multiple or
zero I/Os (eg. attachdb and readrecordlock), but works well for the
simple cases.
(This used to be ctdb commit ebe4dd724338c156423cfdcc10a75b68c2084cde)
These simplifications mostly came up due to the implementation.
o Rename ctdb_context to ctdb_connection.
We already have a ctdb_context internally in ctdbd; don't confuse them!
o Rename ctdb_handle to struct ctdb_request.
From the user POV it's a request, and it's also useful internally to
avoid implicit cast to/from void *.
o Rename ctdb_db_context to ctdb_db.
o Introduce ctdb_lock.
This provides an explicit "lock object" you get from readrecordlock
and have to hand to those functions which need you to hold a lock.
o status args are "int" not int32_t.
Should this be a bool?
o Remove last traces on generic callback.
Without semi-sync API, this doesn't help anything and loses type safety.
o Remove the semi-async API.
We can add this later, but I think a sync and async API is enough for
our poor users for the moment :)
o Registering a message handler also takes a callback.
This way you can tell if it failed. Not sure if this is overkill, but it's
consistent.
o ctdb_service() takes an revents arg
Strictly not necessary for a nonblocking fd, but nice to know if a
read or write is possible.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 86e1f93df856f9627182ed0e18bfcff6866c0954)
This imports ctdb.h and tst.c from Ronnie's work: it's a separate commit
for now to make the changes obvious.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 09f05cbfc883e5aac33d3781b163cde178ece4cf)
ctdb_client.h is the existing internal client interface (which was mainly
in ctdb.h), and ctdb_protocol.h is the information needed for the wire
protocol only.
ctdb.h will be the new, shiny, libctdb API.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 4bba6b8cd47b352f98d41f9f06258d5ac3c9adef)
This resolves a problem with huge numbers of requests which could overflow
16 bits. Fortunately, the IDR should scale reasonably well, so we can simply
hold all the requests.
Although noone checks for failure, I added a constant for that.
BZ: 60540
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 72efc4122e37798227c3420a65ed1f706ca9ebe7)
verify that all nodes agree on the most recent ip address assignments
broke "ctdb moveip ..." since that call would never trigger
a full takeover run and thus would immediately trigger an inconsistency.
Add a new message to the recovery daemon where we can tell the recovery daemon to update its assignments.
BZ62782
(This used to be ctdb commit e7069082e5f0380dcddee247db8754218ce18cab)
In the case of a timeout, we dump a log of what's happening to a file
in /tmp. We do it from the signal handler, which is an unreliable hack
(BZ58365).
Instead, create another (lower-priority) child to do the dump, then
kill the timedout script.
Note that this doesn't quite work as intended (the dump is often run
after the script has been killed), so the next patch resolves this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 7ee5ecc8d53e78e2dec21197b74a74cc4ae1834c)
addresses and verify that the remote nodes have/keep a consistent view of
assigned addresses.
If a remote node has an inconsistent view of addresses visavi the recovery
master this will trigger a full ip reallocation.
(This used to be ctdb commit f3bf2ab61f8dbbc806ec23a68a87aaedd458e712)
We know ask for the known and available interfaces.
This means a node gets a RELEASE_IP event for all interfaces
it "knows", but doesn't serve and a node only gets a TAKE_IP event
for "available" interfaces.
metze
(This used to be ctdb commit a695a38e49e7c3e15a9706392dc920eeab1f11ba)
This is needed because the "startup" event runs after the initial recovery,
but we need to do some actions before the initial recovery.
metze
(This used to be ctdb commit e953808449c102258abb6cba6f4abf486dda3b82)
configureable using --log-ringbuf-size=<num-entries>.
Add an entry in the sysconfig file to set this persistently.
(This used to be ctdb commit c79c2da69bc352f509e7fca4b9172a4b7f23c0f8)
We don't want ctdb stalling due to paging; this can be far worse than
scheduling delays. But if we simply do mlockall(MCL_FUTURE), it
increases the risk that mmap (ie. tdb open) or malloc will fail,
causing us to abort.
This patch is a compromise: we mlock all current pages (including
10k of future stack for expansion) and then relock when a client
asks us to open a TDB. We warn, but don't exit, if it fails.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 82f778e85440bc713d3f87c08ddc955d3cfce926)
1) It's buggy. Code needs to be carefully written (ie. no busy
loops) to handle running with it, and we fork and run scripts.[1]
2) It makes debugging harder. If ctdbd loops (as has happened recently)
it can be extremely hard to get in and see what's happening. We've already
seen the valgrind hacks.
3) We have seen recent scheduler problems. Perhaps they are unrelated,
but removing this very unusual setup is unlikely to hurt.
4) It doesn't make anything faster. Under all but the most perverse of
circumstances, 99% of the cpu gives the same performance as 100%, and
we will always preempt normal processes anyway.
[1] I made this worse in 0fafdcb8d353 "eventscript: fork() a child for
each script" by removing the switch_from_server_to_client() which
restored it, but even that was only for monitor scripts. Others were
run with RT priority.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 482c302d46e2162d0cf552f8456bc49573ae729d)
The do_setsched was being tested for whether to mmap tdbs: let's make it
explicit. We can also happily move the kill-child eventscript hack under
this flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 2ee86cc1f311d7b7504c7b14d142b9c4f6f4b469)
Depending on --max-persistent-check-errors we allow ctdb
to start with unhealthy persistent databases.
The default is 0 which means to reject a startup with
unhealthy dbs.
The health of the persistent databases is checked after each
recovery. Node monitoring and the "startup" is deferred
until all persistent databases are healthy.
Databases can become healthy automaticly by a completely
HEALTHY node joining the cluster. Or by an administrator
with "ctdb backupdb/restoredb" or "ctdb wipedb".
metze
(This used to be ctdb commit 15f133d5150ed1badb4fef7d644f10cd08a25cb5)
This reverts commit 401f421fa003d9515df15e759b50b56e0c67d69c.
Conflicts:
include/ctdb_private.h
server/ctdb_tunables.c
(This used to be ctdb commit b883d19a495a41a22db37f9c2cf6250fee529de0)
since we no longer ban nodes when dodgy scripts continue to hang.
We now only mark nodes as unhealthy if monitor events fail or timeout. Never ban.
(This used to be ctdb commit 5c8e56fc7a518e115bceac257867739283cf6a1e)
there is no rational need for a setting where we permanently mark nodes as disabled everytime an eventscript fails
(This used to be ctdb commit 68a8ee99b128a5ec883600735626bdb3bbc9c503)
This patch improves the handling of the fetch_lock operation on non-persistent
databases that ctdb clients have to do very frequently.
The normal flow how this goes is the following:
1. Client does a local fetch_lock on the database
2. Client looks if the local node is dmaster.
If yes, everything is fine
If no, continue here
3. Client unlocks the local record
4. Client issues a "get me the record" call to ctdbd
5. ctdbd goes out and fetches the dmaster role
6. ctdbd tells the client to retry
7. Client starts over again
The problem is between step 6 and 7: Before the client has had the chance to
retry (i.e. catch the record with a fetch_locked), another node might have come
asking ctdbd to migrate away the record again. This is a real problem, I've
seen >20 loops of this kind in real workloads.
This patch does the following: Whenever ctdb receives a record as result of
step 5, it puts the key on a "holdback list". As long as a key is on this list,
a request to migrate away the dmaster is put on hold. It is the client's duty
to issue the "CTDB_CONTROL_GOTIT" control when it has successfully done step 2
after having asked ctdb to fetch the record. This will release the key from the
"holdback list" and re-issue all dmaster migration requests.
As a safeguard against malicious clients, once a second (default 1000msecs,
tunable "HoldbackCleanupInterval" in milliseconds) ctdbd goes over the list of
held back keys, deletes them and releases all held back migration requests.
(This used to be ctdb commit 5736e17c139c9a8049e235429aeae0c6c9d0e93d)
This is a simplified version of the trans2 commit control:
It just rolls out the marshall buffer to all active nodes.
It is the main ctdbd part of the re-implementation of the
persistent transactions. The client code is changed to
take a global lock to start a transactions and store into
the marshal buffer instead of writing to the local tdb
under a local transaction.
The old transaction implementation is going to be
removed in a later commit.
Michael
(This used to be ctdb commit f66428f9d2013080a414404c1ba6117888352fd6)
Date: Wed, 9 Dec 2009 22:45:12 +0100
Subject: [PATCH] Revert an accidential commit
(This used to be ctdb commit af6656f2844d8fd72204a70358c9d589dbe1bd34)
This might be a bit less efficient, but experience in winbind has shown that
event callbacks can trigger changes in the socket state in very hard to
diagnose ways.
(This used to be ctdb commit a78b8ea7168e5fdb2d62379ad3112008b2748576)
We also no longer return an error before scripts have been run; a special
zero-length data means we have never run the scripts.
"ctdb scriptstatus all" returns all event script results.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 9b90d671581e390e2892d3a68f3ca98d58bef4df)
We're going to need this so ctdb can query non-monitor status.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 53bc5ca23ca55a3ac63a440051f16716944a2a51)
Rather than only tranferring to last_status for monitor events, do
it for every event (ctdb->last_status is now an array).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit c73ea56275d4be76f7ed983d7565b20237dbdce3)
We're going to allow fetching status of all script runs, so this
name is no longer appropriate.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit f5cb41ecf3fa986b8af243e8546eb3b985cd902a)
A new helper functions which sets up an event attached to the child's
stdout/stderr which gets routed to the logging callback after being
placed in the normal logs.
This is a generalization of the previous code which was hardcoded to
call ctdb_log_event_script_output.
The only subtlety is that we hang the child fds off the output buffer;
the destructor for that will flush, which means it has to be destroyed
before the output buffer is.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 32cfdc3aec34272612f43a3588e4cabed9c85b68)
The child no longer uses ctdb_ctrl_event_script_init or
ctdb_ctrl_event_script_finished, and the others are redundant: it
doesn't need to tell us it's starting a script when it only runs one.
We move start and stop calls to the parent, and eliminate the RPC
infrastructure altogether.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 391926a87a7af73840f10bb314c0a2f951a0854c)
We put a "scripts" member in ctdb_event_script_state, rather than using
a special struct for monitor events. This will fit better as we further
unify the different events, and holds the reports from the child process
running each monitor script.
Rather than making the monitor state a child of current_monitor_status_ctx,
we just point current_monitor directly at it. This means we need to reset
that pointer in the destructor for ctdb_event_script_state.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 9a2b4f6b17e54685f878d75bad27aa5090b4571f)
We have monitor_event_script_ctx and other_event_script_ctx, and
current_monitor_status_ctx in struct ctdb_context. This seems more
complex than it needs to be.
We use a single "event_script_ctx" as parent for all event script
state structures. Then we explicitly reparent monitor events under
current_monitor_status_ctx: this is freed every script invocation to
kill off any running scripts anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 0d925e6f2767691fa561f15bbb857a2aec531143)
eventscript.c uses this now, but our next patch makes others use it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit a305cb7743c24386e464f6b2efab7e2108bb1e7e)
This unifies code paths and simplifies things: we just hand -ENOEXEC to
ctdb_ctrl_event_script_stop().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit eadf5e44ef97d7703a7d3bce0e7ea0f21cb11f14)
This starts the move toward more expressive encoding of return values:
positive values mean the script ran, negative means we had a problem with
the script (and the value is the errno).
This does timeout, but changes the ctdb tool to recognize it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 0eb1d0aa14e68b598d9e281c8a02b8f94a042fd9)
This simplifies the code a little: last_status is now read to go
(it's only used by the scriptstatus command at the moment).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 6be931266a4e41fd0253f760936ad9707dd97c47)
It is unlikely we will need something this verbose for normal troubleshooting.
This allows us to keep a significantly longer time interval of log messages
in the 500k slots available in the ringbuffer.
(This used to be ctdb commit cc99c05c0c6484ad574039a454e6133852cb41fa)
This controls is only used by samba when samba wants to check if a subrecord held by a <node-id>:<smbd-pid> is still valid or if it can be reclaimed.
If the node is banned or stopped, we kill the smbd process and return that the process does not exist to the caller. This allows us to recover subrecords from stopped/banned nodes where smbd is hung waiting for the databases to thaw.
bz58185
(This used to be ctdb commit 157807af72ed4f7314afbc9c19756f9787b92c15)
Add the mapping to the list everytime we accept() a new client connection
and set it up to remove in the destructor when the client structure is freed.
(This used to be ctdb commit f75d379377f5d4abbff2576ddc5d58d91dc53bf4)
Now we're doing checking, we might as well make sure the commands from
"ctdb eventscripts" are valid.
This gets rid of the "UNKNOWN" event type.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 1d24a3869fe89fc9a109fd9e9b69df5fc665a5f6)
Rather than doing strcmp everywhere, pass an explicit enum around. This
also subtly documents what options are available. The "options" arg
is now used for extra arguments only.
Unfortunately, gcc complains on empty format strings, so we make
ctdb_event_script() take no varargs, and add ctdb_event_script_args(). We
leave ctdb_event_script_callback() taking varargs, which means callers
have to do "%s", "".
For the moment, we have CTDB_EVENT_UNKNOWN for handling forced scripts
from the ctdb tool.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 8001488be4f2beb25e943fe01b2afc2e8779930d)
Everyone uses the same timeout value, so just remove it from the API.
If we ever need variable timeouts, that might as well be central too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This used to be ctdb commit 533c3e053293941d2a9484b495e78d45f478bb08)
command.
Use the existing context used for non-monitor events
Multiple concurrent uses of "ctdb eventscript ..." could otherwise lead to a SEGV
(This used to be ctdb commit 80a8d728e9680040e00d24361dfc9367dd372a56)
This allows running the actual monitoring asynchronously from ctdbd
and only using "status" to pick up the actual results.
(This used to be ctdb commit 1908bac812650ca25151051f5d86815e0b8ed319)
use a udp socket on the ctdbd port to send messages to teh syslog child process for loggign.
we need this when syslog becomes "slow", like very slow, and on boxes where syslog is limited to 100 lines per second and starts to block after that
(This used to be ctdb commit 1446f4c247310e2ff2d522055bd8927d1a78d017)
Otherwise a node can lock itself out, e.g. when a commit control times out...
Michael
(This used to be ctdb commit cb432e30351d5e5a41e98da3c7b1c2a4d400a3a2)
This aske the daemon wheter a transaction is currently active on a
given DB on that node. More precisely this asks for the transaction_active
flag in the ctdb_db_context that is set in the CTDB_TRANS2_COMMIT
control and cleared in the CTDB_TRANS2_ERROR or CTDB_TRANS2_FINISHED controls.
This will be useful for fixing race conditions in the transaction code.
Michael
(This used to be ctdb commit 8d430ae6968dfe566614379436fc3c56003fcd88)
add a global variable holding the pid of the main daemon.
change the tracking of time() in the event loop to only check/warn when called from the main daemon
(This used to be ctdb commit a10fc51f4c30e85ada6d4b7347b0f9a8ebc76637)
The way to use this is from a client to :
1, first create a message handle and bind it to a SRVID
A special prefix for the srvid space has been set aside for samba :
Only samba is allowed to use srvid's with the top 32 bits set like this.
The lower 32 bits are for samba to use internally.
2, register a "notification" using the new control :
CTDB_CONTROL_REGISTER_NOTIFY = 114,
This control takes as indata a structure like this :
struct ctdb_client_notify_register {
uint64_t srvid;
uint32_t len;
uint8_t notify_data[1];
};
srvid is the srvid used in the space set aside above.
len and notify_data is an arbitrary blob.
When notifications are later sent out to all clients, this is the payload of that notification message.
If a client has registered with control 114 and then disconnects from ctdbd, ctdbd will broadcast a message to that srvid to all nodes/listeners in the cluster.
A client can resister itself with as many different srvid's it want, but this is handled through a linked list from the client structure so it mainly designed for "few notifications per client".
3, a client that no longer wants to have a notification set up can deregister using control
CTDB_CONTROL_DEREGISTER_NOTIFY = 115,
which takes this as arguments :
struct ctdb_client_notify_deregister {
uint64_t srvid;
};
When a client deregisters, there will no longer be sent a message to all other clients when this client disconnects from ctdbd.
(This used to be ctdb commit f1b6ee4a55cdca60f93d992f0431d91bf301af2c)
Add a new tunable to control the maximum queue size we allow to a blocked client before we start discarding REQ_MESSAGES instead of queueing them for delivery.
This avoids having queued up very very large number of MESSAGES that samba semds
between eachother to nodes that are blocked/banned/stopped for extended periods
.
(This used to be ctdb commit f76d6fed8f9630450263b9fa4b5fdf3493fb1e11)
Add a tuneable so that when scripts starts to hang/timeout, we can make the node unhealthy instead of banned
(This used to be ctdb commit 2e9fc6f0609833c6d8146196011ef780669d615d)
master to perform an explicit ip reallocation.
This is more reliable and faster than having the recovery dameon track these
changes, and since we now have an explicit method to ask the recovery daemon
to perform an explicit ip reallocation, we should use this.
(This used to be ctdb commit 3807681e74f4bfe92befdae6ed616ff5f1a99880)
transactions we start across all tdb databased during the recovery.
this allows us to properly clean up and delete these tdb transactions on a
recovery failure.
(This used to be ctdb commit b2ce8b900a7d00944c84e0574fea5b371064a06d)
database priorities will be used to control in which order databases are locked during recovery in.
(This used to be ctdb commit 67741c0ee01916d94cace8e9462ef02507e06078)
This is useful when we are moving addresses using moveip in the cluster since otherwise if we collide with the recovery daemons own check we could cause a recovery
(This used to be ctdb commit 9c63858c0b22c81eaccb9865a414af0bbb2833d4)
Remove the explicit vacuum/repack commands from the 00.ctdb eventscript
and implement this in the ctdb daemon.
Combine vacuuming and repacking into one
cheap read traverse to enumerate all candidate records
and one write traverse that both repacks the database and also deletes the record locally where we are lmaster and where the records have already been deleted remotely.
this code also adds initial autotuning heuristics for the vacuum intervals and how many records to delete in each iteration.
minor stylish changes made by ronnie s
(This used to be ctdb commit 95a3ee551241aa164967991fe5efe078e1714bde)
In ctdb_client.c:ctdb_transaction_commit(), after a failed
TRANS2_COMMIT control call (for instance due to the 1-second
being exceeded waiting for a busy node's reply), there is a
1-second gap between the transaction_cancel() and
replay_transaction() calls in which there is no lock on the
persistent db. And due to the lack of global state
indicating that a transaction is in progress in ctdbd, other nodes
may succeed to start transactions on the db in this gap and
even worse work on top of the possibly already pushed changes.
So the data diverges on the several nodes.
This change fixes this by introducing global state for a transaction
commit being active in the ctdb_db_context struct and in a db_id field
in the client so that a client keeps track of _which_ tdb it as
transaction commit running on. These data are set by ctdb upon
entering the trans2_commit control and they are cleared in the
trans2_error or trans2_finished controls. This makes it impossible
to start a nother transaction or migrate a record to a different
node while a transaction is active on a persistent tdb, including
the retry loop.
This approach is dead lock free and still allows recovery process
to be started in the retry-gap between cancel and replay.
Also note, that this solution does not require any change in the
client side.
This was debugged and developed together with
Stefan Metzmacher <metze@samba.org> - thanks!
Michael
(This used to be ctdb commit f88103516e5ad723062fb95fcb07a128f1069d69)
This event is called when a node is stopped and is used by eventscripts that need to do certain cleanup and removal of configuration or ip addresses or routing ...
Note that a STOPPED node is considered "inactive" and as such will not be running the "recovered" event when the rest of the cluster has recovered.
(This used to be ctdb commit 65e9309564611bf937ded3c74a79abff895d7c59)
This node flag means the node is DISABLED and that all its public ip addresses
are failed over, but also that it has been removed from the VNNmap.
A STOPPED node should be in recovery mode active untill restarted using the continue command.
Adding two new commands "ctdb stop" "ctdb continue"
(This used to be ctdb commit d47dab1026deba0554f21282a59bd172209ea066)
validate the input values used and refuse setting the debug level to an unknown value
(This used to be ctdb commit daec49cea1790bcc64599959faf2159dec2c5929)
This is used to mark nodes as being DELETED internally in ctdb
so that nodes are not renumbered if / when they are removed from the nodes file.
This is used to be able to do "ctdb reloadnodes" at runtime without
causing nodes to be renumbered.
To do this, instead of deleting a node from the nodes file, just comment it out like
1.0.0.1
#1.0.0.2
1.0.0.3
After removing 1.0.0.2 from the cluster, the remaining nodes retain their
pnn's from prior to the deletion, namely 0 and 2
Any line in the nodes file that is commented out represents a DELETED pnn
(This used to be ctdb commit 6a5e4fd7fa391206b463bb4e976502f3ac5bd343)
Log this in "ctdb statistics".
Also add a varaible "RecLockLatencyMs" that will log an error everytime it takes longer than this to access the reclock file.
(This used to be ctdb commit 042377ed803bb8f7ca9d6ea1a387427b7b8ba45a)
When a client (such as smbstatus) is killed, it may have outstanding
traverse children on remote nodes. We need to catch the client
disconnect in ctdbd and send a control to all nodes telling them to
kill those outstanding traverse children.
(This used to be ctdb commit f2fb2df4619a14f7f6c11f9132ee7d793028042c)
this now defaults to 60 seconds
This is useful if a split brain occurs due to network partitioning since it will make sure that the "other half" of the cluster that does not contain the recovery master will eventually release all ips and thus avoiding a duplicate ip situation for the public addresses
(This used to be ctdb commit 70f21428c9eec96bcc787be191e7478ad68956dc)
Rename the variable to SeqnumInterval for
1, it is an interval and not a 1/interval unit
2, so that we catch when people use this old variable and can update the sysconfig file instead of silently changin semantics of this variable
this is a real dodgy variable
(This used to be ctdb commit 68eac459e5d2b6b534f72821036675ffe5d7a350)
This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes.
(This used to be ctdb commit ce534a83a05dbd40238e4eee0669d60ff396f935)
this command shows which eventscripts were executed during the last monitoring cycle and the status from each eventscript.
If an eventscript timedout or returned an error we also
show the output from the eventscript.
Example :
[root@rcn1 ctdb-git]# ./bin/ctdb scriptstatus
6 scripts were executed last monitoring cycle
00.ctdb Status:OK Duration:0.021 Mon Mar 23 19:04:32 2009
10.interface Status:OK Duration:0.048 Mon Mar 23 19:04:32 2009
20.multipathd Status:OK Duration:0.011 Mon Mar 23 19:04:33 2009
40.vsftpd Status:OK Duration:0.011 Mon Mar 23 19:04:33 2009
41.httpd Status:OK Duration:0.011 Mon Mar 23 19:04:33 2009
50.samba Status:ERROR Duration:0.057 Mon Mar 23 19:04:33 2009
OUTPUT:ERROR: Samba tcp port 445 is not responding
Add a new helper function "switch_from_server_to_client()" which both
the recovery daemon can use as well as in the child process we start for running the actual eventscripts.
Create several new controls, both for the eventscript child process to inform the master daemon of the current status of the scripts as well as for the ctdb tool to extract this information from the runninc daemon.
(This used to be ctdb commit c98f90ad61c9b1e679116fbed948ddca4111968d)
a ctdb client instance.
use this from the recovery daemon child process to switch to client mode
and connect back to the main daemon
(This used to be ctdb commit 16f31786a031255ab5b3099a0a3c745de973347a)
This is not portable.
The ctdb build includes the necessary headers from includes.h.
And users of ctdb should cope with including the necessary
prerequisite headers themselves.
Michael
(This used to be ctdb commit fedc6983f5dee39152e6f400f89a3e07eab57f0c)
(for struct sockaddr to be defined)
Thanks to William Jojo <w.jojo@hvcc.edu> for reporting.
Michael
(This used to be ctdb commit 7558bca1e99884c02747adb7cbea799d04ee24d5)
allow clients to register either ipv4 or ipv6 client connections to the tickles list
(This used to be ctdb commit d9b44d7c3255b0fd7359b9afeb613e6ff4c4eaac)
modify the transport methods to allow to restart individual connections
and set up destructors properly.
only tear down/set-up tcp connections to nodes removed from the cluster
or nodes added to the cluster.
Leave tcp connections to unchanged nodes connected.
make "ctdb reloadnodes" explicitely cause a recovery of the cluster once
the files have been realoaded
(This used to be ctdb commit d1057ed6de7de9f2a64d8fa012c52647e89b515b)
race between the ctdb tool and the recovery daemon both at once
trying to push flag changes across the cluster.
(This used to be ctdb commit a9a1156ea4e10483a4bf4265b8e9203f0af033aa)
log the type of operation and the database name for all latencies higher
than a treshold
(This used to be ctdb commit 1d581dcd507e8e13d7ae085ff4d6a9f3e2aaeba5)
fallback to the old-style ipv4-only controls if the new-style ipv4/ipv6
control fails.
this allows a 1.0.59+ (ipv4/ipv6) ctdb daemon being recmaster to be
compatible with
pre-1.0.59 versions of ctdb that are ipv4 only.
(This used to be ctdb commit 8e912abc2c68f5fe7b06c600ba6fec1a6900127c)
older ipv4-only version of these controls.
We need this so that we are backwardcompatible with old versions of ctdb
and so that we can interoperate with a ipv4-only recmaster during a
rolling upgrade.
(This used to be ctdb commit 6b76c520f97127099bd9fbaa0fa7af1c61947fb7)
correctly by measuring how long it was since the last successful
communication with the recovery daemon was recorded.
After a certain timeout the ctdb daemon would deem the recovery daemon
as inoperable and shut down.
If the system clock is suddenly changed forward by many (60 or more)
seconds this could cause the timeout to trigger prematurely/immediately
where ctdb would incorrectly think that more than 60 seconds had passed
since last successful communications and thus abort.
Instead of cehcking for one timeout occuring, only deem the recovery
daemon to be "down" and trigger a shutdown if communications have
timedout for three intervals in a row.
(This used to be ctdb commit 196968c552e6ebcb57389d769a4b25f42fa8bc5d)
we currently only monitor that the dameons are running by kill(0, pid)
and verifying the the domain socket between them is ok.
this is not sufficient since we can have a situation where the recovery
daemon is hung.
this new code monitors that the recovery daemon is operating.
if the recovery hangs, we log this and shut down the main daemon
(This used to be ctdb commit cd69d292292eaab3aac0e9d9fc57cb621597c63c)
the difference between a initial commit attempt and a retry, which
allows us to get the persistent updates counter right for retries
(This used to be ctdb commit 7f29c50ccbc7789bfbc20bcb4b65758af9ebe6c5)
This file creates additional locking stress on the backend filesystem and we may not need it anyway.
(This used to be ctdb commit 84236e03e40bcf46fa634d106903277c149a734f)
change one element from private to private_data
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
(This used to be ctdb commit 0de79352c9b36c118e36905f08ebbe38ecbb957e)
This allows ctdb to automatically start a new full blown recovery
if a client has started updating the local tdb for a persistent database
but is kill -9ed before it has ensured the update is distributed clusterwide.
(This used to be ctdb commit 1ffccb3e0b3b5bd376c5302304029af393709518)
This is a hack to allow backtraces under valgrind to show what opcode
is getting uninitialised bytes
(This used to be ctdb commit 67bb12c8f0af5914efb44b76bc6ddbb11fc0fcdf)
make ctdb uptime print how long the recovery took
in the recovery daemon when we check that the public ip address
allocation on the local node is correct (we have the ips we should have
and we dont have any we shouldnt have) use ctdb uptime and check the
recovery start/stop times and make sure we dont check for ip allocation
inconsistencies during a recovery where the ip address allocation is in flux.
(This used to be ctdb commit f86551580349b7f662f9a07e4eb0c1189e38e429)
this callback is called for every node where the control failed (or timed out)
when we issue the start recovery control from recovery master,
set any node that fails as a culprit so it will eventually be banned
(This used to be ctdb commit 72f89bac13cbe8c3ca3e7a942469cd2ff25abba2)
ctdb_attach() so that we can pass TDB_NOSYNC when we attach to
a persistent database and want fast unsafe writes instead of
slow but safe tdb_transaction writes.
enhance the ctdb_persistent test suite to test both safe and unsafe writes
(This used to be ctdb commit 4948574f5a290434f3edd0c052cf13f3645deec4)
for stores into persistent databases, ALWAYS use a lockwait child take out the lock for the record and never the daemon itself.
(This used to be ctdb commit 7fb6cf549de1b5e9ac5a3e4483c7591850ea2464)
This enhances the framework for sending tcp tickles to be able to send ipv6 tickles as well.
Since we can not use one single RAW socket to send both handcrafted ipv4 and ipv6 packets, instead of always opening TWO sockets, one ipv4 and one ipv6 we get rid of the helper ctdb_sys_open_sending_socket() and just open (and close) a raw socket of the appropriate type inside ctdb_sys_send_tcp().
We know which type of socket v4/v6 to use based on the sin_family of the destination address.
Since ctdb_sys_send_tcp() opens its own socket we no longer nede to pass a socket
descriptor as a parameter. Get rid of this redundant parameter and fixup all callers.
(This used to be ctdb commit 406a2a1e364cf71eb15e5aeec3b87c62f825da92)
If a transaction could be started, do safe transaction store when updating the record inside the daemon.
If the transaction could not be started (maybe another samba process has a lock on the database?) then just do a normal store instead (instead of blocking the ctdb daemon).
The client can "signal" ctdb that updates to this database should, if possible, be done using safe transactions by specifying the TDB_NOSYNC flag when attaching to the database.
The TDB flags are passed to ctdb in the "srvid" field of the control header when attaching using the CTDB_CONTROL_DB_ATTACH_PERSISTENT.
Currently, samba3.2 does not yet tell ctdbd to handle any persistent databases using safe transactions.
If samba3.2 wants a particular persistent database to be handled using
safe transactions inside the ctdbd daemon, it should pass
TDB_NOSYNC as the flags to the call to attach to a persistent database
in ctdbd_db_attach() it currently specifies 0 as the srvid
(This used to be ctdb commit 8d6ecf47318188448d934ab76e40da7e4cece67d)
This allows us to use the async framework also for controls that return
outdata.
Add a "capabilities" field to the ctdb_node structure. This field is
only initialized and kept valid inside the recovery daemon context and not
inside the main ctdb daemon.
change the GET_CAPABILITIES control to return the capabilities in outdata instead of in the res return variable.
When performing a recovery inside the recovery daemon, read the capabilities from all connected nodes and update the ctdb->nodes list of nodes.
when building the new vnnmap after the database rebuild in recovery, do not include any nodes which lack the LMASTER capability in the new vnnmap.
Unless there are no available connected node that sports the LMASTER capability in which case we let the local node (recmaster) take on the lmaster role temporarily (i.e. become a member of the vnnmap list)
(This used to be ctdb commit 0f1883c69c689b28b0c04148774840b2c4081df6)
Define two capabilities :
can be recmaster
can be lmaster
Default both capabilities to YES
Update the ctdb tool to read capabilities off a node
(This used to be ctdb commit 50f1255ea9ed15bb8fa11cf838b29afa77e857fd)
remove the transaction stuff and push so that the git tree will work
This reverts commit 539bbdd9b0d0346b42e66ef2fcfb16f39bbe098b.
(This used to be ctdb commit 876d3aca18c27c2239116c8feb6582b3a68c6571)
thus allowing the client to pass through the TDB_NOSYNC flag
- ensure that tdb_store() operations on persistent databases that don't
have TDB_NOSYNC set happen inside a transaction wrapper, thus making
them crash safe
(This used to be ctdb commit 49330f97c78ca0669615297ac3d8498651831214)
and a ctdb command to pull the talloc memory map from a recovery daemon
ctdb rddumpmemory
(This used to be ctdb commit d23950be7406cf288f48b660c0f57a9b8d7bdd05)
The controls only modify the runtime setting of which public addresses a node
can server and does not modify /etc/ctdb/public_addresses.
To make the change permanent you also need to edit /etc/ctdb/public_addresses
manually.
After ip addresses have been added/deleted you need to invoke a recovery
for the ip addresses to be redistributed.
(This used to be ctdb commit f8294d103fdd8a720d0b0c337d3973c7fdf76b5c)
Add back the controls to enable/disable monitoring we used to have for debugging but removed a while ago
(This used to be ctdb commit 8477f6a079e2beb8c09c19702733c4e17f5032fe)
This can cause a memory leak if the call is terminated before we have managed to respond to the client.
(and the call is talloc_free()d but the data is still hanging off ctdb)
instead we must talloc_steal() the data and hang it off the call structure to avoid the memory leak.
In order to do this we must also change the call structure that is passed into ctdb_call_local() to be allocated through talloc().
This structure was previously either a static variable, or an element of a larger talloc()ed structure (ctdb_call_state or ctdb_client_call_state) so
we must change all creations of a ctdb_call into explicitely creating it through talloc()
(This used to be ctdb commit 4becf32aea088a25686e8bc330eb47d85ae0ef8f)
Vacumming used to delete one record at a time on all nodes, that was
m*n behaviour and would require a huge storm of ctdb->ctdb controls and just wouldnt scale at all.
The new vacuming process collects all records to be deleted locally and then only sends 1 control to the other nodes. This control contains a list of all records to be deleted.
(This used to be ctdb commit 9e625ece19a91f362c9539fa73b6b2108f0d9c53)
when this tunable is set, ip addresses will only be failed over when a node
fails. And only those ip addresses held by the failed node will be reallocated
in the cluster.
When a node becomes active again, this will not lead to any failback of ip addresses.
This can reduce the number of "ip address movements" in the cluster since we dont automatically fail an ip address back, but can also lead to an unbalanced cluster since we no longer attempt to spread the ip addresses out evenly across the active nodes.
This tuneable can NOT be active at the same time as DeterministicIPs are used.
(This used to be ctdb commit d3b8a461b15bc584fa1785eb5922de6d49d8f6c4)
once every such interval :
* the recovery master on each node will uppdate the "connected" count in the
reclock count file (ctdb getreclock)
* if the node thinks it is a recovery master but it detects another node
that is DISCONNECTED but which still holds a lock to the reclock count file
this may mean that we have a split cluster.
if that other node that is DISCONNECTED but still holds the lock on hte reclock
pnn count file, is MORE connected than the local node,
yield the recmaster role and let the other half of the lcuster take over
this add a second, last chance mechanism to detect split clusters.
IF the cluster is split but GPFS is not yet split, this mechanism makes
the largest half of the cluster become the active half.
(This used to be ctdb commit 07af425f444531942cce8abff112c1524228d287)
CTDB_START_AS_DISABLED="yes"
and command line argument
--start-as-disabled
When set, this makes the ctdb node to always start in DISABLED mode and will thus not host any public ip addresses.
The administrator must manually "ctdb enable" the node after it has started when the administrator wants the node to start hosting public ip addresses.
Using this option it is possible to start ctdb on a node without causing any reallocation of ip addresses when it is starting. The node will still merge with the cluster and there will still be a recovery phase but the ip address allocations will not change in the cluster.
(This used to be ctdb commit b93d29f43f5306c244c887b54a77bca8a061daf2)
add a new control that causes the node to drop the current nodes list
and reread it from the nodes file.
During this operation, the node will also drop the tcp layer and restart it.
When we drop the tcp layer, by talloc_free()ing the ctcp structure
add a destructor to ctcp so that we also can clean up and remove the references in the ctdb structure to the transport layer
add two new commands for the ctdb tool.
one to list all nodes in the nodesfile and the second a command to trigger a node to drop the transport and reinitialize it with the nde nodes file
(This used to be ctdb commit 4bc20ac73e9fa94ffd43cccb6eeb438eeff9963c)
nodes into two separate files.
move the monitoring of keepalives for detecting connected/disconnected
remote nodes into ctdb_keepalive.c
(This used to be ctdb commit 23a57b20c314d5f11a433cf251eb9d9de743849a)
ctdb vacuum : vacuums all the databases, deleting any zero length
ctdb records
ctdb repack : repacks all the databases, resulting in a perfectly
packed database with no freelist entries
(This used to be ctdb commit 3532119c84ab3247051ed6ba21ba3243ae2f6bf4)
flag.
change calling of the recovered/takeip/releaseip event scripts to use
these enable/disable functions instead of stopping/starting monitoring.
when we disable monitoring we want all events to still be running
in particular the events to monitor for dead nodes and we only want to
supress running the monitor event scripts
(This used to be ctdb commit a006dcc4f75aba950dd701ad7d1a84e89df285e8)
monitoring should always be enabled
(though a node may want to temporarily disable running the "monitor"
event scripts but can do so internally without the need for this
control)
(This used to be ctdb commit e3a33618026823e6af845fd8513cddb08e6b5584)
specific instance of ctdbd should bind to. This helps when running a
"virtual" cluster on a single machine where all instcances bind to
different alias interfaces.
If --node-ip is specified, then we will only try to bind to this ip
address only. Othervise we fall back to the original method trying the
ip addresses in /etc/ctdb/nodes one by one until we find one we can bind
to.
No variable in /etc/sysconfig/ctdb added since this parameter only makes
sense in a virtual test/debug cluster.
(This used to be ctdb commit d96cb02c2c24f9eabbc53d3d38e90dea49cff3e0)
of the startup event scripts after the point where recovery has
started and the node is in normal operation
This makes the 'startup' script just a special type of the 'monitor'
script which is called first
(This used to be ctdb commit 7424c30a5fd04aea0137c466b4318c3f185280d8)
shut down and restart the transport
othervise, if we use the tcp transport the tcp connection might try to
retransmit the queued data during the time the node is unavailable.
this together with the exponential backoff for tcp means that the tcp
connection quickly reaches the maximum backoff rto which is often 60 or
120 seconds. this would mean that it could take up to 60/120 seconds
before the tcp layer detects that the connection is dead and it has to
be reestablished.
(This used to be ctdb commit 0256db470879ce556b0f00070f7ebeaf37e529ab)
public addresses to nodes deterministic.
Activate it by adding CTDB_SET_DeterministicIPs=1 in /etc/sysconfig/ctdb
When this is set, the first entry in /etc/ctdb/public_addresses will
always be hosted by node 0, when that node is available, the second
entry by node1 and so on.
This tunable allows the allocation of addresses to become very
unbalanced and is only for debugging/testing use.
Beware, this feature requires that /etc/ctdb/public_addresses are
identical on all the nodes in the cluster.
(This used to be ctdb commit f0ca221f235731542090d8a6c86f2b7cd2ce2f96)
used in single public ip address mode.
when using this argument, --public-interface must also be used.
add a vnn structure to the ctdb context to describe the single public ip
address
update the killtcp control in the daemon that if a socketpair that is to
be killed does not match a normal public address it checks if the
destination address maches the single public ip address and if so uses
that vnn structure from the ctdb context
this allows killtcp to kill also connections to the single public ip
instead of only normal public addresses
(This used to be ctdb commit 5661ba17b91f62821dec1c76056c78b99752a90b)
a bool that specifies whether the ip was held by a loopback adaptor or
not
the name of the interface where the ip was held
when we release an ip address from an interface, move the ip address
over to the loopback interface
when we release an ip address after we have move it onto loopback,
use 60.nfs to kill off the server side (the local part) of the tcp
connection so that the tcp connections dont survive a
failover/failback
61.nfstickle, since we kill hte tcp connections when we release an ip
address we no longer need to restart the nfs service in 61.nfstickle
update ctdb_takeover to use the new signature for ctdb_sys_have_ip
when we add a tcp connection to kill in ctdb_killtcp_add_connection()
check if either the srouce or destination address match a known public
address
(This used to be ctdb commit f9fd2a4719c50f6b8e01d0a1b3a74b76b52ecaf3)
files
so that we can partition the cluster into different subsets of nodes
which each serve a different subset of the public addresses
(This used to be ctdb commit 889e0fe69e4c88c6166282b12843b8d9727552d6)
everytime we release an ip.
this context is used to hold all resources needed when sending out
gratious arps and tcp tickles during ip takeover.
we hang it off the vnn structure that manages that particular ip address
instead so that we can have multiple ones going in parallell
this bug (or the same bug in different shape) has probably been in ctdb
for very very long but is likely to be hard to trigger
(This used to be ctdb commit c58db1cadaba253b2659573673b28c235ef7db76)