IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
ctdb_attach() so that we can pass TDB_NOSYNC when we attach to
a persistent database and want fast unsafe writes instead of
slow but safe tdb_transaction writes.
enhance the ctdb_persistent test suite to test both safe and unsafe writes
(This used to be ctdb commit 4948574f5a290434f3edd0c052cf13f3645deec4)
since this event wont run unless the recovery mode is normal but we
can not know what the recovery mode will be in the future on a remote node
so since we issue these commands that will execute in the future at some other node
it is pointless to try to check if it worked or not
in particular if "failure to successfully run the eventscript" would then trigger a full new recovery which is disruptive and expensive.
(This used to be ctdb commit 2c292039a0139dcf5bb2bd964eb6f8902d094c50)
This attempts to fix the problem of ctdb event scripts blocking due to
attempted access to the ctdb databases during recovery. The changes are:
- now only the 'shutdown' and 'startrecovery' events can be called
with the databases locked in recovery. The event scripts must ensure
that for these two events no database access is attempted
- the recovered, takeip and releaseip events could previously be called
inside a recovery. The code now ensures that this doesn't happen, delaying
the events till after recovery has finished
- the 50.samba event script now avoids using testparm unless it is really
needed
This needs extensive testing.
(This used to be ctdb commit e3cdb8f2be6a44ec877efcd75c7297edb008a80b)
If we shutdown the transport and CTDB later decides to send a command out
for queueing, the call to ctdb->methods->allocate_pkt() will SEGV.
This could trigger for example when we are in the process of shuttind down CTDBD and have already shutdown the transport but we are still waiting for the
"shutdown" eventscripts to finish.
If the event scripts now take much much longer to execute for some reason, this
race condition becomes much more probable.
Decorate all dereferencing of ctdb->methods-> with a check that ctdb->menthods is non-NULL
(This used to be ctdb commit c4c2c53918da6fb566d6e9cbd6b02e61ae2921e7)
Remove a bogus check inside the recovery daemon that ONLY redistributed public addresses IFF the local node had/served public addresses.
This was a valid optimization long ago when we enforced that all nodes must use the same public addresses file but is invalid today where we can have different public addresses configs on all nodes and even have some nodes that do NOT use public addresses at all.
(This used to be ctdb commit 5833e6b99d9afaf35dc8354df8676b9115418b23)
Should always use type safe talloc functions when possible. In this case we were allocating bytes instead of uint32_t
(This used to be ctdb commit cb14ee57dd0a589242da1ac2830bb7939df460a5)
This allows us to use the async framework also for controls that return
outdata.
Add a "capabilities" field to the ctdb_node structure. This field is
only initialized and kept valid inside the recovery daemon context and not
inside the main ctdb daemon.
change the GET_CAPABILITIES control to return the capabilities in outdata instead of in the res return variable.
When performing a recovery inside the recovery daemon, read the capabilities from all connected nodes and update the ctdb->nodes list of nodes.
when building the new vnnmap after the database rebuild in recovery, do not include any nodes which lack the LMASTER capability in the new vnnmap.
Unless there are no available connected node that sports the LMASTER capability in which case we let the local node (recmaster) take on the lmaster role temporarily (i.e. become a member of the vnnmap list)
(This used to be ctdb commit 0f1883c69c689b28b0c04148774840b2c4081df6)
handle failure to get/hold the reclock pnn file better and just
treat it as a transient backend filesystem error and try again later
instead of shutting down the recovery daemon
when we have lost the pnn file and if we are recmaster
release the recmaster role so that someone else can become recmaster isntead
(This used to be ctdb commit e513277fb09b951427be8351d04c877e0a15359d)
and a ctdb command to pull the talloc memory map from a recovery daemon
ctdb rddumpmemory
(This used to be ctdb commit d23950be7406cf288f48b660c0f57a9b8d7bdd05)
of connected nodes
num_active only contains the number of active nodes and would thus not count
banned nodes
(This used to be ctdb commit 06d3ce470766ef0b60d68ccd84de5437146cc147)
once every such interval :
* the recovery master on each node will uppdate the "connected" count in the
reclock count file (ctdb getreclock)
* if the node thinks it is a recovery master but it detects another node
that is DISCONNECTED but which still holds a lock to the reclock count file
this may mean that we have a split cluster.
if that other node that is DISCONNECTED but still holds the lock on hte reclock
pnn count file, is MORE connected than the local node,
yield the recmaster role and let the other half of the lcuster take over
this add a second, last chance mechanism to detect split clusters.
IF the cluster is split but GPFS is not yet split, this mechanism makes
the largest half of the cluster become the active half.
(This used to be ctdb commit 07af425f444531942cce8abff112c1524228d287)
ctdb vacuum : vacuums all the databases, deleting any zero length
ctdb records
ctdb repack : repacks all the databases, resulting in a perfectly
packed database with no freelist entries
(This used to be ctdb commit 3532119c84ab3247051ed6ba21ba3243ae2f6bf4)
ctdb_recoverd.c
Always handle banning/unbanning locally on the node that is being
banned/unbanned instead of on the recovery master.
This means that if a ban request comes in to the recovery master for a
remote node, we pass the request on to the remote node instead of
setting up the ban and ban timeouts locally.
ctdb.c
send ban/unban requests to the node being banned/unbanned instead of to
the recmaster
(This used to be ctdb commit 880dd9f5fd0b91e450da93e195cc5c62cb1dcd6e)
the banned_nodes array and not the rec structure so that ban_state is
destroyed when the banned_nodes array gets destroyed
(and so that when this struct is destroyed, that any pending
ctdb_ban_timeout events are also destroyed.)
othervise we may end up with multiple ban_timeout timed events going in
parallell since we destroy/recreate the banned_nodes structure during
election but we never destroy/recreate the rec structure.
(This used to be ctdb commit fbd663d56a2a4421a5c0e541962c87e2e9c7cd82)
control, instead call ctdb_start/stop_monitoring()
ctdb_stop_monitoring() dont allocate a new monitoring context, leave it
NULL. Also set the monitoring_mode in this function so that
ctdb_stop/start_monitoring() and ->monitoring_mode are kept in sync.
Add a debug message to log that we have stopped monitoring.
ctdb_start_monitoring() check whether monitoring is already active and
make the function idempotent.
Create the monitoring context when monitoring is started.
Update ->monitoring_mode once the monitoring has been started.
Add a debug message to log that we have started monitoring.
When we temporarily stop monitoring while running an event script,
restart monitoring after the event script wrapper returns instead of in
the event script callback.
Let monitoring_mode start out as DISABLED and let it be enabled once we call ctdb_start_monitoring.
dont check for MONITORING_DISABLED in check_fore_dead_nodes(). If
monitoring is disabled, this event handler will not be called.
(This used to be ctdb commit 3a93ae8bdcffb1adbd6243844f3058fc742f76aa)
when we are the recmaster and we update the local flags for all the
nodes, if one of the nodes fail to respond and give us his flags,
set that node as a "culprit"
as one of the first things to do in the monitor_cluster loop, check if
the current culprit has caused too many (20) failures and if so ban that
node.
this is for the situation where a remote node may still be CONNECTED but
it fails to respond to the getnodemap control causing the recovery
master to loop in monitor_cluster aborting the monitoring when the
node fails to respond but before anything will trigger a call to
do_recovery().
If one or more of the databases or nodes are frozen at this stage, this
would lead to smbd being blocked for potentially a longish time.
(This used to be ctdb commit 83b0261f2cb453195b86f547d360400103a8b795)
recovery daemon and the ctdb daemon both agree on whether the node is
banned or not and if they disagree then reban the node again after
logging an error to the debug log
(This used to be ctdb commit 6cd6e534493066edd4bb2c6ae5be0e9a9d495aa0)
when these functions are called to ban or unban a node make sure we
update the CTDB_NODE_BANNED flag in rec->node_flags since this field and
flag are checked during the election process
(This used to be ctdb commit 740c632ae96a2d34327d1b575780aaf079d93f4f)
so it differs from what the local ctdb daemon on the recovery master
thinks it should be we should call for a re-election
(This used to be ctdb commit 21ad6039c31ef5cc0e40a35a41220f91943947cb)
flags differ between the local ctdb daemon and the remote node
we can force a flags update on all nodes and not just the local daemon
(This used to be ctdb commit a924eb89c966ecbae029ca137e06cffd40cc70fd)
flags
in update_local_flags()
(this is only called if we are or we belive we are the recmaster)
when we detect that the flags of a remote node is different from what
our local node thinks the flags should be for that remote node
we should send a node-flag-changed message to the local daemon so
that it updates the flags for that node.
(This used to be ctdb commit 36225e4e271f7a4065398253747fb20054f99a53)
make sure we read and update the flags from all remote nodes before we
reach the first codepath that can call do_recovery()
since during do_recovery() we need to know what the flags are.
(This used to be ctdb commit e85f3806483ea420559d449e0e4d81bec996740f)
shouldnt or we are not holding addresses wqe should)
we must first freeze the local node before we set the recovery mode
(This used to be ctdb commit a77a77e8b5180f6a4a1f3d7d4ff03811f3b71b56)
addresses (i.e. htey hold those they should hold and they dont hold
any of those they shouldnt hold)
if an inconsistency is found, mark the local node as recovery mode
active
and wait for the recovery master to trigger a full blown recovery
(This used to be ctdb commit 55a5bfc8244c5b9cdda3f11992f384f00566b5dc)
- add a flag to check that recovery completed correctly. If not, re-trigger it in monitoring
(This used to be ctdb commit d5ed941d9bab4af30d8b5f9b77bdf43d9218d69b)
need_takeover_run is set to true or else we might forget to rerun it
again during the next recovery
othervise, need_takeover_run is only set to true IFF the node flags for
a remote node and the local nodes differ.
It is possible that a takeover run fails and thus the reassignment of
ip addresses is incomplete but before we get back to the test in
monitor_cluster() that all the node flags of all nodes have converged
and they now match each others again. and thus causing
monitor_cluster() to fail to realize that a takeover run is needed.
(This used to be ctdb commit ae7e866787cebd14394983ce1834387c959d1022)
files
so that we can partition the cluster into different subsets of nodes
which each serve a different subset of the public addresses
(This used to be ctdb commit 889e0fe69e4c88c6166282b12843b8d9727552d6)
multiple public addresses spread across multiple interfaces on each
node.
this is a massive patch since we have previously made the assumtion that
we only have one public address per node.
get rid of the public_interface argument. the public addresses file
now explicitely lists which interface the address belongs to
(This used to be ctdb commit 462ebbc791e906a6b874c862defea43235597ca8)
passing it as a parameter we set the callback function explicitely from
the caller if the ..._send() function returned a valid state pointer.
(This used to be ctdb commit aa939570662786455f63299b62c99882cff29d42)
callback function which is called upon completion (or timeout) of the
control.
modify scanning of recmaster in the monitoring_cluster code to try the
api out
(This used to be ctdb commit c37843f1d97b169afec910e7ddb4e5ac12c3015c)
struct so that if we timeout a control we can print debug info such as
what opcode failed and to which node
we dont need the *status parameter to ctdb_client_control_state
create async versions of the getrecmaster control
pass a memory context to getrecmaster
(This used to be ctdb commit 558b680c82f830fba82c283c78c2de8a0b150b75)
places.
create a new helper function to generate new generation id values that
know about the invalid id and avoids generating it.
update the ctdb status tool to know about the invalid generation id and
print the string INVALID instead
(This used to be ctdb commit 4fbcd189543cb8a92227fdcd3d158472e558ccda)
see both the old flags as well as the new flags (so we can tell which
flags changed)
send the CTDB_SRVID_RECONFIGURE messages to connected nodes only, not to
every node, connected or not, in the cluster.
in the handler inside the recovery daemon which is invoked for node flag
change messages, only do a takeover_run() and redistribute the ip addresses IF it was the
disabled or the unhealthy flags that changed. Also send out the cluster
reconfigured message in this case.
If any of the other flags changed we dont need to do the takeover_run(0
here since that will be done during recovery.
(This used to be ctdb commit 5549b2058e2c148a8ca9d419123acf3247bb8829)
dont let those messages modify the DISCONNECTED flag.
the DISCONNECTED flag must be managed locally since it describes whether
the local node can communicate with the remote node or not
(This used to be ctdb commit 5650673205d335a32d4f27f66847ea66752a00f0)
cluster, we cant check that both the BANNED and the DISCONNECTED flags
are both set at the same time since if a node becomes banned just
before it is DISCONNECTED there is no guarantee that all other nodes
will have seen the BANNED flag.
So we must first check the DISCONNECTED flag only and only if the
DISCONNECTED flag is not set should we check the BANNED flag.
othervise this can cause a recovery loop while some nodes thing the
disconnected node is DISCONNECTED|BANNED and other think it is just
DISCONNECTED
(This used to be ctdb commit 0967b2fff376ead631d98e78b3a97253fc109c69)
- added DatabaseHashSize tunable
- added logging of events inside recovery (for timing)
(This used to be ctdb commit 3593cdb928b91e217faf1b3c537fa28dc82cdace)