IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
ctdb_sys_find_ifname() doesn't work for IPv6 addresses so don't use
it.
Trust the eventscript to do sanity checking on the interface. Current
warnings are replaced with equivalents generated by the eventscript.
The unlikely message:
Public IP %s is hosted on interface %s but we have no VNN
will be replaced by:
WARNING: Public IP %s hosted on interface %s but VNN says __none__
which is clear enough.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is part of a migration to Samba's lib/util. CTDB always passes 0
(i.e. no max_size) so use a simple assert() to enforce this, rather
than changing a lot of code that will be discarded anyway.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This was useful for debugging the race fixed by commit
4f79fa6c7c843502fcdaa2dead534ea3719b9f69. It might be useful again.
Also fix a nearby comment typo.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Jun 20 02:07:48 CEST 2014 on sn-devel-104
It might as well be near where it is used. Add a comment explaining
it.
Also add/update comments at the top of the RELEASE_IP and TAKEOVER_IP
loops to explain what is happening.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon May 5 06:20:39 CEST 2014 on sn-devel-104
Signed-off-by: Gregor Beck <gbeck@sernet.de>
Reviewed-by: David Disseldorp <ddiss@samba.org>
Reviewed-by: Michael Adam <obnox@samba.org>
Autobuild-User(master): Michael Adam <obnox@samba.org>
Autobuild-Date(master): Tue Apr 1 02:59:05 CEST 2014 on sn-devel-104
Previous commits maintained the ordering between
ctdb_remove_orphaned_ifaces() and ctdb_vnn_unassign_iface(). This
meant that ctdb_remove_orphaned_ifaces() needed to steal the orphaned
interfaces and they would be freed later.
Unassign the interface first and things get simpler.
ctdb_remove_orphaned_ifaces() is now self-contained.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Sun Mar 23 06:20:43 CET 2014 on sn-devel-104
reloadips really expects deleted IPs to be released before completing.
Otherwise the recovery daemon starts failing the local IP check. The
races that follow can cause a node to be banned.
To make the error handling simple, do the actual deletion in
release_ip_callback().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Commit 0723fedcedd4a97870f7b1224945f1587363c9bf added a cheap
implemention of ctdb_control_startup() that simply flags the recipient
node as needing to send updates for each IP when the tickle update
loop next fires. Commit 026996550d726836091ff5ebd1ebf925bf237bb0
ensures that a node only sends tickle updates once being flagged to do
so.
CTDB_CONTROL_STARTUP is broadcast to all nodes, so this is a good
start. However, the tickle updates are only broadcast to connected
nodes. A recently started node may not yet be considered to be
connected because the keepalive monitoring loop may not yet have
marked the node as connected. This means that the tickle update loop
races with the keepalive monitoring loop. If the tickle update loop
wins then updates will not be sent to the recently started node.
The simplest improvement is to stop the tickle update from depending
on whether a node is connected or not. So instead of broadcasting
tickle updates to connected nodes, they are broadcast to all nodes.
Since no reply is expected, this should work just fine.
While looking at this code, ctdb_ctrl_set_tcp_tickles() is named like
a client function. It isn't a client function. Also, 2 of the
arguments are ignored. So rename this function to
ctdb_send_set_tcp_tickles_for_ip() and remove the ignored arguments.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
CTDB tracks connections to be able to send tickle ACKs and gratuitous
ARPs. When there are no public IPs, there is no need for tickle ACKs
and gratuitous ARPs.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Tue Mar 4 03:01:38 CET 2014 on sn-devel-104
Fix suggested by by Kevin Osborn <kosborn@overlandstorage.com>.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Thu Feb 27 13:54:59 CET 2014 on sn-devel-104
tcp_update_flag is set to true whenever tickles are added or deleted.
This flag is used to determine whether or not to send tickles list to
other nodes. Once tickles list is sent to other nodes successfully,
set tcp_update_flag to false, so ctdbd does not keep sending same tickles
list every TickleUpdateInterval (20 seconds).
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
This doesn't implement what was recommended. That would require
careful error handling, probably with a fallback to this code anyway.
This is simple and does no worse that the current code. That is, the
new node is updated on the next call to tdb_update_tcp_tickles().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This fixes ctdb crash reported in bug #10366.
Fix suggested by Kevin Osborn <kosborn@overlandstorage.com>.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
* Remove unnecessary candimbl parameter.
This parameter can be cheaply calculated in
lcp2_failback_candidate(). The compiler will probably do an
excellent job optimising it. :-)
* Clarify a debug statement
This is much clearer than doing a complex recalculation of a known
value.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Currently this can be checked many times. However, there's no point
calling the rebalance/failback code at all if there are no rebalance
candidates.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
srcimbl gets changed on every iteration of the loop. The value that
should be stored for the new imbalance of the source node is
minsrcimbl.
To help diagnose this, added some extra debug that can be left in.
The extra debug changes the output of a couple of tests. Note that
the resulting IP allocations in those tests is unchanged - only the
debug output is changed.
Also add some new tests that illustrates the bug.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Currently timeouts for controls to inactive nodes can cause banning
credits to be applied. This should not happen.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This was added to support external monitoring using CTDB event scripts.
However, it was never used.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Also get rid of ctdb_set_event_script_dir(). It creates an
unnecessary copy of something that will be around for the lifetime of
the process.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 21b4d1aba00902f1eee0cbf4f082b0794fd5b738)
Otherwise, if existing IPs are added to extra nodes (that have,
perhaps, been disconnected) then those IPs will not be rebalanced
across the extra nodes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit ceb30432a9a550778aed0b422a654fc5287b82a3)
Deleting IPs can take a while because IPs are released and connections
are killed. This can take a while so do them in parallel. In fact,
since the set of IPs being added and deleted will be disjoint, send
all the adds/deletes at the same time and then wait.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 85a5b544ec032173e98c9cc3b5402a76b961aa3b)
The current implementation has a few flaws:
* A takeover run is called unconditionally when the timer goes even if
the recovery master role has moved. This means a node other than
the recovery master can incorrectly do a takeover run.
* The rebalancing target nodes are cleared in the setup for a takeover
run, regardless of whether the takeover run succeeds.
* The timer to force a rebalance isn't cleared if another takeover run
occurs before the deadline. Any forced rebalancing will happen in
the first takeover run and when the timer expires some time later
then an unnecessary takeover run will occur.
* If the recovery master role moves then the rebalancing data will
stay on the original node and affect the next takeover run to occur
if the recovery master role should come back to the original node.
Instead, store an array of rebalance target nodes in the recovery
master context. This is passed as an extra argument to
ctdb_takeover_run() each time it is called and is cleared when a
takeover run succeeds. The timer hangs off the array of rebalance
target nodes, which is cleared if the node isn't the recovery master.
This means that it is possible to lose rebalance data if the recovery
master role moves. However, that's a difficult problem to solve. The
best way of approaching it is probably to try to stop the recovery
master role from jumping around unnecesarily when inactive nodes join
the cluster.
The long term solution is to avoid this nonsense completely. The IP
allocation algorithm needs to cache state between runs so that it
knows which nodes have just become healthy. This also needs recovery
master stability.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c51c1efe5fc7fa668597f2acd435dee16e410fc9)
Previously flagging a failure was probably avoided because of attempts
to run "ipreallocated" events on stopped and banned nodes, which would
fail because they are in recovery. Given the change to a new control
and that fallback only retries the old method on active nodes, this
should never fail in reasonable circumstances.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 53722430ad35f80935aabd12fa07654126443b8b)
They will reject it because they are in recovery. This can result in
extra banning credits being applied to banned nodes.
This corresponds to commit 9132e6814ed927fa317f333f03dedb18f75d0e5b
from the 1.2.40 branch.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 403938804caf1322f9773d63197e4303a7b2a788)
This should have been removed with the associated code in commit
14bd0b6961ef1294e9cba74ce875386b7dfbf446.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 36de63843de10a1f2a9ccdbbee24cc1d08542984)
This is an internal structure. It was moved into ctdb_private.h a
long time ago to allow unit testing. Unit test compilation was
changed shortly afterwards to make this unnecessary.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit db57261d7dc264e161659a8c547f44fbd9e88eeb)
Commit f73a4b1495830bcdd094a93732a89dd53b3c2f78 added a safety check
to ensure that CTDB never kills unrelated processes. However, client
processes are unrelated.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 782814288bb560099ee44b607bf35f3eddf37f82)
Currently the fail callback is called once per (takeip/releaseip) control
failure. This is overkill and can get a node banned much too quickly.
Instead, keep track of control failures per node and only call fail
callback once per failed node.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit bf4a7c1ad87e0e848296d15d63eb8cd901ca5335)
This helps distinguish processes in process list in top, perf, etc.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 2493f57ce268d6fe7e4c40a87852c347fd60d29e)
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 41182623891d74a7e9e9c453183411a161201e67)
At the moment this is silent and it can be confusing to see IPs just
disappear.
Also, this message:
Been in recovery mode for too long. Dropping all IPS
can cause anxiety when all IPs should already have been dropped.
Adding a comforting message saying that 0 IPs were dropped relieves
such anxiety. :-)
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 4d0f26b306fc465d551d340b0e7dce4412eae3fd)
The log messages in verify_remote_ip_allocation() are confusing
because they don't include the PNN of the problem node, because it is
not known in this function.
Add the PNN of the node being verified as a function argument and then
shuffle the log messages around to make them clearer.
Also fold 3 nested if statements into just one.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f0942fa01cd422133fc9398f56b4855397d7bc86)
At the moment there (at least) are 2 bugs that cause rogue IPs:
* A race where release_ip_callback() runs after a "subsequent" take IP
has completed. The IP is back on an interface but we unset
vnn->iface in the callback.
* A "releaseip" eventscript times out. We ignore the timeout and call
it success, deleting the VNN even if the IP is still hosted.
We could decide not to ignore the timeout and ban the node, but
killing TCP connections can take a long time and that might result
in a lot of manning. We probably won't reinstate banning on
"releaseip" until killing TCP connections has been optimised.
In both cases, a rogue IP can be avoided by leaving vnn->iface set and
simply failing the control.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit c5797f2942e83da24df548ea07196fbbac0eab20)
Previous code changes work around a potential problems but do not
provide useful information when the a problem occurs.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit f1f1b0c24b9b6cd24b83a4e4da16e179287ec6ac)
Consider the case of upgrading a cluster node by node, where some
nodes are still running older versions of CTDB without the
IPREALLOCATED control. If a "new" node takes over as recovery master
and a failover occurs, then it will attempt to send IPREALLOCATED
controls to all nodes. The "old" nodes will fail in a fairly
nondescript way (result == -1).
To try to handle this situation, fall back to the EVENTSCRIPT control
to handle "ipreallocated". Only do this on the failed nodes.
However, do not do this on nodes that timed out (they've probably
implemented the control and we should call the regular fail_callback
to get those nodes banned) or for stopped nodes (since they can't
actually run the "ipreallocated" event via the EVENTSCRIPT control).
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit b2654853ce9b7c18c5874b080bc94d3118078a5d)
Currently the order of the first IP allocation, including the first
"ipreallocated" event, and the "startup" event is undefined. Both of
these events can (re)start services.
This stops IPs being hosted before the "startup" event has completed.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit f15dd562fd8c08cafd957ce9509102db7eb49668)
If a tunable is not implemented on a remote node then this should not
be fatal. In this case the takeover run can continue using benign
defaults for the tunables.
However, timeouts and any unexpected errors should be fatal. These
should abort the takeover run because they can lead to unexpected IP
movements.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c0c27762ea728ed86405b29c642ba9e43200f4ae)
Both of the current defaults are implicitly 0. It is better to make
the defaults obvious.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 1190bb0d9c14dc5889c2df56f6c8986db23d81a1)