IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This is a no-op and is in a separate commit to make the previous
commit less cumbersome.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 107e656bbe24f9d21fbaf886a3e9417da4effe5a)
This really needs to be per-node. The rename is because nodes with
this tunable switched on should drop IPs if they become unhealthy (or
disabled in some other way).
* Add new flag NODE_FLAGS_NOIPHOST, only used in recovery daemon.
* Enhance set_ipflags_internal() and set_ipflags() to setup
NODE_FLAGS_NOIPHOST depending on setting of NoIPHostOnAllDisabled
and/or whether nodes are disabled/inactive.
* Replace can_node_servce_ip() with functions can_node_host_ip() and
can_node_takeover_ip(). These functions are the only ones that need
to look at NODE_FLAGS_NOIPTAKEOVER and NODE_FLAGS_NOIPHOST. They
can make the decision without looking at any other flags due to
previous setup.
* Remove explicit flag checking in IP allocation functions (including
unassign_unsuitable_ips()) and just call can_node_host_ip() and
can_node_takeover_ip() as appropriate.
* Update test code to handle CTDB_SET_NoIPHostOnAllDisabled.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 1308a51f73f2e29ba4dbebb6111d9309a89732cc)
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 1fb5352d2b6918fcc6f630db49275d25a3eebe8d)
This means "ipreallocated" is now run on stopped nodes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 83b61f7414b1f7a3424497ac987ca0724fba9eaa)
This is an alternative to using ctdb_run_eventscripts() that can be
used when in recovery.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 27a44685f0d7a88804b61a1542bb42adc8f88cb1)
When collating IP information for IP layout, only trust the nodes that are
hosting an IP, to have correct information about that IP. Ignore what all the
other nodes think.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 1c7adbccc69ac276d2b957ad16c3802fdb8868ca)
This makes the code much more readable and maintainable.
As a side effect, fix a memory leak in LCP2.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 6a1d88a17321f7e1dc84b4823d5e7588516a6904)
Move the code into a new function so it can be called from a number of
places.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 8adb255e62dbe60d1e983047acd7b9c941231d11)
The retry loop is currently in ctdb_takeover_run_core(). Pushing it
into each function will make it possible to put each algorithm into a
separate top-level function. This will make the code much clearer and
more maintainable.
Also keep associated test code compatible.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit f6ce18d011dd9043b04256690d826deb2640cd89)
Neither basic_failback() nor lcp2_failback() unassign IPs anymore, so
there's no point looping back that far.
Also fix a unit test that now fails because looping back to handle
unassigned IPs is no longer logged.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit c09aeaecad7d3232b1c07bab826b96818756f5e0)
Instead of unassigning, looping back and depending on
basic_allocate_unassigned.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 4dc08e37dec464c8785a2ddae15c7c69d3c81ac3)
This seems to be the right thing to do instead of calling into the
failback code and continually skipping the release of an IP.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 4c87e7cb3fa2cf2e034fa8454364e0a7fe0c8f81)
If this is done earlier then some other logic can be improved. Also,
this should be a warning since no error condition is set.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit e06476e07197b7327b8bdac9c0b2e7281798ffec)
Add a new function ctdb_remove_orphaned_ifaces() and call it in
ctdb_control_del_public_address().
ctdb_remove_orphaned_ifaces() uses a naive implementation that does
things in a very obvious way. There are many ways to improve the
performance - some are mentioned in a comment in the code. However, I
doubt that this will be a bottleneck even with a large number of
public IPs. Running the eventscript is likely to outweigh the cost of
this cleanup.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a)
Neither up nor down is a good default value for the link status of a
new interface. Up means that IPs can be assigned to interfaces before
the true state is known and they can move away quickly if the interface
is actually down. Down means that IPs can't be assigned to an interface
for a variable amount of time - until a monitor cycle occurs - and this
can result in imbalanced IPs.
This is a neat compromise. Before the startup event completes, IPs
can't be assigned to interfaces because all interfaces begin in a down
state. As soon as the startup event completes, IPs can be allocated
to any interface that has been marked up by the eventscript. Later,
during normal operation, newly added IPs can be assigned to new
interfaces immediately. The IPs will still move away if an interface
is noticed to be down in the next monitor cycle, but that is the
exception rather than the rule.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 9275a69a414482f1053ae14528d5972575b9214e)
If any of the nodes fail takeover run (either due to timeout or failure
to complete within takeover_timeout interval) from main loop, recovery
master will give up trying takeover run with following message:
"Unable to setup public takeover addresses. Try again later"
And as a side-effect the monitoring is disabled on all the nodes. Before
ctdb_takeover_run() is called from main loop, monitoring get disabled via
startrecovery event. Since ctdb_takeover_run() fails, it never runs
recovered event and monitoring does not get re-enabled.
In main_loop, ctdb_takeover_run() is called with a takeover_fail_callback.
This callback will get called if any of the nodes fail in handling
takeip/releaseip/ipreallocated events in ctdb_takeover_run().
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit a5c6bb1fffb8dc3960af113957a1fd080cc7c245)
Disable for TakeoverTimeout seconds.
Otherwise the the recovery daemon can get overzealous and start trying
to add/delete addresses that it thinks are missing but where the
eventscript just hasn't finished. This didn't used to matter so much
but it is more important now that concurrent takeip/releaseip/updateip
generate error - we want to avoid spamming the log.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 56fcee3c7730cb12fa666072d5400949af6e5f7c)
There's a race here where release and takeover events for an IP can
run at the same time. For example, a "ctdb deleteip" and a takeover
initiated by the recovery daemon. The timeline is as follows:
1. The release code registers a callback to update the VNN. The
callback is executed *after* the eventscripts run the releaseip
event.
2. The release code calls the eventscripts for the releaseip event,
removing IP from its interface.
The takeover code "updates" the VNN saying that IP is on some
iface.... even if/though the address is already there.
3. The release callback runs, removing the iface associated with IP in
the VNN.
The takeover code calls the eventscripts for the takeip event,
adding IP to an interface.
As a result, CTDB doesn't think it should be hosting IP but IP is on
an interface. The recovery daemon fixes this later... but it
shouldn't happen.
This patch can cause some additional noise in the logs:
Release of IP 10.0.2.133/24 on interface eth2 node:2
recoverd:We are still serving a public address '10.0.2.133' that we should not be serving. Removing it.
Release of IP 10.0.2.133/24 rejected update for this IP already in flight
recoverd:client/ctdb_client.c:2455 ctdb_control for release_ip failed
recoverd:Failed to release local ip address
In this case the node has started releasing an IP when the recovery
daemon notices the addresses is still hosted and initiates another
release. This noise is harmless but annoying.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit bfe16cf69bf2eee93c0d831f76d88bba0c2b96c2)
Stops the behaviour where unhealthy nodes can host IPs when there are
no healthy nodes. Set this to 1 when an immediate complete outage is
preferred when all nodes are unhealthy. The alternative
(i.e. default) can lead to undefined behaviour when the shared
filesystem is unavailable.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit a555940fb5c914b7581667a05153256ad7d17774)
The existing code makes one fatally bad assumption:
vnn->iface->references can never be -1 (or max-unit32_t in this case).
Right now the reference counting is broken so a reference count of -1
is possible and causes a spurious updateip when vnn->iface is the same
as best_face. This can occur frequently because we get a lot of
redundant takeovers, especially when each IP can only be hosted on one
interface.
This makes the code much more defensive by noting that when best_iface
is the same as vnn->iface there is never a need for an updateip event.
This effectively neuters the updateip code path when IPs can only be
hosted by a single interface.
This should obsolete 6a74515f0a1e24d97cee3ba05d89133aac7ad2b7.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 7054e4ded59c6b8f254dcfefaef64da05f25aecd)
This reverts commit 4308935ba48ac7a29e7523315acf580019715f0f.
This fixes 16_ctdb_config_add_ip.sh test when run against local daemons. When
running against local daemons, if the interface is assigned as soon as an IP is
added, then takeover would never assign this IP address.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>
(This used to be ctdb commit 06dfd13604d08910e07cbf927c338d7b9fce9a2f)
This message used to be correct because the ipreallocated event only
handled updating the NAT gateway. However, that has changed so the
message needs to be updated.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit cc9d96f4248e45ea99c5f00db1526426ac26fbc2)
When running on local daemons no IPs are actually assigned to
interfaces. Commit 9a806dec8687e2ec08a308853b61af6aed5e5d1e broke
ctdb_control_release_ip() for local daemons because it asks the system
which interface the given IP is on, instead of the old behaviour of
trusting CTDB's internal records.
For local deamons (i.e. !ctdb->do_checkpublicip) revert to the old
behaviour of looking up the interface internally. This is good
enough, given that the tests don't tend to misconfigure the addresses.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 38e8651b955afdbaf0ae87c24c55c052f8209290)
instead of using the interface where ctdb thinks the ip is hosted at.
The difference is that this now allows us to handle cases where we want to release an ip but ctdbd does not know which interface the ip is assigned on.
(user has used 'ip addr add...' and manually assigned an ip to the wrong interface)
(This used to be ctdb commit c6bf22ba5c01001b7febed73dd16a03bd3fd2bed)
Wrap all creation of child processes inside ctdb_fork() which is used to track all processes we have spawned.
Capture SIGCHLD to track also which child processes have terminated.
Wrap kill() inside ctdb_kill() and make sure that we never send a !0 signal to a child process pid that has already terminated (and might have been replaced with a
(This used to be ctdb commit f73a4b1495830bcdd094a93732a89dd53b3c2f78)
and also update the "read public address file" to not check if the address exists already locally when we read if from the child process, to stop it
from spamming the logs with "We already host ..."
messages
(This used to be ctdb commit 334ea830f1bf33419f4a1e78f23afd41a852d0f4)
Also add a method to use the recovery master/daemon to reload the public ips on all nodes in the cluster.
Reloading the public ips on all node sin the cluster is only suported if all nodes in the cluster are available and healthy.
(This used to be ctdb commit 05603e914f8c12618d7e06943c0f7df207f645b0)
Remove the old global setting for this unused tunable and add it as a new node flag. This node flag is only valid/defined within the takeover subsystem in the recovery daemon. Add async functions to collec the NoIPFailback settings for each node.
This will later e used to disqualify certain nodes from being takeover targets when we perform reallocation.
(This used to be ctdb commit 668f3e88a9e5f598706952b7140547640c85a5ed)
the referenced VNN structure is.
Also, remove the circular reference between the two objects KIPPCTP and VNN
(This used to be ctdb commit 02b62482164a3c69715949074feb7f191a29d534)
These were intentionally not static so they could be linked to in unit
test programs. However, using the CCAN-style unit tests where
relevant code is just included, this is no longer necessary.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit d0e9e8554614bd49ffb9ec3509feaa0e80d0f65d)
This patch changes the callback signature for traversal
functions to allow a client to abort a traverse before it finishes.
Updates to all callers and examples as well as rb-test tool.
(This used to be ctdb commit 8ab0c63ad36cfbbb1e5fed46a1f4c47b1fdb581f)
There's a bug in LCP2. Selecting the node with the highest imbalance
doesn't always work. Some nodes can have a high imbalance metric
because they have a lot of IPs. However, these nodes can be part of a
group that is perfectly balanced. Nodes in another group with less
IPs might actually be imbalanced.
Instead of just trying the source node with the highest imbalance this
tries them in descending order of imbalance until it finds one where
an IP can be moved to another node.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 574091d5aced5e87aefad52f8bc47aa75c25fbf6)
There's a bug in LCP2. Selecting the node with the highest imbalance
doesn't always work. Some nodes can have a high imbalance metric
because they have a lot of IPs. However, these nodes can be part of a
group that is perfectly balanced. Nodes in another group with less
IPs might actually be imbalanced.
Factor out the code from lcp2_failback() that actually takes a node
and decides which address should be moved to which node.
This is the first step in fixing the above bug.
Signed-off-by: Martin Schwenke <martin@meltin.net>
(This used to be ctdb commit 75718c5768b5bb5c0bcd7dd90e0327c6ed22a63d)
this triggered a check for "only run the eventscript if we host the address" to trigger and shortcir=cuit calling the eventscript.
An effect of this would be that 'ctdb delip' would remove the ip from ctdb, but fail to delete it from the interface.
S1028798
(This used to be ctdb commit b82524f240bf21769dd7624ca6026763d38b9396)
cant talloc off vnn since it is not yet initialized and might not always be NULL
(This used to be ctdb commit 3d37be3e2bfb61ede824028aeebaa18ba304faae)
This will make it much easier to root-cause problems such as
S1029023
when an external application deleted the interface while it is still is in use by ctdbd.
(This used to be ctdb commit 9abf9c919a7e6789695490e2c3de56c21b63fa57)