1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-23 17:34:34 +03:00
Commit Graph

210 Commits

Author SHA1 Message Date
Martin Schwenke
0baefba368 ctdbd: Removed bogus comment in ctdb_find_iface()
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 4a8d90d0812a3242f58a2a0e2aa0f528f60f7013)
2013-05-22 14:24:21 +10:00
Martin Schwenke
54e91df60d recoverd: Move IP flags into ctdb_takeover.c
These should never be seen outside the IP allocation code.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit e143abd16ccde2e0edfe103673d31a5fb06b6aef)
2013-05-09 12:55:42 +10:00
Martin Schwenke
50f19b5bd4 recoverd: Clear IP flags after IP allocation algorithm has run
If these flags are left set they will confuse other recovery daemon
code.

Factor the clearing code into new function clear_ipflags().

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 45c776958017ea7001f061842c9e0f60e4a25f23)
2013-05-09 12:55:42 +10:00
Martin Schwenke
530020d83b recoverd: Remove unused mask argument and initial mask calculation
This has been replaced by set_ipflags() and associated functionality.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit d0a3822573db296e73cc897835f783c8abc084b3)
2013-05-07 16:20:47 +10:00
Martin Schwenke
ee7357de51 recoverd: When calculating rebalance candidates don't consider flags
This is really a check to see if a node is already hosting IPs.  If
so, we assume it was previously healthy so it isn't considered as a
rebalance candidate.  There's no need to limit this to healthy node,
since this is checked elsewhere.

Due to this the variable newly_healthy is renamed everywhere to
rebalance_candidates.

The mask argument is now completely unused.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 65e0ea6c2c0629e19349ba4b9affa221fde2b070)
2013-05-07 16:20:47 +10:00
Martin Schwenke
c9056b4f88 recoverd: Remove unused mask argument from IP allocation functions
This is a no-op and is in a separate commit to make the previous
commit less cumbersome.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 107e656bbe24f9d21fbaf886a3e9417da4effe5a)
2013-05-07 16:20:47 +10:00
Martin Schwenke
0445c988e2 recoverd: Fix tunable NoIPTakeoverOnDisabled, rename to NoIPHostOnAllDisabled
This really needs to be per-node.  The rename is because nodes with
this tunable switched on should drop IPs if they become unhealthy (or
disabled in some other way).

* Add new flag NODE_FLAGS_NOIPHOST, only used in recovery daemon.

* Enhance set_ipflags_internal() and set_ipflags() to setup
  NODE_FLAGS_NOIPHOST depending on setting of NoIPHostOnAllDisabled
  and/or whether nodes are disabled/inactive.

* Replace can_node_servce_ip() with functions can_node_host_ip() and
  can_node_takeover_ip().  These functions are the only ones that need
  to look at NODE_FLAGS_NOIPTAKEOVER and NODE_FLAGS_NOIPHOST.  They
  can make the decision without looking at any other flags due to
  previous setup.

* Remove explicit flag checking in IP allocation functions (including
  unassign_unsuitable_ips()) and just call can_node_host_ip() and
  can_node_takeover_ip() as appropriate.

* Update test code to handle CTDB_SET_NoIPHostOnAllDisabled.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 1308a51f73f2e29ba4dbebb6111d9309a89732cc)
2013-05-07 16:20:46 +10:00
Martin Schwenke
ac80824709 recoverd: Factor out new function all_nodes_are_disabled()
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 12aef10e9889760d98f58c8d916f19d069fa381a)
2013-05-07 16:20:46 +10:00
Martin Schwenke
657162fb34 recoverd: Refactor code to get NoIPTakeover tunable from all nodes
Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 1fb5352d2b6918fcc6f630db49275d25a3eebe8d)
2013-05-07 16:20:46 +10:00
Martin Schwenke
17521b31b2 recoverd: Add debug message when dropping IPs in IP allocation
Update tests accordingly.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 91405282ba4abad4ad8e8c5f7ee4c83c75f38280)
2013-05-07 16:20:46 +10:00
Martin Schwenke
745c6bc363 recoverd: ctdb_takeover_run() uses CTDB_CONTROL_IPREALLOCATED
This means "ipreallocated" is now run on stopped nodes.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 83b61f7414b1f7a3424497ac987ca0724fba9eaa)
2013-05-06 13:38:21 +10:00
Martin Schwenke
2e59cd5428 ctdbd: New control CTDB_CONTROL_IPREALLOCATED
This is an alternative to using ctdb_run_eventscripts() that can be
used when in recovery.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 27a44685f0d7a88804b61a1542bb42adc8f88cb1)
2013-05-06 13:38:21 +10:00
Amitay Isaacs
77a29b3733 recoverd/takeover: Use IP->node mapping info from nodes hosting that IP
When collating IP information for IP layout, only trust the nodes that are
hosting an IP, to have correct information about that IP.  Ignore what all the
other nodes think.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 1c7adbccc69ac276d2b957ad16c3802fdb8868ca)
2013-04-08 11:14:32 +10:00
Martin Schwenke
53bd183683 recoverd: Separate each IP allocation algorithm into its own function
This makes the code much more readable and maintainable.

As a side effect, fix a memory leak in LCP2.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 6a1d88a17321f7e1dc84b4823d5e7588516a6904)
2013-01-08 10:16:11 +11:00
Martin Schwenke
2e8df43561 recoverd: New function unassign_unsuitable_ips()
Move the code into a new function so it can be called from a number of
places.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 8adb255e62dbe60d1e983047acd7b9c941231d11)
2013-01-08 10:16:11 +11:00
Martin Schwenke
bcefb76884 recoverd: Move failback retry loop into basic_failback() and lcp2_failback()
The retry loop is currently in ctdb_takeover_run_core().  Pushing it
into each function will make it possible to put each algorithm into a
separate top-level function.  This will make the code much clearer and
more maintainable.

Also keep associated test code compatible.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit f6ce18d011dd9043b04256690d826deb2640cd89)
2013-01-08 10:16:11 +11:00
Martin Schwenke
443fbb9e01 recoverd: Trying to failback more IPs no longer allocates unassigned IPs
Neither basic_failback() nor lcp2_failback() unassign IPs anymore, so
there's no point looping back that far.

Also fix a unit test that now fails because looping back to handle
unassigned IPs is no longer logged.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit c09aeaecad7d3232b1c07bab826b96818756f5e0)
2013-01-08 10:16:11 +11:00
Martin Schwenke
dfa7ce7b73 recoverd: basic_failback() can call find_takeover_node() directly
Instead of unassigning, looping back and depending on
basic_allocate_unassigned.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 4dc08e37dec464c8785a2ddae15c7c69d3c81ac3)
2013-01-08 10:16:11 +11:00
Martin Schwenke
326328d520 recoverd: Don't do failback at all when deterministic IPs are in use
This seems to be the right thing to do instead of calling into the
failback code and continually skipping the release of an IP.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 4c87e7cb3fa2cf2e034fa8454364e0a7fe0c8f81)
2013-01-08 10:16:11 +11:00
Martin Schwenke
ef403f70f2 recoverd: Move the test for both 'DeterministicIPs' and 'NoIPFailback' set
If this is done earlier then some other logic can be improved.  Also,
this should be a warning since no error condition is set.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit e06476e07197b7327b8bdac9c0b2e7281798ffec)
2013-01-08 10:16:11 +11:00
Martin Schwenke
a3911ed7bf recoverd: Fix a memory leak in IP allocation
Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit bcd5f587aff3ba536cb0b5ef00d2d802352bae25)
2013-01-08 10:16:11 +11:00
Martin Schwenke
4f0d68cba6 ctdbd: Clean up orphaned interfaces when an IP is deleted
Add a new function ctdb_remove_orphaned_ifaces() and call it in
ctdb_control_del_public_address().

ctdb_remove_orphaned_ifaces() uses a naive implementation that does
things in a very obvious way.  There are many ways to improve the
performance - some are mentioned in a comment in the code.  However, I
doubt that this will be a bottleneck even with a large number of
public IPs.  Running the eventscript is likely to outweigh the cost of
this cleanup.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a)
2013-01-07 12:19:33 +11:00
Martin Schwenke
0f1bcebc80 ctdbd: Make the link status of new interfaces more flexible
Neither up nor down is a good default value for the link status of a
new interface.  Up means that IPs can be assigned to interfaces before
the true state is known and they can move away quickly if the interface
is actually down.  Down means that IPs can't be assigned to an interface
for a variable amount of time - until a monitor cycle occurs - and this
can result in imbalanced IPs.

This is a neat compromise.  Before the startup event completes, IPs
can't be assigned to interfaces because all interfaces begin in a down
state.  As soon as the startup event completes, IPs can be allocated
to any interface that has been marked up by the eventscript.  Later,
during normal operation, newly added IPs can be assigned to new
interfaces immediately.  The IPs will still move away if an interface
is noticed to be down in the next monitor cycle, but that is the
exception rather than the rule.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 9275a69a414482f1053ae14528d5972575b9214e)
2012-11-19 15:53:13 +11:00
Amitay Isaacs
85c8deca3f recoverd: Track the nodes that fail takeover run and set culprit count
If any of the nodes fail takeover run (either due to timeout or failure
to complete within takeover_timeout interval) from main loop, recovery
master will give up trying takeover run with following message:

  "Unable to setup public takeover addresses. Try again later"

And as a side-effect the monitoring is disabled on all the nodes. Before
ctdb_takeover_run() is called from main loop, monitoring get disabled via
startrecovery event. Since ctdb_takeover_run() fails, it never runs
recovered event and monitoring does not get re-enabled.

In main_loop, ctdb_takeover_run() is called with a takeover_fail_callback.
This callback will get called if any of the nodes fail in handling
takeip/releaseip/ipreallocated events in ctdb_takeover_run().

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit a5c6bb1fffb8dc3960af113957a1fd080cc7c245)
2012-11-14 10:59:54 +11:00
Martin Schwenke
62046a8a4c recoverd: When starting a takeover run disable IP verification
Disable for TakeoverTimeout seconds.

Otherwise the the recovery daemon can get overzealous and start trying
to add/delete addresses that it thinks are missing but where the
eventscript just hasn't finished.  This didn't used to matter so much
but it is more important now that concurrent takeip/releaseip/updateip
generate error - we want to avoid spamming the log.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 56fcee3c7730cb12fa666072d5400949af6e5f7c)
2012-10-11 12:10:45 +11:00
Martin Schwenke
4b4e4d8870 ctdbd: Stop takeovers and releases from colliding in mid-air
There's a race here where release and takeover events for an IP can
run at the same time.  For example, a "ctdb deleteip" and a takeover
initiated by the recovery daemon.  The timeline is as follows:

1. The release code registers a callback to update the VNN.  The
   callback is executed *after* the eventscripts run the releaseip
   event.

2. The release code calls the eventscripts for the releaseip event,
   removing IP from its interface.

   The takeover code "updates" the VNN saying that IP is on some
   iface.... even if/though the address is already there.

3. The release callback runs, removing the iface associated with IP in
   the VNN.

   The takeover code calls the eventscripts for the takeip event,
   adding IP to an interface.

As a result, CTDB doesn't think it should be hosting IP but IP is on
an interface.  The recovery daemon fixes this later... but it
shouldn't happen.

This patch can cause some additional noise in the logs:

  Release of IP 10.0.2.133/24 on interface eth2  node:2
  recoverd:We are still serving a public address '10.0.2.133' that we should not be serving. Removing it.
  Release of IP 10.0.2.133/24 rejected update for this IP already in flight
  recoverd:client/ctdb_client.c:2455 ctdb_control for release_ip failed
  recoverd:Failed to release local ip address

In this case the node has started releasing an IP when the recovery
daemon notices the addresses is still hosted and initiates another
release.  This noise is harmless but annoying.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit bfe16cf69bf2eee93c0d831f76d88bba0c2b96c2)
2012-10-11 12:10:45 +11:00
Martin Schwenke
79ea15bf96 ctdbd: New tunable NoIPTakeoverOnDisabled
Stops the behaviour where unhealthy nodes can host IPs when there are
no healthy nodes.  Set this to 1 when an immediate complete outage is
preferred when all nodes are unhealthy.  The alternative
(i.e. default) can lead to undefined behaviour when the shared
filesystem is unavailable.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit a555940fb5c914b7581667a05153256ad7d17774)
2012-10-11 12:10:45 +11:00
Martin Schwenke
9aa9abcc19 ctdbd: Avoid unnecessary updateip event
The existing code makes one fatally bad assumption:
vnn->iface->references can never be -1 (or max-unit32_t in this case).
Right now the reference counting is broken so a reference count of -1
is possible and causes a spurious updateip when vnn->iface is the same
as best_face.  This can occur frequently because we get a lot of
redundant takeovers, especially when each IP can only be hosted on one
interface.

This makes the code much more defensive by noting that when best_iface
is the same as vnn->iface there is never a need for an updateip event.
This effectively neuters the updateip code path when IPs can only be
hosted by a single interface.

This should obsolete 6a74515f0a1e24d97cee3ba05d89133aac7ad2b7.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 7054e4ded59c6b8f254dcfefaef64da05f25aecd)
2012-10-10 14:54:53 +11:00
Amitay Isaacs
3c1f656764 Revert "when creating/adding a public ip, set the initial interface to be the first interface specified"
This reverts commit 4308935ba48ac7a29e7523315acf580019715f0f.

This fixes 16_ctdb_config_add_ip.sh test when run against local daemons. When
running against local daemons, if the interface is assigned as soon as an IP is
added, then takeover would never assign this IP address.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 06dfd13604d08910e07cbf927c338d7b9fce9a2f)
2012-10-07 15:25:34 +11:00
Martin Schwenke
7df1da1c91 recoverd: Update a log message that has bit-rotted
This message used to be correct because the ipreallocated event only
handled updating the NAT gateway.  However, that has changed so the
message needs to be updated.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit cc9d96f4248e45ea99c5f00db1526426ac26fbc2)
2012-08-08 16:11:11 +10:00
Martin Schwenke
75a0041567 ctdbd: Fix ctdb_control_release_ip() on local daemons
When running on local daemons no IPs are actually assigned to
interfaces.  Commit 9a806dec8687e2ec08a308853b61af6aed5e5d1e broke
ctdb_control_release_ip() for local daemons because it asks the system
which interface the given IP is on, instead of the old behaviour of
trusting CTDB's internal records.

For local deamons (i.e. !ctdb->do_checkpublicip) revert to the old
behaviour of looking up the interface internally.  This is good
enough, given that the tests don't tend to misconfigure the addresses.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit 38e8651b955afdbaf0ae87c24c55c052f8209290)
2012-07-26 22:10:54 +10:00
Amitay Isaacs
e379fc3ea5 Fix compiler warnings.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit d29e1880c8ce7219e065d31b47b0e8ad9e83146d)
2012-07-13 14:50:56 +10:00
Ronnie Sahlberg
c7e648c2d1 When we release an ip, get the interface name from the kernel
instead of using the interface where ctdb thinks the ip is hosted at.
The difference is that this now allows us to handle cases where we want to release an ip   but ctdbd does not know which interface the ip is assigned on.
(user has used 'ip addr add...'  and manually assigned an ip to the wrong interface)

(This used to be ctdb commit c6bf22ba5c01001b7febed73dd16a03bd3fd2bed)
2012-06-20 15:11:56 +10:00
Amitay Isaacs
7631830152 server: Replace BOOL datatype with bool, True/False with true/false
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 6e5cbe8fff71985e5a2fc16b7e9f2b868011ff5d)
2012-05-28 11:22:25 +10:00
Ronnie Sahlberg
a57eba2bb4 Track all child process so we never send a signal to an unrelated process (our child died and kernel wrapped the pid-space and reused the pid for a different process
Wrap all creation of child processes inside ctdb_fork() which is used to track all processes we have spawned.
Capture SIGCHLD to track also which child processes have terminated.

Wrap kill() inside ctdb_kill() and make sure that we never send a !0 signal to a child process pid that has already terminated (and might have been replaced with a

(This used to be ctdb commit f73a4b1495830bcdd094a93732a89dd53b3c2f78)
2012-05-03 14:03:26 +10:00
Ronnie Sahlberg
a367fa6138 RELOADIPS: simplify the reloadips code a bit
and also update the "read public address file" to not check if the address exists already locally when we read if from the child process, to stop it
from spamming the logs with "We already host ..."
messages

(This used to be ctdb commit 334ea830f1bf33419f4a1e78f23afd41a852d0f4)
2012-05-01 15:34:26 +10:00
Ronnie Sahlberg
7a1aa560e7 Add new control to reload the public ip address file on a node
Also add a method to use the recovery master/daemon to reload the public ips on all nodes in the cluster.
Reloading the public ips on all node sin the cluster is only suported if all nodes in the cluster are available and healthy.

(This used to be ctdb commit 05603e914f8c12618d7e06943c0f7df207f645b0)
2012-05-01 10:48:08 +10:00
Ronnie Sahlberg
db411aaada Merge remote branch 'amitay/tevent-sync'
(This used to be ctdb commit 17ff3f240b0d72c72ed28d70fb9aeb3b20c80670)
2012-04-26 08:09:23 +10:00
Amitay Isaacs
4392591555 Remove explicit include of lib/tevent/tevent.h.
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 0681014ca5ed2a9b56f63fdace7f894beccf8a9a)
2012-04-13 17:28:14 +10:00
Amitay Isaacs
b3d098ced7 ctdbd: Fix spurious warnings when running with --nopublicipcheck
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 67b909a0718d6cfce82ffce0830da3a6ff1f6c4b)
2012-04-13 15:38:11 +10:00
Amitay Isaacs
425b8768ee ctdbd: Fix the error message string
Signed-off-by: Amitay Isaacs <amitay@gmail.com>

(This used to be ctdb commit 15f63ebab9686734f41a6adf38d4a7faa919ac66)
2012-04-13 14:51:13 +10:00
Ronnie Sahlberg
2456f77ca6 NoIPTakeover: change the tunable name for the "dont allow failing addresses over onto the node" to NoIPTakeover
(This used to be ctdb commit 35592e618cfd827b6978af6332f80504f232c46a)
2012-03-22 11:05:15 +11:00
Ronnie Sahlberg
9f31f76805 NoIPFailback: Exclude nodes which have NoIPFailback as failback targets during reallocation
(This used to be ctdb commit c262c29773d1608e7ce04bdfb7f4469df0a9637b)
2012-03-22 09:24:32 +11:00
Ronnie Sahlberg
befa9df152 Make NoIPFailback a node local setting. Nodes that have NoIPFailback set to !0 can not takeover new ip addresses during failover.
Remove the old global setting for this unused tunable and add it as a new node flag. This node flag is only valid/defined within the takeover subsystem in the recovery daemon. Add async functions to collec the NoIPFailback settings for each node.

This will later e used to disqualify certain nodes from being takeover targets when we perform reallocation.

(This used to be ctdb commit 668f3e88a9e5f598706952b7140547640c85a5ed)
2012-03-22 09:09:57 +11:00
Ronnie Sahlberg
ef2bd0b016 When adding ips to nodes, set up a deferred rebalance for the whole node to trigger after 60 seconds in case the normal ipreallocated is not sufficient to trigger rebalance.
(This used to be ctdb commit 4340263b219d75c39f8de22abe3f6f1c1ee63ea2)
2012-02-28 06:56:04 +11:00
Ronnie Sahlberg
91c9371f2d Make KILLTCP structure a child of VNN so that it is freed at the same time
the referenced VNN structure is.

Also, remove the circular reference between the two objects KIPPCTP and VNN

(This used to be ctdb commit 02b62482164a3c69715949074feb7f191a29d534)
2012-02-27 07:21:26 +11:00
Volker Lendecke
5e3b13a32a FreeBSD does not define s6_addr32, only s6_addr
Signed-off-by: Michael Adam <obnox@samba.org>

(This used to be ctdb commit d657af4fb68ce3f7c462856f2934f6bf169e120b)
2012-02-13 16:20:12 +01:00
Martin Schwenke
3ae8273d86 Make some ctdb_takeover.c functions static
These were intentionally not static so they could be linked to in unit
test programs.  However, using the CCAN-style unit tests where
relevant code is just included, this is no longer necessary.

Signed-off-by: Martin Schwenke <martin@meltin.net>

(This used to be ctdb commit d0e9e8554614bd49ffb9ec3509feaa0e80d0f65d)
2011-11-11 14:41:47 +11:00
Ronnie Sahlberg
8db9b73920 Merge remote branch 'martins/lcp2fix'
(This used to be ctdb commit 7c02d242af552aa732f5c70ea4eeefbc8a8542e2)
2011-11-08 14:06:30 +11:00
Ronnie Sahlberg
0f92fa224c RB_TREE: Add mechanism to abort a traverse
This patch changes the callback signature for traversal
functions to allow a client to abort a traverse before it finishes.
Updates to all callers and examples as well as rb-test tool.

(This used to be ctdb commit 8ab0c63ad36cfbbb1e5fed46a1f4c47b1fdb581f)
2011-11-08 13:40:28 +11:00