IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
With a file "home_nodes" next to "public_addresses" you can assign
public IPs to specific nodes when using the deterministic allocation
algorithm. Whenever the "home node" is up, the IP address will be
assigned to that node, independent of any other deterministic
calculation. The line
192.168.21.254 2
in the file "home_nodes" assigns the IP address to node 2. Only when
node 2 is not able to host IP addresses, 192.168.21.254 undergoes the
normal deterministic IP allocation algorithm.
Signed-off-by: Volker Lendecke <vl@samba.org>
add home_nodes
Reviewed-by: Ralph Boehme <slow@samba.org>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Tue Oct 10 14:17:19 UTC 2023 on atb-devel-224
This effectively provides simple testing for the threshold-based
approach.
Add new script option CTDB_VSFTPD_MONITOR_THRESHOLDS.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Oct 3 04:53:38 UTC 2023 on atb-devel-224
This can be used for simple failure counting, without restarts, as
used in the 40.vsftpd event script. That case will subsequently be
converted and this functionality can also be used elsewhere.
Add documentation to ctdb-script.options(5) to allow parameters that
use this to be more easily described.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Uninitialised counters are treated as 0, but still produce an error.
The redirect to stderr needs to come before the redirect for a missing
counter file.
The seemingly saner alternative of moving it outside the subshell
works when dash is /bin/sh (e.g. on Debian) but does not work when
bash is /bin/sh (e.g. on Fedora).
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
A subsequent commit will add a new section, which looks out of place
without these new sections.
Best reviewed with "git show -w".
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This will allow Unicode characters to be used, resulting in more
readable source files.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Commit 19c82c19c0 changed the behaviour
of prctl_set_comment() so it now calls setproctitle(3bsd) by default.
In some Linux distributions (e.g. Rocky Linux 8.8), this results in
messages like this spamming the logs:
ctdbd: setproctitle not initialized, please either call setproctitle_init() or link against libbsd-ctor.
Most Samba daemons seem to call setproctitle_init(), so do it here.
In the longer term CTDB should also switch to using lib/util's
process_set_title(), like the rest of Samba, for more flexible process
names.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15479
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Thu Sep 21 00:46:50 UTC 2023 on atb-devel-224
Fix a problem where ctdb_killtcp (almost always) fails to capture
packets with --enable-pcap and libpcap ≥ 1.9.1. The problem is due to
a gradual change in libpcap semantics when using
pcap_get_selectable_fd(3PCAP) to get a file descriptor and then using
that file descriptor in non-blocking mode.
pcap_set_immediate_mode(3PCAP) says:
pcap_set_immediate_mode() sets whether immediate mode should be set
on a capture handle when the handle is activated. In immediate
mode, packets are always delivered as soon as they arrive, with no
buffering.
and
On Linux, with previous releases of libpcap, capture devices are
always in immediate mode; however, in 1.5.0 and later, they are, by
default, not in immediate mode, so if pcap_set_immediate_mode() is
available, it should be used.
However, it wasn't until libpcap commit
2ade7676101366983bd4f86bc039ffd25da8c126 (before libpcap 1.9.1) that
it became a requirement to use pcap_set_immediate_mode(), even with a
timeout of 0.
More explanation in this libpcap issue comment:
https://github.com/the-tcpdump-group/libpcap/issues/860#issuecomment-541204548
Do a configure check for pcap_set_immediate_mode() even though it has
existed for 10 years. It is easy enough.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 15 10:53:52 UTC 2023 on atb-devel-224
A subsequent commit will insert an additional call before
pcap_activate().
This sequence of calls is taken from the source for pcap_open_live(),
so there should be no change in behaviour.
Given the defaults set by pcap_create_common(), it would be possible
to omit the calls to pcap_set_promisc() and pcap_set_timeout().
However, those defaults don't seem to be well documented, so continue
to explicitly set everything that was set before.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Factor out a failure label, which will get more use in subsequent
commits, and only set private_data when success is certain.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
>>> CID 1539212: Control flow issues (NO_EFFECT)
>>> This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "p >= 0UL".
216 while (p >= 0 && output[p] == '\n') {
This is a real problem in the unlikely event that the output contains
only newlines.
Fix the issue by using a pointer and add a test to cover this case.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15438
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Multi-line output currently prints like this:
OUTPUT: aaa
bbb
ccc
This is less beautiful than it could be.
Instead, print multi-line output with no inlining and each line
indented:
OUTPUT:
aaa
bbb
ccc
However, continue to inline single line output:
OUTPUT: foo
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
When event scripts succeed they generally produce no output. However,
when a script succeeds and produces output, such output almost
certainly contains warnings. So, always print script output.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Errors logged when testing statd-callout don't currently go anywhere.
This is because arguments to the hacked version of script_log() are
ignored.
Remove the hack and configure logging to stderr.
This could go in the local statd-callout.sh setup script. However,
make it available for other script tests.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Wed Jul 19 09:57:37 UTC 2023 on atb-devel-224
Logging in statd-callout tests is currently useless. This will
provide a way of seeing errors in those tests.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
usecs is going to be passed as a uint32_t. There is no need to
calculate it as a time_t.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
On some platforms, egrep prints a deprecation warning to stderr:
egrep: warning: egrep is obsolescent; using grep -E
Use grep -E instead.
This is nice and simple, so no use splitting this commit into 2
separate commits for each of tools and test.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Loading tunables is now done in ctdbd, so find another example for the
"setup" event.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It will be in the git history if we ever decide to use SCSI persistent
reservations as a cluster lock.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This fixes a little thinko in commit
80de84d36e, where this was overlooked.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Jul 10 15:15:06 UTC 2023 on atb-devel-224
DEBUG level logging in ctdb_killtcp is very noisy. The most important
messages when debugging are those for tickle ACKs and TCP RSTs. TCP
RSTs are already logged at INFO level, so promote tickle ACKs to INFO
level too.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
NOTICE level debug messages in common/run_event.c are not logged by
default.
Currently eventd ends up using ERROR, since this is specified as
LOGGING_LOG_LEVEL_DEFAULT. It doesn't inherit the debug level from
ctdbd and only uses NOTICE level when interactive.
Change the real logging default to NOTICE and use it everywhere.
Followups might be:
* Remove the default_log_level argument to logging_conf_init()
* Kick eventd to update debug level when "ctdb setdebug" is used
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Wed Jul 5 12:16:57 UTC 2023 on atb-devel-224
These are all trivial, so handle them in bulk.
* Change code to avoid (approximately sorted by frequency):
SC2004 $/${} is unnecessary on arithmetic variables.
SC2086 Double quote to prevent globbing and word splitting.
SC2162 read without -r will mangle backslashes.
SC2254 Quote expansions in case patterns to match literally rather than as a glob.
SC2154 (warning): <variable> is referenced but not assigned.
SC3037 (warning): In POSIX sh, echo flags are undefined.
SC2016 (info): Expressions don't expand in single quotes, use double quotes for that.
SC2069 (warning): To redirect stdout+stderr, 2>&1 must be last (or use '{ cmd > file; } 2>&1' to clarify).
SC2124 (warning): Assigning an array to a string! Assign as array, or use * instead of @ to concatenate.
SC2166 (warning): Prefer [ p ] && [ q ] as [ p -a q ] is not well defined.
SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.
* Locally disable checks:
SC2034 (warning): <variable> appears unused. Verify use (or export if used externally).
SC2086 (info): Double quote to prevent globbing and word splitting. [once]
SC2120 (warning): <function> references arguments, but none are ever passed.
SC2317 (info): Command appears to be unreachable. Check usage (or ignore if invoked indirectly).
While touching reads for SC2162, switch unused variables to "_"
instead of "_x", which seems to be preferred by ShellCheck.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
SC2059 (info): Don't use variables in the printf format string. Use printf '..%s..' "$foo".
Move the format string to the function and just parameterise the share
type.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
In ./tests/UNIT/eventscripts/scripts/local.sh line 328:
echo $(ctdb ifaces -X | awk -F'|' 'FNR > 1 {print $2}')
^-- SC2046 (warning): Quote this to prevent word splitting.
^-- SC2005 (style): Useless echo? Instead of 'echo $(cmd)', just use 'cmd'.
Use xargs to get output on 1 line.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
This generates ShellCheck warnings:
In ./tests/UNIT/eventscripts/scripts/60.nfs.sh line 412:
if [ -n "$service_check_cmd" ]; then
^----------------^ SC2031 (info): service_check_cmd was modified in a subshell. That change might be lost.
In ./tests/UNIT/eventscripts/scripts/60.nfs.sh line 413:
if eval "$service_check_cmd"; then
^----------------^ SC2031 (info): service_check_cmd was modified in a subshell. That change might be lost.
service_check_cmd will never be set here because it is only set in a
sub-shell in rpc_set_service_failure_response().
This reverts some of commit 713ec21750.
If testcases requiring use of service_check_cmd are later added then
this will need to be redone properly. This would probably start by
renaming this function nfs_iterate_rpc_test().
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
This is unused since loading tunables was moved to ctdbd.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
SC2086 Double quote to prevent globbing and word splitting.
Apparently ShellCheck is more picky about some of these than it used
to be.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
New in ShellCheck 0.9.0:
SC2317 (info): Command appears to be unreachable. Check usage (or ignore if invoked indirectly).
Also:
SC2086 (info): Double quote to prevent globbing and word splitting.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
New in ShellCheck 0.9.0:
SC2317 (info): Command appears to be unreachable. Check usage (or ignore if invoked indirectly).
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Andreas Schneider <asn@samba.org>
If this codepath is hit, ctdb aborts with:
ctdb/server/ctdb_recovery_helper.c:2687: Type mismatch: name[struct ban_node_state] expected[struct node_ban_state]")
at ../../lib/talloc/talloc.c:505
Fix this by using the correct type.
Signed-off-by: Christof Schmitt <cs@samba.org>
Reviewed-by: Volker Lendecke <vl@samba.org>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Wed May 3 08:04:09 UTC 2023 on atb-devel-224
Best reviewed with: `git show --word-diff`
Signed-off-by: Andreas Schneider <asn@samba.org>
Reviewed-by: Martin Schwenke <mschwenke@ddn.com>
Autobuild-User(master): Andreas Schneider <asn@cryptomilk.org>
Autobuild-Date(master): Fri Mar 24 07:57:37 UTC 2023 on atb-devel-224
When testparm processes the output of "testparm -v" (which includes
default values) it appears to do global checks (or some other sort of
initialisation logic) for all specified values. This includes a DNS
lookup for the node's hostname, as a side-effect of a libldap
ldap_set_option() call when processing "ldap debug level". If DNS
servers are down then this can induce timeouts, possibly resulting in
monitor timeouts.
Avoid this by using sed to extract configuration values from the
testparm cache file.
This is already shown to work when retrieving share paths, where
testparm is basically used as cat. Update the sed pattern to avoid
matching empty values on the right-hand side of the equals ('=') -
this avoids the default empty path value (and "smb ports" never has an
empty value).
Corresponding test changes:
* 50.samba.monitor.111.sh no longer expects a failure from being
unable to set smb ports, since testparm is no longer used in that
code path.
* smb ports needs to be set in fake smb.conf so it is in the default
output and can be extracted using sed.
* Although testparm --parameter-name is no longer used in
50.samba.script, update the stub implementation (in case it is ever
used again) to extract from fake smb.conf, since "smb ports" is now
set there. The change from $parameter to $param allows a long line
to stay below 80 columns.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Tue Feb 14 08:43:53 UTC 2023 on atb-devel-224
The list changed back to space-separated in commit
93448f4be9, so simplify the code a
little.
Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
Removes:
* waf pydoctor
* waf wafdocs
* make pydoctor
There is no "make wafdocs" it only appears to be in wscript.
The reasoning being is these are broken and appear to not have been run for some time.
Signed-off-by: Rob van der Linde <rob@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>
Autobuild-User(master): Jeremy Allison <jra@samba.org>
Autobuild-Date(master): Thu Feb 2 21:15:54 UTC 2023 on atb-devel-224
One of changes is somewhat interesting, it is "tfork waiter proces"
process title in tfork.c. I wonder why no one noticed this before.
There's another similar process title in there, "tfork waiter process(%d)".
Hopefully no one does grep for "proces$" (and there's no reason to).
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Rowland Penny <rpenny@samba.org>
Autobuild-User(master): Jeremy Allison <jra@samba.org>
Autobuild-Date(master): Thu Jan 26 20:46:11 UTC 2023 on atb-devel-224
"basename" is define in libgen.h included from system/dir.h
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
If you happen to talloc_free(run_ctx) before all the tevent_req's
hanging off it, you run into the following:
==495196== Invalid read of size 8
==495196== at 0x10D757: run_proc_state_destructor (run_proc.c:413)
==495196== by 0x488F736: _tc_free_internal (talloc.c:1158)
==495196== by 0x488FBDD: _talloc_free_internal (talloc.c:1248)
==495196== by 0x4890F41: _talloc_free (talloc.c:1792)
==495196== by 0x48538B1: tevent_req_received (tevent_req.c:293)
==495196== by 0x4853429: tevent_req_destructor (tevent_req.c:129)
==495196== by 0x488F736: _tc_free_internal (talloc.c:1158)
==495196== by 0x4890AF6: _tc_free_children_internal (talloc.c:1669)
==495196== by 0x488F967: _tc_free_internal (talloc.c:1184)
==495196== by 0x488FBDD: _talloc_free_internal (talloc.c:1248)
==495196== by 0x4890F41: _talloc_free (talloc.c:1792)
==495196== by 0x10DE62: main (run_proc_test.c:86)
==495196== Address 0x55b77f8 is 152 bytes inside a block of size 160 free'd
==495196== at 0x48399AB: free (vg_replace_malloc.c:538)
==495196== by 0x488FB25: _tc_free_internal (talloc.c:1222)
==495196== by 0x488FBDD: _talloc_free_internal (talloc.c:1248)
==495196== by 0x4890F41: _talloc_free (talloc.c:1792)
==495196== by 0x10D315: run_proc_context_destructor (run_proc.c:329)
==495196== by 0x488F736: _tc_free_internal (talloc.c:1158)
==495196== by 0x488FBDD: _talloc_free_internal (talloc.c:1248)
==495196== by 0x4890F41: _talloc_free (talloc.c:1792)
==495196== by 0x10DE62: main (run_proc_test.c:86)
==495196== Block was alloc'd at
==495196== at 0x483877F: malloc (vg_replace_malloc.c:307)
==495196== by 0x488EAD9: __talloc_with_prefix (talloc.c:783)
==495196== by 0x488EC73: __talloc (talloc.c:825)
==495196== by 0x488F0FC: _talloc_named_const (talloc.c:982)
==495196== by 0x48925B1: _talloc_zero (talloc.c:2421)
==495196== by 0x10C8F2: proc_new (run_proc.c:61)
==495196== by 0x10D4C9: run_proc_send (run_proc.c:381)
==495196== by 0x10DDF6: main (run_proc_test.c:79)
This happens because run_proc_context_destructor() directly does a
talloc_free() on the struct proc_context's and not the enclosing
tevent_req's. run_proc_kill() makes sure that we don't follow
proc->req, but it forgets the "state->proc", which is free()'ed, but
later dereferenced in run_proc_state_destructor().
This is an attempt at a quick fix, I believe we should convert
run_proc_context->plist into an array of tevent_req's, so that we can
properly TALLOC_FREE() according to the "natural" hierarchy and not
just pull an arbitrary thread out of that heap.
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Thu Oct 6 15:10:20 UTC 2022 on sn-devel-184
Add simple support for IPoIB via DLT_LINUX_SLL and DLT_LINUX_SLL2.
This seems to work, even when an IB interface is specified.
If this is later found to be insufficient, support for DLT_IPOIB can
be implemented. See https://www.tcpdump.org/linktypes.html for a
starting point.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This uses Linux cooked capture link-layer headers. See:
https://www.tcpdump.org/linktypes/LINKTYPE_LINUX_SLL.htmlhttps://www.tcpdump.org/linktypes/LINKTYPE_LINUX_SLL2.html
The header type needs to be checked to ensure the protocol
type (i.e. ether type, for the protocols we might be interested in) is
meaningful. The size of the header needs to be known so it can be
skipped, allowing the IP header to be found and parsed.
It would be possible to define support for DLT_LINUX_SLL2 if it is
missing. However, if a platform is missing support in the header file
then it is almost certainly missing in the run-time library too.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The current code will almost certainly generate ENOMSG for
non-ethernet packets, even for ethernet packets when the "any"
interface is used.
pcap_datalink(3PCAP) says:
Do NOT assume that the packets for a given capture or ``savefile``
will have any given link-layer header type, such as DLT_EN10MB for
Ethernet. For example, the "any" device on Linux will have a
link-layer header type of DLT_LINUX_SLL or DLT_LINUX_SLL2 even if
all devices on the sys‐ tem at the time the "any" device is opened
have some other data link type, such as DLT_EN10MB for Ethernet.
So, pcap_datalink() must be used.
Detect pcap packet types that are supported (currently only ethernet)
in the open code. There is no use continuing if the read code can't
parse packets. The pattern of using switch statements supports future
addition of other packet types.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
In particular, knowing the reason fetching the packet fails can help
with debugging unsupported protocols in the pcap code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is preferred because it will fail for devices that do not support
epoll_wait() and similar.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This forces the use pcap for packet capture on Linux.
It appears that using a raw socket for capture does not work with
infiniband - pcap support for that to come.
Don't (yet?) change the default capture method to pcap. On some
platforms (e.g. my personal Intel NUC, running Debian testing), pcap
is much less reliable than the raw socket. However, pcap seems fine
on most other platforms.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The build currently fails on AIX, which can't find the pcap headers
because they're installed in a non-standard place. However, there is
a pcap-config script available.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Although this is a test stub, it is complicated enough to encourage
ShellCheck cleanliness.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
VLAN configuration on Linux often uses a convention of naming a VLAN
on <iface> with VLAN ID <tag> as <iface>.<tag>. To be able to monitor
the underlying interface, the original 10.interface code naively
simply stripped off the '.' and everything after (i.e. ".*", as a glob
pattern).
Some users do not use the above convention. A VLAN can be named
without including the underlying interface, but still with a
tag (e.g. vlan<tag> - the word "vlan" following by the tag) or, more
generally, perhaps without a tag (e.g. <vlan> - an arbitrary name).
The ip(8) command lists a VLAN as <vlan>@<iface>. The underlying
interface can be found by stripping everything up to and including an
'@' (i.e. "*@").
Commit bc71251433 added support for
stripping "*@". However, on suspicion, it kept support for the case
where there is no '@', falling back to stripping ".*". If ip(8) ever
did this then it was a long time ago - it has been printing a format
including '@' since at least 2004.
Stripping ".*" interferes with interesting administrative decisions,
like having '.' in interface names.
So, drop the fallback to stripping ".*" because it appears to be
unnecessary and can cause inconvenience.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Jeremy Allison <jra@samba.org>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Jeremy Allison <jra@samba.org>
Autobuild-Date(master): Mon Sep 12 02:29:32 UTC 2022 on sn-devel-184
Mostly
SC2086: Double quote to prevent globbing and word splitting.
Use ctdb_onnode() where it simplifies code. No behaviour changes
intended.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Thu Aug 25 16:15:45 UTC 2022 on sn-devel-184
Use a new function and wait_until() to simplify.
get_test_ip_mask_and_iface() not needed here because
select_test_node_and_ips() sets $test_ip, and neither $mask nor $iface
is used.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
These lines are just wrong:
try_command_on_node -v $test_node "ip addr show to ${test_node}"
if -n "$out"; then
The 2nd variable referenced should be $test_ip. The 2nd line causes
"-n: command not found" because it is missing [] test command
brackets.
Both typos would probably make the test pass unconditionally.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Fix typo in error checking. While here adjust the bottom of the
range, making errno 0 invalid.
Add corresponding test cases using an alternative syntax for errno packets
(#nnn[;] - trailing ';' is optional).
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Aug 1 09:19:55 UTC 2022 on sn-devel-184
Block the locker helper child by taking a lock on the 2nd byte of the
lock file. This will cause a ping timeout if the process is blocked
for long enough.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Martin Schwenke <martins@samba.org>
Autobuild-Date(master): Thu Jul 28 11:10:54 UTC 2022 on sn-devel-184
Allows blocking mode and start offset to be specified. Always locks a
1-byte range.
Make the lock structure static to avoid initialising the whole
structure each time.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The ping timeout is specified by passing an extra argument to the
mutex helper, representing the ping timeout in seconds.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
In future this will allow extra I/O tests and a timeout in the parent
to (hopefully) release the lock if the child gets wedged. For
simplicity, use tmon only to detect when either parent or child goes
away. Plumbing a timeout for pings from child to parent will be done
later.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
There will be more timeouts so clarify the intent of this one.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
To avoid error messages having ridiculously long paths, set progname
to basename(argv[0]).
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
tmon_ping_test covers complex 2-way interaction between processes
using tmon_ping_send(), including via a socketpair(). tmon_test
covers the more general functionality of tmon_send() but uses a
simpler 1-way harness with wide coverage.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
A convention when testing members of ctdb-util is to include the .c
file so that static functions can potentially be tested. This means
that such tests can't be linked against ctdb-util or duplicate symbols
will be encountered.
ctdb-tests-common depends on ctdb-client, which depends in turn on
ctdb-util, so this can't be used to pull in backtrace support.
Instead, make ctdb-tests-backtrace its own subsystem.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Also, rename ctdb_unit_tests to ctdb_util_tests. The sorting makes
it clear that only items from ctdb-util are tested here.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Commit f5a2037734 arguably got this
back-to-front:
2022-07-27T09:50:01.985857+10:00 testn1 ctdbd[17820]: ../../ctdb/server/ctdb_takeover.c:514 sending TAKE_IP for '10.0.1.173'
2022-07-27T09:50:01.990601+10:00 testn1 ctdbd[17820]: Send TCP tickle ACK: 10.0.1.77:33004 -> 10.0.1.173:2049
2022-07-27T09:50:01.991323+10:00 testn1 ctdb-takeover[19758]: TAKEOVER_IP 10.0.1.173 succeeded on node 0
Unfortunately there is an inconsistency somewhere in the connection
tracking code used for tickle ACKs, making this less than obvious.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Thu Jul 28 09:02:08 UTC 2022 on sn-devel-184
The include isn't strictly necessary, since it is included via
common/reqid.c anyway. However, it is a useful hint.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Jul 22 17:01:00 UTC 2022 on sn-devel-184
root can read files for which the mode prohibits reading, so this test
case fails when run as root. Work around this when running as root.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Some versions of nfs-utils (e.g. recent CentOS 7) use /etc/nfs.conf
but do not include the nfsconf utility to extract values from the
file. However, git has an excellent conf file parser, so use it as a
last resort.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
For example:
In /home/martins/samba/samba/ctdb/tools/onnode line 304:
[ "$nodes" != "${nodes%[ ${nl}]*}" ] && verbose=true
^---^ SC2295 (info): Expansions inside ${..} need to be quoted separately, otherwise they match as patterns.
Did you mean:
[ "$nodes" != "${nodes%[ "${nl}"]*}" ] && verbose=true
For more information:
https://www.shellcheck.net/wiki/SC2295 -- Expansions inside ${..} need to b...
Who knew? Thanks ShellCheck!
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This works as an unprivileged user, so avoids unnecessary errors when
running in test mode (and not as root):
2022-02-18T12:21:12.436491+11:00 node.0 ctdbd[6958]: ctdb_sys_check_iface_exists: Failed to open raw socket
2022-02-18T12:21:12.436534+11:00 node.0 ctdbd[6958]: ctdb_sys_check_iface_exists: Failed to open raw socket
2022-02-18T12:21:12.436557+11:00 node.0 ctdbd[6958]: ctdb_sys_check_iface_exists: Failed to open raw socket
2022-02-18T12:21:12.436577+11:00 node.0 ctdbd[6958]: ctdb_sys_check_iface_exists: Failed to open raw socket
The corresponding porting test would now become pointless because it
would just confirm that "fake" does not exist. Attempt to make it
useful by using a less likely name than "fake" and attempting to
detect the loopback interface.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
A public IP address can be released in between (and probably before)
attempts to send ARPs. One situation when this can occur is when a
cluster is shutting down: node A shuts down first, public IPs from
node A are taken over by node B, node B is shutdown.
Notice this when it occurs and cancel further attempts to send ARPs.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
For the tickle ACK logging, render the connection in a buffer. This
produces more complete information.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Make this fully self-contained in the recovery daemon and avoid
indexing by PNN.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This structure is now standalone, so indexing by PNN can be avoided
via a subsequent commit. Index by culprit here to make this commit
simple.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
For memory usage, no need to dump all of this data on every failed
monitor event. The first call will be enough to diagnose the problem.
The node will then go unhealthy, drop clients and memory usage should
then drop.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Jul 22 07:32:54 UTC 2022 on sn-devel-184
If filesystem usage exceeds the unhealthy threshold then checking
memory usage checking is not done. Always do them both.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Use printf to allow easier line breaks and use some early returns.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
About to modify this file, so reformat first as per recent Samba
convention.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
A problem can only occur if /etc/ctdb/ or an important subdirectory is
removed, which means the script itself would not be found. Use && to
silence ShellCheck.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
When a node is starting, CTDB reports remote nodes as unhealthy by
default. This can be misleading.
To hide this, report an "UNKNOWN" pseudo state when a remote node is
not disconnected and the runstate is less than or equal to
"FIRST_RECOVERY".
Signed-off-by: Vinit Agnihotri <vagnihotri@ddn.com>
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
These would be unintended errors. The block should be omitted to keep
the default value.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
eval is not required and causes the follow ShellCheck warning:
SC2294 (warning): eval negates the benefit of arrays. Drop eval to
preserve whitespace/symbols (or eval as string).
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Fri Jun 24 10:40:50 UTC 2022 on sn-devel-184
The current code requires the use of eval in the NFS callout handling
to facilitate testing. Improve the code to remove this need.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The current code works in all current cases but is lazy and wrong.
Fix it to avoid breaking on code changes involving different thread
setups.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Tests can be run by hand using different distro styles, such as:
CTDB_NFS_DISTRO_STYLE=systemd-debian \
./tests/run_tests.sh ./tests/UNIT/eventscripts/{06,60}.nfs.*
This fixes known problems for Debian styles, so the tests now pass for
the following values of CTDB_NFS_DISTRO_STYLE:
systemd-redhat
sysvinit-redhat
systemd-debian
sysvinit-debian
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
At the moment test results can be influenced by real system
configuration files.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
For example, in Sys-V init "rquotad" is started by the main "nfs"
service. At the moment the call-out can't distinguish between this
case and "should never be run". Services set to "AUTO" are
hand-stopped/started via service_stop()/service_start() on failure via
restart_after.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This logic needs improving, so factor the decision making into new
functions service_or_manual_stop() and service_or_manual_start().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Drop the argument. These now just stop/start the overall NFS service,
so rename them appropriately.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
These are only called in one place and should be done inline, since
that is less confusing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Samba is reformatting shell scripts using
shfmt -w -p -i 0 -fn
so update this one before editing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Directly using dbgtext() with file logging results in a log entry with
no header, which is wrong. This is a regression, introduced in commit
10d15c9e5d. Prior to this, CTDB's
callback for file logging would always add a header.
Use DEBUG() instead dbgtext(). Note that DEBUG() effectively compares
the passed script_log_level with DEBUGLEVEL, so an explicit check is
no longer necessary.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15090
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Thu Jun 16 13:33:10 UTC 2022 on sn-devel-184
These aren't set anywhere in the code.
Drop the log argument because it is also no longer used.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15090
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>
This allows ctdb_set_child_logging() to work.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15090
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Volker Lendecke <vl@samba.org>
If the cluster filesystem is unavailable then I/O errors may occur.
This is no worse than contention, so don't ban. This avoids having
services unavailable for longer than necessary.
Update the associated test to simply confirm that this results in a
leaderless cluster, and leadership is restored when the lock can once
again be taken.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
ctdb_takeover.c and eventscript.c no longer use this.
ipalloc_common.c has never used it.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
After a recovery that takes a significant amount of time the logs are
flooded with messages about every resent call.
Log a summary instead and demote per-call messages to INFO level.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Signed-off-by: Pavel Filipenský <pfilipen@redhat.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Signed-off-by: Pavel Filipenský <pfilipen@redhat.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
These are easier to debug with a backtrace.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue May 3 10:13:23 UTC 2022 on sn-devel-184
Some tests make generous use of assert() and it can be difficult to
guess the cause of failures without resorting to GDB. This provides
some help.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
None of these include any files from the include/ sub-directory.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
If there is an error then this pointer is unconditionally
dereferenced.
However, the only possible error appears to be ENOMEM, where a crash
caused by dereferencing a NULL pointer isn't a terrible outcome. In
the absence of a security issue this is probably not worth
backporting.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
If there is an error then this pointer is unconditionally
dereferenced.
However, the only possible error appears to be ENOMEM, where a crash
caused by dereferencing a NULL pointer isn't a terrible outcome. In
the absence of a security issue this is probably not worth
backporting.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The only value this now provides is use of a notification script to
log when start/stop are called. This was used for debugging strange
start/stop failures, which have not been recently seen. Also, systemd
does a good job of logging start/stop.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
IPs are dropped in the shutdown event.
If a watchdog is necessary to ensure public IPs aren't on interfaces
when CTDB isn't running, then see ctdb-crash-cleanup.sh.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This further untangles public IP handling from the main daemon.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is functionally the same as ctdb_release_all_ips().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This was added to be able to notice startup failures when unknown
tunables were present in the configuration. Tunables are now set by
the daemon, so this is no longer necessary.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This aims to test ctdb_tunable_load_file() but also exercises
ctdb_tunable_names() and ctdb_tunable_get_value().
ctdb_tunable_set_value() is indirectly exercised via
ctdb_tunable_load_file().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Instead of documenting test cases with a comment, this allows them to
be documented via an argument to a function that is printed when the
test case is run. This makes it easier locate test case failures when
commands used by test cases look similar,
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This allows the provided output to be specified a little more
carelessly. As per the comment, trailing newlines can't be matched
anyway, so this is notionally a bug fix.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Samba is reformatting shell scripts using
shfmt -w -p -i 0 -fn
so update this one before editing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
We used to use this for building test packages for standalone CTDB.
However, our testing has now changed to use binary tarballs. We
believe we were the only users of this spec file and expect CTDB to
only be installed as part of a top-level Samba build, especially in
RPM form.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
./configure && make && make install is will always work.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
The changes are made to replace the deprecated network commands
(ifconfig,netstat) with the new commands
(ip addr,ss) respectively
Signed-off-by: Archana Chidirala <archana.chidirala.chidirala@ibm.com>
Reviewed-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Martin Schwenke <martin@meltin.net>
Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Tue Mar 8 12:30:53 UTC 2022 on sn-devel-184
Issue is reported here:
853 case CTDB_CONTROL_DB_VACUUM: {
854 struct ctdb_db_vacuum db_vacuum;
855
>>> CID 1499395: Uninitialized variables (UNINIT)
>>> Using uninitialized value "db_vacuum.full_vacuum_run" when calling "ctdb_db_vacuum_len".
856 CHECK_CONTROL_DATA_SIZE(ctdb_db_vacuum_len(&db_vacuum));
857 return ctdb_control_db_vacuum(ctdb, c, indata, async_reply);
858 }
The problem is that ctdb_bool_len() unnecessarily dereferences its
argument, which in this case is &db_vacuum.full_vacuum_run. Not a
security issue because the value copied by dereferencing is not used.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Wed Feb 23 02:02:06 UTC 2022 on sn-devel-184
Debugging a test failure here without GDB is not possible. Dumping a
stack trace gives a good hint.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Instead of repeatedly running a test binary.
Run time for these tests reduces from ~90s to ~75s.
When run under valgrind, the run time for protocol_test_001.sh reduces
from ~390s to <1s.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Feb 14 04:32:29 UTC 2022 on sn-devel-184
The current method of repeatedly running a binary has huge overhead,
especially with valgrind.
protocol_test_iterate_tag() allows output that is usually used for
hinting where a test failure occurred to be replaced with a tag
stored in a buffer, which is printed on test failure.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
A stalled node probably continues to hold the cluster lock, so confirm
elections work in this case.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Feb 14 02:46:01 UTC 2022 on sn-devel-184
Elections should now be quite rare, so always log when one begins.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is currently missed when the cluster lock is lost.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The problem here is that election-in-progress must be set to
potentially avoid restarting the election broadcast timeout in
main_loop(), so this is already done by leader_handler().
Have force_election() set election-in-progress for all election types
and do not bother setting it in cluster_lock_election().
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Election-in-progress is set by unknown leader broadcast, so needs to
be cleared in all cases when election completes.
This was seen in a case where the leader node stalled, so didn't send
leader broadcasts for some time. The node continued to hold the
cluster lock, so another node could not become leader. However, after
the node returned to normal it still did not send leader broadcasts
because election-in-progress was never cleared.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is many years out of date and recent changes make it worse. It
is unlikely that anyone has the time to fix this in the near future,
so remove it because it is misleading.
Database recovery steps are well documented in comments in the
recovery helper. Cluster monitoring documentation can be re-added
when things stop changing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Rename test, clean up node selection. Duplicate for for banning and
removing leader capability cases. Repeat all 3 tests without cluster
lock.
All of the standard election triggers are now tested, with and without
cluster lock. Due to test cluster configuration limitations, the
tests without cluster lock are skipped on a real cluster.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Can be used to disable default options, such as cluster lock.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Rename this configuration item and move it into the [cluster]
configuration section.
Update documentation to match.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Retain "recovery lock" and mark as deprecated for backward
compatibility.
Some documentation is still inconsistent.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
If the cluster is partitioned then nodes in one partition can not take
the lock anyway, so election is pointless. It just introduces
unnecessary corner cases.
Instead just race for the lock.
When a node notices a lack of leader and notifies other nodes of an
election via an unknown leader broadcast, the cluster lock election is
hooked into this broadcast.
The test needs to be updated because losing the cluster lock can now
result in a leadership change.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This doesn't make sense if leader broadcasts are used.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The following command names are changed:
recmaster -> leader
setrecmasterrole -> setleaderrole
Command output changed for the following commands:
status
getcapabilities
Documentation and tests are updated to reflect these changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This seems pointless but it localises a subsequent change and also
starts a terminology change in the tool code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Now all references to ctdb->recovery_lock are encapsulated in the
cluster lock code.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
It is no longer just a recovery lock but is always held by the cluster
leader.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
ctdb_test_init() doesn't actually pass arguments to local_daemons.sh.
This needs to be done using ctdb_nodes_start_custom().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The introduction of the leader broadcast timeout provides an
alternative to the current leader validation. Using the leader
broadcast may not be as fast but it is more correct.
When the leader node is stopped or banned, the only way of triggering
an election is currently to fetch the leader's node map to check
whether the it is still active. This is because the leader will no
longer push the node map to other nodes. However, having all nodes
fetch the node map from an inactive leader may be unreliable.
Most of the other cases are also handled more reliably by the leader
broadcast timeout.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This no longer occurs at startup due to the leader broadcast timeout.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
If no leader broadcasts have been received from the leader for more
than 5s then trigger an election.
Apart from being sane behaviour, this avoids elected-before-connected
bugs at startup, where a node elects itself leader before it is
connected to other nodes.
When a node processes a leader broadcast timeout it sends an unknown
leader broadcast to all nodes. That causes cancellation of the leader
broadcast timeout across the cluster. This is particular important at
startup, since nodes may be started in a staggered fashion. Without
this cluster-wide cancellation, a node might notice the lack of
leader, win an election and complete a recovery before other nodes
notice the lack of leader. When the leader broadcast timeout finally
occurs on the other nodes then they'll put the cluster back into an
unnecessary recovery.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
These are triggered on 1 second timer, but are only sent if the node
is the current leader and there is no election underway.
If this node can not be the leader then ensure it releases the
recovery lock.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
CTDB_SRVID_LEADER will be regularly broadcast to all connected nodes
by the leader.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
An alternate election method will be added that doesn't use the
election timeout, so this provides a common way for recognising when
an election is in progress.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This makes the code self-documenting.
In ctdb_election_data() there is a slight behaviour change. An
inactive node will now try to lose an election. This case should not happen
because:
* An inactive node can't win an election round and then send a reply.
* Any inactive node should never start an election. There are
currently places where this happens and they will be fixed later.
There is an instance where this could be used in
validate_recovery_master() but this involves a more serious logic
change. Overhaul this function later.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
There are some remaining instances in this file but they will be
removed in subsequent commits.
Modernise debug macros as appropriate.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Recovery master is being renamed to leader. This follows clustering
best practice (e.g. RAFT).
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is currently referenced in a number of inconsistent
ways, including:
* pnn
* rec->ctdb->pnn
* ctdb->pnn
* ctdb_get_pnn(ctdb)
* ctdb_get_pnn(rec->ctdb)
The first of these always requires some thought about the context - is
this the node PNN or some other PNN (e.g. argument to function)?
rec->pnn is now always used when referring to the recovery daemon's
PNN.
Doing this also reduces reliance on struct ctdb_context internals.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
ban_time argument is always ctdb->tunable.recovery_ban_period, so
build this in and make the calling code more readable.
ctdb_ban_node() already logs how long a node is banned for, so don't
repeatedly log this.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
All other arguments are available via rec, so simplify.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
pnn and nodemap are both available via the rec context, so simplify.
vnnmap is unused.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The pnn and nodemap arguments to force_election() and
send_election_request() are always effectively rec->pnn and
rec->nodemap, so simplify.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
This is currently referenced in a number of inconsistent
ways, including:
* pnn
* rec->ctdb->pnn
* ctdb->pnn
* ctdb_get_pnn(ctdb)
* ctdb_get_pnn(rec->ctdb)
The first of these always requires some thought about the context - is
this the node PNN or some other PNN (e.g. argument to function)?
The intention is to always use rec->pnn when referring to the recovery
daemon's PNN.
Doing this also reduces reliance on struct ctdb_context internals.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Make the code self-documenting.
This preempts an upcoming change to terminology but doing it now saves
a lot of churn.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
The recovery and takeover helpers can run for a while and generate
non-trivial logs, so have them reopen their logs to support log
rotation.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Mon Jan 17 04:36:30 UTC 2022 on sn-devel-184