5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-06 17:18:00 +03:00
Commit Graph

715 Commits

Author SHA1 Message Date
Thomas Lamprecht
1280368d31 fix variable name typo
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-07-22 07:25:02 +02:00
Thomas Lamprecht
066fd01670 fix spreading out services if source node isnt operational but otherwise ok
as its the case for going into maintenance mode

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-07-21 18:14:33 +02:00
Thomas Lamprecht
6756e14aed tests: add shutdown policy scenario with multiple guests to spread out
currently wrong as online_node_usage doesn't considers counting the
target node if the source node isn't considered online (=
operational) anymore

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-07-21 18:09:42 +02:00
Thomas Lamprecht
c00c44818a bump version to 3.3-4
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-04-27 14:02:22 +02:00
Fabian Grünbichler
ad6456997e lrm: fix getting stuck on restart
run_workers is responsible for updating the state after workers have
exited. if the current LRM state is 'active', but a shutdown_request was
issued in 'restart' mode (like on package upgrades), this call is the
only one made in the LRM work() loop.

skipping it if there are active services means the following sequence of
events effectively keeps the LRM from restarting or making any progress:

- start HA migration on node A
- reload LRM on node A while migration is still running

even once the migration is finished, the service count is still >= 1
since the LRM never calls run_workers (directly or via
manage_resources), so the service having been migrated is never noticed.

maintenance mode (i.e., rebooting the node with shutdown policy migrate)
does call manage_resources and thus run_workers, and will proceed once
the last worker has exited.

reported by a user:

https://forum.proxmox.com/threads/lrm-hangs-when-updating-while-migration-is-running.108628

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2022-04-27 13:57:37 +02:00
Thomas Lamprecht
fe3781e8ab buildsys: track and upload debug package
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 18:08:27 +01:00
Thomas Lamprecht
c15a8b803e bump version to 3.3-3
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 18:05:37 +01:00
Thomas Lamprecht
eef4f86338 lrm: increase run_worker loop-time parition
every LRM round is scheduled to run for 10s but we spend only half
of that to actively trying to run workers (in the max_worker limit).

Raise that to 80% duty cycle.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 16:17:28 +01:00
Thomas Lamprecht
65c1fbac99 lrm: avoid job starvation on huge workloads
If a setup has a lot VMs we may run into the time limit from the
run_worker loop before processing all workers, which can easily
happen if an admin did not increased their default of max_workers in
the setup, but even with a bigger max_worker setting one can run into
it.

That combined with the fact that we sorted just by the $sid
alpha-numerically means that CTs where preferred over VMs (C comes
before V) and additionally lower VMIDs where preferred too.

That means that a set of SIDs had a lower chance of ever get actually
run, which is naturally not ideal at all.
Improve on that behavior by adding a counter to the queued worker and
preferring those that have a higher one, i.e., spent more time
waiting on getting actively run.

Note, due to the way the stop state is enforced, i.e., always
enqueued as new worker, its start-try counter will be reset every
round and thus have a lower priority compared to other request
states. We probably want to differ between a stop request when the
service is/was in another state just before and the time a stop is
just re-requested even if a service was already stopped for a while.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 16:14:03 +01:00
Thomas Lamprecht
b538340c9d lrm: code/style cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 14:40:27 +01:00
Thomas Lamprecht
f613e426ce lrm: run worker: avoid an indendation level
best viewed with the `-w` flag to ignore whitespace change itself

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 13:42:15 +01:00
Thomas Lamprecht
a25a516ac6 lrm: log actual error if fork fails
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 13:39:35 +01:00
Thomas Lamprecht
2deff1ae35 manager: refactor fence processing and rework fence-but-no-service log
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 13:31:04 +01:00
Thomas Lamprecht
0179818f48 d/changelog: s/nodes/services/
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-20 10:10:27 +01:00
Thomas Lamprecht
ccf328a833 bump version to 3.3-2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 14:30:19 +01:00
Fabian Ebner
7dc927033f manage: handle edge case where a node gets stuck in 'fence' state
If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service
state.

Reported in the community forum:
https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
[ T: track test change of new test ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 13:50:47 +01:00
Thomas Lamprecht
30fc7ceedb lrm: also check CRM node-status for determining fence-request
This fixes point 2. of commit 3addeeb - avoiding that a LRM goes
active as long as the CRM still has it in (pending) `fence` state,
which can happen after a watchdog reset + fast boot. This avoids that
we interfere with the CRM acquiring the lock, which is all the more
important once a future commit gets added that ensures a node isn't
stuck in `fence` state if there's no service configured (anymore) due
to admin manually removing them during fencing.

We explicitly fix the startup first to better show how it works in
the test framework, but as the test/sim hardware can now delay the
CRM now while keeping LRM running, the second test (i.e.,
test-service-command9) should still trigger after the next commit, if
this one would be reverted or broken otherwise.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 13:48:57 +01:00
Thomas Lamprecht
303490d8f1 lrm: factor out fence-request check into own helper
we'll extend that a bit in a future commit

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 13:48:57 +01:00
Thomas Lamprecht
ca2e547a76 test: cover case where all service get removed from in-progress fenced node
this test's log is showing up two issues we'll fix in later commits

1. If a node gets fenced and an admin removes all services before the
   fencing completes, the manager will ignore that node's state and
   thus never make the "fence" -> "unknown" transition required by
   the state machine

2. If a node is marked as "fence" in the manager's node status, but
   has no service, its LRM's check for "pending fence request"
   returns a false negative and the node start trying to acquire its
   LRM work lock. This can even succeed in practice, e.g. the events:
    1. Node A gets fenced (whyever that is), CRM is working on
       acquiring its lock while Node A reboots
    2. Admin is present and removes all services of Node A from HA
    2. Node A booted up fast again, LRM is already starting before
       CRM could ever get the lock (<< 2 minutes)
    3. Service located on Node A gets added to HA (again)
    4. LRM of Node A will actively try to get lock as it has no
       service in fence state and is (currently) not checking the
       manager's node state, so is ignorant of the not yet processed
       fence -> unknown transition
    (note: above uses 2. twice as those points order doesn't matter)

    As a result the CRM may never get to acquire the lock of Node A's
    LRM, and thus cannot finish the fence -> unknown transition,
    resulting in user confusion and possible weird effects.

I the current log one can observe 1. by the missing fence tries of
the master and 2. can be observed by the LRM acquiring the lock while
still being in "fence" state from the masters POV.

We use two tests so that point 2. is better covered later on

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 13:48:21 +01:00
Thomas Lamprecht
1b21e7e651 sim: implement skip-round command for crm/lrm
This allows to simulate situations where there's some asymmetry
required in service type scheduling, e.g., if we the master should
not pickup LRM changes just yet - something that can happen quite
often in the real world due to scheduling not being predictable,
especially across different hosts.

The implementation is pretty simple for now, that also means we just
do not care about watchdog updates for the skipped service, meaning
that one is limited to skip two 20s rounds max before self-fencing
kicks in.

This can be made more advanced once required.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 11:19:34 +01:00
Thomas Lamprecht
214b70f45a sim: test hw: small code cleanups and whitespace fixes
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 11:19:34 +01:00
Thomas Lamprecht
0e13a6c123 sim: service add command: allow to override state
Until now we had at most one extra param, so lets get the all
remaining params in an array and use that, fallback staid the same.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 11:19:34 +01:00
Thomas Lamprecht
1323ef6ec5 sim: add service: set type/name in config
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 11:19:34 +01:00
Thomas Lamprecht
fe19c9b412 test/sim: also log delay commands
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-19 11:19:34 +01:00
Thomas Lamprecht
a0a7d11ed6 sim/hardware: sort and split use statements
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-17 15:57:43 +01:00
Thomas Lamprecht
4ee32601b9 lrm: fix comment typos
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-17 15:57:43 +01:00
Thomas Lamprecht
0dcb6597aa crm: code/style cleanup
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-17 12:28:22 +01:00
Thomas Lamprecht
8a25bf2969 d/postinst: fix restarting LRM/CRM when triggered
We wrongly dropped the semi-manual postinst in favor of a fully
auto-generated one, but we always need to generate the trigger
actions ourself - cannot work otherwise.

Fix 3166752 ("postinst: use auto generated postinst")
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-17 11:30:49 +01:00
Thomas Lamprecht
b7fb934810 d/lintian: update repeated-trigger override
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-01-17 11:30:08 +01:00
Thomas Lamprecht
a31c6fe591 lrm: fix log call on wrong module
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-07 15:19:30 +02:00
Thomas Lamprecht
a2d12984b5 bump version to 3.3-1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
719883e9a5 recovery: allow disabling a in-recovery service
Mostly for convenience for the admin, to avoid the need for removing
it completely, which is always frowned uppon by most users.

Follows the same logic and safety criteria as the transition to
`stopped` on getting into the `disabled` state in the
`next_state_error`.

As we previously had a rather immediate transition from recovery ->
error (not anymore) this is actually restoring a previous feature and
does not adds new implications or the like.

Still, add a test which also covers that the recovery state does not
allows things like stop or migrate to happen.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
6104d9e76e tests: cover request-state changes and crm-cmds for in-recovery services
Add a test which covers that the recovery state does not allows
things like stop or migrate to happen.

Also add one for disabling at the end, this is currently blocked too
but will change in the next patch, as it can be a safe way out for
the admin to reset the service without removing it.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
feea391367 recompute_online_node_usage: show state on internal error
makes debugging easier, also throw in some code cleanup

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
90a247552c fix #3415: never switch in error state on recovery, try harder
With the new 'recovery' state introduced a commit previously we get a
clean transition, and thus actual difference, from to-be-fenced and
fenced.

Use that to avoid going into the error state when we did not find any
possible new node we could recover the service too.
That can happen if the user uses the HA manager for local services,
which is an OK use-case as long as the service is restricted to a
group with only that node. But previous to that we could never
recover such services if their node failed, as they got always put
into the "error" dummy/final state.
But that's just artificially limiting ourself to get a false sense of
safety.

Nobody, touches the services while it's in the recovery state, no LRM
not anything else (as any normal API call gets just routed to the HA
stack anyway) so there's just no chance that we get a bad
double-start of the same services, with resource access collisions
and all the bad stuff that could happen (and note, this will in
practice only matter for restricted services, which are normally only
using local resources, so here it wouldn't even matter if it wasn't
safe already - but it is, double time!).

So, the usual transition guarantees still hold:
* only the current master does transitions
* there needs to be a OK quorate partition to have a master

And, for getting into recovery the following holds:
* the old node's lock was acquired by the master, which means it was
  (self-)fenced -> resource not running

So as "recovery" is a no-op state we got only into once the nodes was
fenced we can continue recovery, i.e., try to find a new node for t
the failed services.

Tests:
* adapt the exist recovery test output to match the endless retry for
  finding a new node (vs. the previous "go into error immediately"
* add a test where the node comes up eventually, so that we cover
  also the recovery to the same node it was on, previous to a failure
* add a test with a non-empty start-state, the restricted failed node
  is online again. This ensure that the service won't get started
  until the HA manager actively recovered it, even if it's staying on
  that node.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
bdbd9b2ba5 gitignore: add test status output directory's content to ignored files
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
d54c04bdcc tests: add one for service set to be & stay ignored from the start
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
21051707f6 LRM: release lock and close watchdog if no service configured for >10min
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
abc1499bc6 LRM: factor out closing watchdog local helper
It's not much but repeated a few times, and as a next commit will add
another such time let's just refactor it to a local private helper
with a very explicit name and comment about what implications calling
it has.

Take the chance and add some more safety comments too.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:08:12 +02:00
Thomas Lamprecht
c259b1a814 manager: make recovery actual state in FSM
This basically makes recovery just an active state transition, as can
be seen from the regression tests - no other semantic change is
caused.

For the admin this is much better to grasp than services still marked
as "fence" when the failed node is already fenced or even already up
again.

Code-wise it makes sense too, to make the recovery part not so hidden
anymore, but show it was it is: an actual part of the FSM

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-02 20:04:38 +02:00
Thomas Lamprecht
3458a0e377 manager: indentation/code-style cleanups
we now allow for a longer text-width in general and adapt some lines
for that

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-01 16:00:51 +02:00
Thomas Lamprecht
c98a3acfec ha-tester: allow one to supress the actual test output
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-01 16:00:51 +02:00
Thomas Lamprecht
bcc057fa6d ha-tester: report summary count of run/passed tests and list failed ones
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-01 16:00:51 +02:00
Thomas Lamprecht
dd4ab3f532 ha-tester: allow to continue harness on test failure
To see if just a bit or many tests are broken it is useful to
sometimes run all, and not just exit after first failure.

Allow this as opt-in feature.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-01 16:00:51 +02:00
Thomas Lamprecht
a5d48ae190 sim: hardware: update & reformat comment for available commands
The service addition and deletion, and also the artificial delay
(useful to force continuation of the HW) commands where missing
completely.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-07-01 16:00:51 +02:00
Thomas Lamprecht
d8b4714873 buildsys: change upload/repo dist to bullseye
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-24 11:40:39 +02:00
Thomas Lamprecht
19265402bc bump version to 3.2-2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-24 11:38:46 +02:00
Thomas Lamprecht
b7e7495bed d/rules: update to systemd dh changes
both, `override_dh_systemd_enable` and `override_dh_systemd_start`
are ignored with current compat level 12, and will become an error in
level >= 13, so drop them and use `override_dh_installsystemd` for
both of the previous uses.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-24 11:37:01 +02:00
Thomas Lamprecht
8a35366fa6 bump version to 3.2-1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-12 20:56:03 +02:00
Thomas Lamprecht
5868b0cf16 d/control: bump debhelper compat level to 12
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-12 20:54:22 +02:00