5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-03-14 04:58:16 +03:00

820 Commits

Author SHA1 Message Date
Maximiliano Sandoval
5a82a001a5 d/postinst trigger: run systemctl only if systemd is available
Check if systemd is active by testing if the /run/systemd/system
directory exists, just like debhelper generated code does, before
running systemctl.

Allows for setting up a package's build dependencies in containers not
managed by systemd.

Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
 [TL: extend commit message and note that this fixes setting up the
  build env, not the build itself]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2025-03-03 09:44:44 +01:00
Thomas Lamprecht
f3e1f04475 tests: fence config parser: drop unused imports
Similar to commit eec4fab ("test: ha tester: drop unused import")

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2025-01-22 16:20:16 +01:00
Fiona Ebner
e481095f76 test: ha tester: remove trailing whitespace
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2025-01-22 16:11:59 +01:00
Fiona Ebner
eec4fab5f8 test: ha tester: drop unused import
There is no user of File::Path remaining after commit 787b66e
("SimCluster: setup status dir inside new") which was the only user
of remove_tree(). make_path() was not used at all according to git
history.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2025-01-22 16:11:59 +01:00
Thomas Lamprecht
34fe8e59ea bump version to 4.0.6
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
977ae28849 crm: get active if there are nodes that need to leave maintenance
This is mostly cosmetic, because as long as there are configured
services a CRM would get active anyway. But it can happen that a
maintenance mode is left-over and all services got removed a fresh
cluster start then will keep all CRMs as idle and thus never clear the
maintenance state.

This can be especially confusing now, as a recent pve-manager commit
993d05abc ("api/ui: include the node ha status in resources call and
show as icon") started to show the maintenance mode as icon in the web
UI, thus making this blip much more prominent.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
73f93a4f6b crm: get active if there are pending CRM commands
But favor the last active CRM to avoid all CRMs trying to get out of
idle at once.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
d0979e6dd0 env: add any_pending_crm_command method
Add helper that returns if the CRM command queue holds any commands
without altering the state of the queue at all, unlike the existing
read_crm_commands method would do.

This will be used to check if a CRM needs to become active when there
are pending CRM commands but no master seems to process them as all
are idle/offline.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
afbfa9bafc tests: add more crm idle situations
To test the behavior for when a CRM should get active or stay active
(for a bit longer).

These cases show the status quo, which will be improved on in the next
commits.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
ddd56db346 fix #5243: make CRM go idle after ~15 min of no service being configured
This is mostly for convenience for when one does a quick HA evaluation
and then removes all services again, the biggest visible effect is
that there will be no status updates once the CRM is idle, reducing
some mid-frequent updates to pmxcfs.

As the CRM never lets the watchdog run out pro-actively to trigger
fencing, this won't have much of a difference w.r.t. "accidental"
self-fencing on the common outage situation that happen in practice,
i.e. quorum loss due to network outage or corosync getting
misconfigured, that's also why this was not considered when adding the
auto-idling for LRM back in commit 2105170 ("LRM: release lock and
close watchdog if no service configured for >10min").
In short, the watchdog for CRM is mostly here to avoid a situation
where the process of the currently active CRM hangs, or does not get
scheduled for a while such that another CRM becomes active only that
the previous one then resumes and still thinks it is the active one
and, e.g., writes out a outdated manager_status file; there are some
other situations, but it's always similar reason.

Compared to the LRM idle mechanism we require more rounds for the CRM
to go idle (90 for CRM vs 60 for LRM), the reason here is that the LRM
needs an active CRM for some operations to progress in the FSM, so
just waiting a bit longer for the CRM is enough to ensure that this
can happen.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
e857e63846 tests: add scenario for CRM going idle
To better show what changes in he future implementation of this
feature in a next commit.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
e5f9b5f6e3 crm: factor out base check if cluster and ha is healthy
This will be reused for the auto-idling mechanism, factor out getting
the manager status too as this will be used for more specific checks
about being able to go idle or when a CRM should be active.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
af71e0049f crm: factor out giving up the watchdog protection
See the added comment for full details why the watchdog protection of
the CRM needs less strict safety requirements compared to the one of
an (active) LRM.

In short, CRM does not manages services but directs them through the
manager_status state-file. This means the watchdog mainly protects it
from a hung system where locks would timeout before writing the state
out and thus a race with the new CRM would happen. So the CRM
basically can always give up the watchdog safely when it stops being
the active CRM anyway.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
b4a7a92d57 crm: style clean-up to early line-wrapping
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
3e18cf3405 simulator: document newer crm commands
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Thomas Lamprecht
77cfe89238 api: status: code style clean-up
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-20 21:11:24 +01:00
Aaron Lauterer
f80628066d tools: adapt line lengths of verbose descriptions
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2024-11-11 21:31:01 +01:00
Aaron Lauterer
601d2c542c tools: group verbose desc: mention higher number is higher priority
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2024-11-11 21:31:01 +01:00
Wolfgang Bumiller
800a0c3e48 bump version to 4.0.5
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
2024-06-04 11:10:11 +02:00
Lukas Wagner
f43a6009ff env: notify: use named templates instead of passing template strings
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Tested-by: Max Carrara <m.carrara@proxmox.com>
Reviewed-by: Max Carrara <m.carrara@proxmox.com>
2024-06-03 14:16:35 +02:00
Thomas Lamprecht
822def8250 bump version to 4.0.4
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-22 13:47:22 +02:00
Fabian Grünbichler
8bac62a877 d/postinst: make deb-systemd-invoke non-fatal
else this can break an upgrade for unrelated reasons.

this also mimics debhelper behaviour more (which we only not use here because
of lack of reload support) - restructured the snippet to be more similar with
an explicit `if` as well.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2024-04-17 16:56:02 +02:00
Thomas Lamprecht
2db44501bc bump version to 4.0.3
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:49:08 +01:00
Lukas Wagner
868d3cd4bb env: switch to matcher-based notification system
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
2023-11-17 14:47:55 +01:00
Thomas Lamprecht
07284f1194 usage stats: tiny code style clean-up
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:47:12 +01:00
Thomas Lamprecht
56d4c7a50a watchdog-mux: code indentation and style cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:46:49 +01:00
Thomas Lamprecht
6548300e33 buildsys: use dpkg default makefile snippet
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:45:35 +01:00
Fiona Ebner
1c61138341 crs: avoid auto-vivification when adding node to service usage
Part of what caused bug #4984. Make the code future-proof and warn
when the node was never registered in the plugin, similar to what the
'static' usage plugin already does.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ TL: rework commit message subject ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-10-06 12:27:04 +02:00
Fiona Ebner
c7843a315d fix #4984: manager: add service to migration-target usage only if online
Otherwise, when using the 'basic' plugin, this would lead to
auto-vilification of the $target node in the Perl hash tracking the
usage and it would wrongly be considered online when selecting the
recovery node.

The 'static' plugin was not affected, because it would check and warn
before adding usage to a node that was not registered with add_node()
first. Doing the same in the 'basic' plugin will be done by another
patch.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ TL: shorten commit message subject ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-10-06 12:22:53 +02:00
Lukas Wagner
4cb3b2cf9b manager: send notifications via new notification module
... instead of using sendmail directly.

If the new 'notify.target-fencing' parameter from datacenter config
is set, we use it as a target for notifications. If it is not set,
we send the notification to the default target (mail-to-root).

There is also a new 'notify.fencing' paramter which controls if
notifications should be sent at all. If it is not set, we
default to the old behavior, which is to send.

Also add dependency to the `libpve-notify-perl` package to d/control.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
2023-08-03 17:34:52 +02:00
Thomas Lamprecht
dfe080bab1 bump version to 4.0.2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-13 08:35:56 +02:00
Fiona Ebner
17c6cbeab9 manager: clear stale maintenance node caused by simultaneous cluster shutdown
Currently, the maintenance node for a service is only cleared when the
service is started on another node. In the edge case of a simultaneous
cluster shutdown however, it might be that the service never was
started anywhere else after the maintenance node was recorded, because
the other nodes were already in the process of being shut down too.

If a user ends up in this edge case, it would be rather surprising
that the service would be automatically migrated back to the
"maintenance node" which actually is not in maintenance mode anymore
after a migration away from it.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-13 08:33:52 +02:00
Fiona Ebner
a1b9918d30 tests: simulate stale maintainance node caused by simultaneous cluster shutdown
In the test log, it can be seen that the service will unexpectedly be
migrated back. This is caused by the service's maintainance node
property being set by the initial shutdown, but never cleared, because
that currently happens only when the service is started on a different
node. The next commit will address the issue.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-13 08:33:52 +02:00
Thomas Lamprecht
eee63557bc bump version to 4.0.1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-09 10:41:59 +02:00
Thomas Lamprecht
bf5d92725e d/control: bump versioned dependency for pve-container & qemu-server
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-09 10:33:40 +02:00
Fiona Ebner
e0346eccaf resources: pve: avoid relying on internal configuration details
Instead, use the new get_derived_property() method to get the same
information in a way that is robust regarding changes in the
configuration structure.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-09 07:28:29 +02:00
Fiona Ebner
afa1aa9cb8 api: fix/add return description for status endpoint
The fact that no 'items' was specified made the api-viewer throw a
JavaScript exception: retinf.items is undefined

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-07 17:40:48 +02:00
Fiona Ebner
5a9c3a2808 lrm: do not migrate if service already running upon rebalance on start
As reported in the community forum[0], currently, a newly added
service that's already running is shut down, offline migrated and
started again if rebalance selects a new node for it. This is
unexpected.

An improvement would be online migrating the service, but rebalance
is only supposed to happen for a stopped->start transition[1], so the
service should not being migrated at all.

The cleanest solution would be for the CRM to use the state 'started'
instead of 'request_start' for newly added services that are already
running, i.e. restore the behavior from before commit c2f2b9c
("manager: set new request_start state for services freshly added to
HA") for such services. But currently, there is no mechanism for the
CRM to check if the service is already running, because it could be on
a different node. For now, avoiding the migration has to be handled in
the LRM instead. If the CRM ever has access to the necessary
information in the future, to solution mentioned above can be
re-considered.

Note that the CRM log message relies on the fact that the LRM only
returns the IGNORED status in this case, but it's more user-friendly
than using a generic message like "migration ignored (check LRM
log)".

[0]: https://forum.proxmox.com/threads/125597/
[1]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_crs_scheduling_points

Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ T: split out adding the test to a previous commit so that one can
   see in git what the original bad behavior was and how it's now ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:08:00 +02:00
Thomas Lamprecht
c1aaa05b85 tests: simulate adding running services to HA with rebalance-on-start
Split out from Fiona's original series, to better show what actually
changes with her fix.

Currently, a newly added service that's already running is shut down,
offline migrated and started again if rebalance selects a new node
for it. This is unexpected and should be fixed, encode that behavior
as a test now, showing still the undesired behavior, and fix it in
the next commit

Originally-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
c0dbab3c32 tools: add IGNORED return code
Will be used to ignore rebalance-on-start when an already running
service is newly added to HA.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
81e8e7d000 sim: hardware: commands: make it possible to add already running service
Will be used in a test for balance on start, where it should make a
difference if the service is running or not.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
b8d86ec48c sim: hardware: commands: fix documentation for add
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Thomas Lamprecht
973bf0324f bump version to 4.0.0
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:27:04 +02:00
Thomas Lamprecht
3de087a57b buildsys: derive upload dist automatically
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
c1b4249bde d/control: raise standards version compliance to 4.6.2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
cfe9011673 buildsys: improve DSC target & add sbuild convenience target
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
1b91242ae9 buildsys: make build-dir generation atomic
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
576ae6e7d5 buildsys: rework doc-gen cleanup and makefile inclusion
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
df0c583fc3 buildsys: use full DEB_VERSION and correct DEB_HOST_ARCH
The DEB_HOST_ARCH is the one the package is actually built for, the
DEB_BUILD_ARCH is the one of the build host; having this correct
makes cross-building easier, but otherwise it makes no difference.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
69e37516e9 makefile: convert to use simple parenthesis
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00