5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-31 05:47:19 +03:00

802 Commits

Author SHA1 Message Date
Wolfgang Bumiller
800a0c3e48 bump version to 4.0.5
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
2024-06-04 11:10:11 +02:00
Lukas Wagner
f43a6009ff env: notify: use named templates instead of passing template strings
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Tested-by: Max Carrara <m.carrara@proxmox.com>
Reviewed-by: Max Carrara <m.carrara@proxmox.com>
2024-06-03 14:16:35 +02:00
Thomas Lamprecht
822def8250 bump version to 4.0.4
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-22 13:47:22 +02:00
Fabian Grünbichler
8bac62a877 d/postinst: make deb-systemd-invoke non-fatal
else this can break an upgrade for unrelated reasons.

this also mimics debhelper behaviour more (which we only not use here because
of lack of reload support) - restructured the snippet to be more similar with
an explicit `if` as well.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2024-04-17 16:56:02 +02:00
Thomas Lamprecht
2db44501bc bump version to 4.0.3
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:49:08 +01:00
Lukas Wagner
868d3cd4bb env: switch to matcher-based notification system
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
2023-11-17 14:47:55 +01:00
Thomas Lamprecht
07284f1194 usage stats: tiny code style clean-up
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:47:12 +01:00
Thomas Lamprecht
56d4c7a50a watchdog-mux: code indentation and style cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:46:49 +01:00
Thomas Lamprecht
6548300e33 buildsys: use dpkg default makefile snippet
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-17 14:45:35 +01:00
Fiona Ebner
1c61138341 crs: avoid auto-vivification when adding node to service usage
Part of what caused bug #4984. Make the code future-proof and warn
when the node was never registered in the plugin, similar to what the
'static' usage plugin already does.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ TL: rework commit message subject ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-10-06 12:27:04 +02:00
Fiona Ebner
c7843a315d fix #4984: manager: add service to migration-target usage only if online
Otherwise, when using the 'basic' plugin, this would lead to
auto-vilification of the $target node in the Perl hash tracking the
usage and it would wrongly be considered online when selecting the
recovery node.

The 'static' plugin was not affected, because it would check and warn
before adding usage to a node that was not registered with add_node()
first. Doing the same in the 'basic' plugin will be done by another
patch.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ TL: shorten commit message subject ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-10-06 12:22:53 +02:00
Lukas Wagner
4cb3b2cf9b manager: send notifications via new notification module
... instead of using sendmail directly.

If the new 'notify.target-fencing' parameter from datacenter config
is set, we use it as a target for notifications. If it is not set,
we send the notification to the default target (mail-to-root).

There is also a new 'notify.fencing' paramter which controls if
notifications should be sent at all. If it is not set, we
default to the old behavior, which is to send.

Also add dependency to the `libpve-notify-perl` package to d/control.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
2023-08-03 17:34:52 +02:00
Thomas Lamprecht
dfe080bab1 bump version to 4.0.2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-13 08:35:56 +02:00
Fiona Ebner
17c6cbeab9 manager: clear stale maintenance node caused by simultaneous cluster shutdown
Currently, the maintenance node for a service is only cleared when the
service is started on another node. In the edge case of a simultaneous
cluster shutdown however, it might be that the service never was
started anywhere else after the maintenance node was recorded, because
the other nodes were already in the process of being shut down too.

If a user ends up in this edge case, it would be rather surprising
that the service would be automatically migrated back to the
"maintenance node" which actually is not in maintenance mode anymore
after a migration away from it.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-13 08:33:52 +02:00
Fiona Ebner
a1b9918d30 tests: simulate stale maintainance node caused by simultaneous cluster shutdown
In the test log, it can be seen that the service will unexpectedly be
migrated back. This is caused by the service's maintainance node
property being set by the initial shutdown, but never cleared, because
that currently happens only when the service is started on a different
node. The next commit will address the issue.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-13 08:33:52 +02:00
Thomas Lamprecht
eee63557bc bump version to 4.0.1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-09 10:41:59 +02:00
Thomas Lamprecht
bf5d92725e d/control: bump versioned dependency for pve-container & qemu-server
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-09 10:33:40 +02:00
Fiona Ebner
e0346eccaf resources: pve: avoid relying on internal configuration details
Instead, use the new get_derived_property() method to get the same
information in a way that is robust regarding changes in the
configuration structure.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-09 07:28:29 +02:00
Fiona Ebner
afa1aa9cb8 api: fix/add return description for status endpoint
The fact that no 'items' was specified made the api-viewer throw a
JavaScript exception: retinf.items is undefined

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-06-07 17:40:48 +02:00
Fiona Ebner
5a9c3a2808 lrm: do not migrate if service already running upon rebalance on start
As reported in the community forum[0], currently, a newly added
service that's already running is shut down, offline migrated and
started again if rebalance selects a new node for it. This is
unexpected.

An improvement would be online migrating the service, but rebalance
is only supposed to happen for a stopped->start transition[1], so the
service should not being migrated at all.

The cleanest solution would be for the CRM to use the state 'started'
instead of 'request_start' for newly added services that are already
running, i.e. restore the behavior from before commit c2f2b9c
("manager: set new request_start state for services freshly added to
HA") for such services. But currently, there is no mechanism for the
CRM to check if the service is already running, because it could be on
a different node. For now, avoiding the migration has to be handled in
the LRM instead. If the CRM ever has access to the necessary
information in the future, to solution mentioned above can be
re-considered.

Note that the CRM log message relies on the fact that the LRM only
returns the IGNORED status in this case, but it's more user-friendly
than using a generic message like "migration ignored (check LRM
log)".

[0]: https://forum.proxmox.com/threads/125597/
[1]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_crs_scheduling_points

Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
 [ T: split out adding the test to a previous commit so that one can
   see in git what the original bad behavior was and how it's now ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:08:00 +02:00
Thomas Lamprecht
c1aaa05b85 tests: simulate adding running services to HA with rebalance-on-start
Split out from Fiona's original series, to better show what actually
changes with her fix.

Currently, a newly added service that's already running is shut down,
offline migrated and started again if rebalance selects a new node
for it. This is unexpected and should be fixed, encode that behavior
as a test now, showing still the undesired behavior, and fix it in
the next commit

Originally-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
c0dbab3c32 tools: add IGNORED return code
Will be used to ignore rebalance-on-start when an already running
service is newly added to HA.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
81e8e7d000 sim: hardware: commands: make it possible to add already running service
Will be used in a test for balance on start, where it should make a
difference if the service is running or not.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Fiona Ebner
b8d86ec48c sim: hardware: commands: fix documentation for add
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-06-06 19:05:22 +02:00
Thomas Lamprecht
973bf0324f bump version to 4.0.0
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:27:04 +02:00
Thomas Lamprecht
3de087a57b buildsys: derive upload dist automatically
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
c1b4249bde d/control: raise standards version compliance to 4.6.2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
cfe9011673 buildsys: improve DSC target & add sbuild convenience target
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
1b91242ae9 buildsys: make build-dir generation atomic
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
576ae6e7d5 buildsys: rework doc-gen cleanup and makefile inclusion
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
df0c583fc3 buildsys: use full DEB_VERSION and correct DEB_HOST_ARCH
The DEB_HOST_ARCH is the one the package is actually built for, the
DEB_BUILD_ARCH is the one of the build host; having this correct
makes cross-building easier, but otherwise it makes no difference.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
69e37516e9 makefile: convert to use simple parenthesis
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-24 19:26:27 +02:00
Thomas Lamprecht
b0274c4acf bump version to 3.6.1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:16:18 +02:00
Thomas Lamprecht
f129138cb0 lrm: keep manual maintenance mode independent of shutdown policy
We did not handle being in maintenance mode explicitly with shutdown
policies, which is in practice not often an issue as most that use
the maintenance mode also switched over the shutdown policy to
'migrate', which keeps the maintenance mode, but for all those
evaluating HA or only using the manual maintenance mode it meant that
on shutdown the mode was set to 'restart' or 'shutdown', which made
the active manager think that the node got out of the maintenance
state again, and marked it as online – but as it wasn't really online
(and on the way to shutdown), this not only cleared the maintenance
mode by mistake, it also had a chance to cause fencing - if any
service was still on the node – i.e., maintenance mode wasn't reached
yet, but still in-progress of moving HA services (guests).

Fix that by checking if maintenance mode is requested, or already
active (we currently don't differ those two explicitly, but could be
determined from active service count if required), and avoid changing
the mode in the shutdown and restart case. Log that also explicitly
so admins can understand what happened and why.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:12:07 +02:00
Thomas Lamprecht
f12abfe072 test behavior of maintenance mode with another shutdown policy
Encode what happens if a node is in maintenance and gets shutdown
with a shutdown policy other than 'migrate' (= maintenance mode)
active.

Currently it's causing disabling the maintenance mode and also might
make a fence even possible (if not all service got moved already).
This will be addressed in the next commit.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:10:18 +02:00
Thomas Lamprecht
26bbff0d55 manager: ensure node-request state transferred to new active CRM
We do not just take the full CRM status of the old master if a new
one gets active, we only take over the most relevant parts like node
state. But the relative new node_request object entry is also
important, as without that a maintenance state request may get lost
if a new CRM becomes the active master.

Simply copy it over on initial manager construction, if it exists.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:10:18 +02:00
Thomas Lamprecht
6925144443 test behavior of shutdown with maintenance mode on active master
this encode the current bad behavior of the maintenance mode getting
lost on active CRM switch, due to the request node state not being
transferred. Will be fixed in the next commit.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:10:18 +02:00
Thomas Lamprecht
ef2c0f29f6 lrm: add maintenance to comment about available modes
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:10:18 +02:00
Thomas Lamprecht
3361156205 ha config: code style/indendation cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-20 14:10:18 +02:00
Thomas Lamprecht
f6c61fe8a3 cli: assert that node exist when changing CRS request state
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-04-06 14:09:01 +02:00
Thomas Lamprecht
03f825dbc7 bump version to 3.6.0
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 13:45:36 +01:00
Thomas Lamprecht
4600bf8998 cli: expose new "crm-command node-maintenance enable/disable" commands
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 13:38:23 +01:00
Thomas Lamprecht
989c4c4929 add CRM command to switch an online node manually into maintenance without reboot
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 13:38:23 +01:00
Thomas Lamprecht
279d91c2ec lrm: always give up lock if node went successfully into maintenance
the change as of now is a no-op, as we only ever switched to
maintenance mode on shutdown-request, and there we exited immediately
if no active service and worker where around anyway.

So this is mostly preparing for a manual maintenance mode without any
pending shutdown.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 13:07:45 +01:00
Thomas Lamprecht
73faade519 lrm: factor out check fo maintenance-request
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 13:04:11 +01:00
Thomas Lamprecht
0916918022 manager: some code style cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 11:09:01 +01:00
Thomas Lamprecht
314ef2579e request start: allow to auto-rebalance on a new start request
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 11:09:01 +01:00
Thomas Lamprecht
2fdf40f282 manager: select service node: allow to force best-score selection withot try-next
useful for re-balanacing on start, where we do not want to exclude
the current node like setting the $try_next param does, but also
don't want to favor it like not setting the $try_next param does.

We might want to transform both, `try_next` and `best_scored` into a
single `mode` parameter to reduce complexity and make it more
explicit what we want here.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 11:09:01 +01:00
Thomas Lamprecht
c2f2b9c62c manager: set new request_start state for services freshly added to HA
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 11:09:01 +01:00
Thomas Lamprecht
4931b58659 manager: add new intermediate state for stop->start transitions
We always check for re-starting a service if its in the started
state, but for those that go from a (request_)stop to the stopped
state it can be useful to explicitly have a separate transition.

The newly introduced `request_start` state can also be used for CRS
to opt-into starting a service up on a load-wise better suited node
in the future.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 11:09:01 +01:00