pve-ha-manager

mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-03-14 04:58:16 +03:00

Author	SHA1	Message	Date
Maximiliano Sandoval	5a82a001a5	d/postinst trigger: run systemctl only if systemd is available Check if systemd is active by testing if the /run/systemd/system directory exists, just like debhelper generated code does, before running systemctl. Allows for setting up a package's build dependencies in containers not managed by systemd. Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com> [TL: extend commit message and note that this fixes setting up the build env, not the build itself] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2025-03-03 09:44:44 +01:00
Thomas Lamprecht	f3e1f04475	tests: fence config parser: drop unused imports Similar to commit eec4fab ("test: ha tester: drop unused import") Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2025-01-22 16:20:16 +01:00
Fiona Ebner	e481095f76	test: ha tester: remove trailing whitespace Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2025-01-22 16:11:59 +01:00
Fiona Ebner	eec4fab5f8	test: ha tester: drop unused import There is no user of File::Path remaining after commit 787b66e ("SimCluster: setup status dir inside new") which was the only user of remove_tree(). make_path() was not used at all according to git history. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2025-01-22 16:11:59 +01:00
Thomas Lamprecht	34fe8e59ea	bump version to 4.0.6 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	977ae28849	crm: get active if there are nodes that need to leave maintenance This is mostly cosmetic, because as long as there are configured services a CRM would get active anyway. But it can happen that a maintenance mode is left-over and all services got removed a fresh cluster start then will keep all CRMs as idle and thus never clear the maintenance state. This can be especially confusing now, as a recent pve-manager commit 993d05abc ("api/ui: include the node ha status in resources call and show as icon") started to show the maintenance mode as icon in the web UI, thus making this blip much more prominent. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	73f93a4f6b	crm: get active if there are pending CRM commands But favor the last active CRM to avoid all CRMs trying to get out of idle at once. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	d0979e6dd0	env: add any_pending_crm_command method Add helper that returns if the CRM command queue holds any commands without altering the state of the queue at all, unlike the existing read_crm_commands method would do. This will be used to check if a CRM needs to become active when there are pending CRM commands but no master seems to process them as all are idle/offline. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	afbfa9bafc	tests: add more crm idle situations To test the behavior for when a CRM should get active or stay active (for a bit longer). These cases show the status quo, which will be improved on in the next commits. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	ddd56db346	fix #5243 : make CRM go idle after ~15 min of no service being configured This is mostly for convenience for when one does a quick HA evaluation and then removes all services again, the biggest visible effect is that there will be no status updates once the CRM is idle, reducing some mid-frequent updates to pmxcfs. As the CRM never lets the watchdog run out pro-actively to trigger fencing, this won't have much of a difference w.r.t. "accidental" self-fencing on the common outage situation that happen in practice, i.e. quorum loss due to network outage or corosync getting misconfigured, that's also why this was not considered when adding the auto-idling for LRM back in commit 2105170 ("LRM: release lock and close watchdog if no service configured for >10min"). In short, the watchdog for CRM is mostly here to avoid a situation where the process of the currently active CRM hangs, or does not get scheduled for a while such that another CRM becomes active only that the previous one then resumes and still thinks it is the active one and, e.g., writes out a outdated manager_status file; there are some other situations, but it's always similar reason. Compared to the LRM idle mechanism we require more rounds for the CRM to go idle (90 for CRM vs 60 for LRM), the reason here is that the LRM needs an active CRM for some operations to progress in the FSM, so just waiting a bit longer for the CRM is enough to ensure that this can happen. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	e857e63846	tests: add scenario for CRM going idle To better show what changes in he future implementation of this feature in a next commit. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	e5f9b5f6e3	crm: factor out base check if cluster and ha is healthy This will be reused for the auto-idling mechanism, factor out getting the manager status too as this will be used for more specific checks about being able to go idle or when a CRM should be active. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	af71e0049f	crm: factor out giving up the watchdog protection See the added comment for full details why the watchdog protection of the CRM needs less strict safety requirements compared to the one of an (active) LRM. In short, CRM does not manages services but directs them through the manager_status state-file. This means the watchdog mainly protects it from a hung system where locks would timeout before writing the state out and thus a race with the new CRM would happen. So the CRM basically can always give up the watchdog safely when it stops being the active CRM anyway. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	b4a7a92d57	crm: style clean-up to early line-wrapping Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	3e18cf3405	simulator: document newer crm commands Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Thomas Lamprecht	77cfe89238	api: status: code style clean-up Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-20 21:11:24 +01:00
Aaron Lauterer	f80628066d	tools: adapt line lengths of verbose descriptions Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>	2024-11-11 21:31:01 +01:00
Aaron Lauterer	601d2c542c	tools: group verbose desc: mention higher number is higher priority Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>	2024-11-11 21:31:01 +01:00
Wolfgang Bumiller	800a0c3e48	bump version to 4.0.5 Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>	2024-06-04 11:10:11 +02:00
Lukas Wagner	f43a6009ff	env: notify: use named templates instead of passing template strings Signed-off-by: Lukas Wagner <l.wagner@proxmox.com> Tested-by: Max Carrara <m.carrara@proxmox.com> Reviewed-by: Max Carrara <m.carrara@proxmox.com>	2024-06-03 14:16:35 +02:00
Thomas Lamprecht	822def8250	bump version to 4.0.4 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-04-22 13:47:22 +02:00
Fabian Grünbichler	8bac62a877	d/postinst: make deb-systemd-invoke non-fatal else this can break an upgrade for unrelated reasons. this also mimics debhelper behaviour more (which we only not use here because of lack of reload support) - restructured the snippet to be more similar with an explicit `if` as well. Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>	2024-04-17 16:56:02 +02:00
Thomas Lamprecht	2db44501bc	bump version to 4.0.3 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-11-17 14:49:08 +01:00
Lukas Wagner	868d3cd4bb	env: switch to matcher-based notification system Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>	2023-11-17 14:47:55 +01:00
Thomas Lamprecht	07284f1194	usage stats: tiny code style clean-up Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-11-17 14:47:12 +01:00
Thomas Lamprecht	56d4c7a50a	watchdog-mux: code indentation and style cleanups Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-11-17 14:46:49 +01:00
Thomas Lamprecht	6548300e33	buildsys: use dpkg default makefile snippet Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-11-17 14:45:35 +01:00
Fiona Ebner	1c61138341	crs: avoid auto-vivification when adding node to service usage Part of what caused bug #4984. Make the code future-proof and warn when the node was never registered in the plugin, similar to what the 'static' usage plugin already does. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> [ TL: rework commit message subject ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-10-06 12:27:04 +02:00
Fiona Ebner	c7843a315d	fix #4984 : manager: add service to migration-target usage only if online Otherwise, when using the 'basic' plugin, this would lead to auto-vilification of the $target node in the Perl hash tracking the usage and it would wrongly be considered online when selecting the recovery node. The 'static' plugin was not affected, because it would check and warn before adding usage to a node that was not registered with add_node() first. Doing the same in the 'basic' plugin will be done by another patch. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> [ TL: shorten commit message subject ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-10-06 12:22:53 +02:00
Lukas Wagner	4cb3b2cf9b	manager: send notifications via new notification module ... instead of using sendmail directly. If the new 'notify.target-fencing' parameter from datacenter config is set, we use it as a target for notifications. If it is not set, we send the notification to the default target (mail-to-root). There is also a new 'notify.fencing' paramter which controls if notifications should be sent at all. If it is not set, we default to the old behavior, which is to send. Also add dependency to the `libpve-notify-perl` package to d/control. Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>	2023-08-03 17:34:52 +02:00
Thomas Lamprecht	dfe080bab1	bump version to 4.0.2 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-13 08:35:56 +02:00
Fiona Ebner	17c6cbeab9	manager: clear stale maintenance node caused by simultaneous cluster shutdown Currently, the maintenance node for a service is only cleared when the service is started on another node. In the edge case of a simultaneous cluster shutdown however, it might be that the service never was started anywhere else after the maintenance node was recorded, because the other nodes were already in the process of being shut down too. If a user ends up in this edge case, it would be rather surprising that the service would be automatically migrated back to the "maintenance node" which actually is not in maintenance mode anymore after a migration away from it. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-13 08:33:52 +02:00
Fiona Ebner	a1b9918d30	tests: simulate stale maintainance node caused by simultaneous cluster shutdown In the test log, it can be seen that the service will unexpectedly be migrated back. This is caused by the service's maintainance node property being set by the initial shutdown, but never cleared, because that currently happens only when the service is started on a different node. The next commit will address the issue. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-13 08:33:52 +02:00
Thomas Lamprecht	eee63557bc	bump version to 4.0.1 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-09 10:41:59 +02:00
Thomas Lamprecht	bf5d92725e	d/control: bump versioned dependency for pve-container & qemu-server Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-09 10:33:40 +02:00
Fiona Ebner	e0346eccaf	resources: pve: avoid relying on internal configuration details Instead, use the new get_derived_property() method to get the same information in a way that is robust regarding changes in the configuration structure. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-09 07:28:29 +02:00
Fiona Ebner	afa1aa9cb8	api: fix/add return description for status endpoint The fact that no 'items' was specified made the api-viewer throw a JavaScript exception: retinf.items is undefined Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-07 17:40:48 +02:00
Fiona Ebner	5a9c3a2808	lrm: do not migrate if service already running upon rebalance on start As reported in the community forum[0], currently, a newly added service that's already running is shut down, offline migrated and started again if rebalance selects a new node for it. This is unexpected. An improvement would be online migrating the service, but rebalance is only supposed to happen for a stopped->start transition[1], so the service should not being migrated at all. The cleanest solution would be for the CRM to use the state 'started' instead of 'request_start' for newly added services that are already running, i.e. restore the behavior from before commit c2f2b9c ("manager: set new request_start state for services freshly added to HA") for such services. But currently, there is no mechanism for the CRM to check if the service is already running, because it could be on a different node. For now, avoiding the migration has to be handled in the LRM instead. If the CRM ever has access to the necessary information in the future, to solution mentioned above can be re-considered. Note that the CRM log message relies on the fact that the LRM only returns the IGNORED status in this case, but it's more user-friendly than using a generic message like "migration ignored (check LRM log)". [0]: https://forum.proxmox.com/threads/125597/ [1]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_crs_scheduling_points Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> [ T: split out adding the test to a previous commit so that one can see in git what the original bad behavior was and how it's now ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-06 19:08:00 +02:00
Thomas Lamprecht	c1aaa05b85	tests: simulate adding running services to HA with rebalance-on-start Split out from Fiona's original series, to better show what actually changes with her fix. Currently, a newly added service that's already running is shut down, offline migrated and started again if rebalance selects a new node for it. This is unexpected and should be fixed, encode that behavior as a test now, showing still the undesired behavior, and fix it in the next commit Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-06 19:05:22 +02:00
Fiona Ebner	c0dbab3c32	tools: add IGNORED return code Will be used to ignore rebalance-on-start when an already running service is newly added to HA. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-06 19:05:22 +02:00
Fiona Ebner	81e8e7d000	sim: hardware: commands: make it possible to add already running service Will be used in a test for balance on start, where it should make a difference if the service is running or not. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-06 19:05:22 +02:00
Fiona Ebner	b8d86ec48c	sim: hardware: commands: fix documentation for add Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-06-06 19:05:22 +02:00
Thomas Lamprecht	973bf0324f	bump version to 4.0.0 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:27:04 +02:00
Thomas Lamprecht	3de087a57b	buildsys: derive upload dist automatically Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	c1b4249bde	d/control: raise standards version compliance to 4.6.2 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	cfe9011673	buildsys: improve DSC target & add sbuild convenience target Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	1b91242ae9	buildsys: make build-dir generation atomic Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	576ae6e7d5	buildsys: rework doc-gen cleanup and makefile inclusion Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	df0c583fc3	buildsys: use full DEB_VERSION and correct DEB_HOST_ARCH The DEB_HOST_ARCH is the one the package is actually built for, the DEB_BUILD_ARCH is the one of the build host; having this correct makes cross-building easier, but otherwise it makes no difference. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00
Thomas Lamprecht	69e37516e9	makefile: convert to use simple parenthesis Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-05-24 19:26:27 +02:00

1 2 3 4 5 ...

820 Commits