pve-ha-manager

mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-02-07 05:57:33 +03:00

Author	SHA1	Message	Date
Thomas Lamprecht	c253924fd3	bump version to 2.0-5 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-02-07 11:20:28 +01:00
Thomas Lamprecht	9cdf16b1c8	buildsys: use correct git revision for SOURCE file	2018-02-07 10:38:51 +01:00
Thomas Lamprecht	724bd3f311	do not do active work if cfs update failed We ignored if the cluster state update failed and happily worked with an empty state, resulting in strange actions, e.g., the removal of all (not so) "stale" services or changing the all but the masters node state to unknown. Check on the update result and if failed, either do not get active, or, if already active, skip the current round with the knowledge that we only got here because the update failed but our lock renew worked => cfs got already in a working and quorate state again - (probably just a restart) Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	3df1538094	move cfs update to common code We updated the CRM and LRM view of the cluster state only in the PVE2 environment, outside of all regression testing and simulation scope. Further, we ignored if this update failed and happily worked with an empty state, resulting in strange actions, e.g., the removal of all (not so) "stale" services or changing the all but the masters node state to unknown. This patch tries to improve this by moving out the update in a own environment method, cluster_update_state, calling this in the LRM and CRM and saving its result. As with our introduced functionallity to simulate cfs rw or update errors we can also simulate failures of this state update with the RT system. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	da6f041699	move start/end hooks to common code We called them at similar times anyways, and have them under the regression test cover with this change. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	ada4b9a830	Revert "wrap possible problematic cfs_read_file calls in eval" This reverts commit bf7febe3771d6f9a2aef97bcd6eab4ece098c5aa. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	30b4f397a0	CRM: refactor check if state transition to active is ok Mainly addresses a problem where we read the manager status without catching any possible exceptions. As this was done only to check if our node has active fencing jobs, which tells us that it makes no sense to even try to acquire the manager lock - as we're fenced soon anyway. Besides this check we always checked if we're quorate and if there are services configured, so move both checks in the new 'can_get_active' method, which replaces the check_pending_fencing and the has_services method. Move the quorum check in front and catch a possible error from the following manager status read. As a side effect the state transition code gets a bit shorter without hiding the check intention. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	8e940b68f9	lrm: handle an error during service_status update we may get an error here if the cluster filesystem is (temporarily) unavailable here, this error resulted in stopping the whole CRM service immediately, which then triggered a node reset (if happened on the current master), even if we had still time left to retry and thus, for example, handle a update of pve-cluster gracefully. Add a method which wraps the status read in an eval and logs an eventual error, but does not abort the service. Instead we rely on our get_protected_ha_agent_lock method to detect a problem and switch to the lost_agent_lock state. If the pmxcfs outage was really short, so that the manager status read failed but the lock update worked again we update also always before doing real work when in the 'active' state. If this update fails we return from the eval and try next round again, as no point in doing anything without consistent state. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:33:16 +01:00
Thomas Lamprecht	ba2a45cd9d	test/sim: allow to simulate cfs failures Add simulated hardware commands for the cluster file system. This allows to tell the regression test or simulator system that a certain nodes calls to methods accessing the CFS should fail, i.e., die. With this we can cover a situation which mainly happen during a cluster file system update. For now allow to define if the CFS is read-/writeable (state rw) and if updates of the CFS (state update) should work or fail. Add 'can read/write' assertions all over the relevant methods. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2018-01-30 09:31:03 +01:00
Thomas Lamprecht	3166752f13	postinst: use auto generated postinst This was introduced for cleaning up an possible left over systemd watchdog mux enable link, which is gone for good now. Then it was extended with trigger targets, as the HA Manager services now restart when the pve-api-update trigger fires. As the autogenerated postinst does the same unconditionally for the pve-ha-lrm.service and pve-ha-crm.service already we may remove it too. The only difference is that try-restart is used by the auto generated script, not reload-or-try-restart, but this does not matter, as the HA services have currently no reload ability. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-26 09:37:22 +01:00
Thomas Lamprecht	c122969ff2	postinst: we do not use templates, remove debconf This was copied by accident when adding the transitional code for removing the left over of the systemd managed watchdog mux in commit f8a3fc80af299e613c21c9b67e29aee8cc807018 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-26 09:37:22 +01:00
Thomas Lamprecht	e2c96fdae4	postinst: drop transitional systemd watchdog mux socket cleanup This transitional code was added first with commit f8a3fc80af299e613c21c9b67e29aee8cc807018 and fixed up with commit ecc145c9724f056549e5458f17d7714ac8c83459 during Proxmox VE 4.1 and 4.2 to remove the problematic systemd managed watchdog mux socket. As each system going for an distribution upgrade must first upgrade to 4.4, where this gets handled, we can remove it now. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-26 09:37:22 +01:00
Thomas Lamprecht	0da2e042e1	watchdog mux: trailing whitespace cleanup Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-16 09:10:01 +01:00
Thomas Lamprecht	1dd1d6cd3a	watchdog mux: fix comment, there's no systemd .socket anymore Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-16 09:09:47 +01:00
Thomas Lamprecht	5f09eb480d	fix typo in simulator package description Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2018-01-15 13:15:48 +01:00
Fabian Grünbichler	a6b9892808	debian/rules: add some explaining comments Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-12-28 16:37:25 +01:00
Fabian Grünbichler	1abfa1f8ec	debian/rules: don't dh_systemd_start watchdog-mux as it's a static unit dh_systemd_starting it is not possible - but it gets pulled in and started by pve-ha-crm/pve-ha-lrm anyway. Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-12-28 16:37:25 +01:00
Fabian Grünbichler	449a03b794	debian/rules: add file names to dh_systemd_enable otherwise it gets confused and enables pve-ha-crm twice in the postinst. Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-12-28 16:37:25 +01:00
Thomas Lamprecht	cf1ad777ff	buildsys: also cleanup *.buildinfo files Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-11-16 11:32:58 +01:00
Wolfgang Bumiller	5d82b887eb	bump version to 2.0-4 Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>	2017-11-09 11:47:36 +01:00
Thomas Lamprecht	bf7febe377	wrap possible problematic cfs_read_file calls in eval Wrap those calls to the cfs_read_file method, which may now also die if there was a grave problem reading the file, into eval in all methods which are used by the ha services. The ones only used by API calls or CLI helpers are not wrapped, as there it can be handled more gracefull (i.e., no watchdog is running) and further, this is more intended to temporarily workaround until we handle such an exception explicitly in the services - which is a bit bigger change, so let's just go back to the old behavior for now. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-11-09 11:42:12 +01:00
Thomas Lamprecht	f466005d20	swap native syslog command with HA environment one Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-11-08 06:01:46 +01:00
Philip Abernethy	79baff03d6	Use run_cli_handler instead of deprecated run_cli	2017-10-18 15:21:45 +02:00
Fabian Grünbichler	5ff407da2f	build: add simulator to deb target	2017-10-13 11:26:00 +02:00
Fabian Grünbichler	b340ba631a	bump version to 2.0-3	2017-10-13 11:12:21 +02:00
Thomas Lamprecht	dab49a14a0	do not show a service as queued if not configured The check if a service is configured has precedence over the check if a service is already processed by the manager. This fixes a bug where a service could be shown as queued even if he was meant to be ignored. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-10-13 10:50:16 +02:00
Thomas Lamprecht	667670b2d4	add ignore state for resources In this state the resource will not get touched by us, all commands (like start/stop/migrate) go directly to the VM/CT itself and not through the HA stack. The resource will not get recovered if its node fails. Achieve that by simply removing the respective service from the manager_status service status hash if it is in ignored state. Add the state also to the test and simulator hardware. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-10-13 10:50:16 +02:00
Thomas Lamprecht	9adedcd738	clean up 'Data::Dumper' usage tree wide Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-10-13 10:37:15 +02:00
Thomas Lamprecht	042a5d2dcf	lrm: crm: show interest in pve-api-update trigger To ensure that the LRM and CRM services get reloaded when pve-api-update trigger gets activated. Important, as we directly use perl API modules from qemu-server, pve-container, pve-common and really want to avoid to run outdated, possible problematic or deprecated, code. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-10-13 10:37:15 +02:00
Thomas Lamprecht	2fb6fcd318	lrm.service: do not timeout on stop we must shut all services down when stopping the LRM for a host shutdown, this can take longer than 95 seconds and should not get interrupted to ensure a gracefull poweroff. The watchdog is still active untill all services got stopped so we still are safe from a freeze or equivalent failure. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-10-13 10:37:15 +02:00
Fabian Grünbichler	1afe2ae2c7	build: reformat debian/control using wrap-and-sort -abt	2017-10-04 11:05:33 +02:00
Philip Abernethy	227c2c7467	fix #1347 : let postfix fill in FQDN in fence mails Using the nodename in $mailto is not correct and can lead to mails not forwarding in restrictive mail server configurations. Also changes $mailfrom to 'root' instead of 'root@localhost', which results in postfix appending the proper FQDN there, too. As a result the Delivered-to header reads something like 'root@host.domain.tld' instead of 'root@localhost', which is much more informational and more consistent. Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-09-25 15:39:10 +02:00
Thomas Lamprecht	bc2a9d18db	fix #1073 : do not count backup-suspended VMs as running when a stopped VM managed by HA got backuped the HA stack continuously tried to shut it down as check_running returns only if a PID for the VM exists. As the VM was locked the shutdown tries were blocked, but still a lot of annoying messages and task spawns happened during the backup period. As querying the VM status through the vm monitor is not cheap, check if the VM is locked with the backup lock first, the config is cached and so this is quite cheap, only then query the VMs status over qmp, and check if the VM is in the 'prelaunch' state. This state gets only set if KVM was started with the `-S` option and has not yet continued guest operation. Some performance results, I repeated each check 1000 times, first number is the total time spent just with the check, second time is the the time per single check: old check (vm runs): 87.117 ms/total => 87.117 us/loop new check (runs, no backup): 107.744 ms/total => 107.744 us/loop new check (runs, backup): 760.337 ms/total => 760.337 us/loop Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-08-23 10:29:33 +02:00
Dietmar Maurer	95ebe18839	bump version to 2.0-2	2017-06-14 07:51:20 +02:00
Thomas Lamprecht	9f4494639d	explicitly sync journal when disabling watchdog updates Without syncing the journal could loose logs for a small interval (ca 10-60 seconds), but these last seconds are really interesting for analyzing the cause of a triggered watchdog. Also without this often the > "client did not stop watchdog - disable watchdog updates" messages wasn't flushed to persistent storage and so some users had a hard time to figure out why the machine reset. Use the '--sync' switch of journalctl which - to quote its man page - "guarantees that any log messages written before its invocation are safely stored on disk at the time it returns." Use execl to call `journalctl --sync` in a child process, do not care for any error checks or recovery as we will be reset anyway. This is just a hit or miss try to log the situation more consistently, if it fails we cannot really do anything anyhow. We call the function on two points: a) if we exit with active connections, here the watchdog will be triggered soon and we want to ensure that this is logged. b) if a client closes the connection without sending the magic close byte, here the watchdog would trigger while we hang in epoll at the beginning of the loop, so sync the log here also. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-06-14 07:47:54 +02:00
Thomas Lamprecht	f65f41b9d8	always queue service stop if node shuts down Commit 61ae38eb6fc5ab351fb61f2323776819e20538b7 which ensured that services get freezed on a node reboot had a side effect where running services did not get gracefully shutdown on node reboot. This may lead to data loss as the services then get hard killed, or they may even prevent a node reboot because a storage cannot get unmounted as a service still access it. This commits addresses this issue but does not changes behavior of the freeze logic for now, but we should evaluate if a freeze makes really sense here or at least make it configurable. The changed regression test is a result of the fact that we did not adapt the correct behavior for the is_node_shutdown command in the problematic commit. The simulation envrionment returned true everytime a node shutdown (reboot and poweroff) and the real world environment just returned true if a poweroff happened but not on a reboot. Now the simulation acts the same way as the real environment. Further I moved the simulation implemenentation to the base class so that both simulator and regression test system behave the same. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-05-30 10:45:37 +02:00
Wolfgang Link	2f549281f2	Fix shutdown order of HA and storage services It is important that all storages stop after pve-ha-lrm. If the storages stop too early the vm loses disks and can not shutdown. This can end in fencing the node.	2017-05-03 12:00:03 +02:00
Thomas Lamprecht	2732fccfec	Resource/API: abort early if resource in error state If a service is in error state the single state change command that can make sense is setting the disabled request state. Thus abort on all other commands early to enhance user experience.	2017-03-14 10:45:57 +01:00
Fabian Grünbichler	f9b7a596b2	bump version to 2.0-1	2017-03-13 11:32:16 +01:00
Fabian Grünbichler	4a70da7db5	buildsys: update make upload target for stretch	2017-03-13 11:32:09 +01:00
Wolfgang Bumiller	a1c8862672	buildsys: don't pull qemu/lxc during doc-generation	2017-02-06 16:10:11 +01:00
Wolfgang Bumiller	fea8212ca4	buildsys: make job safety	2017-02-06 15:32:34 +01:00
Wolfgang Bumiller	ffc6cea9af	buildsys: drop libsystemd-daemon-dev build dep We don't actually need it and it and on stretch libsystemd-dev (which we still depend on) replaces it.	2017-02-06 15:20:10 +01:00
Dietmar Maurer	5b9aeabdf9	bump version to 1.0-40	2017-01-24 10:03:27 +01:00
Thomas Lamprecht	dd8e116871	sim: allow adding service on runtime Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:52:17 +01:00
Thomas Lamprecht	d6ec267c8c	sim: factor out service gui entry addition Will be used to allow adding services to the simulator on runtime in a future patch. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:51:54 +01:00
Thomas Lamprecht	e6b14238b7	sim: allow deleting service via GUI Add a delete button to each service entry row. This allows deleting a service on runtime. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:51:25 +01:00
Thomas Lamprecht	a59ac4178e	sim: allow new service request states over gui Change the old enabled/disabled GTK "Switch" element to a ComboBox one and add all possible service states, so we can simulate the real world behaviour with its new states better. As we do not need to map a the boolean swicth value to our states anymore, we may drop the set_setvice_state method from the RTHardware class and use the one from the Hardware base class instead. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:51:14 +01:00
Thomas Lamprecht	e08a0717de	factor out and unify sim_hardware_cmd Most things done by sim_hardware_cmd are already abstracted and available in both, the TestHardware and the RTHardware class. Abstract out the CRM and LRM control to allow the unification of both classes sim_hardware_cmd. As in the last year mostly the regression test systems TestHardware class saw new features use it as base. We return now the current status out of the locked context, this allows to update the simulators GUI out of the locked context. This changes increases the power of the HA Simulator, but the new possible actions must be still implemented in its GUI. This will be done in future patches. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:50:40 +01:00
Thomas Lamprecht	d49a8daac8	sim: allocate HA Env only once per service and node Do not allocate the HA Environment every time we fork a new CRM or LRM, but once at the start of the Simulator for all nodes. This can be done as the Env does not saves any state and thus can be reused, we use this also in the TestHardware class. Making the behavior of both Hardware classes more similar allows us to refactor out some common code in following commits. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2017-01-24 09:45:44 +01:00

... 4 5 6 7 8 ...

802 Commits