pve-ha-manager

mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-31 05:47:19 +03:00

Author	SHA1	Message	Date
Thomas Lamprecht	77bcb60a9d	api/status: extra handling of maintenance mode Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-30 19:46:47 +01:00
Thomas Lamprecht	1388fcc1d3	do not mark maintenaned nodes as unkown Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-30 19:31:50 +01:00
Thomas Lamprecht	edd2cee9d6	bump LRM stop_wait_time to an hour Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-29 14:15:11 +01:00
Thomas Lamprecht	1c4cf42733	bump version to 3.0-6 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-26 18:03:32 +01:00
Thomas Lamprecht	1694ce69f9	lrm.service: add after ordering for SSH and pveproxy To avoid early disconnect during shutdown ensure we order After them, for shutdown the ordering is reversed and so we're stopped before those two - this allows to checkout the node stats and do SSH stuff if something fails. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 19:48:34 +01:00
Thomas Lamprecht	2167dd1e60	do simple fallback if node comes back online from maintenance We simply remember the node we where on, if moved for maintenance. This record gets dropped once we move to _any_ other node, be it: * our previous node, as it came back from maintenance * another node due to manual migration, group priority changes or fencing The first point is handled explicitly by this patch. In the select service node we check for and old fallback node, if that one is found in a online node list with top priority we _always_ move to it - even if there's no real reason for a move. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 19:48:34 +01:00
Thomas Lamprecht	99278e06a8	add 'migrate' node shutdown policy This adds handling for a new shutdown policy, namely "migrate". If that is set then the LRM doesn't queues stop jobs, but transitions to a new mode, namely 'maintenance'. The LRM modes now get passed from the CRM in the NodeStatus update method, this allows to detect such a mode and make node-status state transitions. Effectively we only allow to transition if we're currently online, else this is ignored. 'maintenance' does not protects from fencing. The moving then gets done by select service node. A node in maintenance mode is not in "list_online_nodes" and so also not in online_node_usage used to re-calculate if a service needs to be moved. Only started services will get moved, this can be done almost by leveraging exiting behavior, the next_state_started FSM state transition method just needs to be thought to not early return for nodes which are not online but in maintenance mode. A few tests get adapted from the other policy tests is added to showcase behavior with reboot, shutdown, and shutdown of the current manager. It also shows the behavior when a service cannot be migrated, albeit as our test system is limited to simulate maximal 9 migration failures, it "seems" to succeed after that. But note here that the maximal retries would have been hit way more earlier, so this is just artifact from our test system. Besides some implementation details two question still are not solved by this approach: * what if a service cannot be moved away, either by errors or as no alternative node is found by select_service_node - retrying indefinitely, this happens currently. The user set this up like this in the first place. We will order SSH, pveproxy, after the LRM service to ensure that the're still the possibility for manual interventions - a idea would be to track the time and see if we're stuck (this is not to hard), in such a case we could stop the services after X minutes and continue. * a full cluster shutdown, but that is even without this mode not to ideal, nodes will get fenced after no partition is quorate anymore, already. And as long as it's just a central setting in DC config, an admin has a single switch to flip to make it work, so not sure how much handling we want to do here, if we go over the point where we have no quorum we're dead anyhow, soo.. at least not really an issue of this series, orthogonal related yes, but not more. For real world usability the datacenter.cfg schema needs to be changed to allow the migrate shutdown policy, but that's trivial Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 19:46:09 +01:00
Thomas Lamprecht	5c2eef4b9e	account service to source and target during move As the Service load is often still happening on the source, and the target may feel the performance impact from an incoming migrate, so account the service to both nodes during that time. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 18:07:38 +01:00
Thomas Lamprecht	3ac4cc879f	manager select_service_node: code cleanup Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 17:53:03 +01:00
Thomas Lamprecht	54d808ad80	lrm.service: sort After statements Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-25 17:53:03 +01:00
Thomas Lamprecht	ad8a5e123a	bump version to 3.0-5 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-20 20:14:11 +01:00
Thomas Lamprecht	63a39498f0	d/control: re-add CT/VM dependency this was an issue for 5.x, initial pre-6.0 and should work now again as expected.. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-20 20:13:36 +01:00
Stefan Reiter	5aac17c6c5	refactor: vm_qmp_command was moved to PVE::QemuServer::Monitor Also change to mon_cmd helper, avoid calling qmp_cmd directly. Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>	2019-11-20 18:25:33 +01:00
Thomas Lamprecht	32ea51ddbe	fix #1339 : remove more locks from services IF the node got fenced Remove further locks from a service after it was recovered from a fenced node. This can be done due to the fact that the node was fenced and thus the operation it was locked for was interrupted anyway. We note in the syslog that we removed a lock. Mostly we disallow the 'create' lock, as here is the only case where we know that the service was not yet in a runnable state before. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-19 14:13:05 +01:00
Fabian Grünbichler	6225c47c96	bump version to 3.0-4 Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-18 12:17:18 +01:00
Fabian Grünbichler	ef39a1ca5d	use PVE::DataCenterConfig to make sure that the corresponding cfs_read_file works() works. Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>	2019-11-18 12:14:49 +01:00
Thomas Lamprecht	26be7ceaf4	cli stop cmd: fix property desc. indentation Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-14 14:39:30 +01:00
Thomas Lamprecht	2378f1c1b3	bump version to 3.0-3 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-11 17:04:40 +01:00
Thomas Lamprecht	396eb6f0d2	followup, adapt stop request log messages; include SID it's always good to say that we request it, not that people think the task should have been already started.. Also include the service ID (SID), so people know what we want(ed) to stop at all. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-11 16:50:39 +01:00
Fabian Ebner	55b5d4ef46	Introduce crm-command to CLI and add stop as a subcommand This should reduce confusion between the old 'set <sid> --state stopped' and the new 'stop' command by making explicit that it is sent as a crm command. Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-11-11 15:56:19 +01:00
Fabian Ebner	21caf0db81	Add crm command 'stop' Not every command parameter is 'target' anymore, so it was necessary to modify the parsing of $sd->{cmd}. Just changing the state to request_stop is not enough, we need to actually update the service configuration as well. Add a simple test for the stop command Signed-off-by: Fabian Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-11-11 15:55:48 +01:00
Fabian Ebner	e4ef317d1f	Add timeout parameter for shutdown Introduces a timeout parameter for shutting a resource down. If the parameter is 0, we perform a hard stop instead of a shutdown. Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-10-11 12:25:53 +02:00
Fabian Ebner	76b83c7207	Add update_service_config to the HA environment interface and simulation Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-10-11 12:25:53 +02:00
Thomas Lamprecht	c9b21b5a0b	followup: s/ss/sc/ fixes: dcb4a2a48404a8bf06df41e071fea348d0c971a4 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 20:25:34 +02:00
Thomas Lamprecht	6e8b0c2254	fix # 2241: VM resource: allow migration with local device, when not running qemu-server ignores the flag if the VM runs, so just set it to true hardcoded. People have identical hosts with same HW and want to be able to relocate VMs in such cases, so allow it here - qemu knows to complain if it cannot work, as nothing bad happens then (VM stays just were it is) we can only win, so do it. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 20:17:29 +02:00
Thomas Lamprecht	dcb4a2a484	get_verbose_service_state: render removal transition as 'deleting' Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 19:11:46 +02:00
Thomas Lamprecht	fbda265807	fix #1919 , #1920 : improve handling zombie (without node) services Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 18:58:26 +02:00
Thomas Lamprecht	b6da6101a4	read_and_check_resources_config: remove dead if branch we only come to the if (!$vmd) check if the previous if (my $vmd = $vmlist->{ids}->{$name) is taken, which means $vmd is always true then. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 18:58:26 +02:00
Thomas Lamprecht	41236dcf61	LRM shutdown: factor out shutdown type to reuse message Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 17:55:09 +02:00
Thomas Lamprecht	a19f2576aa	LRM shutdown request: propagate if we could not write out LRM status Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-10-05 17:51:21 +02:00
Fabian Ebner	1d9316ef54	factor out resource config update from api to HA::Config This makes it easier to update the resource configuration from within the CRM/LRM stack, which is needed for the new 'stop' command. Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-10-04 17:20:25 +02:00
Fabian Ebner	b94b478580	Rename target to param in simulation In preparation to introduce a stop command with a timeout parameter. Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-09-30 17:14:10 +02:00
Fabian Ebner	3ac1ee6b24	Make parameters for LRM resource commands more flexible This will allow for new parameters beside 'target' to be used. This is in preparation to allow for a 'timeout' parameter for a new 'stop' command. Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-09-30 16:49:08 +02:00
Fabian Ebner	3d42b01bf0	Cleanup Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-09-26 15:11:13 +02:00
Fabian Ebner	014cf130a9	Whitespace cleanup Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>	2019-09-26 15:09:46 +02:00
Thomas Lamprecht	58500679bc	bump version to 3.0-2 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-07-11 19:27:27 +02:00
Thomas Lamprecht	9db3786bad	buildsys: use DEB_VERSION_UPSTREAM for buildir Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-07-11 19:23:51 +02:00
Rhonda D'Vine	3cf61f860a	Add missing Dependencies to pve-ha-simulator This two missing dependencies makes it possible to install the package on a stock Debian system (without PVE) Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>	2019-06-27 22:10:22 +02:00
Christian Ebner	993af4280e	fix #2234 : fix typo in service description replace Ressource by Resource Signed-off-by: Christian Ebner <c.ebner@proxmox.com>	2019-06-12 10:50:46 +02:00
Thomas Lamprecht	b9828b7552	services: update PIDFile to point directly to /run fixes a complaint from system: > PIDFile= references path below legacy directory /var/run/' Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-05-26 15:16:15 +02:00
Thomas Lamprecht	e7958dd420	buildsys: switch upload dist over to buster Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-05-23 18:18:16 +02:00
Thomas Lamprecht	bd29ad2938	bump version to 3.0-1 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-05-22 19:18:40 +02:00
Thomas Lamprecht	e72075d069	buildsys: use dpkg-dev makefile helpers for pkg info Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-05-22 19:11:29 +02:00
Thomas Lamprecht	d9906847c2	handle the case where a node gets fully removed If an admin removes a node he may also remove /etc/pve/nodes/NODE quite soon after that, if the "node really deleted" logic of our NodeStatus module has not triggered until then (it waits an hour) the current manager still tries to read the gone nodes LRM status, which results in an exception. Move this exception to a warn and return a node == unkown state in such a case. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-04-10 12:41:19 +02:00
Thomas Lamprecht	5d880e15ef	coding style cleanup Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-04-10 12:29:49 +02:00
Thomas Lamprecht	42294dfd6b	bump version to 2.0-9 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-04-04 16:27:49 +02:00
Thomas Lamprecht	ea998b07ef	service data: only set failed_nodes key if needed Currently we always set this, and thus each services gets a "failed_nodes": null, entry in the written out JSON ha/manager_status so only set if neeed, which can reduce mananager_status quite a bit with a lot of services. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-03-30 19:52:49 +01:00
Thomas Lamprecht	32ae610b9c	partially revert previous unclean commit Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-03-30 19:21:03 +01:00
Thomas Lamprecht	31c1bd1f40	make clean: also clean source tar ball Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-03-30 19:17:03 +01:00
Thomas Lamprecht	5bb0c3daf4	d/control: remove obsolete dh-systemd dependency We do not need to depend explicitly on dh-systemd as we have a versioned debhelper dependency with >= 10~, and lintian on buster for this .dsc even warns: > build-depends-on-obsolete-package build-depends: dh-systemd => use debhelper (>= 9.20160709) Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-03-30 19:03:53 +01:00

... 2 3 4 5 6 ...

802 Commits