5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-31 05:47:19 +03:00

802 Commits

Author SHA1 Message Date
Thomas Lamprecht
77bcb60a9d api/status: extra handling of maintenance mode
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-30 19:46:47 +01:00
Thomas Lamprecht
1388fcc1d3 do not mark maintenaned nodes as unkown
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-30 19:31:50 +01:00
Thomas Lamprecht
edd2cee9d6 bump LRM stop_wait_time to an hour
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-29 14:15:11 +01:00
Thomas Lamprecht
1c4cf42733 bump version to 3.0-6
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-26 18:03:32 +01:00
Thomas Lamprecht
1694ce69f9 lrm.service: add after ordering for SSH and pveproxy
To avoid early disconnect during shutdown ensure we order After them,
for shutdown the ordering is reversed and so we're stopped before
those two - this allows to checkout the node stats and do SSH stuff
if something fails.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 19:48:34 +01:00
Thomas Lamprecht
2167dd1e60 do simple fallback if node comes back online from maintenance
We simply remember the node we where on, if moved for maintenance.
This record gets dropped once we move to _any_ other node, be it:
* our previous node, as it came back from maintenance
* another node due to manual migration, group priority changes or
  fencing

The first point is handled explicitly by this patch. In the select
service node we check for and old fallback node, if that one is found
in a online node list with top priority we _always_ move to it - even
if there's no real reason for a move.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 19:48:34 +01:00
Thomas Lamprecht
99278e06a8 add 'migrate' node shutdown policy
This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 19:46:09 +01:00
Thomas Lamprecht
5c2eef4b9e account service to source and target during move
As the Service load is often still happening on the source, and the
target may feel the performance impact from an incoming migrate, so
account the service to both nodes during that time.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 18:07:38 +01:00
Thomas Lamprecht
3ac4cc879f manager select_service_node: code cleanup
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 17:53:03 +01:00
Thomas Lamprecht
54d808ad80 lrm.service: sort After statements
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 17:53:03 +01:00
Thomas Lamprecht
ad8a5e123a bump version to 3.0-5
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-20 20:14:11 +01:00
Thomas Lamprecht
63a39498f0 d/control: re-add CT/VM dependency
this was an issue for 5.x, initial pre-6.0 and should work now again
as expected..

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-20 20:13:36 +01:00
Stefan Reiter
5aac17c6c5 refactor: vm_qmp_command was moved to PVE::QemuServer::Monitor
Also change to mon_cmd helper, avoid calling qmp_cmd directly.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
2019-11-20 18:25:33 +01:00
Thomas Lamprecht
32ea51ddbe fix #1339: remove more locks from services IF the node got fenced
Remove further locks from a service after it was recovered from a
fenced node. This can be done due to the fact that the node was
fenced and thus the operation it was locked for was interrupted
anyway. We note in the syslog that we removed a lock.

Mostly we disallow the 'create' lock, as here is the only case where
we know that the service was not yet in a runnable state before.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-19 14:13:05 +01:00
Fabian Grünbichler
6225c47c96 bump version to 3.0-4
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-18 12:17:18 +01:00
Fabian Grünbichler
ef39a1ca5d use PVE::DataCenterConfig
to make sure that the corresponding cfs_read_file works() works.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2019-11-18 12:14:49 +01:00
Thomas Lamprecht
26be7ceaf4 cli stop cmd: fix property desc. indentation
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-14 14:39:30 +01:00
Thomas Lamprecht
2378f1c1b3 bump version to 3.0-3
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-11 17:04:40 +01:00
Thomas Lamprecht
396eb6f0d2 followup, adapt stop request log messages; include SID
it's always good to say that we request it, not that people think the
task should have been already started..

Also include the service ID (SID), so people know what we want(ed) to
stop at all.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-11 16:50:39 +01:00
Fabian Ebner
55b5d4ef46 Introduce crm-command to CLI and add stop as a subcommand
This should reduce confusion between the old 'set <sid> --state stopped' and
the new 'stop' command by making explicit that it is sent as a crm command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-11-11 15:56:19 +01:00
Fabian Ebner
21caf0db81 Add crm command 'stop'
Not every command parameter is 'target' anymore, so
it was necessary to modify the parsing of $sd->{cmd}.

Just changing the state to request_stop is not enough,
we need to actually update the service configuration as well.

Add a simple test for the stop command

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-11 15:55:48 +01:00
Fabian Ebner
e4ef317d1f Add timeout parameter for shutdown
Introduces a timeout parameter for shutting a resource down.
If the parameter is 0, we perform a hard stop instead of a shutdown.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-10-11 12:25:53 +02:00
Fabian Ebner
76b83c7207 Add update_service_config to the HA environment interface and simulation
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-10-11 12:25:53 +02:00
Thomas Lamprecht
c9b21b5a0b followup: s/ss/sc/
fixes: dcb4a2a48404a8bf06df41e071fea348d0c971a4

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 20:25:34 +02:00
Thomas Lamprecht
6e8b0c2254 fix # 2241: VM resource: allow migration with local device, when not running
qemu-server ignores the flag if the VM runs, so just set it to true
hardcoded.

People have identical hosts with same HW and want to be able to
relocate VMs in such cases, so allow it here - qemu knows to complain
if it cannot work, as nothing bad happens then (VM stays just were it
is) we can only win, so do it.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 20:17:29 +02:00
Thomas Lamprecht
dcb4a2a484 get_verbose_service_state: render removal transition as 'deleting'
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 19:11:46 +02:00
Thomas Lamprecht
fbda265807 fix #1919, #1920: improve handling zombie (without node) services
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 18:58:26 +02:00
Thomas Lamprecht
b6da6101a4 read_and_check_resources_config: remove dead if branch
we only come to the if (!$vmd) check if the previous
if (my $vmd = $vmlist->{ids}->{$name) is taken, which means $vmd is
always true then.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 18:58:26 +02:00
Thomas Lamprecht
41236dcf61 LRM shutdown: factor out shutdown type to reuse message
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 17:55:09 +02:00
Thomas Lamprecht
a19f2576aa LRM shutdown request: propagate if we could not write out LRM status
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-10-05 17:51:21 +02:00
Fabian Ebner
1d9316ef54 factor out resource config update from api to HA::Config
This makes it easier to update the resource configuration from within the CRM/LRM stack,
which is needed for the new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-10-04 17:20:25 +02:00
Fabian Ebner
b94b478580 Rename target to param in simulation
In preparation to introduce a stop command with a timeout parameter.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-09-30 17:14:10 +02:00
Fabian Ebner
3ac1ee6b24 Make parameters for LRM resource commands more flexible
This will allow for new parameters beside 'target' to be used.
This is in preparation to allow for a 'timeout' parameter for a new 'stop' command.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-09-30 16:49:08 +02:00
Fabian Ebner
3d42b01bf0 Cleanup
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-09-26 15:11:13 +02:00
Fabian Ebner
014cf130a9 Whitespace cleanup
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2019-09-26 15:09:46 +02:00
Thomas Lamprecht
58500679bc bump version to 3.0-2
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-07-11 19:27:27 +02:00
Thomas Lamprecht
9db3786bad buildsys: use DEB_VERSION_UPSTREAM for buildir
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-07-11 19:23:51 +02:00
Rhonda D'Vine
3cf61f860a Add missing Dependencies to pve-ha-simulator
This two missing dependencies makes it possible to install the package
on a stock Debian system (without PVE)

Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>
2019-06-27 22:10:22 +02:00
Christian Ebner
993af4280e fix #2234: fix typo in service description
replace Ressource by Resource

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
2019-06-12 10:50:46 +02:00
Thomas Lamprecht
b9828b7552 services: update PIDFile to point directly to /run
fixes a complaint from system:
> PIDFile= references path below legacy directory /var/run/'

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-05-26 15:16:15 +02:00
Thomas Lamprecht
e7958dd420 buildsys: switch upload dist over to buster
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-05-23 18:18:16 +02:00
Thomas Lamprecht
bd29ad2938 bump version to 3.0-1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-05-22 19:18:40 +02:00
Thomas Lamprecht
e72075d069 buildsys: use dpkg-dev makefile helpers for pkg info
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-05-22 19:11:29 +02:00
Thomas Lamprecht
d9906847c2 handle the case where a node gets fully removed
If an admin removes a node he may also remove /etc/pve/nodes/NODE
quite soon after that, if the "node really deleted" logic of our
NodeStatus module has not triggered until then (it waits an hour) the
current manager still tries to read the gone nodes LRM status, which
results in an exception. Move this exception to a warn and return a
node == unkown state in such a case.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-04-10 12:41:19 +02:00
Thomas Lamprecht
5d880e15ef coding style cleanup
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-04-10 12:29:49 +02:00
Thomas Lamprecht
42294dfd6b bump version to 2.0-9
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-04-04 16:27:49 +02:00
Thomas Lamprecht
ea998b07ef service data: only set failed_nodes key if needed
Currently we always set this, and thus each services gets a
"failed_nodes": null,
entry in the written out JSON ha/manager_status

so only set if neeed, which can reduce mananager_status quite a bit
with a lot of services.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-03-30 19:52:49 +01:00
Thomas Lamprecht
32ae610b9c partially revert previous unclean commit
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-03-30 19:21:03 +01:00
Thomas Lamprecht
31c1bd1f40 make clean: also clean source tar ball
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-03-30 19:17:03 +01:00
Thomas Lamprecht
5bb0c3daf4 d/control: remove obsolete dh-systemd dependency
We do not need to depend explicitly on dh-systemd as we have a
versioned debhelper dependency with >= 10~, and lintian on buster for
this .dsc even warns:

> build-depends-on-obsolete-package build-depends: dh-systemd => use debhelper (>= 9.20160709)

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-03-30 19:03:53 +01:00