5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-07 21:18:00 +03:00
Go to file
Thomas Lamprecht 99278e06a8 add 'migrate' node shutdown policy
This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-11-25 19:46:09 +01:00
debian lrm.service: sort After statements 2019-11-25 17:53:03 +01:00
src add 'migrate' node shutdown policy 2019-11-25 19:46:09 +01:00
.gitignore add gitignore 2019-03-30 18:59:01 +01:00
Makefile buildsys: use DEB_VERSION_UPSTREAM for buildir 2019-07-11 19:23:51 +02:00
README fixing typos, also whitespace cleanup in PVE2 env class 2015-09-16 11:56:08 +02:00

= Proxmox HA Manager =

== Motivation ==

The current HA manager has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)

- highly depend on old version of corosync

- complicated code (cause by compatibility layer with 
  older cluster stack (cman)

- no self-fencing

In future, we want to make HA easier for our users, and it should 
be possible to move to newest corosync, or even a totally different 
cluster stack. So we want:

- possibility to run with any distributed key/value store which provides
  some kind of locking with timeouts (zookeeper, consul, etcd, ..) 

- self fencing using Linux watchdog device

- implemented in Perl, so that we can use PVE framework

- only work with simply resources like VMs

We dropped the idea to assemble complex, dependend services, because we think
this is already done with the VM abstraction.

= Architecture =

== Cluster requirements ==

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

=== Watchdog ===

We need a reliable watchdog mechanism, which is able to provide hard
timeouts. It must be guaranteed that the node reboots within the specified
timeout if we do not update the watchdog. For me it looks that neither
systemd nor the standard watchdog(8) daemon provides such guarantees.

We could use the /dev/watchdog directly, but unfortunately this only
allows one user. We need to protect at least two daemons, so we write
our own watchdog daemon. This daemon work on /dev/watchdog, but
provides that service to several other daemons using a local socket.

== Self fencing ==

A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to acquire the 'ha_agent_${node}_lock' for that node.

=== Problems with "two_node" Clusters ===

This corosync options depends on a fence race condition, and only
works using reliable HW fence devices.

Above 'self fencing' algorithm does not work if you use this option!

== Testing requirements ==

We want to be able to simulate HA cluster, using a GUI. This makes it easier
to learn how the system behaves. We also need a way to run regression tests.

= Implementation details =

== Cluster Resource Manager (class PVE::HA::CRM) ==

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

=== Service Relocation ===

Some services like Qemu Virtual Machines supports live migration.
So the LRM can migrate those services without stopping them (CRM 
service state 'migrate'),

Most other service types requires the service to be stopped, and then
restarted at the other node. Stopped services are moved by the CRM
(usually by simply changing the service configuration).

=== Service ordering and colocation constarints ===

So far there are no plans to implement this (although it would be possible).

=== Possible CRM Service States ===

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
	      confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
	      quorate cluster partition).

freeze:       Do not touch. We use this state while we reboot a node,
	      or when we restart the LRM daemon.

migrate:      Migrate (live) service to other node.

error:        Service disabled because of LRM errors.


== Local Resource Manager (class PVE::HA::LRM) ==

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actial service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

== Pluggable Interface for cluster environment (class PVE::HA::Env) ==

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster