pve-ha-manager/README

= Proxmox HA Manager =

== Motivation ==

The current HA manager has a bunch of drawbacks:

- no more development (redhat moved to pacemaker)

- highly depend on old version of corosync

- complicated code (cause by compatibility layer with 
  older cluster stack (cman)

- no self-fencing

In future, we want to make HA easier for our users, and it should 
be possible to move to newest corosync, or even a totally different 
cluster stack. So we want:

- possible to run with any distributed key/value store which provides
  some kind of locking with timeouts.

- self fencing using Linux watchdog device

- implemented in Perl, so that we can use PVE framework

- only works with simply resources like VMs

= Architecture =

== Cluster requirements ==

=== Cluster wide locks with timeouts ===

The cluster stack must provide cluster wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync.

== Self fencing ==

A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
lock for each node) before starting HA resources, and the node updates
the watchdog device once it get that lock. If the node loose quorum,
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
longer updated. The node can release the lock if there are no running
HA resources.

This makes sure that the node holds the 'ha_agent_${node}_lock' as
long as there are running services on that node.

The HA manger can assume that the watchdog triggered a reboot when he
is able to aquire the 'ha_agent_${node}_lock' for that node.

== Testing requirements ==

We want to be able to simulate HA cluster, using a GUI. This makes it easier
to learn how the system behaves. We also need a way to run regression tests.

= Implementation details =

== Cluster Resource Manager (class PVE::HA::CRM) ==

The Cluster Resource Manager (CRM) daemon runs one each node, but
locking makes sure only one CRM daemon act in 'master' role. That
'master' daemon reads the service configuration file, and request new
service states by writing the global 'manager_status'. That data
structure is read by the Local Resource Manager, which performs the
real work (start/stop/migrate) services.

=== Possible CRM Service States ===

stopped:      Service is stopped (confirmed by LRM)

request_stop: Service should be stopped. Waiting for 
	      confirmation from LRM.

started:      Service is active an LRM should start it asap.

fence:        Wait for node fencing (service node is not inside
	      quorate cluster partition).

migrate:      Migrate VM to other node

error:        Service disabled because of LRM errors.

== Local Resource Manager (class PVE::HA::LRM) ==

The Local Resource Manager (LRM) daemon runs one each node, and
performs service commands (start/stop/migrate) for services assigned
to the local node. It should be mentioned that each LRM holds a
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
to assign the service to another node while the LRM holds that lock.

The LRM reads the requested service state from 'manager_status', and
tries to bring the local service into that state. The actial service
status is written back to the 'service_${node}_status', and can be
read by the CRM.

== Pluggable Interface for cluster environment (class PVE::HA::Env) ==

This class defines an interface to the actual cluster environment:

* get node membership and quorum information

* get/release cluster wide locks

* get system time

* watchdog interface

* read/write cluster wide status files 

We have plugins for several different environments:

* PVE::HA::Sim::TestEnv: the regression test environment

* PVE::HA::Sim::RTEnv: the graphical simulator

* PVE::HA::Env::PVE2: the real Proxmox VE cluster
update docu 2014-12-03 08:58:38 +03:00			`= Proxmox HA Manager =`
initial commit 2014-11-29 13:14:59 +03:00
update docu 2014-12-03 08:58:38 +03:00			`== Motivation ==`

			`The current HA manager has a bunch of drawbacks:`

			`- no more development (redhat moved to pacemaker)`

improve documentation 2015-02-11 13:19:44 +03:00			`- highly depend on old version of corosync`
update docu 2014-12-03 08:58:38 +03:00
			`- complicated code (cause by compatibility layer with`
			`older cluster stack (cman)`

			`- no self-fencing`

			`In future, we want to make HA easier for our users, and it should`
			`be possible to move to newest corosync, or even a totally different`
			`cluster stack. So we want:`

			`- possible to run with any distributed key/value store which provides`
improve documentation 2015-02-11 13:19:44 +03:00			`some kind of locking with timeouts.`
update docu 2014-12-03 08:58:38 +03:00
improve documentation 2015-02-11 13:19:44 +03:00			`- self fencing using Linux watchdog device`
update docu 2014-12-03 08:58:38 +03:00
improve documentation 2015-02-11 13:19:44 +03:00			`- implemented in Perl, so that we can use PVE framework`
initial commit 2014-11-29 13:14:59 +03:00
			`- only works with simply resources like VMs`

update docu 2014-12-03 08:58:38 +03:00			`= Architecture =`

			`== Cluster requirements ==`

			`=== Cluster wide locks with timeouts ===`

			`The cluster stack must provide cluster wide locks with timeouts.`
			`The Proxmox 'pmxcfs' implements this on top of corosync.`

			`== Self fencing ==`

improve documentation 2015-02-11 13:19:44 +03:00			`A node needs to aquire a special 'ha_agent_${node}_lock' (one separate`
			`lock for each node) before starting HA resources, and the node updates`
			`the watchdog device once it get that lock. If the node loose quorum,`
			`or is unable to get the 'ha_agent_${node}_lock', the watchdog is no`
			`longer updated. The node can release the lock if there are no running`
			`HA resources.`
update docu 2014-12-03 08:58:38 +03:00
improve documentation 2015-02-11 13:19:44 +03:00			`This makes sure that the node holds the 'ha_agent_${node}_lock' as`
			`long as there are running services on that node.`
update docu 2014-12-03 08:58:38 +03:00
			`The HA manger can assume that the watchdog triggered a reboot when he`
improve documentation 2015-02-11 13:19:44 +03:00			`is able to aquire the 'ha_agent_${node}_lock' for that node.`

			`== Testing requirements ==`

			`We want to be able to simulate HA cluster, using a GUI. This makes it easier`
			`to learn how the system behaves. We also need a way to run regression tests.`

			`= Implementation details =`

			`== Cluster Resource Manager (class PVE::HA::CRM) ==`

			`The Cluster Resource Manager (CRM) daemon runs one each node, but`
			`locking makes sure only one CRM daemon act in 'master' role. That`
			`'master' daemon reads the service configuration file, and request new`
			`service states by writing the global 'manager_status'. That data`
			`structure is read by the Local Resource Manager, which performs the`
			`real work (start/stop/migrate) services.`

improve CRM state transitions 2015-02-14 13:52:35 +03:00			`=== Possible CRM Service States ===`

			`stopped: Service is stopped (confirmed by LRM)`

			`request_stop: Service should be stopped. Waiting for`
			`confirmation from LRM.`

			`started: Service is active an LRM should start it asap.`

			`fence: Wait for node fencing (service node is not inside`
			`quorate cluster partition).`

			`migrate: Migrate VM to other node`

			`error: Service disabled because of LRM errors.`

improve documentation 2015-02-11 13:19:44 +03:00			`== Local Resource Manager (class PVE::HA::LRM) ==`

			`The Local Resource Manager (LRM) daemon runs one each node, and`
			`performs service commands (start/stop/migrate) for services assigned`
			`to the local node. It should be mentioned that each LRM holds a`
			`cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed`
			`to assign the service to another node while the LRM holds that lock.`

			`The LRM reads the requested service state from 'manager_status', and`
			`tries to bring the local service into that state. The actial service`
			`status is written back to the 'service_${node}_status', and can be`
			`read by the CRM.`

			`== Pluggable Interface for cluster environment (class PVE::HA::Env) ==`

			`This class defines an interface to the actual cluster environment:`

			`* get node membership and quorum information`

			`* get/release cluster wide locks`

			`* get system time`

			`* watchdog interface`

			`* read/write cluster wide status files`

			`We have plugins for several different environments:`

			`* PVE::HA::Sim::TestEnv: the regression test environment`

			`* PVE::HA::Sim::RTEnv: the graphical simulator`

			`* PVE::HA::Env::PVE2: the real Proxmox VE cluster`