mirror of
git://git.proxmox.com/git/pve-ha-manager.git
synced 2025-01-18 10:03:53 +03:00
de225e04c4
E.g., pve-ha-manager is our current HA manager, so talking about the "current HA stack" being EOL without mentioning the actually meant `rgmanager` one, got taken up the wrong way by some potential users. Correct that and a few other things, but as there are definitively stuff still out-of-date, or will be in a few months, mention that this is an older readme and refer to the HA reference docs at the top. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
176 lines
6.3 KiB
Plaintext
176 lines
6.3 KiB
Plaintext
= Proxmox HA Manager =
|
|
|
|
Note that this README got written as early development planning/documentation
|
|
in 2015, even though small updates where made in 2023 it might be a bit out of
|
|
date. For usage documentation see the official reference docs shipped with your
|
|
Proxmox VE installation or, for your convenience, hosted at:
|
|
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html
|
|
|
|
== History & Motivation ==
|
|
|
|
The `rgmanager` HA stack used in Proxmox VE 3.x has a bunch of drawbacks:
|
|
|
|
- no more development (redhat moved to pacemaker)
|
|
- highly depend on old version of corosync
|
|
- complicated code (cause by compatibility layer with older cluster stack
|
|
(cman)
|
|
- no self-fencing
|
|
|
|
For Proxmox VE 4.0 we thus required a new HA stack and also wanted to make HA
|
|
easier for our users while also making it possible to move to newest corosync,
|
|
or even a totally different cluster stack. So, the following core requirements
|
|
got set out:
|
|
|
|
- possibility to run with any distributed key/value store which provides some
|
|
kind of locking with timeouts (zookeeper, consul, etcd, ..)
|
|
- self fencing using Linux watchdog device
|
|
- implemented in Perl, so that we can use PVE framework
|
|
- only work with simply resources like VMs
|
|
|
|
We dropped the idea to assemble complex, dependent services, because we think
|
|
this is already done with the VM/CT abstraction.
|
|
|
|
== Architecture ==
|
|
|
|
Cluster requirements.
|
|
|
|
=== Cluster wide locks with timeouts ===
|
|
|
|
The cluster stack must provide cluster-wide locks with timeouts.
|
|
The Proxmox 'pmxcfs' implements this on top of corosync.
|
|
|
|
=== Watchdog ===
|
|
|
|
We need a reliable watchdog mechanism, which is able to provide hard
|
|
timeouts. It must be guaranteed that the node reboots within the specified
|
|
timeout if we do not update the watchdog. For me it looks that neither
|
|
systemd nor the standard watchdog(8) daemon provides such guarantees.
|
|
|
|
We could use the /dev/watchdog directly, but unfortunately this only
|
|
allows one user. We need to protect at least two daemons, so we write
|
|
our own watchdog daemon. This daemon work on /dev/watchdog, but
|
|
provides that service to several other daemons using a local socket.
|
|
|
|
=== Self fencing ===
|
|
|
|
A node needs to acquire a special 'ha_agent_${node}_lock' (one separate
|
|
lock for each node) before starting HA resources, and the node updates
|
|
the watchdog device once it get that lock. If the node loose quorum,
|
|
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
|
|
longer updated. The node can release the lock if there are no running
|
|
HA resources.
|
|
|
|
This makes sure that the node holds the 'ha_agent_${node}_lock' as
|
|
long as there are running services on that node.
|
|
|
|
The HA manger can assume that the watchdog triggered a reboot when he
|
|
is able to acquire the 'ha_agent_${node}_lock' for that node.
|
|
|
|
==== Problems with "two_node" Clusters ====
|
|
|
|
This corosync options depends on a fence race condition, and only
|
|
works using reliable HW fence devices.
|
|
|
|
Above 'self fencing' algorithm does not work if you use this option!
|
|
|
|
Note that you can use a QDevice, i.e., a external simple (no full corosync
|
|
membership, so relaxed networking) note arbiter process.
|
|
|
|
=== Testing Requirements ===
|
|
|
|
We want to be able to simulate and test the behavior of a HA cluster, using
|
|
either a GUI or a CLI. This makes it easier to learn how the system behaves. We
|
|
also need a way to run regression tests.
|
|
|
|
== Implementation details ==
|
|
|
|
=== Cluster Resource Manager (class PVE::HA::CRM) ====
|
|
|
|
The Cluster Resource Manager (CRM) daemon runs one each node, but
|
|
locking makes sure only one CRM daemon act in 'master' role. That
|
|
'master' daemon reads the service configuration file, and request new
|
|
service states by writing the global 'manager_status'. That data
|
|
structure is read by the Local Resource Manager, which performs the
|
|
real work (start/stop/migrate) services.
|
|
|
|
==== Service Relocation ====
|
|
|
|
Some services, like a QEMU Virtual Machine, supports live migration.
|
|
So the LRM can migrate those services without stopping them (CRM service state
|
|
'migrate'),
|
|
|
|
Most other service types requires the service to be stopped, and then restarted
|
|
at the other node. Stopped services are moved by the CRM (usually by simply
|
|
changing the service configuration).
|
|
|
|
==== Service Ordering and Co-location Constraints ====
|
|
|
|
There are no to implement this for the initial version but although it would be
|
|
possible and probably should be done for later versions.
|
|
|
|
==== Possible CRM Service States ====
|
|
|
|
stopped: Service is stopped (confirmed by LRM)
|
|
|
|
request_stop: Service should be stopped. Waiting for
|
|
confirmation from LRM.
|
|
|
|
started: Service is active an LRM should start it asap.
|
|
|
|
fence: Wait for node fencing (service node is not inside
|
|
quorate cluster partition).
|
|
|
|
recovery: Service node gets recovered to a new node as it current one was
|
|
fenced. Note that a service might be stuck here depending on the
|
|
group/priority configuration
|
|
|
|
freeze: Do not touch. We use this state while we reboot a node,
|
|
or when we restart the LRM daemon.
|
|
|
|
migrate: Migrate (live) service to other node.
|
|
|
|
relocate: Migrate (stop. move, start) service to other node.
|
|
|
|
error: Service disabled because of LRM errors.
|
|
|
|
|
|
There's also a `ignored` state which tells the HA stack to ignore a service
|
|
completely, i.e., as it wasn't under HA control at all.
|
|
|
|
=== Local Resource Manager (class PVE::HA::LRM) ===
|
|
|
|
The Local Resource Manager (LRM) daemon runs one each node, and
|
|
performs service commands (start/stop/migrate) for services assigned
|
|
to the local node. It should be mentioned that each LRM holds a
|
|
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
|
|
to assign the service to another node while the LRM holds that lock.
|
|
|
|
The LRM reads the requested service state from 'manager_status', and
|
|
tries to bring the local service into that state. The actual service
|
|
status is written back to the 'service_${node}_status', and can be
|
|
read by the CRM.
|
|
|
|
=== Pluggable Interface for Cluster Environment (class PVE::HA::Env) ===
|
|
|
|
This class defines an interface to the actual cluster environment:
|
|
|
|
* get node membership and quorum information
|
|
|
|
* get/release cluster wide locks
|
|
|
|
* get system time
|
|
|
|
* watchdog interface
|
|
|
|
* read/write cluster wide status files
|
|
|
|
We have plugins for several different environments:
|
|
|
|
* PVE::HA::Sim::TestEnv: the regression test environment
|
|
|
|
* PVE::HA::Sim::RTEnv: the graphical simulator
|
|
|
|
* PVE::HA::Env::PVE2: the real Proxmox VE cluster
|
|
|
|
|