mirror of
git://git.proxmox.com/git/pve-ha-manager.git
synced 2024-12-22 17:34:22 +03:00
157 lines
5.1 KiB
Plaintext
157 lines
5.1 KiB
Plaintext
= Proxmox HA Manager =
|
|
|
|
== Motivation ==
|
|
|
|
The current HA manager has a bunch of drawbacks:
|
|
|
|
- no more development (redhat moved to pacemaker)
|
|
|
|
- highly depend on old version of corosync
|
|
|
|
- complicated code (cause by compatibility layer with
|
|
older cluster stack (cman)
|
|
|
|
- no self-fencing
|
|
|
|
In future, we want to make HA easier for our users, and it should
|
|
be possible to move to newest corosync, or even a totally different
|
|
cluster stack. So we want:
|
|
|
|
- possibility to run with any distributed key/value store which provides
|
|
some kind of locking with timeouts (zookeeper, consul, etcd, ..)
|
|
|
|
- self fencing using Linux watchdog device
|
|
|
|
- implemented in Perl, so that we can use PVE framework
|
|
|
|
- only work with simply resources like VMs
|
|
|
|
We dropped the idea to assemble complex, dependend services, because we think
|
|
this is already done with the VM abstraction.
|
|
|
|
= Architecture =
|
|
|
|
== Cluster requirements ==
|
|
|
|
=== Cluster wide locks with timeouts ===
|
|
|
|
The cluster stack must provide cluster wide locks with timeouts.
|
|
The Proxmox 'pmxcfs' implements this on top of corosync.
|
|
|
|
=== Watchdog ===
|
|
|
|
We need a reliable watchdog mechanism, which is able to provide hard
|
|
timeouts. It must be guaranteed that the node reboot withing specified
|
|
timeout if we do not update the watchdog. For me it looks that neither
|
|
systemd nor the standard watchdog(8) daemon provides such guarantees.
|
|
|
|
We could use the /dev/watchdog directly, but unfortunately this only
|
|
allows one user. We need to protect at least two daemons, so we write
|
|
our own watchdog daemon. This daemon work on /dev/watchdog, but
|
|
provides that service to several other daemons using a local socket.
|
|
|
|
== Self fencing ==
|
|
|
|
A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
|
|
lock for each node) before starting HA resources, and the node updates
|
|
the watchdog device once it get that lock. If the node loose quorum,
|
|
or is unable to get the 'ha_agent_${node}_lock', the watchdog is no
|
|
longer updated. The node can release the lock if there are no running
|
|
HA resources.
|
|
|
|
This makes sure that the node holds the 'ha_agent_${node}_lock' as
|
|
long as there are running services on that node.
|
|
|
|
The HA manger can assume that the watchdog triggered a reboot when he
|
|
is able to aquire the 'ha_agent_${node}_lock' for that node.
|
|
|
|
=== Problems with "two_node" Clusters ===
|
|
|
|
This corosync options depends on a fence race condition, and only
|
|
works using reliable HW fence devices.
|
|
|
|
Above 'self fencing' algorithm does not work if you use this option!
|
|
|
|
== Testing requirements ==
|
|
|
|
We want to be able to simulate HA cluster, using a GUI. This makes it easier
|
|
to learn how the system behaves. We also need a way to run regression tests.
|
|
|
|
= Implementation details =
|
|
|
|
== Cluster Resource Manager (class PVE::HA::CRM) ==
|
|
|
|
The Cluster Resource Manager (CRM) daemon runs one each node, but
|
|
locking makes sure only one CRM daemon act in 'master' role. That
|
|
'master' daemon reads the service configuration file, and request new
|
|
service states by writing the global 'manager_status'. That data
|
|
structure is read by the Local Resource Manager, which performs the
|
|
real work (start/stop/migrate) services.
|
|
|
|
=== Service Relocation ===
|
|
|
|
Some services like Qemu Virtual Machines supports live migration.
|
|
So the LRM can migrate those services without stopping them (CRM
|
|
service state 'migrate'),
|
|
|
|
Most other service types requires the service to be stopped, and then
|
|
restarted at the other node. Stopped services are moved by the CRM
|
|
(usually by simply changing the service configuration).
|
|
|
|
=== Possible CRM Service States ===
|
|
|
|
stopped: Service is stopped (confirmed by LRM)
|
|
|
|
request_stop: Service should be stopped. Waiting for
|
|
confirmation from LRM.
|
|
|
|
started: Service is active an LRM should start it asap.
|
|
|
|
fence: Wait for node fencing (service node is not inside
|
|
quorate cluster partition).
|
|
|
|
freeze: Do not touch. We use this state while we reboot a node,
|
|
or when we restart the LRM daemon.
|
|
|
|
migrate: Migrate (live) service to other node.
|
|
|
|
error: Service disabled because of LRM errors.
|
|
|
|
|
|
== Local Resource Manager (class PVE::HA::LRM) ==
|
|
|
|
The Local Resource Manager (LRM) daemon runs one each node, and
|
|
performs service commands (start/stop/migrate) for services assigned
|
|
to the local node. It should be mentioned that each LRM holds a
|
|
cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed
|
|
to assign the service to another node while the LRM holds that lock.
|
|
|
|
The LRM reads the requested service state from 'manager_status', and
|
|
tries to bring the local service into that state. The actial service
|
|
status is written back to the 'service_${node}_status', and can be
|
|
read by the CRM.
|
|
|
|
== Pluggable Interface for cluster environment (class PVE::HA::Env) ==
|
|
|
|
This class defines an interface to the actual cluster environment:
|
|
|
|
* get node membership and quorum information
|
|
|
|
* get/release cluster wide locks
|
|
|
|
* get system time
|
|
|
|
* watchdog interface
|
|
|
|
* read/write cluster wide status files
|
|
|
|
We have plugins for several different environments:
|
|
|
|
* PVE::HA::Sim::TestEnv: the regression test environment
|
|
|
|
* PVE::HA::Sim::RTEnv: the graphical simulator
|
|
|
|
* PVE::HA::Env::PVE2: the real Proxmox VE cluster
|
|
|
|
|