mirror of
git://git.proxmox.com/git/pve-ha-manager.git
synced 2025-01-04 09:17:59 +03:00
c15a8b803e
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> |
||
---|---|---|
debian | ||
src | ||
.gitignore | ||
Makefile | ||
README |
= Proxmox HA Manager = == Motivation == The current HA manager has a bunch of drawbacks: - no more development (redhat moved to pacemaker) - highly depend on old version of corosync - complicated code (cause by compatibility layer with older cluster stack (cman) - no self-fencing In future, we want to make HA easier for our users, and it should be possible to move to newest corosync, or even a totally different cluster stack. So we want: - possibility to run with any distributed key/value store which provides some kind of locking with timeouts (zookeeper, consul, etcd, ..) - self fencing using Linux watchdog device - implemented in Perl, so that we can use PVE framework - only work with simply resources like VMs We dropped the idea to assemble complex, dependend services, because we think this is already done with the VM abstraction. = Architecture = == Cluster requirements == === Cluster wide locks with timeouts === The cluster stack must provide cluster wide locks with timeouts. The Proxmox 'pmxcfs' implements this on top of corosync. === Watchdog === We need a reliable watchdog mechanism, which is able to provide hard timeouts. It must be guaranteed that the node reboots within the specified timeout if we do not update the watchdog. For me it looks that neither systemd nor the standard watchdog(8) daemon provides such guarantees. We could use the /dev/watchdog directly, but unfortunately this only allows one user. We need to protect at least two daemons, so we write our own watchdog daemon. This daemon work on /dev/watchdog, but provides that service to several other daemons using a local socket. == Self fencing == A node needs to acquire a special 'ha_agent_${node}_lock' (one separate lock for each node) before starting HA resources, and the node updates the watchdog device once it get that lock. If the node loose quorum, or is unable to get the 'ha_agent_${node}_lock', the watchdog is no longer updated. The node can release the lock if there are no running HA resources. This makes sure that the node holds the 'ha_agent_${node}_lock' as long as there are running services on that node. The HA manger can assume that the watchdog triggered a reboot when he is able to acquire the 'ha_agent_${node}_lock' for that node. === Problems with "two_node" Clusters === This corosync options depends on a fence race condition, and only works using reliable HW fence devices. Above 'self fencing' algorithm does not work if you use this option! == Testing requirements == We want to be able to simulate HA cluster, using a GUI. This makes it easier to learn how the system behaves. We also need a way to run regression tests. = Implementation details = == Cluster Resource Manager (class PVE::HA::CRM) == The Cluster Resource Manager (CRM) daemon runs one each node, but locking makes sure only one CRM daemon act in 'master' role. That 'master' daemon reads the service configuration file, and request new service states by writing the global 'manager_status'. That data structure is read by the Local Resource Manager, which performs the real work (start/stop/migrate) services. === Service Relocation === Some services like Qemu Virtual Machines supports live migration. So the LRM can migrate those services without stopping them (CRM service state 'migrate'), Most other service types requires the service to be stopped, and then restarted at the other node. Stopped services are moved by the CRM (usually by simply changing the service configuration). === Service ordering and colocation constarints === So far there are no plans to implement this (although it would be possible). === Possible CRM Service States === stopped: Service is stopped (confirmed by LRM) request_stop: Service should be stopped. Waiting for confirmation from LRM. started: Service is active an LRM should start it asap. fence: Wait for node fencing (service node is not inside quorate cluster partition). freeze: Do not touch. We use this state while we reboot a node, or when we restart the LRM daemon. migrate: Migrate (live) service to other node. error: Service disabled because of LRM errors. == Local Resource Manager (class PVE::HA::LRM) == The Local Resource Manager (LRM) daemon runs one each node, and performs service commands (start/stop/migrate) for services assigned to the local node. It should be mentioned that each LRM holds a cluster wide 'ha_agent_${node}_lock' lock, and the CRM is not allowed to assign the service to another node while the LRM holds that lock. The LRM reads the requested service state from 'manager_status', and tries to bring the local service into that state. The actial service status is written back to the 'service_${node}_status', and can be read by the CRM. == Pluggable Interface for cluster environment (class PVE::HA::Env) == This class defines an interface to the actual cluster environment: * get node membership and quorum information * get/release cluster wide locks * get system time * watchdog interface * read/write cluster wide status files We have plugins for several different environments: * PVE::HA::Sim::TestEnv: the regression test environment * PVE::HA::Sim::RTEnv: the graphical simulator * PVE::HA::Env::PVE2: the real Proxmox VE cluster