5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2024-12-22 17:34:22 +03:00

add thoughts about watchdog implementation

This commit is contained in:
Dietmar Maurer 2015-02-21 13:42:06 +01:00
parent 71bf7e6b96
commit f02ff212a7

15
README
View File

@ -18,7 +18,7 @@ be possible to move to newest corosync, or even a totally different
cluster stack. So we want: cluster stack. So we want:
- possible to run with any distributed key/value store which provides - possible to run with any distributed key/value store which provides
some kind of locking with timeouts. some kind of locking with timeouts (zookeeper, consul, etcd, ..)
- self fencing using Linux watchdog device - self fencing using Linux watchdog device
@ -35,6 +35,18 @@ cluster stack. So we want:
The cluster stack must provide cluster wide locks with timeouts. The cluster stack must provide cluster wide locks with timeouts.
The Proxmox 'pmxcfs' implements this on top of corosync. The Proxmox 'pmxcfs' implements this on top of corosync.
=== Watchdog ===
We need a reliable watchdog mechanism, which is able to provide hard
timeouts. It must be guaranteed that the node reboot withing specified
timeout if we do not update the watchdog. For me it looks that neither
systemd nor the standard watchdog(8) daemon provides such guarantees.
We could use the /dev/watchdog directly, but unfortunately this only
allows one user. We need to protect at least two daemons, so we write
our own watchdog daemon. This daemon work on /dev/watchdog, but
provides that service to several other daemons using a local socket.
== Self fencing == == Self fencing ==
A node needs to aquire a special 'ha_agent_${node}_lock' (one separate A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
@ -50,7 +62,6 @@ long as there are running services on that node.
The HA manger can assume that the watchdog triggered a reboot when he The HA manger can assume that the watchdog triggered a reboot when he
is able to aquire the 'ha_agent_${node}_lock' for that node. is able to aquire the 'ha_agent_${node}_lock' for that node.
=== Problems with "two_node" Clusters === === Problems with "two_node" Clusters ===
This corosync options depends on a fence race condition, and only This corosync options depends on a fence race condition, and only