add thoughts about watchdog implementation

2024-12-22 17:34:22 +03:00 · 2015-02-21 13:42:06 +01:00 · 2015-02-21 13:42:06 +01:00 · f02ff212a7
commit f02ff212a7
parent 71bf7e6b96
1 changed files with 13 additions and 2 deletions
--- a/15
+++ b/15
@ -18,7 +18,7 @@ be possible to move to newest corosync, or even a totally different
 cluster stack. So we want:
 - possible to run with any distributed key/value store which provides
-  some kind of locking with timeouts.
+  some kind of locking with timeouts (zookeeper, consul, etcd, ..) 
 - self fencing using Linux watchdog device
@ -35,6 +35,18 @@ cluster stack. So we want:
 The cluster stack must provide cluster wide locks with timeouts.
 The Proxmox 'pmxcfs' implements this on top of corosync.
 === Watchdog ===
 We need a reliable watchdog mechanism, which is able to provide hard
 timeouts. It must be guaranteed that the node reboot withing specified
 timeout if we do not update the watchdog. For me it looks that neither
 systemd nor the standard watchdog(8) daemon provides such guarantees.
 We could use the /dev/watchdog directly, but unfortunately this only
 allows one user. We need to protect at least two daemons, so we write
 our own watchdog daemon. This daemon work on /dev/watchdog, but
 provides that service to several other daemons using a local socket.
 == Self fencing ==
 A node needs to aquire a special 'ha_agent_${node}_lock' (one separate
@ -50,7 +62,6 @@ long as there are running services on that node.
 The HA manger can assume that the watchdog triggered a reboot when he
 is able to aquire the 'ha_agent_${node}_lock' for that node.
 === Problems with "two_node" Clusters ===
 This corosync options depends on a fence race condition, and only