From 7cdfa499633eeb08daf9242d826e584261d673c8 Mon Sep 17 00:00:00 2001
From: Dietmar Maurer <dietmar@proxmox.com>
Date: Wed, 3 Dec 2014 06:58:38 +0100
Subject: [PATCH] update docu

---
 README | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/README b/README
index 372182d..6b56dd6 100644
--- a/README
+++ b/README
@@ -1,6 +1,50 @@
-= Experimental implementation of a simple HA Manager =
+= Proxmox HA Manager =
 
-- should run with any distributed key/value store (consul, ...)
+== Motivation ==
+
+The current HA manager has a bunch of drawbacks:
+
+- no more development (redhat moved to pacemaker)
+
+- highly depend on corosync (old version)
+
+- complicated code (cause by compatibility layer with 
+  older cluster stack (cman)
+
+- no self-fencing
+
+In future, we want to make HA easier for our users, and it should 
+be possible to move to newest corosync, or even a totally different 
+cluster stack. So we want:
+
+- possible to run with any distributed key/value store which provides
+  some kind of locking (with timeouts).
+
+- self fencing using linux watchdog device
+
+- implemented in perl, so thatw e can use PVE framework
 
 - only works with simply resources like VMs
 
+= Architecture =
+
+== Cluster requirements ==
+
+=== Cluster wide locks with timeouts ===
+
+The cluster stack must provide cluster wide locks with timeouts.
+The Proxmox 'pmxcfs' implements this on top of corosync.
+
+== Self fencing ==
+
+A node needs to aquire a special 'agent_lock' (one separate lock for
+each node) before starting HA resources, and the node updates the
+watchdog device once it get that lock. If the node loose quorum, or is
+unable to get the 'agent_lock', the watchdog is no longer updated. The
+node can release the lock if there are no running HA resources.
+
+This makes sure that the node holds the 'agent_lock' as long as there
+are running services on that node.
+
+The HA manger can assume that the watchdog triggered a reboot when he
+is able to aquire the 'agent_lock' for that node.