mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-18 06:03:46 +03:00
599 lines
21 KiB
Plaintext
599 lines
21 KiB
Plaintext
[[chapter_ha_manager]]
|
|
ifdef::manvolnum[]
|
|
ha-manager(1)
|
|
=============
|
|
:pve-toplevel:
|
|
|
|
NAME
|
|
----
|
|
|
|
ha-manager - Proxmox VE HA Manager
|
|
|
|
SYNOPSIS
|
|
--------
|
|
|
|
include::ha-manager.1-synopsis.adoc[]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
endif::manvolnum[]
|
|
ifndef::manvolnum[]
|
|
High Availability
|
|
=================
|
|
:pve-toplevel:
|
|
endif::manvolnum[]
|
|
|
|
Our modern society depends heavily on information provided by
|
|
computers over the network. Mobile devices amplified that dependency,
|
|
because people can access the network any time from anywhere. If you
|
|
provide such services, it is very important that they are available
|
|
most of the time.
|
|
|
|
We can mathematically define the availability as the ratio of (A) the
|
|
total time a service is capable of being used during a given interval
|
|
to (B) the length of the interval. It is normally expressed as a
|
|
percentage of uptime in a given year.
|
|
|
|
.Availability - Downtime per Year
|
|
[width="60%",cols="<d,d",options="header"]
|
|
|===========================================================
|
|
|Availability % |Downtime per year
|
|
|99 |3.65 days
|
|
|99.9 |8.76 hours
|
|
|99.99 |52.56 minutes
|
|
|99.999 |5.26 minutes
|
|
|99.9999 |31.5 seconds
|
|
|99.99999 |3.15 seconds
|
|
|===========================================================
|
|
|
|
There are several ways to increase availability. The most elegant
|
|
solution is to rewrite your software, so that you can run it on
|
|
several host at the same time. The software itself need to have a way
|
|
to detect errors and do failover. This is relatively easy if you just
|
|
want to serve read-only web pages. But in general this is complex, and
|
|
sometimes impossible because you cannot modify the software
|
|
yourself. The following solutions works without modifying the
|
|
software:
|
|
|
|
* Use reliable ``server'' components
|
|
|
|
NOTE: Computer components with same functionality can have varying
|
|
reliability numbers, depending on the component quality. Most vendors
|
|
sell components with higher reliability as ``server'' components -
|
|
usually at higher price.
|
|
|
|
* Eliminate single point of failure (redundant components)
|
|
** use an uninterruptible power supply (UPS)
|
|
** use redundant power supplies on the main boards
|
|
** use ECC-RAM
|
|
** use redundant network hardware
|
|
** use RAID for local storage
|
|
** use distributed, redundant storage for VM data
|
|
|
|
* Reduce downtime
|
|
** rapidly accessible administrators (24/7)
|
|
** availability of spare parts (other nodes in a {pve} cluster)
|
|
** automatic error detection (provided by `ha-manager`)
|
|
** automatic failover (provided by `ha-manager`)
|
|
|
|
Virtualization environments like {pve} make it much easier to reach
|
|
high availability because they remove the ``hardware'' dependency. They
|
|
also support to setup and use redundant storage and network
|
|
devices. So if one host fail, you can simply start those services on
|
|
another host within your cluster.
|
|
|
|
Even better, {pve} provides a software stack called `ha-manager`,
|
|
which can do that automatically for you. It is able to automatically
|
|
detect errors and do automatic failover.
|
|
|
|
{pve} `ha-manager` works like an ``automated'' administrator. First, you
|
|
configure what resources (VMs, containers, ...) it should
|
|
manage. `ha-manager` then observes correct functionality, and handles
|
|
service failover to another node in case of errors. `ha-manager` can
|
|
also handle normal user requests which may start, stop, relocate and
|
|
migrate a service.
|
|
|
|
But high availability comes at a price. High quality components are
|
|
more expensive, and making them redundant duplicates the costs at
|
|
least. Additional spare parts increase costs further. So you should
|
|
carefully calculate the benefits, and compare with those additional
|
|
costs.
|
|
|
|
TIP: Increasing availability from 99% to 99.9% is relatively
|
|
simply. But increasing availability from 99.9999% to 99.99999% is very
|
|
hard and costly. `ha-manager` has typical error detection and failover
|
|
times of about 2 minutes, so you can get no more than 99.999%
|
|
availability.
|
|
|
|
Requirements
|
|
------------
|
|
|
|
* at least three cluster nodes (to get reliable quorum)
|
|
|
|
* shared storage for VMs and containers
|
|
|
|
* hardware redundancy (everywhere)
|
|
|
|
* hardware watchdog - if not available we fall back to the
|
|
linux kernel software watchdog (`softdog`)
|
|
|
|
* optional hardware fencing devices
|
|
|
|
|
|
[[ha_manager_resources]]
|
|
Resources
|
|
---------
|
|
|
|
We call the primary management unit handled by `ha-manager` a
|
|
resource. A resource (also called ``service'') is uniquely
|
|
identified by a service ID (SID), which consists of the resource type
|
|
and an type specific ID, e.g.: `vm:100`. That example would be a
|
|
resource of type `vm` (virtual machine) with the ID 100.
|
|
|
|
For now we have two important resources types - virtual machines and
|
|
containers. One basic idea here is that we can bundle related software
|
|
into such VM or container, so there is no need to compose one big
|
|
service from other services, like it was done with `rgmanager`. In
|
|
general, a HA enabled resource should not depend on other resources.
|
|
|
|
|
|
How It Works
|
|
------------
|
|
|
|
This section provides an in detail description of the {PVE} HA-manager
|
|
internals. It describes how the CRM and the LRM work together.
|
|
|
|
To provide High Availability two daemons run on each node:
|
|
|
|
`pve-ha-lrm`::
|
|
|
|
The local resource manager (LRM), which controls the services running on
|
|
the local node. It reads the requested states for its services from
|
|
the current manager status file and executes the respective commands.
|
|
|
|
`pve-ha-crm`::
|
|
|
|
The cluster resource manager (CRM), which makes the cluster wide
|
|
decisions. It sends commands to the LRM, processes the results,
|
|
and moves resources to other nodes if something fails. The CRM also
|
|
handles node fencing.
|
|
|
|
|
|
.Locks in the LRM & CRM
|
|
[NOTE]
|
|
Locks are provided by our distributed configuration file system (pmxcfs).
|
|
They are used to guarantee that each LRM is active once and working. As a
|
|
LRM only executes actions when it holds its lock we can mark a failed node
|
|
as fenced if we can acquire its lock. This lets us then recover any failed
|
|
HA services securely without any interference from the now unknown failed node.
|
|
This all gets supervised by the CRM which holds currently the manager master
|
|
lock.
|
|
|
|
Local Resource Manager
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The local resource manager (`pve-ha-lrm`) is started as a daemon on
|
|
boot and waits until the HA cluster is quorate and thus cluster wide
|
|
locks are working.
|
|
|
|
It can be in three states:
|
|
|
|
wait for agent lock::
|
|
|
|
The LRM waits for our exclusive lock. This is also used as idle state if no
|
|
service is configured.
|
|
|
|
active::
|
|
|
|
The LRM holds its exclusive lock and has services configured.
|
|
|
|
lost agent lock::
|
|
|
|
The LRM lost its lock, this means a failure happened and quorum was lost.
|
|
|
|
After the LRM gets in the active state it reads the manager status
|
|
file in `/etc/pve/ha/manager_status` and determines the commands it
|
|
has to execute for the services it owns.
|
|
For each command a worker gets started, this workers are running in
|
|
parallel and are limited to at most 4 by default. This default setting
|
|
may be changed through the datacenter configuration key `max_worker`.
|
|
When finished the worker process gets collected and its result saved for
|
|
the CRM.
|
|
|
|
.Maximum Concurrent Worker Adjustment Tips
|
|
[NOTE]
|
|
The default value of at most 4 concurrent workers may be unsuited for
|
|
a specific setup. For example may 4 live migrations happen at the same
|
|
time, which can lead to network congestions with slower networks and/or
|
|
big (memory wise) services. Ensure that also in the worst case no congestion
|
|
happens and lower the `max_worker` value if needed. In the contrary, if you
|
|
have a particularly powerful high end setup you may also want to increase it.
|
|
|
|
Each command requested by the CRM is uniquely identifiable by an UID, when
|
|
the worker finished its result will be processed and written in the LRM
|
|
status file `/etc/pve/nodes/<nodename>/lrm_status`. There the CRM may collect
|
|
it and let its state machine - respective the commands output - act on it.
|
|
|
|
The actions on each service between CRM and LRM are normally always synced.
|
|
This means that the CRM requests a state uniquely marked by an UID, the LRM
|
|
then executes this action *one time* and writes back the result, also
|
|
identifiable by the same UID. This is needed so that the LRM does not
|
|
executes an outdated command.
|
|
With the exception of the `stop` and the `error` command,
|
|
those two do not depend on the result produced and are executed
|
|
always in the case of the stopped state and once in the case of
|
|
the error state.
|
|
|
|
.Read the Logs
|
|
[NOTE]
|
|
The HA Stack logs every action it makes. This helps to understand what
|
|
and also why something happens in the cluster. Here its important to see
|
|
what both daemons, the LRM and the CRM, did. You may use
|
|
`journalctl -u pve-ha-lrm` on the node(s) where the service is and
|
|
the same command for the pve-ha-crm on the node which is the current master.
|
|
|
|
Cluster Resource Manager
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The cluster resource manager (`pve-ha-crm`) starts on each node and
|
|
waits there for the manager lock, which can only be held by one node
|
|
at a time. The node which successfully acquires the manager lock gets
|
|
promoted to the CRM master.
|
|
|
|
It can be in three states:
|
|
|
|
wait for agent lock::
|
|
|
|
The CRM waits for our exclusive lock. This is also used as idle state if no
|
|
service is configured
|
|
|
|
active::
|
|
|
|
The CRM holds its exclusive lock and has services configured
|
|
|
|
lost agent lock::
|
|
|
|
The CRM lost its lock, this means a failure happened and quorum was lost.
|
|
|
|
It main task is to manage the services which are configured to be highly
|
|
available and try to always enforce them to the wanted state, e.g.: a
|
|
enabled service will be started if its not running, if it crashes it will
|
|
be started again. Thus it dictates the LRM the actions it needs to execute.
|
|
|
|
When an node leaves the cluster quorum, its state changes to unknown.
|
|
If the current CRM then can secure the failed nodes lock, the services
|
|
will be 'stolen' and restarted on another node.
|
|
|
|
When a cluster member determines that it is no longer in the cluster
|
|
quorum, the LRM waits for a new quorum to form. As long as there is no
|
|
quorum the node cannot reset the watchdog. This will trigger a reboot
|
|
after the watchdog then times out, this happens after 60 seconds.
|
|
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
The HA stack is well integrated into the {pve} API. So, for example,
|
|
HA can be configured via the `ha-manager` command line interface, or
|
|
the {pve} web interface - both interfaces provide an easy way to
|
|
manage HA. Automation tools can use the API directly.
|
|
|
|
All HA configuration files are within `/etc/pve/ha/`, so they get
|
|
automatically distributed to the cluster nodes, and all nodes share
|
|
the same HA configuration.
|
|
|
|
|
|
Resources
|
|
~~~~~~~~~
|
|
|
|
The resource configuration file `/etc/pve/ha/resources.cfg` stores
|
|
the list of resources managed by `ha-manager`. A resource configuration
|
|
inside that list look like this:
|
|
|
|
----
|
|
<sid>:
|
|
<property> <value>
|
|
...
|
|
----
|
|
|
|
It starts with the service ID followed by a collon. The next lines
|
|
contain additional properties:
|
|
|
|
include::ha-resources-opts.adoc[]
|
|
|
|
|
|
Groups
|
|
~~~~~~
|
|
|
|
The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
|
|
define groups of cluster nodes. A resource can be restricted to run
|
|
only on the members of such group. A group configuration look like
|
|
this:
|
|
|
|
----
|
|
group: <group>
|
|
nodes <node_list>
|
|
<property> <value>
|
|
...
|
|
----
|
|
|
|
include::ha-groups-opts.adoc[]
|
|
|
|
|
|
Node Power Status
|
|
-----------------
|
|
|
|
If a node needs maintenance you should migrate and or relocate all
|
|
services which are required to run always on another node first.
|
|
After that you can stop the LRM and CRM services. But note that the
|
|
watchdog triggers if you stop it with active services.
|
|
|
|
Package Updates
|
|
---------------
|
|
|
|
When updating the ha-manager you should do one node after the other, never
|
|
all at once for various reasons. First, while we test our software
|
|
thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
|
|
Upgrading one node after the other and checking the functionality of each node
|
|
after finishing the update helps to recover from an eventual problems, while
|
|
updating all could render you in a broken cluster state and is generally not
|
|
good practice.
|
|
|
|
Also, the {pve} HA stack uses a request acknowledge protocol to perform
|
|
actions between the cluster and the local resource manager. For restarting,
|
|
the LRM makes a request to the CRM to freeze all its services. This prevents
|
|
that they get touched by the Cluster during the short time the LRM is restarting.
|
|
After that the LRM may safely close the watchdog during a restart.
|
|
Such a restart happens on a update and as already stated a active master
|
|
CRM is needed to acknowledge the requests from the LRM, if this is not the case
|
|
the update process can be too long which, in the worst case, may result in
|
|
a watchdog reset.
|
|
|
|
|
|
[[ha_manager_fencing]]
|
|
Fencing
|
|
-------
|
|
|
|
What is Fencing
|
|
~~~~~~~~~~~~~~~
|
|
|
|
Fencing secures that on a node failure the dangerous node gets will be rendered
|
|
unable to do any damage and that no resource runs twice when it gets recovered
|
|
from the failed node. This is a really important task and one of the base
|
|
principles to make a system Highly Available.
|
|
|
|
If a node would not get fenced it would be in an unknown state where it may
|
|
have still access to shared resources, this is really dangerous!
|
|
Imagine that every network but the storage one broke, now while not
|
|
reachable from the public network the VM still runs and writes on the shared
|
|
storage. If we would not fence the node and just start up this VM on another
|
|
Node we would get dangerous race conditions, atomicity violations the whole VM
|
|
could be rendered unusable. The recovery could also simply fail if the storage
|
|
protects from multiple mounts and thus defeat the purpose of HA.
|
|
|
|
How {pve} Fences
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
There are different methods to fence a node, for example fence devices which
|
|
cut off the power from the node or disable their communication completely.
|
|
|
|
Those are often quite expensive and bring additional critical components in
|
|
a system, because if they fail you cannot recover any service.
|
|
|
|
We thus wanted to integrate a simpler method in the HA Manager first, namely
|
|
self fencing with watchdogs.
|
|
|
|
Watchdogs are widely used in critical and dependable systems since the
|
|
beginning of micro controllers, they are often independent and simple
|
|
integrated circuit which programs can use to watch them. After opening they need to
|
|
report periodically. If, for whatever reason, a program becomes unable to do
|
|
so the watchdogs triggers a reset of the whole server.
|
|
|
|
Server motherboards often already include such hardware watchdogs, these need
|
|
to be configured. If no watchdog is available or configured we fall back to the
|
|
Linux Kernel softdog while still reliable it is not independent of the servers
|
|
Hardware and thus has a lower reliability then a hardware watchdog.
|
|
|
|
Configure Hardware Watchdog
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
By default all watchdog modules are blocked for security reasons as they are
|
|
like a loaded gun if not correctly initialized.
|
|
If you have a hardware watchdog available remove its kernel module from the
|
|
blacklist, load it with insmod and restart the `watchdog-mux` service or reboot
|
|
the node.
|
|
|
|
Recover Fenced Services
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
After a node failed and its fencing was successful we start to recover services
|
|
to other available nodes and restart them there so that they can provide service
|
|
again.
|
|
|
|
The selection of the node on which the services gets recovered is influenced
|
|
by the users group settings, the currently active nodes and their respective
|
|
active service count.
|
|
First we build a set out of the intersection between user selected nodes and
|
|
available nodes. Then the subset with the highest priority of those nodes
|
|
gets chosen as possible nodes for recovery. We select the node with the
|
|
currently lowest active service count as a new node for the service.
|
|
That minimizes the possibility of an overload, which else could cause an
|
|
unresponsive node and as a result a chain reaction of node failures in the
|
|
cluster.
|
|
|
|
[[ha_manager_groups]]
|
|
Groups
|
|
------
|
|
|
|
A group is a collection of cluster nodes which a service may be bound to.
|
|
|
|
Group Settings
|
|
~~~~~~~~~~~~~~
|
|
|
|
nodes::
|
|
|
|
List of group node members where a priority can be given to each node.
|
|
A service bound to this group will run on the nodes with the highest priority
|
|
available. If more nodes are in the highest priority class the services will
|
|
get distributed to those node if not already there. The priorities have a
|
|
relative meaning only.
|
|
Example;;
|
|
You want to run all services from a group on `node1` if possible. If this node
|
|
is not available, you want them to run equally splitted on `node2` and `node3`, and
|
|
if those fail it should use `node4`.
|
|
To achieve this you could set the node list to:
|
|
[source,bash]
|
|
ha-manager groupset mygroup -nodes "node1:2,node2:1,node3:1,node4"
|
|
|
|
restricted::
|
|
|
|
Resources bound to this group may only run on nodes defined by the
|
|
group. If no group node member is available the resource will be
|
|
placed in the stopped state.
|
|
Example;;
|
|
Lets say a service uses resources only available on `node1` and `node2`,
|
|
so we need to make sure that HA manager does not use other nodes.
|
|
We need to create a 'restricted' group with said nodes:
|
|
[source,bash]
|
|
ha-manager groupset mygroup -nodes "node1,node2" -restricted
|
|
|
|
nofailback::
|
|
|
|
The resource won't automatically fail back when a more preferred node
|
|
(re)joins the cluster.
|
|
Examples;;
|
|
* You need to migrate a service to a node which hasn't the highest priority
|
|
in the group at the moment, to tell the HA manager to not move this service
|
|
instantly back set the 'nofailback' option and the service will stay on
|
|
the current node.
|
|
|
|
* A service was fenced and it got recovered to another node. The admin
|
|
repaired the node and brought it up online again but does not want that the
|
|
recovered services move straight back to the repaired node as he wants to
|
|
first investigate the failure cause and check if it runs stable. He can use
|
|
the 'nofailback' option to achieve this.
|
|
|
|
|
|
Start Failure Policy
|
|
---------------------
|
|
|
|
The start failure policy comes in effect if a service failed to start on a
|
|
node once ore more times. It can be used to configure how often a restart
|
|
should be triggered on the same node and how often a service should be
|
|
relocated so that it gets a try to be started on another node.
|
|
The aim of this policy is to circumvent temporary unavailability of shared
|
|
resources on a specific node. For example, if a shared storage isn't available
|
|
on a quorate node anymore, e.g. network problems, but still on other nodes,
|
|
the relocate policy allows then that the service gets started nonetheless.
|
|
|
|
There are two service start recover policy settings which can be configured
|
|
specific for each resource.
|
|
|
|
max_restart::
|
|
|
|
Maximum number of tries to restart an failed service on the actual
|
|
node. The default is set to one.
|
|
|
|
max_relocate::
|
|
|
|
Maximum number of tries to relocate the service to a different node.
|
|
A relocate only happens after the max_restart value is exceeded on the
|
|
actual node. The default is set to one.
|
|
|
|
NOTE: The relocate count state will only reset to zero when the
|
|
service had at least one successful start. That means if a service is
|
|
re-enabled without fixing the error only the restart policy gets
|
|
repeated.
|
|
|
|
Error Recovery
|
|
--------------
|
|
|
|
If after all tries the service state could not be recovered it gets
|
|
placed in an error state. In this state the service won't get touched
|
|
by the HA stack anymore. To recover from this state you should follow
|
|
these steps:
|
|
|
|
* bring the resource back into a safe and consistent state (e.g.,
|
|
killing its process)
|
|
|
|
* disable the ha resource to place it in an stopped state
|
|
|
|
* fix the error which led to this failures
|
|
|
|
* *after* you fixed all errors you may enable the service again
|
|
|
|
|
|
[[ha_manager_service_operations]]
|
|
Service Operations
|
|
------------------
|
|
|
|
This are how the basic user-initiated service operations (via
|
|
`ha-manager`) work.
|
|
|
|
enable::
|
|
|
|
The service will be started by the LRM if not already running.
|
|
|
|
disable::
|
|
|
|
The service will be stopped by the LRM if running.
|
|
|
|
migrate/relocate::
|
|
|
|
The service will be relocated (live) to another node.
|
|
|
|
remove::
|
|
|
|
The service will be removed from the HA managed resource list. Its
|
|
current state will not be touched.
|
|
|
|
start/stop::
|
|
|
|
`start` and `stop` commands can be issued to the resource specific tools
|
|
(like `qm` or `pct`), they will forward the request to the
|
|
`ha-manager` which then will execute the action and set the resulting
|
|
service state (enabled, disabled).
|
|
|
|
|
|
Service States
|
|
--------------
|
|
|
|
stopped::
|
|
|
|
Service is stopped (confirmed by LRM), if detected running it will get stopped
|
|
again.
|
|
|
|
request_stop::
|
|
|
|
Service should be stopped. Waiting for confirmation from LRM.
|
|
|
|
started::
|
|
|
|
Service is active an LRM should start it ASAP if not already running.
|
|
If the Service fails and is detected to be not running the LRM restarts it.
|
|
|
|
fence::
|
|
|
|
Wait for node fencing (service node is not inside quorate cluster
|
|
partition).
|
|
As soon as node gets fenced successfully the service will be recovered to
|
|
another node, if possible.
|
|
|
|
freeze::
|
|
|
|
Do not touch the service state. We use this state while we reboot a
|
|
node, or when we restart the LRM daemon.
|
|
|
|
migrate::
|
|
|
|
Migrate service (live) to other node.
|
|
|
|
error::
|
|
|
|
Service disabled because of LRM errors. Needs manual intervention.
|
|
|
|
|
|
ifdef::manvolnum[]
|
|
include::pve-copyright.adoc[]
|
|
endif::manvolnum[]
|
|
|