mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-10 01:17:51 +03:00
6d3c0b3479
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
1242 lines
37 KiB
Plaintext
1242 lines
37 KiB
Plaintext
[[chapter_pvecm]]
|
|
ifdef::manvolnum[]
|
|
pvecm(1)
|
|
========
|
|
:pve-toplevel:
|
|
|
|
NAME
|
|
----
|
|
|
|
pvecm - Proxmox VE Cluster Manager
|
|
|
|
SYNOPSIS
|
|
--------
|
|
|
|
include::pvecm.1-synopsis.adoc[]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
endif::manvolnum[]
|
|
|
|
ifndef::manvolnum[]
|
|
Cluster Manager
|
|
===============
|
|
:pve-toplevel:
|
|
endif::manvolnum[]
|
|
|
|
The {PVE} cluster manager `pvecm` is a tool to create a group of
|
|
physical servers. Such a group is called a *cluster*. We use the
|
|
http://www.corosync.org[Corosync Cluster Engine] for reliable group
|
|
communication, and such clusters can consist of up to 32 physical nodes
|
|
(probably more, dependent on network latency).
|
|
|
|
`pvecm` can be used to create a new cluster, join nodes to a cluster,
|
|
leave the cluster, get status information and do various other cluster
|
|
related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
|
|
is used to transparently distribute the cluster configuration to all cluster
|
|
nodes.
|
|
|
|
Grouping nodes into a cluster has the following advantages:
|
|
|
|
* Centralized, web based management
|
|
|
|
* Multi-master clusters: each node can do all management tasks
|
|
|
|
* `pmxcfs`: database-driven file system for storing configuration files,
|
|
replicated in real-time on all nodes using `corosync`.
|
|
|
|
* Easy migration of virtual machines and containers between physical
|
|
hosts
|
|
|
|
* Fast deployment
|
|
|
|
* Cluster-wide services like firewall and HA
|
|
|
|
|
|
Requirements
|
|
------------
|
|
|
|
* All nodes must be able to connect to each other via UDP ports 5404 and 5405
|
|
for corosync to work.
|
|
|
|
* Date and time have to be synchronized.
|
|
|
|
* SSH tunnel on TCP port 22 between nodes is used.
|
|
|
|
* If you are interested in High Availability, you need to have at
|
|
least three nodes for reliable quorum. All nodes should have the
|
|
same version.
|
|
|
|
* We recommend a dedicated NIC for the cluster traffic, especially if
|
|
you use shared storage.
|
|
|
|
* Root password of a cluster node is required for adding nodes.
|
|
|
|
NOTE: It is not possible to mix {pve} 3.x and earlier with {pve} 4.X cluster
|
|
nodes.
|
|
|
|
NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
|
|
production configuration and should only used temporarily during upgrading the
|
|
whole cluster from one to another major version.
|
|
|
|
NOTE: Running a cluster of {pve} 6.x with earlier versions is not possible. The
|
|
cluster protocol (corosync) between {pve} 6.x and earlier versions changed
|
|
fundamentally. The corosync 3 packages for {pve} 5.4 are only intended for the
|
|
upgrade procedure to {pve} 6.0.
|
|
|
|
|
|
Preparing Nodes
|
|
---------------
|
|
|
|
First, install {PVE} on all nodes. Make sure that each node is
|
|
installed with the final hostname and IP configuration. Changing the
|
|
hostname and IP is not possible after cluster creation.
|
|
|
|
Currently the cluster creation can either be done on the console (login via
|
|
`ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
|
|
Cluster__).
|
|
|
|
While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
|
|
make their names resolvable through other means), this is not necessary for a
|
|
cluster to work. It may be useful however, as you can then connect from one node
|
|
to the other with SSH via the easier to remember node name (see also
|
|
xref:pvecm_corosync_addresses[Link Address Types]). Note that we always
|
|
recommend to reference nodes by their IP addresses in the cluster configuration.
|
|
|
|
|
|
[[pvecm_create_cluster]]
|
|
Create the Cluster
|
|
------------------
|
|
|
|
Login via `ssh` to the first {pve} node. Use a unique name for your cluster.
|
|
This name cannot be changed later. The cluster name follows the same rules as
|
|
node names.
|
|
|
|
----
|
|
hp1# pvecm create CLUSTERNAME
|
|
----
|
|
|
|
NOTE: It is possible to create multiple clusters in the same physical or logical
|
|
network. Use unique cluster names if you do so. To avoid human confusion, it is
|
|
also recommended to choose different names even if clusters do not share the
|
|
cluster network.
|
|
|
|
To check the state of your cluster use:
|
|
|
|
----
|
|
hp1# pvecm status
|
|
----
|
|
|
|
Multiple Clusters In Same Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
It is possible to create multiple clusters in the same physical or logical
|
|
network. Each such cluster must have a unique name, this does not only helps
|
|
admins to distinguish on which cluster they currently operate, it is also
|
|
required to avoid possible clashes in the cluster communication stack.
|
|
|
|
While the bandwidth requirement of a corosync cluster is relatively low, the
|
|
latency of packages and the package per second (PPS) rate is the limiting
|
|
factor. Different clusters in the same network can compete with each other for
|
|
these resources, so it may still make sense to use separate physical network
|
|
infrastructure for bigger clusters.
|
|
|
|
[[pvecm_join_node_to_cluster]]
|
|
Adding Nodes to the Cluster
|
|
---------------------------
|
|
|
|
Login via `ssh` to the node you want to add.
|
|
|
|
----
|
|
hp2# pvecm add IP-ADDRESS-CLUSTER
|
|
----
|
|
|
|
For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
|
|
An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
|
|
|
|
CAUTION: A new node cannot hold any VMs, because you would get
|
|
conflicts about identical VM IDs. Also, all existing configuration in
|
|
`/etc/pve` is overwritten when you join a new node to the cluster. To
|
|
workaround, use `vzdump` to backup and restore to a different VMID after
|
|
adding the node to the cluster.
|
|
|
|
To check the state of the cluster use:
|
|
|
|
----
|
|
# pvecm status
|
|
----
|
|
|
|
.Cluster status after adding 4 nodes
|
|
----
|
|
hp2# pvecm status
|
|
Quorum information
|
|
~~~~~~~~~~~~~~~~~~
|
|
Date: Mon Apr 20 12:30:13 2015
|
|
Quorum provider: corosync_votequorum
|
|
Nodes: 4
|
|
Node ID: 0x00000001
|
|
Ring ID: 1/8
|
|
Quorate: Yes
|
|
|
|
Votequorum information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Expected votes: 4
|
|
Highest expected: 4
|
|
Total votes: 4
|
|
Quorum: 3
|
|
Flags: Quorate
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
0x00000001 1 192.168.15.91
|
|
0x00000002 1 192.168.15.92 (local)
|
|
0x00000003 1 192.168.15.93
|
|
0x00000004 1 192.168.15.94
|
|
----
|
|
|
|
If you only want the list of all nodes use:
|
|
|
|
----
|
|
# pvecm nodes
|
|
----
|
|
|
|
.List nodes in a cluster
|
|
----
|
|
hp2# pvecm nodes
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
1 1 hp1
|
|
2 1 hp2 (local)
|
|
3 1 hp3
|
|
4 1 hp4
|
|
----
|
|
|
|
[[pvecm_adding_nodes_with_separated_cluster_network]]
|
|
Adding Nodes With Separated Cluster Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When adding a node to a cluster with a separated cluster network you need to
|
|
use the 'link0' parameter to set the nodes address on that network:
|
|
|
|
[source,bash]
|
|
----
|
|
pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
|
|
----
|
|
|
|
If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
|
|
kronosnet transport layer, also use the 'link1' parameter.
|
|
|
|
|
|
Remove a Cluster Node
|
|
---------------------
|
|
|
|
CAUTION: Read carefully the procedure before proceeding, as it could
|
|
not be what you want or need.
|
|
|
|
Move all virtual machines from the node. Make sure you have no local
|
|
data or backups you want to keep, or save them accordingly.
|
|
In the following example we will remove the node hp4 from the cluster.
|
|
|
|
Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
|
|
command to identify the node ID to remove:
|
|
|
|
----
|
|
hp1# pvecm nodes
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
1 1 hp1 (local)
|
|
2 1 hp2
|
|
3 1 hp3
|
|
4 1 hp4
|
|
----
|
|
|
|
|
|
At this point you must power off hp4 and
|
|
make sure that it will not power on again (in the network) as it
|
|
is.
|
|
|
|
IMPORTANT: As said above, it is critical to power off the node
|
|
*before* removal, and make sure that it will *never* power on again
|
|
(in the existing cluster network) as it is.
|
|
If you power on the node as it is, your cluster will be screwed up and
|
|
it could be difficult to restore a clean cluster state.
|
|
|
|
After powering off the node hp4, we can safely remove it from the cluster.
|
|
|
|
----
|
|
hp1# pvecm delnode hp4
|
|
----
|
|
|
|
If the operation succeeds no output is returned, just check the node
|
|
list again with `pvecm nodes` or `pvecm status`. You should see
|
|
something like:
|
|
|
|
----
|
|
hp1# pvecm status
|
|
|
|
Quorum information
|
|
~~~~~~~~~~~~~~~~~~
|
|
Date: Mon Apr 20 12:44:28 2015
|
|
Quorum provider: corosync_votequorum
|
|
Nodes: 3
|
|
Node ID: 0x00000001
|
|
Ring ID: 1/8
|
|
Quorate: Yes
|
|
|
|
Votequorum information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Expected votes: 3
|
|
Highest expected: 3
|
|
Total votes: 3
|
|
Quorum: 2
|
|
Flags: Quorate
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
0x00000001 1 192.168.15.90 (local)
|
|
0x00000002 1 192.168.15.91
|
|
0x00000003 1 192.168.15.92
|
|
----
|
|
|
|
If, for whatever reason, you want this server to join the same cluster again,
|
|
you have to
|
|
|
|
* reinstall {pve} on it from scratch
|
|
|
|
* then join it, as explained in the previous section.
|
|
|
|
NOTE: After removal of the node, its SSH fingerprint will still reside in the
|
|
'known_hosts' of the other nodes. If you receive an SSH error after rejoining
|
|
a node with the same IP or hostname, run `pvecm updatecerts` once on the
|
|
re-added node to update its fingerprint cluster wide.
|
|
|
|
[[pvecm_separate_node_without_reinstall]]
|
|
Separate A Node Without Reinstalling
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
CAUTION: This is *not* the recommended method, proceed with caution. Use the
|
|
above mentioned method if you're unsure.
|
|
|
|
You can also separate a node from a cluster without reinstalling it from
|
|
scratch. But after removing the node from the cluster it will still have
|
|
access to the shared storages! This must be resolved before you start removing
|
|
the node from the cluster. A {pve} cluster cannot share the exact same
|
|
storage with another cluster, as storage locking doesn't work over cluster
|
|
boundary. Further, it may also lead to VMID conflicts.
|
|
|
|
Its suggested that you create a new storage where only the node which you want
|
|
to separate has access. This can be a new export on your NFS or a new Ceph
|
|
pool, to name a few examples. Its just important that the exact same storage
|
|
does not gets accessed by multiple clusters. After setting this storage up move
|
|
all data from the node and its VMs to it. Then you are ready to separate the
|
|
node from the cluster.
|
|
|
|
WARNING: Ensure all shared resources are cleanly separated! Otherwise you will
|
|
run into conflicts and problems.
|
|
|
|
First stop the corosync and the pve-cluster services on the node:
|
|
[source,bash]
|
|
----
|
|
systemctl stop pve-cluster
|
|
systemctl stop corosync
|
|
----
|
|
|
|
Start the cluster filesystem again in local mode:
|
|
[source,bash]
|
|
----
|
|
pmxcfs -l
|
|
----
|
|
|
|
Delete the corosync configuration files:
|
|
[source,bash]
|
|
----
|
|
rm /etc/pve/corosync.conf
|
|
rm /etc/corosync/*
|
|
----
|
|
|
|
You can now start the filesystem again as normal service:
|
|
[source,bash]
|
|
----
|
|
killall pmxcfs
|
|
systemctl start pve-cluster
|
|
----
|
|
|
|
The node is now separated from the cluster. You can deleted it from a remaining
|
|
node of the cluster with:
|
|
[source,bash]
|
|
----
|
|
pvecm delnode oldnode
|
|
----
|
|
|
|
If the command failed, because the remaining node in the cluster lost quorum
|
|
when the now separate node exited, you may set the expected votes to 1 as a workaround:
|
|
[source,bash]
|
|
----
|
|
pvecm expected 1
|
|
----
|
|
|
|
And then repeat the 'pvecm delnode' command.
|
|
|
|
Now switch back to the separated node, here delete all remaining files left
|
|
from the old cluster. This ensures that the node can be added to another
|
|
cluster again without problems.
|
|
|
|
[source,bash]
|
|
----
|
|
rm /var/lib/corosync/*
|
|
----
|
|
|
|
As the configuration files from the other nodes are still in the cluster
|
|
filesystem you may want to clean those up too. Remove simply the whole
|
|
directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
|
|
you used the correct one before deleting it.
|
|
|
|
CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
|
|
the nodes can still connect to each other with public key authentication. This
|
|
should be fixed by removing the respective keys from the
|
|
'/etc/pve/priv/authorized_keys' file.
|
|
|
|
|
|
Quorum
|
|
------
|
|
|
|
{pve} use a quorum-based technique to provide a consistent state among
|
|
all cluster nodes.
|
|
|
|
[quote, from Wikipedia, Quorum (distributed computing)]
|
|
____
|
|
A quorum is the minimum number of votes that a distributed transaction
|
|
has to obtain in order to be allowed to perform an operation in a
|
|
distributed system.
|
|
____
|
|
|
|
In case of network partitioning, state changes requires that a
|
|
majority of nodes are online. The cluster switches to read-only mode
|
|
if it loses quorum.
|
|
|
|
NOTE: {pve} assigns a single vote to each node by default.
|
|
|
|
|
|
Cluster Network
|
|
---------------
|
|
|
|
The cluster network is the core of a cluster. All messages sent over it have to
|
|
be delivered reliably to all nodes in their respective order. In {pve} this
|
|
part is done by corosync, an implementation of a high performance, low overhead
|
|
high availability development toolkit. It serves our decentralized
|
|
configuration file system (`pmxcfs`).
|
|
|
|
[[pvecm_cluster_network_requirements]]
|
|
Network Requirements
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
This needs a reliable network with latencies under 2 milliseconds (LAN
|
|
performance) to work properly. The network should not be used heavily by other
|
|
members, ideally corosync runs on its own network. Do not use a shared network
|
|
for corosync and storage (except as a potential low-priority fallback in a
|
|
xref:pvecm_redundancy[redundant] configuration).
|
|
|
|
Before setting up a cluster, it is good practice to check if the network is fit
|
|
for that purpose. To make sure the nodes can connect to each other on the
|
|
cluster network, you can test the connectivity between them with the `ping`
|
|
tool.
|
|
|
|
If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
|
|
be generated - no manual action is required.
|
|
|
|
NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
|
|
Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
|
|
communication, which, for now, only supports regular UDP unicast.
|
|
|
|
CAUTION: You can still enable Multicast or legacy unicast by setting your
|
|
transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
|
|
but keep in mind that this will disable all cryptography and redundancy support.
|
|
This is therefore not recommended.
|
|
|
|
Separate Cluster Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When creating a cluster without any parameters the corosync cluster network is
|
|
generally shared with the Web UI and the VMs and their traffic. Depending on
|
|
your setup, even storage traffic may get sent over the same network. Its
|
|
recommended to change that, as corosync is a time critical real time
|
|
application.
|
|
|
|
Setting Up A New Network
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
First you have to set up a new network interface. It should be on a physically
|
|
separate network. Ensure that your network fulfills the
|
|
xref:pvecm_cluster_network_requirements[cluster network requirements].
|
|
|
|
Separate On Cluster Creation
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This is possible via the 'linkX' parameters of the 'pvecm create'
|
|
command used for creating a new cluster.
|
|
|
|
If you have set up an additional NIC with a static address on 10.10.10.1/25,
|
|
and want to send and receive all cluster communication over this interface,
|
|
you would execute:
|
|
|
|
[source,bash]
|
|
----
|
|
pvecm create test --link0 10.10.10.1
|
|
----
|
|
|
|
To check if everything is working properly execute:
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
----
|
|
|
|
Afterwards, proceed as described above to
|
|
xref:pvecm_adding_nodes_with_separated_cluster_network[add nodes with a separated cluster network].
|
|
|
|
[[pvecm_separate_cluster_net_after_creation]]
|
|
Separate After Cluster Creation
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
You can do this if you have already created a cluster and want to switch
|
|
its communication to another network, without rebuilding the whole cluster.
|
|
This change may lead to short durations of quorum loss in the cluster, as nodes
|
|
have to restart corosync and come up one after the other on the new network.
|
|
|
|
Check how to xref:pvecm_edit_corosync_conf[edit the corosync.conf file] first.
|
|
Then, open it and you should see a file similar to:
|
|
|
|
----
|
|
logging {
|
|
debug: off
|
|
to_syslog: yes
|
|
}
|
|
|
|
nodelist {
|
|
|
|
node {
|
|
name: due
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: due
|
|
}
|
|
|
|
node {
|
|
name: tre
|
|
nodeid: 3
|
|
quorum_votes: 1
|
|
ring0_addr: tre
|
|
}
|
|
|
|
node {
|
|
name: uno
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: uno
|
|
}
|
|
|
|
}
|
|
|
|
quorum {
|
|
provider: corosync_votequorum
|
|
}
|
|
|
|
totem {
|
|
cluster_name: testcluster
|
|
config_version: 3
|
|
ip_version: ipv4-6
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
linknumber: 0
|
|
}
|
|
|
|
}
|
|
----
|
|
|
|
NOTE: `ringX_addr` actually specifies a corosync *link address*, the name "ring"
|
|
is a remnant of older corosync versions that is kept for backwards
|
|
compatibility.
|
|
|
|
The first thing you want to do is add the 'name' properties in the node entries
|
|
if you do not see them already. Those *must* match the node name.
|
|
|
|
Then replace all addresses from the 'ring0_addr' properties of all nodes with
|
|
the new addresses. You may use plain IP addresses or hostnames here. If you use
|
|
hostnames ensure that they are resolvable from all nodes. (see also
|
|
xref:pvecm_corosync_addresses[Link Address Types])
|
|
|
|
In this example, we want to switch the cluster communication to the
|
|
10.10.10.1/25 network. So we replace all 'ring0_addr' respectively.
|
|
|
|
NOTE: The exact same procedure can be used to change other 'ringX_addr' values
|
|
as well, although we recommend to not change multiple addresses at once, to make
|
|
it easier to recover if something goes wrong.
|
|
|
|
After we increase the 'config_version' property, the new configuration file
|
|
should look like:
|
|
|
|
----
|
|
logging {
|
|
debug: off
|
|
to_syslog: yes
|
|
}
|
|
|
|
nodelist {
|
|
|
|
node {
|
|
name: due
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.2
|
|
}
|
|
|
|
node {
|
|
name: tre
|
|
nodeid: 3
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.3
|
|
}
|
|
|
|
node {
|
|
name: uno
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.1
|
|
}
|
|
|
|
}
|
|
|
|
quorum {
|
|
provider: corosync_votequorum
|
|
}
|
|
|
|
totem {
|
|
cluster_name: testcluster
|
|
config_version: 4
|
|
ip_version: ipv4-6
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
linknumber: 0
|
|
}
|
|
|
|
}
|
|
----
|
|
|
|
Then, after a final check if all changed information is correct, we save it and
|
|
once again follow the xref:pvecm_edit_corosync_conf[edit corosync.conf file]
|
|
section to bring it into effect.
|
|
|
|
The changes will be applied live, so restarting corosync is not strictly
|
|
necessary. If you changed other settings as well, or notice corosync
|
|
complaining, you can optionally trigger a restart.
|
|
|
|
On a single node execute:
|
|
|
|
[source,bash]
|
|
----
|
|
systemctl restart corosync
|
|
----
|
|
|
|
Now check if everything is fine:
|
|
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
----
|
|
|
|
If corosync runs again correct restart corosync also on all other nodes.
|
|
They will then join the cluster membership one by one on the new network.
|
|
|
|
[[pvecm_corosync_addresses]]
|
|
Corosync addresses
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
A corosync link address (for backwards compatibility denoted by 'ringX_addr' in
|
|
`corosync.conf`) can be specified in two ways:
|
|
|
|
* **IPv4/v6 addresses** will be used directly. They are recommended, since they
|
|
are static and usually not changed carelessly.
|
|
|
|
* **Hostnames** will be resolved using `getaddrinfo`, which means that per
|
|
default, IPv6 addresses will be used first, if available (see also
|
|
`man gai.conf`). Keep this in mind, especially when upgrading an existing
|
|
cluster to IPv6.
|
|
|
|
CAUTION: Hostnames should be used with care, since the address they
|
|
resolve to can be changed without touching corosync or the node it runs on -
|
|
which may lead to a situation where an address is changed without thinking
|
|
about implications for corosync.
|
|
|
|
A seperate, static hostname specifically for corosync is recommended, if
|
|
hostnames are preferred. Also, make sure that every node in the cluster can
|
|
resolve all hostnames correctly.
|
|
|
|
Since {pve} 5.1, while supported, hostnames will be resolved at the time of
|
|
entry. Only the resolved IP is then saved to the configuration.
|
|
|
|
Nodes that joined the cluster on earlier versions likely still use their
|
|
unresolved hostname in `corosync.conf`. It might be a good idea to replace
|
|
them with IPs or a seperate hostname, as mentioned above.
|
|
|
|
|
|
[[pvecm_redundancy]]
|
|
Corosync Redundancy
|
|
-------------------
|
|
|
|
Corosync supports redundant networking via its integrated kronosnet layer by
|
|
default (it is not supported on the legacy udp/udpu transports). It can be
|
|
enabled by specifying more than one link address, either via the '--linkX'
|
|
parameters of `pvecm` (while creating a cluster or adding a new node) or by
|
|
specifying more than one 'ringX_addr' in `corosync.conf`.
|
|
|
|
NOTE: To provide useful failover, every link should be on its own
|
|
physical network connection.
|
|
|
|
Links are used according to a priority setting. You can configure this priority
|
|
by setting 'knet_link_priority' in the corresponding interface section in
|
|
`corosync.conf`, or, preferrably, using the 'priority' parameter when creating
|
|
your cluster with `pvecm`:
|
|
|
|
----
|
|
# pvecm create CLUSTERNAME --link0 10.10.10.1,priority=20 --link1 10.20.20.1,priority=15
|
|
----
|
|
|
|
This would cause 'link1' to be used first, since it has the lower priority.
|
|
|
|
If no priorities are configured manually (or two links have the same priority),
|
|
links will be used in order of their number, with the lower number having higher
|
|
priority.
|
|
|
|
Even if all links are working, only the one with the highest priority will see
|
|
corosync traffic. Link priorities cannot be mixed, i.e. links with different
|
|
priorities will not be able to communicate with each other.
|
|
|
|
Since lower priority links will not see traffic unless all higher priorities
|
|
have failed, it becomes a useful strategy to specify even networks used for
|
|
other tasks (VMs, storage, etc...) as low-priority links. If worst comes to
|
|
worst, a higher-latency or more congested connection might be better than no
|
|
connection at all.
|
|
|
|
Adding Redundant Links To An Existing Cluster
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To add a new link to a running configuration, first check how to
|
|
xref:pvecm_edit_corosync_conf[edit the corosync.conf file].
|
|
|
|
Then, add a new 'ringX_addr' to every node in the `nodelist` section. Make
|
|
sure that your 'X' is the same for every node you add it to, and that it is
|
|
unique for each node.
|
|
|
|
Lastly, add a new 'interface', as shown below, to your `totem`
|
|
section, replacing 'X' with your link number chosen above.
|
|
|
|
Assuming you added a link with number 1, the new configuration file could look
|
|
like this:
|
|
|
|
----
|
|
logging {
|
|
debug: off
|
|
to_syslog: yes
|
|
}
|
|
|
|
nodelist {
|
|
|
|
node {
|
|
name: due
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.2
|
|
ring1_addr: 10.20.20.2
|
|
}
|
|
|
|
node {
|
|
name: tre
|
|
nodeid: 3
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.3
|
|
ring1_addr: 10.20.20.3
|
|
}
|
|
|
|
node {
|
|
name: uno
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.1
|
|
ring1_addr: 10.20.20.1
|
|
}
|
|
|
|
}
|
|
|
|
quorum {
|
|
provider: corosync_votequorum
|
|
}
|
|
|
|
totem {
|
|
cluster_name: testcluster
|
|
config_version: 4
|
|
ip_version: ipv4-6
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
linknumber: 0
|
|
}
|
|
interface {
|
|
linknumber: 1
|
|
}
|
|
}
|
|
----
|
|
|
|
The new link will be enabled as soon as you follow the last steps to
|
|
xref:pvecm_edit_corosync_conf[edit the corosync.conf file]. A restart should not
|
|
be necessary. You can check that corosync loaded the new link using:
|
|
|
|
----
|
|
journalctl -b -u corosync
|
|
----
|
|
|
|
It might be a good idea to test the new link by temporarily disconnecting the
|
|
old link on one node and making sure that its status remains online while
|
|
disconnected:
|
|
|
|
----
|
|
pvecm status
|
|
----
|
|
|
|
If you see a healthy cluster state, it means that your new link is being used.
|
|
|
|
|
|
Corosync External Vote Support
|
|
------------------------------
|
|
|
|
This section describes a way to deploy an external voter in a {pve} cluster.
|
|
When configured, the cluster can sustain more node failures without
|
|
violating safety properties of the cluster communication.
|
|
|
|
For this to work there are two services involved:
|
|
|
|
* a so called qdevice daemon which runs on each {pve} node
|
|
|
|
* an external vote daemon which runs on an independent server.
|
|
|
|
As a result you can achieve higher availability even in smaller setups (for
|
|
example 2+1 nodes).
|
|
|
|
QDevice Technical Overview
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
|
|
node. It provides a configured number of votes to the clusters quorum
|
|
subsystem based on an external running third-party arbitrator's decision.
|
|
Its primary use is to allow a cluster to sustain more node failures than
|
|
standard quorum rules allow. This can be done safely as the external device
|
|
can see all nodes and thus choose only one set of nodes to give its vote.
|
|
This will only be done if said set of nodes can have quorum (again) when
|
|
receiving the third-party vote.
|
|
|
|
Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
|
|
a daemon which provides a vote to a cluster partition if it can reach the
|
|
partition members over the network. It will give only votes to one partition
|
|
of a cluster at any time.
|
|
It's designed to support multiple clusters and is almost configuration and
|
|
state free. New clusters are handled dynamically and no configuration file
|
|
is needed on the host running a QDevice.
|
|
|
|
The external host has the only requirement that it needs network access to the
|
|
cluster and a corosync-qnetd package available. We provide such a package
|
|
for Debian based hosts, other Linux distributions should also have a package
|
|
available through their respective package manager.
|
|
|
|
NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
|
|
TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
|
|
latencies than 2 ms.
|
|
|
|
Supported Setups
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
We support QDevices for clusters with an even number of nodes and recommend
|
|
it for 2 node clusters, if they should provide higher availability.
|
|
For clusters with an odd node count we discourage the use of QDevices
|
|
currently. The reason for this, is the difference of the votes the QDevice
|
|
provides for each cluster type. Even numbered clusters get single additional
|
|
vote, with this we can only increase availability, i.e. if the QDevice
|
|
itself fails we are in the same situation as with no QDevice at all.
|
|
|
|
Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
|
|
where 'N' corresponds to the cluster node count. This difference makes
|
|
sense, if we had only one additional vote the cluster can get into a split
|
|
brain situation.
|
|
This algorithm would allow that all nodes but one (and naturally the
|
|
QDevice itself) could fail.
|
|
There are two drawbacks with this:
|
|
|
|
* If the QNet daemon itself fails, no other node may fail or the cluster
|
|
immediately loses quorum. For example, in a cluster with 15 nodes 7
|
|
could fail before the cluster becomes inquorate. But, if a QDevice is
|
|
configured here and said QDevice fails itself **no single node** of
|
|
the 15 may fail. The QDevice acts almost as a single point of failure in
|
|
this case.
|
|
|
|
* The fact that all but one node plus QDevice may fail sound promising at
|
|
first, but this may result in a mass recovery of HA services that would
|
|
overload the single node left. Also ceph server will stop to provide
|
|
services after only '((N-1)/2)' nodes are online.
|
|
|
|
If you understand the drawbacks and implications you can decide yourself if
|
|
you should use this technology in an odd numbered cluster setup.
|
|
|
|
QDevice-Net Setup
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
We recommend to run any daemon which provides votes to corosync-qdevice as an
|
|
unprivileged user. {pve} and Debian provides a package which is already
|
|
configured to do so.
|
|
The traffic between the daemon and the cluster must be encrypted to ensure a
|
|
safe and secure QDevice integration in {pve}.
|
|
|
|
First install the 'corosync-qnetd' package on your external server and
|
|
the 'corosync-qdevice' package on all cluster nodes.
|
|
|
|
After that, ensure that all your nodes on the cluster are online.
|
|
|
|
You can now easily set up your QDevice by running the following command on one
|
|
of the {pve} nodes:
|
|
|
|
----
|
|
pve# pvecm qdevice setup <QDEVICE-IP>
|
|
----
|
|
|
|
The SSH key from the cluster will be automatically copied to the QDevice. You
|
|
might need to enter an SSH password during this step.
|
|
|
|
After you enter the password and all the steps are successfully completed, you
|
|
will see "Done". You can check the status now:
|
|
|
|
----
|
|
pve# pvecm status
|
|
|
|
...
|
|
|
|
Votequorum information
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
Expected votes: 3
|
|
Highest expected: 3
|
|
Total votes: 3
|
|
Quorum: 2
|
|
Flags: Quorate Qdevice
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Qdevice Name
|
|
0x00000001 1 A,V,NMW 192.168.22.180 (local)
|
|
0x00000002 1 A,V,NMW 192.168.22.181
|
|
0x00000000 1 Qdevice
|
|
|
|
----
|
|
|
|
which means the QDevice is set up.
|
|
|
|
Frequently Asked Questions
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Tie Breaking
|
|
^^^^^^^^^^^^
|
|
|
|
In case of a tie, where two same-sized cluster partitions cannot see each other
|
|
but the QDevice, the QDevice chooses randomly one of those partitions and
|
|
provides a vote to it.
|
|
|
|
Possible Negative Implications
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For clusters with an even node count there are no negative implications when
|
|
setting up a QDevice. If it fails to work, you are as good as without QDevice at
|
|
all.
|
|
|
|
Adding/Deleting Nodes After QDevice Setup
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you want to add a new node or remove an existing one from a cluster with a
|
|
QDevice setup, you need to remove the QDevice first. After that, you can add or
|
|
remove nodes normally. Once you have a cluster with an even node count again,
|
|
you can set up the QDevice again as described above.
|
|
|
|
Removing the QDevice
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you used the official `pvecm` tool to add the QDevice, you can remove it
|
|
trivially by running:
|
|
|
|
----
|
|
pve# pvecm qdevice remove
|
|
----
|
|
|
|
//Still TODO
|
|
//^^^^^^^^^^
|
|
//There is still stuff to add here
|
|
|
|
|
|
Corosync Configuration
|
|
----------------------
|
|
|
|
The `/etc/pve/corosync.conf` file plays a central role in a {pve} cluster. It
|
|
controls the cluster membership and its network.
|
|
For further information about it, check the corosync.conf man page:
|
|
[source,bash]
|
|
----
|
|
man corosync.conf
|
|
----
|
|
|
|
For node membership you should always use the `pvecm` tool provided by {pve}.
|
|
You may have to edit the configuration file manually for other changes.
|
|
Here are a few best practice tips for doing this.
|
|
|
|
[[pvecm_edit_corosync_conf]]
|
|
Edit corosync.conf
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Editing the corosync.conf file is not always very straightforward. There are
|
|
two on each cluster node, one in `/etc/pve/corosync.conf` and the other in
|
|
`/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
|
|
propagate the changes to the local one, but not vice versa.
|
|
|
|
The configuration will get updated automatically as soon as the file changes.
|
|
This means changes which can be integrated in a running corosync will take
|
|
effect immediately. So you should always make a copy and edit that instead, to
|
|
avoid triggering some unwanted changes by an in-between safe.
|
|
|
|
[source,bash]
|
|
----
|
|
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
|
|
----
|
|
|
|
Then open the config file with your favorite editor, `nano` and `vim.tiny` are
|
|
preinstalled on any {pve} node for example.
|
|
|
|
NOTE: Always increment the 'config_version' number on configuration changes,
|
|
omitting this can lead to problems.
|
|
|
|
After making the necessary changes create another copy of the current working
|
|
configuration file. This serves as a backup if the new configuration fails to
|
|
apply or makes problems in other ways.
|
|
|
|
[source,bash]
|
|
----
|
|
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
|
|
----
|
|
|
|
Then move the new configuration file over the old one:
|
|
[source,bash]
|
|
----
|
|
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
|
|
----
|
|
|
|
You may check with the commands
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
journalctl -b -u corosync
|
|
----
|
|
|
|
If the change could be applied automatically. If not you may have to restart the
|
|
corosync service via:
|
|
[source,bash]
|
|
----
|
|
systemctl restart corosync
|
|
----
|
|
|
|
On errors check the troubleshooting section below.
|
|
|
|
Troubleshooting
|
|
~~~~~~~~~~~~~~~
|
|
|
|
Issue: 'quorum.expected_votes must be configured'
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
When corosync starts to fail and you get the following message in the system log:
|
|
|
|
----
|
|
[...]
|
|
corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
|
|
corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
|
|
'configuration error: nodelist or quorum.expected_votes must be configured!'
|
|
[...]
|
|
----
|
|
|
|
It means that the hostname you set for corosync 'ringX_addr' in the
|
|
configuration could not be resolved.
|
|
|
|
Write Configuration When Not Quorate
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
|
|
know what you do, use:
|
|
[source,bash]
|
|
----
|
|
pvecm expected 1
|
|
----
|
|
|
|
This sets the expected vote count to 1 and makes the cluster quorate. You can
|
|
now fix your configuration, or revert it back to the last working backup.
|
|
|
|
This is not enough if corosync cannot start anymore. Here it is best to edit the
|
|
local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
|
|
that corosync can start again. Ensure that on all nodes this configuration has
|
|
the same content to avoid split brains. If you are not sure what went wrong
|
|
it's best to ask the Proxmox Community to help you.
|
|
|
|
|
|
[[pvecm_corosync_conf_glossary]]
|
|
Corosync Configuration Glossary
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
ringX_addr::
|
|
This names the different link addresses for the kronosnet connections between
|
|
nodes.
|
|
|
|
|
|
Cluster Cold Start
|
|
------------------
|
|
|
|
It is obvious that a cluster is not quorate when all nodes are
|
|
offline. This is a common case after a power failure.
|
|
|
|
NOTE: It is always a good idea to use an uninterruptible power supply
|
|
(``UPS'', also called ``battery backup'') to avoid this state, especially if
|
|
you want HA.
|
|
|
|
On node startup, the `pve-guests` service is started and waits for
|
|
quorum. Once quorate, it starts all guests which have the `onboot`
|
|
flag set.
|
|
|
|
When you turn on nodes, or when power comes back after power failure,
|
|
it is likely that some nodes boots faster than others. Please keep in
|
|
mind that guest startup is delayed until you reach quorum.
|
|
|
|
|
|
Guest Migration
|
|
---------------
|
|
|
|
Migrating virtual guests to other nodes is a useful feature in a
|
|
cluster. There are settings to control the behavior of such
|
|
migrations. This can be done via the configuration file
|
|
`datacenter.cfg` or for a specific migration via API or command line
|
|
parameters.
|
|
|
|
It makes a difference if a Guest is online or offline, or if it has
|
|
local resources (like a local disk).
|
|
|
|
For Details about Virtual Machine Migration see the
|
|
xref:qm_migration[QEMU/KVM Migration Chapter].
|
|
|
|
For Details about Container Migration see the
|
|
xref:pct_migration[Container Migration Chapter].
|
|
|
|
Migration Type
|
|
~~~~~~~~~~~~~~
|
|
|
|
The migration type defines if the migration data should be sent over an
|
|
encrypted (`secure`) channel or an unencrypted (`insecure`) one.
|
|
Setting the migration type to insecure means that the RAM content of a
|
|
virtual guest gets also transferred unencrypted, which can lead to
|
|
information disclosure of critical data from inside the guest (for
|
|
example passwords or encryption keys).
|
|
|
|
Therefore, we strongly recommend using the secure channel if you do
|
|
not have full control over the network and can not guarantee that no
|
|
one is eavesdropping on it.
|
|
|
|
NOTE: Storage migration does not follow this setting. Currently, it
|
|
always sends the storage content over a secure channel.
|
|
|
|
Encryption requires a lot of computing power, so this setting is often
|
|
changed to "unsafe" to achieve better performance. The impact on
|
|
modern systems is lower because they implement AES encryption in
|
|
hardware. The performance impact is particularly evident in fast
|
|
networks where you can transfer 10 Gbps or more.
|
|
|
|
Migration Network
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
By default, {pve} uses the network in which cluster communication
|
|
takes place to send the migration traffic. This is not optimal because
|
|
sensitive cluster traffic can be disrupted and this network may not
|
|
have the best bandwidth available on the node.
|
|
|
|
Setting the migration network parameter allows the use of a dedicated
|
|
network for the entire migration traffic. In addition to the memory,
|
|
this also affects the storage traffic for offline migrations.
|
|
|
|
The migration network is set as a network in the CIDR notation. This
|
|
has the advantage that you do not have to set individual IP addresses
|
|
for each node. {pve} can determine the real address on the
|
|
destination node from the network specified in the CIDR form. To
|
|
enable this, the network must be specified so that each node has one,
|
|
but only one IP in the respective network.
|
|
|
|
Example
|
|
^^^^^^^
|
|
|
|
We assume that we have a three-node setup with three separate
|
|
networks. One for public communication with the Internet, one for
|
|
cluster communication and a very fast one, which we want to use as a
|
|
dedicated network for migration.
|
|
|
|
A network configuration for such a setup might look as follows:
|
|
|
|
----
|
|
iface eno1 inet manual
|
|
|
|
# public network
|
|
auto vmbr0
|
|
iface vmbr0 inet static
|
|
address 192.X.Y.57
|
|
netmask 255.255.250.0
|
|
gateway 192.X.Y.1
|
|
bridge_ports eno1
|
|
bridge_stp off
|
|
bridge_fd 0
|
|
|
|
# cluster network
|
|
auto eno2
|
|
iface eno2 inet static
|
|
address 10.1.1.1
|
|
netmask 255.255.255.0
|
|
|
|
# fast network
|
|
auto eno3
|
|
iface eno3 inet static
|
|
address 10.1.2.1
|
|
netmask 255.255.255.0
|
|
----
|
|
|
|
Here, we will use the network 10.1.2.0/24 as a migration network. For
|
|
a single migration, you can do this using the `migration_network`
|
|
parameter of the command line tool:
|
|
|
|
----
|
|
# qm migrate 106 tre --online --migration_network 10.1.2.0/24
|
|
----
|
|
|
|
To configure this as the default network for all migrations in the
|
|
cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
|
|
file:
|
|
|
|
----
|
|
# use dedicated migration network
|
|
migration: secure,network=10.1.2.0/24
|
|
----
|
|
|
|
NOTE: The migration type must always be set when the migration network
|
|
gets set in `/etc/pve/datacenter.cfg`.
|
|
|
|
|
|
ifdef::manvolnum[]
|
|
include::pve-copyright.adoc[]
|
|
endif::manvolnum[]
|