mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-21 18:03:45 +03:00
1003 lines
29 KiB
Plaintext
1003 lines
29 KiB
Plaintext
[[chapter_pvecm]]
|
|
ifdef::manvolnum[]
|
|
pvecm(1)
|
|
========
|
|
:pve-toplevel:
|
|
|
|
NAME
|
|
----
|
|
|
|
pvecm - Proxmox VE Cluster Manager
|
|
|
|
SYNOPSIS
|
|
--------
|
|
|
|
include::pvecm.1-synopsis.adoc[]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
endif::manvolnum[]
|
|
|
|
ifndef::manvolnum[]
|
|
Cluster Manager
|
|
===============
|
|
:pve-toplevel:
|
|
endif::manvolnum[]
|
|
|
|
The {PVE} cluster manager `pvecm` is a tool to create a group of
|
|
physical servers. Such a group is called a *cluster*. We use the
|
|
http://www.corosync.org[Corosync Cluster Engine] for reliable group
|
|
communication, and such clusters can consist of up to 32 physical nodes
|
|
(probably more, dependent on network latency).
|
|
|
|
`pvecm` can be used to create a new cluster, join nodes to a cluster,
|
|
leave the cluster, get status information and do various other cluster
|
|
related tasks. The **P**rox**m**o**x** **C**luster **F**ile **S**ystem (``pmxcfs'')
|
|
is used to transparently distribute the cluster configuration to all cluster
|
|
nodes.
|
|
|
|
Grouping nodes into a cluster has the following advantages:
|
|
|
|
* Centralized, web based management
|
|
|
|
* Multi-master clusters: each node can do all management task
|
|
|
|
* `pmxcfs`: database-driven file system for storing configuration files,
|
|
replicated in real-time on all nodes using `corosync`.
|
|
|
|
* Easy migration of virtual machines and containers between physical
|
|
hosts
|
|
|
|
* Fast deployment
|
|
|
|
* Cluster-wide services like firewall and HA
|
|
|
|
|
|
Requirements
|
|
------------
|
|
|
|
* All nodes must be in the same network as `corosync` uses IP Multicast
|
|
to communicate between nodes (also see
|
|
http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP
|
|
ports 5404 and 5405 for cluster communication.
|
|
+
|
|
NOTE: Some switches do not support IP multicast by default and must be
|
|
manually enabled first.
|
|
|
|
* Date and time have to be synchronized.
|
|
|
|
* SSH tunnel on TCP port 22 between nodes is used.
|
|
|
|
* If you are interested in High Availability, you need to have at
|
|
least three nodes for reliable quorum. All nodes should have the
|
|
same version.
|
|
|
|
* We recommend a dedicated NIC for the cluster traffic, especially if
|
|
you use shared storage.
|
|
|
|
* Root password of a cluster node is required for adding nodes.
|
|
|
|
NOTE: It is not possible to mix Proxmox VE 3.x and earlier with
|
|
Proxmox VE 4.0 cluster nodes.
|
|
|
|
|
|
Preparing Nodes
|
|
---------------
|
|
|
|
First, install {PVE} on all nodes. Make sure that each node is
|
|
installed with the final hostname and IP configuration. Changing the
|
|
hostname and IP is not possible after cluster creation.
|
|
|
|
Currently the cluster creation can either be done on the console(login via `ssh`) or the GUI.
|
|
|
|
[[pvecm_create_cluster]]
|
|
Create the Cluster
|
|
------------------
|
|
|
|
Login via `ssh` to the first {pve} node. Use a unique name for your cluster.
|
|
This name cannot be changed later. The cluster name follows the same rules as node names.
|
|
|
|
hp1# pvecm create YOUR-CLUSTER-NAME
|
|
|
|
CAUTION: The cluster name is used to compute the default multicast
|
|
address. Please use unique cluster names if you run more than one
|
|
cluster inside your network.
|
|
|
|
To check the state of your cluster use:
|
|
|
|
hp1# pvecm status
|
|
|
|
Multiple Clusters In Same Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
It is possible to create multiple clusters in the same physical or logical
|
|
network. Each cluster must have a unique name, which is used to generate the
|
|
cluster's multicast group address. As long as no duplicate cluster names are
|
|
configured in one network segment, the different clusters won't interfere with
|
|
each other.
|
|
|
|
If multiple clusters operate in a single network it may be beneficial to setup
|
|
an IGMP querier and enable IGMP Snooping in said network. This may reduce the
|
|
load of the network significantly because multicast packets are only delivered
|
|
to endpoints of the respective member nodes.
|
|
|
|
|
|
[[pvecm_join_node_to_cluster]]
|
|
Adding Nodes to the Cluster
|
|
---------------------------
|
|
|
|
Login via `ssh` to the node you want to add.
|
|
|
|
hp2# pvecm add IP-ADDRESS-CLUSTER
|
|
|
|
For `IP-ADDRESS-CLUSTER` use the IP from an existing cluster node.
|
|
|
|
CAUTION: A new node cannot hold any VMs, because you would get
|
|
conflicts about identical VM IDs. Also, all existing configuration in
|
|
`/etc/pve` is overwritten when you join a new node to the cluster. To
|
|
workaround, use `vzdump` to backup and restore to a different VMID after
|
|
adding the node to the cluster.
|
|
|
|
To check the state of cluster:
|
|
|
|
# pvecm status
|
|
|
|
.Cluster status after adding 4 nodes
|
|
----
|
|
hp2# pvecm status
|
|
Quorum information
|
|
~~~~~~~~~~~~~~~~~~
|
|
Date: Mon Apr 20 12:30:13 2015
|
|
Quorum provider: corosync_votequorum
|
|
Nodes: 4
|
|
Node ID: 0x00000001
|
|
Ring ID: 1928
|
|
Quorate: Yes
|
|
|
|
Votequorum information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Expected votes: 4
|
|
Highest expected: 4
|
|
Total votes: 4
|
|
Quorum: 2
|
|
Flags: Quorate
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
0x00000001 1 192.168.15.91
|
|
0x00000002 1 192.168.15.92 (local)
|
|
0x00000003 1 192.168.15.93
|
|
0x00000004 1 192.168.15.94
|
|
----
|
|
|
|
If you only want the list of all nodes use:
|
|
|
|
# pvecm nodes
|
|
|
|
.List nodes in a cluster
|
|
----
|
|
hp2# pvecm nodes
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
1 1 hp1
|
|
2 1 hp2 (local)
|
|
3 1 hp3
|
|
4 1 hp4
|
|
----
|
|
|
|
[[adding-nodes-with-separated-cluster-network]]
|
|
Adding Nodes With Separated Cluster Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When adding a node to a cluster with a separated cluster network you need to
|
|
use the 'ringX_addr' parameters to set the nodes address on those networks:
|
|
|
|
[source,bash]
|
|
----
|
|
pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0
|
|
----
|
|
|
|
If you want to use the Redundant Ring Protocol you will also want to pass the
|
|
'ring1_addr' parameter.
|
|
|
|
|
|
Remove a Cluster Node
|
|
---------------------
|
|
|
|
CAUTION: Read carefully the procedure before proceeding, as it could
|
|
not be what you want or need.
|
|
|
|
Move all virtual machines from the node. Make sure you have no local
|
|
data or backups you want to keep, or save them accordingly.
|
|
In the following example we will remove the node hp4 from the cluster.
|
|
|
|
Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes`
|
|
command to identify the node ID to remove:
|
|
|
|
----
|
|
hp1# pvecm nodes
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
1 1 hp1 (local)
|
|
2 1 hp2
|
|
3 1 hp3
|
|
4 1 hp4
|
|
----
|
|
|
|
|
|
At this point you must power off hp4 and
|
|
make sure that it will not power on again (in the network) as it
|
|
is.
|
|
|
|
IMPORTANT: As said above, it is critical to power off the node
|
|
*before* removal, and make sure that it will *never* power on again
|
|
(in the existing cluster network) as it is.
|
|
If you power on the node as it is, your cluster will be screwed up and
|
|
it could be difficult to restore a clean cluster state.
|
|
|
|
After powering off the node hp4, we can safely remove it from the cluster.
|
|
|
|
hp1# pvecm delnode hp4
|
|
|
|
If the operation succeeds no output is returned, just check the node
|
|
list again with `pvecm nodes` or `pvecm status`. You should see
|
|
something like:
|
|
|
|
----
|
|
hp1# pvecm status
|
|
|
|
Quorum information
|
|
~~~~~~~~~~~~~~~~~~
|
|
Date: Mon Apr 20 12:44:28 2015
|
|
Quorum provider: corosync_votequorum
|
|
Nodes: 3
|
|
Node ID: 0x00000001
|
|
Ring ID: 1992
|
|
Quorate: Yes
|
|
|
|
Votequorum information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Expected votes: 3
|
|
Highest expected: 3
|
|
Total votes: 3
|
|
Quorum: 3
|
|
Flags: Quorate
|
|
|
|
Membership information
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
Nodeid Votes Name
|
|
0x00000001 1 192.168.15.90 (local)
|
|
0x00000002 1 192.168.15.91
|
|
0x00000003 1 192.168.15.92
|
|
----
|
|
|
|
If, for whatever reason, you want that this server joins the same
|
|
cluster again, you have to
|
|
|
|
* reinstall {pve} on it from scratch
|
|
|
|
* then join it, as explained in the previous section.
|
|
|
|
[[pvecm_separate_node_without_reinstall]]
|
|
Separate A Node Without Reinstalling
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
CAUTION: This is *not* the recommended method, proceed with caution. Use the
|
|
above mentioned method if you're unsure.
|
|
|
|
You can also separate a node from a cluster without reinstalling it from
|
|
scratch. But after removing the node from the cluster it will still have
|
|
access to the shared storages! This must be resolved before you start removing
|
|
the node from the cluster. A {pve} cluster cannot share the exact same
|
|
storage with another cluster, as storage locking doesn't work over cluster
|
|
boundary. Further, it may also lead to VMID conflicts.
|
|
|
|
Its suggested that you create a new storage where only the node which you want
|
|
to separate has access. This can be an new export on your NFS or a new Ceph
|
|
pool, to name a few examples. Its just important that the exact same storage
|
|
does not gets accessed by multiple clusters. After setting this storage up move
|
|
all data from the node and its VMs to it. Then you are ready to separate the
|
|
node from the cluster.
|
|
|
|
WARNING: Ensure all shared resources are cleanly separated! You will run into
|
|
conflicts and problems else.
|
|
|
|
First stop the corosync and the pve-cluster services on the node:
|
|
[source,bash]
|
|
----
|
|
systemctl stop pve-cluster
|
|
systemctl stop corosync
|
|
----
|
|
|
|
Start the cluster filesystem again in local mode:
|
|
[source,bash]
|
|
----
|
|
pmxcfs -l
|
|
----
|
|
|
|
Delete the corosync configuration files:
|
|
[source,bash]
|
|
----
|
|
rm /etc/pve/corosync.conf
|
|
rm /etc/corosync/*
|
|
----
|
|
|
|
You can now start the filesystem again as normal service:
|
|
[source,bash]
|
|
----
|
|
killall pmxcfs
|
|
systemctl start pve-cluster
|
|
----
|
|
|
|
The node is now separated from the cluster. You can deleted it from a remaining
|
|
node of the cluster with:
|
|
[source,bash]
|
|
----
|
|
pvecm delnode oldnode
|
|
----
|
|
|
|
If the command failed, because the remaining node in the cluster lost quorum
|
|
when the now separate node exited, you may set the expected votes to 1 as a workaround:
|
|
[source,bash]
|
|
----
|
|
pvecm expected 1
|
|
----
|
|
|
|
And the repeat the 'pvecm delnode' command.
|
|
|
|
Now switch back to the separated node, here delete all remaining files left
|
|
from the old cluster. This ensures that the node can be added to another
|
|
cluster again without problems.
|
|
|
|
[source,bash]
|
|
----
|
|
rm /var/lib/corosync/*
|
|
----
|
|
|
|
As the configuration files from the other nodes are still in the cluster
|
|
filesystem you may want to clean those up too. Remove simply the whole
|
|
directory recursive from '/etc/pve/nodes/NODENAME', but check three times that
|
|
you used the correct one before deleting it.
|
|
|
|
CAUTION: The nodes SSH keys are still in the 'authorized_key' file, this means
|
|
the nodes can still connect to each other with public key authentication. This
|
|
should be fixed by removing the respective keys from the
|
|
'/etc/pve/priv/authorized_keys' file.
|
|
|
|
Quorum
|
|
------
|
|
|
|
{pve} use a quorum-based technique to provide a consistent state among
|
|
all cluster nodes.
|
|
|
|
[quote, from Wikipedia, Quorum (distributed computing)]
|
|
____
|
|
A quorum is the minimum number of votes that a distributed transaction
|
|
has to obtain in order to be allowed to perform an operation in a
|
|
distributed system.
|
|
____
|
|
|
|
In case of network partitioning, state changes requires that a
|
|
majority of nodes are online. The cluster switches to read-only mode
|
|
if it loses quorum.
|
|
|
|
NOTE: {pve} assigns a single vote to each node by default.
|
|
|
|
Cluster Network
|
|
---------------
|
|
|
|
The cluster network is the core of a cluster. All messages sent over it have to
|
|
be delivered reliable to all nodes in their respective order. In {pve} this
|
|
part is done by corosync, an implementation of a high performance low overhead
|
|
high availability development toolkit. It serves our decentralized
|
|
configuration file system (`pmxcfs`).
|
|
|
|
[[cluster-network-requirements]]
|
|
Network Requirements
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
This needs a reliable network with latencies under 2 milliseconds (LAN
|
|
performance) to work properly. While corosync can also use unicast for
|
|
communication between nodes its **highly recommended** to have a multicast
|
|
capable network. The network should not be used heavily by other members,
|
|
ideally corosync runs on its own network.
|
|
*never* share it with network where storage communicates too.
|
|
|
|
Before setting up a cluster it is good practice to check if the network is fit
|
|
for that purpose.
|
|
|
|
* Ensure that all nodes are in the same subnet. This must only be true for the
|
|
network interfaces used for cluster communication (corosync).
|
|
|
|
* Ensure all nodes can reach each other over those interfaces, using `ping` is
|
|
enough for a basic test.
|
|
|
|
* Ensure that multicast works in general and a high package rates. This can be
|
|
done with the `omping` tool. The final "%loss" number should be < 1%.
|
|
+
|
|
[source,bash]
|
|
----
|
|
omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ...
|
|
----
|
|
|
|
* Ensure that multicast communication works over an extended period of time.
|
|
This uncovers problems where IGMP snooping is activated on the network but
|
|
no multicast querier is active. This test has a duration of around 10
|
|
minutes.
|
|
+
|
|
[source,bash]
|
|
----
|
|
omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
|
|
----
|
|
|
|
Your network is not ready for clustering if any of these test fails. Recheck
|
|
your network configuration. Especially switches are notorious for having
|
|
multicast disabled by default or IGMP snooping enabled with no IGMP querier
|
|
active.
|
|
|
|
In smaller cluster its also an option to use unicast if you really cannot get
|
|
multicast to work.
|
|
|
|
Separate Cluster Network
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When creating a cluster without any parameters the cluster network is generally
|
|
shared with the Web UI and the VMs and its traffic. Depending on your setup
|
|
even storage traffic may get sent over the same network. Its recommended to
|
|
change that, as corosync is a time critical real time application.
|
|
|
|
Setting Up A New Network
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
First you have to setup a new network interface. It should be on a physical
|
|
separate network. Ensure that your network fulfills the
|
|
<<cluster-network-requirements,cluster network requirements>>.
|
|
|
|
Separate On Cluster Creation
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This is possible through the 'ring0_addr' and 'bindnet0_addr' parameter of
|
|
the 'pvecm create' command used for creating a new cluster.
|
|
|
|
If you have setup an additional NIC with a static address on 10.10.10.1/25
|
|
and want to send and receive all cluster communication over this interface
|
|
you would execute:
|
|
|
|
[source,bash]
|
|
----
|
|
pvecm create test --ring0_addr 10.10.10.1 --bindnet0_addr 10.10.10.0
|
|
----
|
|
|
|
To check if everything is working properly execute:
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
----
|
|
|
|
Afterwards, proceed as descripted in the section to
|
|
<<adding-nodes-with-separated-cluster-network,add nodes with a separated cluster network>>.
|
|
|
|
[[separate-cluster-net-after-creation]]
|
|
Separate After Cluster Creation
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
You can do this also if you have already created a cluster and want to switch
|
|
its communication to another network, without rebuilding the whole cluster.
|
|
This change may lead to short durations of quorum loss in the cluster, as nodes
|
|
have to restart corosync and come up one after the other on the new network.
|
|
|
|
Check how to <<edit-corosync-conf,edit the corosync.conf file>> first.
|
|
The open it and you should see a file similar to:
|
|
|
|
----
|
|
logging {
|
|
debug: off
|
|
to_syslog: yes
|
|
}
|
|
|
|
nodelist {
|
|
|
|
node {
|
|
name: due
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: due
|
|
}
|
|
|
|
node {
|
|
name: tre
|
|
nodeid: 3
|
|
quorum_votes: 1
|
|
ring0_addr: tre
|
|
}
|
|
|
|
node {
|
|
name: uno
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: uno
|
|
}
|
|
|
|
}
|
|
|
|
quorum {
|
|
provider: corosync_votequorum
|
|
}
|
|
|
|
totem {
|
|
cluster_name: thomas-testcluster
|
|
config_version: 3
|
|
ip_version: ipv4
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
bindnetaddr: 192.168.30.50
|
|
ringnumber: 0
|
|
}
|
|
|
|
}
|
|
----
|
|
|
|
The first you want to do is add the 'name' properties in the node entries if
|
|
you do not see them already. Those *must* match the node name.
|
|
|
|
Then replace the address from the 'ring0_addr' properties with the new
|
|
addresses. You may use plain IP addresses or also hostnames here. If you use
|
|
hostnames ensure that they are resolvable from all nodes.
|
|
|
|
In my example I want to switch my cluster communication to the 10.10.10.1/25
|
|
network. So I replace all 'ring0_addr' respectively. I also set the bindnetaddr
|
|
in the totem section of the config to an address of the new network. It can be
|
|
any address from the subnet configured on the new network interface.
|
|
|
|
After you increased the 'config_version' property the new configuration file
|
|
should look like:
|
|
|
|
----
|
|
|
|
logging {
|
|
debug: off
|
|
to_syslog: yes
|
|
}
|
|
|
|
nodelist {
|
|
|
|
node {
|
|
name: due
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.2
|
|
}
|
|
|
|
node {
|
|
name: tre
|
|
nodeid: 3
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.3
|
|
}
|
|
|
|
node {
|
|
name: uno
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.1
|
|
}
|
|
|
|
}
|
|
|
|
quorum {
|
|
provider: corosync_votequorum
|
|
}
|
|
|
|
totem {
|
|
cluster_name: thomas-testcluster
|
|
config_version: 4
|
|
ip_version: ipv4
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
bindnetaddr: 10.10.10.1
|
|
ringnumber: 0
|
|
}
|
|
|
|
}
|
|
----
|
|
|
|
Now after a final check whether all changed information is correct we save it
|
|
and see again the <<edit-corosync-conf,edit corosync.conf file>> section to
|
|
learn how to bring it in effect.
|
|
|
|
As our change cannot be enforced live from corosync we have to do an restart.
|
|
|
|
On a single node execute:
|
|
[source,bash]
|
|
----
|
|
systemctl restart corosync
|
|
----
|
|
|
|
Now check if everything is fine:
|
|
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
----
|
|
|
|
If corosync runs again correct restart corosync also on all other nodes.
|
|
They will then join the cluster membership one by one on the new network.
|
|
|
|
[[pvecm_rrp]]
|
|
Redundant Ring Protocol
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
To avoid a single point of failure you should implement counter measurements.
|
|
This can be on the hardware and operating system level through network bonding.
|
|
|
|
Corosync itself offers also a possibility to add redundancy through the so
|
|
called 'Redundant Ring Protocol'. This protocol allows running a second totem
|
|
ring on another network, this network should be physically separated from the
|
|
other rings network to actually increase availability.
|
|
|
|
RRP On Cluster Creation
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The 'pvecm create' command provides the additional parameters 'bindnetX_addr',
|
|
'ringX_addr' and 'rrp_mode', can be used for RRP configuration.
|
|
|
|
NOTE: See the <<corosync-conf-glossary,glossary>> if you do not know what each parameter means.
|
|
|
|
So if you have two networks, one on the 10.10.10.1/24 and the other on the
|
|
10.10.20.1/24 subnet you would execute:
|
|
|
|
[source,bash]
|
|
----
|
|
pvecm create CLUSTERNAME -bindnet0_addr 10.10.10.1 -ring0_addr 10.10.10.1 \
|
|
-bindnet1_addr 10.10.20.1 -ring1_addr 10.10.20.1
|
|
----
|
|
|
|
RRP On Existing Clusters
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
You will take similar steps as described in
|
|
<<separate-cluster-net-after-creation,separating the cluster network>> to
|
|
enable RRP on an already running cluster. The single difference is, that you
|
|
will add `ring1` and use it instead of `ring0`.
|
|
|
|
First add a new `interface` subsection in the `totem` section, set its
|
|
`ringnumber` property to `1`. Set the interfaces `bindnetaddr` property to an
|
|
address of the subnet you have configured for your new ring.
|
|
Further set the `rrp_mode` to `passive`, this is the only stable mode.
|
|
|
|
Then add to each node entry in the `nodelist` section its new `ring1_addr`
|
|
property with the nodes additional ring address.
|
|
|
|
So if you have two networks, one on the 10.10.10.1/24 and the other on the
|
|
10.10.20.1/24 subnet, the final configuration file should look like:
|
|
|
|
----
|
|
totem {
|
|
cluster_name: tweak
|
|
config_version: 9
|
|
ip_version: ipv4
|
|
rrp_mode: passive
|
|
secauth: on
|
|
version: 2
|
|
interface {
|
|
bindnetaddr: 10.10.10.1
|
|
ringnumber: 0
|
|
}
|
|
interface {
|
|
bindnetaddr: 10.10.20.1
|
|
ringnumber: 1
|
|
}
|
|
}
|
|
|
|
nodelist {
|
|
node {
|
|
name: pvecm1
|
|
nodeid: 1
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.1
|
|
ring1_addr: 10.10.20.1
|
|
}
|
|
|
|
node {
|
|
name: pvecm2
|
|
nodeid: 2
|
|
quorum_votes: 1
|
|
ring0_addr: 10.10.10.2
|
|
ring1_addr: 10.10.20.2
|
|
}
|
|
|
|
[...] # other cluster nodes here
|
|
}
|
|
|
|
[...] # other remaining config sections here
|
|
|
|
----
|
|
|
|
Bring it in effect like described in the
|
|
<<edit-corosync-conf,edit the corosync.conf file>> section.
|
|
|
|
This is a change which cannot take live in effect and needs at least a restart
|
|
of corosync. Recommended is a restart of the whole cluster.
|
|
|
|
If you cannot reboot the whole cluster ensure no High Availability services are
|
|
configured and the stop the corosync service on all nodes. After corosync is
|
|
stopped on all nodes start it one after the other again.
|
|
|
|
Corosync Configuration
|
|
----------------------
|
|
|
|
The `/etc/pve/corosync.conf` file plays a central role in {pve} cluster. It
|
|
controls the cluster member ship and its network.
|
|
For reading more about it check the corosync.conf man page:
|
|
[source,bash]
|
|
----
|
|
man corosync.conf
|
|
----
|
|
|
|
For node membership you should always use the `pvecm` tool provided by {pve}.
|
|
You may have to edit the configuration file manually for other changes.
|
|
Here are a few best practice tips for doing this.
|
|
|
|
[[edit-corosync-conf]]
|
|
Edit corosync.conf
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Editing the corosync.conf file can be not always straight forward. There are
|
|
two on each cluster, one in `/etc/pve/corosync.conf` and the other in
|
|
`/etc/corosync/corosync.conf`. Editing the one in our cluster file system will
|
|
propagate the changes to the local one, but not vice versa.
|
|
|
|
The configuration will get updated automatically as soon as the file changes.
|
|
This means changes which can be integrated in a running corosync will take
|
|
instantly effect. So you should always make a copy and edit that instead, to
|
|
avoid triggering some unwanted changes by an in between safe.
|
|
|
|
[source,bash]
|
|
----
|
|
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
|
|
----
|
|
|
|
Then open the Config file with your favorite editor, `nano` and `vim.tiny` are
|
|
preinstalled on {pve} for example.
|
|
|
|
NOTE: Always increment the 'config_version' number on configuration changes,
|
|
omitting this can lead to problems.
|
|
|
|
After making the necessary changes create another copy of the current working
|
|
configuration file. This serves as a backup if the new configuration fails to
|
|
apply or makes problems in other ways.
|
|
|
|
[source,bash]
|
|
----
|
|
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
|
|
----
|
|
|
|
Then move the new configuration file over the old one:
|
|
[source,bash]
|
|
----
|
|
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
|
|
----
|
|
|
|
You may check with the commands
|
|
[source,bash]
|
|
----
|
|
systemctl status corosync
|
|
journalctl -b -u corosync
|
|
----
|
|
|
|
If the change could applied automatically. If not you may have to restart the
|
|
corosync service via:
|
|
[source,bash]
|
|
----
|
|
systemctl restart corosync
|
|
----
|
|
|
|
On errors check the troubleshooting section below.
|
|
|
|
Troubleshooting
|
|
~~~~~~~~~~~~~~~
|
|
|
|
Issue: 'quorum.expected_votes must be configured'
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
When corosync starts to fail and you get the following message in the system log:
|
|
|
|
----
|
|
[...]
|
|
corosync[1647]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
|
|
corosync[1647]: [SERV ] Service engine 'corosync_quorum' failed to load for reason
|
|
'configuration error: nodelist or quorum.expected_votes must be configured!'
|
|
[...]
|
|
----
|
|
|
|
It means that the hostname you set for corosync 'ringX_addr' in the
|
|
configuration could not be resolved.
|
|
|
|
|
|
Write Configuration When Not Quorate
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you need to change '/etc/pve/corosync.conf' on an node with no quorum, and you
|
|
know what you do, use:
|
|
[source,bash]
|
|
----
|
|
pvecm expected 1
|
|
----
|
|
|
|
This sets the expected vote count to 1 and makes the cluster quorate. You can
|
|
now fix your configuration, or revert it back to the last working backup.
|
|
|
|
This is not enough if corosync cannot start anymore. Here its best to edit the
|
|
local copy of the corosync configuration in '/etc/corosync/corosync.conf' so
|
|
that corosync can start again. Ensure that on all nodes this configuration has
|
|
the same content to avoid split brains. If you are not sure what went wrong
|
|
it's best to ask the Proxmox Community to help you.
|
|
|
|
|
|
[[corosync-conf-glossary]]
|
|
Corosync Configuration Glossary
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
ringX_addr::
|
|
This names the different ring addresses for the corosync totem rings used for
|
|
the cluster communication.
|
|
|
|
bindnetaddr::
|
|
Defines to which interface the ring should bind to. It may be any address of
|
|
the subnet configured on the interface we want to use. In general its the
|
|
recommended to just use an address a node uses on this interface.
|
|
|
|
rrp_mode::
|
|
Specifies the mode of the redundant ring protocol and may be passive, active or
|
|
none. Note that use of active is highly experimental and not official
|
|
supported. Passive is the preferred mode, it may double the cluster
|
|
communication throughput and increases availability.
|
|
|
|
|
|
Cluster Cold Start
|
|
------------------
|
|
|
|
It is obvious that a cluster is not quorate when all nodes are
|
|
offline. This is a common case after a power failure.
|
|
|
|
NOTE: It is always a good idea to use an uninterruptible power supply
|
|
(``UPS'', also called ``battery backup'') to avoid this state, especially if
|
|
you want HA.
|
|
|
|
On node startup, the `pve-guests` service is started and waits for
|
|
quorum. Once quorate, it starts all guests which have the `onboot`
|
|
flag set.
|
|
|
|
When you turn on nodes, or when power comes back after power failure,
|
|
it is likely that some nodes boots faster than others. Please keep in
|
|
mind that guest startup is delayed until you reach quorum.
|
|
|
|
|
|
Guest Migration
|
|
---------------
|
|
|
|
Migrating virtual guests to other nodes is a useful feature in a
|
|
cluster. There are settings to control the behavior of such
|
|
migrations. This can be done via the configuration file
|
|
`datacenter.cfg` or for a specific migration via API or command line
|
|
parameters.
|
|
|
|
It makes a difference if a Guest is online or offline, or if it has
|
|
local resources (like a local disk).
|
|
|
|
For Details about Virtual Machine Migration see the
|
|
xref:qm_migration[QEMU/KVM Migration Chapter]
|
|
|
|
For Details about Container Migration see the
|
|
xref:pct_migration[Container Migration Chapter]
|
|
|
|
Migration Type
|
|
~~~~~~~~~~~~~~
|
|
|
|
The migration type defines if the migration data should be sent over an
|
|
encrypted (`secure`) channel or an unencrypted (`insecure`) one.
|
|
Setting the migration type to insecure means that the RAM content of a
|
|
virtual guest gets also transferred unencrypted, which can lead to
|
|
information disclosure of critical data from inside the guest (for
|
|
example passwords or encryption keys).
|
|
|
|
Therefore, we strongly recommend using the secure channel if you do
|
|
not have full control over the network and can not guarantee that no
|
|
one is eavesdropping to it.
|
|
|
|
NOTE: Storage migration does not follow this setting. Currently, it
|
|
always sends the storage content over a secure channel.
|
|
|
|
Encryption requires a lot of computing power, so this setting is often
|
|
changed to "unsafe" to achieve better performance. The impact on
|
|
modern systems is lower because they implement AES encryption in
|
|
hardware. The performance impact is particularly evident in fast
|
|
networks where you can transfer 10 Gbps or more.
|
|
|
|
|
|
Migration Network
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
By default, {pve} uses the network in which cluster communication
|
|
takes place to send the migration traffic. This is not optimal because
|
|
sensitive cluster traffic can be disrupted and this network may not
|
|
have the best bandwidth available on the node.
|
|
|
|
Setting the migration network parameter allows the use of a dedicated
|
|
network for the entire migration traffic. In addition to the memory,
|
|
this also affects the storage traffic for offline migrations.
|
|
|
|
The migration network is set as a network in the CIDR notation. This
|
|
has the advantage that you do not have to set individual IP addresses
|
|
for each node. {pve} can determine the real address on the
|
|
destination node from the network specified in the CIDR form. To
|
|
enable this, the network must be specified so that each node has one,
|
|
but only one IP in the respective network.
|
|
|
|
|
|
Example
|
|
^^^^^^^
|
|
|
|
We assume that we have a three-node setup with three separate
|
|
networks. One for public communication with the Internet, one for
|
|
cluster communication and a very fast one, which we want to use as a
|
|
dedicated network for migration.
|
|
|
|
A network configuration for such a setup might look as follows:
|
|
|
|
----
|
|
iface eno1 inet manual
|
|
|
|
# public network
|
|
auto vmbr0
|
|
iface vmbr0 inet static
|
|
address 192.X.Y.57
|
|
netmask 255.255.250.0
|
|
gateway 192.X.Y.1
|
|
bridge_ports eno1
|
|
bridge_stp off
|
|
bridge_fd 0
|
|
|
|
# cluster network
|
|
auto eno2
|
|
iface eno2 inet static
|
|
address 10.1.1.1
|
|
netmask 255.255.255.0
|
|
|
|
# fast network
|
|
auto eno3
|
|
iface eno3 inet static
|
|
address 10.1.2.1
|
|
netmask 255.255.255.0
|
|
----
|
|
|
|
Here, we will use the network 10.1.2.0/24 as a migration network. For
|
|
a single migration, you can do this using the `migration_network`
|
|
parameter of the command line tool:
|
|
|
|
----
|
|
# qm migrate 106 tre --online --migration_network 10.1.2.0/24
|
|
----
|
|
|
|
To configure this as the default network for all migrations in the
|
|
cluster, set the `migration` property of the `/etc/pve/datacenter.cfg`
|
|
file:
|
|
|
|
----
|
|
# use dedicated migration network
|
|
migration: secure,network=10.1.2.0/24
|
|
----
|
|
|
|
NOTE: The migration type must always be set when the migration network
|
|
gets set in `/etc/pve/datacenter.cfg`.
|
|
|
|
|
|
ifdef::manvolnum[]
|
|
include::pve-copyright.adoc[]
|
|
endif::manvolnum[]
|