mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-21 18:03:45 +03:00
f226da0ef4
Signed-off-by: Matthias Heiserer <m.heiserer@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
1050 lines
37 KiB
Plaintext
1050 lines
37 KiB
Plaintext
[[chapter_pveceph]]
|
||
ifdef::manvolnum[]
|
||
pveceph(1)
|
||
==========
|
||
:pve-toplevel:
|
||
|
||
NAME
|
||
----
|
||
|
||
pveceph - Manage Ceph Services on Proxmox VE Nodes
|
||
|
||
SYNOPSIS
|
||
--------
|
||
|
||
include::pveceph.1-synopsis.adoc[]
|
||
|
||
DESCRIPTION
|
||
-----------
|
||
endif::manvolnum[]
|
||
ifndef::manvolnum[]
|
||
Deploy Hyper-Converged Ceph Cluster
|
||
===================================
|
||
:pve-toplevel:
|
||
endif::manvolnum[]
|
||
|
||
[thumbnail="screenshot/gui-ceph-status-dashboard.png"]
|
||
|
||
{pve} unifies your compute and storage systems, that is, you can use the same
|
||
physical nodes within a cluster for both computing (processing VMs and
|
||
containers) and replicated storage. The traditional silos of compute and
|
||
storage resources can be wrapped up into a single hyper-converged appliance.
|
||
Separate storage networks (SANs) and connections via network attached storage
|
||
(NAS) disappear. With the integration of Ceph, an open source software-defined
|
||
storage platform, {pve} has the ability to run and manage Ceph storage directly
|
||
on the hypervisor nodes.
|
||
|
||
Ceph is a distributed object store and file system designed to provide
|
||
excellent performance, reliability and scalability.
|
||
|
||
.Some advantages of Ceph on {pve} are:
|
||
- Easy setup and management via CLI and GUI
|
||
- Thin provisioning
|
||
- Snapshot support
|
||
- Self healing
|
||
- Scalable to the exabyte level
|
||
- Setup pools with different performance and redundancy characteristics
|
||
- Data is replicated, making it fault tolerant
|
||
- Runs on commodity hardware
|
||
- No need for hardware RAID controllers
|
||
- Open source
|
||
|
||
For small to medium-sized deployments, it is possible to install a Ceph server for
|
||
RADOS Block Devices (RBD) directly on your {pve} cluster nodes (see
|
||
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). Recent
|
||
hardware has a lot of CPU power and RAM, so running storage services
|
||
and VMs on the same node is possible.
|
||
|
||
To simplify management, we provide 'pveceph' - a tool for installing and
|
||
managing {ceph} services on {pve} nodes.
|
||
|
||
.Ceph consists of multiple Daemons, for use as an RBD storage:
|
||
- Ceph Monitor (ceph-mon)
|
||
- Ceph Manager (ceph-mgr)
|
||
- Ceph OSD (ceph-osd; Object Storage Daemon)
|
||
|
||
TIP: We highly recommend to get familiar with Ceph
|
||
footnote:[Ceph intro {cephdocs-url}/start/intro/],
|
||
its architecture
|
||
footnote:[Ceph architecture {cephdocs-url}/architecture/]
|
||
and vocabulary
|
||
footnote:[Ceph glossary {cephdocs-url}/glossary].
|
||
|
||
|
||
Precondition
|
||
------------
|
||
|
||
To build a hyper-converged Proxmox + Ceph Cluster, you must use at least
|
||
three (preferably) identical servers for the setup.
|
||
|
||
Check also the recommendations from
|
||
{cephdocs-url}/start/hardware-recommendations/[Ceph's website].
|
||
|
||
.CPU
|
||
A high CPU core frequency reduces latency and should be preferred. As a simple
|
||
rule of thumb, you should assign a CPU core (or thread) to each Ceph service to
|
||
provide enough resources for stable and durable Ceph performance.
|
||
|
||
.Memory
|
||
Especially in a hyper-converged setup, the memory consumption needs to be
|
||
carefully monitored. In addition to the predicted memory usage of virtual
|
||
machines and containers, you must also account for having enough memory
|
||
available for Ceph to provide excellent and stable performance.
|
||
|
||
As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
|
||
by an OSD. Especially during recovery, re-balancing or backfilling.
|
||
|
||
The daemon itself will use additional memory. The Bluestore backend of the
|
||
daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
|
||
legacy Filestore backend uses the OS page cache and the memory consumption is
|
||
generally related to PGs of an OSD daemon.
|
||
|
||
.Network
|
||
We recommend a network bandwidth of at least 10 GbE or more, which is used
|
||
exclusively for Ceph. A meshed network setup
|
||
footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
|
||
is also an option if there are no 10 GbE switches available.
|
||
|
||
The volume of traffic, especially during recovery, will interfere with other
|
||
services on the same network and may even break the {pve} cluster stack.
|
||
|
||
Furthermore, you should estimate your bandwidth needs. While one HDD might not
|
||
saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will
|
||
even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even
|
||
more bandwidth will ensure that this isn't your bottleneck and won't be anytime
|
||
soon. 25, 40 or even 100 Gbps are possible.
|
||
|
||
.Disks
|
||
When planning the size of your Ceph cluster, it is important to take the
|
||
recovery time into consideration. Especially with small clusters, recovery
|
||
might take long. It is recommended that you use SSDs instead of HDDs in small
|
||
setups to reduce recovery time, minimizing the likelihood of a subsequent
|
||
failure event during recovery.
|
||
|
||
In general, SSDs will provide more IOPS than spinning disks. With this in mind,
|
||
in addition to the higher cost, it may make sense to implement a
|
||
xref:pve_ceph_device_classes[class based] separation of pools. Another way to
|
||
speed up OSDs is to use a faster disk as a journal or
|
||
DB/**W**rite-**A**head-**L**og device, see
|
||
xref:pve_ceph_osds[creating Ceph OSDs].
|
||
If a faster disk is used for multiple OSDs, a proper balance between OSD
|
||
and WAL / DB (or journal) disk must be selected, otherwise the faster disk
|
||
becomes the bottleneck for all linked OSDs.
|
||
|
||
Aside from the disk type, Ceph performs best with an even sized and distributed
|
||
amount of disks per node. For example, 4 x 500 GB disks within each node is
|
||
better than a mixed setup with a single 1 TB and three 250 GB disk.
|
||
|
||
You also need to balance OSD count and single OSD capacity. More capacity
|
||
allows you to increase storage density, but it also means that a single OSD
|
||
failure forces Ceph to recover more data at once.
|
||
|
||
.Avoid RAID
|
||
As Ceph handles data object redundancy and multiple parallel writes to disks
|
||
(OSDs) on its own, using a RAID controller normally doesn’t improve
|
||
performance or availability. On the contrary, Ceph is designed to handle whole
|
||
disks on it's own, without any abstraction in between. RAID controllers are not
|
||
designed for the Ceph workload and may complicate things and sometimes even
|
||
reduce performance, as their write and caching algorithms may interfere with
|
||
the ones from Ceph.
|
||
|
||
WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead.
|
||
|
||
NOTE: The above recommendations should be seen as a rough guidance for choosing
|
||
hardware. Therefore, it is still essential to adapt it to your specific needs.
|
||
You should test your setup and monitor health and performance continuously.
|
||
|
||
[[pve_ceph_install_wizard]]
|
||
Initial Ceph Installation & Configuration
|
||
-----------------------------------------
|
||
|
||
Using the Web-based Wizard
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
[thumbnail="screenshot/gui-node-ceph-install.png"]
|
||
|
||
With {pve} you have the benefit of an easy to use installation wizard
|
||
for Ceph. Click on one of your cluster nodes and navigate to the Ceph
|
||
section in the menu tree. If Ceph is not already installed, you will see a
|
||
prompt offering to do so.
|
||
|
||
The wizard is divided into multiple sections, where each needs to
|
||
finish successfully, in order to use Ceph.
|
||
|
||
First you need to chose which Ceph version you want to install. Prefer the one
|
||
from your other nodes, or the newest if this is the first node you install
|
||
Ceph.
|
||
|
||
After starting the installation, the wizard will download and install all the
|
||
required packages from {pve}'s Ceph repository.
|
||
[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"]
|
||
|
||
After finishing the installation step, you will need to create a configuration.
|
||
This step is only needed once per cluster, as this configuration is distributed
|
||
automatically to all remaining cluster members through {pve}'s clustered
|
||
xref:chapter_pmxcfs[configuration file system (pmxcfs)].
|
||
|
||
The configuration step includes the following settings:
|
||
|
||
* *Public Network:* You can set up a dedicated network for Ceph. This
|
||
setting is required. Separating your Ceph traffic is highly recommended.
|
||
Otherwise, it could cause trouble with other latency dependent services,
|
||
for example, cluster communication may decrease Ceph's performance.
|
||
|
||
[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"]
|
||
|
||
* *Cluster Network:* As an optional step, you can go even further and
|
||
separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic
|
||
as well. This will relieve the public network and could lead to
|
||
significant performance improvements, especially in large clusters.
|
||
|
||
You have two more options which are considered advanced and therefore
|
||
should only changed if you know what you are doing.
|
||
|
||
* *Number of replicas*: Defines how often an object is replicated
|
||
* *Minimum replicas*: Defines the minimum number of required replicas
|
||
for I/O to be marked as complete.
|
||
|
||
Additionally, you need to choose your first monitor node. This step is required.
|
||
|
||
That's it. You should now see a success page as the last step, with further
|
||
instructions on how to proceed. Your system is now ready to start using Ceph.
|
||
To get started, you will need to create some additional xref:pve_ceph_monitors[monitors],
|
||
xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool].
|
||
|
||
The rest of this chapter will guide you through getting the most out of
|
||
your {pve} based Ceph setup. This includes the aforementioned tips and
|
||
more, such as xref:pveceph_fs[CephFS], which is a helpful addition to your
|
||
new Ceph cluster.
|
||
|
||
[[pve_ceph_install]]
|
||
CLI Installation of Ceph Packages
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Alternatively to the the recommended {pve} Ceph installation wizard available
|
||
in the web-interface, you can use the following CLI command on each node:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph install
|
||
----
|
||
|
||
This sets up an `apt` package repository in
|
||
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
|
||
|
||
|
||
Initial Ceph configuration via CLI
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Use the {pve} Ceph installation wizard (recommended) or run the
|
||
following command on one node:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph init --network 10.10.10.0/24
|
||
----
|
||
|
||
This creates an initial configuration at `/etc/pve/ceph.conf` with a
|
||
dedicated network for Ceph. This file is automatically distributed to
|
||
all {pve} nodes, using xref:chapter_pmxcfs[pmxcfs]. The command also
|
||
creates a symbolic link at `/etc/ceph/ceph.conf`, which points to that file.
|
||
Thus, you can simply run Ceph commands without the need to specify a
|
||
configuration file.
|
||
|
||
|
||
[[pve_ceph_monitors]]
|
||
Ceph Monitor
|
||
-----------
|
||
|
||
[thumbnail="screenshot/gui-ceph-monitor.png"]
|
||
|
||
The Ceph Monitor (MON)
|
||
footnote:[Ceph Monitor {cephdocs-url}/start/intro/]
|
||
maintains a master copy of the cluster map. For high availability, you need at
|
||
least 3 monitors. One monitor will already be installed if you
|
||
used the installation wizard. You won't need more than 3 monitors, as long
|
||
as your cluster is small to medium-sized. Only really large clusters will
|
||
require more than this.
|
||
|
||
[[pveceph_create_mon]]
|
||
Create Monitors
|
||
~~~~~~~~~~~~~~~
|
||
|
||
On each node where you want to place a monitor (three monitors are recommended),
|
||
create one by using the 'Ceph -> Monitor' tab in the GUI or run:
|
||
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph mon create
|
||
----
|
||
|
||
[[pveceph_destroy_mon]]
|
||
Destroy Monitors
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
To remove a Ceph Monitor via the GUI, first select a node in the tree view and
|
||
go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy**
|
||
button.
|
||
|
||
To remove a Ceph Monitor via the CLI, first connect to the node on which the MON
|
||
is running. Then execute the following command:
|
||
[source,bash]
|
||
----
|
||
pveceph mon destroy
|
||
----
|
||
|
||
NOTE: At least three Monitors are needed for quorum.
|
||
|
||
|
||
[[pve_ceph_manager]]
|
||
Ceph Manager
|
||
------------
|
||
|
||
The Manager daemon runs alongside the monitors. It provides an interface to
|
||
monitor the cluster. Since the release of Ceph luminous, at least one ceph-mgr
|
||
footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is
|
||
required.
|
||
|
||
[[pveceph_create_mgr]]
|
||
Create Manager
|
||
~~~~~~~~~~~~~~
|
||
|
||
Multiple Managers can be installed, but only one Manager is active at any given
|
||
time.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph mgr create
|
||
----
|
||
|
||
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
|
||
high availability install more then one manager.
|
||
|
||
|
||
[[pveceph_destroy_mgr]]
|
||
Destroy Manager
|
||
~~~~~~~~~~~~~~~
|
||
|
||
To remove a Ceph Manager via the GUI, first select a node in the tree view and
|
||
go to the **Ceph -> Monitor** panel. Select the Manager and click the
|
||
**Destroy** button.
|
||
|
||
To remove a Ceph Monitor via the CLI, first connect to the node on which the
|
||
Manager is running. Then execute the following command:
|
||
[source,bash]
|
||
----
|
||
pveceph mgr destroy
|
||
----
|
||
|
||
NOTE: While a manager is not a hard-dependency, it is crucial for a Ceph cluster,
|
||
as it handles important features like PG-autoscaling, device health monitoring,
|
||
telemetry and more.
|
||
|
||
[[pve_ceph_osds]]
|
||
Ceph OSDs
|
||
---------
|
||
|
||
[thumbnail="screenshot/gui-ceph-osd-status.png"]
|
||
|
||
Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the
|
||
network. It is recommended to use one OSD per physical disk.
|
||
|
||
[[pve_ceph_osd_create]]
|
||
Create OSDs
|
||
~~~~~~~~~~~
|
||
|
||
You can create an OSD either via the {pve} web-interface or via the CLI using
|
||
`pveceph`. For example:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph osd create /dev/sd[X]
|
||
----
|
||
|
||
TIP: We recommend a Ceph cluster with at least three nodes and at least 12
|
||
OSDs, evenly distributed among the nodes.
|
||
|
||
If the disk was in use before (for example, for ZFS or as an OSD) you first need
|
||
to zap all traces of that usage. To remove the partition table, boot sector and
|
||
any other OSD leftover, you can use the following command:
|
||
|
||
[source,bash]
|
||
----
|
||
ceph-volume lvm zap /dev/sd[X] --destroy
|
||
----
|
||
|
||
WARNING: The above command will destroy all data on the disk!
|
||
|
||
.Ceph Bluestore
|
||
|
||
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
|
||
introduced called Bluestore
|
||
footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
|
||
This is the default when creating OSDs since Ceph Luminous.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph osd create /dev/sd[X]
|
||
----
|
||
|
||
.Block.db and block.wal
|
||
|
||
If you want to use a separate DB/WAL device for your OSDs, you can specify it
|
||
through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if
|
||
not specified separately.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
|
||
----
|
||
|
||
You can directly choose the size of those with the '-db_size' and '-wal_size'
|
||
parameters respectively. If they are not given, the following values (in order)
|
||
will be used:
|
||
|
||
* bluestore_block_{db,wal}_size from Ceph configuration...
|
||
** ... database, section 'osd'
|
||
** ... database, section 'global'
|
||
** ... file, section 'osd'
|
||
** ... file, section 'global'
|
||
* 10% (DB)/1% (WAL) of OSD size
|
||
|
||
NOTE: The DB stores BlueStore’s internal metadata, and the WAL is BlueStore’s
|
||
internal journal or write-ahead log. It is recommended to use a fast SSD or
|
||
NVRAM for better performance.
|
||
|
||
.Ceph Filestore
|
||
|
||
Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs.
|
||
Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
|
||
'pveceph' anymore. If you still want to create filestore OSDs, use
|
||
'ceph-volume' directly.
|
||
|
||
[source,bash]
|
||
----
|
||
ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
|
||
----
|
||
|
||
[[pve_ceph_osd_destroy]]
|
||
Destroy OSDs
|
||
~~~~~~~~~~~~
|
||
|
||
To remove an OSD via the GUI, first select a {PVE} node in the tree view and go
|
||
to the **Ceph -> OSD** panel. Then select the OSD to destroy and click the **OUT**
|
||
button. Once the OSD status has changed from `in` to `out`, click the **STOP**
|
||
button. Finally, after the status has changed from `up` to `down`, select
|
||
**Destroy** from the `More` drop-down menu.
|
||
|
||
To remove an OSD via the CLI run the following commands.
|
||
|
||
[source,bash]
|
||
----
|
||
ceph osd out <ID>
|
||
systemctl stop ceph-osd@<ID>.service
|
||
----
|
||
|
||
NOTE: The first command instructs Ceph not to include the OSD in the data
|
||
distribution. The second command stops the OSD service. Until this time, no
|
||
data is lost.
|
||
|
||
The following command destroys the OSD. Specify the '-cleanup' option to
|
||
additionally destroy the partition table.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph osd destroy <ID>
|
||
----
|
||
|
||
WARNING: The above command will destroy all data on the disk!
|
||
|
||
|
||
[[pve_ceph_pools]]
|
||
Ceph Pools
|
||
----------
|
||
|
||
[thumbnail="screenshot/gui-ceph-pools.png"]
|
||
|
||
A pool is a logical group for storing objects. It holds a collection of objects,
|
||
known as **P**lacement **G**roups (`PG`, `pg_num`).
|
||
|
||
|
||
Create and Edit Pools
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
You can create and edit pools from the command line or the web-interface of any
|
||
{pve} host under **Ceph -> Pools**.
|
||
|
||
When no options are given, we set a default of **128 PGs**, a **size of 3
|
||
replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if
|
||
any OSD fails.
|
||
|
||
WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1
|
||
allows I/O on an object when it has only 1 replica, which could lead to data
|
||
loss, incomplete PGs or unfound objects.
|
||
|
||
It is advised that you either enable the PG-Autoscaler or calculate the PG
|
||
number based on your setup. You can find the formula and the PG calculator
|
||
footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus
|
||
onward, you can change the number of PGs
|
||
footnoteref:[placement_groups,Placement Groups
|
||
{cephdocs-url}/rados/operations/placement-groups/] after the setup.
|
||
|
||
The PG autoscaler footnoteref:[autoscaler,Automated Scaling
|
||
{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can
|
||
automatically scale the PG count for a pool in the background. Setting the
|
||
`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to
|
||
make better decisions.
|
||
|
||
.Example for creating a pool over the CLI
|
||
[source,bash]
|
||
----
|
||
pveceph pool create <pool-name> --add_storages
|
||
----
|
||
|
||
TIP: If you would also like to automatically define a storage for your
|
||
pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the
|
||
command line option '--add_storages' at pool creation.
|
||
|
||
Pool Options
|
||
^^^^^^^^^^^^
|
||
|
||
[thumbnail="screenshot/gui-ceph-pool-create.png"]
|
||
|
||
The following options are available on pool creation, and partially also when
|
||
editing a pool.
|
||
|
||
Name:: The name of the pool. This must be unique and can't be changed afterwards.
|
||
Size:: The number of replicas per object. Ceph always tries to have this many
|
||
copies of an object. Default: `3`.
|
||
PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of
|
||
the pool. If set to `warn`, it produces a warning message when a pool
|
||
has a non-optimal PG count. Default: `warn`.
|
||
Add as Storage:: Configure a VM or container storage using the new pool.
|
||
Default: `true` (only visible on creation).
|
||
|
||
.Advanced Options
|
||
Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on
|
||
the pool if a PG has less than this many replicas. Default: `2`.
|
||
Crush Rule:: The rule to use for mapping object placement in the cluster. These
|
||
rules define how data is placed within the cluster. See
|
||
xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on
|
||
device-based rules.
|
||
# of PGs:: The number of placement groups footnoteref:[placement_groups] that
|
||
the pool should have at the beginning. Default: `128`.
|
||
Target Ratio:: The ratio of data that is expected in the pool. The PG
|
||
autoscaler uses the ratio relative to other ratio sets. It takes precedence
|
||
over the `target size` if both are set.
|
||
Target Size:: The estimated amount of data expected in the pool. The PG
|
||
autoscaler uses this size to estimate the optimal PG count.
|
||
Min. # of PGs:: The minimum number of placement groups. This setting is used to
|
||
fine-tune the lower bound of the PG count for that pool. The PG autoscaler
|
||
will not merge PGs below this threshold.
|
||
|
||
Further information on Ceph pool handling can be found in the Ceph pool
|
||
operation footnote:[Ceph pool operation
|
||
{cephdocs-url}/rados/operations/pools/]
|
||
manual.
|
||
|
||
|
||
[[pve_ceph_ec_pools]]
|
||
Erasure Coded Pools
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
Erasure coding (EC) is a form of `forward error correction' codes that allows
|
||
to recover from a certain amount of data loss. Erasure coded pools can offer
|
||
more usable space compared to replicated pools, but they do that for the price
|
||
of performance.
|
||
|
||
For comparison: in classic, replicated pools, multiple replicas of the data
|
||
are stored (`size`) while in erasure coded pool, data is split into `k` data
|
||
chunks with additional `m` coding (checking) chunks. Those coding chunks can be
|
||
used to recreate data should data chunks be missing.
|
||
|
||
The number of coding chunks, `m`, defines how many OSDs can be lost without
|
||
losing any data. The total amount of objects stored is `k + m`.
|
||
|
||
Creating EC Pools
|
||
^^^^^^^^^^^^^^^^^
|
||
|
||
Erasure coded (EC) pools can be created with the `pveceph` CLI tooling.
|
||
Planning an EC pool needs to account for the fact, that they work differently
|
||
than replicated pools.
|
||
|
||
The default `min_size` of an EC pool depends on the `m` parameter. If `m = 1`,
|
||
the `min_size` of the EC pool will be `k`. The `min_size` will be `k + 1` if
|
||
`m > 1`. The Ceph documentation recommends a conservative `min_size` of `k + 2`
|
||
footnote:[Ceph Erasure Coded Pool Recovery
|
||
{cephdocs-url}/rados/operations/erasure-code/#erasure-coded-pool-recovery].
|
||
|
||
If there are less than `min_size` OSDs available, any IO to the pool will be
|
||
blocked until there are enough OSDs available again.
|
||
|
||
NOTE: When planning an erasure coded pool, keep an eye on the `min_size` as it
|
||
defines how many OSDs need to be available. Otherwise, IO will be blocked.
|
||
|
||
For example, an EC pool with `k = 2` and `m = 1` will have `size = 3`,
|
||
`min_size = 2` and will stay operational if one OSD fails. If the pool is
|
||
configured with `k = 2`, `m = 2`, it will have a `size = 4` and `min_size = 3`
|
||
and stay operational if one OSD is lost.
|
||
|
||
To create a new EC pool, run the following command:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph pool create <pool-name> --erasure-coding k=2,m=1
|
||
----
|
||
|
||
Optional parameters are `failure-domain` and `device-class`. If you
|
||
need to change any EC profile settings used by the pool, you will have to
|
||
create a new pool with a new profile.
|
||
|
||
This will create a new EC pool plus the needed replicated pool to store the RBD
|
||
omap and other metadata. In the end, there will be a `<pool name>-data` and
|
||
`<pool name>-metada` pool. The default behavior is to create a matching storage
|
||
configuration as well. If that behavior is not wanted, you can disable it by
|
||
providing the `--add_storages 0` parameter. When configuring the storage
|
||
configuration manually, keep in mind that the `data-pool` parameter needs to be
|
||
set. Only then will the EC pool be used to store the data objects. For example:
|
||
|
||
NOTE: The optional parameters `--size`, `--min_size` and `--crush_rule` will be
|
||
used for the replicated metadata pool, but not for the erasure coded data pool.
|
||
If you need to change the `min_size` on the data pool, you can do it later.
|
||
The `size` and `crush_rule` parameters cannot be changed on erasure coded
|
||
pools.
|
||
|
||
If there is a need to further customize the EC profile, you can do so by
|
||
creating it with the Ceph tools directly footnote:[Ceph Erasure Code Profile
|
||
{cephdocs-url}/rados/operations/erasure-code/#erasure-code-profiles], and
|
||
specify the profile to use with the `profile` parameter.
|
||
|
||
For example:
|
||
[source,bash]
|
||
----
|
||
pveceph pool create <pool-name> --erasure-coding profile=<profile-name>
|
||
----
|
||
|
||
Adding EC Pools as Storage
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
You can add an already existing EC pool as storage to {pve}. It works the same
|
||
way as adding an `RBD` pool but requires the extra `data-pool` option.
|
||
|
||
[source,bash]
|
||
----
|
||
pvesm add rbd <storage-name> --pool <replicated-pool> --data-pool <ec-pool>
|
||
----
|
||
|
||
TIP: Do not forget to add the `keyring` and `monhost` option for any external
|
||
Ceph clusters, not managed by the local {pve} cluster.
|
||
|
||
Destroy Pools
|
||
~~~~~~~~~~~~~
|
||
|
||
To destroy a pool via the GUI, select a node in the tree view and go to the
|
||
**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy**
|
||
button. To confirm the destruction of the pool, you need to enter the pool name.
|
||
|
||
Run the following command to destroy a pool. Specify the '-remove_storages' to
|
||
also remove the associated storage.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph pool destroy <name>
|
||
----
|
||
|
||
NOTE: Pool deletion runs in the background and can take some time.
|
||
You will notice the data usage in the cluster decreasing throughout this
|
||
process.
|
||
|
||
|
||
PG Autoscaler
|
||
~~~~~~~~~~~~~
|
||
|
||
The PG autoscaler allows the cluster to consider the amount of (expected) data
|
||
stored in each pool and to choose the appropriate pg_num values automatically.
|
||
It is available since Ceph Nautilus.
|
||
|
||
You may need to activate the PG autoscaler module before adjustments can take
|
||
effect.
|
||
|
||
[source,bash]
|
||
----
|
||
ceph mgr module enable pg_autoscaler
|
||
----
|
||
|
||
The autoscaler is configured on a per pool basis and has the following modes:
|
||
|
||
[horizontal]
|
||
warn:: A health warning is issued if the suggested `pg_num` value differs too
|
||
much from the current value.
|
||
on:: The `pg_num` is adjusted automatically with no need for any manual
|
||
interaction.
|
||
off:: No automatic `pg_num` adjustments are made, and no warning will be issued
|
||
if the PG count is not optimal.
|
||
|
||
The scaling factor can be adjusted to facilitate future data storage with the
|
||
`target_size`, `target_size_ratio` and the `pg_num_min` options.
|
||
|
||
WARNING: By default, the autoscaler considers tuning the PG count of a pool if
|
||
it is off by a factor of 3. This will lead to a considerable shift in data
|
||
placement and might introduce a high load on the cluster.
|
||
|
||
You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog -
|
||
https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in
|
||
Nautilus: PG merging and autotuning].
|
||
|
||
|
||
[[pve_ceph_device_classes]]
|
||
Ceph CRUSH & device classes
|
||
---------------------------
|
||
|
||
[thumbnail="screenshot/gui-ceph-config.png"]
|
||
|
||
The footnote:[CRUSH
|
||
https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled
|
||
**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the
|
||
foundation of Ceph.
|
||
|
||
CRUSH calculates where to store and retrieve data from. This has the
|
||
advantage that no central indexing service is needed. CRUSH works using a map of
|
||
OSDs, buckets (device locations) and rulesets (data replication) for pools.
|
||
|
||
NOTE: Further information can be found in the Ceph documentation, under the
|
||
section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/].
|
||
|
||
This map can be altered to reflect different replication hierarchies. The object
|
||
replicas can be separated (e.g., failure domains), while maintaining the desired
|
||
distribution.
|
||
|
||
A common configuration is to use different classes of disks for different Ceph
|
||
pools. For this reason, Ceph introduced device classes with luminous, to
|
||
accommodate the need for easy ruleset generation.
|
||
|
||
The device classes can be seen in the 'ceph osd tree' output. These classes
|
||
represent their own root bucket, which can be seen with the below command.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush tree --show-shadow
|
||
----
|
||
|
||
Example output form the above command:
|
||
|
||
[source, bash]
|
||
----
|
||
ID CLASS WEIGHT TYPE NAME
|
||
-16 nvme 2.18307 root default~nvme
|
||
-13 nvme 0.72769 host sumi1~nvme
|
||
12 nvme 0.72769 osd.12
|
||
-14 nvme 0.72769 host sumi2~nvme
|
||
13 nvme 0.72769 osd.13
|
||
-15 nvme 0.72769 host sumi3~nvme
|
||
14 nvme 0.72769 osd.14
|
||
-1 7.70544 root default
|
||
-3 2.56848 host sumi1
|
||
12 nvme 0.72769 osd.12
|
||
-5 2.56848 host sumi2
|
||
13 nvme 0.72769 osd.13
|
||
-7 2.56848 host sumi3
|
||
14 nvme 0.72769 osd.14
|
||
----
|
||
|
||
To instruct a pool to only distribute objects on a specific device class, you
|
||
first need to create a ruleset for the device class:
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
|
||
----
|
||
|
||
[frame="none",grid="none", align="left", cols="30%,70%"]
|
||
|===
|
||
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|
||
|<root>|which crush root it should belong to (default Ceph root "default")
|
||
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|
||
|<class>|what type of OSD backing store to use (e.g., nvme, ssd, hdd)
|
||
|===
|
||
|
||
Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd pool set <pool-name> crush_rule <rule-name>
|
||
----
|
||
|
||
TIP: If the pool already contains objects, these must be moved accordingly.
|
||
Depending on your setup, this may introduce a big performance impact on your
|
||
cluster. As an alternative, you can create a new pool and move disks separately.
|
||
|
||
|
||
Ceph Client
|
||
-----------
|
||
|
||
[thumbnail="screenshot/gui-ceph-log.png"]
|
||
|
||
Following the setup from the previous sections, you can configure {pve} to use
|
||
such pools to store VM and Container images. Simply use the GUI to add a new
|
||
`RBD` storage (see section
|
||
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
|
||
|
||
You also need to copy the keyring to a predefined location for an external Ceph
|
||
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
|
||
done automatically.
|
||
|
||
NOTE: The filename needs to be `<storage_id> + `.keyring`, where `<storage_id>` is
|
||
the expression after 'rbd:' in `/etc/pve/storage.cfg`. In the following example,
|
||
`my-ceph-storage` is the `<storage_id>`:
|
||
|
||
[source,bash]
|
||
----
|
||
mkdir /etc/pve/priv/ceph
|
||
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
|
||
----
|
||
|
||
[[pveceph_fs]]
|
||
CephFS
|
||
------
|
||
|
||
Ceph also provides a filesystem, which runs on top of the same object storage as
|
||
RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map the
|
||
RADOS backed objects to files and directories, allowing Ceph to provide a
|
||
POSIX-compliant, replicated filesystem. This allows you to easily configure a
|
||
clustered, highly available, shared filesystem. Ceph's Metadata Servers
|
||
guarantee that files are evenly distributed over the entire Ceph cluster. As a
|
||
result, even cases of high load will not overwhelm a single host, which can be
|
||
an issue with traditional shared filesystem approaches, for example `NFS`.
|
||
|
||
[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
|
||
|
||
{pve} supports both creating a hyper-converged CephFS and using an existing
|
||
xref:storage_cephfs[CephFS as storage] to save backups, ISO files, and container
|
||
templates.
|
||
|
||
|
||
[[pveceph_fs_mds]]
|
||
Metadata Server (MDS)
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
CephFS needs at least one Metadata Server to be configured and running, in order
|
||
to function. You can create an MDS through the {pve} web GUI's `Node
|
||
-> CephFS` panel or from the command line with:
|
||
|
||
----
|
||
pveceph mds create
|
||
----
|
||
|
||
Multiple metadata servers can be created in a cluster, but with the default
|
||
settings, only one can be active at a time. If an MDS or its node becomes
|
||
unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
|
||
You can speed up the handover between the active and standby MDS by using
|
||
the 'hotstandby' parameter option on creation, or if you have already created it
|
||
you may set/add:
|
||
|
||
----
|
||
mds standby replay = true
|
||
----
|
||
|
||
in the respective MDS section of `/etc/pve/ceph.conf`. With this enabled, the
|
||
specified MDS will remain in a `warm` state, polling the active one, so that it
|
||
can take over faster in case of any issues.
|
||
|
||
NOTE: This active polling will have an additional performance impact on your
|
||
system and the active `MDS`.
|
||
|
||
.Multiple Active MDS
|
||
|
||
Since Luminous (12.2.x) you can have multiple active metadata servers
|
||
running at once, but this is normally only useful if you have a high amount of
|
||
clients running in parallel. Otherwise the `MDS` is rarely the bottleneck in a
|
||
system. If you want to set this up, please refer to the Ceph documentation.
|
||
footnote:[Configuring multiple active MDS daemons
|
||
{cephdocs-url}/cephfs/multimds/]
|
||
|
||
[[pveceph_fs_create]]
|
||
Create CephFS
|
||
~~~~~~~~~~~~~
|
||
|
||
With {pve}'s integration of CephFS, you can easily create a CephFS using the
|
||
web interface, CLI or an external API interface. Some prerequisites are required
|
||
for this to work:
|
||
|
||
.Prerequisites for a successful CephFS setup:
|
||
- xref:pve_ceph_install[Install Ceph packages] - if this was already done some
|
||
time ago, you may want to rerun it on an up-to-date system to
|
||
ensure that all CephFS related packages get installed.
|
||
- xref:pve_ceph_monitors[Setup Monitors]
|
||
- xref:pve_ceph_monitors[Setup your OSDs]
|
||
- xref:pveceph_fs_mds[Setup at least one MDS]
|
||
|
||
After this is complete, you can simply create a CephFS through
|
||
either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
|
||
for example:
|
||
|
||
----
|
||
pveceph fs create --pg_num 128 --add-storage
|
||
----
|
||
|
||
This creates a CephFS named 'cephfs', using a pool for its data named
|
||
'cephfs_data' with '128' placement groups and a pool for its metadata named
|
||
'cephfs_metadata' with one quarter of the data pool's placement groups (`32`).
|
||
Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
|
||
Ceph documentation for more information regarding an appropriate placement group
|
||
number (`pg_num`) for your setup footnoteref:[placement_groups].
|
||
Additionally, the '--add-storage' parameter will add the CephFS to the {pve}
|
||
storage configuration after it has been created successfully.
|
||
|
||
Destroy CephFS
|
||
~~~~~~~~~~~~~~
|
||
|
||
WARNING: Destroying a CephFS will render all of its data unusable. This cannot be
|
||
undone!
|
||
|
||
To completely and gracefully remove a CephFS, the following steps are
|
||
necessary:
|
||
|
||
* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests).
|
||
* Disable all related CephFS {PVE} storage entries (to prevent it from being
|
||
automatically mounted).
|
||
* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you
|
||
want to destroy.
|
||
* Unmount the CephFS storages on all cluster nodes manually with
|
||
+
|
||
----
|
||
umount /mnt/pve/<STORAGE-NAME>
|
||
----
|
||
+
|
||
Where `<STORAGE-NAME>` is the name of the CephFS storage in your {PVE}.
|
||
|
||
* Now make sure that no metadata server (`MDS`) is running for that CephFS,
|
||
either by stopping or destroying them. This can be done through the web
|
||
interface or via the command line interface, for the latter you would issue
|
||
the following command:
|
||
+
|
||
----
|
||
pveceph stop --service mds.NAME
|
||
----
|
||
+
|
||
to stop them, or
|
||
+
|
||
----
|
||
pveceph mds destroy NAME
|
||
----
|
||
+
|
||
to destroy them.
|
||
+
|
||
Note that standby servers will automatically be promoted to active when an
|
||
active `MDS` is stopped or removed, so it is best to first stop all standby
|
||
servers.
|
||
|
||
* Now you can destroy the CephFS with
|
||
+
|
||
----
|
||
pveceph fs destroy NAME --remove-storages --remove-pools
|
||
----
|
||
+
|
||
This will automatically destroy the underlying Ceph pools as well as remove
|
||
the storages from pve config.
|
||
|
||
After these steps, the CephFS should be completely removed and if you have
|
||
other CephFS instances, the stopped metadata servers can be started again
|
||
to act as standbys.
|
||
|
||
Ceph maintenance
|
||
----------------
|
||
|
||
Replace OSDs
|
||
~~~~~~~~~~~~
|
||
|
||
One of the most common maintenance tasks in Ceph is to replace the disk of an
|
||
OSD. If a disk is already in a failed state, then you can go ahead and run
|
||
through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate
|
||
those copies on the remaining OSDs if possible. This rebalancing will start as
|
||
soon as an OSD failure is detected or an OSD was actively stopped.
|
||
|
||
NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
|
||
`size + 1` nodes are available. The reason for this is that the Ceph object
|
||
balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
|
||
`failure domain'.
|
||
|
||
To replace a functioning disk from the GUI, go through the steps in
|
||
xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
|
||
the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
|
||
|
||
On the command line, use the following commands:
|
||
|
||
----
|
||
ceph osd out osd.<id>
|
||
----
|
||
|
||
You can check with the command below if the OSD can be safely removed.
|
||
|
||
----
|
||
ceph osd safe-to-destroy osd.<id>
|
||
----
|
||
|
||
Once the above check tells you that it is safe to remove the OSD, you can
|
||
continue with the following commands:
|
||
|
||
----
|
||
systemctl stop ceph-osd@<id>.service
|
||
pveceph osd destroy <id>
|
||
----
|
||
|
||
Replace the old disk with the new one and use the same procedure as described
|
||
in xref:pve_ceph_osd_create[Create OSDs].
|
||
|
||
Trim/Discard
|
||
~~~~~~~~~~~~
|
||
|
||
It is good practice to run 'fstrim' (discard) regularly on VMs and containers.
|
||
This releases data blocks that the filesystem isn’t using anymore. It reduces
|
||
data usage and resource load. Most modern operating systems issue such discard
|
||
commands to their disks regularly. You only need to ensure that the Virtual
|
||
Machines enable the xref:qm_hard_disk_discard[disk discard option].
|
||
|
||
[[pveceph_scrub]]
|
||
Scrub & Deep Scrub
|
||
~~~~~~~~~~~~~~~~~~
|
||
|
||
Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
|
||
object in a PG for its health. There are two forms of Scrubbing, daily
|
||
cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
|
||
the objects and uses checksums to ensure data integrity. If a running scrub
|
||
interferes with business (performance) needs, you can adjust the time when
|
||
scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing]
|
||
are executed.
|
||
|
||
|
||
Ceph Monitoring and Troubleshooting
|
||
-----------------------------------
|
||
|
||
It is important to continuously monitor the health of a Ceph deployment from the
|
||
beginning, either by using the Ceph tools or by accessing
|
||
the status through the {pve} link:api-viewer/index.html[API].
|
||
|
||
The following Ceph commands can be used to see if the cluster is healthy
|
||
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
|
||
('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
|
||
below will also give you an overview of the current events and actions to take.
|
||
|
||
----
|
||
# single time output
|
||
pve# ceph -s
|
||
# continuously output status changes (press CTRL+C to stop)
|
||
pve# ceph -w
|
||
----
|
||
|
||
To get a more detailed view, every Ceph service has a log file under
|
||
`/var/log/ceph/`. If more detail is required, the log level can be
|
||
adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
|
||
|
||
You can find more information about troubleshooting
|
||
footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
|
||
a Ceph cluster on the official website.
|
||
|
||
|
||
ifdef::manvolnum[]
|
||
include::pve-copyright.adoc[]
|
||
endif::manvolnum[]
|