mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-10 01:17:51 +03:00
5a54ef446d
Signed-off-by: Alwin Antreich <a.antreich@proxmox.com>
540 lines
18 KiB
Plaintext
540 lines
18 KiB
Plaintext
[[chapter_pveceph]]
|
||
ifdef::manvolnum[]
|
||
pveceph(1)
|
||
==========
|
||
:pve-toplevel:
|
||
|
||
NAME
|
||
----
|
||
|
||
pveceph - Manage Ceph Services on Proxmox VE Nodes
|
||
|
||
SYNOPSIS
|
||
--------
|
||
|
||
include::pveceph.1-synopsis.adoc[]
|
||
|
||
DESCRIPTION
|
||
-----------
|
||
endif::manvolnum[]
|
||
ifndef::manvolnum[]
|
||
Manage Ceph Services on Proxmox VE Nodes
|
||
========================================
|
||
:pve-toplevel:
|
||
endif::manvolnum[]
|
||
|
||
[thumbnail="screenshot/gui-ceph-status.png"]
|
||
|
||
{pve} unifies your compute and storage systems, i.e. you can use the same
|
||
physical nodes within a cluster for both computing (processing VMs and
|
||
containers) and replicated storage. The traditional silos of compute and
|
||
storage resources can be wrapped up into a single hyper-converged appliance.
|
||
Separate storage networks (SANs) and connections via network attached storages
|
||
(NAS) disappear. With the integration of Ceph, an open source software-defined
|
||
storage platform, {pve} has the ability to run and manage Ceph storage directly
|
||
on the hypervisor nodes.
|
||
|
||
Ceph is a distributed object store and file system designed to provide
|
||
excellent performance, reliability and scalability.
|
||
|
||
.Some advantages of Ceph on {pve} are:
|
||
- Easy setup and management with CLI and GUI support
|
||
- Thin provisioning
|
||
- Snapshots support
|
||
- Self healing
|
||
- Scalable to the exabyte level
|
||
- Setup pools with different performance and redundancy characteristics
|
||
- Data is replicated, making it fault tolerant
|
||
- Runs on economical commodity hardware
|
||
- No need for hardware RAID controllers
|
||
- Open source
|
||
|
||
For small to mid sized deployments, it is possible to install a Ceph server for
|
||
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
|
||
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
|
||
hardware has plenty of CPU power and RAM, so running storage services
|
||
and VMs on the same node is possible.
|
||
|
||
To simplify management, we provide 'pveceph' - a tool to install and
|
||
manage {ceph} services on {pve} nodes.
|
||
|
||
.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
|
||
- Ceph Monitor (ceph-mon)
|
||
- Ceph Manager (ceph-mgr)
|
||
- Ceph OSD (ceph-osd; Object Storage Daemon)
|
||
|
||
TIP: We recommend to get familiar with the Ceph vocabulary.
|
||
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
|
||
|
||
|
||
Precondition
|
||
------------
|
||
|
||
To build a Proxmox Ceph Cluster there should be at least three (preferably)
|
||
identical servers for the setup.
|
||
|
||
A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
|
||
setup is also an option if there are no 10Gb switches available, see our wiki
|
||
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
|
||
|
||
Check also the recommendations from
|
||
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
|
||
|
||
.Avoid RAID
|
||
As Ceph handles data object redundancy and multiple parallel writes to disks
|
||
(OSDs) on its own, using a RAID controller normally doesn’t improve
|
||
performance or availability. On the contrary, Ceph is designed to handle whole
|
||
disks on it's own, without any abstraction in between. RAID controller are not
|
||
designed for the Ceph use case and may complicate things and sometimes even
|
||
reduce performance, as their write and caching algorithms may interfere with
|
||
the ones from Ceph.
|
||
|
||
WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
|
||
|
||
|
||
[[pve_ceph_install]]
|
||
Installation of Ceph Packages
|
||
-----------------------------
|
||
|
||
On each node run the installation script as follows:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph install
|
||
----
|
||
|
||
This sets up an `apt` package repository in
|
||
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
|
||
|
||
|
||
Creating initial Ceph configuration
|
||
-----------------------------------
|
||
|
||
[thumbnail="screenshot/gui-ceph-config.png"]
|
||
|
||
After installation of packages, you need to create an initial Ceph
|
||
configuration on just one node, based on your network (`10.10.10.0/24`
|
||
in the following example) dedicated for Ceph:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph init --network 10.10.10.0/24
|
||
----
|
||
|
||
This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
|
||
automatically distributed to all {pve} nodes by using
|
||
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
|
||
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
|
||
Ceph commands without the need to specify a configuration file.
|
||
|
||
|
||
[[pve_ceph_monitors]]
|
||
Creating Ceph Monitors
|
||
----------------------
|
||
|
||
[thumbnail="screenshot/gui-ceph-monitor.png"]
|
||
|
||
The Ceph Monitor (MON)
|
||
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
|
||
maintains a master copy of the cluster map. For high availability you need to
|
||
have at least 3 monitors.
|
||
|
||
On each node where you want to place a monitor (three monitors are recommended),
|
||
create it by using the 'Ceph -> Monitor' tab in the GUI or run.
|
||
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createmon
|
||
----
|
||
|
||
This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
|
||
do not want to install a manager, specify the '-exclude-manager' option.
|
||
|
||
|
||
[[pve_ceph_manager]]
|
||
Creating Ceph Manager
|
||
----------------------
|
||
|
||
The Manager daemon runs alongside the monitors, providing an interface for
|
||
monitoring the cluster. Since the Ceph luminous release the
|
||
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
|
||
is required. During monitor installation the ceph manager will be installed as
|
||
well.
|
||
|
||
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
|
||
high availability install more then one manager.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createmgr
|
||
----
|
||
|
||
|
||
[[pve_ceph_osds]]
|
||
Creating Ceph OSDs
|
||
------------------
|
||
|
||
[thumbnail="screenshot/gui-ceph-osd-status.png"]
|
||
|
||
via GUI or via CLI as follows:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X]
|
||
----
|
||
|
||
TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
|
||
among your, at least three nodes (4 OSDs on each node).
|
||
|
||
If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
|
||
sector and any OSD leftover the following commands should be sufficient.
|
||
|
||
[source,bash]
|
||
----
|
||
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
|
||
ceph-disk zap /dev/sd[X]
|
||
----
|
||
|
||
WARNING: The above commands will destroy data on the disk!
|
||
|
||
Ceph Bluestore
|
||
~~~~~~~~~~~~~~
|
||
|
||
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
|
||
introduced, the so called Bluestore
|
||
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
|
||
This is the default when creating OSDs in Ceph luminous.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X]
|
||
----
|
||
|
||
NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
|
||
to have a GPT footnoteref:[GPT, GPT partition table
|
||
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
|
||
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
|
||
disk as DB/WAL.
|
||
|
||
If you want to use a separate DB/WAL device for your OSDs, you can specify it
|
||
through the '-journal_dev' option. The WAL is placed with the DB, if not
|
||
specified separately.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
|
||
----
|
||
|
||
NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
|
||
internal journal or write-ahead log. It is recommended to use a fast SSDs or
|
||
NVRAM for better performance.
|
||
|
||
|
||
Ceph Filestore
|
||
~~~~~~~~~~~~~
|
||
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
|
||
still be used and might give better performance in small setups, when backed by
|
||
a NVMe SSD or similar.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -bluestore 0
|
||
----
|
||
|
||
NOTE: In order to select a disk in the GUI, the disk needs to have a
|
||
GPT footnoteref:[GPT] partition table. You can
|
||
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
|
||
disk as journal. Currently the journal size is fixed to 5 GB.
|
||
|
||
If you want to use a dedicated SSD journal disk:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
|
||
----
|
||
|
||
Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
|
||
journal disk.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
|
||
----
|
||
|
||
This partitions the disk (data and journal partition), creates
|
||
filesystems and starts the OSD, afterwards it is running and fully
|
||
functional.
|
||
|
||
NOTE: This command refuses to initialize disk when it detects existing data. So
|
||
if you want to overwrite a disk you should remove existing data first. You can
|
||
do that using: 'ceph-disk zap /dev/sd[X]'
|
||
|
||
You can create OSDs containing both journal and data partitions or you
|
||
can place the journal on a dedicated SSD. Using a SSD journal disk is
|
||
highly recommended to achieve good performance.
|
||
|
||
|
||
[[pve_ceph_pools]]
|
||
Creating Ceph Pools
|
||
-------------------
|
||
|
||
[thumbnail="screenshot/gui-ceph-pools.png"]
|
||
|
||
A pool is a logical group for storing objects. It holds **P**lacement
|
||
**G**roups (PG), a collection of objects.
|
||
|
||
When no options are given, we set a
|
||
default of **128 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
|
||
for serving objects in a degraded state.
|
||
|
||
NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
|
||
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
|
||
|
||
It is advised to calculate the PG number depending on your setup, you can find
|
||
the formula and the PG calculator footnote:[PG calculator
|
||
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
|
||
never be decreased.
|
||
|
||
|
||
You can create pools through command line or on the GUI on each PVE host under
|
||
**Ceph -> Pools**.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createpool <name>
|
||
----
|
||
|
||
If you would like to automatically get also a storage definition for your pool,
|
||
active the checkbox "Add storages" on the GUI or use the command line option
|
||
'--add_storages' on pool creation.
|
||
|
||
Further information on Ceph pool handling can be found in the Ceph pool
|
||
operation footnote:[Ceph pool operation
|
||
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
|
||
manual.
|
||
|
||
Ceph CRUSH & device classes
|
||
---------------------------
|
||
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
|
||
**U**nder **S**calable **H**ashing
|
||
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
|
||
|
||
CRUSH calculates where to store to and retrieve data from, this has the
|
||
advantage that no central index service is needed. CRUSH works with a map of
|
||
OSDs, buckets (device locations) and rulesets (data replication) for pools.
|
||
|
||
NOTE: Further information can be found in the Ceph documentation, under the
|
||
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
|
||
|
||
This map can be altered to reflect different replication hierarchies. The object
|
||
replicas can be separated (eg. failure domains), while maintaining the desired
|
||
distribution.
|
||
|
||
A common use case is to use different classes of disks for different Ceph pools.
|
||
For this reason, Ceph introduced the device classes with luminous, to
|
||
accommodate the need for easy ruleset generation.
|
||
|
||
The device classes can be seen in the 'ceph osd tree' output. These classes
|
||
represent their own root bucket, which can be seen with the below command.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush tree --show-shadow
|
||
----
|
||
|
||
Example output form the above command:
|
||
|
||
[source, bash]
|
||
----
|
||
ID CLASS WEIGHT TYPE NAME
|
||
-16 nvme 2.18307 root default~nvme
|
||
-13 nvme 0.72769 host sumi1~nvme
|
||
12 nvme 0.72769 osd.12
|
||
-14 nvme 0.72769 host sumi2~nvme
|
||
13 nvme 0.72769 osd.13
|
||
-15 nvme 0.72769 host sumi3~nvme
|
||
14 nvme 0.72769 osd.14
|
||
-1 7.70544 root default
|
||
-3 2.56848 host sumi1
|
||
12 nvme 0.72769 osd.12
|
||
-5 2.56848 host sumi2
|
||
13 nvme 0.72769 osd.13
|
||
-7 2.56848 host sumi3
|
||
14 nvme 0.72769 osd.14
|
||
----
|
||
|
||
To let a pool distribute its objects only on a specific device class, you need
|
||
to create a ruleset with the specific class first.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
|
||
----
|
||
|
||
[frame="none",grid="none", align="left", cols="30%,70%"]
|
||
|===
|
||
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|
||
|<root>|which crush root it should belong to (default ceph root "default")
|
||
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|
||
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|
||
|===
|
||
|
||
Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd pool set <pool-name> crush_rule <rule-name>
|
||
----
|
||
|
||
TIP: If the pool already contains objects, all of these have to be moved
|
||
accordingly. Depending on your setup this may introduce a big performance hit on
|
||
your cluster. As an alternative, you can create a new pool and move disks
|
||
separately.
|
||
|
||
|
||
Ceph Client
|
||
-----------
|
||
|
||
[thumbnail="screenshot/gui-ceph-log.png"]
|
||
|
||
You can then configure {pve} to use such pools to store VM or
|
||
Container images. Simply use the GUI too add a new `RBD` storage (see
|
||
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
|
||
|
||
You also need to copy the keyring to a predefined location for a external Ceph
|
||
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
|
||
done automatically.
|
||
|
||
NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
|
||
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
|
||
`my-ceph-storage` in the following example:
|
||
|
||
[source,bash]
|
||
----
|
||
mkdir /etc/pve/priv/ceph
|
||
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
|
||
----
|
||
|
||
[[pveceph_fs]]
|
||
CephFS
|
||
------
|
||
|
||
Ceph provides also a filesystem running on top of the same object storage as
|
||
RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map
|
||
the RADOS backed objects to files and directories, allowing to provide a
|
||
POSIX-compliant replicated filesystem. This allows one to have a clustered
|
||
highly available shared filesystem in an easy way if ceph is already used. Its
|
||
Metadata Servers guarantee that files get balanced out over the whole Ceph
|
||
cluster, this way even high load will not overload a single host, which can be
|
||
be an issue with traditional shared filesystem approaches, like `NFS`, for
|
||
example.
|
||
|
||
{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage])
|
||
to save backups, ISO files or container templates and creating a
|
||
hyper-converged CephFS itself.
|
||
|
||
|
||
[[pveceph_fs_mds]]
|
||
Metadata Server (MDS)
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
CephFS needs at least one Metadata Server to be configured and running to be
|
||
able to work. One can simply create one through the {pve} web GUI's `Node ->
|
||
CephFS` panel or on the command line with:
|
||
|
||
----
|
||
pveceph mds create
|
||
----
|
||
|
||
Multiple metadata servers can be created in a cluster. But with the default
|
||
settings only one can be active at any time. If an MDS, or its node, becomes
|
||
unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
|
||
One can speed up the hand-over between the active and a standby MDS up by using
|
||
the 'hotstandby' parameter option on create, or if you have already created it
|
||
you may set/add:
|
||
|
||
----
|
||
mds standby replay = true
|
||
----
|
||
|
||
in the ceph.conf respective MDS section. With this enabled, this specific MDS
|
||
will always poll the active one, so that it can take over faster as it is in a
|
||
`warm' state. But naturally, the active polling will cause some additional
|
||
performance impact on your system and active `MDS`.
|
||
|
||
Multiple Active MDS
|
||
^^^^^^^^^^^^^^^^^^^
|
||
|
||
Since Luminous (12.2.x) you can also have multiple active metadata servers
|
||
running, but this is normally only useful for a high count on parallel clients,
|
||
as else the `MDS` seldom is the bottleneck. If you want to set this up please
|
||
refer to the ceph documentation. footnote:[Configuring multiple active MDS
|
||
daemons http://docs.ceph.com/docs/mimic/cephfs/multimds/]
|
||
|
||
[[pveceph_fs_create]]
|
||
Create a CephFS
|
||
~~~~~~~~~~~~~~~
|
||
|
||
With {pve}'s CephFS integration into you can create a CephFS easily over the
|
||
Web GUI, the CLI or an external API interface. Some prerequisites are required
|
||
for this to work:
|
||
|
||
.Prerequisites for a successful CephFS setup:
|
||
- xref:pve_ceph_install[Install Ceph packages], if this was already done some
|
||
time ago you might want to rerun it on an up to date system to ensure that
|
||
also all CephFS related packages get installed.
|
||
- xref:pve_ceph_monitors[Setup Monitors]
|
||
- xref:pve_ceph_monitors[Setup your OSDs]
|
||
- xref:pveceph_fs_mds[Setup at least one MDS]
|
||
|
||
After this got all checked and done you can simply create a CephFS through
|
||
either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
|
||
for example with:
|
||
|
||
----
|
||
pveceph fs create --pg_num 128 --add-storage
|
||
----
|
||
|
||
This creates a CephFS named `'cephfs'' using a pool for its data named
|
||
`'cephfs_data'' with `128` placement groups and a pool for its metadata named
|
||
`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
|
||
Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
|
||
Ceph documentation for more information regarding a fitting placement group
|
||
number (`pg_num`) for your setup footnote:[Ceph Placement Groups
|
||
http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/].
|
||
Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
|
||
storage configuration after it was created successfully.
|
||
|
||
Destroy CephFS
|
||
~~~~~~~~~~~~~~
|
||
|
||
WARN: Destroying a CephFS will render all its data unusable, this cannot be
|
||
undone!
|
||
|
||
If you really want to destroy an existing CephFS you first need to stop, or
|
||
destroy, all metadata server (`M̀DS`). You can destroy them either over the Web
|
||
GUI or the command line interface, with:
|
||
|
||
----
|
||
pveceph mds destroy NAME
|
||
----
|
||
on each {pve} node hosting a MDS daemon.
|
||
|
||
Then, you can remove (destroy) CephFS by issuing a:
|
||
|
||
----
|
||
ceph rm fs NAME --yes-i-really-mean-it
|
||
----
|
||
on a single node hosting Ceph. After this you may want to remove the created
|
||
data and metadata pools, this can be done either over the Web GUI or the CLI
|
||
with:
|
||
|
||
----
|
||
pveceph pool destroy NAME
|
||
----
|
||
|
||
ifdef::manvolnum[]
|
||
include::pve-copyright.adoc[]
|
||
endif::manvolnum[]
|