mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-24 02:03:59 +03:00
04ba9b2466
remove duplicated 'Easy management' point. Remove 'no signle point of failure' as it's very easy to setup a ceph cluster with lot's of single point of failures, ceph allows to avoid them, but does not magically do so - so do not lull an unknown user into false hope. Use {pve} makro and rewor heading to a more "what's cool on ceph _with_ pve" one. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
419 lines
13 KiB
Plaintext
419 lines
13 KiB
Plaintext
[[chapter_pveceph]]
|
||
ifdef::manvolnum[]
|
||
pveceph(1)
|
||
==========
|
||
:pve-toplevel:
|
||
|
||
NAME
|
||
----
|
||
|
||
pveceph - Manage Ceph Services on Proxmox VE Nodes
|
||
|
||
SYNOPSIS
|
||
--------
|
||
|
||
include::pveceph.1-synopsis.adoc[]
|
||
|
||
DESCRIPTION
|
||
-----------
|
||
endif::manvolnum[]
|
||
ifndef::manvolnum[]
|
||
Manage Ceph Services on Proxmox VE Nodes
|
||
========================================
|
||
:pve-toplevel:
|
||
endif::manvolnum[]
|
||
|
||
[thumbnail="gui-ceph-status.png"]
|
||
|
||
{pve} unifies your compute and storage systems, i.e. you can use the same
|
||
physical nodes within a cluster for both computing (processing VMs and
|
||
containers) and replicated storage. The traditional silos of compute and
|
||
storage resources can be wrapped up into a single hyper-converged appliance.
|
||
Separate storage networks (SANs) and connections via network attached storages
|
||
(NAS) disappear. With the integration of Ceph, an open source software-defined
|
||
storage platform, {pve} has the ability to run and manage Ceph storage directly
|
||
on the hypervisor nodes.
|
||
|
||
Ceph is a distributed object store and file system designed to provide
|
||
excellent performance, reliability and scalability.
|
||
|
||
.Some advantages of Ceph on {pve} are:
|
||
- Easy setup and management with CLI and GUI support
|
||
- Thin provisioning
|
||
- Snapshots support
|
||
- Self healing
|
||
- Scalable to the exabyte level
|
||
- Setup pools with different performance and redundancy characteristics
|
||
- Data is replicated, making it fault tolerant
|
||
- Runs on economical commodity hardware
|
||
- No need for hardware RAID controllers
|
||
- Open source
|
||
|
||
For small to mid sized deployments, it is possible to install a Ceph server for
|
||
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
|
||
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
|
||
hardware has plenty of CPU power and RAM, so running storage services
|
||
and VMs on the same node is possible.
|
||
|
||
To simplify management, we provide 'pveceph' - a tool to install and
|
||
manage {ceph} services on {pve} nodes.
|
||
|
||
.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
|
||
- Ceph Monitor (ceph-mon)
|
||
- Ceph Manager (ceph-mgr)
|
||
- Ceph OSD (ceph-osd; Object Storage Daemon)
|
||
|
||
TIP: We recommend to get familiar with the Ceph vocabulary.
|
||
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
|
||
|
||
|
||
Precondition
|
||
------------
|
||
|
||
To build a Proxmox Ceph Cluster there should be at least three (preferably)
|
||
identical servers for the setup.
|
||
|
||
A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
|
||
setup is also an option if there are no 10Gb switches available, see our wiki
|
||
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
|
||
|
||
Check also the recommendations from
|
||
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
|
||
|
||
.Avoid RAID
|
||
While RAID controller are build for storage virtualisation, to combine
|
||
independent disks to form one or more logical units. Their caching methods,
|
||
algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
|
||
targeted towards aforementioned logical units and not to Ceph.
|
||
|
||
WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
|
||
|
||
|
||
Installation of Ceph Packages
|
||
-----------------------------
|
||
|
||
On each node run the installation script as follows:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph install
|
||
----
|
||
|
||
This sets up an `apt` package repository in
|
||
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
|
||
|
||
|
||
Creating initial Ceph configuration
|
||
-----------------------------------
|
||
|
||
[thumbnail="gui-ceph-config.png"]
|
||
|
||
After installation of packages, you need to create an initial Ceph
|
||
configuration on just one node, based on your network (`10.10.10.0/24`
|
||
in the following example) dedicated for Ceph:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph init --network 10.10.10.0/24
|
||
----
|
||
|
||
This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
|
||
automatically distributed to all {pve} nodes by using
|
||
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
|
||
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
|
||
Ceph commands without the need to specify a configuration file.
|
||
|
||
|
||
[[pve_ceph_monitors]]
|
||
Creating Ceph Monitors
|
||
----------------------
|
||
|
||
[thumbnail="gui-ceph-monitor.png"]
|
||
|
||
The Ceph Monitor (MON)
|
||
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
|
||
maintains a master copy of the cluster map. For high availability you need to
|
||
have at least 3 monitors.
|
||
|
||
On each node where you want to place a monitor (three monitors are recommended),
|
||
create it by using the 'Ceph -> Monitor' tab in the GUI or run.
|
||
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createmon
|
||
----
|
||
|
||
This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
|
||
do not want to install a manager, specify the '-exclude-manager' option.
|
||
|
||
|
||
[[pve_ceph_manager]]
|
||
Creating Ceph Manager
|
||
----------------------
|
||
|
||
The Manager daemon runs alongside the monitors, providing an interface for
|
||
monitoring the cluster. Since the Ceph luminous release the
|
||
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
|
||
is required. During monitor installation the ceph manager will be installed as
|
||
well.
|
||
|
||
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
|
||
high availability install more then one manager.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createmgr
|
||
----
|
||
|
||
|
||
[[pve_ceph_osds]]
|
||
Creating Ceph OSDs
|
||
------------------
|
||
|
||
[thumbnail="gui-ceph-osd-status.png"]
|
||
|
||
via GUI or via CLI as follows:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X]
|
||
----
|
||
|
||
TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
|
||
among your, at least three nodes (4 OSDs on each node).
|
||
|
||
If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
|
||
sector and any OSD leftover the following commands should be sufficient.
|
||
|
||
[source,bash]
|
||
----
|
||
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
|
||
ceph-disk zap /dev/sd[X]
|
||
----
|
||
|
||
WARNING: The above commands will destroy data on the disk!
|
||
|
||
Ceph Bluestore
|
||
~~~~~~~~~~~~~~
|
||
|
||
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
|
||
introduced, the so called Bluestore
|
||
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
|
||
This is the default when creating OSDs in Ceph luminous.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X]
|
||
----
|
||
|
||
NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
|
||
to have a GPT footnoteref:[GPT, GPT partition table
|
||
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
|
||
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
|
||
disk as DB/WAL.
|
||
|
||
If you want to use a separate DB/WAL device for your OSDs, you can specify it
|
||
through the '-journal_dev' option. The WAL is placed with the DB, if not
|
||
specified separately.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
|
||
----
|
||
|
||
NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
|
||
internal journal or write-ahead log. It is recommended to use a fast SSDs or
|
||
NVRAM for better performance.
|
||
|
||
|
||
Ceph Filestore
|
||
~~~~~~~~~~~~~
|
||
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
|
||
still be used and might give better performance in small setups, when backed by
|
||
a NVMe SSD or similar.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -bluestore 0
|
||
----
|
||
|
||
NOTE: In order to select a disk in the GUI, the disk needs to have a
|
||
GPT footnoteref:[GPT] partition table. You can
|
||
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
|
||
disk as journal. Currently the journal size is fixed to 5 GB.
|
||
|
||
If you want to use a dedicated SSD journal disk:
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
|
||
----
|
||
|
||
Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
|
||
journal disk.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
|
||
----
|
||
|
||
This partitions the disk (data and journal partition), creates
|
||
filesystems and starts the OSD, afterwards it is running and fully
|
||
functional.
|
||
|
||
NOTE: This command refuses to initialize disk when it detects existing data. So
|
||
if you want to overwrite a disk you should remove existing data first. You can
|
||
do that using: 'ceph-disk zap /dev/sd[X]'
|
||
|
||
You can create OSDs containing both journal and data partitions or you
|
||
can place the journal on a dedicated SSD. Using a SSD journal disk is
|
||
highly recommended to achieve good performance.
|
||
|
||
|
||
[[pve_ceph_pools]]
|
||
Creating Ceph Pools
|
||
-------------------
|
||
|
||
[thumbnail="gui-ceph-pools.png"]
|
||
|
||
A pool is a logical group for storing objects. It holds **P**lacement
|
||
**G**roups (PG), a collection of objects.
|
||
|
||
When no options are given, we set a
|
||
default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
|
||
for serving objects in a degraded state.
|
||
|
||
NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
|
||
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
|
||
|
||
It is advised to calculate the PG number depending on your setup, you can find
|
||
the formula and the PG calculator footnote:[PG calculator
|
||
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
|
||
never be decreased.
|
||
|
||
|
||
You can create pools through command line or on the GUI on each PVE host under
|
||
**Ceph -> Pools**.
|
||
|
||
[source,bash]
|
||
----
|
||
pveceph createpool <name>
|
||
----
|
||
|
||
If you would like to automatically get also a storage definition for your pool,
|
||
active the checkbox "Add storages" on the GUI or use the command line option
|
||
'--add_storages' on pool creation.
|
||
|
||
Further information on Ceph pool handling can be found in the Ceph pool
|
||
operation footnote:[Ceph pool operation
|
||
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
|
||
manual.
|
||
|
||
Ceph CRUSH & device classes
|
||
---------------------------
|
||
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
|
||
**U**nder **S**calable **H**ashing
|
||
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
|
||
|
||
CRUSH calculates where to store to and retrieve data from, this has the
|
||
advantage that no central index service is needed. CRUSH works with a map of
|
||
OSDs, buckets (device locations) and rulesets (data replication) for pools.
|
||
|
||
NOTE: Further information can be found in the Ceph documentation, under the
|
||
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
|
||
|
||
This map can be altered to reflect different replication hierarchies. The object
|
||
replicas can be separated (eg. failure domains), while maintaining the desired
|
||
distribution.
|
||
|
||
A common use case is to use different classes of disks for different Ceph pools.
|
||
For this reason, Ceph introduced the device classes with luminous, to
|
||
accommodate the need for easy ruleset generation.
|
||
|
||
The device classes can be seen in the 'ceph osd tree' output. These classes
|
||
represent their own root bucket, which can be seen with the below command.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush tree --show-shadow
|
||
----
|
||
|
||
Example output form the above command:
|
||
|
||
[source, bash]
|
||
----
|
||
ID CLASS WEIGHT TYPE NAME
|
||
-16 nvme 2.18307 root default~nvme
|
||
-13 nvme 0.72769 host sumi1~nvme
|
||
12 nvme 0.72769 osd.12
|
||
-14 nvme 0.72769 host sumi2~nvme
|
||
13 nvme 0.72769 osd.13
|
||
-15 nvme 0.72769 host sumi3~nvme
|
||
14 nvme 0.72769 osd.14
|
||
-1 7.70544 root default
|
||
-3 2.56848 host sumi1
|
||
12 nvme 0.72769 osd.12
|
||
-5 2.56848 host sumi2
|
||
13 nvme 0.72769 osd.13
|
||
-7 2.56848 host sumi3
|
||
14 nvme 0.72769 osd.14
|
||
----
|
||
|
||
To let a pool distribute its objects only on a specific device class, you need
|
||
to create a ruleset with the specific class first.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
|
||
----
|
||
|
||
[frame="none",grid="none", align="left", cols="30%,70%"]
|
||
|===
|
||
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|
||
|<root>|which crush root it should belong to (default ceph root "default")
|
||
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|
||
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|
||
|===
|
||
|
||
Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
|
||
|
||
[source, bash]
|
||
----
|
||
ceph osd pool set <pool-name> crush_rule <rule-name>
|
||
----
|
||
|
||
TIP: If the pool already contains objects, all of these have to be moved
|
||
accordingly. Depending on your setup this may introduce a big performance hit on
|
||
your cluster. As an alternative, you can create a new pool and move disks
|
||
separately.
|
||
|
||
|
||
Ceph Client
|
||
-----------
|
||
|
||
[thumbnail="gui-ceph-log.png"]
|
||
|
||
You can then configure {pve} to use such pools to store VM or
|
||
Container images. Simply use the GUI too add a new `RBD` storage (see
|
||
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
|
||
|
||
You also need to copy the keyring to a predefined location for a external Ceph
|
||
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
|
||
done automatically.
|
||
|
||
NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
|
||
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
|
||
`my-ceph-storage` in the following example:
|
||
|
||
[source,bash]
|
||
----
|
||
mkdir /etc/pve/priv/ceph
|
||
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
|
||
----
|
||
|
||
|
||
ifdef::manvolnum[]
|
||
include::pve-copyright.adoc[]
|
||
endif::manvolnum[]
|