mirror of
git://git.proxmox.com/git/pve-docs.git
synced 2025-01-25 06:03:45 +03:00
237007ebeb
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
554 lines
16 KiB
Plaintext
554 lines
16 KiB
Plaintext
[[chapter_zfs]]
|
|
ZFS on Linux
|
|
------------
|
|
ifdef::wiki[]
|
|
:pve-toplevel:
|
|
endif::wiki[]
|
|
|
|
ZFS is a combined file system and logical volume manager designed by
|
|
Sun Microsystems. Starting with {pve} 3.4, the native Linux
|
|
kernel port of the ZFS file system is introduced as optional
|
|
file system and also as an additional selection for the root
|
|
file system. There is no need for manually compile ZFS modules - all
|
|
packages are included.
|
|
|
|
By using ZFS, its possible to achieve maximum enterprise features with
|
|
low budget hardware, but also high performance systems by leveraging
|
|
SSD caching or even SSD only setups. ZFS can replace cost intense
|
|
hardware raid cards by moderate CPU and memory load combined with easy
|
|
management.
|
|
|
|
.General ZFS advantages
|
|
|
|
* Easy configuration and management with {pve} GUI and CLI.
|
|
|
|
* Reliable
|
|
|
|
* Protection against data corruption
|
|
|
|
* Data compression on file system level
|
|
|
|
* Snapshots
|
|
|
|
* Copy-on-write clone
|
|
|
|
* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
|
|
|
|
* Can use SSD for cache
|
|
|
|
* Self healing
|
|
|
|
* Continuous integrity checking
|
|
|
|
* Designed for high storage capacities
|
|
|
|
* Protection against data corruption
|
|
|
|
* Asynchronous replication over network
|
|
|
|
* Open Source
|
|
|
|
* Encryption
|
|
|
|
* ...
|
|
|
|
|
|
Hardware
|
|
~~~~~~~~
|
|
|
|
ZFS depends heavily on memory, so you need at least 8GB to start. In
|
|
practice, use as much you can get for your hardware/budget. To prevent
|
|
data corruption, we recommend the use of high quality ECC RAM.
|
|
|
|
If you use a dedicated cache and/or log disk, you should use an
|
|
enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
|
|
increase the overall performance significantly.
|
|
|
|
IMPORTANT: Do not use ZFS on top of hardware controller which has its
|
|
own cache management. ZFS needs to directly communicate with disks. An
|
|
HBA adapter is the way to go, or something like LSI controller flashed
|
|
in ``IT'' mode.
|
|
|
|
If you are experimenting with an installation of {pve} inside a VM
|
|
(Nested Virtualization), don't use `virtio` for disks of that VM,
|
|
since they are not supported by ZFS. Use IDE or SCSI instead (works
|
|
also with `virtio` SCSI controller type).
|
|
|
|
|
|
Installation as Root File System
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When you install using the {pve} installer, you can choose ZFS for the
|
|
root file system. You need to select the RAID type at installation
|
|
time:
|
|
|
|
[horizontal]
|
|
RAID0:: Also called ``striping''. The capacity of such volume is the sum
|
|
of the capacities of all disks. But RAID0 does not add any redundancy,
|
|
so the failure of a single drive makes the volume unusable.
|
|
|
|
RAID1:: Also called ``mirroring''. Data is written identically to all
|
|
disks. This mode requires at least 2 disks with the same size. The
|
|
resulting capacity is that of a single disk.
|
|
|
|
RAID10:: A combination of RAID0 and RAID1. Requires at least 4 disks.
|
|
|
|
RAIDZ-1:: A variation on RAID-5, single parity. Requires at least 3 disks.
|
|
|
|
RAIDZ-2:: A variation on RAID-5, double parity. Requires at least 4 disks.
|
|
|
|
RAIDZ-3:: A variation on RAID-5, triple parity. Requires at least 5 disks.
|
|
|
|
The installer automatically partitions the disks, creates a ZFS pool
|
|
called `rpool`, and installs the root file system on the ZFS subvolume
|
|
`rpool/ROOT/pve-1`.
|
|
|
|
Another subvolume called `rpool/data` is created to store VM
|
|
images. In order to use that with the {pve} tools, the installer
|
|
creates the following configuration entry in `/etc/pve/storage.cfg`:
|
|
|
|
----
|
|
zfspool: local-zfs
|
|
pool rpool/data
|
|
sparse
|
|
content images,rootdir
|
|
----
|
|
|
|
After installation, you can view your ZFS pool status using the
|
|
`zpool` command:
|
|
|
|
----
|
|
# zpool status
|
|
pool: rpool
|
|
state: ONLINE
|
|
scan: none requested
|
|
config:
|
|
|
|
NAME STATE READ WRITE CKSUM
|
|
rpool ONLINE 0 0 0
|
|
mirror-0 ONLINE 0 0 0
|
|
sda2 ONLINE 0 0 0
|
|
sdb2 ONLINE 0 0 0
|
|
mirror-1 ONLINE 0 0 0
|
|
sdc ONLINE 0 0 0
|
|
sdd ONLINE 0 0 0
|
|
|
|
errors: No known data errors
|
|
----
|
|
|
|
The `zfs` command is used configure and manage your ZFS file
|
|
systems. The following command lists all file systems after
|
|
installation:
|
|
|
|
----
|
|
# zfs list
|
|
NAME USED AVAIL REFER MOUNTPOINT
|
|
rpool 4.94G 7.68T 96K /rpool
|
|
rpool/ROOT 702M 7.68T 96K /rpool/ROOT
|
|
rpool/ROOT/pve-1 702M 7.68T 702M /
|
|
rpool/data 96K 7.68T 96K /rpool/data
|
|
rpool/swap 4.25G 7.69T 64K -
|
|
----
|
|
|
|
|
|
Bootloader
|
|
~~~~~~~~~~
|
|
|
|
Depending on whether the system is booted in EFI or legacy BIOS mode the
|
|
{pve} installer sets up either `grub` or `systemd-boot` as main bootloader.
|
|
See the chapter on xref:sysboot[{pve} host bootladers] for details.
|
|
|
|
|
|
ZFS Administration
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
This section gives you some usage examples for common tasks. ZFS
|
|
itself is really powerful and provides many options. The main commands
|
|
to manage ZFS are `zfs` and `zpool`. Both commands come with great
|
|
manual pages, which can be read with:
|
|
|
|
----
|
|
# man zpool
|
|
# man zfs
|
|
-----
|
|
|
|
.Create a new zpool
|
|
|
|
To create a new pool, at least one disk is needed. The `ashift` should
|
|
have the same sector-size (2 power of `ashift`) or larger as the
|
|
underlying disk.
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> <device>
|
|
----
|
|
|
|
To activate compression (see section <<zfs_compression,Compression in ZFS>>):
|
|
|
|
----
|
|
# zfs set compression=lz4 <pool>
|
|
----
|
|
|
|
.Create a new pool with RAID-0
|
|
|
|
Minimum 1 disk
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> <device1> <device2>
|
|
----
|
|
|
|
.Create a new pool with RAID-1
|
|
|
|
Minimum 2 disks
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
|
|
----
|
|
|
|
.Create a new pool with RAID-10
|
|
|
|
Minimum 4 disks
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
|
|
----
|
|
|
|
.Create a new pool with RAIDZ-1
|
|
|
|
Minimum 3 disks
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
|
|
----
|
|
|
|
.Create a new pool with RAIDZ-2
|
|
|
|
Minimum 4 disks
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
|
|
----
|
|
|
|
.Create a new pool with cache (L2ARC)
|
|
|
|
It is possible to use a dedicated cache drive partition to increase
|
|
the performance (use SSD).
|
|
|
|
As `<device>` it is possible to use more devices, like it's shown in
|
|
"Create a new pool with RAID*".
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
|
|
----
|
|
|
|
.Create a new pool with log (ZIL)
|
|
|
|
It is possible to use a dedicated cache drive partition to increase
|
|
the performance(SSD).
|
|
|
|
As `<device>` it is possible to use more devices, like it's shown in
|
|
"Create a new pool with RAID*".
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> <device> log <log_device>
|
|
----
|
|
|
|
.Add cache and log to an existing pool
|
|
|
|
If you have a pool without cache and log. First partition the SSD in
|
|
2 partition with `parted` or `gdisk`
|
|
|
|
IMPORTANT: Always use GPT partition tables.
|
|
|
|
The maximum size of a log device should be about half the size of
|
|
physical memory, so this is usually quite small. The rest of the SSD
|
|
can be used as cache.
|
|
|
|
----
|
|
# zpool add -f <pool> log <device-part1> cache <device-part2>
|
|
----
|
|
|
|
.Changing a failed device
|
|
|
|
----
|
|
# zpool replace -f <pool> <old device> <new device>
|
|
----
|
|
|
|
.Changing a failed bootable device when using systemd-boot
|
|
|
|
----
|
|
# sgdisk <healthy bootable device> -R <new device>
|
|
# sgdisk -G <new device>
|
|
# zpool replace -f <pool> <old zfs partition> <new zfs partition>
|
|
# pve-efiboot-tool format <new disk's ESP>
|
|
# pve-efiboot-tool init <new disk's ESP>
|
|
----
|
|
|
|
NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on
|
|
bootable disks setup by the {pve} installer since version 5.4. For details, see
|
|
xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP].
|
|
|
|
|
|
Activate E-Mail Notification
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
ZFS comes with an event daemon, which monitors events generated by the
|
|
ZFS kernel module. The daemon can also send emails on ZFS events like
|
|
pool errors. Newer ZFS packages ship the daemon in a separate package,
|
|
and you can install it using `apt-get`:
|
|
|
|
----
|
|
# apt-get install zfs-zed
|
|
----
|
|
|
|
To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your
|
|
favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting:
|
|
|
|
--------
|
|
ZED_EMAIL_ADDR="root"
|
|
--------
|
|
|
|
Please note {pve} forwards mails to `root` to the email address
|
|
configured for the root user.
|
|
|
|
IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All
|
|
other settings are optional.
|
|
|
|
|
|
Limit ZFS Memory Usage
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
It is good to use at most 50 percent (which is the default) of the
|
|
system memory for ZFS ARC to prevent performance shortage of the
|
|
host. Use your preferred editor to change the configuration in
|
|
`/etc/modprobe.d/zfs.conf` and insert:
|
|
|
|
--------
|
|
options zfs zfs_arc_max=8589934592
|
|
--------
|
|
|
|
This example setting limits the usage to 8GB.
|
|
|
|
[IMPORTANT]
|
|
====
|
|
If your root file system is ZFS you must update your initramfs every
|
|
time this value changes:
|
|
|
|
----
|
|
# update-initramfs -u
|
|
----
|
|
====
|
|
|
|
|
|
[[zfs_swap]]
|
|
SWAP on ZFS
|
|
~~~~~~~~~~~
|
|
|
|
Swap-space created on a zvol may generate some troubles, like blocking the
|
|
server or generating a high IO load, often seen when starting a Backup
|
|
to an external Storage.
|
|
|
|
We strongly recommend to use enough memory, so that you normally do not
|
|
run into low memory situations. Should you need or want to add swap, it is
|
|
preferred to create a partition on a physical disk and use it as swapdevice.
|
|
You can leave some space free for this purpose in the advanced options of the
|
|
installer. Additionally, you can lower the
|
|
``swappiness'' value. A good value for servers is 10:
|
|
|
|
----
|
|
# sysctl -w vm.swappiness=10
|
|
----
|
|
|
|
To make the swappiness persistent, open `/etc/sysctl.conf` with
|
|
an editor of your choice and add the following line:
|
|
|
|
--------
|
|
vm.swappiness = 10
|
|
--------
|
|
|
|
.Linux kernel `swappiness` parameter values
|
|
[width="100%",cols="<m,2d",options="header"]
|
|
|===========================================================
|
|
| Value | Strategy
|
|
| vm.swappiness = 0 | The kernel will swap only to avoid
|
|
an 'out of memory' condition
|
|
| vm.swappiness = 1 | Minimum amount of swapping without
|
|
disabling it entirely.
|
|
| vm.swappiness = 10 | This value is sometimes recommended to
|
|
improve performance when sufficient memory exists in a system.
|
|
| vm.swappiness = 60 | The default value.
|
|
| vm.swappiness = 100 | The kernel will swap aggressively.
|
|
|===========================================================
|
|
|
|
[[zfs_encryption]]
|
|
Encrypted ZFS Datasets
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
ZFS on Linux version 0.8.0 introduced support for native encryption of
|
|
datasets. After an upgrade from previous ZFS on Linux versions, the encryption
|
|
feature can be enabled per pool:
|
|
|
|
----
|
|
# zpool get feature@encryption tank
|
|
NAME PROPERTY VALUE SOURCE
|
|
tank feature@encryption disabled local
|
|
|
|
# zpool set feature@encryption=enabled
|
|
|
|
# zpool get feature@encryption tank
|
|
NAME PROPERTY VALUE SOURCE
|
|
tank feature@encryption enabled local
|
|
----
|
|
|
|
WARNING: There is currently no support for booting from pools with encrypted
|
|
datasets using Grub, and only limited support for automatically unlocking
|
|
encrypted datasets on boot. Older versions of ZFS without encryption support
|
|
will not be able to decrypt stored data.
|
|
|
|
NOTE: It is recommended to either unlock storage datasets manually after
|
|
booting, or to write a custom unit to pass the key material needed for
|
|
unlocking on boot to `zfs load-key`.
|
|
|
|
WARNING: Establish and test a backup procedure before enabling encryption of
|
|
production data. If the associated key material/passphrase/keyfile has been
|
|
lost, accessing the encrypted data is no longer possible.
|
|
|
|
Encryption needs to be setup when creating datasets/zvols, and is inherited by
|
|
default to child datasets. For example, to create an encrypted dataset
|
|
`tank/encrypted_data` and configure it as storage in {pve}, run the following
|
|
commands:
|
|
|
|
----
|
|
# zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data
|
|
Enter passphrase:
|
|
Re-enter passphrase:
|
|
|
|
# pvesm add zfspool encrypted_zfs -pool tank/encrypted_data
|
|
----
|
|
|
|
All guest volumes/disks create on this storage will be encrypted with the
|
|
shared key material of the parent dataset.
|
|
|
|
To actually use the storage, the associated key material needs to be loaded
|
|
with `zfs load-key`:
|
|
|
|
----
|
|
# zfs load-key tank/encrypted_data
|
|
Enter passphrase for 'tank/encrypted_data':
|
|
----
|
|
|
|
It is also possible to use a (random) keyfile instead of prompting for a
|
|
passphrase by setting the `keylocation` and `keyformat` properties, either at
|
|
creation time or with `zfs change-key` on existing datasets:
|
|
|
|
----
|
|
# dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1
|
|
|
|
# zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data
|
|
----
|
|
|
|
WARNING: When using a keyfile, special care needs to be taken to secure the
|
|
keyfile against unauthorized access or accidental loss. Without the keyfile, it
|
|
is not possible to access the plaintext data!
|
|
|
|
A guest volume created underneath an encrypted dataset will have its
|
|
`encryptionroot` property set accordingly. The key material only needs to be
|
|
loaded once per encryptionroot to be available to all encrypted datasets
|
|
underneath it.
|
|
|
|
See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and
|
|
`keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs
|
|
change-key` commands and the `Encryption` section from `man zfs` for more
|
|
details and advanced usage.
|
|
|
|
|
|
[[zfs_compression]]
|
|
Compression in ZFS
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
When compression is enabled on a dataset, ZFS tries to compress all *new*
|
|
blocks before writing them and decompresses them on reading. Already
|
|
existing data will not be compressed retroactively.
|
|
|
|
You can enable compression with:
|
|
|
|
----
|
|
# zfs set compression=<algorithm> <dataset>
|
|
----
|
|
|
|
We recommend using the `lz4` algorithm, because it adds very little CPU
|
|
overhead. Other algorithms like `lzjb` and `gzip-N`, where `N` is an
|
|
integer from `1` (fastest) to `9` (best compression ratio), are also
|
|
available. Depending on the algorithm and how compressible the data is,
|
|
having compression enabled can even increase I/O performance.
|
|
|
|
You can disable compression at any time with:
|
|
|
|
----
|
|
# zfs set compression=off <dataset>
|
|
----
|
|
|
|
Again, only new blocks will be affected by this change.
|
|
|
|
|
|
ZFS Special Device
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Since version 0.8.0 ZFS supports `special` devices. A `special` device in a
|
|
pool is used to store metadata, deduplication tables, and optionally small
|
|
file blocks.
|
|
|
|
A `special` device can improve the speed of a pool consisting of slow spinning
|
|
hard disks with a lot of metadata changes. For example workloads that involve
|
|
creating, updating or deleting a large number of files will benefit from the
|
|
presence of a `special` device. ZFS datasets can also be configured to store
|
|
whole small files on the `special` device which can further improve the
|
|
performance. Use fast SSDs for the `special` device.
|
|
|
|
IMPORTANT: The redundancy of the `special` device should match the one of the
|
|
pool, since the `special` device is a point of failure for the whole pool.
|
|
|
|
WARNING: Adding a `special` device to a pool cannot be undone!
|
|
|
|
.Create a pool with `special` device and RAID-1:
|
|
|
|
----
|
|
# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
|
|
----
|
|
|
|
.Add a `special` device to an existing pool with RAID-1:
|
|
|
|
----
|
|
# zpool add <pool> special mirror <device1> <device2>
|
|
----
|
|
|
|
ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
|
|
`0` to disable storing small file blocks on the `special` device or a power of
|
|
two in the range between `512B` to `128K`. After setting the property new file
|
|
blocks smaller than `size` will be allocated on the `special` device.
|
|
|
|
IMPORTANT: If the value for `special_small_blocks` is greater than or equal to
|
|
the `recordsize` (default `128K`) of the dataset, *all* data will be written to
|
|
the `special` device, so be careful!
|
|
|
|
Setting the `special_small_blocks` property on a pool will change the default
|
|
value of that property for all child ZFS datasets (for example all containers
|
|
in the pool will opt in for small file blocks).
|
|
|
|
.Opt in for all file smaller than 4K-blocks pool-wide:
|
|
|
|
----
|
|
# zfs set special_small_blocks=4K <pool>
|
|
----
|
|
|
|
.Opt in for small file blocks for a single dataset:
|
|
|
|
----
|
|
# zfs set special_small_blocks=4K <pool>/<filesystem>
|
|
----
|
|
|
|
.Opt out from small file blocks for a single dataset:
|
|
|
|
----
|
|
# zfs set special_small_blocks=0 <pool>/<filesystem>
|
|
----
|