mirror of
git://sourceware.org/git/lvm2.git
synced 2024-12-21 13:34:40 +03:00
doc: Update dm kernel files.
v4.0-9804-gdb4fd9c
This commit is contained in:
parent
2fea720881
commit
81d03b46b0
@ -30,28 +30,48 @@ multiqueue
|
|||||||
|
|
||||||
This policy is the default.
|
This policy is the default.
|
||||||
|
|
||||||
The multiqueue policy has two sets of 16 queues: one set for entries
|
The multiqueue policy has three sets of 16 queues: one set for entries
|
||||||
waiting for the cache and another one for those in the cache.
|
waiting for the cache and another two for those in the cache (a set for
|
||||||
|
clean entries and a set for dirty entries).
|
||||||
|
|
||||||
Cache entries in the queues are aged based on logical time. Entry into
|
Cache entries in the queues are aged based on logical time. Entry into
|
||||||
the cache is based on variable thresholds and queue selection is based
|
the cache is based on variable thresholds and queue selection is based
|
||||||
on hit count on entry. The policy aims to take different cache miss
|
on hit count on entry. The policy aims to take different cache miss
|
||||||
costs into account and to adjust to varying load patterns automatically.
|
costs into account and to adjust to varying load patterns automatically.
|
||||||
|
|
||||||
Message and constructor argument pairs are:
|
Message and constructor argument pairs are:
|
||||||
'sequential_threshold <#nr_sequential_ios>' and
|
'sequential_threshold <#nr_sequential_ios>'
|
||||||
'random_threshold <#nr_random_ios>'.
|
'random_threshold <#nr_random_ios>'
|
||||||
|
'read_promote_adjustment <value>'
|
||||||
|
'write_promote_adjustment <value>'
|
||||||
|
'discard_promote_adjustment <value>'
|
||||||
|
|
||||||
The sequential threshold indicates the number of contiguous I/Os
|
The sequential threshold indicates the number of contiguous I/Os
|
||||||
required before a stream is treated as sequential. The random threshold
|
required before a stream is treated as sequential. Once a stream is
|
||||||
|
considered sequential it will bypass the cache. The random threshold
|
||||||
is the number of intervening non-contiguous I/Os that must be seen
|
is the number of intervening non-contiguous I/Os that must be seen
|
||||||
before the stream is treated as random again.
|
before the stream is treated as random again.
|
||||||
|
|
||||||
The sequential and random thresholds default to 512 and 4 respectively.
|
The sequential and random thresholds default to 512 and 4 respectively.
|
||||||
|
|
||||||
Large, sequential ios are probably better left on the origin device
|
Large, sequential I/Os are probably better left on the origin device
|
||||||
since spindles tend to have good bandwidth. The io_tracker counts
|
since spindles tend to have good sequential I/O bandwidth. The
|
||||||
contiguous I/Os to try to spot when the io is in one of these sequential
|
io_tracker counts contiguous I/Os to try to spot when the I/O is in one
|
||||||
modes.
|
of these sequential modes. But there are use-cases for wanting to
|
||||||
|
promote sequential blocks to the cache (e.g. fast application startup).
|
||||||
|
If sequential threshold is set to 0 the sequential I/O detection is
|
||||||
|
disabled and sequential I/O will no longer implicitly bypass the cache.
|
||||||
|
Setting the random threshold to 0 does _not_ disable the random I/O
|
||||||
|
stream detection.
|
||||||
|
|
||||||
|
Internally the mq policy determines a promotion threshold. If the hit
|
||||||
|
count of a block not in the cache goes above this threshold it gets
|
||||||
|
promoted to the cache. The read, write and discard promote adjustment
|
||||||
|
tunables allow you to tweak the promotion threshold by adding a small
|
||||||
|
value based on the io type. They default to 4, 8 and 1 respectively.
|
||||||
|
If you're trying to quickly warm a new cache device you may wish to
|
||||||
|
reduce these to encourage promotion. Remember to switch them back to
|
||||||
|
their defaults after the cache fills though.
|
||||||
|
|
||||||
cleaner
|
cleaner
|
||||||
-------
|
-------
|
||||||
|
@ -50,14 +50,16 @@ other parameters detailed later):
|
|||||||
which are dirty, and extra hints for use by the policy object.
|
which are dirty, and extra hints for use by the policy object.
|
||||||
This information could be put on the cache device, but having it
|
This information could be put on the cache device, but having it
|
||||||
separate allows the volume manager to configure it differently,
|
separate allows the volume manager to configure it differently,
|
||||||
e.g. as a mirror for extra robustness.
|
e.g. as a mirror for extra robustness. This metadata device may only
|
||||||
|
be used by a single cache device.
|
||||||
|
|
||||||
Fixed block size
|
Fixed block size
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
The origin is divided up into blocks of a fixed size. This block size
|
The origin is divided up into blocks of a fixed size. This block size
|
||||||
is configurable when you first create the cache. Typically we've been
|
is configurable when you first create the cache. Typically we've been
|
||||||
using block sizes of 256k - 1024k.
|
using block sizes of 256KB - 1024KB. The block size must be between 64
|
||||||
|
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).
|
||||||
|
|
||||||
Having a fixed block size simplifies the target a lot. But it is
|
Having a fixed block size simplifies the target a lot. But it is
|
||||||
something of a compromise. For instance, a small part of a block may be
|
something of a compromise. For instance, a small part of a block may be
|
||||||
@ -66,10 +68,11 @@ So large block sizes are bad because they waste cache space. And small
|
|||||||
block sizes are bad because they increase the amount of metadata (both
|
block sizes are bad because they increase the amount of metadata (both
|
||||||
in core and on disk).
|
in core and on disk).
|
||||||
|
|
||||||
Writeback/writethrough
|
Cache operating modes
|
||||||
----------------------
|
---------------------
|
||||||
|
|
||||||
The cache has two modes, writeback and writethrough.
|
The cache has three operating modes: writeback, writethrough and
|
||||||
|
passthrough.
|
||||||
|
|
||||||
If writeback, the default, is selected then a write to a block that is
|
If writeback, the default, is selected then a write to a block that is
|
||||||
cached will go only to the cache and the block will be marked dirty in
|
cached will go only to the cache and the block will be marked dirty in
|
||||||
@ -79,15 +82,38 @@ If writethrough is selected then a write to a cached block will not
|
|||||||
complete until it has hit both the origin and cache devices. Clean
|
complete until it has hit both the origin and cache devices. Clean
|
||||||
blocks should remain clean.
|
blocks should remain clean.
|
||||||
|
|
||||||
|
If passthrough is selected, useful when the cache contents are not known
|
||||||
|
to be coherent with the origin device, then all reads are served from
|
||||||
|
the origin device (all reads miss the cache) and all writes are
|
||||||
|
forwarded to the origin device; additionally, write hits cause cache
|
||||||
|
block invalidates. To enable passthrough mode the cache must be clean.
|
||||||
|
Passthrough mode allows a cache device to be activated without having to
|
||||||
|
worry about coherency. Coherency that exists is maintained, although
|
||||||
|
the cache will gradually cool as writes take place. If the coherency of
|
||||||
|
the cache can later be verified, or established through use of the
|
||||||
|
"invalidate_cblocks" message, the cache device can be transitioned to
|
||||||
|
writethrough or writeback mode while still warm. Otherwise, the cache
|
||||||
|
contents can be discarded prior to transitioning to the desired
|
||||||
|
operating mode.
|
||||||
|
|
||||||
A simple cleaner policy is provided, which will clean (write back) all
|
A simple cleaner policy is provided, which will clean (write back) all
|
||||||
dirty blocks in a cache. Useful for decommissioning a cache.
|
dirty blocks in a cache. Useful for decommissioning a cache or when
|
||||||
|
shrinking a cache. Shrinking the cache's fast device requires all cache
|
||||||
|
blocks, in the area of the cache being removed, to be clean. If the
|
||||||
|
area being removed from the cache still contains dirty blocks the resize
|
||||||
|
will fail. Care must be taken to never reduce the volume used for the
|
||||||
|
cache's fast device until the cache is clean. This is of particular
|
||||||
|
importance if writeback mode is used. Writethrough and passthrough
|
||||||
|
modes already maintain a clean cache. Future support to partially clean
|
||||||
|
the cache, above a specified threshold, will allow for keeping the cache
|
||||||
|
warm and in writeback mode during resize.
|
||||||
|
|
||||||
Migration throttling
|
Migration throttling
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
Migrating data between the origin and cache device uses bandwidth.
|
Migrating data between the origin and cache device uses bandwidth.
|
||||||
The user can set a throttle to prevent more than a certain amount of
|
The user can set a throttle to prevent more than a certain amount of
|
||||||
migration occuring at any one time. Currently we're not taking any
|
migration occurring at any one time. Currently we're not taking any
|
||||||
account of normal io traffic going to the devices. More work needs
|
account of normal io traffic going to the devices. More work needs
|
||||||
doing here to avoid migrating during those peak io moments.
|
doing here to avoid migrating during those peak io moments.
|
||||||
|
|
||||||
@ -98,12 +124,11 @@ the default being 204800 sectors (or 100MB).
|
|||||||
Updating on-disk metadata
|
Updating on-disk metadata
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is
|
On-disk metadata is committed every time a FLUSH or FUA bio is written.
|
||||||
written. If no such requests are made then commits will occur every
|
If no such requests are made then commits will occur every second. This
|
||||||
second. This means the cache behaves like a physical disk that has a
|
means the cache behaves like a physical disk that has a volatile write
|
||||||
write cache (the same is true of the thin-provisioning target). If
|
cache. If power is lost you may lose some recent writes. The metadata
|
||||||
power is lost you may lose some recent writes. The metadata should
|
should always be consistent in spite of any crash.
|
||||||
always be consistent in spite of any crash.
|
|
||||||
|
|
||||||
The 'dirty' state for a cache block changes far too frequently for us
|
The 'dirty' state for a cache block changes far too frequently for us
|
||||||
to keep updating it on the fly. So we treat it as a hint. In normal
|
to keep updating it on the fly. So we treat it as a hint. In normal
|
||||||
@ -159,7 +184,7 @@ Constructor
|
|||||||
block size : cache unit size in sectors
|
block size : cache unit size in sectors
|
||||||
|
|
||||||
#feature args : number of feature arguments passed
|
#feature args : number of feature arguments passed
|
||||||
feature args : writethrough. (The default is writeback.)
|
feature args : writethrough or passthrough (The default is writeback.)
|
||||||
|
|
||||||
policy : the replacement policy to use
|
policy : the replacement policy to use
|
||||||
#policy args : an even number of arguments corresponding to
|
#policy args : an even number of arguments corresponding to
|
||||||
@ -175,6 +200,13 @@ Optional feature arguments are:
|
|||||||
back cache block contents later for performance reasons,
|
back cache block contents later for performance reasons,
|
||||||
so they may differ from the corresponding origin blocks.
|
so they may differ from the corresponding origin blocks.
|
||||||
|
|
||||||
|
passthrough : a degraded mode useful for various cache coherency
|
||||||
|
situations (e.g., rolling back snapshots of
|
||||||
|
underlying storage). Reads and writes always go to
|
||||||
|
the origin. If a write goes to a cached origin
|
||||||
|
block, then the cache block is invalidated.
|
||||||
|
To enable passthrough mode the cache must be clean.
|
||||||
|
|
||||||
A policy called 'default' is always registered. This is an alias for
|
A policy called 'default' is always registered. This is an alias for
|
||||||
the policy we currently think is giving best all round performance.
|
the policy we currently think is giving best all round performance.
|
||||||
|
|
||||||
@ -184,36 +216,43 @@ the characteristics of a specific policy, always request it by name.
|
|||||||
Status
|
Status
|
||||||
------
|
------
|
||||||
|
|
||||||
<#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses>
|
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||||
<#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache>
|
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||||
<#dirty> <#features> <features>* <#core args> <core args>* <#policy args>
|
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||||
<policy args>*
|
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||||
|
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||||
|
|
||||||
#used metadata blocks : Number of metadata blocks used
|
metadata block size : Fixed block size for each metadata block in
|
||||||
#total metadata blocks : Total number of metadata blocks
|
sectors
|
||||||
#read hits : Number of times a READ bio has been mapped
|
#used metadata blocks : Number of metadata blocks used
|
||||||
|
#total metadata blocks : Total number of metadata blocks
|
||||||
|
cache block size : Configurable block size for the cache device
|
||||||
|
in sectors
|
||||||
|
#used cache blocks : Number of blocks resident in the cache
|
||||||
|
#total cache blocks : Total number of cache blocks
|
||||||
|
#read hits : Number of times a READ bio has been mapped
|
||||||
to the cache
|
to the cache
|
||||||
#read misses : Number of times a READ bio has been mapped
|
#read misses : Number of times a READ bio has been mapped
|
||||||
to the origin
|
to the origin
|
||||||
#write hits : Number of times a WRITE bio has been mapped
|
#write hits : Number of times a WRITE bio has been mapped
|
||||||
to the cache
|
to the cache
|
||||||
#write misses : Number of times a WRITE bio has been
|
#write misses : Number of times a WRITE bio has been
|
||||||
mapped to the origin
|
mapped to the origin
|
||||||
#demotions : Number of times a block has been removed
|
#demotions : Number of times a block has been removed
|
||||||
from the cache
|
from the cache
|
||||||
#promotions : Number of times a block has been moved to
|
#promotions : Number of times a block has been moved to
|
||||||
the cache
|
the cache
|
||||||
#blocks in cache : Number of blocks resident in the cache
|
#dirty : Number of blocks in the cache that differ
|
||||||
#dirty : Number of blocks in the cache that differ
|
|
||||||
from the origin
|
from the origin
|
||||||
#feature args : Number of feature args to follow
|
#feature args : Number of feature args to follow
|
||||||
feature args : 'writethrough' (optional)
|
feature args : 'writethrough' (optional)
|
||||||
#core args : Number of core arguments (must be even)
|
#core args : Number of core arguments (must be even)
|
||||||
core args : Key/value pairs for tuning the core
|
core args : Key/value pairs for tuning the core
|
||||||
e.g. migration_threshold
|
e.g. migration_threshold
|
||||||
#policy args : Number of policy arguments to follow (must be even)
|
policy name : Name of the policy
|
||||||
policy args : Key/value pairs
|
#policy args : Number of policy arguments to follow (must be even)
|
||||||
e.g. 'sequential_threshold 1024
|
policy args : Key/value pairs
|
||||||
|
e.g. sequential_threshold
|
||||||
|
|
||||||
Messages
|
Messages
|
||||||
--------
|
--------
|
||||||
@ -229,12 +268,28 @@ The message format is:
|
|||||||
E.g.
|
E.g.
|
||||||
dmsetup message my_cache 0 sequential_threshold 1024
|
dmsetup message my_cache 0 sequential_threshold 1024
|
||||||
|
|
||||||
|
|
||||||
|
Invalidation is removing an entry from the cache without writing it
|
||||||
|
back. Cache blocks can be invalidated via the invalidate_cblocks
|
||||||
|
message, which takes an arbitrary number of cblock ranges. Each cblock
|
||||||
|
range's end value is "one past the end", meaning 5-10 expresses a range
|
||||||
|
of values from 5 to 9. Each cblock must be expressed as a decimal
|
||||||
|
value, in the future a variant message that takes cblock ranges
|
||||||
|
expressed in hexidecimal may be needed to better support efficient
|
||||||
|
invalidation of larger caches. The cache must be in passthrough mode
|
||||||
|
when invalidate_cblocks is used.
|
||||||
|
|
||||||
|
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
|
||||||
|
|
||||||
|
E.g.
|
||||||
|
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
|
||||||
|
|
||||||
Examples
|
Examples
|
||||||
========
|
========
|
||||||
|
|
||||||
The test suite can be found here:
|
The test suite can be found here:
|
||||||
|
|
||||||
https://github.com/jthornber/thinp-test-suite
|
https://github.com/jthornber/device-mapper-test-suite
|
||||||
|
|
||||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||||
|
@ -4,12 +4,15 @@ dm-crypt
|
|||||||
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
||||||
using the kernel crypto API.
|
using the kernel crypto API.
|
||||||
|
|
||||||
|
For a more detailed description of supported parameters see:
|
||||||
|
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
|
||||||
|
|
||||||
Parameters: <cipher> <key> <iv_offset> <device path> \
|
Parameters: <cipher> <key> <iv_offset> <device path> \
|
||||||
<offset> [<#opt_params> <opt_params>]
|
<offset> [<#opt_params> <opt_params>]
|
||||||
|
|
||||||
<cipher>
|
<cipher>
|
||||||
Encryption cipher and an optional IV generation mode.
|
Encryption cipher and an optional IV generation mode.
|
||||||
(In format cipher[:keycount]-chainmode-ivopts:ivmode).
|
(In format cipher[:keycount]-chainmode-ivmode[:ivopts]).
|
||||||
Examples:
|
Examples:
|
||||||
des
|
des
|
||||||
aes-cbc-essiv:sha256
|
aes-cbc-essiv:sha256
|
||||||
@ -19,7 +22,11 @@ Parameters: <cipher> <key> <iv_offset> <device path> \
|
|||||||
|
|
||||||
<key>
|
<key>
|
||||||
Key used for encryption. It is encoded as a hexadecimal number.
|
Key used for encryption. It is encoded as a hexadecimal number.
|
||||||
You can only use key sizes that are valid for the selected cipher.
|
You can only use key sizes that are valid for the selected cipher
|
||||||
|
in combination with the selected iv mode.
|
||||||
|
Note that for some iv modes the key string can contain additional
|
||||||
|
keys (for example IV seed) so the key contains more parts concatenated
|
||||||
|
into a single string.
|
||||||
|
|
||||||
<keycount>
|
<keycount>
|
||||||
Multi-key compatibility mode. You can define <keycount> keys and
|
Multi-key compatibility mode. You can define <keycount> keys and
|
||||||
@ -44,7 +51,7 @@ Parameters: <cipher> <key> <iv_offset> <device path> \
|
|||||||
Otherwise #opt_params is the number of following arguments.
|
Otherwise #opt_params is the number of following arguments.
|
||||||
|
|
||||||
Example of optional parameters section:
|
Example of optional parameters section:
|
||||||
1 allow_discards
|
3 allow_discards same_cpu_crypt submit_from_crypt_cpus
|
||||||
|
|
||||||
allow_discards
|
allow_discards
|
||||||
Block discard requests (a.k.a. TRIM) are passed through the crypt device.
|
Block discard requests (a.k.a. TRIM) are passed through the crypt device.
|
||||||
@ -56,11 +63,24 @@ allow_discards
|
|||||||
used space etc.) if the discarded blocks can be located easily on the
|
used space etc.) if the discarded blocks can be located easily on the
|
||||||
device later.
|
device later.
|
||||||
|
|
||||||
|
same_cpu_crypt
|
||||||
|
Perform encryption using the same cpu that IO was submitted on.
|
||||||
|
The default is to use an unbound workqueue so that encryption work
|
||||||
|
is automatically balanced between available CPUs.
|
||||||
|
|
||||||
|
submit_from_crypt_cpus
|
||||||
|
Disable offloading writes to a separate thread after encryption.
|
||||||
|
There are some situations where offloading write bios from the
|
||||||
|
encryption threads to a single thread degrades performance
|
||||||
|
significantly. The default is to offload write bios to the same
|
||||||
|
thread because it benefits CFQ to have writes submitted using the
|
||||||
|
same context.
|
||||||
|
|
||||||
Example scripts
|
Example scripts
|
||||||
===============
|
===============
|
||||||
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
||||||
encryption with dm-crypt using the 'cryptsetup' utility, see
|
encryption with dm-crypt using the 'cryptsetup' utility, see
|
||||||
http://code.google.com/p/cryptsetup/
|
https://gitlab.com/cryptsetup/cryptsetup
|
||||||
|
|
||||||
[[
|
[[
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
|
108
doc/kernel/era.txt
Normal file
108
doc/kernel/era.txt
Normal file
@ -0,0 +1,108 @@
|
|||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
dm-era is a target that behaves similar to the linear target. In
|
||||||
|
addition it keeps track of which blocks were written within a user
|
||||||
|
defined period of time called an 'era'. Each era target instance
|
||||||
|
maintains the current era as a monotonically increasing 32-bit
|
||||||
|
counter.
|
||||||
|
|
||||||
|
Use cases include tracking changed blocks for backup software, and
|
||||||
|
partially invalidating the contents of a cache to restore cache
|
||||||
|
coherency after rolling back a vendor snapshot.
|
||||||
|
|
||||||
|
Constructor
|
||||||
|
===========
|
||||||
|
|
||||||
|
era <metadata dev> <origin dev> <block size>
|
||||||
|
|
||||||
|
metadata dev : fast device holding the persistent metadata
|
||||||
|
origin dev : device holding data blocks that may change
|
||||||
|
block size : block size of origin data device, granularity that is
|
||||||
|
tracked by the target
|
||||||
|
|
||||||
|
Messages
|
||||||
|
========
|
||||||
|
|
||||||
|
None of the dm messages take any arguments.
|
||||||
|
|
||||||
|
checkpoint
|
||||||
|
----------
|
||||||
|
|
||||||
|
Possibly move to a new era. You shouldn't assume the era has
|
||||||
|
incremented. After sending this message, you should check the
|
||||||
|
current era via the status line.
|
||||||
|
|
||||||
|
take_metadata_snap
|
||||||
|
------------------
|
||||||
|
|
||||||
|
Create a clone of the metadata, to allow a userland process to read it.
|
||||||
|
|
||||||
|
drop_metadata_snap
|
||||||
|
------------------
|
||||||
|
|
||||||
|
Drop the metadata snapshot.
|
||||||
|
|
||||||
|
Status
|
||||||
|
======
|
||||||
|
|
||||||
|
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||||
|
<current era> <held metadata root | '-'>
|
||||||
|
|
||||||
|
metadata block size : Fixed block size for each metadata block in
|
||||||
|
sectors
|
||||||
|
#used metadata blocks : Number of metadata blocks used
|
||||||
|
#total metadata blocks : Total number of metadata blocks
|
||||||
|
current era : The current era
|
||||||
|
held metadata root : The location, in blocks, of the metadata root
|
||||||
|
that has been 'held' for userspace read
|
||||||
|
access. '-' indicates there is no held root
|
||||||
|
|
||||||
|
Detailed use case
|
||||||
|
=================
|
||||||
|
|
||||||
|
The scenario of invalidating a cache when rolling back a vendor
|
||||||
|
snapshot was the primary use case when developing this target:
|
||||||
|
|
||||||
|
Taking a vendor snapshot
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
- Send a checkpoint message to the era target
|
||||||
|
- Make a note of the current era in its status line
|
||||||
|
- Take vendor snapshot (the era and snapshot should be forever
|
||||||
|
associated now).
|
||||||
|
|
||||||
|
Rolling back to an vendor snapshot
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
- Cache enters passthrough mode (see: dm-cache's docs in cache.txt)
|
||||||
|
- Rollback vendor storage
|
||||||
|
- Take metadata snapshot
|
||||||
|
- Ascertain which blocks have been written since the snapshot was taken
|
||||||
|
by checking each block's era
|
||||||
|
- Invalidate those blocks in the caching software
|
||||||
|
- Cache returns to writeback/writethrough mode
|
||||||
|
|
||||||
|
Memory usage
|
||||||
|
============
|
||||||
|
|
||||||
|
The target uses a bitset to record writes in the current era. It also
|
||||||
|
has a spare bitset ready for switching over to a new era. Other than
|
||||||
|
that it uses a few 4k blocks for updating metadata.
|
||||||
|
|
||||||
|
(4 * nr_blocks) bytes + buffers
|
||||||
|
|
||||||
|
Resilience
|
||||||
|
==========
|
||||||
|
|
||||||
|
Metadata is updated on disk before a write to a previously unwritten
|
||||||
|
block is performed. As such dm-era should not be effected by a hard
|
||||||
|
crash such as power failure.
|
||||||
|
|
||||||
|
Userland tools
|
||||||
|
==============
|
||||||
|
|
||||||
|
Userland tools are found in the increasingly poorly named
|
||||||
|
thin-provisioning-tools project:
|
||||||
|
|
||||||
|
https://github.com/jthornber/thin-provisioning-tools
|
140
doc/kernel/log-writes.txt
Normal file
140
doc/kernel/log-writes.txt
Normal file
@ -0,0 +1,140 @@
|
|||||||
|
dm-log-writes
|
||||||
|
=============
|
||||||
|
|
||||||
|
This target takes 2 devices, one to pass all IO to normally, and one to log all
|
||||||
|
of the write operations to. This is intended for file system developers wishing
|
||||||
|
to verify the integrity of metadata or data as the file system is written to.
|
||||||
|
There is a log_write_entry written for every WRITE request and the target is
|
||||||
|
able to take arbitrary data from userspace to insert into the log. The data
|
||||||
|
that is in the WRITE requests is copied into the log to make the replay happen
|
||||||
|
exactly as it happened originally.
|
||||||
|
|
||||||
|
Log Ordering
|
||||||
|
============
|
||||||
|
|
||||||
|
We log things in order of completion once we are sure the write is no longer in
|
||||||
|
cache. This means that normal WRITE requests are not actually logged until the
|
||||||
|
next REQ_FLUSH request. This is to make it easier for userspace to replay the
|
||||||
|
log in a way that correlates to what is on disk and not what is in cache, to
|
||||||
|
make it easier to detect improper waiting/flushing.
|
||||||
|
|
||||||
|
This works by attaching all WRITE requests to a list once the write completes.
|
||||||
|
Once we see a REQ_FLUSH request we splice this list onto the request and once
|
||||||
|
the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
|
||||||
|
completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to
|
||||||
|
simulate the worst case scenario with regard to power failures. Consider the
|
||||||
|
following example (W means write, C means complete):
|
||||||
|
|
||||||
|
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||||
|
|
||||||
|
The log would show the following
|
||||||
|
|
||||||
|
W3,W2,flush,W1....
|
||||||
|
|
||||||
|
Again this is to simulate what is actually on disk, this allows us to detect
|
||||||
|
cases where a power failure at a particular point in time would create an
|
||||||
|
inconsistent file system.
|
||||||
|
|
||||||
|
Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
|
||||||
|
they complete as those requests will obviously bypass the device cache.
|
||||||
|
|
||||||
|
Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
|
||||||
|
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
|
||||||
|
request. Consider the following example:
|
||||||
|
|
||||||
|
WRITE block 1, DISCARD block 1, FLUSH
|
||||||
|
|
||||||
|
If we logged DISCARD when it completed, the replay would look like this
|
||||||
|
|
||||||
|
DISCARD 1, WRITE 1, FLUSH
|
||||||
|
|
||||||
|
which isn't quite what happened and wouldn't be caught during the log replay.
|
||||||
|
|
||||||
|
Target interface
|
||||||
|
================
|
||||||
|
|
||||||
|
i) Constructor
|
||||||
|
|
||||||
|
log-writes <dev_path> <log_dev_path>
|
||||||
|
|
||||||
|
dev_path : Device that all of the IO will go to normally.
|
||||||
|
log_dev_path : Device where the log entries are written to.
|
||||||
|
|
||||||
|
ii) Status
|
||||||
|
|
||||||
|
<#logged entries> <highest allocated sector>
|
||||||
|
|
||||||
|
#logged entries : Number of logged entries
|
||||||
|
highest allocated sector : Highest allocated sector
|
||||||
|
|
||||||
|
iii) Messages
|
||||||
|
|
||||||
|
mark <description>
|
||||||
|
|
||||||
|
You can use a dmsetup message to set an arbitrary mark in a log.
|
||||||
|
For example say you want to fsck a file system after every
|
||||||
|
write, but first you need to replay up to the mkfs to make sure
|
||||||
|
we're fsck'ing something reasonable, you would do something like
|
||||||
|
this:
|
||||||
|
|
||||||
|
mkfs.btrfs -f /dev/mapper/log
|
||||||
|
dmsetup message log 0 mark mkfs
|
||||||
|
<run test>
|
||||||
|
|
||||||
|
This would allow you to replay the log up to the mkfs mark and
|
||||||
|
then replay from that point on doing the fsck check in the
|
||||||
|
interval that you want.
|
||||||
|
|
||||||
|
Every log has a mark at the end labeled "dm-log-writes-end".
|
||||||
|
|
||||||
|
Userspace component
|
||||||
|
===================
|
||||||
|
|
||||||
|
There is a userspace tool that will replay the log for you in various ways.
|
||||||
|
It can be found here: https://github.com/josefbacik/log-writes
|
||||||
|
|
||||||
|
Example usage
|
||||||
|
=============
|
||||||
|
|
||||||
|
Say you want to test fsync on your file system. You would do something like
|
||||||
|
this:
|
||||||
|
|
||||||
|
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||||
|
dmsetup create log --table "$TABLE"
|
||||||
|
mkfs.btrfs -f /dev/mapper/log
|
||||||
|
dmsetup message log 0 mark mkfs
|
||||||
|
|
||||||
|
mount /dev/mapper/log /mnt/btrfs-test
|
||||||
|
<some test that does fsync at the end>
|
||||||
|
dmsetup message log 0 mark fsync
|
||||||
|
md5sum /mnt/btrfs-test/foo
|
||||||
|
umount /mnt/btrfs-test
|
||||||
|
|
||||||
|
dmsetup remove log
|
||||||
|
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||||
|
mount /dev/sdb /mnt/btrfs-test
|
||||||
|
md5sum /mnt/btrfs-test/foo
|
||||||
|
<verify md5sum's are correct>
|
||||||
|
|
||||||
|
Another option is to do a complicated file system operation and verify the file
|
||||||
|
system is consistent during the entire operation. You could do this with:
|
||||||
|
|
||||||
|
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||||
|
dmsetup create log --table "$TABLE"
|
||||||
|
mkfs.btrfs -f /dev/mapper/log
|
||||||
|
dmsetup message log 0 mark mkfs
|
||||||
|
|
||||||
|
mount /dev/mapper/log /mnt/btrfs-test
|
||||||
|
<fsstress to dirty the fs>
|
||||||
|
btrfs filesystem balance /mnt/btrfs-test
|
||||||
|
umount /mnt/btrfs-test
|
||||||
|
dmsetup remove log
|
||||||
|
|
||||||
|
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||||
|
btrfsck /dev/sdb
|
||||||
|
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||||
|
--fsck "btrfsck /dev/sdb" --check fua
|
||||||
|
|
||||||
|
And that will replay the log until it sees a FUA request, run the fsck command
|
||||||
|
and if the fsck passes it will replay to the next FUA, until it is completed or
|
||||||
|
the fsck command exists abnormally.
|
@ -222,3 +222,5 @@ Version History
|
|||||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||||
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
|
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
|
||||||
|
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||||
|
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||||
|
186
doc/kernel/statistics.txt
Normal file
186
doc/kernel/statistics.txt
Normal file
@ -0,0 +1,186 @@
|
|||||||
|
DM statistics
|
||||||
|
=============
|
||||||
|
|
||||||
|
Device Mapper supports the collection of I/O statistics on user-defined
|
||||||
|
regions of a DM device. If no regions are defined no statistics are
|
||||||
|
collected so there isn't any performance impact. Only bio-based DM
|
||||||
|
devices are currently supported.
|
||||||
|
|
||||||
|
Each user-defined region specifies a starting sector, length and step.
|
||||||
|
Individual statistics will be collected for each step-sized area within
|
||||||
|
the range specified.
|
||||||
|
|
||||||
|
The I/O statistics counters for each step-sized area of a region are
|
||||||
|
in the same format as /sys/block/*/stat or /proc/diskstats (see:
|
||||||
|
Documentation/iostats.txt). But two extra counters (12 and 13) are
|
||||||
|
provided: total time spent reading and writing in milliseconds. All
|
||||||
|
these counters may be accessed by sending the @stats_print message to
|
||||||
|
the appropriate DM device via dmsetup.
|
||||||
|
|
||||||
|
Each region has a corresponding unique identifier, which we call a
|
||||||
|
region_id, that is assigned when the region is created. The region_id
|
||||||
|
must be supplied when querying statistics about the region, deleting the
|
||||||
|
region, etc. Unique region_ids enable multiple userspace programs to
|
||||||
|
request and process statistics for the same DM device without stepping
|
||||||
|
on each other's data.
|
||||||
|
|
||||||
|
The creation of DM statistics will allocate memory via kmalloc or
|
||||||
|
fallback to using vmalloc space. At most, 1/4 of the overall system
|
||||||
|
memory may be allocated by DM statistics. The admin can see how much
|
||||||
|
memory is used by reading
|
||||||
|
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||||
|
|
||||||
|
Messages
|
||||||
|
========
|
||||||
|
|
||||||
|
@stats_create <range> <step> [<program_id> [<aux_data>]]
|
||||||
|
|
||||||
|
Create a new region and return the region_id.
|
||||||
|
|
||||||
|
<range>
|
||||||
|
"-" - whole device
|
||||||
|
"<start_sector>+<length>" - a range of <length> 512-byte sectors
|
||||||
|
starting with <start_sector>.
|
||||||
|
|
||||||
|
<step>
|
||||||
|
"<area_size>" - the range is subdivided into areas each containing
|
||||||
|
<area_size> sectors.
|
||||||
|
"/<number_of_areas>" - the range is subdivided into the specified
|
||||||
|
number of areas.
|
||||||
|
|
||||||
|
<program_id>
|
||||||
|
An optional parameter. A name that uniquely identifies
|
||||||
|
the userspace owner of the range. This groups ranges together
|
||||||
|
so that userspace programs can identify the ranges they
|
||||||
|
created and ignore those created by others.
|
||||||
|
The kernel returns this string back in the output of
|
||||||
|
@stats_list message, but it doesn't use it for anything else.
|
||||||
|
|
||||||
|
<aux_data>
|
||||||
|
An optional parameter. A word that provides auxiliary data
|
||||||
|
that is useful to the client program that created the range.
|
||||||
|
The kernel returns this string back in the output of
|
||||||
|
@stats_list message, but it doesn't use this value for anything.
|
||||||
|
|
||||||
|
@stats_delete <region_id>
|
||||||
|
|
||||||
|
Delete the region with the specified id.
|
||||||
|
|
||||||
|
<region_id>
|
||||||
|
region_id returned from @stats_create
|
||||||
|
|
||||||
|
@stats_clear <region_id>
|
||||||
|
|
||||||
|
Clear all the counters except the in-flight i/o counters.
|
||||||
|
|
||||||
|
<region_id>
|
||||||
|
region_id returned from @stats_create
|
||||||
|
|
||||||
|
@stats_list [<program_id>]
|
||||||
|
|
||||||
|
List all regions registered with @stats_create.
|
||||||
|
|
||||||
|
<program_id>
|
||||||
|
An optional parameter.
|
||||||
|
If this parameter is specified, only matching regions
|
||||||
|
are returned.
|
||||||
|
If it is not specified, all regions are returned.
|
||||||
|
|
||||||
|
Output format:
|
||||||
|
<region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
|
||||||
|
|
||||||
|
@stats_print <region_id> [<starting_line> <number_of_lines>]
|
||||||
|
|
||||||
|
Print counters for each step-sized area of a region.
|
||||||
|
|
||||||
|
<region_id>
|
||||||
|
region_id returned from @stats_create
|
||||||
|
|
||||||
|
<starting_line>
|
||||||
|
The index of the starting line in the output.
|
||||||
|
If omitted, all lines are returned.
|
||||||
|
|
||||||
|
<number_of_lines>
|
||||||
|
The number of lines to include in the output.
|
||||||
|
If omitted, all lines are returned.
|
||||||
|
|
||||||
|
Output format for each step-sized area of a region:
|
||||||
|
|
||||||
|
<start_sector>+<length> counters
|
||||||
|
|
||||||
|
The first 11 counters have the same meaning as
|
||||||
|
/sys/block/*/stat or /proc/diskstats.
|
||||||
|
|
||||||
|
Please refer to Documentation/iostats.txt for details.
|
||||||
|
|
||||||
|
1. the number of reads completed
|
||||||
|
2. the number of reads merged
|
||||||
|
3. the number of sectors read
|
||||||
|
4. the number of milliseconds spent reading
|
||||||
|
5. the number of writes completed
|
||||||
|
6. the number of writes merged
|
||||||
|
7. the number of sectors written
|
||||||
|
8. the number of milliseconds spent writing
|
||||||
|
9. the number of I/Os currently in progress
|
||||||
|
10. the number of milliseconds spent doing I/Os
|
||||||
|
11. the weighted number of milliseconds spent doing I/Os
|
||||||
|
|
||||||
|
Additional counters:
|
||||||
|
12. the total time spent reading in milliseconds
|
||||||
|
13. the total time spent writing in milliseconds
|
||||||
|
|
||||||
|
@stats_print_clear <region_id> [<starting_line> <number_of_lines>]
|
||||||
|
|
||||||
|
Atomically print and then clear all the counters except the
|
||||||
|
in-flight i/o counters. Useful when the client consuming the
|
||||||
|
statistics does not want to lose any statistics (those updated
|
||||||
|
between printing and clearing).
|
||||||
|
|
||||||
|
<region_id>
|
||||||
|
region_id returned from @stats_create
|
||||||
|
|
||||||
|
<starting_line>
|
||||||
|
The index of the starting line in the output.
|
||||||
|
If omitted, all lines are printed and then cleared.
|
||||||
|
|
||||||
|
<number_of_lines>
|
||||||
|
The number of lines to process.
|
||||||
|
If omitted, all lines are printed and then cleared.
|
||||||
|
|
||||||
|
@stats_set_aux <region_id> <aux_data>
|
||||||
|
|
||||||
|
Store auxiliary data aux_data for the specified region.
|
||||||
|
|
||||||
|
<region_id>
|
||||||
|
region_id returned from @stats_create
|
||||||
|
|
||||||
|
<aux_data>
|
||||||
|
The string that identifies data which is useful to the client
|
||||||
|
program that created the range. The kernel returns this
|
||||||
|
string back in the output of @stats_list message, but it
|
||||||
|
doesn't use this value for anything.
|
||||||
|
|
||||||
|
Examples
|
||||||
|
========
|
||||||
|
|
||||||
|
Subdivide the DM device 'vol' into 100 pieces and start collecting
|
||||||
|
statistics on them:
|
||||||
|
|
||||||
|
dmsetup message vol 0 @stats_create - /100
|
||||||
|
|
||||||
|
Set the auxillary data string to "foo bar baz" (the escape for each
|
||||||
|
space must also be escaped, otherwise the shell will consume them):
|
||||||
|
|
||||||
|
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
|
||||||
|
|
||||||
|
List the statistics:
|
||||||
|
|
||||||
|
dmsetup message vol 0 @stats_list
|
||||||
|
|
||||||
|
Print the statistics:
|
||||||
|
|
||||||
|
dmsetup message vol 0 @stats_print 0
|
||||||
|
|
||||||
|
Delete the statistics:
|
||||||
|
|
||||||
|
dmsetup message vol 0 @stats_delete 0
|
138
doc/kernel/switch.txt
Normal file
138
doc/kernel/switch.txt
Normal file
@ -0,0 +1,138 @@
|
|||||||
|
dm-switch
|
||||||
|
=========
|
||||||
|
|
||||||
|
The device-mapper switch target creates a device that supports an
|
||||||
|
arbitrary mapping of fixed-size regions of I/O across a fixed set of
|
||||||
|
paths. The path used for any specific region can be switched
|
||||||
|
dynamically by sending the target a message.
|
||||||
|
|
||||||
|
It maps I/O to underlying block devices efficiently when there is a large
|
||||||
|
number of fixed-sized address regions but there is no simple pattern
|
||||||
|
that would allow for a compact representation of the mapping such as
|
||||||
|
dm-stripe.
|
||||||
|
|
||||||
|
Background
|
||||||
|
----------
|
||||||
|
|
||||||
|
Dell EqualLogic and some other iSCSI storage arrays use a distributed
|
||||||
|
frameless architecture. In this architecture, the storage group
|
||||||
|
consists of a number of distinct storage arrays ("members") each having
|
||||||
|
independent controllers, disk storage and network adapters. When a LUN
|
||||||
|
is created it is spread across multiple members. The details of the
|
||||||
|
spreading are hidden from initiators connected to this storage system.
|
||||||
|
The storage group exposes a single target discovery portal, no matter
|
||||||
|
how many members are being used. When iSCSI sessions are created, each
|
||||||
|
session is connected to an eth port on a single member. Data to a LUN
|
||||||
|
can be sent on any iSCSI session, and if the blocks being accessed are
|
||||||
|
stored on another member the I/O will be forwarded as required. This
|
||||||
|
forwarding is invisible to the initiator. The storage layout is also
|
||||||
|
dynamic, and the blocks stored on disk may be moved from member to
|
||||||
|
member as needed to balance the load.
|
||||||
|
|
||||||
|
This architecture simplifies the management and configuration of both
|
||||||
|
the storage group and initiators. In a multipathing configuration, it
|
||||||
|
is possible to set up multiple iSCSI sessions to use multiple network
|
||||||
|
interfaces on both the host and target to take advantage of the
|
||||||
|
increased network bandwidth. An initiator could use a simple round
|
||||||
|
robin algorithm to send I/O across all paths and let the storage array
|
||||||
|
members forward it as necessary, but there is a performance advantage to
|
||||||
|
sending data directly to the correct member.
|
||||||
|
|
||||||
|
A device-mapper table already lets you map different regions of a
|
||||||
|
device onto different targets. However in this architecture the LUN is
|
||||||
|
spread with an address region size on the order of 10s of MBs, which
|
||||||
|
means the resulting table could have more than a million entries and
|
||||||
|
consume far too much memory.
|
||||||
|
|
||||||
|
Using this device-mapper switch target we can now build a two-layer
|
||||||
|
device hierarchy:
|
||||||
|
|
||||||
|
Upper Tier - Determine which array member the I/O should be sent to.
|
||||||
|
Lower Tier - Load balance amongst paths to a particular member.
|
||||||
|
|
||||||
|
The lower tier consists of a single dm multipath device for each member.
|
||||||
|
Each of these multipath devices contains the set of paths directly to
|
||||||
|
the array member in one priority group, and leverages existing path
|
||||||
|
selectors to load balance amongst these paths. We also build a
|
||||||
|
non-preferred priority group containing paths to other array members for
|
||||||
|
failover reasons.
|
||||||
|
|
||||||
|
The upper tier consists of a single dm-switch device. This device uses
|
||||||
|
a bitmap to look up the location of the I/O and choose the appropriate
|
||||||
|
lower tier device to route the I/O. By using a bitmap we are able to
|
||||||
|
use 4 bits for each address range in a 16 member group (which is very
|
||||||
|
large for us). This is a much denser representation than the dm table
|
||||||
|
b-tree can achieve.
|
||||||
|
|
||||||
|
Construction Parameters
|
||||||
|
=======================
|
||||||
|
|
||||||
|
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
|
||||||
|
[<dev_path> <offset>]+
|
||||||
|
|
||||||
|
<num_paths>
|
||||||
|
The number of paths across which to distribute the I/O.
|
||||||
|
|
||||||
|
<region_size>
|
||||||
|
The number of 512-byte sectors in a region. Each region can be redirected
|
||||||
|
to any of the available paths.
|
||||||
|
|
||||||
|
<num_optional_args>
|
||||||
|
The number of optional arguments. Currently, no optional arguments
|
||||||
|
are supported and so this must be zero.
|
||||||
|
|
||||||
|
<dev_path>
|
||||||
|
The block device that represents a specific path to the device.
|
||||||
|
|
||||||
|
<offset>
|
||||||
|
The offset of the start of data on the specific <dev_path> (in units
|
||||||
|
of 512-byte sectors). This number is added to the sector number when
|
||||||
|
forwarding the request to the specific path. Typically it is zero.
|
||||||
|
|
||||||
|
Messages
|
||||||
|
========
|
||||||
|
|
||||||
|
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
|
||||||
|
|
||||||
|
Modify the region table by specifying which regions are redirected to
|
||||||
|
which paths.
|
||||||
|
|
||||||
|
<index>
|
||||||
|
The region number (region size was specified in constructor parameters).
|
||||||
|
If index is omitted, the next region (previous index + 1) is used.
|
||||||
|
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||||
|
|
||||||
|
<path_nr>
|
||||||
|
The path number in the range 0 ... (<num_paths> - 1).
|
||||||
|
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||||
|
|
||||||
|
R<n>,<m>
|
||||||
|
This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
|
||||||
|
are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
|
||||||
|
slots.
|
||||||
|
|
||||||
|
Status
|
||||||
|
======
|
||||||
|
|
||||||
|
No status line is reported.
|
||||||
|
|
||||||
|
Example
|
||||||
|
=======
|
||||||
|
|
||||||
|
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
|
||||||
|
the same size.
|
||||||
|
|
||||||
|
Create a switch device with 64kB region size:
|
||||||
|
dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0`
|
||||||
|
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
|
||||||
|
|
||||||
|
Set mappings for the first 7 entries to point to devices switch0, switch1,
|
||||||
|
switch2, switch0, switch1, switch2, switch1:
|
||||||
|
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
|
||||||
|
|
||||||
|
Set repetitive mapping. This command:
|
||||||
|
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
|
||||||
|
is equivalent to:
|
||||||
|
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
|
||||||
|
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
|
||||||
|
|
@ -99,13 +99,14 @@ Using an existing pool device
|
|||||||
$data_block_size $low_water_mark"
|
$data_block_size $low_water_mark"
|
||||||
|
|
||||||
$data_block_size gives the smallest unit of disk space that can be
|
$data_block_size gives the smallest unit of disk space that can be
|
||||||
allocated at a time expressed in units of 512-byte sectors. People
|
allocated at a time expressed in units of 512-byte sectors.
|
||||||
primarily interested in thin provisioning may want to use a value such
|
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
|
||||||
as 1024 (512KB). People doing lots of snapshotting may want a smaller value
|
multiple of 128 (64KB). $data_block_size cannot be changed after the
|
||||||
such as 128 (64KB). If you are not zeroing newly-allocated data,
|
thin-pool is created. People primarily interested in thin provisioning
|
||||||
a larger $data_block_size in the region of 256000 (128MB) is suggested.
|
may want to use a value such as 1024 (512KB). People doing lots of
|
||||||
$data_block_size must be the same for the lifetime of the
|
snapshotting may want a smaller value such as 128 (64KB). If you are
|
||||||
metadata device.
|
not zeroing newly-allocated data, a larger $data_block_size in the
|
||||||
|
region of 256000 (128MB) is suggested.
|
||||||
|
|
||||||
$low_water_mark is expressed in blocks of size $data_block_size. If
|
$low_water_mark is expressed in blocks of size $data_block_size. If
|
||||||
free space on the data device drops below this level then a dm event
|
free space on the data device drops below this level then a dm event
|
||||||
@ -115,6 +116,35 @@ Resuming a device with a new table itself triggers an event so the
|
|||||||
userspace daemon can use this to detect a situation where a new table
|
userspace daemon can use this to detect a situation where a new table
|
||||||
already exceeds the threshold.
|
already exceeds the threshold.
|
||||||
|
|
||||||
|
A low water mark for the metadata device is maintained in the kernel and
|
||||||
|
will trigger a dm event if free space on the metadata device drops below
|
||||||
|
it.
|
||||||
|
|
||||||
|
Updating on-disk metadata
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
On-disk metadata is committed every time a FLUSH or FUA bio is written.
|
||||||
|
If no such requests are made then commits will occur every second. This
|
||||||
|
means the thin-provisioning target behaves like a physical disk that has
|
||||||
|
a volatile write cache. If power is lost you may lose some recent
|
||||||
|
writes. The metadata should always be consistent in spite of any crash.
|
||||||
|
|
||||||
|
If data space is exhausted the pool will either error or queue IO
|
||||||
|
according to the configuration (see: error_if_no_space). If metadata
|
||||||
|
space is exhausted or a metadata operation fails: the pool will error IO
|
||||||
|
until the pool is taken offline and repair is performed to 1) fix any
|
||||||
|
potential inconsistencies and 2) clear the flag that imposes repair.
|
||||||
|
Once the pool's metadata device is repaired it may be resized, which
|
||||||
|
will allow the pool to return to normal operation. Note that if a pool
|
||||||
|
is flagged as needing repair, the pool's data and metadata devices
|
||||||
|
cannot be resized until repair is performed. It should also be noted
|
||||||
|
that when the pool's metadata space is exhausted the current metadata
|
||||||
|
transaction is aborted. Given that the pool will cache IO whose
|
||||||
|
completion may have already been acknowledged to upper IO layers
|
||||||
|
(e.g. filesystem) it is strongly suggested that consistency checks
|
||||||
|
(e.g. fsck) be performed on those layers when repair of the pool is
|
||||||
|
required.
|
||||||
|
|
||||||
Thin provisioning
|
Thin provisioning
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
@ -234,6 +264,8 @@ i) Constructor
|
|||||||
read_only: Don't allow any changes to be made to the pool
|
read_only: Don't allow any changes to be made to the pool
|
||||||
metadata.
|
metadata.
|
||||||
|
|
||||||
|
error_if_no_space: Error IOs, instead of queueing, if no space.
|
||||||
|
|
||||||
Data block size must be between 64KB (128 sectors) and 1GB
|
Data block size must be between 64KB (128 sectors) and 1GB
|
||||||
(2097152 sectors) inclusive.
|
(2097152 sectors) inclusive.
|
||||||
|
|
||||||
@ -255,10 +287,9 @@ ii) Status
|
|||||||
should register for the event and then check the target's status.
|
should register for the event and then check the target's status.
|
||||||
|
|
||||||
held metadata root:
|
held metadata root:
|
||||||
The location, in sectors, of the metadata root that has been
|
The location, in blocks, of the metadata root that has been
|
||||||
'held' for userspace read access. '-' indicates there is no
|
'held' for userspace read access. '-' indicates there is no
|
||||||
held root. This feature is not yet implemented so '-' is
|
held root.
|
||||||
always returned.
|
|
||||||
|
|
||||||
discard_passdown|no_discard_passdown
|
discard_passdown|no_discard_passdown
|
||||||
Whether or not discards are actually being passed down to the
|
Whether or not discards are actually being passed down to the
|
||||||
@ -275,6 +306,14 @@ ii) Status
|
|||||||
contain the string 'Fail'. The userspace recovery tools
|
contain the string 'Fail'. The userspace recovery tools
|
||||||
should then be used.
|
should then be used.
|
||||||
|
|
||||||
|
error_if_no_space|queue_if_no_space
|
||||||
|
If the pool runs out of data or metadata space, the pool will
|
||||||
|
either queue or error the IO destined to the data device. The
|
||||||
|
default is to queue the IO until more space is added or the
|
||||||
|
'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
|
||||||
|
module parameter can be used to change this timeout -- it
|
||||||
|
defaults to 60 seconds but may be disabled using a value of 0.
|
||||||
|
|
||||||
iii) Messages
|
iii) Messages
|
||||||
|
|
||||||
create_thin <dev id>
|
create_thin <dev id>
|
||||||
@ -341,9 +380,6 @@ then you'll have no access to blocks mapped beyond the end. If you
|
|||||||
load a target that is bigger than before, then extra blocks will be
|
load a target that is bigger than before, then extra blocks will be
|
||||||
provisioned as and when needed.
|
provisioned as and when needed.
|
||||||
|
|
||||||
If you wish to reduce the size of your thin device and potentially
|
|
||||||
regain some space then send the 'trim' message to the pool.
|
|
||||||
|
|
||||||
ii) Status
|
ii) Status
|
||||||
|
|
||||||
<nr mapped sectors> <highest mapped sector>
|
<nr mapped sectors> <highest mapped sector>
|
||||||
|
@ -11,6 +11,7 @@ Construction Parameters
|
|||||||
<data_block_size> <hash_block_size>
|
<data_block_size> <hash_block_size>
|
||||||
<num_data_blocks> <hash_start_block>
|
<num_data_blocks> <hash_start_block>
|
||||||
<algorithm> <digest> <salt>
|
<algorithm> <digest> <salt>
|
||||||
|
[<#opt_params> <opt_params>]
|
||||||
|
|
||||||
<version>
|
<version>
|
||||||
This is the type of the on-disk hash format.
|
This is the type of the on-disk hash format.
|
||||||
@ -62,6 +63,22 @@ Construction Parameters
|
|||||||
<salt>
|
<salt>
|
||||||
The hexadecimal encoding of the salt value.
|
The hexadecimal encoding of the salt value.
|
||||||
|
|
||||||
|
<#opt_params>
|
||||||
|
Number of optional parameters. If there are no optional parameters,
|
||||||
|
the optional paramaters section can be skipped or #opt_params can be zero.
|
||||||
|
Otherwise #opt_params is the number of following arguments.
|
||||||
|
|
||||||
|
Example of optional parameters section:
|
||||||
|
1 ignore_corruption
|
||||||
|
|
||||||
|
ignore_corruption
|
||||||
|
Log corrupted blocks, but allow read operations to proceed normally.
|
||||||
|
|
||||||
|
restart_on_corruption
|
||||||
|
Restart the system when a corrupted block is discovered. This option is
|
||||||
|
not compatible with ignore_corruption and requires user space support to
|
||||||
|
avoid restart loops.
|
||||||
|
|
||||||
Theory of operation
|
Theory of operation
|
||||||
===================
|
===================
|
||||||
|
|
||||||
@ -125,7 +142,7 @@ block boundary) are the hash blocks which are stored a depth at a time
|
|||||||
|
|
||||||
The full specification of kernel parameters and on-disk metadata format
|
The full specification of kernel parameters and on-disk metadata format
|
||||||
is available at the cryptsetup project's wiki page
|
is available at the cryptsetup project's wiki page
|
||||||
http://code.google.com/p/cryptsetup/wiki/DMVerity
|
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
|
||||||
|
|
||||||
Status
|
Status
|
||||||
======
|
======
|
||||||
@ -142,7 +159,7 @@ Set up a device:
|
|||||||
|
|
||||||
A command line tool veritysetup is available to compute or verify
|
A command line tool veritysetup is available to compute or verify
|
||||||
the hash tree or activate the kernel device. This is available from
|
the hash tree or activate the kernel device. This is available from
|
||||||
the cryptsetup upstream repository http://code.google.com/p/cryptsetup/
|
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
|
||||||
(as a libcryptsetup extension).
|
(as a libcryptsetup extension).
|
||||||
|
|
||||||
Create hash on the device:
|
Create hash on the device:
|
||||||
|
Loading…
Reference in New Issue
Block a user