mirror of
git://sourceware.org/git/lvm2.git
synced 2024-12-30 17:18:21 +03:00
Include a copy of kernel DM documentation in doc/kernel
This commit is contained in:
parent
5680d14ecd
commit
fb7817fe7c
@ -1,5 +1,6 @@
|
||||
Version 1.02.68 -
|
||||
==================================
|
||||
Include a copy of kernel DM documentation in doc/kernel.
|
||||
Improve man page style for dmsetup.
|
||||
Fix _get_proc_number to be tolerant of malformed /proc/misc entries.
|
||||
Add ExecReload to dm-event.service for systemd to reload dmeventd properly.
|
||||
|
76
doc/kernel/crypt.txt
Normal file
76
doc/kernel/crypt.txt
Normal file
@ -0,0 +1,76 @@
|
||||
dm-crypt
|
||||
=========
|
||||
|
||||
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
||||
using the kernel crypto API.
|
||||
|
||||
Parameters: <cipher> <key> <iv_offset> <device path> \
|
||||
<offset> [<#opt_params> <opt_params>]
|
||||
|
||||
<cipher>
|
||||
Encryption cipher and an optional IV generation mode.
|
||||
(In format cipher[:keycount]-chainmode-ivopts:ivmode).
|
||||
Examples:
|
||||
des
|
||||
aes-cbc-essiv:sha256
|
||||
twofish-ecb
|
||||
|
||||
/proc/crypto contains supported crypto modes
|
||||
|
||||
<key>
|
||||
Key used for encryption. It is encoded as a hexadecimal number.
|
||||
You can only use key sizes that are valid for the selected cipher.
|
||||
|
||||
<keycount>
|
||||
Multi-key compatibility mode. You can define <keycount> keys and
|
||||
then sectors are encrypted according to their offsets (sector 0 uses key0;
|
||||
sector 1 uses key1 etc.). <keycount> must be a power of two.
|
||||
|
||||
<iv_offset>
|
||||
The IV offset is a sector count that is added to the sector number
|
||||
before creating the IV.
|
||||
|
||||
<device path>
|
||||
This is the device that is going to be used as backend and contains the
|
||||
encrypted data. You can specify it as a path like /dev/xxx or a device
|
||||
number <major>:<minor>.
|
||||
|
||||
<offset>
|
||||
Starting sector within the device where the encrypted data begins.
|
||||
|
||||
<#opt_params>
|
||||
Number of optional parameters. If there are no optional parameters,
|
||||
the optional paramaters section can be skipped or #opt_params can be zero.
|
||||
Otherwise #opt_params is the number of following arguments.
|
||||
|
||||
Example of optional parameters section:
|
||||
1 allow_discards
|
||||
|
||||
allow_discards
|
||||
Block discard requests (a.k.a. TRIM) are passed through the crypt device.
|
||||
The default is to ignore discard requests.
|
||||
|
||||
WARNING: Assess the specific security risks carefully before enabling this
|
||||
option. For example, allowing discards on encrypted devices may lead to
|
||||
the leak of information about the ciphertext device (filesystem type,
|
||||
used space etc.) if the discarded blocks can be located easily on the
|
||||
device later.
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
||||
encryption with dm-crypt using the 'cryptsetup' utility, see
|
||||
http://code.google.com/p/cryptsetup/
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
]]
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
||||
]]
|
26
doc/kernel/delay.txt
Normal file
26
doc/kernel/delay.txt
Normal file
@ -0,0 +1,26 @@
|
||||
dm-delay
|
||||
========
|
||||
|
||||
Device-Mapper's "delay" target delays reads and/or writes
|
||||
and maps them to different devices.
|
||||
|
||||
Parameters:
|
||||
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>]
|
||||
|
||||
With separate write parameters, the first set is only used for reads.
|
||||
Delays are specified in milliseconds.
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsize $1` delay $1 0 500" | dmsetup create delayed
|
||||
]]
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsize $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
||||
]]
|
53
doc/kernel/flakey.txt
Normal file
53
doc/kernel/flakey.txt
Normal file
@ -0,0 +1,53 @@
|
||||
dm-flakey
|
||||
=========
|
||||
|
||||
This target is the same as the linear target except that it exhibits
|
||||
unreliable behaviour periodically. It's been found useful in simulating
|
||||
failing devices for testing purposes.
|
||||
|
||||
Starting from the time the table is loaded, the device is available for
|
||||
<up interval> seconds, then exhibits unreliable behaviour for <down
|
||||
interval> seconds, and then this cycle repeats.
|
||||
|
||||
Also, consider using this in combination with the dm-delay target too,
|
||||
which can delay reads and writes and/or send them to different
|
||||
underlying devices.
|
||||
|
||||
Table parameters
|
||||
----------------
|
||||
<dev path> <offset> <up interval> <down interval> \
|
||||
[<num_features> [<feature arguments>]]
|
||||
|
||||
Mandatory parameters:
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
<up interval>: Number of seconds device is available.
|
||||
<down interval>: Number of seconds device returns errors.
|
||||
|
||||
Optional feature parameters:
|
||||
If no feature parameters are present, during the periods of
|
||||
unreliability, all I/O returns errors.
|
||||
|
||||
drop_writes:
|
||||
All write I/O is silently ignored.
|
||||
Read I/O is handled correctly.
|
||||
|
||||
corrupt_bio_byte <Nth_byte> <direction> <value> <flags>:
|
||||
During <down interval>, replace <Nth_byte> of the data of
|
||||
each matching bio with <value>.
|
||||
|
||||
<Nth_byte>: The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>: Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>: The value (from 0-255) to write.
|
||||
<flags>: Perform the replacement only if bio->bi_rw has all the
|
||||
selected flags set.
|
||||
|
||||
Examples:
|
||||
corrupt_bio_byte 32 r 1 0
|
||||
- replaces the 32nd byte of READ bios with the value 1
|
||||
|
||||
corrupt_bio_byte 224 w 0 32
|
||||
- replaces the 224th byte of REQ_META (=32) bios with the value 0
|
75
doc/kernel/io.txt
Normal file
75
doc/kernel/io.txt
Normal file
@ -0,0 +1,75 @@
|
||||
dm-io
|
||||
=====
|
||||
|
||||
Dm-io provides synchronous and asynchronous I/O services. There are three
|
||||
types of I/O services available, and each type has a sync and an async
|
||||
version.
|
||||
|
||||
The user must set up an io_region structure to describe the desired location
|
||||
of the I/O. Each io_region indicates a block-device along with the starting
|
||||
sector and size of the region.
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
sector_t sector;
|
||||
sector_t count;
|
||||
};
|
||||
|
||||
Dm-io can read from one io_region or write to one or more io_regions. Writes
|
||||
to multiple regions are specified by an array of io_region structures.
|
||||
|
||||
The first I/O service type takes a list of memory pages as the data buffer for
|
||||
the I/O, along with an offset into the first page.
|
||||
|
||||
struct page_list {
|
||||
struct page_list *next;
|
||||
struct page *page;
|
||||
};
|
||||
|
||||
int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw,
|
||||
struct page_list *pl, unsigned int offset,
|
||||
unsigned long *error_bits);
|
||||
int dm_io_async(unsigned int num_regions, struct io_region *where, int rw,
|
||||
struct page_list *pl, unsigned int offset,
|
||||
io_notify_fn fn, void *context);
|
||||
|
||||
The second I/O service type takes an array of bio vectors as the data buffer
|
||||
for the I/O. This service can be handy if the caller has a pre-assembled bio,
|
||||
but wants to direct different portions of the bio to different devices.
|
||||
|
||||
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
unsigned long *error_bits);
|
||||
int dm_io_async_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
io_notify_fn fn, void *context);
|
||||
|
||||
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
|
||||
data buffer for the I/O. This service can be handy if the caller needs to do
|
||||
I/O to a large region but doesn't want to allocate a large number of individual
|
||||
memory pages.
|
||||
|
||||
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, unsigned long *error_bits);
|
||||
int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, io_notify_fn fn, void *context);
|
||||
|
||||
Callers of the asynchronous I/O services must include the name of a completion
|
||||
callback routine and a pointer to some context data for the I/O.
|
||||
|
||||
typedef void (*io_notify_fn)(unsigned long error, void *context);
|
||||
|
||||
The "error" parameter in this callback, as well as the "*error" parameter in
|
||||
all of the synchronous versions, is a bitset (instead of a simple error value).
|
||||
In the case of an write-I/O to multiple regions, this bitset allows dm-io to
|
||||
indicate success or failure on each individual region.
|
||||
|
||||
Before using any of the dm-io services, the user should call dm_io_get()
|
||||
and specify the number of pages they expect to perform I/O on concurrently.
|
||||
Dm-io will attempt to resize its mempool to make sure enough pages are
|
||||
always available in order to avoid unnecessary waiting while performing I/O.
|
||||
|
||||
When the user is finished using the dm-io services, they should call
|
||||
dm_io_put() and specify the same number of pages that were given on the
|
||||
dm_io_get() call.
|
||||
|
47
doc/kernel/kcopyd.txt
Normal file
47
doc/kernel/kcopyd.txt
Normal file
@ -0,0 +1,47 @@
|
||||
kcopyd
|
||||
======
|
||||
|
||||
Kcopyd provides the ability to copy a range of sectors from one block-device
|
||||
to one or more other block-devices, with an asynchronous completion
|
||||
notification. It is used by dm-snapshot and dm-mirror.
|
||||
|
||||
Users of kcopyd must first create a client and indicate how many memory pages
|
||||
to set aside for their copy jobs. This is done with a call to
|
||||
kcopyd_client_create().
|
||||
|
||||
int kcopyd_client_create(unsigned int num_pages,
|
||||
struct kcopyd_client **result);
|
||||
|
||||
To start a copy job, the user must set up io_region structures to describe
|
||||
the source and destinations of the copy. Each io_region indicates a
|
||||
block-device along with the starting sector and size of the region. The source
|
||||
of the copy is given as one io_region structure, and the destinations of the
|
||||
copy are given as an array of io_region structures.
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
sector_t sector;
|
||||
sector_t count;
|
||||
};
|
||||
|
||||
To start the copy, the user calls kcopyd_copy(), passing in the client
|
||||
pointer, pointers to the source and destination io_regions, the name of a
|
||||
completion callback routine, and a pointer to some context data for the copy.
|
||||
|
||||
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
|
||||
unsigned int num_dests, struct io_region *dests,
|
||||
unsigned int flags, kcopyd_notify_fn fn, void *context);
|
||||
|
||||
typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err,
|
||||
void *context);
|
||||
|
||||
When the copy completes, kcopyd will call the user's completion routine,
|
||||
passing back the user's context pointer. It will also indicate if a read or
|
||||
write error occurred during the copy.
|
||||
|
||||
When a user is done with all their copy jobs, they should call
|
||||
kcopyd_client_destroy() to delete the kcopyd client, which will release the
|
||||
associated memory pages.
|
||||
|
||||
void kcopyd_client_destroy(struct kcopyd_client *kc);
|
||||
|
61
doc/kernel/linear.txt
Normal file
61
doc/kernel/linear.txt
Normal file
@ -0,0 +1,61 @@
|
||||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsize $1` linear $1 0" | dmsetup create identity
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsize $1`
|
||||
size2=`blockdev --getsize $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsize $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
||||
]]
|
54
doc/kernel/log.txt
Normal file
54
doc/kernel/log.txt
Normal file
@ -0,0 +1,54 @@
|
||||
Device-Mapper Logging
|
||||
=====================
|
||||
The device-mapper logging code is used by some of the device-mapper
|
||||
RAID targets to track regions of the disk that are not consistent.
|
||||
A region (or portion of the address space) of the disk may be
|
||||
inconsistent because a RAID stripe is currently being operated on or
|
||||
a machine died while the region was being altered. In the case of
|
||||
mirrors, a region would be considered dirty/inconsistent while you
|
||||
are writing to it because the writes need to be replicated for all
|
||||
the legs of the mirror and may not reach the legs at the same time.
|
||||
Once all writes are complete, the region is considered clean again.
|
||||
|
||||
There is a generic logging interface that the device-mapper RAID
|
||||
implementations use to perform logging operations (see
|
||||
dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
|
||||
logging implementations are available and provide different
|
||||
capabilities. The list includes:
|
||||
|
||||
Type Files
|
||||
==== =====
|
||||
disk drivers/md/dm-log.c
|
||||
core drivers/md/dm-log.c
|
||||
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
|
||||
|
||||
The "disk" log type
|
||||
-------------------
|
||||
This log implementation commits the log state to disk. This way, the
|
||||
logging state survives reboots/crashes.
|
||||
|
||||
The "core" log type
|
||||
-------------------
|
||||
This log implementation keeps the log state in memory. The log state
|
||||
will not survive a reboot or crash, but there may be a small boost in
|
||||
performance. This method can also be used if no storage device is
|
||||
available for storing log state.
|
||||
|
||||
The "userspace" log type
|
||||
------------------------
|
||||
This log type simply provides a way to export the log API to userspace,
|
||||
so log implementations can be done there. This is done by forwarding most
|
||||
logging requests to userspace, where a daemon receives and processes the
|
||||
request.
|
||||
|
||||
The structure used for communication between kernel and userspace are
|
||||
located in include/linux/dm-log-userspace.h. Due to the frequency,
|
||||
diversity, and 2-way communication nature of the exchanges between
|
||||
kernel and userspace, 'connector' is used as the interface for
|
||||
communication.
|
||||
|
||||
There are currently two userspace log implementations that leverage this
|
||||
framework - "clustered-disk" and "clustered-core". These implementations
|
||||
provide a cluster-coherent log for shared-storage. Device-mapper mirroring
|
||||
can be used in a shared-storage environment when the cluster log implementations
|
||||
are employed.
|
84
doc/kernel/persistent-data.txt
Normal file
84
doc/kernel/persistent-data.txt
Normal file
@ -0,0 +1,84 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
The more-sophisticated device-mapper targets require complex metadata
|
||||
that is managed in kernel. In late 2010 we were seeing that various
|
||||
different targets were rolling their own data strutures, for example:
|
||||
|
||||
- Mikulas Patocka's multisnap implementation
|
||||
- Heinz Mauelshagen's thin provisioning target
|
||||
- Another btree-based caching target posted to dm-devel
|
||||
- Another multi-snapshot target based on a design of Daniel Phillips
|
||||
|
||||
Maintaining these data structures takes a lot of work, so if possible
|
||||
we'd like to reduce the number.
|
||||
|
||||
The persistent-data library is an attempt to provide a re-usable
|
||||
framework for people who want to store metadata in device-mapper
|
||||
targets. It's currently used by the thin-provisioning target and an
|
||||
upcoming hierarchical storage target.
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The main documentation is in the header files which can all be found
|
||||
under drivers/md/persistent-data.
|
||||
|
||||
The block manager
|
||||
-----------------
|
||||
|
||||
dm-block-manager.[hc]
|
||||
|
||||
This provides access to the data on disk in fixed sized-blocks. There
|
||||
is a read/write locking interface to prevent concurrent accesses, and
|
||||
keep data that is being used in the cache.
|
||||
|
||||
Clients of persistent-data are unlikely to use this directly.
|
||||
|
||||
The transaction manager
|
||||
-----------------------
|
||||
|
||||
dm-transaction-manager.[hc]
|
||||
|
||||
This restricts access to blocks and enforces copy-on-write semantics.
|
||||
The only way you can get hold of a writable block through the
|
||||
transaction manager is by shadowing an existing block (ie. doing
|
||||
copy-on-write) or allocating a fresh one. Shadowing is elided within
|
||||
the same transaction so performance is reasonable. The commit method
|
||||
ensures that all data is flushed before it writes the superblock.
|
||||
On power failure your metadata will be as it was when last committed.
|
||||
|
||||
The Space Maps
|
||||
--------------
|
||||
|
||||
dm-space-map.h
|
||||
dm-space-map-metadata.[hc]
|
||||
dm-space-map-disk.[hc]
|
||||
|
||||
On-disk data structures that keep track of reference counts of blocks.
|
||||
Also acts as the allocator of new blocks. Currently two
|
||||
implementations: a simpler one for managing blocks on a different
|
||||
device (eg. thinly-provisioned data blocks); and one for managing
|
||||
the metadata space. The latter is complicated by the need to store
|
||||
its own data within the space it's managing.
|
||||
|
||||
The data structures
|
||||
-------------------
|
||||
|
||||
dm-btree.[hc]
|
||||
dm-btree-remove.c
|
||||
dm-btree-spine.c
|
||||
dm-btree-internal.h
|
||||
|
||||
Currently there is only one data structure, a hierarchical btree.
|
||||
There are plans to add more. For example, something with an
|
||||
array-like interface would see a lot of use.
|
||||
|
||||
The btree is 'hierarchical' in that you can define it to be composed
|
||||
of nested btrees, and take multiple keys. For example, the
|
||||
thin-provisioning target uses a btree with two levels of nesting.
|
||||
The first maps a device id to a mapping tree, and that in turn maps a
|
||||
virtual block to a physical block.
|
||||
|
||||
Values stored in the btrees can have arbitrary size. Keys are always
|
||||
64bits, although nesting allows you to use multiple keys.
|
39
doc/kernel/queue-length.txt
Normal file
39
doc/kernel/queue-length.txt
Normal file
@ -0,0 +1,39 @@
|
||||
dm-queue-length
|
||||
===============
|
||||
|
||||
dm-queue-length is a path selector module for device-mapper targets,
|
||||
which selects a path with the least number of in-flight I/Os.
|
||||
The path selector name is 'queue-length'.
|
||||
|
||||
Table parameters for each path: [<repeat_count>]
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight>
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight>: The number of in-flight I/Os on the path.
|
||||
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-queue-length increments/decrements 'in-flight' when an I/O is
|
||||
dispatched/completed respectively.
|
||||
dm-queue-length selects a path with the minimum 'in-flight'.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
108
doc/kernel/raid.txt
Normal file
108
doc/kernel/raid.txt
Normal file
@ -0,0 +1,108 @@
|
||||
dm-raid
|
||||
-------
|
||||
|
||||
The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
|
||||
It allows the MD RAID drivers to be accessed using a device-mapper
|
||||
interface.
|
||||
|
||||
The target is named "raid" and it accepts the following parameters:
|
||||
|
||||
<raid_type> <#raid_params> <raid_params> \
|
||||
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
|
||||
|
||||
<raid_type>:
|
||||
raid1 RAID1 mirroring
|
||||
raid4 RAID4 dedicated parity disk
|
||||
raid5_la RAID5 left asymmetric
|
||||
- rotating parity 0 with data continuation
|
||||
raid5_ra RAID5 right asymmetric
|
||||
- rotating parity N with data continuation
|
||||
raid5_ls RAID5 left symmetric
|
||||
- rotating parity 0 with data restart
|
||||
raid5_rs RAID5 right symmetric
|
||||
- rotating parity N with data restart
|
||||
raid6_zr RAID6 zero restart
|
||||
- rotating parity zero (left-to-right) with data restart
|
||||
raid6_nr RAID6 N restart
|
||||
- rotating parity N (right-to-left) with data restart
|
||||
raid6_nc RAID6 N continue
|
||||
- rotating parity N (right-to-left) with data continuation
|
||||
|
||||
Refererence: Chapter 4 of
|
||||
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
|
||||
|
||||
<#raid_params>: The number of parameters that follow.
|
||||
|
||||
<raid_params> consists of
|
||||
Mandatory parameters:
|
||||
<chunk_size>: Chunk size in sectors. This parameter is often known as
|
||||
"stripe size". It is the only mandatory parameter and
|
||||
is placed first.
|
||||
|
||||
followed by optional parameters (in any order):
|
||||
[sync|nosync] Force or prevent RAID initialization.
|
||||
|
||||
[rebuild <idx>] Rebuild drive number idx (first drive is 0).
|
||||
|
||||
[daemon_sleep <ms>]
|
||||
Interval between runs of the bitmap daemon that
|
||||
clear bits. A longer interval means less bitmap I/O but
|
||||
resyncing after a failure is likely to take longer.
|
||||
|
||||
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[write_mostly <idx>] Drive index is write-mostly
|
||||
[max_write_behind <sectors>] See '-write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>] Stripe cache size (higher RAIDs only)
|
||||
[region_size <sectors>]
|
||||
The region_size multiplied by the number of regions is the
|
||||
logical size of the array. The bitmap records the device
|
||||
synchronisation state for each region.
|
||||
|
||||
<#raid_devs>: The number of devices composing the array.
|
||||
Each device consists of two entries. The first is the device
|
||||
containing the metadata (if any); the second is the one containing the
|
||||
data.
|
||||
|
||||
If a drive has failed or is missing at creation time, a '-' can be
|
||||
given for both the metadata and data drives for a given position.
|
||||
|
||||
|
||||
Example tables
|
||||
--------------
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
|
||||
'dmsetup table' displays the table used to construct the mapping.
|
||||
The optional parameters are always printed in the order listed
|
||||
above with "sync" or "nosync" always output ahead of the other
|
||||
arguments, regardless of the order used when originally loading the table.
|
||||
Arguments that can be repeated are ordered by value.
|
||||
|
||||
'dmsetup status' yields information on the state and health of the
|
||||
array.
|
||||
The output is as follows:
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio>
|
||||
|
||||
Line 1 is the standard output produced by device-mapper.
|
||||
Line 2 is produced by the raid target, and best explained by example:
|
||||
0 1960893648 raid raid4 5 AAAAA 2/490221568
|
||||
Here we can see the RAID type is raid4, there are 5 devices - all of
|
||||
which are 'A'live, and the array is 2/490221568 complete with recovery.
|
||||
Faulty or missing devices are marked 'D'. Devices that are out-of-sync
|
||||
are marked 'a'.
|
91
doc/kernel/service-time.txt
Normal file
91
doc/kernel/service-time.txt
Normal file
@ -0,0 +1,91 @@
|
||||
dm-service-time
|
||||
===============
|
||||
|
||||
dm-service-time is a path selector module for device-mapper targets,
|
||||
which selects a path with the shortest estimated service time for
|
||||
the incoming I/O.
|
||||
|
||||
The service time for each path is estimated by dividing the total size
|
||||
of in-flight I/Os on a path with the performance value of the path.
|
||||
The performance value is a relative throughput value among all paths
|
||||
in a path-group, and it can be specified as a table argument.
|
||||
|
||||
The path selector name is 'service-time'.
|
||||
|
||||
Table parameters for each path: [<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
The valid range is 0-100.
|
||||
If not given, minimum value '1' is used.
|
||||
If '0' is given, the path isn't selected while
|
||||
other paths having a positive value are available.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight-size> \
|
||||
<relative_throughput>
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight-size>: The size of in-flight I/Os on the path.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
|
||||
dispatched and subtracts when completed.
|
||||
Basically, dm-service-time selects a path having minimum service time
|
||||
which is calculated by:
|
||||
|
||||
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
|
||||
|
||||
However, some optimizations below are used to reduce the calculation
|
||||
as much as possible.
|
||||
|
||||
1. If the paths have the same 'relative_throughput', skip
|
||||
the division and just compare the 'in-flight-size'.
|
||||
|
||||
2. If the paths have the same 'in-flight-size', skip the division
|
||||
and just compare the 'relative_throughput'.
|
||||
|
||||
3. If some paths have non-zero 'relative_throughput' and others
|
||||
have zero 'relative_throughput', ignore those paths with zero
|
||||
'relative_throughput'.
|
||||
|
||||
If such optimizations can't be applied, calculate service time, and
|
||||
compare service time.
|
||||
If calculated service time is equal, the path having maximum
|
||||
'relative_throughput' may be better. So compare 'relative_throughput'
|
||||
then.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128
|
||||
and sda has an average throughput 1GB/s and sdb has 4GB/s,
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb.
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
|
||||
|
||||
Or '2' for sda and '8' for sdb would be also true.
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
168
doc/kernel/snapshot.txt
Normal file
168
doc/kernel/snapshot.txt
Normal file
@ -0,0 +1,168 @@
|
||||
Device-mapper snapshot support
|
||||
==============================
|
||||
|
||||
Device-mapper allows you, without massive data copying:
|
||||
|
||||
*) To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
*) To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
*) To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
|
||||
In the first two cases, dm copies only the chunks of data that get
|
||||
changed and uses a separate copy-on-write (COW) block device for
|
||||
storage.
|
||||
|
||||
For snapshot merge the contents of the COW storage are merged back into
|
||||
the origin device.
|
||||
|
||||
|
||||
There are three dm targets available:
|
||||
snapshot, snapshot-origin, and snapshot-merge.
|
||||
|
||||
*) snapshot-origin <origin>
|
||||
|
||||
which will normally have one or more snapshots based on it.
|
||||
Reads will be mapped directly to the backing device. For each write, the
|
||||
original data will be saved in the <COW device> of each snapshot to keep
|
||||
its visible content unchanged, at least until the <COW device> fills up.
|
||||
|
||||
|
||||
*) snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
|
||||
A snapshot of the <origin> block device is created. Changed chunks of
|
||||
<chunksize> sectors will be stored on the <COW device>. Writes will
|
||||
only go to the <COW device>. Reads will come from the <COW device> or
|
||||
from <origin> for unchanged data. <COW device> will often be
|
||||
smaller than the origin and if it fills up the snapshot will become
|
||||
useless and be disabled, returning errors. So it is important to monitor
|
||||
the amount of free space and expand the <COW device> before it fills up.
|
||||
|
||||
<persistent?> is P (Persistent) or N (Not persistent - will not survive
|
||||
after reboot).
|
||||
The difference is that for transient snapshots less metadata must be
|
||||
saved on disk - they can be kept in memory by the kernel.
|
||||
|
||||
|
||||
* snapshot-merge <origin> <COW device> <persistent> <chunksize>
|
||||
|
||||
takes the same table arguments as the snapshot target except it only
|
||||
works with persistent snapshots. This target assumes the role of the
|
||||
"snapshot-origin" target and must not be loaded if the "snapshot-origin"
|
||||
is still present for <origin>.
|
||||
|
||||
Creates a merging snapshot that takes control of the changed chunks
|
||||
stored in the <COW device> of an existing snapshot, through a handover
|
||||
procedure, and merges these chunks back into the <origin>. Once merging
|
||||
has started (in the background) the <origin> may be opened and the merge
|
||||
will continue while I/O is flowing to it. Changes to the <origin> are
|
||||
deferred until the merging snapshot's corresponding chunk(s) have been
|
||||
merged. Once merging has started the snapshot device, associated with
|
||||
the "snapshot" target, will return -EIO when accessed.
|
||||
|
||||
|
||||
How snapshot is used by LVM2
|
||||
============================
|
||||
When you create the first LVM2 snapshot of a volume, four dm devices are used:
|
||||
|
||||
1) a device containing the original mapping table of the source volume;
|
||||
2) a device used as the <COW device>;
|
||||
3) a "snapshot" device, combining #1 and #2, which is the visible snapshot
|
||||
volume;
|
||||
4) the "original" volume (which uses the device number used by the original
|
||||
source volume), whose table is replaced by a "snapshot-origin" mapping
|
||||
from device #1.
|
||||
|
||||
A fixed naming scheme is used, so with the following commands:
|
||||
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
|
||||
we'll have this situation (with volumes in above order):
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How snapshot-merge is used by LVM2
|
||||
==================================
|
||||
A merging snapshot assumes the role of the "snapshot-origin" while
|
||||
merging. As such the "snapshot-origin" is replaced with
|
||||
"snapshot-merge". The "-real" device is not changed and the "-cow"
|
||||
device is renamed to <origin name>-cow to aid LVM2's cleanup of the
|
||||
merging snapshot after it completes. The "snapshot" that hands over its
|
||||
COW device to the "snapshot-merge" is deactivated (unless using lvchange
|
||||
--refresh); but if it is left active it will simply return I/O errors.
|
||||
|
||||
A snapshot will merge into its origin with the following command:
|
||||
|
||||
lvconvert --merge volumeGroup/snap
|
||||
|
||||
we'll now have this situation:
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How to determine when a merging is complete
|
||||
===========================================
|
||||
The snapshot-merge and snapshot status lines end with:
|
||||
<sectors_allocated>/<total_sectors> <metadata_sectors>
|
||||
|
||||
Both <sectors_allocated> and <total_sectors> include both data and metadata.
|
||||
During merging, the number of sectors allocated gets smaller and
|
||||
smaller. Merging has finished when the number of sectors holding data
|
||||
is zero, in other words <sectors_allocated> == <metadata_sectors>.
|
||||
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands):
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
|
||||
Merging has finished.
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
58
doc/kernel/striped.txt
Normal file
58
doc/kernel/striped.txt
Normal file
@ -0,0 +1,58 @@
|
||||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>: Number of underlying devices.
|
||||
<chunk size>: Size of each chunk of data. Must be a power-of-2 and at
|
||||
least as large as the system's PAGE_SIZE.
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size and a multiple of the number of underlying
|
||||
devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsize $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsize $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
||||
]]
|
||||
|
285
doc/kernel/thin-provisioning.txt
Normal file
285
doc/kernel/thin-provisioning.txt
Normal file
@ -0,0 +1,285 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document descibes a collection of device-mapper targets that
|
||||
between them implement thin-provisioning and snapshots.
|
||||
|
||||
The main highlight of this implementation, compared to the previous
|
||||
implementation of snapshots, is that it allows many virtual devices to
|
||||
be stored on the same data volume. This simplifies administration and
|
||||
allows the sharing of data between volumes, thus reducing disk usage.
|
||||
|
||||
Another significant feature is support for an arbitrary depth of
|
||||
recursive snapshots (snapshots of snapshots of snapshots ...). The
|
||||
previous implementation of snapshots did this by chaining together
|
||||
lookup tables, and so performance was O(depth). This new
|
||||
implementation uses a single data structure to avoid this degradation
|
||||
with depth. Fragmentation may still be an issue, however, in some
|
||||
scenarios.
|
||||
|
||||
Metadata is stored on a separate device from data, giving the
|
||||
administrator some freedom, for example to:
|
||||
|
||||
- Improve metadata resilience by storing metadata on a mirrored volume
|
||||
but data on a non-mirrored one.
|
||||
|
||||
- Improve performance by storing the metadata on SSD.
|
||||
|
||||
Status
|
||||
======
|
||||
|
||||
These targets are very much still in the EXPERIMENTAL state. Please
|
||||
do not yet rely on them in production. But do experiment and offer us
|
||||
feedback. Different use cases will have different performance
|
||||
characteristics, for example due to fragmentation of the data volume.
|
||||
|
||||
If you find this software is not performing as expected please mail
|
||||
dm-devel@redhat.com with details and we'll try our best to improve
|
||||
things for you.
|
||||
|
||||
Userspace tools for checking and repairing the metadata are under
|
||||
development.
|
||||
|
||||
Cookbook
|
||||
========
|
||||
|
||||
This section describes some quick recipes for using thin provisioning.
|
||||
They use the dmsetup program to control the device-mapper driver
|
||||
directly. End users will be advised to use a higher-level volume
|
||||
manager such as LVM2 once support has been added.
|
||||
|
||||
Pool device
|
||||
-----------
|
||||
|
||||
The pool device ties together the metadata volume and the data volume.
|
||||
It maps I/O linearly to the data volume and updates the metadata via
|
||||
two mechanisms:
|
||||
|
||||
- Function calls from the thin targets
|
||||
|
||||
- Device-mapper 'messages' from userspace which control the creation of new
|
||||
virtual devices amongst other things.
|
||||
|
||||
Setting up a fresh pool device
|
||||
------------------------------
|
||||
|
||||
Setting up a pool device requires a valid metadata device, and a
|
||||
data device. If you do not have an existing metadata device you can
|
||||
make one by zeroing the first 4k to indicate empty metadata.
|
||||
|
||||
dd if=/dev/zero of=$metadata_dev bs=4096 count=1
|
||||
|
||||
The amount of metadata you need will vary according to how many blocks
|
||||
are shared between thin devices (i.e. through snapshots). If you have
|
||||
less sharing than average you'll need a larger-than-average metadata device.
|
||||
|
||||
As a guide, we suggest you calculate the number of bytes to use in the
|
||||
metadata device as 48 * $data_dev_size / $data_block_size but round it up
|
||||
to 2MB if the answer is smaller. The largest size supported is 16GB.
|
||||
|
||||
If you're creating large numbers of snapshots which are recording large
|
||||
amounts of change, you may need find you need to increase this.
|
||||
|
||||
Reloading a pool table
|
||||
----------------------
|
||||
|
||||
You may reload a pool's table, indeed this is how the pool is resized
|
||||
if it runs out of space. (N.B. While specifying a different metadata
|
||||
device when reloading is not forbidden at the moment, things will go
|
||||
wrong if it does not route I/O to exactly the same on-disk location as
|
||||
previously.)
|
||||
|
||||
Using an existing pool device
|
||||
-----------------------------
|
||||
|
||||
dmsetup create pool \
|
||||
--table "0 20971520 thin-pool $metadata_dev $data_dev \
|
||||
$data_block_size $low_water_mark"
|
||||
|
||||
$data_block_size gives the smallest unit of disk space that can be
|
||||
allocated at a time expressed in units of 512-byte sectors. People
|
||||
primarily interested in thin provisioning may want to use a value such
|
||||
as 1024 (512KB). People doing lots of snapshotting may want a smaller value
|
||||
such as 128 (64KB). If you are not zeroing newly-allocated data,
|
||||
a larger $data_block_size in the region of 256000 (128MB) is suggested.
|
||||
$data_block_size must be the same for the lifetime of the
|
||||
metadata device.
|
||||
|
||||
$low_water_mark is expressed in blocks of size $data_block_size. If
|
||||
free space on the data device drops below this level then a dm event
|
||||
will be triggered which a userspace daemon should catch allowing it to
|
||||
extend the pool device. Only one such event will be sent.
|
||||
Resuming a device with a new table itself triggers an event so the
|
||||
userspace daemon can use this to detect a situation where a new table
|
||||
already exceeds the threshold.
|
||||
|
||||
Thin provisioning
|
||||
-----------------
|
||||
|
||||
i) Creating a new thinly-provisioned volume.
|
||||
|
||||
To create a new thinly- provisioned volume you must send a message to an
|
||||
active pool device, /dev/mapper/pool in this example.
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
Here '0' is an identifier for the volume, a 24-bit number. It's up
|
||||
to the caller to allocate and manage these identifiers. If the
|
||||
identifier is already in use, the message will fail with -EEXIST.
|
||||
|
||||
ii) Using a thinly-provisioned volume.
|
||||
|
||||
Thinly-provisioned volumes are activated using the 'thin' target:
|
||||
|
||||
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
|
||||
|
||||
The last parameter is the identifier for the thinp device.
|
||||
|
||||
Internal snapshots
|
||||
------------------
|
||||
|
||||
i) Creating an internal snapshot.
|
||||
|
||||
Snapshots are created with another message to the pool.
|
||||
|
||||
N.B. If the origin device that you wish to snapshot is active, you
|
||||
must suspend it before creating the snapshot to avoid corruption.
|
||||
This is NOT enforced at the moment, so please be careful!
|
||||
|
||||
dmsetup suspend /dev/mapper/thin
|
||||
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
|
||||
dmsetup resume /dev/mapper/thin
|
||||
|
||||
Here '1' is the identifier for the volume, a 24-bit number. '0' is the
|
||||
identifier for the origin device.
|
||||
|
||||
ii) Using an internal snapshot.
|
||||
|
||||
Once created, the user doesn't have to worry about any connection
|
||||
between the origin and the snapshot. Indeed the snapshot is no
|
||||
different from any other thinly-provisioned device and can be
|
||||
snapshotted itself via the same method. It's perfectly legal to
|
||||
have only one of them active, and there's no ordering requirement on
|
||||
activating or removing them both. (This differs from conventional
|
||||
device-mapper snapshots.)
|
||||
|
||||
Activate it exactly the same way as any other thinly-provisioned volume:
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
|
||||
|
||||
Deactivation
|
||||
------------
|
||||
|
||||
All devices using a pool must be deactivated before the pool itself
|
||||
can be.
|
||||
|
||||
dmsetup remove thin
|
||||
dmsetup remove snap
|
||||
dmsetup remove pool
|
||||
|
||||
Reference
|
||||
=========
|
||||
|
||||
'thin-pool' target
|
||||
------------------
|
||||
|
||||
i) Constructor
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
|
||||
Optional feature arguments:
|
||||
- 'skip_block_zeroing': skips the zeroing of newly-provisioned blocks.
|
||||
|
||||
Data block size must be between 64KB (128 sectors) and 1GB
|
||||
(2097152 sectors) inclusive.
|
||||
|
||||
|
||||
ii) Status
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
|
||||
|
||||
transaction id:
|
||||
A 64-bit number used by userspace to help synchronise with metadata
|
||||
from volume managers.
|
||||
|
||||
used data blocks / total data blocks
|
||||
If the number of free blocks drops below the pool's low water mark a
|
||||
dm event will be sent to userspace. This event is edge-triggered and
|
||||
it will occur only once after each resume so volume manager writers
|
||||
should register for the event and then check the target's status.
|
||||
|
||||
held metadata root:
|
||||
The location, in sectors, of the metadata root that has been
|
||||
'held' for userspace read access. '-' indicates there is no
|
||||
held root. This feature is not yet implemented so '-' is
|
||||
always returned.
|
||||
|
||||
iii) Messages
|
||||
|
||||
create_thin <dev id>
|
||||
|
||||
Create a new thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
|
||||
create_snap <dev id> <origin id>
|
||||
|
||||
Create a new snapshot of another thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
<origin id> is the identifier of the thinly-provisioned device
|
||||
of which the new device will be a snapshot.
|
||||
|
||||
delete <dev id>
|
||||
|
||||
Deletes a thin device. Irreversible.
|
||||
|
||||
trim <dev id> <new size in sectors>
|
||||
|
||||
Delete mappings from the end of a thin device. Irreversible.
|
||||
You might want to use this if you're reducing the size of
|
||||
your thinly-provisioned device. In many cases, due to the
|
||||
sharing of blocks between devices, it is not possible to
|
||||
determine in advance how much space 'trim' will release. (In
|
||||
future a userspace tool might be able to perform this
|
||||
calculation.)
|
||||
|
||||
set_transaction_id <current id> <new id>
|
||||
|
||||
Userland volume managers, such as LVM, need a way to
|
||||
synchronise their external metadata with the internal metadata of the
|
||||
pool target. The thin-pool target offers to store an
|
||||
arbitrary 64-bit transaction id and return it on the target's
|
||||
status line. To avoid races you must provide what you think
|
||||
the current transaction id is when you change it with this
|
||||
compare-and-swap message.
|
||||
|
||||
'thin' target
|
||||
-------------
|
||||
|
||||
i) Constructor
|
||||
|
||||
thin <pool dev> <dev id>
|
||||
|
||||
pool dev:
|
||||
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
|
||||
|
||||
dev id:
|
||||
the internal device identifier of the device to be
|
||||
activated.
|
||||
|
||||
The pool doesn't store any size against the thin devices. If you
|
||||
load a thin target that is smaller than you've been using previously,
|
||||
then you'll have no access to blocks mapped beyond the end. If you
|
||||
load a target that is bigger than before, then extra blocks will be
|
||||
provisioned as and when needed.
|
||||
|
||||
If you wish to reduce the size of your thin device and potentially
|
||||
regain some space then send the 'trim' message to the pool.
|
||||
|
||||
ii) Status
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
97
doc/kernel/uevent.txt
Normal file
97
doc/kernel/uevent.txt
Normal file
@ -0,0 +1,97 @@
|
||||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s).
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed.
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description: A sequence number for this specific device-mapper device.
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description:
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Name of the device-mapper device.
|
||||
Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: UUID of the device-mapper device.
|
||||
Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below.
|
||||
|
||||
1.) Path failure.
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate.
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
37
doc/kernel/zero.txt
Normal file
37
doc/kernel/zero.txt
Normal file
@ -0,0 +1,37 @@
|
||||
dm-zero
|
||||
=======
|
||||
|
||||
Device-Mapper's "zero" target provides a block-device that always returns
|
||||
zero'd data on reads and silently drops writes. This is similar behavior to
|
||||
/dev/zero, but as a block-device instead of a character-device.
|
||||
|
||||
Dm-zero has no target-specific parameters.
|
||||
|
||||
One very interesting use of dm-zero is for creating "sparse" devices in
|
||||
conjunction with dm-snapshot. A sparse device reports a device-size larger
|
||||
than the amount of actual storage space available for that device. A user can
|
||||
write data anywhere within the sparse device and read it back like a normal
|
||||
device. Reads to previously unwritten areas will return a zero'd buffer. When
|
||||
enough data has been written to fill up the actual storage space, the sparse
|
||||
device is deactivated. This can be very useful for testing device and
|
||||
filesystem limitations.
|
||||
|
||||
To create a sparse device, start by creating a dm-zero device that's the
|
||||
desired size of the sparse device. For this example, we'll assume a 10TB
|
||||
sparse device.
|
||||
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
|
||||
Then create a snapshot of the zero device, using any available block-device as
|
||||
the COW device. The size of the COW device will determine the amount of real
|
||||
space available to the sparse device. For this example, we'll assume /dev/sdb1
|
||||
is an available 10GB partition.
|
||||
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
|
||||
This will create a 10TB sparse device called /dev/mapper/sparse1 that has
|
||||
10GB of actual storage space available. If more than 10GB of data is written
|
||||
to this device, it will start returning I/O errors.
|
||||
|
Loading…
Reference in New Issue
Block a user