mirror of
git://sourceware.org/git/lvm2.git
synced 2025-01-10 05:18:36 +03:00
LVM2 RAID design doc
This commit is contained in:
parent
1c8512eb62
commit
b546c17105
298
doc/lvm2-raid.txt
Normal file
298
doc/lvm2-raid.txt
Normal file
@ -0,0 +1,298 @@
|
||||
=======================
|
||||
= LVM RAID Design Doc =
|
||||
=======================
|
||||
|
||||
#############################
|
||||
# Chapter 1: User-Interface #
|
||||
#############################
|
||||
|
||||
***************** CREATING A RAID DEVICE ******************
|
||||
|
||||
01: lvcreate --type <RAID type> \
|
||||
02: [--regionsize <size>] \
|
||||
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
|
||||
04: [-m/--mirrors <#>] \
|
||||
05: [--[min|max]recoveryrate <kB/sec/disk>] \
|
||||
06: [--stripecache <size>] \
|
||||
07: [--writemostly <devices>] \
|
||||
08: [--maxwritebehind <size>] \
|
||||
09: [[no]sync] \
|
||||
10: <Other normal args, like: -L 5G -n lv vg> \
|
||||
11: [devices]
|
||||
|
||||
Line 01:
|
||||
I don't intend for there to be shorthand options for specifying the
|
||||
segment type. The available RAID types are:
|
||||
"raid0" - Stripe [NOT IMPLEMENTED]
|
||||
"raid1" - should replace DM Mirroring
|
||||
"raid10" - striped mirrors, [NOT IMPLEMENTED]
|
||||
"raid4" - RAID4
|
||||
"raid5" - Same as "raid5_ls" (Same default as MD)
|
||||
"raid5_la" - RAID5 Rotating parity 0 with data continuation
|
||||
"raid5_ra" - RAID5 Rotating parity N with data continuation
|
||||
"raid5_ls" - RAID5 Rotating parity 0 with data restart
|
||||
"raid5_rs" - RAID5 Rotating parity N with data restart
|
||||
"raid6" - Same as "raid6_zr"
|
||||
"raid6_zr" - RAID6 Rotating parity 0 with data restart
|
||||
"raid6_nr" - RAID6 Rotating parity N with data restart
|
||||
"raid6_nc" - RAID6 Rotating parity N with data continuation
|
||||
The exception to 'no shorthand options' will be where the RAID implementations
|
||||
can displace traditional tagets. This is the case with 'mirror' and 'raid1'.
|
||||
In these cases, a switch will exist in lvm.conf allowing the user to specify
|
||||
which implementation they want. When this is in place, the segment type is
|
||||
inferred from the argument, '-m' for example.
|
||||
|
||||
Line 02:
|
||||
Region size is relevant for all RAID types. It defines the granularity for
|
||||
which the bitmap will track the active areas of disk. The default is currently
|
||||
4MiB. I see no reason to change this unless it is a problem for MD performance.
|
||||
MD does impose a restriction of 2^21 regions for a given device, however. This
|
||||
means two things: 1) we should never need a metadata area larger than
|
||||
8kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region
|
||||
size will have to be upwardly revised if the device is larger than 8TiB
|
||||
(assuming defaults).
|
||||
|
||||
Line 03/04:
|
||||
The '-m/--mirrors' option is only relevant to RAID1 and will be used just like
|
||||
it is today for DM mirroring. For all other RAID types, -i/--stripes and
|
||||
-I/--stripesize are relevant. The former will specify the number of data
|
||||
devices that will be used for striping. For example, if the user specifies
|
||||
'--type raid0 -i 3', then 3 devices are needed. If the user specifies
|
||||
'--type raid6 -i 3', then 5 devices are needed. The -I/--stripesize may be
|
||||
confusing to MD users, as they use the term "chunksize". I think they will
|
||||
adapt without issue and I don't wish to create a conflict with the term
|
||||
"chunksize" that we use for snapshots.
|
||||
|
||||
Line 05/06/07:
|
||||
I'm still not clear on how to specify these options. Some are easier than
|
||||
others. '--writemostly' is particularly hard because it involves specifying
|
||||
which devices shall be 'write-mostly' and thus, also have 'max-write-behind'
|
||||
applied to them. It has been suggested that a '--readmostly'/'--readfavored'
|
||||
or similar option could be introduced as a way to specify a primary disk vs.
|
||||
specifying all the non-primary disks via '--writemostly'. I like this idea,
|
||||
but haven't come up with a good name yet. Thus, these will remain
|
||||
unimplemented until future specification.
|
||||
|
||||
Line 09/10/11:
|
||||
These are familiar.
|
||||
|
||||
Further creation related ideas:
|
||||
Today, you can specify '--type mirror' without an '-m/--mirrors' argument
|
||||
necessary. The number of devices defaults to two (and the log defaults to
|
||||
'disk'). A similar thing should happen with the RAID types. All of them
|
||||
should default to having two data devices unless otherwise specified. This
|
||||
would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5,
|
||||
and 4 devices for RAID 6/10.
|
||||
|
||||
|
||||
***************** CONVERTING A RAID DEVICE ******************
|
||||
|
||||
01: lvconvert [--type <RAID type>] \
|
||||
02: [-R/--regionsize <size>] \
|
||||
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
|
||||
04: [-m/--mirrors <#>] \
|
||||
05: [--splitmirrors <#>] \
|
||||
06: [--replace <sub_lv|device>] \
|
||||
07: [--[min|max]recoveryrate <kB/sec/disk>] \
|
||||
08: [--stripecache <size>] \
|
||||
09: [--writemostly <devices>] \
|
||||
10: [--maxwritebehind <size>] \
|
||||
11: vg/lv
|
||||
12: [devices]
|
||||
|
||||
lvconvert should work exactly as it does now when dealing with mirrors -
|
||||
even if(when) we switch to MD RAID1. Of course, there are no plans to
|
||||
allow the presense of the metadata area to be configurable (e.g. --corelog).
|
||||
It will be simple enough to detect if the LV being up/down-converted is
|
||||
new or old-style mirroring.
|
||||
|
||||
If we choose to use MD RAID0 as well, it will be possible to change the
|
||||
number of stripes and the stripesize. It is therefore conceivable to see
|
||||
something like, 'lvconvert -i +1 vg/lv'.
|
||||
|
||||
Line 01:
|
||||
It is possible to change the RAID type of an LV - even if that LV is already
|
||||
a RAID device of a different type. For example, you could change from
|
||||
RAID4 to RAID5 or RAID5 to RAID6.
|
||||
|
||||
Line 02/03/04/05:
|
||||
These are familiar options - all of which would now be available as options
|
||||
for change. (However, it'd be nice if we didn't have regionsize in there.
|
||||
It's simple on the kernel side, but is just an extra - often unecessary -
|
||||
parameter to many functions in the LVM codebase.)
|
||||
|
||||
Line 06:
|
||||
This option allows the user to specify a sub_lv (e.g. a mirror image) or
|
||||
a particular device for replacement. The device (or all the devices in
|
||||
the sub_lv) will be removed and replaced with different devices from the
|
||||
VG.
|
||||
|
||||
Line 07/08/09/10:
|
||||
It should be possible to alter these parameters of a RAID device. As with
|
||||
lvcreate, however, I'm not entirely certain how to best define some of these.
|
||||
We don't need all the capabilities at once though, so it isn't a pressing
|
||||
issue.
|
||||
|
||||
Line 11:
|
||||
The LV to operate on.
|
||||
|
||||
Line 12:
|
||||
Devices that are to be used to satisfy the conversion request. If the
|
||||
operation removes devices or splits a mirror, then the devices specified
|
||||
form the list of candidates for removal. If the operation adds or replaces
|
||||
devices, then the devices specified form the list of candidates for allocation.
|
||||
|
||||
|
||||
|
||||
###############################################
|
||||
# Chapter 2: LVM RAID internal representation #
|
||||
###############################################
|
||||
|
||||
The internal representation is somewhat like mirroring, but with alterations
|
||||
for the different metadata components. LVM mirroring has a single log LV,
|
||||
but RAID will have one for each data device. Because of this, I've added a
|
||||
new 'areas' list to the 'struct lv_segment' - 'meta_areas'. There is exactly
|
||||
a one-to-one relationship between 'areas' and 'meta_areas'. The 'areas' array
|
||||
still holds the data sub-lv's (similar to mirroring), while the 'meta_areas'
|
||||
array holds the metadata sub-lv's (akin to the mirroring log device).
|
||||
|
||||
The sub_lvs will be named '%s_rimage_%d' instead of '%s_mimage_%d' as it is
|
||||
for mirroring, and '%s_rmeta_%d' instead of '%s_mlog'. Thus, you can imagine
|
||||
an LV named 'foo' with the following layout:
|
||||
foo
|
||||
[foo's lv_segment]
|
||||
|
|
||||
|-> foo_rimage_0 (areas[0])
|
||||
| [foo_rimage_0's lv_segment]
|
||||
|-> foo_rimage_1 (areas[1])
|
||||
| [foo_rimage_1's lv_segment]
|
||||
|
|
||||
|-> foo_rmeta_0 (meta_areas[0])
|
||||
| [foo_rmeta_0's lv_segment]
|
||||
|-> foo_rmeta_1 (meta_areas[1])
|
||||
| [foo_rmeta_1's lv_segment]
|
||||
|
||||
LVM Meta-data format
|
||||
--------------------
|
||||
The RAID format will need to be able to store parameters that are unique to
|
||||
RAID and unique to specific RAID sub-devices. It will be modeled after that
|
||||
of mirroring.
|
||||
|
||||
Here is an example of the mirroring layout:
|
||||
lv {
|
||||
id = "agL1vP-1B8Z-5vnB-41cS-lhBJ-Gcvz-dh3L3H"
|
||||
status = ["READ", "WRITE", "VISIBLE"]
|
||||
flags = []
|
||||
segment_count = 1
|
||||
|
||||
segment1 {
|
||||
start_extent = 0
|
||||
extent_count = 125 # 500 Megabytes
|
||||
|
||||
type = "mirror"
|
||||
mirror_count = 2
|
||||
mirror_log = "lv_mlog"
|
||||
region_size = 1024
|
||||
|
||||
mirrors = [
|
||||
"lv_mimage_0", 0,
|
||||
"lv_mimage_1", 0
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
The real trick is dealing with the metadata devices. Mirroring has an entry,
|
||||
'mirror_log', in the top-level segment. This won't work for RAID because there
|
||||
is a one-to-one mapping between the data devices and the metadata devices. The
|
||||
mirror devices are layed-out in sub-device/le pairs. The 'le' parameter is
|
||||
redundant since it will always be zero. So for RAID, I have simple put the
|
||||
metadata and data devices in pairs without the 'le' parameter.
|
||||
|
||||
RAID metadata:
|
||||
lv {
|
||||
id = "EnpqAM-5PEg-i9wB-5amn-P116-1T8k-nS3GfD"
|
||||
status = ["READ", "WRITE", "VISIBLE"]
|
||||
flags = []
|
||||
segment_count = 1
|
||||
|
||||
segment1 {
|
||||
start_extent = 0
|
||||
extent_count = 125 # 500 Megabytes
|
||||
|
||||
type = "raid1"
|
||||
device_count = 2
|
||||
region_size = 1024
|
||||
|
||||
raids = [
|
||||
"lv_rmeta_0", "lv_rimage_0",
|
||||
"lv_rmeta_1", "lv_rimage_1",
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
The metadata also must be capable of representing the various tunables. We
|
||||
already have a good example for one from mirroring, region_size.
|
||||
'max_write_behind', 'stripe_cache', and '[min|max]_recovery_rate' could also
|
||||
be handled in this way. However, 'write_mostly' cannot be handled in this
|
||||
way, because it is a characteristic associated with the sub_lvs, not the
|
||||
array as a whole. In these cases, the status field of the sub-lv's themselves
|
||||
will hold these flags - the meaning being only useful in the larger context.
|
||||
|
||||
New Segment Type(s)
|
||||
-------------------
|
||||
I've created a new file 'lib/raid/raid.c' that will handle the various different
|
||||
RAID types. While there will be a unique segment type for each RAID variant,
|
||||
they will all share a common backend - segtype_handler functions and
|
||||
segtype->flags = SEG_RAID.
|
||||
|
||||
I'm also adding a new field to 'struct segment_type', parity_devs. For every
|
||||
segment_type except RAID4/5/6, this will be 0. This field facilitates in
|
||||
allocation and size calculations. For example, the lvcreate for RAID5 would
|
||||
look something like:
|
||||
~> lvcreate --type raid5 -L 30G -i 3 -n my_raid5 my_vg
|
||||
or
|
||||
~> lvcreate --type raid5 -n my_raid5 my_vg /dev/sd[bcdef]1
|
||||
|
||||
In the former case, the stripe count (3) and device size are computed, and
|
||||
then 'segtype->parity_devs' extra devices are allocated of the same size. In
|
||||
the latter case, the number of PVs is determined and 'segtype->parity_devs' is
|
||||
subtracted off to determine the number of stripes.
|
||||
|
||||
This should also work in the case of RAID10 and doing things in this manor
|
||||
should not affect the way size is calculated via the area_multiple.
|
||||
|
||||
Allocation
|
||||
----------
|
||||
When a RAID device is created, metadata LVs must be created along with the
|
||||
data LVs that will ultimately compose the top-level RAID array. For the
|
||||
foreseeable future, the metadata LVs must reside on the same device as (or
|
||||
at least one of the devices that compose) the data LV. We use this property
|
||||
to simplify the allocation process. Rather than allocating for the data LVs
|
||||
and then asking for a small chunk of space on the same device (or the other
|
||||
way around), we simply ask for the aggregate size of the data LV plus the
|
||||
metadata LV. Once we have the space allocated, we divide it between the
|
||||
metadata and data LVs. This also greatly simplifies the process of finding
|
||||
parallel space for all the data LVs that will compose the RAID array. When
|
||||
a RAID device is resized, we will not need to take the metadata LV into
|
||||
account, because it will already be present.
|
||||
|
||||
Apart from the metadata areas, the other unique characteristic of RAID
|
||||
devices is the parity device count. The number of parity devices does nothing
|
||||
to the calculation of size-per-device. The 'area_multiple' means nothing
|
||||
here. The parity devices will simply be the same size as all the other devices
|
||||
and will also require a metadata LV (i.e. it is treated no differently than
|
||||
the other devices).
|
||||
|
||||
Therefore, to allocate space for RAID devices, we need to know two things:
|
||||
1) how many parity devices are required and 2) does an allocated area need to
|
||||
be split out for the metadata LVs after finding the space to fill the request.
|
||||
We simply add these two fields to the 'alloc_handle' data structure as,
|
||||
'parity_count' and 'alloc_and_split_meta'. These two fields get set simply
|
||||
in '_alloc_init'. The 'segtype->parity_devs' holds the number of parity
|
||||
drives and can be directly copied to 'ah->parity_count' and
|
||||
'alloc_and_split_meta' is set when a RAID segtype is detected and
|
||||
'metadata_area_count' has been specified. With these two variables set, we
|
||||
can calculate how many allocated areas we need. Also, in the routines that
|
||||
find the actual space, they stop not when they have found ah->area_count but
|
||||
when they have found (ah->area_count + ah->parity_count).
|
||||
|
Loading…
Reference in New Issue
Block a user