mirror of
git://sourceware.org/git/lvm2.git
synced 2025-01-04 09:18:36 +03:00
Update the RAID design doc to reflect some of the new options introduce (e.g.
--merge and --trackchanges) and document the coding steps of up/down-conversion, splitting RAID1 images, and merging RAID1 images.
This commit is contained in:
parent
efa3621a59
commit
75a59aabb3
@ -38,9 +38,10 @@ segment type. The available RAID types are:
|
|||||||
"raid6_nc" - RAID6 Rotating parity N with data continuation
|
"raid6_nc" - RAID6 Rotating parity N with data continuation
|
||||||
The exception to 'no shorthand options' will be where the RAID implementations
|
The exception to 'no shorthand options' will be where the RAID implementations
|
||||||
can displace traditional tagets. This is the case with 'mirror' and 'raid1'.
|
can displace traditional tagets. This is the case with 'mirror' and 'raid1'.
|
||||||
In these cases, a switch will exist in lvm.conf allowing the user to specify
|
In this case, "mirror_segtype_default" - found under the "global" section in
|
||||||
which implementation they want. When this is in place, the segment type is
|
lvm.conf - can be set to "mirror" or "raid1". The segment type inferred when
|
||||||
inferred from the argument, '-m' for example.
|
the '-m' option is used will be taken from this setting. The default segment
|
||||||
|
types can be overridden on the command line by using the '--type' argument.
|
||||||
|
|
||||||
Line 02:
|
Line 02:
|
||||||
Region size is relevant for all RAID types. It defines the granularity for
|
Region size is relevant for all RAID types. It defines the granularity for
|
||||||
@ -91,14 +92,15 @@ and 4 devices for RAID 6/10.
|
|||||||
02: [-R/--regionsize <size>] \
|
02: [-R/--regionsize <size>] \
|
||||||
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
|
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
|
||||||
04: [-m/--mirrors <#>] \
|
04: [-m/--mirrors <#>] \
|
||||||
05: [--splitmirrors <#>] \
|
05: [--merge]
|
||||||
06: [--replace <sub_lv|device>] \
|
06: [--splitmirrors <#> [--trackchanges]] \
|
||||||
07: [--[min|max]recoveryrate <kB/sec/disk>] \
|
07: [--replace <sub_lv|device>] \
|
||||||
08: [--stripecache <size>] \
|
08: [--[min|max]recoveryrate <kB/sec/disk>] \
|
||||||
09: [--writemostly <devices>] \
|
09: [--stripecache <size>] \
|
||||||
10: [--maxwritebehind <size>] \
|
10: [--writemostly <devices>] \
|
||||||
11: vg/lv
|
11: [--maxwritebehind <size>] \
|
||||||
12: [devices]
|
12: vg/lv
|
||||||
|
13: [devices]
|
||||||
|
|
||||||
lvconvert should work exactly as it does now when dealing with mirrors -
|
lvconvert should work exactly as it does now when dealing with mirrors -
|
||||||
even if(when) we switch to MD RAID1. Of course, there are no plans to
|
even if(when) we switch to MD RAID1. Of course, there are no plans to
|
||||||
@ -115,28 +117,46 @@ It is possible to change the RAID type of an LV - even if that LV is already
|
|||||||
a RAID device of a different type. For example, you could change from
|
a RAID device of a different type. For example, you could change from
|
||||||
RAID4 to RAID5 or RAID5 to RAID6.
|
RAID4 to RAID5 or RAID5 to RAID6.
|
||||||
|
|
||||||
Line 02/03/04/05:
|
Line 02/03/04:
|
||||||
These are familiar options - all of which would now be available as options
|
These are familiar options - all of which would now be available as options
|
||||||
for change. (However, it'd be nice if we didn't have regionsize in there.
|
for change. (However, it'd be nice if we didn't have regionsize in there.
|
||||||
It's simple on the kernel side, but is just an extra - often unecessary -
|
It's simple on the kernel side, but is just an extra - often unecessary -
|
||||||
parameter to many functions in the LVM codebase.)
|
parameter to many functions in the LVM codebase.)
|
||||||
|
|
||||||
|
Line 05:
|
||||||
|
This option is used to merge an LV back into a RAID1 array - provided it was
|
||||||
|
split for temporary read-only use by '--splitmirrors 1 --trackchanges'.
|
||||||
|
|
||||||
Line 06:
|
Line 06:
|
||||||
|
The '--splitmirrors <#>' argument should be familiar from the "mirror" segment
|
||||||
|
type. It allows RAID1 images to be split from the array to form a new LV.
|
||||||
|
Either the original LV or the split LV - or both - could become a linear LV as
|
||||||
|
a result. If the '--trackchanges' argument is specified in addition to
|
||||||
|
'--splitmirrors', an LV will be split from the array. It will be read-only.
|
||||||
|
This operation does not change the original array - except that it uses an empty
|
||||||
|
slot to hold the position of the split LV which it expects to return in the
|
||||||
|
future (see the '--merge' argument). It tracks any changes that occur to the
|
||||||
|
array while the slot is kept in reserve. If the LV is merged back into the
|
||||||
|
array, only the changes are resync'ed to the returning image. Repeating the
|
||||||
|
'lvconvert' operation without the '--trackchanges' option will complete the
|
||||||
|
split of the LV permanently.
|
||||||
|
|
||||||
|
Line 07:
|
||||||
This option allows the user to specify a sub_lv (e.g. a mirror image) or
|
This option allows the user to specify a sub_lv (e.g. a mirror image) or
|
||||||
a particular device for replacement. The device (or all the devices in
|
a particular device for replacement. The device (or all the devices in
|
||||||
the sub_lv) will be removed and replaced with different devices from the
|
the sub_lv) will be removed and replaced with different devices from the
|
||||||
VG.
|
VG.
|
||||||
|
|
||||||
Line 07/08/09/10:
|
Line 08/09/10/11:
|
||||||
It should be possible to alter these parameters of a RAID device. As with
|
It should be possible to alter these parameters of a RAID device. As with
|
||||||
lvcreate, however, I'm not entirely certain how to best define some of these.
|
lvcreate, however, I'm not entirely certain how to best define some of these.
|
||||||
We don't need all the capabilities at once though, so it isn't a pressing
|
We don't need all the capabilities at once though, so it isn't a pressing
|
||||||
issue.
|
issue.
|
||||||
|
|
||||||
Line 11:
|
Line 12:
|
||||||
The LV to operate on.
|
The LV to operate on.
|
||||||
|
|
||||||
Line 12:
|
Line 13:
|
||||||
Devices that are to be used to satisfy the conversion request. If the
|
Devices that are to be used to satisfy the conversion request. If the
|
||||||
operation removes devices or splits a mirror, then the devices specified
|
operation removes devices or splits a mirror, then the devices specified
|
||||||
form the list of candidates for removal. If the operation adds or replaces
|
form the list of candidates for removal. If the operation adds or replaces
|
||||||
@ -173,7 +193,7 @@ foo
|
|||||||
| [foo_rmeta_1's lv_segment]
|
| [foo_rmeta_1's lv_segment]
|
||||||
|
|
||||||
LVM Meta-data format
|
LVM Meta-data format
|
||||||
--------------------
|
====================
|
||||||
The RAID format will need to be able to store parameters that are unique to
|
The RAID format will need to be able to store parameters that are unique to
|
||||||
RAID and unique to specific RAID sub-devices. It will be modeled after that
|
RAID and unique to specific RAID sub-devices. It will be modeled after that
|
||||||
of mirroring.
|
of mirroring.
|
||||||
@ -238,8 +258,13 @@ way, because it is a characteristic associated with the sub_lvs, not the
|
|||||||
array as a whole. In these cases, the status field of the sub-lv's themselves
|
array as a whole. In these cases, the status field of the sub-lv's themselves
|
||||||
will hold these flags - the meaning being only useful in the larger context.
|
will hold these flags - the meaning being only useful in the larger context.
|
||||||
|
|
||||||
|
|
||||||
|
##############################################
|
||||||
|
# Chapter 3: LVM RAID implementation details #
|
||||||
|
##############################################
|
||||||
|
|
||||||
New Segment Type(s)
|
New Segment Type(s)
|
||||||
-------------------
|
===================
|
||||||
I've created a new file 'lib/raid/raid.c' that will handle the various different
|
I've created a new file 'lib/raid/raid.c' that will handle the various different
|
||||||
RAID types. While there will be a unique segment type for each RAID variant,
|
RAID types. While there will be a unique segment type for each RAID variant,
|
||||||
they will all share a common backend - segtype_handler functions and
|
they will all share a common backend - segtype_handler functions and
|
||||||
@ -262,7 +287,7 @@ This should also work in the case of RAID10 and doing things in this manor
|
|||||||
should not affect the way size is calculated via the area_multiple.
|
should not affect the way size is calculated via the area_multiple.
|
||||||
|
|
||||||
Allocation
|
Allocation
|
||||||
----------
|
==========
|
||||||
When a RAID device is created, metadata LVs must be created along with the
|
When a RAID device is created, metadata LVs must be created along with the
|
||||||
data LVs that will ultimately compose the top-level RAID array. For the
|
data LVs that will ultimately compose the top-level RAID array. For the
|
||||||
foreseeable future, the metadata LVs must reside on the same device as (or
|
foreseeable future, the metadata LVs must reside on the same device as (or
|
||||||
@ -287,8 +312,8 @@ Therefore, to allocate space for RAID devices, we need to know two things:
|
|||||||
1) how many parity devices are required and 2) does an allocated area need to
|
1) how many parity devices are required and 2) does an allocated area need to
|
||||||
be split out for the metadata LVs after finding the space to fill the request.
|
be split out for the metadata LVs after finding the space to fill the request.
|
||||||
We simply add these two fields to the 'alloc_handle' data structure as,
|
We simply add these two fields to the 'alloc_handle' data structure as,
|
||||||
'parity_count' and 'alloc_and_split_meta'. These two fields get set simply
|
'parity_count' and 'alloc_and_split_meta'. These two fields get set in
|
||||||
in '_alloc_init'. The 'segtype->parity_devs' holds the number of parity
|
'_alloc_init'. The 'segtype->parity_devs' holds the number of parity
|
||||||
drives and can be directly copied to 'ah->parity_count' and
|
drives and can be directly copied to 'ah->parity_count' and
|
||||||
'alloc_and_split_meta' is set when a RAID segtype is detected and
|
'alloc_and_split_meta' is set when a RAID segtype is detected and
|
||||||
'metadata_area_count' has been specified. With these two variables set, we
|
'metadata_area_count' has been specified. With these two variables set, we
|
||||||
@ -296,3 +321,86 @@ can calculate how many allocated areas we need. Also, in the routines that
|
|||||||
find the actual space, they stop not when they have found ah->area_count but
|
find the actual space, they stop not when they have found ah->area_count but
|
||||||
when they have found (ah->area_count + ah->parity_count).
|
when they have found (ah->area_count + ah->parity_count).
|
||||||
|
|
||||||
|
Conversion
|
||||||
|
==========
|
||||||
|
RAID -> RAID, adding images
|
||||||
|
---------------------------
|
||||||
|
When adding images to a RAID array, metadata and data components must be added
|
||||||
|
as a pair. It is best to perform as many operations as possible before writing
|
||||||
|
new LVM metadata. This allows us to error-out without having to unwind any
|
||||||
|
changes. It also makes things easier if the machine should crash during a
|
||||||
|
conversion operation. Thus, the actions performed when adding a new image are:
|
||||||
|
1) Allocate the required number of metadata/data pairs using the method
|
||||||
|
describe above in 'Allocation' (i.e. find the metadata/data space
|
||||||
|
as one unit and split the space between them after found - this keeps
|
||||||
|
them together on the same device).
|
||||||
|
2) Form the metadata/data LVs from the allocated space (leave them
|
||||||
|
visible) - setting required RAID_[IMAGE | META] flags as appropriate.
|
||||||
|
3) Write the LVM metadata
|
||||||
|
4) Activate and clear the metadata LVs. The clearing of the metadata
|
||||||
|
requires the LVM metadata be written (step 3) and is a requirement
|
||||||
|
before adding the new metadata LVs to the array. If the metadata
|
||||||
|
is not cleared, it carry residual superblock state from a previous
|
||||||
|
array the device may have been part of.
|
||||||
|
5) Deactivate new sub-LVs and set them "hidden".
|
||||||
|
6) expand the 'first_seg(raid_lv)->areas' and '->meta_areas' array
|
||||||
|
for inclusion of the new sub-LVs
|
||||||
|
7) Add new sub-LVs and update 'first_seg(raid_lv)->area_count'
|
||||||
|
8) Commit new LVM metadata
|
||||||
|
Failure during any of these steps will not affect the original RAID array. In
|
||||||
|
the worst scenario, the user may have to remove the new sub-LVs that did not
|
||||||
|
yet make it into the array.
|
||||||
|
|
||||||
|
RAID -> RAID, removing images
|
||||||
|
-----------------------------
|
||||||
|
To remove images from a RAID, the metadata/data LV pairs must be removed
|
||||||
|
together. This is pretty straight-forward, but one place where RAID really
|
||||||
|
differs from the "mirror" segment type is how the resulting "holes" are filled.
|
||||||
|
When a device is removed from a "mirror" segment type, it is identified, moved
|
||||||
|
to the end of the 'mirrored_seg->areas' array, and then removed. This action
|
||||||
|
causes the other images to shift down and fill the position of the device which
|
||||||
|
was removed. While "raid1" could be handled in this way, the other RAID types
|
||||||
|
could not be - it would corrupt the ordering of the data on the array. Thus,
|
||||||
|
when a device is removed from a RAID array, the corresponding metadata/data
|
||||||
|
sub-LVs are removed from the 'raid_seg->meta_areas' and 'raid_seg->areas' arrays.
|
||||||
|
The slot in these 'lv_segment_area' arrays are set to 'AREA_UNASSIGNED'. RAID
|
||||||
|
is perfectly happy to construct a DM table mapping with '- -' if it comes across
|
||||||
|
area assigned in such a way. The pair of dashes is a valid way to tell the RAID
|
||||||
|
kernel target that the slot should be considered empty. So, we can remove
|
||||||
|
devices from a RAID array without affecting the correct operation of the RAID.
|
||||||
|
(It also becomes easy to replace the empty slots properly if a spare device is
|
||||||
|
available.) In the case of RAID1 device removal, the empty slot can be safely
|
||||||
|
eliminated. This is done by shifting the higher indexed devices down to fill
|
||||||
|
the slot. Even the names of the images will be renamed to properly reflect
|
||||||
|
their index in the array. Unlike the "mirror" segment type, you will never have
|
||||||
|
an image named "*_rimage_1" occupying the index position 0.
|
||||||
|
|
||||||
|
As with adding images, removing images holds off on commiting LVM metadata
|
||||||
|
until all possible changes have been made. This reduces the likelyhood of bad
|
||||||
|
intermediate stages being left due to a failure of operation or machine crash.
|
||||||
|
|
||||||
|
RAID1 '--splitmirrors', '--trackchanges', and '--merge' operations
|
||||||
|
-----------------------------------------------------------------
|
||||||
|
This suite of operations is only available to the "raid1" segment type.
|
||||||
|
|
||||||
|
Splitting an image from a RAID1 array is almost identical to the removal of
|
||||||
|
an image described above. However, the metadata LV associated with the split
|
||||||
|
image is removed and the data LV is kept and promoted to a top-level device.
|
||||||
|
(i.e. It is made visible and stripped of its RAID_IMAGE status flags.)
|
||||||
|
|
||||||
|
When the '--trackchanges' option is given along with the '--splitmirrors'
|
||||||
|
argument, the metadata LV is left as part of the original array. The data LV
|
||||||
|
is set as 'VISIBLE' and read-only (~LVM_WRITE). When the array DM table is
|
||||||
|
being created, it notices the read-only, VISIBLE nature of the sub-LV and puts
|
||||||
|
in the '- -' sentinel. Only a single image can be split from the mirror and
|
||||||
|
the name of the sub-LV cannot be changed. Unlike '--splitmirrors' on its own,
|
||||||
|
the '--name' argument must not be specified. Therefore, the name of the newly
|
||||||
|
split LV will remain the same '<lv>_rimage_<N>', where 'N' is the index of the
|
||||||
|
slot in the array for which it is associated.
|
||||||
|
|
||||||
|
When an LV which was split from a RAID1 array with the '--trackchanges' option
|
||||||
|
is merged back into the array, its read/write status is restored and it is
|
||||||
|
set as "hidden" again. Recycling the array (suspend/resume) restores the sub-LV
|
||||||
|
to its position in the array and begins the process of sync'ing the changes that
|
||||||
|
were made since the time it was split from the array.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user