mirror of
git://sourceware.org/git/lvm2.git
synced 2025-01-23 02:05:07 +03:00
b2b6c18db3
surrounding device faults/failures.
222 lines
12 KiB
Plaintext
222 lines
12 KiB
Plaintext
LVM device fault handling
|
|
=========================
|
|
|
|
Introduction
|
|
------------
|
|
This document is to serve as the definitive source for information
|
|
regarding the policies and procedures surrounding device failures
|
|
in LVM. It codifies LVM's responses to device failures as well as
|
|
the responsibilities of administrators.
|
|
|
|
Device failures can be permanent or transient. A permanent failure
|
|
is one where a device becomes inaccessible and will never be
|
|
revived. A transient failure is a failure that can be recovered
|
|
from (e.g. a power failure, intermittent network outage, block
|
|
relocation, etc). The policies for handling both types of failures
|
|
is described herein.
|
|
|
|
Available Operations During a Device Failure
|
|
--------------------------------------------
|
|
When there is a device failure, LVM behaves somewhat differently because
|
|
only a subset of the available devices will be found for the particular
|
|
volume group. The number of operations available to the administrator
|
|
is diminished. It is not possible to create new logical volumes while
|
|
PVs cannot be accessed, for example. Operations that create, convert, or
|
|
resize logical volumes are disallowed, such as:
|
|
- lvcreate
|
|
- lvresize
|
|
- lvreduce
|
|
- lvextend
|
|
- lvconvert (unless '--repair' is used)
|
|
Operations that activate, deactivate, remove, report, or repair logical
|
|
volumes are allowed, such as:
|
|
- lvremove
|
|
- vgremove (will remove all LVs, but not the VG until consistent)
|
|
- pvs
|
|
- vgs
|
|
- lvs
|
|
- lvchange -a [yn]
|
|
- vgchange -a [yn]
|
|
Operations specific to the handling of failed devices are allowed and
|
|
are as follows:
|
|
|
|
- 'vgreduce --removemissing <VG>': This action is designed to remove
|
|
the reference of a failed device from the LVM metadata stored on the
|
|
remaining devices. If there are (portions of) logical volumes on the
|
|
failed devices, the ability of the operation to proceed will depend
|
|
on the type of logical volumes found. If an image (i.e leg or side)
|
|
of a mirror is located on the device, that image/leg of the mirror
|
|
is eliminated along with the failed device. The result of such a
|
|
mirror reduction could be a no-longer-redundant linear device. If
|
|
a linear, stripe, or snapshot device is located on the failed device
|
|
the command will not proceed without a '--force' option. The result
|
|
of using the '--force' option is the entire removal and complete
|
|
loss of the non-redundant logical volume. Once this operation is
|
|
complete, the volume group will again have a complete and consistent
|
|
view of the devices it contains. Thus, all operations will be
|
|
permitted - including creation, conversion, and resizing operations.
|
|
|
|
- 'lvconvert --repair <VG/LV>': This action is designed specifically
|
|
to operate on mirrored logical volumes. It is used on logical volumes
|
|
individually and does not remove the faulty device from the volume
|
|
group. If, for example, a failed device happened to contain the
|
|
images of four distinct mirrors, it would be necessary to run
|
|
'lvconvert --repair' on each of them. The ultimate result is to leave
|
|
the faulty device in the volume group, but have no logical volumes
|
|
referencing it. In addition to removing mirror images that reside
|
|
on failed devices, 'lvconvert --repair' can also replace the failed
|
|
device if there are spare devices available in the volume group. The
|
|
user is prompted whether to simply remove the failed portions of the
|
|
mirror or to also allocate a replacement, if run from the command-line.
|
|
Optionally, the '--use-policies' flag can be specified which will
|
|
cause the operation not to prompt the user, but instead respect
|
|
the policies outlined in the LVM configuration file - usually,
|
|
/etc/lvm/lvm.conf. Once this operation is complete, mirrored logical
|
|
volumes will be consistent and I/O will be allowed to continue.
|
|
However, the volume group will still be inconsistent - due to the
|
|
refernced-but-missing device/PV - and operations will still be
|
|
restricted to the aformentioned actions until either the device is
|
|
restored or 'vgreduce --removemissing' is run.
|
|
|
|
Device Revival (transient failures):
|
|
------------------------------------
|
|
During a device failure, the above section describes what limitations
|
|
a user can expect. However, if the device returns after a period of
|
|
time, what to expect will depend on what has happened during the time
|
|
period when the device was failed. If no automated actions (described
|
|
below) or user actions were necessary or performed, then no change in
|
|
operations or logical volume layout will occur. However, if an
|
|
automated action or one of the aforementioned repair commands was
|
|
manually run, the returning device will be perceived as having stale
|
|
LVM metadata. In this case, the user can expect to see a warning
|
|
concerning inconsistent metadata. The metadata on the returning
|
|
device will be automatically replaced with the latest copy of the
|
|
LVM metadata - restoring consistency. Note, while most LVM commands
|
|
will automatically update the metadata on a restored devices, the
|
|
following possible exceptions exist:
|
|
- pvs (when it does not read/update VG metadata)
|
|
|
|
Automated Target Response to Failures:
|
|
--------------------------------------
|
|
The only LVM target type (i.e. "personality") that has an automated
|
|
response to failures is a mirrored logical volume. The other target
|
|
types (linear, stripe, snapshot, etc) will simply propagate the failure.
|
|
[A snapshot becomes invalid if its underlying device fails, but the
|
|
origin will remain valid - presuming the origin device has not failed.]
|
|
There are three types of errors that a mirror can suffer - read, write,
|
|
and resynchronization errors. Each is described in depth below.
|
|
|
|
Mirror read failures:
|
|
If a mirror is 'in-sync' (i.e. all images have been initialized and
|
|
are identical), a read failure will only produce a warning. Data is
|
|
simply pulled from one of the other images and the fault is recorded.
|
|
Sometimes - like in the case of bad block relocation - read errors can
|
|
be recovered from by the storage hardware. Therefore, it is up to the
|
|
user to decide whether to reconfigure the mirror and remove the device
|
|
that caused the error. Managing the composition of a mirror is done with
|
|
'lvconvert' and removing a device from a volume group can be done with
|
|
'vgreduce'.
|
|
|
|
If a mirror is not 'in-sync', a read failure will produce an I/O error.
|
|
This error will propagate all the way up to the applications above the
|
|
logical volume (e.g. the file system). No automatic intervention will
|
|
take place in this case either. It is up to the user to decide what
|
|
can be done/salvaged in this senario. If the user is confident that the
|
|
images of the mirror are the same (or they are willing to simply attempt
|
|
to retreive whatever data they can), 'lvconvert' can be used to eliminate
|
|
the failed image and proceed.
|
|
|
|
Mirror resynchronization errors:
|
|
A resynchronization error is one that occurs when trying to initialize
|
|
all mirror images to be the same. It can happen due to a failure to
|
|
read the primary image (the image considered to have the 'good' data), or
|
|
due to a failure to write the secondary images. This type of failure
|
|
only produces a warning, and it is up to the user to take action in this
|
|
case. If the error is transient, the user can simply reactivate the
|
|
mirrored logical volume to make another attempt at resynchronization.
|
|
If attempts to finish resynchronization fail, 'lvconvert' can be used to
|
|
remove the faulty device from the mirror.
|
|
|
|
TODO...
|
|
Some sort of response to this type of error could be automated.
|
|
Since this document is the definitive source for how to handle device
|
|
failures, the process should be defined here. If the process is defined
|
|
but not implemented, it should be noted as such. One idea might be to
|
|
make a single attempt to suspend/resume the mirror in an attempt to
|
|
redo the sync operation that failed. On the other hand, if there is
|
|
a permanent failure, it may simply be best to wait for the user or the
|
|
automated response that is sure to follow from a write failure.
|
|
...TODO
|
|
|
|
Mirror write failures:
|
|
When a write error occurs on a mirror constituent device, an attempt
|
|
to handle the failure is automatically made. This is done by calling
|
|
'lvconvert --repair --use-policies'. The policies implied by this
|
|
command are set in the LVM configuration file. They are:
|
|
- mirror_log_fault_policy: This defines what action should be taken
|
|
if the device containing the log fails. The available options are
|
|
"remove" and "allocate". Either of these options will cause the
|
|
faulty log device to be removed from the mirror. The "allocate"
|
|
policy will attempt the further action of trying to replace the
|
|
failed disk log by using space that might be available in the
|
|
volume group. If the allocation fails (or the "remove" policy
|
|
is specified), the mirror log will be maintained in memory. Should
|
|
the machine be rebooted or the logical volume deactivated, a
|
|
complete resynchronization of the mirror will be necessary upon
|
|
the follow activation - such is the nature of a mirror with a 'core'
|
|
log. The default policy for handling log failures is "allocate".
|
|
The service disruption incurred by replacing the failed log is
|
|
negligible, while the benefits of having persistent log is
|
|
pronounced.
|
|
- mirror_image_fault_policy: This defines what action should be taken
|
|
if a device containing an image fails. Again, the available options
|
|
are "remove" and "allocate". Both of these options will cause the
|
|
faulty image device to be removed - adjusting the logical volume
|
|
accordingly. For example, if one image of a 2-way mirror fails, the
|
|
mirror will be converted to a linear device. If one image of a
|
|
3-way mirror fails, the mirror will be converted to a 2-way mirror.
|
|
The "allocate" policy takes the further action of trying to replace
|
|
the failed image using space that is available in the volume group.
|
|
Replacing a failed mirror image will incure the cost of
|
|
resynchronizing - degrading the performance of the mirror. The
|
|
default policy for handling an image failure is "remove". This
|
|
allows the mirror to still function, but gives the administrator the
|
|
choice of when to incure the extra performance costs of replacing
|
|
the failed image.
|
|
|
|
TODO...
|
|
The appropriate time to take permanent corrective action on a mirror
|
|
should be driven by policy. There should be a directive that takes
|
|
a time or percentage argument. Something like the following:
|
|
- mirror_fault_policy_WHEN = "10sec"/"10%"
|
|
A time value would signal the amount of time to wait for transient
|
|
failures to resolve themselves. The percentage value would signal the
|
|
amount a mirror could become out-of-sync before the faulty device is
|
|
removed.
|
|
|
|
A mirror cannot be used unless /some/ corrective action is taken,
|
|
however. One option is to replace the failed mirror image with an
|
|
error target, forgo the use of 'handle_errors', and simply let the
|
|
out-of-sync regions accumulate and be tracked by the log. Mirrors
|
|
that have more than 2 images would have to "stack" to perform the
|
|
tracking, as each failed image would have to be associated with a
|
|
log. If the failure is transient, the device would replace the
|
|
error target that was holding its spot and the log that was tracking
|
|
the deltas would be used to quickly restore the portions that changed.
|
|
|
|
One unresolved issue with the above scheme is how to know which
|
|
regions of the mirror are out-of-sync when a problem occurs. When
|
|
a write failure occurs in the kernel, the log will contain those
|
|
regions that are not in-sync. If the log is a disk log, that log
|
|
could continue to be used to track differences. However, if the
|
|
log was a core log - or if the log device failed at the same time
|
|
as an image device - there would be no way to determine which
|
|
regions are out-of-sync to begin with as we start to track the
|
|
deltas for the failed image. I don't have a solution for this
|
|
problem other than to only be able to handle errors in this way
|
|
if conditions are right. These issues will have to be ironed out
|
|
before proceeding. This could be another case, where it is better
|
|
to handle failures in the kernel by allowing the kernel to store
|
|
updates in various metadata areas.
|
|
...TODO
|