IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Corrupt metadata text (with good mda header) was being handled
in the label_scan phase, but not in the vg_read phase. This
was sufficient because metadata areas would always be read and
checksummed during label_scan (metadata parsing was skipped
previously as an optimization.)
This changed with the optimization in
commit 61a6f9905e
"metadata: optimize reading metadata copies in scan"
Now, some metadata areas will not be read and checksummed
at all during the label_scan phase, only during the vg_read
phase. This means that bad metadata text may first be detected
in the vg_read phase. So, add equivalent bad metadata handling
to the vg_read path to match the label_scan path.
When creating lvm2 metadata for VG, lvm2 allocate some buffer,
and if buffer is not big enough, the buffer is 'reallocated' bigger,
and whole metadata creation is repeated until metadata fits.
We can try to use 'previous' metadata size as hint to reduce looping
here.
When parsing VG metadata we can create from a single config tree
also 'vg_committed' that is always created for writable VG.
This avoids extra uncessary step of serializing and deserilizing
just parsed VG.
Move extra md component detection into the label scan phase.
It had been in set_pv_devices which was deep within the vg_read
phase, which wasn't a good place (better to detect that earlier.)
Now that pv metadata info is available in the scan phase, the pv
details (size and device_hint) can be used for extra md checking.
Use the device_hint from the pv metadata to trigger a full md
component check if the device_hint begins with /dev/md.
Stop triggering full md component checks based on missing
udev info for a dev.
Changes to tests to reflect that the code is now detecting
md components in some test case that it wasn't before.
Usually md components are eliminated in label scan and/or
duplicate resolution, but they could sometimes get into
the vg_read stage, where set_pv_devices compares the
device to the PV.
If set_pv_devices runs an md component check and finds
one, vg_read should eliminate the components.
In set_pv_devices, run an md component check always
if the PV is smaller than the device (this is not
very common.) If the PV is larger than the device,
(more common), do the component check when the config
setting is "auto" (the default).
The fact that vg repair is implemented as a part of vg read
has led to a messy and complicated implementation of vg_read,
and limited and uncontrolled repair capability. This splits
read and repair apart.
Summary
-------
- take all kinds of various repairs out of vg_read
- vg_read no longer writes anything
- vg_read now simply reads and returns vg metadata
- vg_read ignores bad or old copies of metadata
- vg_read proceeds with a single good copy of metadata
- improve error checks and handling when reading
- keep track of bad (corrupt) copies of metadata in lvmcache
- keep track of old (seqno) copies of metadata in lvmcache
- keep track of outdated PVs in lvmcache
- vg_write will do basic repairs
- new command vgck --updatemetdata will do all repairs
Details
-------
- In scan, do not delete dev from lvmcache if reading/processing fails;
the dev is still present, and removing it makes it look like the dev
is not there. Records are now kept about the problems with each PV
so they be fixed/repaired in the appropriate places.
- In scan, record a bad mda on failure, and delete the mda from
mda in use list so it will not be used by vg_read or vg_write,
only by repair.
- In scan, succeed if any good mda on a device is found, instead of
failing if any is bad. The bad/old copies of metadata should not
interfere with normal usage while good copies can be used.
- In scan, add a record of old mdas in lvmcache for later, do not repair
them while reading, and do not let them prevent us from finding and
using a good copy of metadata from elsewhere. One result is that
"inconsistent metadata" is no longer a read error, but instead a
record in lvmcache that can be addressed separate from the read.
- Treat a dev with no good mdas like a dev with no mdas, which is an
existing case we already handle.
- Don't use a fake vg "handle" for returning an error from vg_read,
or the vg_read_error function for getting that error number;
just return null if the vg cannot be read or used, and an error_flags
arg with flags set for the specific kind of error (which can be used
later for determining the kind of repair.)
- Saving an original copy of the vg metadata, for purposes of reverting
a write, is now done explicitly in vg_read instead of being hidden in
the vg_make_handle function.
- When a vg is not accessible due to "access restrictions" but is
otherwise fine, return the vg through the new error_vg arg so that
process_each_pv can skip the PVs in the VG while processing.
(This is a temporary accomodation for the way process_each_pv
tracks which devs have been looked at, and can be dropped later
when process_each_pv implementation dev tracking is changed.)
- vg_read does not try to fix or recover a vg, but now just reads the
metadata, checks access restrictions and returns it.
(Checking access restrictions might be better done outside of vg_read,
but this is a later improvement.)
- _vg_read now simply makes one attempt to read metadata from
each mda, and uses the most recent copy to return to the caller
in the form of a 'vg' struct.
(bad mdas were excluded during the scan and are not retried)
(old mdas were not excluded during scan and are retried here)
- vg_read uses _vg_read to get the latest copy of metadata from mdas,
and then makes various checks against it to produce warnings,
and to check if VG access is allowed (access restrictions include:
writable, foreign, shared, clustered, missing pvs).
- Things that were previously silently/automatically written by vg_read
that are now done by vg_write, based on the records made in lvmcache
during the scan and read:
. clearing the missing flag
. updating old copies of metadata
. clearing outdated pvs
. updating pv header flags
- Bad/corrupt metadata are now repaired; they were not before.
Test changes
------------
- A read command no longer writes the VG to repair it, so add a write
command to do a repair.
(inconsistent-metadata, unlost-pv)
- When a missing PV is removed from a VG, and then the device is
enabled again, vgck --updatemetadata is needed to clear the
outdated PV before it can be used again, where it wasn't before.
(lvconvert-repair-policy, lvconvert-repair-raid, lvconvert-repair,
mirror-vgreduce-removemissing, pv-ext-flags, unlost-pv)
Reading bad/old metadata
------------------------
- "bad metadata": the mda_header or metadata text has invalid fields
or can't be parsed by lvm. This is a form of corruption that would
not be caused by known failure scenarios. A checksum error is
typically included among the errors reported.
- "old metadata": a valid copy of the metadata that has a smaller seqno
than other copies of the metadata. This can happen if the device
failed, or io failed, or lvm failed while commiting new metadata
to all the metadata areas. Old metadata on a PV that has been
removed from the VG is the "outdated" case below.
When a VG has some PVs with bad/old metadata, lvm can simply ignore
the bad/old copies, and use a good copy. This is why there are
multiple copies of the metadata -- so it's available even when some
of the copies cannot be used. The bad/old copies do not have to be
repaired before the VG can be used (the repair can happen later.)
A PV with no good copies of the metadata simply falls back to being
treated like a PV with no mdas; a common and harmless configuration.
When bad/old metadata exists, lvm warns the user about it, and
suggests repairing it using a new metadata repair command.
Bad metadata in particular is something that users will want to
investigate and repair themselves, since it should not happen and
may indicate some other problem that needs to be fixed.
PVs with bad/old metadata are not the same as missing devices.
Missing devices will block various kinds of VG modification or
activation, but bad/old metadata will not.
Previously, lvm would attempt to repair bad/old metadata whenever
it was read. This was unnecessary since lvm does not require every
copy of the metadata to be used. It would also hide potential
problems that should be investigated by the user. It was also
dangerous in cases where the VG was on shared storage. The user
is now allowed to investigate potential problems and decide how
and when to repair them.
Repairing bad/old metadata
--------------------------
When label scan sees bad metadata in an mda, that mda is removed
from the lvmcache info->mdas list. This means that vg_read will
skip it, and not attempt to read/process it again. If it was
the only in-use mda on a PV, that PV is treated like a PV with
no mdas. It also means that vg_write will also skip the bad mda,
and not attempt to write new metadata to it. The only way to
repair bad metadata is with the metadata repair command.
When label scan sees old metadata in an mda, that mda is kept
in the lvmcache info->mdas list. This means that vg_read will
read/process it again, and likely see the same mismatch with
the other copies of the metadata. Like the label_scan, the
vg_read will simply ignore the old copy of the metadata and
use the latest copy. If the command is modifying the vg
(e.g. lvcreate), then vg_write, which writes new metadata to
every mda on info->mdas, will write the new metadata to the
mda that had the old version. If successful, this will resolve
the old metadata problem (without needing to run a metadata
repair command.)
Outdated PVs
------------
An outdated PV is a PV that has an old copy of VG metadata
that shows it is a member of the VG, but the latest copy of
the VG metadata does not include this PV. This happens if
the PV is disconnected, vgreduce --removemissing is run to
remove the PV from the VG, then the PV is reconnected.
In this case, the outdated PV needs have its outdated metadata
removed and the PV used flag needs to be cleared. This repair
will be done by the subsequent repair command. It is also done
if vgremove is run on the VG.
MISSING PVs
-----------
When a device is missing, most commands will refuse to modify
the VG. This is the simple case. More complicated is when
a command is allowed to modify the VG while it is missing a
device.
When a VG is written while a device is missing for one of it's PVs,
the VG metadata is written to disk with the MISSING flag on the PV
with the missing device. When the VG is next used, it is treated
as if the PV with the MISSING flag still has a missing device, even
if that device has reappeared.
If all LVs that were using a PV with the MISSING flag are removed
or repaired so that the MISSING PV is no longer used, then the
next time the VG metadata is written, the MISSING flag will be
dropped.
Alternative methods of clearing the MISSING flag are:
vgreduce --removemissing will remove PVs with missing devices,
or PVs with the MISSING flag where the device has reappeared.
vgextend --restoremissing will clear the MISSING flag on PVs
where the device has reappeared, allowing the VG to be used
normally. This must be done with caution since the reappeared
device may have old data that is inconsistent with data on other PVs.
Bad mda repair
--------------
The new command:
vgck --updatemetadata VG
first uses vg_write to repair old metadata, and other basic
issues mentioned above (old metadata, outdated PVs, pv_header
flags, MISSING_PV flags). It will also go further and repair
bad metadata:
. text metadata that has a bad checksum
. text metadata that is not parsable
. corrupt mda_header checksum and version fields
(To keep a clean diff, #if 0 is added around functions that
are replaced by new code. These commented functions are
removed by the following commit.)
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
Refactor the recent metadata-reading optimisation patches.
Remove the recently-added cache fields from struct labeller
and struct format_instance.
Instead, introduce struct lvmcache_vgsummary to wrap the VG information
that lvmcache holds and add the metadata size and checksum to it.
Allow this VG summary information to be looked up by metadata size +
checksum. Adjust the debug log messages to make it clear when this
shortcut has been successful.
(This changes the optimisation slightly, and might be extendable
further.)
Add struct cached_vg_fmtdata to format-specific vg_read calls to
preserve state alongside the VG across separate calls and indicate
if the details supplied match, avoiding the need to read and
process the VG metadata again.
Use similar logic as with text_vg_import_fd() and avoid repeated
parsing of same mda and its config tree for vgname_from_mda().
Remember last parsed vgname, vgid and creation_host in labeller
structure and if the metadata have the same size and checksum,
return this stored info.
TODO: The reuse of labeller struct is not ideal, some lvmcache API for
this functionality would be nicer.
When reading VG mda from multiple PVs - do all the validation only
when mda is seen for the first time and when mda checksum and length
is same just return already existing VG pointer.
(i.e. using 300PVs for a VG would lead to create and destroy 300 config trees....)
The warnings arg was used to enable logging of warnings
when reading a PV. This arg is turned into a set of flags
with the WARN_PV_READ flag matching the existing behavior.
A new flag WARN_INCONSISTENT is added that will cause
vg_read_internal() to log the "VG is not consistent"
warning so the various callers do not need to log
this warning themselves.
A new vg_read flag READ_WARN_INCONSISTENT is used from
reporting to enable the WARN_INCONSISTENT flag in
vg_read_internal.
[Committed by agk with cosmetic changes and tweaks.]
Add CONFIG_FILE_SPECIAL config source id to make a difference between
real configuration tree (like lvm.conf and tag configs) and special purpose
configuration tree (like LVM metadata, persistent filter).
This makes it easier to attach correct customized data to the config
tree that is created out of the source then.
A helper type that helps with identification of the configuration source
which makes handling the configuration cascade a bit easier, mainly
removing and adding configuration trees to cascade dynamically.
Currently, the possible types are:
CONFIG_UNDEFINED - configuration is not defined yet (not initialized)
CONFIG_FILE - one file configuration
CONFIG_MERGED_FILES - configuration that is a result of merging more files into one
CONFIG_STRING - configuration string typed on cmd line directly
CONFIG_PROFILE - profile configuration (the new type of configuration, patches will follow...)
Also, generalize existing "remove_overridden_config_tree" to work with
configuration type identification in a cascade. Before, it was just
the CONFIG_STRING we used. Now, we need some more to add in a
cascade (like the CONFIG_PROFILE). So, we have:
struct dm_config_tree *remove_config_tree_by_source(struct cmd_context *cmd, config_source_t source);
config_source_t config_get_source_type(struct dm_config_tree *cft);
... for removing the tree by its source type from the cascade and
simply getting the source type.
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
It's useful to keep the partial flag cached - so just move the call
for vg_mark_partil_lvs() into import_vg_from_config_tree() so it gets
evaluated before it goes through the lvmcache.
This patch should not present any functional change.
Note: It is rather temporal solution - proper place is probably inside the
'read' call back - but needs some more discussion.
For now using this minor hack.
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
Physical segments were still allocated from global
command context mempool.
This leads to very high memory usage when
activating large VG (vgchange).
(Memory usage was about 2G when >3000LVs).
Fix it by properly using vg->vgmem private pool,
so all the memory is released early.
New memory pool parameter is needed here for pv_split_segment
function.
Also fix the same problem in some minor allocations
(vg description, lv segment split).
The physical_volume, volume_group, logical_volume and lv_segment
structures' 'status' member is now uint64_t.
The alignment of these structures was also audited to remove holes. The
movement of some members in 'volume_group' and 'lv_segment' eliminates
holes. The 'physical_volume' structure still has one 4-byte hole after
'pe_size'; the other structures no longer have any holes. Each
structures' size has not changed.