IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
uses vg_write to correct more common or less severe issues,
and also adds the ability to repair some metadata corruption
that couldn't be handled previously.
and implement it based on a device, not based
on a pv struct (which is not available when the
device is not a part of the vg.)
currently only the vgremove command wipes outdated
pvs until more advanced recovery is added in a
subsequent commit
The vg read and vg write cases need to update lvmcache
differently, so create separate functions for them.
The read case now handles checking for outdated mdas
and moves them aside into a new list to be repaired in
a subsequent commit.
The existing comment was desribing the correct behavior,
but the code didn't match. The commit is successful if
one mda was committed. Making it depend on the result of
the internal lvmcache update was wrong.
mda's that cannot be processed by lvm because of
some corruption can be kept on a separate list.
These will be used for more advanced repair in a
subsequent commit.
When reading metadata headers and text, use a new set
of flags to identify specific errors that are seen.
These will be used for more advanced repair in a
subsequent commit.
Use the recently added dump routines to produce the
old/traditional pvck output, and remove the code that
had been used for that.
The validation/checking done by the new routines means
that new lines prefixed with CHECK are printed for
incorrect values.
Recent kernel version from kernel commit:
de7180ff908b2bc0342e832dbdaa9a5f1ecaa33a
started to report in cache status line new flag:
no_discard_passdown
Whenever lvm spots unknown status it reports:
Unknown feature in status:
So add reconginzing this feature flag and also report this with
'lvs -o+kernel_discards'
When no_discard_passdown is found in status 'nopassdown' gets reported
for this field (roughly matching what we report for thin-pools).
There have been two file locks used to protect lvm
"global state": "ORPHANS" and "GLOBAL".
Commands that used the ORPHAN flock in exclusive mode:
pvcreate, pvremove, vgcreate, vgextend, vgremove,
vgcfgrestore
Commands that used the ORPHAN flock in shared mode:
vgimportclone, pvs, pvscan, pvresize, pvmove,
pvdisplay, pvchange, fullreport
Commands that used the GLOBAL flock in exclusive mode:
pvchange, pvscan, vgimportclone, vgscan
Commands that used the GLOBAL flock in shared mode:
pvscan --cache, pvs
The ORPHAN lock covers the important cases of serializing
the use of orphan PVs. It also partially covers the
reporting of orphan PVs (although not correctly as
explained below.)
The GLOBAL lock doesn't seem to have a clear purpose
(it may have eroded over time.)
Neither lock correctly protects the VG namespace, or
orphan PV properties.
To simplify and correct these issues, the two separate
flocks are combined into the one GLOBAL flock, and this flock
is used from the locking sites that are in place for the
lvmlockd global lock.
The logic behind the lvmlockd (distributed) global lock is
that any command that changes "global state" needs to take
the global lock in ex mode. Global state in lvm is: the list
of VG names, the set of orphan PVs, and any properties of
orphan PVs. Reading this global state can use the global lock
in sh mode to ensure it doesn't change while being reported.
The locking of global state now looks like:
lockd_global()
previously named lockd_gl(), acquires the distributed
global lock through lvmlockd. This is unchanged.
It serializes distributed lvm commands that are changing
global state. This is a no-op when lvmlockd is not in use.
lockf_global()
acquires an flock on a local file. It serializes local lvm
commands that are changing global state.
lock_global()
first calls lockf_global() to acquire the local flock for
global state, and if this succeeds, it calls lockd_global()
to acquire the distributed lock for global state.
Replace instances of lockd_gl() with lock_global(), so that the
existing sites for lvmlockd global state locking are now also
used for local file locking of global state. Remove the previous
file locking calls lock_vol(GLOBAL) and lock_vol(ORPHAN).
The following commands which change global state are now
serialized with the exclusive global flock:
pvchange (of orphan), pvresize (of orphan), pvcreate, pvremove,
vgcreate, vgextend, vgremove, vgreduce, vgrename,
vgcfgrestore, vgimportclone, vgmerge, vgsplit
Commands that use a shared flock to read global state (and will
be serialized against the prior list) are those that use
process_each functions that are based on processing a list of
all VG names, or all PVs. The list of all VGs or all PVs is
global state and the shared lock prevents those lists from
changing while the command is processing them.
The ORPHAN lock previously attempted to produce an accurate
listing of orphan PVs, but it was only acquired at the end of
the command during the fake vg_read of the fake orphan vg.
This is not when orphan PVs were determined; they were
determined by elimination beforehand by processing all real
VGs, and subtracting the PVs in the real VGs from the list
of all PVs that had been identified during the initial scan.
This is fixed by holding the single global lock in shared mode
while processing all VGs to determine the list of orphan PVs.
wipe_lv knows it's going to write the device, so it
can open rw from the start. It was opening readonly,
and then dev_write needed to reopen it readwrite.
This reverts 518a8e8cfb
"lvmlockd: activate mirror LVs in shared mode with cmirrord"
because while activating a mirror LV with cmirrord worked,
changes to the active cmirror did not work.
When data are growing, adapt also size of metadata.
As we get way too many reports from users doing huge growths of
data portion while keep metadata small and avoiding using monitoring.
So to enhance the user-experience in case user requests grown of
thin-pool (without passing PV list for growth) - lvm2 will automaticaly
grown also the metadata part of thin-pool (if possible).
Add function for estimation of thin-pool metadata size for given size of
data. Function is using already existing internal API so it can
be reused for resize of thin-pool data.
When lvextend extends an LV that is active with a shared
lock, use this as a signal that other hosts may also have
the LV active, with gfs2 mounted, and should have the LV
refreshed to reflect the new size. Use the libdlmcontrol
run api, which uses dlm_controld/corosync to run an
lvchange --refresh command on other cluster nodes.
Allow using caching with VDO.
User can either cache a single vdopool or
a vdo LV - difference when the caching is put-in depends on a use-case
and it's upto user to decide which kind of speed is expected.
Activation would not be allowed anyway, but we can
check for these cases early and avoid wasted time in
pvscan managing online files an attempting activation.
and "cachepool" to refer to a cache on a cache pool object.
The problem was that the --cachepool option was being used
to refer to both a cache pool object, and to a standard LV
used for caching. This could be somewhat confusing, and it
made it less clear when each kind would be used. By
separating them, it's clear when a cachepool or a cachevol
should be used.
Previously:
- lvm would use the cache pool approach when the user passed
a cache-pool LV to the --cachepool option.
- lvm would use the cache vol approach when the user passed
a standard LV in the --cachepool option.
Now:
- lvm will always use the cache pool approach when the user
uses the --cachepool option.
- lvm will always use the cache vol approach when the user
uses the --cachevol option.
Whenever thin-pool chunk size is unspecified and left for lvm calculation
try to select the size as nearest highest power-of-2 instead of
just being a multiple of 64KiB.
Fixing recent commit 022ebb0cfe
Resize already has size that needs to be counted with,
otherwise upsizing operation could turn into size reduction one.
Now with newer VDO kvdo target we can start to use standard mechanism
to enable resize of VDO volumes.
VDO pool can be grown.
Virtual volume grows on top of VDO pool when is not big enough.
Reduced VDOLV is calling discard for reduced areas - this can
take long time!
TODO: implement some pollable mechanism for out-of-lock TRIM.
When using 'lvcreate -l100%VG' and there is big disproportion between
real available space and requested setting - automatically fallback
to 100%FREE.
Difference can be seen when VG is big and already most space was
allocated, so the requestion 100%VG can end (and by spec for % modifier
it's correct) as LV with size of 1%VG. Usually this is not a big
problem - buit in some cases - like cache-pool allocation, this
can result a big difference for chunksize selection.
With this patch it's more closely match common-sense logic without
the need of reitteration of too big changes in lvm2 core ATM.
TODO: in the future there should be allocator solving all allocations
in a single call.
Drop very old original format of VDO target and focus on V2 version.
So some variables were renamed or replaced.
There is no compatibility preserved (with assumption so far this is
experimental feature and there is no real user).
Note - version currently VDO calls this version 6.2.
This is a followup patch to commit edb72cb70c
to support related lvm2 test suite tests.
A 'global/support_mirrored_mirror_log' bool configuration variable gets
introduced allowing the creation of, or conversion to mirrored 'mirror'
logs if set. The capability to create these in turn allows the rest of
the tests to perform activation of such existing LVs and their conversions
to disk/core 'mirror' logs.
Display a disclaimer warning if enabled that this is not for regular use.
Add definition of the enabled config option to respective test scripts.
Related: rhbz1643562
Scenario: Given an existed LV `lvol0`, I want to create another LV
on the PVs used by `lvol0`.
I use `build_parallel_areas_from_lv()` to obtain the `pv_list` of each segments.
However, the returned `pv_list` is not properly initialized, which causes
segfault in subsequent operations.
There's a small window during creation of a new RaidLV when
rmeta SubLVs are made visible to wipe them in order to prevent
erroneous discovery of stale RAID metadata. In case a crash
prevents the SubLVs from being committed hidden after such
wiping, the RaidLV can still be activated with the SubLVs visible.
During deactivation though, a deadlock occurs because the visible
SubLVs are deactivated before the RaidLV.
The patch adds _check_raid_sublvs to the raid validation in merge.c,
an activation check to activate.c (paranoid, because the merge.c check
will prevent activation in case of visible SubLVs) and shares the
existing wiping function _clear_lvs in raid_manip.c moved to lv_manip.c
and renamed to activate_and_wipe_lvlist to remove code duplication.
Whilst on it, introduce activate_and_wipe_lv to share with
(lvconvert|lvchange).c.
Resolves: rhbz1633167
In RHEL7 we marked mirrored mirror logs as deprecated and
added a related message. This patch prohibits creating new
'mirror' LVs with that log type or converting existing LVs
to have one.
Existing LVs with mirrored mirror log can be activated
and converted to disk/core logs.
Avoid double deprecation message when running lvconvert.
Resolves: rhbz1643562
. When using default settings, this commit should change
nothing. The first PE continues to be placed at 1 MiB
resulting in a metadata area size of 1020 KiB (for
4K page sizes; slightly smaller for larger page sizes.)
. When default_data_alignment is disabled in lvm.conf,
align pe_start at 1 MiB, based on a default metadata area
size that adapts to the page size. Previously, disabling
this option would result in mda_size that was too small
for common use, and produced a 64 KiB aligned pe_start.
. Customized pe_start and mda_size values continue to be
set as before in lvm.conf and command line.
. Remove the configure option for setting default_data_alignment
at build time.
. Improve alignment related option descriptions.
. Add section about alignment to pvcreate man page.
Previously, DEFAULT_PVMETADATASIZE was 255 sectors.
However, the fact that the config setting named
"default_data_alignment" has a default value of 1 (MiB)
meant that DEFAULT_PVMETADATASIZE was having no effect.
The metadata area size is the space between the start of
the metadata area (page size offset from the start of the
device) and the first PE (1 MiB by default due to
default_data_alignment 1.) The result is a 1020 KiB metadata
area on machines with 4KiB page size (1024 KiB - 4 KiB),
and smaller on machines with larger page size.
If default_data_alignment was set to 0 (disabled), then
DEFAULT_PVMETADATASIZE 255 would take effect, and produce a
metadata area that was 188 KiB and pe_start of 192 KiB.
This was too small for common use.
This is fixed by making the default metadata area size a
computed value that matches the value produced by
default_data_alignment.
If a single, standard LV is specified as the cache, use
it directly instead of converting it into a cache-pool
object with two separate LVs (for data and metadata).
With a single LV as the cache, lvm will use blocks at the
beginning for metadata, and the rest for data. Separate
dm linear devices are set up to point at the metadata and
data areas of the LV. These dm devs are given to the
dm-cache target to use.
The single LV cache cannot be resized without recreating it.
If the --poolmetadata option is used to specify an LV for
metadata, then a cache pool will be created (with separate
LVs for data and metadata.)
Usage:
$ lvcreate -n main -L 128M vg /dev/loop0
$ lvcreate -n fast -L 64M vg /dev/loop1
$ lvs -a vg
LV VG Attr LSize Type Devices
main vg -wi-a----- 128.00m linear /dev/loop0(0)
fast vg -wi-a----- 64.00m linear /dev/loop1(0)
$ lvconvert --type cache --cachepool fast vg/main
$ lvs -a vg
LV VG Attr LSize Origin Pool Type Devices
[fast] vg Cwi---C--- 64.00m linear /dev/loop1(0)
main vg Cwi---C--- 128.00m [main_corig] [fast] cache main_corig(0)
[main_corig] vg owi---C--- 128.00m linear /dev/loop0(0)
$ lvchange -ay vg/main
$ dmsetup ls
vg-fast_cdata (253:4)
vg-fast_cmeta (253:5)
vg-main_corig (253:6)
vg-main (253:24)
vg-fast (253:3)
$ dmsetup table
vg-fast_cdata: 0 98304 linear 253:3 32768
vg-fast_cmeta: 0 32768 linear 253:3 0
vg-main_corig: 0 262144 linear 7:0 2048
vg-main: 0 262144 cache 253:5 253:4 253:6 128 2 metadata2 writethrough mq 0
vg-fast: 0 131072 linear 7:1 2048
$ lvchange -an vg/min
$ lvconvert --splitcache vg/main
$ lvs -a vg
LV VG Attr LSize Type Devices
fast vg -wi------- 64.00m linear /dev/loop1(0)
main vg -wi------- 128.00m linear /dev/loop0(0)
lvm uses a bcache block size of 128K. A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated. How much depends on
alignments. When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.
This patch is a quick hack to prevent lvm from writing
past the end of the metadata area.
This reverts commit 16ae968d24.
We need to come up with a better fix, because we fall short
wiping all known signatures when not using the wipe_lv API.
lvm metadata writes, commits and activations are performed
for (newly) allocated RAID metadata SubLVs to wipe any preexisiting
data thus avoid false raid superblock positives on RaidLV activation.
This process can be interrupted by command or system crashs
thus leaving stale SubLVs in the lvm metadata as a problem.
Because we hold an exclusive lock in this metadata SubLV wiping
process, we can address this problem by avoiding aforementioned
commits/writes/activations altogether wiping the respective first
sector of the first physical extent allocated to any metadata SubLV
directly via the existing dev_set() API.
Succeeds all LVM RAID tests.
Related: rhbz1633167
Allow "lvconvert --type linear RaidLV" on a raid4 LV
providing convenient interim steps to convert to linear.
Add respective new test
lvconvert-raid-takeover-raid4_to_linear.sh
and
lvconvert-raid-takeover-linear_to_raid4.sh
for linear to raid4 once on it.
When converting from striped/raid0/raid0_meta
to raid6 with > 2 stripes, allow possible
direct conversion (to raid6_n_6).
In case of 2 stripes, first convert to raid5_n to restripe
to at least 3 data stripes (the raid6 minimum in lvm2) in
a second conversion before finally converting to raid6_n_6.
As before, raid6_n_6 then can be converted
to any other raid6 layout.
Enhance lvconvert-raid-takeover.sh to test the
2 stripes conversions to raid6.
Resolves: rhbz1624038
"lvconvert --type linear RaidLV" on striped and raid4/5/6/10
have to provide the convenient interim layouts. Fix involves
a cleanup to the convenience type function.
As a result of testing, add missing sync waits to
lvconvert-raid-reshape-linear_to_raid6-single-type.sh.
Resolves: rhbz1447809
Conversion to striped from raid0/raid0_meta is directly possible.
Fix a regression setting superfluous interim raid5_n conversion type
introduced by commit bd7cdd0b09.
Add new test script lvconvert-raid0-striped.sh.
Resolves: rhbz1608067
With improved mirror activation code --splitmirror issue poppedup
since there was missing proper preload code and deactivation
for splitted mirror leg.
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
Support vgchange usage with VDO segtype.
Also changing extent size need small update for vdo virtual extent.
TODO: API needs enhancements so it's not about adding ifs() everywhere.
When user create vdo-pool - use different automatic name.
So unlike with traditional LVs using lvol0, lvol1
use vpool0, vpool1...
TODO: apply similar for thin-pool & cache-pool...
When allocating thin-pool with more then 1 device - try to
allocate 'metadataLV' with reuse of log-type allocation for mirror LV.
It should be naturally place on other device then 'dataLV'.
However due to somewhat hard to follow allocation logic code,
it's been rejected allocation in cases where there was not
enough space for data or metadata on single PV, thus to successed,
usage of segments was mandatory.
While user may use:
allocation/thin_pool_metadata_require_separate_pvs=1
to enforce separe meta and data LV - on default settings, this is not
enable thus segment allocation is meant to work.
NOTE:
As already said - the original intention of this whole 'if()' is unclear,
so try to split this test into multiple more simple tests that are more readable.
TODO: more validation.
Allow creation of any virtual segment type with just --virtualsize
specified without any real extent size give.
TODO: likely --type error,zero might be later enhanced to use -V
(along with -L) - but since those targets do not allocate real
space, supporting -V makes sense with them.
It's no longer needed. Clustered VGs are now handled in
the same way as foreign VGs, and as shared VGs that
can't be accessed:
- A command processing all VGs sees a clustered VG,
prints a message ("Skipping clustered VG foo."),
skips it, and does not fail.
- A command where the clustered VG is explicitly
named on the command line, prints a message and fails.
"Cannot access clustered VG foo, see lvmlockd(8)."
The option is listed in the set of ignored options for
the commands that previously accepted it. (Removing it
entirely would cause commands/scripts to fail if they
set it.)
The previous method for forcibly changing a clustered VG
to a local VG involved using -cn and locking_type 0.
Since those options are deprecated, replace it with
the same command used for other forced lock type changes:
vgchange --locktype none --lockopt force.
vgreduce, vgremove and vgcfgrestore were acquiring
the orphan lock in the midst of command processing
instead of at the start of the command. (The orphan
lock moved to being acquired at the start of the
command back when pvcreate/vgcreate/vgextend were
reworked based on pvcreate_each_device.)
vgsplit also needed a small update to avoid reacquiring
a VG lock that it already held (for the new VG name).
A few places were calling a function to check if a
VG lock was held. The only place it was actually
needed is for pvcreate which wants to do its own
locking (and scanning) around process_each_pv.
The locking/scanning exceptions for pvcreate in
process_each_pv/vg_read can be enabled by just passing
a couple of flags instead of checking if the VG is
already locked. This also means that these special
cases won't be enabled unknowingly in other places
where they shouldn't be used.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
"lvconvert --type {linear|striped|raid*} ..." on a striped/linear
LV provides convenience interim type to convert to the requested
final layout similar to the given raid* <-> raid* conveninece types.
Whilst on it, add missing raid5_n convenince type from raid5* to raid10.
Resolves: rhbz1439925
Resolves: rhbz1447809
Resolves: rhbz1573255
In this command, lvcreate creates a new LV and then combines
it with an existing cache pool, producing a cache LV. This
command was previously not allowed in in a shared VG.
When the lvmlockd lock is shared, upgrade it to ex
when repair (writing) is needed during vg_read.
Pass the lockd state through additional read-related
functions so the instances of repair scattered through
vg_read can be handled.
(Temporary solution until the ad hoc repairs can be
pulled out of vg_read into a top level, centralized
repair function.)
This minor patch fixes grammar in a few messages which get
printed to users. It also fixes the same grammar mistake in
several comments.
Signed-off-by: Rick Elrod <relrod@redhat.com>
--
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
The clvmd saved_vg data is independent from the normal lvm
lvmcache vginfo data, so separate saved_vg from vginfo.
Normal lvm doesn't need to use save_vg at all, and in clvmd,
lvmcache changes on vginfo can be made without worrying
about unwanted effects on saved_vg.
In case "lvconvert -mN RaidLV" was used on a degraded
raid1 LV, success was returned instead of an error.
Provide message to inform about the need to repair first
before changing number of mirrors and exit with error.
Add new lvconvert-m-raid1-degraded.sh test.
Resolves: rhbz1573960
There are likely more bits of code that can be removed,
e.g. lvm1/pool-specific bits of code that were identified
using FMT flags.
The vgconvert command can likely be reduced further.
The lvm1-specific config settings should probably have
some other fields set for proper deprecation.
The mixed up vg repair code in vg_read was trying
to repair a vg when vg_read was called by clvmd.
The clvmd daemon isn't supposed to be repairing
or writing a vg.
(This is a temporary workaround; vg repair will soon
be pulled out of vg_read so it can be called in a
controlled way and consolidated instead of spread
around.)
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.
Avoid using same return code for reporting 2 different things
and stricly report error code by return value and add new
parameter for reporting monitoring status.
This makes easier to recognize which error we got from dm_event
and continue only with ENOENT.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
In some cases the message could be slightly misleading so use
here rather conditional.
TODO:
In future we may possibly further tune the message in case we are
certain the level of redundancy protection has not been reduced.
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
Commit 763db8aab0 rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Resolves: rhbz1511047
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
Since 4fa5add6b1 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)
vgsplit shares the vg_rename code so that must only set the PV_MOVED_VG
flag introduced in commit 486ed10848
("vgmerge: Fix intermediate metadata corruption") on PVs that moved.
Since both lvcreate and lvconvert needs to check for same
type of allowed origin for snapshot - move the code into
a single function.
This way we also fix several inconsitencies where snapshot
has been allowed by mistake either through lvcreate or
lvconvert path.
Converting from one raid level to another, no changes
of stripes or stripesize can be requested because those
are subject to reshaping. I.e. the process requires to
takeover first and secondly request raid algorithm,
stripe or stripesize changes.
Ignore any related changes display warninngs
and proceed with the takeover.
Without this patch, a takeover requesting
stripesize change causes data corruption!
Do not allow to take snapshot of mirror/raid leg or log or metadata LV.
This was actually never supported, but user was able to create it,
and this put device stack in hardly fixable state (needs manual work).
This prevents such creation to pass.
Also improve validation when recreating snapshot volume type
from origin and COW volume.
Replaced the confusing device error message "not found (or ignored by
filtering)" by either "not found" or "excluded by a filter".
(Later we should be able to say which filter.)
Left the the liblvm code paths alone.
Fixes the following case with 3PVs and 3 legs "mirror" LV:
# lvcreate -l100%FREE --type mirror -m2 vg3
Insufficient free space for log allocation for logical volume .
Unable to allocate extents for mirror log.
Related: rhbz1269533
Creating striped RaidLVs with lv size not divisible by region size
caused the region size to be adjusted:
# lvcreate --type raid5 -n region_check.32.00m_3 -i 3 -L 1g --nosync -R 32.00m raid_sanity
Using default stripesize 64.00 KiB.
Rounding size 1.00 GiB (256 extents) up to stripe boundary size <1.01 GiB(258 extents).
WARNING: New raid5 won't be synchronised. Don't read what you didn't write!
Using reduced mirror region size of 8.00 MiB
Logical volume region_check.32.00m_3 created.
Fix by not imposing "mirror" constraints on "raid".
Resolves: rhbz1404007
vgmerge suffers from a similar problem to the one fixed in commit
8146548d25 ("vgsplit: Fix intermediate
metadata corruption.")
When merging, splitting or renaming VGs, use a new PV status flag
PV_MOVED_VG to mark the PVs that hold metadata with the old VG name and
use this to provide PV-level granularity instead of incorrectly assuming
all PVs in the VG are the same.
In a shared VG, only allow pvmove with a named LV,
so that only PE's used by the LV will be moved.
The LV is then activated exclusively, ensuring that
the PE's being moved are not used from another host.
Previously, pvmove was mistakenly allowed on a full PV.
This won't work when LVs using that PV are active on
other hosts.
lvm2 warned about zeroing and too big chunksize (>=512KiB), but
only during lvconvert, so lvcreate was creating thin-pools
without any warning about possible slowness of thin provisioning
because of zeroing.
Since _deactivate_and_remove_lvs() is used in more then one place,
move the needed udev synchronization into this function so other
users automatically get correct fs state before next dm manipulation.
Assumption here is that this udev synchronization 'delay' may also
prevent to 'early' table reloads which might cause kernel problems
for md-core - but we may need more generic time-limited reload
frequency for raid devices.
Note: on udev-less system there will be almost no delay.
Commit 34504855a7 introduced
flag LV_RESHAPE_DATA_OFFSET and used it to avoid incompatible
activation on older runtime.
Enhance vg_validate() raid checking functions with checks for it.
In order to reject out of place reshaping with segment data_offset
field on old runtime, add a respective segment type incompatibility
flag causing "+RESHAPE_DATA_OFFSET" to be suffixed to the segment
type name.
When reshape space is allocated anew, an update and reload is needed to
promote the new size to the cluster node with the exclusively active RaidLV
or reloading the RaidLV will fail with a size related error. Additionally,
store "data_offset <sectors>" with the RaidLV in the lvm2 metadata so that
it can be retrieved on cluster nodes.
Process allocation of reshape space on a 2-legged raid4/5 (interim layout
to convert from/to linear via raid1) properly in the cluster.
Resolves: rhbz1461562
Resolves: rhbz1448116
If the activation step in lvcreate fails (e.g. the specified
minor number is already used), then the lvcreate is reverted,
but the LV lock in lvmlockd was not being unlocked or properly
freed.
With commit 41c10034aa we actually
do require LV to be used with _vg_write_lv_suspend_commit_backup().
So write a proper separte single wrapper for write && commit && backup.
Since we discovered status reporting from 'md' goes from large set
of weird states we can't just decided based on this word.
So let it pass for rebuild and idle as well
and check for health devices afterwards.
When raid leg rimage device is marked as 'D'ead by mdcore,
lvm2 was not able to replace such device with allocate policy,
as device has not appared as missing.
Add detection of transiently failing devices.
Basically reverting commit 58a9f88b8c.
We can use origin_only in case we are snapshot's origin,
as we do support this stack.
So when we are 'uncaching' origin+snaps - we do need to reload only
origin and we do not need to play with snaps.
Handle change of 'region size' better and follow also standard rule
if the command can't success (i.e. size is already same) we return
error for all such cases.
Also log_pring more info about adjusted value (just like we
do for rounding)
Also avoid keep pointers on 'display_*' values - they are in
ringbuffer for immediate use - not to be kept across multiple calls
(as they could be already overwritten by later calls) - so dropped
seg_region_size_str
Since cache LV can be a stacked device, there is no real reason
trying to use slight optimised tree for origin_only cache reload
(it could be even wrongly implemented in this case).
We can easily go with stardard tree load here.
When user runs command like 'lvconvert --splitcache' the operation
might be actually either slow or not making any progress in kernel,
so lets give user a chance to abort such operation.
When user press 'Ctrl+C' device table is restored to pre-flushing state.
Remove explicit activation of SubLVs and let lv_update_and_reload()
perform the proper (pre-)loading sequencing of tables.
This avoids related callback functions which are removed.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
When lock-holding LV differs from actually request locked LV,
we drop origin_only flag as it has no use - it'd be applied
on completely different LV.
Example of problem:
Raid is thin-pool _tdata LV.
Raid run origin_only locking on stacked device.
As lock holder is discovered thinLV.
Whole origin_only operation is then applied only on thinLV
changing the meaning of whole operation.
NOTE: this patch does not change anything for LV that are
already top-level lock holding LVs (i.e. thinLVs, snahoshots/origins).
Disable until we have a proper fix for reshape space allocation,
switching it to begin/end of rimages and activation in the cluster.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
Enhance reporting code, so it does not need to do 'extra' ioctl to
get 'status' of normal raid and provide percentage directly.
When we have 'merging' snapshot into raid origin, we still need to get
this secondary number with extra status call - however, since 'raid'
is always a single segment LV - we may skip 'copy_percent' call as
we directly know the percent and also with better precision.
NOTE: for mirror we still base reported number on the percetage of
transferred extents which might get quite imprecisse if big size
of extent is used while volume itself is smaller as reporting jump
steps are much bigger the actual reported number provides.
2nd.NOTE: raid lvs line report already requires quite a few extra status
calls for the same device - but fix will be need slight code improval.
For the test clean-up, I was providing too many devices to the first
command - possibly allowing it to allocate in the wrong place. I was
also not providing a device for the second command - virtually ensuring
the test was not performing correctly at times.
This patch ensures that under normal conditions (i.e. not during repair
operations) that users are prevented from removing devices that would
cause data loss.
When a RAID1 is undergoing its initial sync, it is ok to remove all but
one of the images because they have all existed since creation and
contain all the data written since the array was created. OTOH, if the
RAID1 was created as a result of an up-convert from linear, it is very
important not to let the user remove the primary image (the source of
all the data). They should be allowed to remove any devices they want
and as many as they want as long as one original (primary) device is left
during a "recover" (aka up-convert).
This fixes bug 1461187 and includes the necessary regression tests.
Add the checks necessary to distiguish the state of a RAID when the primary
source for syncing fails during the "recover" process.
It has been possible to hit this condition before (like when converting from
2-way RAID1 to 3-way and having the first two devices die during the "recover"
process). However, this condition is now more likely since we treat linear ->
RAID1 conversions as "recover" now - so it is especially important we cleanly
handle this condition.
Previously, we were treating non-RAID to RAID up-converts as a "resync"
operation. (The most common example being 'linear -> RAID1'.) RAID to
RAID up-converts or rebuilds of specific RAID images are properly treated
as a "recover" operation.
Since we were treating some up-convert operations as "resync", it was
possible to have scenarios where data corruption or data loss were
possibilities if the RAID hadn't been able to sync completely before a
loss of the primary source devices. In order to ensure that the user took
the proper precautions in such scenarios, we required a '--force' option
to be present. Unfortuneately, the force option was rendered useless
because there was no way to distiguish the failure state of a potentially
destructive repair from a nominal one - making the '--force' option a
requirement for any RAID1 repair!
We now treat non-RAID to RAID up-converts properly as "recover" operations.
This eliminates the scenarios that can potentially cause data loss or
data corruption; and this eliminates the need for the '--force' requirement.
This patch removes the requirement to specify '--force' for RAID repairs.
Two of the sync actions performed by the kernel (aka MD runtime) are
"resync" and "recover". The "resync" refers to when an entirely new array
is going through the process of initializing (or resynchronizing after an
unexpected shutdown). The "recover" is the process of initializing a new
member device to the array. So, a brand new array with all new devices
will undergo "resync". An array with replaced or added sub-LVs will undergo
"recover".
These two states are treated very differently when failures happen. If any
device is lost or replaced while "resync", there are no worries. This is
because any writes created from the inception of the array have occurred to
all the devices and can be safely recovered. Even though non-initialized
portions will still be resync'ed with uninitialized data, it is ok. However,
if a pre-existing device is lost (aka, the original linear device in a
linear -> raid1 convert) during a "recover", data loss can be the result.
Thus, writes are errored by the kernel and recovery is halted. The failed
device must be restored or removed. This is the correct behavior.
Unfortunately, we were treating an up-convert from linear as a "resync"
when we should have been treating it as a "recover". This patch
removes the special case for linear upconvert. It allows each new image
sub-LV to be marked with a rebuild flag and treats the array as 'in-sync'.
This has the correct effect of causing the upconvert to be treated as a
"recover" rather than a "resync". There is no need to flag these two states
differently in LVM metadata, because they are already considered differently
by the kernel RAID metadata. (Any activation/deactivation will properly
resume the "recover" process and not a "resync" process.)
We make this behavior change based on the presense of dm-raid target
version 1.9.0+.
On conversion from raid10 to raid0 (takeover), all rmeta
devices and the rimage devices of mirrored stripes are
detached from the raid10 LV. The remaining rimage areas
are being shifted down into the slots of the detached
ones hence requiring renames to show proper _N suffix
sequences (e.g. 0,1,2,3 instead of 0,2,4,6). Only the
top-level raid10 LV has a cluster lock, not the detached
SubLVs thus their deactivation is impossible and e.g the
rename from *_rimage_6 to *_rimage_3 will fail. Fix by
activating exclusively before deactivating and removing.
Resolves: rhbz1448123
Prohibit activation of reshaping RaidLVs on incompatible
lvm2 runtime by storing e.g. 'raid5+RESHAPE' segment type
strings in the lvm2 metadata. Incompatible runtime not
supporting reshaping won't be able to activate those thus
avoiding potential data corruption.
Any new non-reshaping lvconvert command will reset the
segment type string from 'raid5+RESHAPE' to 'raid5'.
See commits
0299a7af1e and
4141409eb0
for segtype flag support.
When a combination of thin-pool chunk size and thin-pool data size
goes beyond addressable limit, such volume creation is directly
prohibited.
Maximum usable thin-pool size is calculated with use of maximal support
metadata size (even when it's created smaller) and given chunk-size.
If the value data size is found to be too big, the command reports
error and operation fails.
Previously thin-pool was created however lots of thin-pool data LV was
not usable and this space in VG has been wasted.
Only support RAID conversions on active LVs.
If we'd accept e.g. upconverting linear -> raid1 on inactive
linear LVs, any LV flags passed to the kernel aren't properly
cleared thus errouneously passing them on every activation.
Add respective check to lv_raid_change_image_count() and
move existing one in lv_raid_convert() for better messages.
Warn about a PV that has the in-use flag set, but appears in
the orphan VG (no VG was found referencing it.)
There are a number of conditions that could lead to this:
. The PV was created with no mdas and is used in a VG with
other PVs (with metadata) that have not yet appeared on
the system. So, no VG metadata is found by lvm which
references the in-use PV with no mdas.
. vgremove could have failed after clearing mdas but
before clearing the in-use flag. In this case, the
in-use flag needs to be manually cleared on the PV.
. The PV may have damanged/unrecognized VG metadata
that lvm could not read.
. The PV may have no mdas, and the PVs with the metadata
may have damaged/unrecognized metadata.
A PV holding VG metadata that lvm can't understand
(e.g. damaged, checksum error, unrecognized flag)
will appear as an in-use orphan, and will be cleared
by this repair code. Disable this repair until the
code can keep track of these problematic PVs, and
distinguish them from actual in-use orphans.
Reject any stripe adding/removing reshape on raid4/5/6/10 because
of related MD kernel deadlock on single core systems until
we get a proper fix in MD.
Related: rhbz1443999
Commit 5fe07d3574 failed to set raid5 types
properly on conversions from raid6. It always enforced raid6_ls_6
for types raid6/raid6_zr/raid6_nr/raid6_nc, thus requiring 3 conversions
instead of 2 when asking for raid5_{la,rs,ra,n}.
Related: rhbz1439403
Offer possible interim LV types and display their aliases
(e.g. raid5 and raid5_ls) for all conversions between
striped and any raid LVs in case user requests a type
not suitable to direct conversion.
E.g. running "lvconvert --type raid5 LV" on a striped
LV will replace raid5 aka raid5_ls (rotating parity)
with raid5_n (dedicated parity on last image).
User is asked to repeat the lvconvert command to get to the
requested LV type (raid5 aka raid5_ls in this example)
when such replacement occurs.
Resolves: rhbz1439403
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of
PVs with unknown devices.
While one caller avoids passing such PVs into the function, the other
doesn't. Move the check inside the function so it's not forgotten.
Without this patch, if the normal VG reading code tries to repair
inconsistent metadata while there is an unknown PV, it incorrectly
considers the missing PVs no longer to be missing and produces
incorrect 'pvs' output omitting the missing PV, for example.
Easy reproducer:
Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
Reinstate the hidden PV pv2 and at the same time hide a different PV
pv3.
Run 'pvs' - incorrect output.
Run 'pvs' again - correct output.
See https://bugzilla.redhat.com/1434054
There are certain situations (not fully understood)
where is_missing_pv() is false, but pv->dev is NULL,
so this adds a check for NULL pv->dev after is_missing_pv()
to avoid a segfault.
lvconvert parameters not causing a conversion (i.e. no type,
number of stripes, stripesize or regionsize changes) will
remove any allocated reshape space in which case the command
returns success. If reshape space does not exist though,
return error.
Reshape check failed when regionsize changed and current raid type
was provided with no other change requested (stripes or stripesize).
E.g. "lvconvert --type raid6 --regionsize 256K" on a raid6 LV
with != 256K regionsize.
Enable --type in test script.
Remove any newly allocated sub LV (pair) remnants in case
allocation fails due to lag of (parallel) free PV space
and keep initial raid type.
Resolves: rhbz1438013
Avoid error message
"Logical Volume *_rimage_0 already exists in volume group,,,"
on takeover conversion from a 2-legged raid1 to raid4
(aiming to reshape it adding images).
Resolves: rhbz1439398
Requesting _raid_remove_images() to commit the
metadata missed to reload the origin causing a
kernel takeover error converting a 2-legged raid1
(with previously removed images) to raid5.
Allow the combination of both arguments keeping
the raid level but changing the regionssize
(e.g. "lvconvert --type raid1 --regionsize 1M RaidLV"
on an existing raid1 LV).
Resolves: rhbz1438396
Removing some unused new lines and changing some incorrect "can't
release until this is fixed" comments. Rename license.txt to make
it clear its merely an included file, not itself a licence.
This reverts commit 1e4462dbfb
in favour of an enhanced solution avoiding changes in liblvm
completetly by checking the target versions in libdm and emitting
the respective parameter lines.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
Starting with dm-raid target version 1.9.0 shrinking of mapped devices is supported.
Check for support being present in lvresize and lvreduce.
Related: rhbz1394048
Quering non-thin-pool segment for discard property may lead
to intenal error if the segment had set 'out-of-range' value,
so only thin-pool is allowed, for other it returns NULL.
If SubLVs to be removed still exist after an image removing
conversion (i.e. "lvconvert --yes --force --stripes N "
with N < total stripes) any request to convert to a different
striped/raid* level has to be rejected until after those freed
SubLVs got removed by running the aforementioned lvconvert again.
Add tests to check conversion to striped/raid* gets rejected.
Enhance a test comment.
Related: rhbz1191935
Related: rhbz1366296
Only cache-pool segtype may store cache_metadata_format.
Only supported values are 0,1,2
Format 2 requires LV status uses LV_METADATA_FORMAT.
Format 0 (unselected) or 1 shall not set this 'incompatible' status.
Cache pool read/writes metadata_format within its segment type..
For CachePoolLV unselected metadata format is NOT stored in metadata.
For CacheLV when metadata format is not present/selected in lvm2 metadata,
it's automatically assumed to be the version 1 (backward compatible).
To ensure older lvm2 will not 'miss-read' metadata with new version 2,
such LV is marked with METADATA_FORMAT status flag (segment is
specifying metadata format). So when cache uses metadata format 2,
it will become inaccesible on older system without such support.
(kernel dm cache < 1.10, lvm2 < 2.02.169).
Add new profilable configation setting to let user select
which metadata format of a created cache pool he wish to use.
By default the 'best' available format is autodetected at runtime,
but user may enforce format 1 or 2 ATM.
Code also detects availability for metadata2 supporting cache target.
In case of troubles user may easily Disable usage of this feature
by placing 'metadata2' into global/cache_disabled_features list.
As now we can properly recognize all paramerters for pool creation,
we may drop PASS_ARG_ defines and rely on '_UNSELECTED' or 0 entries
as being those without user given args.
When setting are not given on command line - 'update' function
fill them from profiles or configuration. For this 'profile' arg
was needed to be passed around and since 'VG' itself is not needed,
it's been all replaced with 'cmd, profile, extents_size' args.
Since cache chunk might be huge and there is no technical need
to enforce rounding and there is actually more 'real' VG space
used then necessary - keep rounding on 'chunk' bounrary only
for thin volumes - where it's the space used anyway.
NB: we support conversion of any-size 'existing' LV into cached LV.
Fix missing reset of '*settings' pointer when no args were given.
Handle cache_chunk settings like all other settings, so it is properly
updated only with non-zero settings and the existing cache-pool
chunk_size is not being reconfigured.
User can specify metadata profile which stores important cache
geometry data for easy configuration.
Fix missing support for getting chunk_size, cache_mode, cache_policy
for a cache/cache pools volumes from configuration or metadata profile.
To more easily recognize unselected state from select '0' state
add new 'THIN_ZERO_UNSELECTED' enum.
Same applies to THIN_DISCARDS_UNSELECTED.
For those we no longer need to use PASS_ARG_ZERO or PASS_ARG_DISCARDS.
Basically code moving operation to have a single place resolving
thin_pool_chunk_size_policy.
Supported are generic & performance profiles.
Function is now shared between thin manipulation code and configuration
_CFG logic to obtain defaults and handle correct reporting upward coding
stack.
In addition to the already supported conversion between 2-legged
raid1 and raid5, raid1 and raid4 can be also converted into each
other with 2 legs (raid4/5 are limited to map a 2-legged raid1).
This patch supports the missing raid4 conversion in the sequence
linear -> 2-legged raid1 -> raid4/5, then restripe to more than one
data stripes for performance and resilience reasons and optionally
convert to striped/raid0.
The other conversion sequence is also possible by converting N-way
striped/raid0 to raid4/5, then restripe to 2 legs followed by a
conversion to raid1 and optionally to linear (loosing all resilience).
On conversion from striped to raid0, data LVs are created
and all segments and their respective areas of the striped
LV are moved across to new segments allocated for the raid0
image LVs. This can cause non-canonical segments to be added
to the image LVs.
Add a call to lv_merge_segments() once all segments have been
added to an image LV to compensate for that. This avoids
unsafe table loads on activation.
Fix comments.
Splitting off an image LV of a 2-legged
raid1 LV causes loss of resilience.
Ask user to avoid uninformed loss of all resilience.
Don't ask for N > 2 legged raid1 LVs.
Adjust tests.
Splitting off an image LV of a 2-legged raid1 LV tracking changes
causes loosing partial resilience for any newly written data set.
Full resilience will be provided again after the split off image LV
got merged back in and the new data set got fully synchronized.
Reason being that the data is only stored on the remaining single
writable image during the split.
Ask user to avoid uninformed loss of such partial resilience.
Don't ask for N > 2 legged raid1 LVs.
In case N images fail (N <= parity chunks) _and_
a "vgreduce --removemissing --force VG" was applied
a following repair of the RaidLV fails:
Unable to remove N images: Only 0 devices given.
Failed to remove the specified images from tb/r.
Failed to replace faulty devices in tb/r.
Fix as of this commit results in correct repair:
Faulty devices in tb/r successfully replaced.
seg->extents_copied has to be defined properly on reducing
the size of a raid LV or conversion from raid5 with 1 stripe
to raid1 will fail.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The lv_extend/_lv_reduce API doesn't cope with resizing RaidLVs
with allocated reshape space and ongoing conversions. Prohibit
resizing during conversions and remove the reshape space before
processing resize. Add missing seg->data_copies initialisation.
Fix typo/comment.
For the time being raid10 is limited to even number of total stripes
as is and 2 data copies. The number of stripes provided on creation
of a raid10(_near) LV with -i/--stripes gets doubled to define
that even total number of stripes (i.e. images).
Apply the same on disk adding conversions (reshapes) with
"lvconvert --stripes RaidLV" (e.g. 2 stripes = 4 images
total converted to 3 stripes = 6 images total).
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Enhance the raid report functions for the recently added LV fields
reshape_len, reshape_len_le, data_offset, new_data_offset, data_copies,
data_stripes and parity_chunks to cope with "lvs --select".
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
An initialization was missing when converting striped to raid0(_meta)
causing unitialized reshape_len in the new component LVs first segment.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The imposed minimum region size can cause rejection on
disk removing reshapes. Lower it to avoid that.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
When requesting a regionsize change during conversions, check
for constraints or the command may fail in the kernel n case
the region size is too smalle or too large thus leaving any
new SubLVs behind.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Allow regionsize on upconvert from linear:
fix related commit 2574d3257a to actually work
Related: rhbz1394427
Remove setting raid5_n on conversions from raid1
as of commit 932db3db53 because any raid5 mapping
may be requested.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces the changes to call the reshaping infratructure
from lv_raid_convert().
Changes:
- add reshaping calls from lv_raid_convert()
- add command definitons for reshaping to tools/command-lines.in
- fix raid_rimage_extents()
- add 2 new test scripts lvconvert-raid-reshape-linear_to_striped.sh
and lvconvert-raid-reshape-striped_to_linear.sh to test
the linear <-> striped multi-step conversions
- add lvconvert-raid-reshape.sh reshaping tests
- enhance lvconvert-raid-takeover.sh with new raid10 tests
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- allow raid_rimage_extents() to calculate raid10
- remove an __unused__ attribute
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- add missing raid1 <-> raid5 conversions to support
linear <-> raid5 <-> raid0(_meta)/striped conversions
- rename related new takeover functions to
_takeover_from_raid1_to_raid5 and _takeover_from_raid5_to_raid1,
because a reshape to > 2 legs is only possible with
raid5 layout
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- enhance _clear_meta_lvs() to support raid0 allowing
raid0_meta -> raid10 conversions to succeed by clearing
the raid0 rmeta images or the kernel will fail because
of discovering reordered raid devices
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- enhance _raid45610_to_raid0_or_striped_wrapper() to support
raid5_n with 2 areas to raid1 conversion to allow for
striped/raid0(_meta)/raid4/5/6 -> raid1/linear conversions;
rename it to _takeover_downconvert_wrapper to discontinue the
illegible function name
- enhance _striped_or_raid0_to_raid45610_wrapper() to support
raid1 with 2 areas to raid5* conversions to allow for
linear/raid1 -> striped/raid0(_meta)/raid4/5/6 conversions;
rename it to _takeover_upconvert_wrapper for the same reason
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add missing possible reshape conversions and conversion options
to allow/prohibit changing stripesize or number fo stripes
- enhance setting convenient riad types in reshape conversions
(e.g. raid1 with 2 legs -> radi5_n)
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add _raid_reshape() using the pre/post callbacks and
the stripes add/remove reshape functions introduced before
- and _reshape_requested function checking if a reshape
was requested
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add vg metadata update functions
- add pre and post activation callback functions for
proper sequencing of sub lv activations during reshaping
- move and enhance _lv_update_reload_fns_reset_eliminate_lvs()
to support pre and post activation callbacks
- add _reset_flags_passed_to_kernel() which resets anyxi
rebuild/reshape flags after they have been passed into the kernel
and sets the SubLV remove after reshape flags on legs to be removed
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function to support disk adding reshapes
- add function to support disk removing reshapes
- add function to support layout (e.g. raid5ls -> raid5_rs)
or stripesize reshaping
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function providing state of a reshaped RaidLV
- add function to adjust the size of a RaidLV was
reshaped to add/remove stripes
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add lv_raid_data_copies returning raid type specific number;
needed for raid10 with more than 2 data copies
- remove _shift_and_rename_image_components() constraint
to support more than 10 raid legs
- add function to calculate total rimage length used by out-of-place
reshape space allocation
- add out-of-place reshape space alloc/relocate/free functions
- move _data_rimages_count() used by reshape space alloc/realocate functions
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces local infrastructure to raid_manip.c
used by followup patches.
Add functions:
- to check reshaping is supported in target attibute
- to return device health string needed to check
the raid device is ready to reshape
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces infrastructure prerequisites to be used
by raid_manip.c extensions in followup patches.
This base is needed for allocation of out-of-place
reshape space required by the MD raid personalities to
avoid writing over data in-place when reading off the
current RAID layout or number of legs and writing out
the new layout or to a different number of legs
(i.e. restripe)
Changes:
- add members reshape_len to 'struct lv_segment' to store
out-of-place reshape length per component rimage
- add member data_copies to struct lv_segment
to support more than 2 raid10 data copies
- make alloc_lv_segment() aware of both reshape_len and data_copies
- adjust all alloc_lv_segment() callers to the new API
- add functions to retrieve the current data offset (needed for
out-of-place reshaping space allocation) and the devices count
from the kernel
- make libdm deptree code aware of reshape_len
- add LV flags for disk add/remove reshaping
- support import/export of the new 'struct lv_segment' members
- enhance lv_extend/_lv_reduce to cope with reshape_len
- add seg_is_*/segtype_is_* macros related to reshaping
- add target version check for reshaping
- grow rebuilds/writemostly bitmaps to 246 bit to support kernel maximal
- enhance libdm deptree code to support data_offset (out-of-place reshaping)
and delta_disk (legs add/remove reshaping) target arguments
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The MD kernel raid1 personality does no use any writemostly leg as the primary.
In case a previous linear LV holding data gets upconverted to
raid1 it becomes the primary leg of the new raid1 LV and a full
resynchronization is started to update the new legs.
No writemostly and/or writebehind setting may be allowed during
this initial, full synchronization period of this new raid1 LV
(using the lvchange(8) command), because that would change the
primary (i.e the previous linear LV) thus causing data loss.
lvchange has a bug not preventing this scenario.
Fix rejects setting writemostly and/or writebehind on resychronizing raid1 LVs.
Once we have status in the lvm2 metadata about the linear -> raid upconversion,
we may relax this constraint for other types of resynchronization
(e.g. for user requested "lvchange --resync ").
New lvchange-raid1-writemostly.sh test is added to the test suite.
Resolves: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855895