IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In fact pvmove does support 'clustered-core' target for clustered
pvmove of LVs activated on multiple nodes.
This patch restores support for activation of pvmove on all nodes
for LVs that are also activate on all nodes.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
Add protectional internall error whenever we spot activation
of 'exclusive' only segments in 'non-exclusive' mode.
TODO: possibly the activation locking could be enhanced to handle
this fully behind the scene - as for now this works purely for
lvchange/vgchange activation.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Dedicated functions are now used to process each piece of data obtained,
so the refactoring in this file gives us one for the vgsummary and one
for the metadata header. This new type of function takes two parameters
(for now), the obtained data plus a single struct (that must not
reference any data on the stack) that wraps up the entire context needed
to process it.
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
If there is sufficient space in the metadata area, align the next
metadata to a disk offset that is a multiple of 4096 bytes and
don't write it circularly. If it doesn't all fit at the end
of the metadata area, go back to the start and write it all there
contiguously.
If there is insufficient space to use the new stricter rules, revert to
the original behaviour, aligning on 512-byte boundaries wrapping around
the circular buffer as required.
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
Use new ALIGN_ABSOLUTE macro when calculating the start location
of new metadata and adjust the end of buffer detection so that
there is no longer an imposed gap between old and new metadata.
Currently both start and offset should always be divisible by alignment,
so this should have no effect, but a later patch will increase alignment
so these variables can no longer be optimised out.
Although it doesn't look like it can be a measurable problem
and costs some time to flip priorities outside of activation window.
So just like with memory locking preserve priority until call
memlock_unlock() appears.
(addition to commit c086dfadc3).
Expand out the metadata wrapping calculations to prepare
to support a larger alignment.
The current alignment is 512 bytes so
(mdac_area_start + rlocn->offset) % alignment is zero.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
In case of failed legs, raid replaces those with
e.g. "vg-lv_rimage_0-missing_0_0" mapped to an error target.
Those errouneously remain on deactivation.
Fix by removing them on deactivation/removal of the RaidLV.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Reduces large amount of duplicate ioctl queries.
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Use code which detectes handlers in a way, which is more
backward-compatible friendly.
Replace read of 'sysfs' uuid entry with dm ioctl call.
Use /sys/block/dm-X/holders path instead of
new path /sys/dev/block/major:minor/holders.
TODO:
There are few more occurencies of this logic around the code
so some abstract interface should be considered.
In some cases the message could be slightly misleading so use
here rather conditional.
TODO:
In future we may possibly further tune the message in case we are
certain the level of redundancy protection has not been reduced.
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
Just like suspend handles preload for pvmove finish,
in similar way handle suspend of starting pvmove.
In this case the precommited metadata are checked for list of PVMOVEed
LVs and those are suspended in with committed metadata.
There is no need to differentiation between clustered VG and normal VG.
As the activation depends on locking type.
Use unconditionally locally exclusive activation for pvmove.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
When activation of LVs fails prior pvmove start, try to deactivate
already activated LVs.
TODO: possibly remember which LVs where already activate and only those
take down - devices which are already in-use will stay active.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
Commit 763db8aab0 rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Resolves: rhbz1511047
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
Since 4fa5add6b1 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)
When an ignored metadata area gets flagged for use again, make sure the
code doesn't try to parse its old metadata. Firstly by trying to detect
this situation and skipping the read (while still remembering the
position reached in the circular buffer), and secondly by clearing the
invalid live metadata location on disk as a precaution when subsequently
writing out the precommitted metadata.
Problems showed up when a metadata area in one VG got moved to
another VG in ignored state (still holding metadata for the original
VG) and then later got brought into use in the new VG - only the header
should be read in this case, not any of the metadata content.
vgsplit shares the vg_rename code so that must only set the PV_MOVED_VG
flag introduced in commit 486ed10848
("vgmerge: Fix intermediate metadata corruption") on PVs that moved.
Since both lvcreate and lvconvert needs to check for same
type of allowed origin for snapshot - move the code into
a single function.
This way we also fix several inconsitencies where snapshot
has been allowed by mistake either through lvcreate or
lvconvert path.
Converting from one raid level to another, no changes
of stripes or stripesize can be requested because those
are subject to reshaping. I.e. the process requires to
takeover first and secondly request raid algorithm,
stripe or stripesize changes.
Ignore any related changes display warninngs
and proceed with the takeover.
Without this patch, a takeover requesting
stripesize change causes data corruption!
Add explaining message, when command was aborted due to the reach
of configure line number count (LVM_LOG_FILE_MAX_LINES)
for logging (used mainly with testing).
Do not allow to take snapshot of mirror/raid leg or log or metadata LV.
This was actually never supported, but user was able to create it,
and this put device stack in hardly fixable state (needs manual work).
This prevents such creation to pass.
Also improve validation when recreating snapshot volume type
from origin and COW volume.
Replaced the confusing device error message "not found (or ignored by
filtering)" by either "not found" or "excluded by a filter".
(Later we should be able to say which filter.)
Left the the liblvm code paths alone.
Fixes the following case with 3PVs and 3 legs "mirror" LV:
# lvcreate -l100%FREE --type mirror -m2 vg3
Insufficient free space for log allocation for logical volume .
Unable to allocate extents for mirror log.
Related: rhbz1269533
Activation lock has a primary purpose to serialize locking of individual
LV in case there is no other protecting mechanism for parallel
execution.
However in the case an activated LV is composed from several other LVs,
noone should be able to manipulate with those LVs as well.
This patch add a very 'naive' global VG activation locking in this case.
In the future we may introduce smarter function detecting minimal closed
graph components if this will appear as bottleneck
Patch checks if the VG Write lock is held - in this case we do not
need any more locking - command has exclusive access to VG.
In case we have clustered VG and we are activating an LV which does not
need other LVs - we also do not need any more locks.
In all other cases take respective lock - for single LV - use lvid,
for complex LVs use vgname.
Creating striped RaidLVs with lv size not divisible by region size
caused the region size to be adjusted:
# lvcreate --type raid5 -n region_check.32.00m_3 -i 3 -L 1g --nosync -R 32.00m raid_sanity
Using default stripesize 64.00 KiB.
Rounding size 1.00 GiB (256 extents) up to stripe boundary size <1.01 GiB(258 extents).
WARNING: New raid5 won't be synchronised. Don't read what you didn't write!
Using reduced mirror region size of 8.00 MiB
Logical volume region_check.32.00m_3 created.
Fix by not imposing "mirror" constraints on "raid".
Resolves: rhbz1404007
vgmerge suffers from a similar problem to the one fixed in commit
8146548d25 ("vgsplit: Fix intermediate
metadata corruption.")
When merging, splitting or renaming VGs, use a new PV status flag
PV_MOVED_VG to mark the PVs that hold metadata with the old VG name and
use this to provide PV-level granularity instead of incorrectly assuming
all PVs in the VG are the same.
Changing the VG of a PV uses the same on-disk mechanism as vgrename.
This relies on recognising both the old and new VG names. Prior to this
patch the vgsplit code incorrectly provided the new VG name twice
instead of the old and new ones. This lead the low-level mechanism not
to recognise the device as already belonging to a VG and so paying no
attention to the location of its existing metadata, sometimes partly
overwriting it and then later trying to read the corrupt metadata and
issuing a checksum error.
In a shared VG, only allow pvmove with a named LV,
so that only PE's used by the LV will be moved.
The LV is then activated exclusively, ensuring that
the PE's being moved are not used from another host.
Previously, pvmove was mistakenly allowed on a full PV.
This won't work when LVs using that PV are active on
other hosts.
In a shared VG, lvconvert must be used to create thin pools
and cache pools, not the lvcreate variants of those commands.
Deny these cases early in lvcreate using the new command defs.
Denying these cases deeper in the code was missing some
cleanup of the partially completed command.
Revert the lvmlockd.c changes from:
commit 0bf836aa14
"tidy: prefer not using else after return"
The commit introduced at least one regression, which broke
lvcreate of a thin pool in a shared VG.
When file-locking mode failed on locking, such description was leaked
(typically not an issue since command usually exists afterwards).
So shirt close() at the end of function and use it in all error paths.
Also make sure, when interrrupt is detected, it's really not holding
lock and returns 0.
lvmcache_foreach_mda() can fail for numerous reasons
and failing error code cannot be ignored (out-of-memory...)
TODO: might need more error handling tunning.
After the internal lvmlock LV (holding sanlock leases) is
extended to hold more leases, it needs to be zeroed.
sanlock expects to see either zeroed blocks or blocks
initialized with leases.
Fix code checking that the 2nd mda which is at the end of disk really
fits the available free space and avoid any DA and MDA interleaving when
we already have DA preallocated. This mainly applies when we're restoring
a PV from VG backup using pvcreate --restorefile where we may already have
some DA preallocated - this means the PV was in a VG before with already
allocated space from it (the LVs were created). Hence we need to avoid
stepping into DA - the MDA can never ever be inside in such case!
The code responsible for this calculation was already in
_text_pv_add_metadata_area fn, but it had a bug in the calculation where
we subtracted one more sector by mistake and then the code could still
incorrectly allocate the MDA inside existing DA. The patch also renames
the variable in the code so it doesn't confuse us in future.
Also, if the 2nd mda doesn't fit, don't silently continue with just 1
MDA (at the start of the disk). If 2nd mda was requested and we can't
create that due to unavailable space, error out correctly (the patch
also adds a test to shell/pvcreate-operation.sh for this case).
Previously the cache remembered an existing bootloaderarea and
reinstated it (without even checking for overlap) when asked to
write out the PV. pvcreate could write out an incorrect layout.
Avoid adding -g more then once for debug builds.
Avoid enabling DEBUG_MEM when we build multithreaded tools.
Link executables with -fPIE -pie and --export-dynamic LDFLAGS
Introduce PROGS_FLAGS to add option to pass flags for external libs.
Link lvm2 internally library only when really used.
Link DAEMON_LIBS with daemons.
Pass VALGRIND_CFLAGS internally
Set shell failure mode on couple places.
lvm2 warned about zeroing and too big chunksize (>=512KiB), but
only during lvconvert, so lvcreate was creating thin-pools
without any warning about possible slowness of thin provisioning
because of zeroing.
Since _deactivate_and_remove_lvs() is used in more then one place,
move the needed udev synchronization into this function so other
users automatically get correct fs state before next dm manipulation.
Assumption here is that this udev synchronization 'delay' may also
prevent to 'early' table reloads which might cause kernel problems
for md-core - but we may need more generic time-limited reload
frequency for raid devices.
Note: on udev-less system there will be almost no delay.
API for strtod() or strtoul() needs reset of errno, before it's being
called. So add missing resets in missing places and some also some
errno validation for out-of-range numbers.
Switch from warn to log_error since this generated
failing return code for command so printing log_error()
is mandatory.
Happens with i.e. pvscan --cache meets crashing lvmetad.
Commit 34504855a7 introduced
flag LV_RESHAPE_DATA_OFFSET and used it to avoid incompatible
activation on older runtime.
Enhance vg_validate() raid checking functions with checks for it.
In order to reject out of place reshaping with segment data_offset
field on old runtime, add a respective segment type incompatibility
flag causing "+RESHAPE_DATA_OFFSET" to be suffixed to the segment
type name.
When reshape space is allocated anew, an update and reload is needed to
promote the new size to the cluster node with the exclusively active RaidLV
or reloading the RaidLV will fail with a size related error. Additionally,
store "data_offset <sectors>" with the RaidLV in the lvm2 metadata so that
it can be retrieved on cluster nodes.
Process allocation of reshape space on a 2-legged raid4/5 (interim layout
to convert from/to linear via raid1) properly in the cluster.
Resolves: rhbz1461562
Resolves: rhbz1448116
If the activation step in lvcreate fails (e.g. the specified
minor number is already used), then the lvcreate is reverted,
but the LV lock in lvmlockd was not being unlocked or properly
freed.
Some lvconvert commands can be used directly on the data sublv:
lvconvert ... vg/pool_tdata
The correct LV lock to use in lvmlockd is the one on the pool LV.
With commit 41c10034aa we actually
do require LV to be used with _vg_write_lv_suspend_commit_backup().
So write a proper separte single wrapper for write && commit && backup.
Since we discovered status reporting from 'md' goes from large set
of weird states we can't just decided based on this word.
So let it pass for rebuild and idle as well
and check for health devices afterwards.
When raid leg rimage device is marked as 'D'ead by mdcore,
lvm2 was not able to replace such device with allocate policy,
as device has not appared as missing.
Add detection of transiently failing devices.
Basically reverting commit 58a9f88b8c.
We can use origin_only in case we are snapshot's origin,
as we do support this stack.
So when we are 'uncaching' origin+snaps - we do need to reload only
origin and we do not need to play with snaps.
Handle change of 'region size' better and follow also standard rule
if the command can't success (i.e. size is already same) we return
error for all such cases.
Also log_pring more info about adjusted value (just like we
do for rounding)
Also avoid keep pointers on 'display_*' values - they are in
ringbuffer for immediate use - not to be kept across multiple calls
(as they could be already overwritten by later calls) - so dropped
seg_region_size_str
'lvdisplay -m' tried to go through NULL policy settings,
when such policy was not defined for CachedLV.
Patch is fixing display of cache-pool without defined settings,
as this is now a valid pool and we mostly want users to define
these settings when actually really caching a LV.
Since cache LV can be a stacked device, there is no real reason
trying to use slight optimised tree for origin_only cache reload
(it could be even wrongly implemented in this case).
We can easily go with stardard tree load here.
When user runs command like 'lvconvert --splitcache' the operation
might be actually either slow or not making any progress in kernel,
so lets give user a chance to abort such operation.
When user press 'Ctrl+C' device table is restored to pre-flushing state.
Remove explicit activation of SubLVs and let lv_update_and_reload()
perform the proper (pre-)loading sequencing of tables.
This avoids related callback functions which are removed.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
When lock-holding LV differs from actually request locked LV,
we drop origin_only flag as it has no use - it'd be applied
on completely different LV.
Example of problem:
Raid is thin-pool _tdata LV.
Raid run origin_only locking on stacked device.
As lock holder is discovered thinLV.
Whole origin_only operation is then applied only on thinLV
changing the meaning of whole operation.
NOTE: this patch does not change anything for LV that are
already top-level lock holding LVs (i.e. thinLVs, snahoshots/origins).
Disable until we have a proper fix for reshape space allocation,
switching it to begin/end of rimages and activation in the cluster.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
Enhance reporting code, so it does not need to do 'extra' ioctl to
get 'status' of normal raid and provide percentage directly.
When we have 'merging' snapshot into raid origin, we still need to get
this secondary number with extra status call - however, since 'raid'
is always a single segment LV - we may skip 'copy_percent' call as
we directly know the percent and also with better precision.
NOTE: for mirror we still base reported number on the percetage of
transferred extents which might get quite imprecisse if big size
of extent is used while volume itself is smaller as reporting jump
steps are much bigger the actual reported number provides.
2nd.NOTE: raid lvs line report already requires quite a few extra status
calls for the same device - but fix will be need slight code improval.
Relative to last comit ddf2a1d656:
adjust the dm-raid target version to 1.12.0 which shows
mandatory kernel MD deadlock fixes related to reshaping
are presant in the kernel.
Related: rhbz1443999
For the test clean-up, I was providing too many devices to the first
command - possibly allowing it to allocate in the wrong place. I was
also not providing a device for the second command - virtually ensuring
the test was not performing correctly at times.
This patch ensures that under normal conditions (i.e. not during repair
operations) that users are prevented from removing devices that would
cause data loss.
When a RAID1 is undergoing its initial sync, it is ok to remove all but
one of the images because they have all existed since creation and
contain all the data written since the array was created. OTOH, if the
RAID1 was created as a result of an up-convert from linear, it is very
important not to let the user remove the primary image (the source of
all the data). They should be allowed to remove any devices they want
and as many as they want as long as one original (primary) device is left
during a "recover" (aka up-convert).
This fixes bug 1461187 and includes the necessary regression tests.
Add the checks necessary to distiguish the state of a RAID when the primary
source for syncing fails during the "recover" process.
It has been possible to hit this condition before (like when converting from
2-way RAID1 to 3-way and having the first two devices die during the "recover"
process). However, this condition is now more likely since we treat linear ->
RAID1 conversions as "recover" now - so it is especially important we cleanly
handle this condition.
Previously, we were treating non-RAID to RAID up-converts as a "resync"
operation. (The most common example being 'linear -> RAID1'.) RAID to
RAID up-converts or rebuilds of specific RAID images are properly treated
as a "recover" operation.
Since we were treating some up-convert operations as "resync", it was
possible to have scenarios where data corruption or data loss were
possibilities if the RAID hadn't been able to sync completely before a
loss of the primary source devices. In order to ensure that the user took
the proper precautions in such scenarios, we required a '--force' option
to be present. Unfortuneately, the force option was rendered useless
because there was no way to distiguish the failure state of a potentially
destructive repair from a nominal one - making the '--force' option a
requirement for any RAID1 repair!
We now treat non-RAID to RAID up-converts properly as "recover" operations.
This eliminates the scenarios that can potentially cause data loss or
data corruption; and this eliminates the need for the '--force' requirement.
This patch removes the requirement to specify '--force' for RAID repairs.
Two of the sync actions performed by the kernel (aka MD runtime) are
"resync" and "recover". The "resync" refers to when an entirely new array
is going through the process of initializing (or resynchronizing after an
unexpected shutdown). The "recover" is the process of initializing a new
member device to the array. So, a brand new array with all new devices
will undergo "resync". An array with replaced or added sub-LVs will undergo
"recover".
These two states are treated very differently when failures happen. If any
device is lost or replaced while "resync", there are no worries. This is
because any writes created from the inception of the array have occurred to
all the devices and can be safely recovered. Even though non-initialized
portions will still be resync'ed with uninitialized data, it is ok. However,
if a pre-existing device is lost (aka, the original linear device in a
linear -> raid1 convert) during a "recover", data loss can be the result.
Thus, writes are errored by the kernel and recovery is halted. The failed
device must be restored or removed. This is the correct behavior.
Unfortunately, we were treating an up-convert from linear as a "resync"
when we should have been treating it as a "recover". This patch
removes the special case for linear upconvert. It allows each new image
sub-LV to be marked with a rebuild flag and treats the array as 'in-sync'.
This has the correct effect of causing the upconvert to be treated as a
"recover" rather than a "resync". There is no need to flag these two states
differently in LVM metadata, because they are already considered differently
by the kernel RAID metadata. (Any activation/deactivation will properly
resume the "recover" process and not a "resync" process.)
We make this behavior change based on the presense of dm-raid target
version 1.9.0+.
On conversion from raid10 to raid0 (takeover), all rmeta
devices and the rimage devices of mirrored stripes are
detached from the raid10 LV. The remaining rimage areas
are being shifted down into the slots of the detached
ones hence requiring renames to show proper _N suffix
sequences (e.g. 0,1,2,3 instead of 0,2,4,6). Only the
top-level raid10 LV has a cluster lock, not the detached
SubLVs thus their deactivation is impossible and e.g the
rename from *_rimage_6 to *_rimage_3 will fail. Fix by
activating exclusively before deactivating and removing.
Resolves: rhbz1448123
Prohibit activation of reshaping RaidLVs on incompatible
lvm2 runtime by storing e.g. 'raid5+RESHAPE' segment type
strings in the lvm2 metadata. Incompatible runtime not
supporting reshaping won't be able to activate those thus
avoiding potential data corruption.
Any new non-reshaping lvconvert command will reset the
segment type string from 'raid5+RESHAPE' to 'raid5'.
See commits
0299a7af1e and
4141409eb0
for segtype flag support.
When old snapshot is merged, lvm2 still can report some data about
merged 'snapshot' - i.e. it occupied space in VG.
This patch fixes regression from commit:
6fd20be629
and resolved RHBZ: 1460161
When a combination of thin-pool chunk size and thin-pool data size
goes beyond addressable limit, such volume creation is directly
prohibited.
Maximum usable thin-pool size is calculated with use of maximal support
metadata size (even when it's created smaller) and given chunk-size.
If the value data size is found to be too big, the command reports
error and operation fails.
Previously thin-pool was created however lots of thin-pool data LV was
not usable and this space in VG has been wasted.
Only support RAID conversions on active LVs.
If we'd accept e.g. upconverting linear -> raid1 on inactive
linear LVs, any LV flags passed to the kernel aren't properly
cleared thus errouneously passing them on every activation.
Add respective check to lv_raid_change_image_count() and
move existing one in lv_raid_convert() for better messages.
Warn about a PV that has the in-use flag set, but appears in
the orphan VG (no VG was found referencing it.)
There are a number of conditions that could lead to this:
. The PV was created with no mdas and is used in a VG with
other PVs (with metadata) that have not yet appeared on
the system. So, no VG metadata is found by lvm which
references the in-use PV with no mdas.
. vgremove could have failed after clearing mdas but
before clearing the in-use flag. In this case, the
in-use flag needs to be manually cleared on the PV.
. The PV may have damanged/unrecognized VG metadata
that lvm could not read.
. The PV may have no mdas, and the PVs with the metadata
may have damaged/unrecognized metadata.
A PV holding VG metadata that lvm can't understand
(e.g. damaged, checksum error, unrecognized flag)
will appear as an in-use orphan, and will be cleared
by this repair code. Disable this repair until the
code can keep track of these problematic PVs, and
distinguish them from actual in-use orphans.
Reject any stripe adding/removing reshape on raid4/5/6/10 because
of related MD kernel deadlock on single core systems until
we get a proper fix in MD.
Related: rhbz1443999
Since lvmetad is using 'MISSING' in status for 'another' purpose,
we need to support ATM also flag get from this place.
Until fixed better - we accept both flags - alhough lvm2 will
only print in flags.
Switch METADATA_FORMAT flag usage to be stored via segtype
instead of 'status' flag which appeared to cause major
incompatibility troubles.
For backward compatiblity segtype flags are still accepted also
via 'status' bits which were used from version 2.02.169 so metadata
saved by this newer lvm2 version should still work nicely, although
new save version will no longer work on this older lvm2 version.
Allow storing LV status bits with segment type name field.
Switching to this since this field has better support for compatibility
with older version of lvm2 - since such unknown segtype will not cause
complete invisiblity of metadata from older lvm2 code - just the
particular LV will become unusable with unknown type of segment.
Commit 5fe07d3574 failed to set raid5 types
properly on conversions from raid6. It always enforced raid6_ls_6
for types raid6/raid6_zr/raid6_nr/raid6_nc, thus requiring 3 conversions
instead of 2 when asking for raid5_{la,rs,ra,n}.
Related: rhbz1439403
Offer possible interim LV types and display their aliases
(e.g. raid5 and raid5_ls) for all conversions between
striped and any raid LVs in case user requests a type
not suitable to direct conversion.
E.g. running "lvconvert --type raid5 LV" on a striped
LV will replace raid5 aka raid5_ls (rotating parity)
with raid5_n (dedicated parity on last image).
User is asked to repeat the lvconvert command to get to the
requested LV type (raid5 aka raid5_ls in this example)
when such replacement occurs.
Resolves: rhbz1439403
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of
PVs with unknown devices.
While one caller avoids passing such PVs into the function, the other
doesn't. Move the check inside the function so it's not forgotten.
Without this patch, if the normal VG reading code tries to repair
inconsistent metadata while there is an unknown PV, it incorrectly
considers the missing PVs no longer to be missing and produces
incorrect 'pvs' output omitting the missing PV, for example.
Easy reproducer:
Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
Reinstate the hidden PV pv2 and at the same time hide a different PV
pv3.
Run 'pvs' - incorrect output.
Run 'pvs' again - correct output.
See https://bugzilla.redhat.com/1434054
There are certain situations (not fully understood)
where is_missing_pv() is false, but pv->dev is NULL,
so this adds a check for NULL pv->dev after is_missing_pv()
to avoid a segfault.
lvconvert parameters not causing a conversion (i.e. no type,
number of stripes, stripesize or regionsize changes) will
remove any allocated reshape space in which case the command
returns success. If reshape space does not exist though,
return error.
When metadata LV size was over DM_THIN_MAX_METADATA_SIZE sectors,
the info() routine was incorrectly trying to match bigger size,
while we do never pass any bigger device.
Fixing a case, where lvs should be displaying status for metadata
LV with 16GB size.
Reshape check failed when regionsize changed and current raid type
was provided with no other change requested (stripes or stripesize).
E.g. "lvconvert --type raid6 --regionsize 256K" on a raid6 LV
with != 256K regionsize.
Enable --type in test script.
Remove any newly allocated sub LV (pair) remnants in case
allocation fails due to lag of (parallel) free PV space
and keep initial raid type.
Resolves: rhbz1438013
SIGINT isn't blocked properly after a sigint_allow(),
sigint_restore() cycle leading to illicit interruptable
metadata updates. These can leave corrupted metadata behind.
Issues addressed in this commit:
sigint_allow() fails to set _oldmasked[] members properly due
to an offset by one bug on indexing the members of the array.
It bails out prematurely comparing to MAX_SIGINTS causing nesting
depths to be one less than MAX_SIGINTS. Fix the comparision.
Correct the related comparison flaw in sigint_restore().
Initialize all sig_atomic_t variables consequently.
Resolves: rhbz1440766
Avoid error message
"Logical Volume *_rimage_0 already exists in volume group,,,"
on takeover conversion from a 2-legged raid1 to raid4
(aiming to reshape it adding images).
Resolves: rhbz1439398
Requesting _raid_remove_images() to commit the
metadata missed to reload the origin causing a
kernel takeover error converting a 2-legged raid1
(with previously removed images) to raid5.
Allow the combination of both arguments keeping
the raid level but changing the regionssize
(e.g. "lvconvert --type raid1 --regionsize 1M RaidLV"
on an existing raid1 LV).
Resolves: rhbz1438396
With monolithic kernels we can't actually modprobe
for cache modules as they are already compiled-in
and policy modules do not export version symbol.
Reported issue on list:
https://www.redhat.com/archives/dm-devel/2017-March/msg00061.html
Fix will try to look for explicit kernel symbols first before
calling modprobe.
Removing some unused new lines and changing some incorrect "can't
release until this is fixed" comments. Rename license.txt to make
it clear its merely an included file, not itself a licence.
This patch fixed lvm2 compilation running on x32 arch.
(Using 64bit x86 cpu features but running on 32b address space,
so consuming less mem in VM).
On x32 arch 'time_t' is 64bit while 'long' is 32bit.
This reverts commit 1e4462dbfb
in favour of an enhanced solution avoiding changes in liblvm
completetly by checking the target versions in libdm and emitting
the respective parameter lines.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
Starting with dm-raid target version 1.9.0 shrinking of mapped devices is supported.
Check for support being present in lvresize and lvreduce.
Related: rhbz1394048
Quering non-thin-pool segment for discard property may lead
to intenal error if the segment had set 'out-of-range' value,
so only thin-pool is allowed, for other it returns NULL.
If SubLVs to be removed still exist after an image removing
conversion (i.e. "lvconvert --yes --force --stripes N "
with N < total stripes) any request to convert to a different
striped/raid* level has to be rejected until after those freed
SubLVs got removed by running the aforementioned lvconvert again.
Add tests to check conversion to striped/raid* gets rejected.
Enhance a test comment.
Related: rhbz1191935
Related: rhbz1366296
Repairing missing devices does not work reliably
with lvmetad, so disable lvmetad before repair.
A standard lvmetad refresh (pvscan --cache) will
enable lvmetad again.
Sending %d as format argument in lvmetad_vg_remove_pending() will cause
segfaults in config_make_nodes_v() when va_arg() casts to int64_t. Also, it is
clearly advertised in the lvm source code that using plain %d is prohibited, so
let's switch to FMTd64.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Commit f4b30b0dae was about displaying visible LV size
when reshape space is allocated. Take parity devices
into account when displaying the user visible LV size.
Better support for lvdisplay.
By default info about running (in kernel) cache status is printed.
To get 'segtype' info, user runs: 'lvdisplay -m', example:
--- Logical volume ---
LV Path /dev/vg/lvol0
LV Name lvol0
VG Name vg
LV UUID Y4uWuN-TBGk-duer-aPWl-yBWn-iFFR-RU1gg1
LV Write Access read/write
LV Creation host, time linux, 2017-03-01 20:52:39 +0100
LV Cache pool name lvol2
LV Cache origin name lvol0_corig
LV Status available
# open 0
LV Size 12,00 MiB
Cache used blocks 10,42%
Cache metadata blocks 0,49%
Cache dirty blocks 0,00%
Cache read hits/misses 112 / 34
Cache wrt hits/misses 133 / 0
Cache demotions 0
Cache promotions 20
Current LE 3
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Segments ---
Logical extents 0 to 2:
Type cache
Chunk size 64,00 KiB
Metadata format 1
Mode writethrough
Policy smq
Setting migration_threshold=100000
Report CMFmt column with cache metadata format version.
Report KMFmt column with 'kernel cache metadata format version' for device.
(a value reported from status).
(Update 'CacheMode' to name 'Cache' as primary segtype).
Only cache-pool segtype may store cache_metadata_format.
Only supported values are 0,1,2
Format 2 requires LV status uses LV_METADATA_FORMAT.
Format 0 (unselected) or 1 shall not set this 'incompatible' status.
Cache pool read/writes metadata_format within its segment type..
For CachePoolLV unselected metadata format is NOT stored in metadata.
For CacheLV when metadata format is not present/selected in lvm2 metadata,
it's automatically assumed to be the version 1 (backward compatible).
To ensure older lvm2 will not 'miss-read' metadata with new version 2,
such LV is marked with METADATA_FORMAT status flag (segment is
specifying metadata format). So when cache uses metadata format 2,
it will become inaccesible on older system without such support.
(kernel dm cache < 1.10, lvm2 < 2.02.169).
Add new profilable configation setting to let user select
which metadata format of a created cache pool he wish to use.
By default the 'best' available format is autodetected at runtime,
but user may enforce format 1 or 2 ATM.
Code also detects availability for metadata2 supporting cache target.
In case of troubles user may easily Disable usage of this feature
by placing 'metadata2' into global/cache_disabled_features list.
As now we can properly recognize all paramerters for pool creation,
we may drop PASS_ARG_ defines and rely on '_UNSELECTED' or 0 entries
as being those without user given args.
When setting are not given on command line - 'update' function
fill them from profiles or configuration. For this 'profile' arg
was needed to be passed around and since 'VG' itself is not needed,
it's been all replaced with 'cmd, profile, extents_size' args.
Since cache chunk might be huge and there is no technical need
to enforce rounding and there is actually more 'real' VG space
used then necessary - keep rounding on 'chunk' bounrary only
for thin volumes - where it's the space used anyway.
NB: we support conversion of any-size 'existing' LV into cached LV.
Fix missing reset of '*settings' pointer when no args were given.
Handle cache_chunk settings like all other settings, so it is properly
updated only with non-zero settings and the existing cache-pool
chunk_size is not being reconfigured.
User can specify metadata profile which stores important cache
geometry data for easy configuration.
Fix missing support for getting chunk_size, cache_mode, cache_policy
for a cache/cache pools volumes from configuration or metadata profile.
To more easily recognize unselected state from select '0' state
add new 'THIN_ZERO_UNSELECTED' enum.
Same applies to THIN_DISCARDS_UNSELECTED.
For those we no longer need to use PASS_ARG_ZERO or PASS_ARG_DISCARDS.
Basically code moving operation to have a single place resolving
thin_pool_chunk_size_policy.
Supported are generic & performance profiles.
Function is now shared between thin manipulation code and configuration
_CFG logic to obtain defaults and handle correct reporting upward coding
stack.
In addition to the already supported conversion between 2-legged
raid1 and raid5, raid1 and raid4 can be also converted into each
other with 2 legs (raid4/5 are limited to map a 2-legged raid1).
This patch supports the missing raid4 conversion in the sequence
linear -> 2-legged raid1 -> raid4/5, then restripe to more than one
data stripes for performance and resilience reasons and optionally
convert to striped/raid0.
The other conversion sequence is also possible by converting N-way
striped/raid0 to raid4/5, then restripe to 2 legs followed by a
conversion to raid1 and optionally to linear (loosing all resilience).
On conversion from striped to raid0, data LVs are created
and all segments and their respective areas of the striped
LV are moved across to new segments allocated for the raid0
image LVs. This can cause non-canonical segments to be added
to the image LVs.
Add a call to lv_merge_segments() once all segments have been
added to an image LV to compensate for that. This avoids
unsafe table loads on activation.
Fix comments.
Splitting off an image LV of a 2-legged
raid1 LV causes loss of resilience.
Ask user to avoid uninformed loss of all resilience.
Don't ask for N > 2 legged raid1 LVs.
Adjust tests.
Splitting off an image LV of a 2-legged raid1 LV tracking changes
causes loosing partial resilience for any newly written data set.
Full resilience will be provided again after the split off image LV
got merged back in and the new data set got fully synchronized.
Reason being that the data is only stored on the remaining single
writable image during the split.
Ask user to avoid uninformed loss of such partial resilience.
Don't ask for N > 2 legged raid1 LVs.
In case N images fail (N <= parity chunks) _and_
a "vgreduce --removemissing --force VG" was applied
a following repair of the RaidLV fails:
Unable to remove N images: Only 0 devices given.
Failed to remove the specified images from tb/r.
Failed to replace faulty devices in tb/r.
Fix as of this commit results in correct repair:
Faulty devices in tb/r successfully replaced.
seg->extents_copied has to be defined properly on reducing
the size of a raid LV or conversion from raid5 with 1 stripe
to raid1 will fail.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Reshaping a raid5 LV to one stripe aiming to convert it to
raid1 (and optionally to linear) reports the wrong LV size
when still having reshape space allocated.
The lv_extend/_lv_reduce API doesn't cope with resizing RaidLVs
with allocated reshape space and ongoing conversions. Prohibit
resizing during conversions and remove the reshape space before
processing resize. Add missing seg->data_copies initialisation.
Fix typo/comment.
For the time being raid10 is limited to even number of total stripes
as is and 2 data copies. The number of stripes provided on creation
of a raid10(_near) LV with -i/--stripes gets doubled to define
that even total number of stripes (i.e. images).
Apply the same on disk adding conversions (reshapes) with
"lvconvert --stripes RaidLV" (e.g. 2 stripes = 4 images
total converted to 3 stripes = 6 images total).
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Enhance the raid report functions for the recently added LV fields
reshape_len, reshape_len_le, data_offset, new_data_offset, data_copies,
data_stripes and parity_chunks to cope with "lvs --select".
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978