IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
There are currently a few issues with the reporting done on RAID LVs and
sub-LVs. The most concerning is that 'lvs' does not always report the
correct failure status of individual RAID sub-LVs (devices). This can
occur when a device fails and is restored after the failure has been
detected by the kernel. In this case, 'lvs' would report all devices are
fine because it can read the labels on each device just fine.
Example:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
However, 'dmsetup status' on the device tells us a different story:
[root@bp-01 lvm2]# dmsetup status vg-lv
0 1024000 raid raid1 2 DA 1024000/1024000
In this case, we must also be sure to check the RAID LVs kernel status
in order to get the proper information. Here is an example of the correct
output that is displayed after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-p 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-p /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-p /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
The other case where 'lvs' gives incomplete or improper output is when a
device is replaced or added to a RAID LV. It should display that the RAID
LV is in the process of sync'ing and that the new device is the only one
that is not-in-sync - as indicated by a leading 'I' in the Attr column.
(Remember that 'i' indicates an (i)mage that is in-sync and 'I' indicates
an (I)mage that is not in sync.) Here's an example of the old incorrect
behaviour:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg Iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg Iwi-aor-- /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0) ** Note that all the images currently are marked as 'I' even though it is
only the last device that has been added that should be marked.
Here is an example of the correct output after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0)
** Note only the last image is marked with an 'I'. This is correct and we can
tell that it isn't the whole array that is sync'ing, but just the new
device.
It also works under snapshots...
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg owi-a-r-p 33.47 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg Iwi-aor-p /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-p /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0)
snap vg swi-a-s-- /dev/sda1(51201)
fmt1 doesn't have a separate commit function: updates take effect
immediately vg_write is called, so we must update lvmetad at this
point if we're going to go on and ask lvmetad for the VG metadata
again before calling the commit function (though that's probably an
unsupported and pointless thing to do anyway as the client must
already have that data and it cannot have changed because it's locked
and with devs suspended we shouldn't be communicating with lvmetad;
so when that's fixed properly, this fix here can be reverted).
This problem showed up as an internal error when lvremoving an LVM1
snapshot.
> Internal error: LV snap1 (00000000000000000000000000000001) missing from preload metadata
https://bugzilla.redhat.com/891855
If a RAID array is not in-sync, replacing devices should not be allowed
as a general rule. This is because the contents used to populate the
incoming device may be undefined because the devices being read where
not in-sync. The kernel enforces this rule unless overridden by not
allowing the creation of an array that is not in-sync and includes a
devices that needs to be rebuilt.
Since we cannot know the sync state of an LV if it is inactive, we must
also enforce the rule that an array must be active to replace devices.
That leaves us with the following conditions:
1) never allow replacement or repair of devices if the LV is in-active
2) never allow replacement if the LV is not in-sync
3) allow repair if the LV is not in-sync, but warn that contents may
not be recoverable.
In the case where a user is performing the repair on the command line via
'lvconvert --repair', the warning is printed before the user is prompted
if they would like to replace the device(s). If the repair is automated
(i.e. via dmeventd and policy is "allocate"), then the device is replaced
if possible and the warning is printed.
If the lvmcache_info_from_pvid() fails to find valid
info, invoke the lookup by dev, and only in this case
call lvmcache_info_from_pvid() again.
Also check for the result of info and return
error directly, so the NULL is not passed
to lvmcache_get_label().
Commit bf2741376d started to use
lv_is_active() instead of call for lv_info & info.exists so
we cover also cluster activated devices.
For snapshost the conversion was not correct and introduced
regression by blocking creation of snapshot of inactive LV.
Fix it by assigning lv_is_active() directly.
Note: we still have minor issue to fix - to make
lv_is_???? function able to return error states since
lv_info() may fail.
Target tells us its version, and we may allow different set of options
to be supported with different version of driver.
Idea is to provide individual feature flags and later be
able to query for them.
The 'copy_percent' function takes the 'extents_copied' field from each
segment in an LV to create the numerator for the ratio that is to
become the copy_percent. (Otherwise known as the 'sync' percent for
non-pvmove uses, like mirror LVs and RAID LVs.) This function safely
works on RAID - not just mirrors - so it is better to have it in
lv_manip.c rather than mirror.c.
There's a lot of different functions that do a lot of different things
in lv_manip.c, so I placed the function near a function in lv_manip.c
that it was close to in metadata-exported.h. Different placement in the
file or a different name for the function may be useful.
Use log_warn to print non-fatal warning messages.
Use of log_error would confuse checker for testing
whether proper error has been reported for some real error.
A message is printed when the region_size of a RAID LV is adjusted
to allow for large (> ~1TB) LVs. The message wasn't very clear.
Hopefully, this is better.
It would be possible to activate a RAID LV exclusively in a cluster
volume group, but for now we do not allow RAID LVs to exist in a
clustered volume group at all. This has two components:
1) Do not allow RAID LVs to be created in a clustered VG
2) Do not allow changing a VG from single-machine to clustered
if there are RAID LVs present.
MD's bitmaps can handle 2^21 regions at most. The RAID code has always
used a region_size of 1024 sectors. That means the size of a RAID LV was
limited to 1TiB. (The user can adjust the region_size when creating a
RAID LV, which can affect the maximum size.) Thus, creating, extending or
converting to a RAID LV greater than 1TiB would result in a failure to
load the new device-mapper table.
Again, the size of the RAID LV is not limited by how much space is allocated
for the metadata area, but by the limitations of the MD bitmap. Therefore,
we must adjust the 'region_size' to ensure that the number of regions does
not exceed the limit. I've added code to do this when extending a RAID LV
(which covers 'create' and 'extend' operations) and when up-converting -
specifically from linear to RAID1.
We were using daemon_send_simple until now, but it is no longer adequate, since
we need to manipulate requests in a generic way (adding a validity token to each
request), and the tree-based request interface is much more suitable for this.
Don't try to issue discards to a missing PV to avoid segfault.
Prevent lvremove from removing LVs that have any part missing.
https://bugzilla.redhat.com/857554
Failing to clear the LV_NOTSYNCED flag when converting a RAID1 LV to
linear can result in the flag being present after an upconvert - even
if the sync is performed when upconverting.
Mirrors do not allow upconverting if the LV has been created with --nosync.
We will enforce the same rule for RAID1. It isn't hugely critical, since
the portions that have been written will be copied over to the new device
identically from either of the existing images. However, the unwritten
sections may be different, causing the added image to be a hybrid of the
existing images.
Also, we are disallowing the addition of new images to a RAID1 LV that has
not completed the initial sync. This may be different from mirroring, but
that is due to the fact that the 'mirror' segment type "stacks" when adding
a new image and RAID1 does not. RAID1 will rebuild a newly added image
"inline" from the existant images, so they should be in-sync.
We cannot add images to a RAID array while it is not in-sync. The
kernel will simply reject the table, saying:
'rebuild' specified while array is not in-sync
Now we check to ensure the LV is in-sync before attempting image
additions.
It is necessary when creating a RAID LV to clear the new metadata areas.
Failure to do so could result in a prepopulated bitmap that would cause
the new array to skip syncing portions of the array. It is a requirement
that the metadata LVs be activated and cleared in the process of creating.
However in test mode, this requirement should be lifted - no new LVs should
be created or written to.
When printing a message for the user and the lv_segment pointer is available,
use segtype->ops->name() instead of segtype->name. This gives a better
user-readable name for the segment. This is especially true for the
'striped' segment type, which prints "linear" if there is an area_count of
one.
Accept -q as the short form of --quiet.
Suppress non-essential standard output if -q is given twice.
Treat log/silent in lvm.conf as equivalent to -qq.
Review all log_print messages and change some to
log_print_unless_silent.
When silent, the following commands still produce output:
dumpconfig, lvdisplay, lvmdiskscan, lvs, pvck, pvdisplay,
pvs, version, vgcfgrestore -l, vgdisplay, vgs.
[Needs checking.]
Non-essential messages are shifted from log level 4 to log level 5
for syslog and lvm2_log_fn purposes.
This patch adds support for RAID10. It is not the default at this
stage. The user needs to specify '--type raid10' if they would like
RAID10 instead of stacked mirror over stripe.
Adding couple INTERNAL_ERROR reports for unwanted parameters:
Ensure the 'top' metadata node cannot be NULL for lvmetad.
Make obvious vginfo2 cannot be NULL.
Report internal error if handler and vg is undefined.
Check for handle in poll_vg().
Ensure seg is not NULL in dev_manager_transient().
Report missing read_ahead for _lv_read_ahead_single().
Check for report handler in dm_report_object().
Check missing VG in _vgreduce_single().
Respond with "unknown" rather than a NULL pointer if there's an
internal error and the discard value is invalid.
Don't accept 'no_passdown' or 'no-passdown' variants in the LVM
metadata: this is written by the program so should only ever contain
"nopassdown" and should be validated strictly against that.
Commit 8767435ef8 allowed RAID 4/5/6
LV to be extended properly, but introduced a regression in device
replacement - a critical component of fault tolerance.
When only 1 or 2 drives are being replaced, the 'area_count' needed
can be equal to the parity_count. The 'area_multiple' for RAID 4/5/6
was computed as 'area_count - parity_devs', which could result in
'area_multiple' being 0. This would ultimately lead to a division by
zero error. Therefore, in calc_area_multiple, it is important to take
into account the number of areas that are being requested - just as
we already do in _alloc_init.
Add arg support for discard.
Add discard ignore, nopassdown, passdown (=default) support.
Flags could be set per pool.
lvcreate [--discard {ignore|no_passdown|passdown}] vg/thinlv
Reducing a RAID 4/5/6 LV or extending it with a different number of
stripes is still not implemented. This patch covers the "simple" case
where the LV is extended with the same number of stripes as the orginal.
If _alloc_parallel_area for raid devices chooses an area already used
up, it doesn't notice that it has no space left in it and leaves
later code trying to place a zero-length area into the LV.
https://bugzilla.redhat.com/832596
One can use "lvcreate --aay" to have the newly created volume
activated or not activated based on the activation/auto_activation_volume_list
this way.
Note: -Z/--zero is not compatible with -aay, zeroing is not used in this case!
When using lvcreate -aay, a default warning message is also issued that zeroing
is not done.
Define an 'activation_handler' that gets called automatically on
PV appearance/disappearance while processing the lvmetad_pv_found
and lvmetad_pv_gone functions that are supposed to update the
lvmetad state based on PV availability state. For now, the actual
support is for PV appearance only, leaving room for PV disappearance
support as well (which is a more complex problem to solve as this
needs to count with possible device stack).
Add a new activation change mode - CHANGE_AAY exposed as
'--activate ay/-aay' argument ('activate automatically').
Factor out the vgchange activation functionality for use in other
tools (like pvscan...).
We're refererring to 'activation' all over the code and we're talking
about 'LVs being activated' all the time so let's use 'activation/activate'
everywhere for clarity and consistency (still providing the old
'available' keyword as a synonym for backward compatibility with
existing environments).
Update release_lv_segment_area not to discard any PV extents,
as it also gets used when moving extents between LVs.
Instead, call a new function release_and_discard_lv_segment_area() in
the two places where data should be discarded - lv_reduce() and
remove_mirrors_from_segments().
When mirrors are up-converted, a transient mirror layer is put in so that
only the new devices are sync'ed. That transient layer must carry the tags
of the original mirror LV, otherwise it will fail to activate when activation
is regulated by lvm.conf:activation/volume_list. The conversion would then
fail.
The fix is to do exactly the same thing that is being done for linear ->
mirror converting (lib/metadata/mirror.c:_init_mirror_log()). We copy the
tags temporarily for the new LV and remove them after the activation.
Snapshots of RAID logical volumes are allowed (including "raid1"). However,
snapshots of "mirror" logical volumes has been disallowed due to unsolvable
issues inherent to the design. The fact that mirroring (dm-raid1.c) must
stop all I/O as the result of a failure and wait for userspace intervention
can lead to a circular dependency if userspace is simultaneously waiting for
snapshots (on mirrors) to make an I/O update before proceeding.
Various snapshot on mirror tests have been removed as a result.
If two devices in an array failed, it was previously impossible to replace
just one of them. This patch allows for the replacement of some, but perhaps
not all, failed devices.
The logic for resuming the original and newly split LVs was not properly
done to handle situations where anything but the last device in the array
was split. It did not take into account the possible name collisions that
might occur when the original LV undergoes the shifting and renaming of its
sub-LVs.
In this case we should allow to use local mirror, check for cmirror
should apply only for lvconvert/lvcreate.
Introduced in 2.02.86 by removing !(lv->status & ACTIVATE_EXCL).
(Partially workaround, it is minimalistic patch for now.)
Code adds better support for monitoring of thin pool devices.
update_pool_lv uses DMEVENTD_MONITOR_IGNORE to not manipulate with monitoring.
vgchange & lvchange are checking real thin pool device for existance
as we are using _tpool real device and visible LV pool device might not
be even active (_tpool is activated implicitely for any thin volume).
monitor_dev_for_events is another _lv_postorder like code it might be worth
to think about reusing it here - for now update the code to properly
monitory thin volume deps.
For unmonitoring add extra code to check the usage of thin pool - in case it's in use
unmonitoring of thin volume is skipped.
When down-converting a RAID1 device, it is the last device that is extracted
and removed when the user does not specify a particular device. However,
when a device is specified (and it is not the last), the device is removed and
the remaining sub-LVs are "shifted down" to fill the hole. This cause problems
when resuming the LV because if the shifted devices were resumed (and thus
renamed) before the sub-LV being extracted, there would be a name conflict.
The solution is to resume the extracted sub-LVs first so that they can be
properly renamed preventing a possible conflict.
This addresses bug 801967.
Add 3rd daemon return state "unknown" for lookups that are carried out
successfully but don't find the item requested.
Avoid issuing error messages when it's expected that a device that's
being looked up in lvmetad might not be there.
If the thin pool has disabled zeroing (created with -Zn), we at least
clear initial 4KiB of such thin volume (provisions 1st block).
If lvcreate is executed with '-an' command will abort (same way like we for
normal LV - however for normal LV option -Zn may skip clearing completely,
for thin volumes this option is not supported (applies only for pools).
Adding at least stack traces with some FIXMEs for cases,
where we might want to do something cleaver - maybe fail command
or give user hints something is not going well ?
For remote_backup is stack probably 'good' enough for now.
The code fail to account for the case where we just need a single device
in a RAID 4/5/6 array. There is no good way to tell the allocation functions
that we don't need parity devices when we are allocating just a single device.
So, I've used a bit of a hack. If we are allocating an area_count that is <=
the parity count, then we can assume we are simply allocating a replacement
device (i.e. no need to include parity devices in the calculations). This
should make sense in most cases. If we need to allocate replacement devices
due to failure (or moving), we will never allocate more than the parity count;
or we would cause the array to become unusable. If we are creating a new device,
we should always create more stripes than parity devices.
Read lvm.conf setting for monitoring for each command. So we should not
activate monitoring if the default compilation is set to monitor during
lvconvert commnads.
Patch also removes check for clustered VG and allows to disable monitoring
for clustered VG with the assumption, the problem with monitoring and dmeventd
flag passing for INGNORE is already fixed.
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
Before removing thin pool LV always make sure, stacked message
for previous run are cleared - but allow to remove any
device that should have been created
(i.e. creation of snapshot failed - so the message for snapshot creation
may be replaced with delete message within unfinished transaction).
Also commit messages after lv remove - so free space is released in pool.
Extend the usage of origin_only flag to allow resume of thin pool LV
(when it's active) to pass only the messages.
origin_only flag will skip detection of already resumed tree for thin_pool,
so we do not need to suspend the tree and we just send messages.
Add pool_has_message and use it in attach_pool_message.
Also update header to make more obvious which segment type is
expected as parameter.
Rename 'read_only' to 'no_update' (no auto update transaction_id)
to better fit how it's used.
Fix problem when there was only one stacked message replaced with delete
message that caused unwanted transaction_id increase.
Failure to do so results in "Performing unsafe table load while X device(s) are
known to be suspended" errors. While fixing the problem in this way works and
is consistent with the way the mirror segment type does it, it would be nice
to find a solution that uses the generic suspend/resume calls.
Also included in this check-in are additions to the test suite that perform
conversions on RAID LVs under a snapshot. These tests are disabled for the
time being due to a kernel bug that is yet to be tracked down.
Since striped name function knows when to report 'linear' instead of
'stripe' type name - drop it from this place.
This fixes problem when reporting segtype e.g. for thin-pool which
is also using area_count=1 to store thin data device reference.
It also returns properly strduped memory instead of badly casted const char*.
Since snapshot needs to suspend origin - it might lead to pool userspace
deadlock (as the pool will wait for new space in case it would be overfilled,
but dmeventd would not be able to resize it, as the lvcreate operation would
have kept the VG lock.)
To minimize the risk of such scenario - we prevent to create new snapshot
in case we are over the threshold - but beware, there is still small timewindow,
so keep threshold at some reasonable level!
New field Data% is able to display info about
thin_pool, thin, snapshot and has generic meaning here.
Simple Time/Host field are here to display host and time creation.
Basic support to keep info when the LV was created.
Host and time is stored into LV mda section.
FIXME: Current version doesn't support configurable string via lvm.conf
and used fixed version strftime "%Y-%m-%d %T %z".
Also, don't allow a splitmirror operation on a RAID LV that is already tracking
a split, unless the operation is to stop the tracking and complete the split.
Example:
~> lvconvert --splitmirrors 1 --trackchanges vg/lv /dev/sdc1
# Now tracking changes - image can be merged back or split-off for good
~> lvconvert --splitmirrors 1 -n new_name vg/lv /dev/sdc1
# ^ Completes split ^
If a split is performed on a RAID that is tracking an already split image and
PVs are provided, we must ensure that
1) the already split LV is represented in the PVs
2) we are careful to split only the tracked image
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
For write we do not need to hold memory locked.
This relaxes many conditions and avoid problems when allocating
a lot of memory for writting metadata buffers.
(In case of huge MDA size this would lead to mismatch between
locked and unlocked memory region size).
Add also internal check we are not writing in critical section.
Removal of an inactive origin removes also all related snapshots.
When we now support 'old' external snapshots with thin volumes,
removal of pool will not only drop all thin volumes, but as
a consequence also all snapshots - which might be seen a bit
unexpected for the user - so add a query to confirm such action.
lvremove -f will skip the prompt.
Update region_size only for mirror and raid targets.
This fixes warning messages when vg is using small
extent size like 1KiB and no mirror/raid is created,
but the user still got the message:
$> vgcreate -s 1K vg <pvs>
$> lvcreate -L10K vg
Using reduced mirror region size of 4 sectors
If the extent_size is smaller then the chunk_size we may try
to find better aligment (wasting less space).
i.e. using 4KB extent_size and 64KB chunk size will
lead to creation of 64KB aligned thin volume.
Support allocation of metadata from the same PV, if the VG
is build only from one PV.
As thinp is not mirror - we do not require 2 PVs
for basic thin usage as user is losing only perfomance.
Replace detach_pool_messages with update_pool_lv.
Move creation code from to 'if' condition into 1.
Ensure creation has finished all previous message operations.
All thins are created with the next activation and VG is updated
without messages. Only some basic commands works.
(i.e. lvcreate -an -V10 -T mvg/pool)
There can be some combination to confuse this system.
This functionality for snapshots is going to be interesting.
If something fails during creation of thin LV remove such LV
and deactivate in case it's been already tried to activate
(i.e. thin kernel driver fails for some reason.)
To ensure we properly handle LV cluster locking - explicitely do
not allow to change the availability of the thin pool that is in use
for some thin LV.
As soon as the thin volume is created the only way to activate pool
is via implicit dependency.
Ignore thinpool open count for lv/vgchange operations.
Before adding a new virtual segment to LV, check first whether
the last segment isn't already of the same type. In this case
extend last segment instead of creating the new one.
Thin volumes should have always only 1 virtual segment, but it
helps also to virtual snapshot or error segtype..
Git commit ID 0864378250 was meant to disallow
'mirrored' logs for cluster mirrors. However, when add_mirror_log is used
to create the log (as is now the case when using 'lvcreate' or converting only
the log) the check is bypassed.
This patch adds the check to add_mirror_log.
pvck prints 'extra' character from the label since there is no '\0'
after the struct label entry and just uint64_t follows directly.
So avoid it by limiting 8 chars to be printed.
https://www.redhat.com/archives/lvm-devel/2011-January/msg00109.html
Signed-off-by: Paul Bolle <pebolle tiscali nl>
Since thin is not able to use _lv_insert_empty_sublvs,
remove its appearence from this function.
Start to use extend_pool() function for desired functionality
and modify lv_extend() for this.
Code in _lv_insert_empty_sublvs was not able to provide proper
initialization order for thin pool LV.
New function extend_pool() first adds metadata segment to pool LV which
is still visible. Such LV is activate and cleared.
Then new meta LV is created and metadata segments are moved there.
Now the preallocated pool data segment is attached to the pool LV
and layer _tpool is created. Finaly segment is marked as thin_pool.
This function could be useful for other _manip source files.
Use dm_list manipulation function for provided functionality,
which make the code more readable and avoid touching list
internal details here.
Couple FIXMEs put into the code for parts of the code which may be
improved later, since we might be able to add 'lazy' device creation later.
For now require exclusive activation.
lvm part of messaging.
Each message is now stored it's own thin pool section:
message1 {
create = lv
}
Messages are queued to thin pool dm target when this target
is going to be resumed or used through some dependency.
Currently 'delete' message are purely queued and processed
with next thin pool resume operation (i.e. create_thin).
WARNING - thin provisioning support is developmental code.
Properly detect if the filters were refreshed properly.
(May needs few more fixes ??)
Filter refresh may fail because it may be out of free file descriptors
when clvmd gets overloaded.
Example:
~> lvconvert --type raid1 vg/mirror_lv
Steps to convert "mirror" to "raid1"
1) Allocate a RAID metadata LV for each mirror image from the same PVs
on which they are located.
2) Clear the metadata LVs. This involves writing LVM metadata, so we don't
change any aspects of the mirror LV before this so that the user can easily
remove LVs from the failed convert attempt while retaining the original
mirror.
3) Remove the mirror log, if it exists.
4) Add metadata LVs to mirror LV
5) Rename mirror sub-lvs (s/mimage/rimage/)
6) Change flags and segtype from mirror to raid1
Example:
~> lvconvert --type raid1 -m 1 vg/lv
The following steps are performed to convert linear to RAID1:
1) Allocate a metadata device from the same PV as the linear device
to provide the metadata/data LV pair required for all RAID components.
2) Allocate the required number of metadata/data LV pairs for the
remaining additional images.
3) Clear the metadata LVs. This performs a LVM metadata update.
4) Create the top-level RAID LV and add the component devices.
We want to make any failure easy to unwind. This is why we don't create the
top-level LV and add the components until the last step. Should anything
happen before that, the user could simply remove the unnecessary images. Also,
we want to ensure that the metadata LVs are cleared before forming the array to
prevent stale information from polluting the new array.
A new macro 'seg_is_linear' was added to allow us to distinguish linear LVs
from striped LVs.
This patch allows a mirror to be extended without an initial resync of the
extended portion. It compliments the existing '--nosync' option to lvcreate.
This action can be done implicitly if the mirror was created with the '--nosync'
option, or explicitly if the '--nosync' option is used when extending the device.
Here are the operational criteria:
1) A mirror created with '--nosync' should extend with 'nosync' implicitly
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv ; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 10.00g lv_mlog 100.00
2) The 'M' attribute ('M' signifies a mirror created with '--nosync', while 'm'
signifies a mirror created w/o '--nosync') must be preserved when extending a
mirror created with '--nosync'. See #1 for example of 'M' attribute.
3) A mirror created without '--nosync' should extend with 'nosync' only when
'--nosync' is explicitly used when extending.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 5.02g lv_mlog 0.39
vs.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv --nosync; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.02g lv_mlog 100.00
4) The 'm' attribute must change to 'M' when extending a mirror created without
'--nosync' is extended with the '--nosync' option. (See #3 examples above.)
5) An inactive mirror's sync percent cannot be determined definitively, so it
must not be allowed to skip resync. Instead, the extend should ask the user if
they want to extend while performing a resync.
[EXAMPLE]# lvchange -an vg/lv
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv is not active. Unable to get sync percent.
Do full resync of extended portion of vg/lv? [y/n]: y
Logical volume lv successfully resized
6) A mirror that is performing recovery (as opposed to an initial sync) - like
after a failure - is not allowed to extend with either an implicit or
explicit nosync option. [You can simulate this with a 'corelog' mirror because
when it is reactivated, it must be recovered every time.]
[EXAMPLE]# lvcreate -m1 -L 5G -n lv vg --nosync --corelog
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "lv" created
[EXAMPLE]# lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 100.00
[EXAMPLE]# lvchange -an vg/lv; lvchange -ay vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 0.08
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv cannot be extended while it is recovering.
7) If 'no' is selected in #5 or if the condition in #6 is hit, it should not
result in the mirror being resized or the 'm/M' attribute being changed.
NOTE: A mirror created with '--nosync' behaves differently than one created
without it when performing an extension. The former cannot be extended when
the mirror is recovering (unless in-active), while the latter can. This is
a reasonable thing to do since recovery of a mirror doesn't take long (at
least in the case of an on-disk log) and it would cause far more time in
degraded mode if the extension w/o '--nosync' was allowed. It might be
reasonable to add the ability to force the operation in the future. This
should /not/ force a nosync extension, but rather force a sync'ed extension.
IOW, the user would be saying, "Yes, yes... I know recovery won't take long
and that I'll be adding significantly to the time spent in degraded mode, but
I need the extra space right now!".
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The original commit comments can be located via this git commit ID:
7d8e615c0b
There were three possible solutions to the original problem proposed in the
initial check-in. The one chosen was as follows:
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it doesn't
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
Turns out, the cluster also views the extra suspend/resume operations as
pointless too and ignores them. So, this solution doesn't work in a cluster.
Further, I've noticed that in addition to the remote cluster nodes still getting
I/O errors from scanning the error target, they also have a different LVM and
DM views of the same LV. IOW, while the LVM level (gotten from the LVM metadata)
sees the correct name for the newly split LV, device-mapper still maintains the
old names.
Because the original fix failed to completely fix the problem (or work-around it)
and because a better solution must be found to address the additional cluster
issue of device renaming, I am reverting the above mentioned commit.
Make limits for thin data_block_size and device_id part of public API.
FIXME: read them possible from some kernel header file in the future ?
But we may need to support different values for different versions ?
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
seg->areas and seg->meta_areas. We also need to copy the memory from the
old arrays to the newly allocated arrays. The amount of memory to copy was
determined by seg->area_count. However, seg->area_count was being set to the
higher value after copying the 'seg->areas' information, but before copying
the 'seg->meta_areas' information. This means we were copying more memory
than necessary for 'seg->meta_areas' - something that could lead to a segfault.
Compiler says variable may be used uninitialized. It can't be, but we
initialize the variable to NULL anyway. Also, remove the double initialization
of another variable.
This bug showed up when trying to add a log to a mirror whose images are on
multiple devices. This is an intra-release regression and no WHATS_NEW
entry will be added. The error was introduce in the following commit:
2d8a2f35c7
The solution is to recognise in _alloc_init that if there are no mirrors
or stripes specified, then 'new_extents' should be zero.