IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The code fail to account for the case where we just need a single device
in a RAID 4/5/6 array. There is no good way to tell the allocation functions
that we don't need parity devices when we are allocating just a single device.
So, I've used a bit of a hack. If we are allocating an area_count that is <=
the parity count, then we can assume we are simply allocating a replacement
device (i.e. no need to include parity devices in the calculations). This
should make sense in most cases. If we need to allocate replacement devices
due to failure (or moving), we will never allocate more than the parity count;
or we would cause the array to become unusable. If we are creating a new device,
we should always create more stripes than parity devices.
Read lvm.conf setting for monitoring for each command. So we should not
activate monitoring if the default compilation is set to monitor during
lvconvert commnads.
Patch also removes check for clustered VG and allows to disable monitoring
for clustered VG with the assumption, the problem with monitoring and dmeventd
flag passing for INGNORE is already fixed.
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
Before removing thin pool LV always make sure, stacked message
for previous run are cleared - but allow to remove any
device that should have been created
(i.e. creation of snapshot failed - so the message for snapshot creation
may be replaced with delete message within unfinished transaction).
Also commit messages after lv remove - so free space is released in pool.
Extend the usage of origin_only flag to allow resume of thin pool LV
(when it's active) to pass only the messages.
origin_only flag will skip detection of already resumed tree for thin_pool,
so we do not need to suspend the tree and we just send messages.
Add pool_has_message and use it in attach_pool_message.
Also update header to make more obvious which segment type is
expected as parameter.
Rename 'read_only' to 'no_update' (no auto update transaction_id)
to better fit how it's used.
Fix problem when there was only one stacked message replaced with delete
message that caused unwanted transaction_id increase.
Failure to do so results in "Performing unsafe table load while X device(s) are
known to be suspended" errors. While fixing the problem in this way works and
is consistent with the way the mirror segment type does it, it would be nice
to find a solution that uses the generic suspend/resume calls.
Also included in this check-in are additions to the test suite that perform
conversions on RAID LVs under a snapshot. These tests are disabled for the
time being due to a kernel bug that is yet to be tracked down.
Since striped name function knows when to report 'linear' instead of
'stripe' type name - drop it from this place.
This fixes problem when reporting segtype e.g. for thin-pool which
is also using area_count=1 to store thin data device reference.
It also returns properly strduped memory instead of badly casted const char*.
Since snapshot needs to suspend origin - it might lead to pool userspace
deadlock (as the pool will wait for new space in case it would be overfilled,
but dmeventd would not be able to resize it, as the lvcreate operation would
have kept the VG lock.)
To minimize the risk of such scenario - we prevent to create new snapshot
in case we are over the threshold - but beware, there is still small timewindow,
so keep threshold at some reasonable level!
New field Data% is able to display info about
thin_pool, thin, snapshot and has generic meaning here.
Simple Time/Host field are here to display host and time creation.
Basic support to keep info when the LV was created.
Host and time is stored into LV mda section.
FIXME: Current version doesn't support configurable string via lvm.conf
and used fixed version strftime "%Y-%m-%d %T %z".
Also, don't allow a splitmirror operation on a RAID LV that is already tracking
a split, unless the operation is to stop the tracking and complete the split.
Example:
~> lvconvert --splitmirrors 1 --trackchanges vg/lv /dev/sdc1
# Now tracking changes - image can be merged back or split-off for good
~> lvconvert --splitmirrors 1 -n new_name vg/lv /dev/sdc1
# ^ Completes split ^
If a split is performed on a RAID that is tracking an already split image and
PVs are provided, we must ensure that
1) the already split LV is represented in the PVs
2) we are careful to split only the tracked image
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
For write we do not need to hold memory locked.
This relaxes many conditions and avoid problems when allocating
a lot of memory for writting metadata buffers.
(In case of huge MDA size this would lead to mismatch between
locked and unlocked memory region size).
Add also internal check we are not writing in critical section.
Removal of an inactive origin removes also all related snapshots.
When we now support 'old' external snapshots with thin volumes,
removal of pool will not only drop all thin volumes, but as
a consequence also all snapshots - which might be seen a bit
unexpected for the user - so add a query to confirm such action.
lvremove -f will skip the prompt.
Update region_size only for mirror and raid targets.
This fixes warning messages when vg is using small
extent size like 1KiB and no mirror/raid is created,
but the user still got the message:
$> vgcreate -s 1K vg <pvs>
$> lvcreate -L10K vg
Using reduced mirror region size of 4 sectors
If the extent_size is smaller then the chunk_size we may try
to find better aligment (wasting less space).
i.e. using 4KB extent_size and 64KB chunk size will
lead to creation of 64KB aligned thin volume.
Support allocation of metadata from the same PV, if the VG
is build only from one PV.
As thinp is not mirror - we do not require 2 PVs
for basic thin usage as user is losing only perfomance.
Replace detach_pool_messages with update_pool_lv.
Move creation code from to 'if' condition into 1.
Ensure creation has finished all previous message operations.
All thins are created with the next activation and VG is updated
without messages. Only some basic commands works.
(i.e. lvcreate -an -V10 -T mvg/pool)
There can be some combination to confuse this system.
This functionality for snapshots is going to be interesting.
If something fails during creation of thin LV remove such LV
and deactivate in case it's been already tried to activate
(i.e. thin kernel driver fails for some reason.)
To ensure we properly handle LV cluster locking - explicitely do
not allow to change the availability of the thin pool that is in use
for some thin LV.
As soon as the thin volume is created the only way to activate pool
is via implicit dependency.
Ignore thinpool open count for lv/vgchange operations.
Before adding a new virtual segment to LV, check first whether
the last segment isn't already of the same type. In this case
extend last segment instead of creating the new one.
Thin volumes should have always only 1 virtual segment, but it
helps also to virtual snapshot or error segtype..
Git commit ID 0864378250 was meant to disallow
'mirrored' logs for cluster mirrors. However, when add_mirror_log is used
to create the log (as is now the case when using 'lvcreate' or converting only
the log) the check is bypassed.
This patch adds the check to add_mirror_log.
pvck prints 'extra' character from the label since there is no '\0'
after the struct label entry and just uint64_t follows directly.
So avoid it by limiting 8 chars to be printed.
https://www.redhat.com/archives/lvm-devel/2011-January/msg00109.html
Signed-off-by: Paul Bolle <pebolle tiscali nl>
Since thin is not able to use _lv_insert_empty_sublvs,
remove its appearence from this function.
Start to use extend_pool() function for desired functionality
and modify lv_extend() for this.
Code in _lv_insert_empty_sublvs was not able to provide proper
initialization order for thin pool LV.
New function extend_pool() first adds metadata segment to pool LV which
is still visible. Such LV is activate and cleared.
Then new meta LV is created and metadata segments are moved there.
Now the preallocated pool data segment is attached to the pool LV
and layer _tpool is created. Finaly segment is marked as thin_pool.
This function could be useful for other _manip source files.
Use dm_list manipulation function for provided functionality,
which make the code more readable and avoid touching list
internal details here.
Couple FIXMEs put into the code for parts of the code which may be
improved later, since we might be able to add 'lazy' device creation later.
For now require exclusive activation.
lvm part of messaging.
Each message is now stored it's own thin pool section:
message1 {
create = lv
}
Messages are queued to thin pool dm target when this target
is going to be resumed or used through some dependency.
Currently 'delete' message are purely queued and processed
with next thin pool resume operation (i.e. create_thin).
WARNING - thin provisioning support is developmental code.
Properly detect if the filters were refreshed properly.
(May needs few more fixes ??)
Filter refresh may fail because it may be out of free file descriptors
when clvmd gets overloaded.
Example:
~> lvconvert --type raid1 vg/mirror_lv
Steps to convert "mirror" to "raid1"
1) Allocate a RAID metadata LV for each mirror image from the same PVs
on which they are located.
2) Clear the metadata LVs. This involves writing LVM metadata, so we don't
change any aspects of the mirror LV before this so that the user can easily
remove LVs from the failed convert attempt while retaining the original
mirror.
3) Remove the mirror log, if it exists.
4) Add metadata LVs to mirror LV
5) Rename mirror sub-lvs (s/mimage/rimage/)
6) Change flags and segtype from mirror to raid1
Example:
~> lvconvert --type raid1 -m 1 vg/lv
The following steps are performed to convert linear to RAID1:
1) Allocate a metadata device from the same PV as the linear device
to provide the metadata/data LV pair required for all RAID components.
2) Allocate the required number of metadata/data LV pairs for the
remaining additional images.
3) Clear the metadata LVs. This performs a LVM metadata update.
4) Create the top-level RAID LV and add the component devices.
We want to make any failure easy to unwind. This is why we don't create the
top-level LV and add the components until the last step. Should anything
happen before that, the user could simply remove the unnecessary images. Also,
we want to ensure that the metadata LVs are cleared before forming the array to
prevent stale information from polluting the new array.
A new macro 'seg_is_linear' was added to allow us to distinguish linear LVs
from striped LVs.
This patch allows a mirror to be extended without an initial resync of the
extended portion. It compliments the existing '--nosync' option to lvcreate.
This action can be done implicitly if the mirror was created with the '--nosync'
option, or explicitly if the '--nosync' option is used when extending the device.
Here are the operational criteria:
1) A mirror created with '--nosync' should extend with 'nosync' implicitly
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv ; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 10.00g lv_mlog 100.00
2) The 'M' attribute ('M' signifies a mirror created with '--nosync', while 'm'
signifies a mirror created w/o '--nosync') must be preserved when extending a
mirror created with '--nosync'. See #1 for example of 'M' attribute.
3) A mirror created without '--nosync' should extend with 'nosync' only when
'--nosync' is explicitly used when extending.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 5.02g lv_mlog 0.39
vs.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv --nosync; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.02g lv_mlog 100.00
4) The 'm' attribute must change to 'M' when extending a mirror created without
'--nosync' is extended with the '--nosync' option. (See #3 examples above.)
5) An inactive mirror's sync percent cannot be determined definitively, so it
must not be allowed to skip resync. Instead, the extend should ask the user if
they want to extend while performing a resync.
[EXAMPLE]# lvchange -an vg/lv
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv is not active. Unable to get sync percent.
Do full resync of extended portion of vg/lv? [y/n]: y
Logical volume lv successfully resized
6) A mirror that is performing recovery (as opposed to an initial sync) - like
after a failure - is not allowed to extend with either an implicit or
explicit nosync option. [You can simulate this with a 'corelog' mirror because
when it is reactivated, it must be recovered every time.]
[EXAMPLE]# lvcreate -m1 -L 5G -n lv vg --nosync --corelog
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "lv" created
[EXAMPLE]# lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 100.00
[EXAMPLE]# lvchange -an vg/lv; lvchange -ay vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 0.08
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv cannot be extended while it is recovering.
7) If 'no' is selected in #5 or if the condition in #6 is hit, it should not
result in the mirror being resized or the 'm/M' attribute being changed.
NOTE: A mirror created with '--nosync' behaves differently than one created
without it when performing an extension. The former cannot be extended when
the mirror is recovering (unless in-active), while the latter can. This is
a reasonable thing to do since recovery of a mirror doesn't take long (at
least in the case of an on-disk log) and it would cause far more time in
degraded mode if the extension w/o '--nosync' was allowed. It might be
reasonable to add the ability to force the operation in the future. This
should /not/ force a nosync extension, but rather force a sync'ed extension.
IOW, the user would be saying, "Yes, yes... I know recovery won't take long
and that I'll be adding significantly to the time spent in degraded mode, but
I need the extra space right now!".
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The original commit comments can be located via this git commit ID:
7d8e615c0b
There were three possible solutions to the original problem proposed in the
initial check-in. The one chosen was as follows:
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it doesn't
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
Turns out, the cluster also views the extra suspend/resume operations as
pointless too and ignores them. So, this solution doesn't work in a cluster.
Further, I've noticed that in addition to the remote cluster nodes still getting
I/O errors from scanning the error target, they also have a different LVM and
DM views of the same LV. IOW, while the LVM level (gotten from the LVM metadata)
sees the correct name for the newly split LV, device-mapper still maintains the
old names.
Because the original fix failed to completely fix the problem (or work-around it)
and because a better solution must be found to address the additional cluster
issue of device renaming, I am reverting the above mentioned commit.
Make limits for thin data_block_size and device_id part of public API.
FIXME: read them possible from some kernel header file in the future ?
But we may need to support different values for different versions ?
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
seg->areas and seg->meta_areas. We also need to copy the memory from the
old arrays to the newly allocated arrays. The amount of memory to copy was
determined by seg->area_count. However, seg->area_count was being set to the
higher value after copying the 'seg->areas' information, but before copying
the 'seg->meta_areas' information. This means we were copying more memory
than necessary for 'seg->meta_areas' - something that could lead to a segfault.
Compiler says variable may be used uninitialized. It can't be, but we
initialize the variable to NULL anyway. Also, remove the double initialization
of another variable.
This bug showed up when trying to add a log to a mirror whose images are on
multiple devices. This is an intra-release regression and no WHATS_NEW
entry will be added. The error was introduce in the following commit:
2d8a2f35c7
The solution is to recognise in _alloc_init that if there are no mirrors
or stripes specified, then 'new_extents' should be zero.
to settle udev before calling deactivate_lv.
This is an intra-release regression (no WHATS_NEW entry required). It is
part of the fix for the current WHATS_NEW entry:
Work around resume_lv causing error LV scanning during splitmirror operation.
When user wants to remove thin pool - check if there are no thin volumes using it.
If so - query before removal (or -ff for no question) and remove them first.
When LV is unlinked, we want to catch problem in vg_validate,
that LV has changed.
i.e. catch LV has been removed and is no long thin_pool while still
being referenced by some thin volume.
Revert John patch, which fixed only 1 place where ~LVM_WRITE was in use and
convert ommited LVM_READ/WRITE flags to 64bit constants as well.
(Since both 'status' flags for LV and VG are 64bit.)
Changing lv_mirror_count to only count the AREA_LVs made the function
stop working for PVMOVE mirrors. A conditional has been added to fix
that problem. Additionally, when counting the images in a mirror stack,
we don't need to subtract 1 from the count we get back from the
lv_mirror_count call on the temporary mirror layer. (This is because we
are no falsely counting the top layer of the temporary mirror.)
lv_mirror_count was not able to handle mirrors of stripes properly. When a
failed device is removed, the MIRRORED status flag is removed from the LV
conditionally based on the results of lv_mirror_count. However, lv_mirror_count
trusted the MIRRORED flag - thinking any such LV must be mirrored. It would
happily assign first_seg(lv)->area_count as the number of mirrors, but when
a mirrored striped LV was reduced to a simple striped LV area_count would be
the number of /stripes/ not the number of /mirrors/. A result higher than 1
would be returned from lv_mirror_count, the MIRRORED flag would not be cleared,
and the LV would fail to be up-converted properly in lvconvert_mirrors_aux
because of it.
The operation of deactivating the residual error target LV after removing a
mirror layer can cause a "device in-use" conflict with udev. Giving udev a
poke before calling deactivate_lv eliminates the conflict. The stick used
to poke udev is 'sync_local_dev_names'.
Kernel requires a mirror to be at least 1 region large. So,
if our mirror log is itself a mirror, it must be at least
1 region large. This restriction may not be necessary for
non-mirrored logs, but we apply the rule anyway.
(The other option is to make the region size of the log
mirror smaller than the mirror it is acting as a log for,
but that really complicates things. It's much easier to
keep the region_size the same for both.)
WHATS_NEW entry:
Fix log size calculation when only a log is being added to a mirror.
The original fix pass the mirror LV to allocate_extents (rather than
passing NULL) so that _alloc_init could correctly determine the necessary
size of the mirror log. In the previous check-in, I noted:
In order to get a decent value computed, we need to pass in the 'lv' argument
to allocate_extents. This would normally imply a desire for cling/contiguous
allocation to the given LV, but since we are not allocating any parallel
extents and only log extents, it works fine.
However, passing in the LV did have unintended consequences on the placement of
the log. The better solution is to pass in the number of extext that are in
the mirror LV instead of the LV itself. This will not cause the allocator to
reserve that number of extents, because 'stripes' and 'mirrors' are specified
as 0. Thus, 'extents' is used to calculate the size of the log, but won't
affect how much is allocated.
LVM_WRITE is a 32-bit flag. Now that RAID[_IMAGE|_META] are 64-bit,
and'ing a RAID LV's status against LVM_WRITE can reset the higher order
flags.
A similar thing will affect thinp flags if not careful.
_alloc_init calculates the number of necessary log extents via
'mirror_log_extents'. 'mirror_log_extents' takes 3 arguments: region_size,
pe_size, and size of the mirror LV. Unfortunately, _alloc_init is guessing at
the mirror size by using 'ah->new_extents / ah->area_multiple' - the number of
extents that the mirror images have. However, this is /always/ wrong when
allocating the log separately. Further, the log is always allocated separately
unless we are up-converting the mirror at the same time. It was by luck alone
that a default value of '1' reflects what we want in most cases.
In order to get a decent value computed, we need to pass in the 'lv' argument
to allocate_extents. This would normally imply a desire for cling/contiguous
allocation to the given LV, but since we are not allocating any parallel
extents and only log extents, it works fine.
When an image is split from a 2-way mirror, the original mirror is converted to
a linear device. To do this, the top "layer" must be removed. The segments
are transferred from the sub-lv to the top-level LV and the link is severed.
The former sub-lv - having its segments transferred - now contains a temporary
error target.
When the original LV is resumed, the old sub-lv that now contains an error
segment is activated and scanned. This is what causes the I/O error messages.
There are three ways to fix this problem:
1) Do not set the sub-lv which contains the error target as "visible" before
suspending the original LV. This way, when the original is resumed, the sub-lv
device node is not created and it is not scanned - avoiding the error messages.
The problem with this approach is that if the machine crashes after the
resume, it leaves the *hidden* LV in place and the user has a more difficult
time noticing that it needs to be cleaned up. Thus, this type of processing is
frowned upon.
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it does not
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
3) Flag the sub-lv (error target) with a "do not scan" flag. This seems like
the cleanest approach, but I have been unable to find the method for doing
this. LVs get tagged in such a way by _get_udev_flags, but in this case the
resume of the original LV also resumes the error target LV without running it
through _get_udev_flags (likely because they are no longer linked). Could
there be something wrong in resume_lv?
Option #2 was chosen to fix this bug, but it seems like more of a workaround
for now.
A gentle reminder that anyone relying on the output of reporting commands
like lvs in scripts must use -o to guarantee they get the fields they expect.
The default sequence of fields can change from release to release.
Equally, the 'attr' fields can have new values introduced and/or characters
appended to them.
There was a bad sequence:
*) Make changes to LV layout to split images (e.g. 4-way -> 2-way/2-way)
1) vg_write, suspend_lv(original_mirror), vg_commit
2) activate_lv(newly_split_lv)
3) resume_lv(original_mirror)
Step #2 is not allowed. However, without it, the resume of the original
mirror will also resume its former sub-LVs - making it impossible to
activate the newly split LV due to the changes in layering, pointers, and
names that had already been made. Additionally, the resume or the original
brings the sub-lv's online with names that differ from the metadata on disk -
also a no-no. Thus, the split must be done in stages such that the active LVs
always reflect what is in the committed LVM metadata.
First, alter the original mirror by releasing the images. The images are made
visible and independent as an intermediate stage. (This way, we can have
consistency between LVM metadata and active LVs.) The second stage collects
the recently split LVs, deactivates them, forms them into a mirror if necessary,
and then activates them. It is a bit of a circuitous method, but it is the only
way to split a mirror from a mirror and obey these general rules:
1) Never [de]activate sub-lvs when the top-level LV is suspended
2) Avoid having active LVs that differ from the description in the LVM metadata
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
This patch adds the ability to upconvert a raid1 array - say from 2-way to
3-way. It does not yet support upconverting linear to n-way.
The 'raid' device-mapper target allows for individual components (images) of
an array to be specified for rebuild. This mechanism is used when adding
new images to the array so that the new images can be resync'ed while the
rest of the images in the array can remain 'in-sync'. (There is no
mirror-on-mirror layering required.)
~> lvconvert --splitmirrors 1 --trackchanges vg/lv
The '--trackchanges' option allows a user the ability to use an image of
a RAID1 array for the purposes of temporary read-only access. The image
can be merged back into the array at a later time and only the blocks that
have changed in the array since the split will be resync'ed. This
operation can be thought of as a partial split. The image is never completely
extracted from the array, in that the array reserves the position the device
occupied and tracks the differences between the array and the split image via
a bitmap. The image itself is rendered read-only and the name (<LV>_rimage_*)
cannot be changed. The user can complete the split (permanently splitting the
image from the array) by re-issuing the 'lvconvert' command without the
'--trackchanges' argument and specifying the '--name' argument.
~> lvconvert --splitmirrors 1 --name my_split vg/lv
Merging the tracked image back into the array is done with the '--merge'
option (included in a follow-on patch).
~> lvconvert --merge vg/lv_rimage_<n>
The internal mechanics of this are relatively simple. The 'raid' device-
mapper target allows for the specification of an empty slot in an array
via '- -'. This is what will be used if a partial activation of an array
is ever required. (It would also be possible to use 'error' targets in
place of the '- -'.) If a RAID image is found to be both read-only and
visible, then it is considered separate from the array and '- -' is used
to hold it's position in the array. So, all that needs to be done to
temporarily split an image from the array /and/ cause the kernel target's
bitmap to track (aka "mark") changes made is to make the specified image
visible and read-only. To merge the device back into the array, the image
needs to be returned to the read/write state of the top-level LV and made
invisible.
Users already have the ability to split an image from an LV of "mirror"
segtype. This patch extends that ability to LVs of "raid1" segtype.
This patch only allows a single image to be split off, however. (The
"mirror" segtype allows an arbitrary number of images to be split off.
e.g. 4-way => 3-way/linear, 2-way/2-way, linear,3-way)
of top-level LV.
We can't activate sub-lv's that are being removed from a RAID1 LV while it
is suspended. However, this is what was being used to have them show-up
so we could remove them. 'sync_local_dev_names' is a sufficient and
proper replacement and can be done after the top-level LV is resumed.
1) add new function 'raid_remove_top_layer' which will be useful
to other conversion functions later (also cleans up code)
2) Add error messages if raid_[extract|add]_images fails
3) Add function prototypes to prevent compiler warnings when
compiling with '--with-raid=shared'
Fix a couple more issues that kabi found.
- Add some error messages in failure cases
- s/malloc/zalloc/
- use vg->vgmem for lv names instead of vg->cmd->mem
Use debug pool locking functionality. So the command could check,
whether the memory in the pool has not been modified.
For lv_postoder() instead of unlocking and locking for every changed
struct status member do it once when entering and leaving function.
(mprotect would trap each such memory access).
Currently lv_postoder() does not modify other part of vg structure
then status flags of each LV with flags that are reverted back to
its original state after function exit.
Extend vginfo cache with cached VG structure. So if the same metadata
are use, skip mda decoding in the case, the same data are in use.
This helps for operations like activation of all LVs in one VG,
where same data were decoded giving the same output result.
Patch adds 1-to-1 connection between volume_group and lvmcache_vginfo.
Move the free_vg() to vg.c and replace free_vg with release_vg
and make the _free_vg internal.
Patch is needed for sharing VG in vginfo cache so the release_vg function name
is a better fit here.
As this flag could not have been set by the current code - removing it.
Note: because of the wrong code logic this call:
lvmcache_update_vg(correct_vg, correct_vg->status & PRECOMMITTED &
(inconsistent ? INCONSISTENT_VG : 0));
had always passed '0' - now after flag removal it's passing
PRECOMMITTED flag in - this present functinal change in this patch.
To match the original functionality - 0 had to be always passed.
More testing is needed here.
Compiler complaining that meta_lv could be used uninitialized. (Not true
because it is protected by 'clear_metadata'.) I switched to using 'lv->vg',
as it makes no difference to vg_[write|commit].
Last usage was removed in Petr's commit related to VG mda repair fix
where relaxed check starts to ignore inconsistencies coming from
PVs that are marked MISSING - thus removing unused variable.
Implementation described in doc/lvm2-raid.txt.
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
teardown sequence. (Previously the snapshot was deactivated while its
origin was active and before its removal was committed to disk, so
restarting after a crash at the point would leave corruption.)
When an LVM mirror is up-converted (an additional image added), it creates
a temporary mirror stack. The lower-level mirror in the stack that is
created was not being activated exclusively - violating the exclusive nature
of the original mirror. We now check for exclusive activation of a mirror
before converting it, and if found, we ensure that the temporary mirror
is also exclusively activated.
Mirrors used to be created by first creating a linear device and then adding
the other images plus the log. Now mirrors are created by creating all the
images in one go and then adding the log separately. The new way ran into
the condition that cluster mirrors cannot change the log type (in the case
of creation, from core -> disk) while the mirror is not active. (It isn't
active because it is in the process of being created.) The reason this
condition is in place is because a remote node may have the mirror active, and
we don't want to alter the log underneath it.
What we really needed was a way of checking if the mirror was active remotely
but not locally, and in that case do not allow a change of the log. I've added
this check, and cluster mirrors can now be created again.
It's useful to keep the partial flag cached - so just move the call
for vg_mark_partil_lvs() into import_vg_from_config_tree() so it gets
evaluated before it goes through the lvmcache.
This patch should not present any functional change.
Note: It is rather temporal solution - proper place is probably inside the
'read' call back - but needs some more discussion.
For now using this minor hack.
As the ACTIVATE_EXCL could be set only in clvmd code - there is no
use for this test in lv_add_mirrors() function only called from
tools context.
FIXME: Add cluster test case for this.
To avoid modification of 'read-only' volume group structure
add a new structure to pass local data around the code for LV
activation.
As origin_only is one such flag - replace this parameter with new
struct lv_activate_opts.
More parameters might eventually become part of lv_activate_opts.
transient error), stemming from the following sequence of events:
1) devices fail IO, triggering repair
2) dmeventd starts fixing up the mirror
3) during the downconversion, a new metadata version is written
--> the devices come back online here
4) the mirror device suspend/resume is called to update DM tables
5) during the suspend/resume cycle, *pre*-commit metadata is read;
however, since the failed devices are now back online, we get back
inconsistent set of precommit metadata and the whole operation fails
The patch relaxes the check that fails in step 5 above, namely by ignoring
inconsistencies coming from PVs that are marked MISSING.
are affected by the move. (Currently it's possible for I/O to become
trapped between suspended devices amongst other problems.
The current fix was selected so as to minimise the testing surface. I
hope eventually to replace it with a cleaner one that extends the
deptree code.
Some lvconvert scenarios still suffer from related problems.
Currently some operation with striped mirrors lead
to corrupted metadata, this patch just add detection of such
situation.
Example:
# lvcreate -i2 -l10 -n lvs vg_test
# lvconvert -m1 vg_test/lvs
# lvreduce -f -l1 vg_test/lvs
Reducing logical volume lvs to 4.00 MiB
Segment extent reduction 9not divisible by #stripes 2
Logical volume lvs successfully resized
# lvremove vg_test/lvs
Segment extent reduction 1not divisible by #stripes 2
LV segment lvs:0-4294967295 is incorrectly listed as being used by LV lvs_mimage_0
Internal error: LV segments corrupted in lvs_mimage_0.
Avoid using of already released memory when duplicated MDA is found.
As get_pv_from_vg_by_id() may call lvmcache_label_scan() use the local copy
of the vgname and vgid on the stack as vginfo may dissapear and code was
then accessing garbage in memory.
i.e. pvs /dev/loop0
(when /dev/loop0 and /dev/loop1 has same MDA content)
Invalid read of size 1
at 0x523C986: dm_hash_lookup (hash.c:325)
by 0x440C8C: vginfo_from_vgname (lvmcache.c:399)
by 0x4605C0: _create_vg_text_instance (format-text.c:1882)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
by 0x47FBD8: vg_read_internal (metadata.c:3231)
by 0x477594: get_pv_from_vg_by_id (metadata.c:344)
by 0x45F07A: _get_pv_if_in_vg (format-text.c:1400)
by 0x45F0B9: _populate_pv_fields (format-text.c:1414)
by 0x45F40F: _text_pv_read (format-text.c:1493)
by 0x480431: _pv_read (metadata.c:3500)
by 0x4802B2: pv_read (metadata.c:3462)
Address 0x652ab80 is 0 bytes inside a block of size 4 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x442277: _free_vginfo (lvmcache.c:963)
by 0x44235E: _drop_vginfo (lvmcache.c:992)
by 0x442B23: _lvmcache_update_vgname (lvmcache.c:1165)
by 0x443449: lvmcache_update_vgname_and_id (lvmcache.c:1358)
by 0x443C07: lvmcache_add (lvmcache.c:1492)
by 0x46588C: _text_read (text_label.c:271)
by 0x466A65: label_read (label.c:289)
by 0x4413FC: lvmcache_label_scan (lvmcache.c:635)
by 0x4605AD: _create_vg_text_instance (format-text.c:1881)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
Add testing script
pv_manip.c to properly account for case when pe_start=0 and the first
physical extent is to be released (currently skip the first extent to
avoid discarding the PV label).
My previous patch fixed incorrect error check for dm_snprintf.
However in this particular case - dm_snprintf has been used differently -
just like strncpy + setting last char with '\0' - so the code had to return
error - because the buffer was to short for whole string.
Patch replaces it with real strncpy.
Also test for alloca() failure is removed - as the program behaviour
is rather undefined in this case - it never returns NULL.
lvseg properties for lvm2app, 'devices' and 'seg_pe_ranges'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
dm_pool_free incorrectly. This check-in fixes that incorrect usage.
I've also added a WHATS_NEW line to reflect the changes I made to allow
lv_extend to operate on 0 length intrinsically layered LVs (i.e mirrors
and RAID). I forgot that in the last commit.
allows us to allocate all images of a mirror (or RAID array) at one
time during create.
The current mirror implementation still requires a separate allocation
for the log, however.
Actually, we can call vg_set_fid(vg, NULL) instead of calling
destroy_instance for all PV structs and a VG struct - it's the same
code we already have in the vg_set_fid.
Also, allow exactly the same fid to be set again for the same PV/VG
Before, this could end up with the fid destroyed because we destroyed
existing fid first and then we used the new one and we didn't care
whether existing one == new one by chance.
Instead of searching linear list of all LVs, PVs - use created hash tables
also for quick mapping between LV.
(Note - for small number of PVs or LVs the overhead of the hash is bigger).
TODO: Use hash tables in volume_group structure directly.
If _move_lv_segments is passed a 'lv_from' that does not yet
have any segments, it will screw things up because the code
that does the segment copy assumes there is at least one
segment. See copy code here:
lv_to->segments = lv_from->segments;
lv_to->segments.n->p = &lv_to->segments;
lv_to->segments.p->n = &lv_to->segments;
If 'segments' is an empty list, the first statement copies over
the values, but the next two reset those values to point to the
other LV's list structure. 'lv_to' now appears to have one
segment, but it is really an ill-set pointer.
other cases, the code could wind up removing wrong number of mirrors. In yet
other cases, we could remove the right number of mirrors, but fail to respect
the removal preferences (i.e. keep an image that was requested to be removed
while removing an image that was requested to be kept). Under some
circumstances, remove_mirror_images could also get stuck in an infinite loop.
This patch should fix all of the above undesirable behaviours.
Signed-off-by: Petr Rockai <prockai@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Attach \0 for proper char* display - otherwise somewhat random message could
be displayed in debug more and read of unpredictable read of uninitilized
memory values could happen.
As code uses strncpy(system_id, NAME_LEN) and doesn't set '\0'
Fix it by always allocating NAME_LEN + 1 buffer size and with zalloc
we always get '\0' as the last byte.
This bug may trigger some unexpected behavior of the string operation
code - depends on the pool allocator.
FIXME: refactor this code to alloc_vg.
Missing free_vg on error_path in lvmcache_get_vg fn. Call destroy_instance
only if the fid is not part of the vg in backup_read_vg fn (otherwise it's
part of the VG we're returning and we definitely don't want to destroy it!).
This is necessary for proper format instance ref_count support. We iterate
over vg->pvs and vg->removed_pvs list and the ref_count is decremented and
then it is destroyed if not referenced anymore.
Since format instances will use own memory pool, it's necessary to properly
deallocate it. For now, only fid is deallocated. The PV structure itself
still uses cmd mempool mostly, but anytime we'd like to add a mempool
in the struct physical_volume, we can just rename this fn to free_pv and
add the code (like we have free_vg fn for VGs).
This is essential for proper format instance ref_count support. We must
use these functions to set the fid everywhere from now on, even the NULL
value!
Format instances can be created anytime on demand and it contains
metadata area information mostly (at least for now, but in the future,
we may store more things here to update/edit in a PV/VG). In case we
have lots of metadata areas, memory consumption will rise. Using cmd
context mempool is not quite optimal here because it is destroyed too
late. So let's use a separate mempool for format instances.
Reference counting is used because fids could be shared, e.g. each PV
has either a PV-based fid or VG-based fid. If it's VG-based, each PV has
a shared fid with the VG - a reference to VG's fid.
Add _lv_postorder_vg() - for calling _lv_postorder() for every LV from VG.
We use this in 2 places - vg_mark_partial_lvs() and vg_validate()
so make it as a one function.
Benefit here is - to use only one cleanup code and avoid
potentially duplicate scans of same LVs.
Accelerate validation loop by using lvname, lvid, pvid hash tables.
Also merge pvl loop into one cycle now - no need to scan the list twice.
List scan is stopped when dm_hash_insert fails.
The error message with loop_counter1 is no longer provided - however
the message has been misleading anyway.
Create new function alloc_vg() to allocate VG structure.
It takes pool_name (for easier debugging).
and also take vg_name to futher simplify code.
Move remainder of _build_vg_from_pds to _pool_vg_read
and use vg memory pool for import functions.
(it's been using smem -> fid mempool -> cmd mempool)
(FIXME: remove mempool parameter for import functions and use vg).
Move remainder of the _build_vg to _format1_vg_read
lvseg_segtype_dup used memory pool vg memory pool for strind duplication.
However this one gets released before reporting happens so the command like:
pvs -o segtype
prints data from already released memory pool. Thanks to the fact there
is not much allocation happing after the VG is released, the memory
stays unmodified and correct result is printed.
Fix adds support for mempool passed parameter (like other similar
query commands) and uses dm_report memory pool for string duplication.
We allow writing non-orphan PVs only for resize now. The "orphan PV" assert
in pv_write fn uses the "allow_non_orphan" parameter to control this assert.
However, we should find a more elaborate solution so we can remove this
restriction altogether (pv_write together with vg_write is not atomic, we
need to find a safe mechanism so there's an easy revert possible in case of
an error).
There is a lot to test.
Two new config settings added that are intended to make the code behave
closely to the way it did before - worth a try if you find problems.
If the PV is already part of the VG (so the pv->fid == vg->fid), it makes no
sense to attach the mdas information from PV to a VG. Instead, we read new
PV metadata information from cache and attach it to the VG fid.
This function also sets a reference to a new VG format instance for all PVs
that are part of the VG so the PV-VG interconnection is consistent after the
change.
Add supporting functions to work with the format instance and metadata area
structures stored within the format instance. Add support for simple indexing
of metadata areas using PV id and mda order (for on-disk PV only for now, we
can extend the indexing even for other mdas if needed - we only need to define
a proper key for the index).
Fixing some const warnings - with API change in:
int vg_extend(struct volume_group *vg, int pv_count, const char *const *pv_names,
Change is needed - as lvm2api expects const behaviour here.
So vg_extend() is doing local strdup for unescaping.
skip_dev_dir return const char* from const char* vg_name.
Rest of the patch is cleanup of related warnings.
Also using dm_report_filed_string() API change to simplify
casting in _string_disp and _lvname_disp.
New strategy for memory locking to decrease the number of call to
to un/lock memory when processing critical lvm functions.
Introducing functions for critical section.
Inside the critical section - memory is always locked.
When leaving the critical section, the memory stays locked
until memlock_unlock() is called - this happens with
sync_local_dev_names() and sync_dev_names() function call.
memlock_reset() is needed to reset locking numbers after fork
(polldaemon).
The patch itself is mostly rename:
memlock_inc -> critical_section_inc
memlock_dec -> critical_section_dec
memlock -> critical_section
Daemons (clmvd, dmevent) are using memlock_daemon_inc&dec
(mlockall()) thus they will never release or relock memory they've
already locked memory.
Macros sync_local_dev_names() and sync_dev_names() are functions.
It's better for debugging - and also we do not need to add memlock.h
to locking.h header (for memlock_unlock() prototyp).
Add configurable option to define minimal size of
of block device usable as a PV.
pv_min_size() is added to lvm-globals and it's being
initialized through _process_config.
Macro PV_MIN_SIZE is unused and removed.
New define DEFAULT_PV_MIN_SIZE_KB is added to lvm-global
and unlike PV_MIN_SIZE it uses KB units.
Should help users with various slow devices attached to the system,
which cannot be easily filtered out (like FDD on /dev/sdX):
https://bugzilla.redhat.com/show_bug.cgi?id=644578
activated.
In order to achieve this, we need to be able to query whether
the origin is active exclusively (a condition of being able to
add an exclusive snapshot).
Once we are able to query the exclusive activation of an LV, we
can safely create/activate the snapshot.
A change to 'hold_lock' was also made so that a request to aquire
a WRITE lock did not replace an EX lock, which is already a form
of write lock.
Stop calling fs_unlock() from lv_de/activate().
Start using internal lvm fs cookie for dm_tree.
Stop directly calling dm_udev_wait() and
dm_tree_set/get_cookie() from activate code -
it's now called through fs_unlock() function.
Add lvm_do_fs_unlock()
Call fs_unlock() when unlocking vg where implicit unlock solves the
problem also for cluster - thus no extra command for clustering
environment is required - only lvm_do_fs_unlock() function is added
to call lvm's fs_unlock() while holding lvm_lock mutex in clvmd.
Add fs_unlock() also to set_lv() so the command waits until devices
are ready for regular open (i.e. wiping its begining).
Move fs_unlock() prototype to activation.h to keep fs.h private
in lib/activate dir and not expose other functions from this header.
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
As const segment_type or const format_type are never released
use their non-const version and remove const downcast from dm_free calls.
This change fixes many gcc warnings we were getting from them.
To have better control were the config tree could be modified use more
const pointers and very carefully downcast them back to non-const
(for config tree merge).
Set cmd->independent_metadata_areas if metadata/dirs or disk_areas in use.
- Identify and record this state.
Don't skip full scan when independent mdas are present even if memlock is set.
- Clusters and OOM aren't supported, so no problem doing the proper scans.
Avoid revalidating the label cache immediately after scanning.
- A simple optimisation.
Support scanning for a single VG in independent mdas.
- Not used by the fix but I left it in anyway as later patches might use it.
This reset of vgmem pointer causes access of already released memory.
(_vg_make_handle allocates vg from vgmem pool itself - which is a bit tricky)
Interestingly this memory fault was missed by our test suite.
Set vg to NULL after releasing it as the following memlock() test may
lead to goto for the second call of vg_release() with the already
released vg pointer.
Problem:
When both legs of a mirrored log fail, neither the log nor the parent
mirror can proceed. The repair code must be careful to replace the
log with an error target before operating on the parent - otherwise,
the parent can get stuck trying to suspend because it can't push through
any writes. The steps to replace the log device with an error target
were incomplete and resulted in the replacement not happening at all!
The code originally had all the necessary logic to complete the
replacement task, but was pulled out in a effort to clean-up that
section of code, while fixing another bug:
<offending commit msg>
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
</offending commit msg>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/LVM2/lib/metadata/mirror.c?cvsroot=lvm2&f=h#rev1.130
I've reverted the "clean-up" changes associated with that fix, but not what
that commit was actually fixing.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
page.
Add ->target_name to segtype_handler to allow a more specific target
name to be returned based on the state of the segment.
Result of trying to merge a snapshot using a kernel that doesn't have
the snapshot-merge target:
Before:
# lvconvert --merge vg/snap
Can't expand LV lv: snapshot target support missing from kernel?
Failed to suspend origin lv
After:
# lvconvert --merge vg/snap
Can't process LV lv: snapshot-merge target support missing from kernel?
Failed to suspend origin lv
Unable to merge LV "snap" into it's origin.
Use _even_rand() function instead of floor() in _bitset_with_random_bits().
floor() function is missing in dietlibc (on architectures other than x86).
Moreover using floor() to clip rand results does not assure even result
distribution. _even_rand() uses integer arithmetic only and is designed to
return evenly distributed results.
> Looks OK to me. It took a while to decipher what is the exact meaning of
> the loop in _even_rand (to a non-pseudorandomness-expert) but I am
> fairly comfortable with it now. If I understand this correctly, it
> rejects numbers that come from an "incomplete" slice of the RAND_MAX
> space (considering the number space [0, RAND_MAX] is divided into some
> "max"-sized slices and at most a single smaller slice, between [n*max,
> RAND_MAX] for suitable n -- numbers from this last slice are discarded
> because they could distort the distribution in favour of smaller
> numbers).
Signed-off-by: Przemyslaw Iskra <sparky <at> pld-linux.org>
Reviewed-by: Petr Rockai <prockai <at> redhat.com>
In other LVM memory structures such as volume_group, the field
used to store flags is called "status", and on-disk fields are called
'flags', so rename the one inside metadata_area to be consistent.
Not only is it more consistent with existing code but is cleaner
to say "the status of this mda is ignored".
Background for this patch - prajnoha pinged me on IRC this morning
about a fix he was working on related to metadataignore when
metadata/dirs was set. I was reviewing my patches from this year
and realized the 'flags' field was probably not the best choice
when I originally did the metadataignore patches.
Add supporting functions for vg_name, vg_fmt, vg_system_id.
Append "_dup" to end of supporting functions to make clear the strings
are dup'd and to avoid namespace conflict with vg_name.
Add supporting functions for pv_uuid, vg_uuid, and lv_uuid.
Call new function id_format_and_copy. Use 'const' where appropriate.
Add "_dup" suffix to indicate memory is being allocated.
Call {pv|vg|lv}_uuid_dup from lvm2app uuid functions.
This patch addresses code review request to simplify creation of 'attr'
strings. The simplification is done in this separate patch to more
easily review and ensure the simplification is done without error.
Move the creating of the 'attr' strings into a common function so
they can be called from the 'disp' functions as well as the new
'get' property functions.
Add "_dup" suffix to indicate memory is allocated.
Refactor pvstatus_disp to take pv argument and call pv_attr_dup().
This patch is similar to the other patches for pv and vg
functionality, and separates lv functionality into separate
files, concentrating on reporting fields and simple functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out pv functions and data, particularly
related to the reporting fields calculations.
More code could be moved here but for now I'm stopping at reporting
functions 'get' / 'set' functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out vg functions and data, particularly
related to the reporting fields calculations.
introduced in commit b16b4d92a7
"Improve various log messages."
fixes a lot of
../include/metadata.h:148: warning: type qualifiers ignored on function return type
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
In all top vg read functions only LCK_VG_READ/WRITE can be used.
All other vg lock definitions are low-level backend machinery.
Moreover, LCK_WRITE cannot be tested through bitmask.
This patch fixes these mistakes.
For _recover_vg() we do not need lock_flags, it can be only
two of above and we always upgrading to LCK_VG_WRITE lock there.
(N.B. that code is racy)
There is no functional change in code (despite wrong masking
it produces correct bits:-)
One shiny day we should use libblkid here. But now using LUKS is
very common together with LVM and pvcreate destroys LUKS completely.
So for user's convenience, try to detect LUKS signature and allow abort.
pvcreate detects MD and swap signature.
The logic hidden there is not only documented but it is also
user unfriendly. Who invented this logic should run pvcreate
on its own critical MD device to see why;-)
This patch
- creates one function instead of duplication code
- asks if user want to overwrite signature
- allows aborting (!)
(Please note that writing LVM signatute without wiping old
is wrong, it confuses blkid, MD will not work anyway and
swap and LUKS is broken too.)
Ignore snapshots when performing mirror recovery beneath an origin.
Pass LCK_ORIGIN_ONLY flag around cluster.
Add suspend_lv_origin and resume_lv_origin using LCK_ORIGIN_ONLY.
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
An incorrect fix on July 13, 2010 for an annoyance has caused a regression.
The offending check-in was part of the 2.02.71 release of LVM. That
check-in caused any PVs specified on the command line to be ignored when
performing a mirror split.
This patch reverses the aforementioned check-in (solving the regressions)
and posits a new solution to the list reversal problem. The original
problem was that we would always take the lowest mimage LVs from a mirror
when performing a split, but what we really want is to take the highest
mimage LVs. This patch accomplishes that by working through the list in
reverse order - choosing the higher numbered mimages first. (This also
reduces the amount of processing necessary.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Takahiro Yasui <takahiro.yasui@hds.com>
all but one mirror leg.
<patch header>
To handle a double failure of a mirrored log, Jon's two patches are
commited, however, lvconvert command can't still handle an error
when mirror leg and mirrored log got failure at the same time.
[Patch]: Handle both devices of a mirrored log failing (bug 607347)
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00009.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00027.html
[Patch]: Handle both devices of a mirrored log failing (bug 607347) -
additional fix
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00093.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00101.html
In the second patch, the target type of mirrored log is replaced with
error target when remove_log is set to 1, but this procedure should be
also used in other cases such as the number of mirror leg is 1. This
patch relocates the procedure to the main path.
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
- With Jon's second patch, we think that we don't have to call
remove_mirror_log() in _lv_update_mirrored_log() because will be
handled remove_mirror_images() in _lvconvert_mirrors_repaire().
</patch header>
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
The cluster log daemon (cmirrord) is not multi-threaded and
can handle only one request at a time. When a log is stacked
on top of a mirror (which itself contains a 'core' log), it
creates a situation that cannot be solved without threading.
When the top level mirror issues a "resume", the log daemon
attempts to read from the log device to retrieve the log
state. However, the log is a mirror which, before issuing
the read, attempts to determine the 'sync' status of the
region of the mirror which is to be read. This sync status
request cannot be completed by the daemon because it is
blocked on a read I/O to the very mirror requesting the
sync status.
mirrors, we must also check that the log daemon (cmirrord) is running.
The log module can be auto-loaded, but the daemon cannot be
"auto-started". Failing to check for the daemon produces cryptic
messages that customers have a hard time deciphering. (The system
messages do report that the log daemon is not running, but people
don't seem to find this message easily.)
Here are examples of what is printed when the module is available,
but the log daemon has not been started.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg
Shared cluster mirrors are not available.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg -v
Setting logging type to disk
Finding volume group "vg"
Archiving volume group "vg" metadata (seqno 3).
Creating logical volume lv
Executing: /sbin/modprobe dm-log-userspace
Cluster mirror log daemon is not running
Shared cluster mirrors are not available.
Creating volume group backup "/etc/lvm/backup/vg" (seqno 4).
The main problem with these bugs was that the newly split
off LV was not being suspended properly. This meant that
the memlock count was not being balanced, the DM devices
were not being renamed, and some DM devices which should
have been removed were not.
I've also renamed some of the variables and added comments
to make things clearer as to what is going on. (I can break
this patch in two if it means easier review.)
even though there was no log. A simple run through the in-tree test
suite would have caught this. :(
- if (lv_is_mirrored(detached_log_lv) &&
+ if (detached_log_lv && lv_is_mirrored(detached_log_lv) &&
Also, made some cosmetic changes suggested by kabi after my last check-in
(e.g. s/return 0/return_0/ and adding an error message).
A previous check-in added logic to handle the case where both images
of a mirrored log failed. It solved the problem by simply removing
the log entirely - leaving the parent mirror with a 'core' log. This
worked for most cases. However, if there was a small delay between
the failures of the two mirrored log devices, the mirror would hang,
LVM would hang, and no additional LVM commands could be issued.
When the first leg of the log fails, it signals the need for repair.
Before 'lvconvert --repair' is run by dmeventd, the second leg fails.
'lvconvert' would see both devices as failed and try to remove the
log entirely. When it came time to suspend the parent mirror to
update the configuration, the suspend would hang because it couldn't
get any I/O through the mirrored log, which was plugged waiting for
corrective action. The solution is to replace the log with an error
target to clear any pending writes before removing it. This allows
the parent mirror to suspend and make the proper changes.
Pass metadataignore through PV creation / setup paths.
As a result of this cleanup, we can remove the unnecessary setting
of mda_ignore bits inside pvcreate_single(), after call to pv_create.
For now, just set metadataignore to '0' in some places. This is
equivalent to the prior functionality, although the 0 is given
by the caller not hardcoded in _mda_setup() call.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
- If a PV contained empty mdas, the auto-recovery code was not kicking in.
- The 'inconsistent' state was getting lost when metadata was cached so
recovery didn't kick in. But leave the behaviour alone when using
precommitted metadata because of a warning in a confusing FIXME.
In my testing, pvs and vgs didn't repair inconsistent metadata like they
used to do. (How many other tools fail similarly now?)
And there should be no need to cache inconsistent metadata because it is
supposed to get repaired under the protection of a write lock immediately it is
discovered.
This code is in need of a redesign based on first principles.
I still see bugs in this code and this commit is risky.
Allow metadataignore flag to be passed in to pvcreate.
Ideally, more refactoring of the mda allocation / initialization
is warranted, but for now, we just add another parameter to 'add_mda'
to take an existing mda ignored flag. We need to do this or pv_write
loses the state of the mda 'ignored' flag before copying and writing
to disk.
Print device name when setting or clearing metadata ignore bit.
Example:
label/label.c:160 /dev/loop2: lvm2 label detected
cache/lvmcache.c:1136 lvmcache: /dev/loop2: now in VG #orphans_lvm2 (#orphans_lvm2)
metadata/metadata.c:4142 Setting mda ignored flag for metadata_locn /dev/loop2.
format_text/text_label.c:318 Skipping mda with ignored flag on device /dev/loop2 at offset 4096
Logging isn't ideal, especially for mda_set_ignore. Ideally we'd
like to display the device name and offset in this case but this
requires a bit more work and a per-format 'mda_description' function
pointer definition (we don't have access to mda_context in
metadata.c).
In preparation to call this from both pvcreate as well as pvchange,
move the guts of metadataignore into a library function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Allowing an 'all' and 'unmanaged' value is more intuitive, and
provides a simple way for users to get back to original LVM behavior
of metadata written to all PVs in the volume group.
If the user requests "--vgmetadatacopies unmanaged", this instructs
LVM not to manage the ignore bits to achieve a specific number of
metadata copies in the volume group. The user is free to use
"pvchange --metadataignore" to control the mdas on a per-PV basis.
If the user requests "--vgmetadatacopies all", this instructs LVM
to do 2 things: 1) clear all ignore bits, and 2) set the "unmanaged"
policy going forward.
Internally, we use the special MAX_UINT32 value to indicate 'all'.
This 'just' works since it's the largest value possible for the
field and so all 'ignore' bits on all mdas in the VG will get
cleared inside _vg_metadata_balance(). However, after we've
called the _vg_metadata_balance function, we check for the special
'all' value, and if set, we write the "unmanaged" value into the
metadata. As such, the 'all' value is never written to disk.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The check in vg_split_mdas will trigger an error if the 'from' vg
list is empty. However, this might be ok in some instances now
that we have ignored mdas. Relax this check so an error is triggered
only in the case where there's truly no more mdas in the 'from'
vg.
One example of where this makes a difference is with vgreduce.
If we try to vgreduce a PV with un-ignored mdas, this should trigger
the balancing function to un-ignore mdas on another PV in the VG.
However, we don't get to vg_write() before we fail because this
list size check fails, and we see an error message indicating:
"Cannot remove final metadata area ..."
Another example is with vgsplit into a new VG, where the PVs
being moved contain all ignored mdas. We must move the mdas on
fid->metadata_areas_ignored from 'vg_from' to 'vg_to'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The vgextend path calls add_pv_to_vg(). Inside add_pv_to_vg(),
we must ensure we pass the correct mdas list into pv_setup(), as
copies of mdas are placed on the vg->fid list. If we don't place
the mdas on the correct vg->fid list, the various counts may be
incorrect and the metadata balance algorithm will not work when
called from vg_write() path.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Compare the value of the newly added vg_mda_copies field
(--vgmetadatacopies parameter) with the current count of
in-use mdas and ignoring or unignoring mdas as necessary to
get to the target count. Also, as a safety check before
returning, ensure we have at least one mda enabled.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the get and partially implemented set function.
The 'set' function should probably ignore or un-ignore metadata areas
based on new values.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a field to struct volume_group to later implement metadata
balancing:
- mda_copies: target # of non-ignored mdas in the VG; default 0 (do
not control pv 'ignore mdas' bit.
This patch just adds the parameter to the structures with the default
values but does not modify any commands. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Arrange mdas so mdas that are to be ignored come first. This is an
optimization that ensures consistency on disk for the longest period of time.
This was noted by agk in review of the v4 patchset of pvchange-based mda
balance.
Note the following example for an explanation of the background:
Assume the initial state on disk is as follows:
PV0 (v1, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
If we did not sort the list, we would have a commit sequence something like
this:
PV0 (v2, non-ignored)
PV1 (v2, ignored)
PV2 (v2, ignored)
PV3 (v2, non-ignored)
After the commit of PV0's mdas, we'd have an on-disk state like this:
PV0 (v2, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is an inconsistent state of the disk. If the machine fails, the next
time it was brought back up, the auto-correct mechanism in vg_read would
update the metadata on PV1-PV3. However, if possible we try to avoid
inconsistent on-disk states. Clearly, because we did not sort, we have
a greater chance of on-disk inconsistency - from the time the commit of
PV0 is complete until the time PV3 is complete.
We could improve the amount of time the on-disk state is consistent by simply
sorting the commit order as follows:
PV1 (v2, ignored)
PV2 (v2, ignored)
PV0 (v2, non-ignored)
PV3 (v2, non-ignored)
Thus, after the first PV is committed (in this case PV1), on-disk we would
have:
PV0 (v1, non-ignored)
PV1 (v2, ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is clearly a consistent state. PV1 will be read but the mda will be
ignored. All other PVs contain v1 metadata, and no auto-correct will be
required. In fact, if we commit all PVs with ignored mdas first, we'll
only have an inconsistent state when we start writing non-ignored PVs,
and thus the chances we'll get an inconsistent state on disk is much
less with the sorted method.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When we are constructing the vg, we may need to adjust the list of
metadata_areas if there are ignored mdas. At label read time, we
do not read the metadata of ignored mdas, and as a result, they do
not get placed on vg->fid->metadata_areas inside _text_create_text_instance
since lvmcache does not have these areas attached to vginfo->infos.
However, when we're checking the pvids inside _vg_read, after having
read another metadata area from another PV, we do have the opportunity
to update the metadata_area and metadata_areas_ignored lists based
on the read metadata_area. We need accurate mda lists for the reporting
functions that count the ignored mdas, as well as general correctness
of mda balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
With the addition of ignored mdas, we replace all checks for an empty
mda list with a new function to look for either an empty mda list or
ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a helper function to consolidate checking for an empty mdas list
or ignored mdas. Ignored mdas should behave almost identically to
an empty mda list - the metadata areas should not be read or written
to. This function will make it easier to implement metadata balancing
and easier to track pvs with an empty mda list or ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Define a new pvs field, pv_mda_used_count, and a new vgs field,
vg_mda_used_count to match the existing pv_mda_count and vg_mda_count.
These new fields count the number of mdas that have the 'ignored' bit
clear (they are in use on the PV / VG). Also define various supporting
functions to implement the counting as well as setting the ignored
flag and determining if an mda is ignored. These high level functions
call into the lower level location independent mda ignore functions
defined by earlier patches.
Note that counting ignored mdas in a vg requires traversing both lists
and checking for the ignored bit on the mda. The count of 'ignored'
mdas then is defined by having the bit set, not by which list the mda
is on. The list does determine whether LVM actually does read/write to
the mda, though we must count the bits in order to return accurate numbers
for the various counts. Also, pv_mda_set_ignored must search both vg
lists for ignored mda. If the state changes and needs to be committed
to disk, the ignored mda will be on the non-ignored list.
Note also in pv_mda_set_ignored(), we must properly manage the mda lists.
If we change the ignored state of an mda, we must change any mdas on
vg->fid->metadata_areas that correspond to this pv. Also, we may
need to allocate a copy of the mda, as is done when fid->metadata_areas
is populated from _vg_read(), if we are un-ignoring an ignored mda.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a second mda list, metadata_areas_ignored to fid, and a couple
functions, fid_add_mda() and fid_add_mdas() to help manage the lists.
These functions are needed to properly count the ignored mdas and
manage the lists attached to the 'fid' and ultimately the 'vg'.
Ensure metadata_areas_ignored is initialized in other formats, even
if the list is never used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Because of the way mdas are handled internally, where a PV in a VG
has mdas on both info->mdas and vg->fid->metadata_areas list, we
need a location independent copy constructor for struct
metadata_area. Break up the existing format-text specific copy
constructor into a format independent piece and a format dependent
piece.
This function is necessary to properly implement pv_set_mda_ignored().
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
A metadata_area is defined independent of the location. One downside
is that there is no obvious mapping from a pv to an mda. For a PV in
a VG, we need a way to start with a PV and end up with an MDA, if we
are to manage mdas starting with a device/pv. This function provides
us a way to go down the list of PVs on a VG, and identify which ones
match a particular PV.
I'm not entirely happy with this approach, but it does fit into the
existing structures in a reasonable way.
An alternative solution might be to refactor the VG - PV interface such
that mdas are a list tied to a PV. However, this seemed a bit tricky since
a PV does not come into existence until after the list of mdas is
constructed (see _vg_read() - we create a 'fid' and attach mdas to it,
then we go through them and attach pvs).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
First we add a 'flags' field to the location independent
metadata_area structure, and a MDA_IGNORE flag. The
mda_is_ignored and mda_set_ignored functions are added to
manage the flag. Adding the flag and functions gives a
library interface to ignore metadata areas independent of
the underlying location (disk, file, etc). The location
specific read/write functions must then handle the specifics
of what this flag means to the location.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
The same region size is used for both mirror volume and mirrored
log volume, but when the physical extent size is bigger than region size,
the size of mirror leg for mirrored log is smaller than the region size
and lvcreate command fails.
This patch adjusts a region size of mirrored log to a smaller value of
region size or physical extent size.
[This patch ensures that the region_size of the mirrored log does not
exceed the size of the mirrored log itself, which would violate the
kernel constraint: (region_size <= ti->len).]
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
converting from 2-way to 3-way mirror (collapse_mirrored_lv)
was calling '_remove_mirror_images' with the 'remove_log'
parameter set. When the code was put in to fix 599898 to
honor log parameters during conversion, this argument was
suddenly being honored. Thus, when someone would convert from
a 2-way to 3-way mirror, the log would get removed.
'collapse_mirrored_lv' should not be calling '_remove_mirror_images'
with 'remove_log' set.
to 3-way mirror. When conversion operations are performed on
these types of mirrors, log options can be confused/ignored.
In the case of a converting 3-way mirror, we have a top-level
2-way corelog mirror whose legs are 1) a 2-way disk-log mirror
and 2) a linear device. If we wish to convert this 3-way mirror
to a 2-way mirror, the linear device is removed and the extra
top layer is eliminated. If we also wished to convert the disk
log to a core log in the same step, ambiguity creeps in. It is
somewhat obvious what the user wants - a 2-way mirror with a
corelog. However, looking at the top level mirror before
compression, it seems that the mirror already has a core log.
This is why the operation seemed to fail.
This patch simply re-evaluates what mirrored_seg points to after
a compression and then considers the log argument.
This is a fix for bug 599898.
Introduce struct cmd_vg to store information about needed
volume group name, vgid, flags and the pointer to opened VG.
Keep VGs list in alphabetical order for locking order.
Introduce functions:
cmd_vg_add() add new cmd_vg entry.
cmd_vg_lookup() search cmd_vgs for vg_name.
cmd_vg_read() open VGs in cmd_vgs list.
cmd_vg_release() close VGs in reversed order.
Add pointer to linked list of opened VGs. List temporarily keeps
the information about needed or locked and opened VGs for replicator target.
Also add cmd_missing_vgs flag information for quick check and
also for possible continuos process_each_lv() usage where we need
to detect whether failure has been caused by missing VG or
some other reason.
Adding configure.in support for Replicators.
Adding basic lib lvm support for Replicators.
Adding flags REPLICATOR and REPLICATOR_LOG.
Adding segments SEG_REPLICATOR and SEG_REPLICATOR_DEV.
Adding basic methods for handling replicator metadata.
Some commands start with a pvname, but we'd like to force users to
start with a vg handle to obtain a pv handle. Our best option seems
to be providing a way to look up the vgname from the pvname, and then
require them to use vg_read/vg_open.
In addition to the pvname lookup function, this patch also provides a
lookup by pvid. The lookup by pvid can be used in conjunction with
lvmcache_get_pvids to process all pvs in the system.
The pvid find function first calls lvmcache_vgname_from_pvid, which may
cause the label to be read if it is not in the cache. If the vgname is
returned is an orphan, we then check to see if there are metadata areas,
and if not, we scan every PV on the system by calling scan_vgs_for_pvs().
In most cases we should not need to do this, and by using the info->mdas
count, we avoid calling pv_read() as prior code did. So this patch is a
bit cleaner and should allow us to refactor more of the pv code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
are active mirrors or snapshots.
We don't have the mechanisms in place to change the device-mapper
tables for those targets that have behavioral differences between
cluster and single machine instances. Allowing users to change
the attribute but not changing the target's behavior can lead to
data corruption.
The following bugs are fixed/avoided by this patch:
235123 - vgchange -c [ny] do not change target types when necessary
289331 - RFE: switching from cluster domain to local domain needs to deactivate volume somehow
289541 - when changing from local to cluster, volumes can not appear to be deactivated
being able to remove more images from a mirror than the
number of PVs directly specified for removal.
The effort to fix bug 581611 corrected a bug that was unnoticed
at the time. The loop in _remove_mirror_images that looks over
the specified PVs was allowing devices that were previously
counted and moved to the end of the list to be double-counted.
This resulted in the number of devices needed for removal always
being satisfied - even if the user did not specify enough PVs
for removal to satisfy the request. When 581611 was fixed, this
double-counting no longer took place and the result was to remove
only the minimum of the number of PVs specified or the number
that was asked to be removed.
By simply always setting 'new_area_count' (as used to be done
only in the else statement), we return to the previous behavior.
Indeed, this is exactly what the double-counting was allowing
to happen before the fix of 581611.
Allow lv_remove_with_dependencies() to know the top-level LV that was
requested to be removed (otherwise it recurses and we lose context).
A merging snapshot cannot be removed directly but the associated origin
can be. Disallow removal of a merging snapshot unless the associated
origin is also being removed.
We should write metadata into next position in the ring buffer while calling
vgrename and vgcfgrestore. At this code level (_vg_write_raw), we were not able
to determine if this is a rename or not. If yes, then accompanying VG structure
passed here has a new name set, not the old one.
When looking for a location where to put metadata next, we were given a NULL
value because of failed VG name comparison (in _find_vg_rlocn) between the
name in existing metadata and metadata we're just about to write.
This resets the position in the ring buffer, overwriting any existing metadata
(and also incorrectly updates the cache to "orphan" afterwards).
This patch just adds old_name item in struct volume_group that we can check and use
if necessary and detect renames at lower layers as well.
The same applies for vgcfgrestore, but here we're using a special value of
old_name, an empty string, to disable the check with existing metadata totally.
lvm2app needs a link back to the vg in order to use the vg handle for
memory allocations as well as other things. This patch adds the field
to struct physical_volume, and sets pv->vg when reading a vg from disk or
extending a vg by using the helper function previously added,
add_pvl_to_vgs(). Moves and renames are handled with separate code
inside move_pv() and vgmerge(). Add pv->vg check to vg_validate().
A NULL value in pv->vg signifies membership in the orphan VG.
Note though in the case of pv_read() on a device with metadatacopies == 0,
more devices may need to be read for an authoritative answer.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Now that we have library functions to add/delete a pv from the vg->pvs
list, call them from everywhere.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a delete function to manage the vg->pvs list.
NOTE: It may be possible to do further cleanup to these add/del functions
by passing a 'pv' as input instead of 'pv_list'. The pv_list is used for
functions which do allocations (lvcreate) while other places in the code
just manage a list of 'pv' (e.g. import functions, vgextend, etc).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
A user specifying duplicate paths on the cmdline of vgcreate will
get a message similar to the following:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Internal error: Duplicate PV id jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8 detected for /dev/loop3 in vgtest2.
This is caught by vg_validate(), but it would be good to find
this condition earlier in the vgcreate code. add_pv_to_vg()
currently checks by pvname, but does not look for duplcate pvids.
This patch adds the check for duplicate pvids and results in new
error output as follows:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Physical volume '/dev/loop5 (jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8)' listed more than once.
Unable to add physical volume '/dev/loop5' to volume group 'vgtest2'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When moving parts of striped LVs, pvmove wouldn't care about leaving you with
two stripes on the same disk. Now --alloc anywhere is needed for that.
(Tried and gave up on two alternative approaches before the one committed here.)
Small refactor of main places in the code where a pv is added to a
vg into a small function which adds the pv to the list and updates
the vg counts.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
In add_pv_to_vg(), we should only add the pv to vg->pvs after all
internal checks have passed. The check for vg->extent_count exeeding
maximum was after we added the pv to the list, so this function could
return a state of vg->pvs that did not reflect other parameters such
as vg->pv_count.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Physical segments were still allocated from global
command context mempool.
This leads to very high memory usage when
activating large VG (vgchange).
(Memory usage was about 2G when >3000LVs).
Fix it by properly using vg->vgmem private pool,
so all the memory is released early.
New memory pool parameter is needed here for pv_split_segment
function.
Also fix the same problem in some minor allocations
(vg description, lv segment split).
In addition to previous patch, we really do not need
to search for segment which was just allocated in
split request.
Make pv_split_segment function return newly allocated
(split) segment also.
(So after this patch, there is only one user
of slow find_peg_by_pe).
The function find_peg_by_pe is incredibly inefficient
for Pvs with many segments.
In shiny future there should be binary (or interval) tree
instead of sorted linked list (volunteers?).
Anyway, for now, we can use dirty trick here to optimise this case:
- Allocations are usually applied from the beginning
of PV (we have no alloocation policy which allocates areas
"backwards")
- The only user of find_peg_by_pe is pv_split_segment()
call. In *most* cases it need to split *last* PV segment.
So if we search sorted pv segment list backwards, we
hit the requested segment immediatelly.
This patch applies this tiny change.
(and saves >30% of processing time when >3000LVs segments are on one PV!)
To discourage using this inefficient function from other code,
it is moved to pv_manip.c and used static for now:-)
This check-in enables the 'mirrored' log type. It can be specified
by using the '--mirrorlog' option as follows:
#> lvcreate -m1 --mirrorlog mirrored -L 5G -n lv vg
I've also included a couple updates to the testsuite. These updates
include tests for the new log type, and some fixes to some of the
*lvconvert* tests.
This is the next preparatory step towards better --alloc anywhere
support and is not intended to break anything that currently works so
please report any problems - segfaults, bogus data in the new debug
messages, or if the code now chooses bizarre allocation layouts.
. Add "monitoring" option to "activation" section of lvm.conf
. Have clvmd consult the lvm.conf "activation/monitoring" too.
. Introduce toollib.c:get_activation_monitoring_mode().
. Error out when both --monitor and --ignoremonitoring are provided.
. Add --monitor and --ignoremonitoring support to lvcreate. Update
lvcreate man page accordingly.
. Clarify that '--monitor' controls the start and stop of monitoring in
the {vg,lv}change man pages.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When we pv_read() a device that has an orphan vgname, we might need to scan
the system to be sure this is true. However, if the PV has mdas, there's
no way possible for it to have an orphan vgname unless it is a true orphan.
Some areas of the code were optimized to take advantage of this fact, while
others were not (we would still do the expensive scan if a device had mdas
but had an orphan VG).
This patch unifies the code so that every place we are operating on such
a PV, we skip the expensive scan if there are mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Petr Rockai <prockai@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
If user try to vgcreate or vgextend non-existent VG,
these messages appears:
# vgcreate xxx /dev/xxx
Internal error: Volume Group xxx was not unlocked
Device /dev/xxx not found (or ignored by filtering).
Unable to add physical volume '/dev/xxx' to volume group 'xxx'.
Internal error: Attempt to unlock unlocked VG xxx.
(the same with existing VG and non-existing PV & vgextend)
# vgextend vg_test /dev/xxx
...
It is caused because code tries to "refresh" cache if
md filter is switched on using cache destroy.
But we can change filters and rescan even without this
machinery now, just use refresh_filters
(and reset md filter afterwards).
(Patch also discovers cache alias bug in vgsplit test,
fix it by using better filter line.)
The kernel's blk_stack_limits() function may flag a device as
'misaligned'. If it does the alignment_offset will be -1.
Update set_pe_align_offset() to accommodate this corner case.
We need to allocate memory for the tag and copy the tag value before we
add it to the list of tags. We could put this inside lvm2app since the
tools keep their memory around until vg_write/vg_commit is called, but
we put it inside the internal library to minimize code in lvm2app.
We need to copy the tag passed in by the caller to ensure the lifetime of
the memory until the {vg|lv} handle is released.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Similar refactoring to vgchange - pull out common parts and put into
library function for reuse. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Pull out common code to be called from tools as well as lvm2app.
Leave archive() at tool level so we can use from vgcreate
as well as vgchange. Should be no functional change.
- add stack macro in vgchange
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a merging snapshot to the deptree, using the "error" target, rather
than avoid adding it entirely. This allows proper cleanup of the -cow
device without having to rename the -cow to use the origin's name as a
prefix.
Move the preloading of the origin LV, after a merge, from
lv_remove_single() to vg_remove_snapshot(). Having vg_remove_snapshot()
preload the origin allows the -cow device to be released so that it can
be removed via deactivate_lv(). lv_remove_single()'s deactivate_lv()
reliably removes the -cow device because the associated snapshot LV,
that is to be removed when a snapshot-merge completes, is always added
to the deptree (and kernel -- via "error" target).
Now when the snapshot LV is removed both the -cow and -real devices
get removed using uuid rather than device name. This paves the way
for us to switch over to info-by-uuid queries.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Internally we store sizes in sectors, but lvm2app exports sizes
in bytes. We could get fancier and allow units configuration but
this fix should do for now.
Fixes rhbz561422.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>