IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Commit 9fd7ac7d03 did not handle mirrors
that contained mirrored logs. This is because the status line of the
mirror does not give an indication of the health of the mirrored log,
as you can see here:
[root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
Thus, the possibility for LVM commands to hang still persists when mirror
have mirrored logs. I discovered this while performing some testing that
does polling with 'pvs' while doing I/O and killing devices. The 'pvs'
managed to get between the mirrored log device failure and the attempt
by dmeventd to repair it. The result was a very nasty block in LVM
commands that is very difficult to remove - even for someone who knows
what is going on. Thus, it is absolutely essential that the log of a
mirror be recursively checked for mirror devices which may be failed
as well.
Despite what the code comment says in the aforementioned commit...
+ * _mirrored_transient_status(). FIXME: It is unable to handle mirrors
+ * with mirrored logs because it does not have a way to get the status of
+ * the mirror that forms the log, which could be blocked.
... it is possible to get the status of the log because the log device
major/minor is given to us by the status output of the top-level mirror.
We can use that to query the log device for any DM status and see if it
is a mirror that needs to be bypassed. This patch does just that and is
now able to avoid reading from mirrors that have failed devices in a
mirrored log.
Addresses: rhbz855398 (Allow VGs to be built on cluster mirrors),
and other issues.
The LVM code attempts to avoid reading labels from devices that are
suspended to try to avoid situations that may cause the commands to
block indefinitely. When scanning devices, 'ignore_suspended_devices'
can be set so the code (lib/activate/dev_manager.c:device_is_usable())
checks any DM devices it finds and avoids them if they are suspended.
The mirror target has an additional mechanism that can cause I/O to
be blocked. If a device in a mirror fails, all I/O will be blocked
by the kernel until a new table (a linear target or a mirror with
replacement devices) is loaded. The mirror indicates that this condition
has happened by marking a 'D' for the faulty device in its status
output. This condition must also be checked by 'device_is_usable()' to
avoid the possibility of blocking LVM commands indefinitely due to an
attempt to read the blocked mirror for labels.
Until now, mirrors were avoided if the 'ignore_suspended_devices'
condition was set. This check seemed to suggest, "if we are concerned
about suspended devices, then let's ignore mirrors altogether just
in case". This is insufficient and doesn't solve any problems. All
devices that are suspended are already avoided if
'ignore_suspended_devices' is set; and if a mirror is blocking because
of an error condition, it will block the LVM command regardless of the
setting of that variable.
Rather than avoiding mirrors whenever 'ignore_suspended_devices' is
set, this patch causes mirrors to be avoided whenever they are blocking
due to an error. (As mentioned above, the case where a DM device is
suspended is already covered.) This solves a number of issues that weren't
handled before. For example, pvcreate (or any command that does a
pv_read or vg_read, which eventually call device_is_usable()) will be
protected from blocked mirrors regardless of how
'ignore_suspended_devices' is set. Additionally, a mirror that is
neither suspended nor blocking is /allowed/ to be read regardless
of how 'ignore_suspended_devices' is set. (The latter point being the
source of the fix for rhbz855398.)
clvmd -d option parsing was not working properly.
clvmd -d 2 (with space) has been ignored because of
'::' used in getopt string, and as failsafe it's been used '1'.
Later this debug_arg has been ignored and debug_opt was used
instead which happend to have value '1'.
Submitted-by: Robert Milasan <rmilasan at suse.com>
Reported-by: Robert Milasan <rmilasan at suse.com>
Use log_warn to print non-fatal warning messages.
Use of log_error would confuse checker for testing
whether proper error has been reported for some real error.
Try to bring the lvmetad usage text and man page closer to the code.
There seem to be 3 useful ways to use -d with lvmetad at the moment:
-d all
-d wire
-d debug
(They can also be comma-separated like -d wire,debug.)
Prior to the last release, -d, -dd and -ddd were supported.
Fail if an unrecognised debug arg is supplied on the command line.
Change -V to report the same version as the lvm binary: previously it
just reported version 0.
Use configure --enable-python_bindings to generate them.
Note that the Makefiles do not yet control the owner or permissions of
the two new files on installation.
A while back, the behavior of LVM changed from allowing metadata changes
when PVs were missing to not allowing changes. Until recently, this
change was tolerated by HA-LVM by forcing a 'vgreduce --removemissing'
before trying (again) to add tags to an LV and then activate it. LVM
mirroring requires that failed devices are removed anyway, so this was
largely harmless. However, RAID LVs do not require devices to be removed
from the array in order to be activated. In fact, in an HA-LVM
environment this would be very undesirable. Device failures in such an
environment can often be transient and it would be much better to restore
the device to the array than synchronize an entirely new device.
There are two methods that can be used to setup an HA-LVM environment:
"clvm" or "tagging". For RAID LVs, "clvm" is out of the question because
RAID LVs are not supported in clustered VGs - not even in an exclusively
activated manner. That leaves "tagging". HA-LVM uses tagging - coupled
with 'volume_list' - to ensure that only one machine can have an LV active
at a time. If updates are not allowed when a PV is missing, it is
impossible to add or remove tags to allow for activation. This removes
one of the most basic functionalities of HA-LVM - site redundancy. If
mirroring or RAID is used to replicate the storage in two data centers
and one of them goes down, a server and a storage device are lost. When
the service fails-over to the alternate site, the VG will be "partial".
Unable to add a tag to the VG/LV, the RAID device will be unable to
activate.
The solution is to allow vgchange and lvchange to alter the LVM metadata
for a limited set of options - --[add|del]tag included. The set of
allowable options are ones that do not cause changes to the DM kernel
target (like --resync would) or could alter the structure of the LV
(like allocation or conversion).
The ExecStartPost with pvscan --cache in lvm2-lvmetad.service
is not needed now as this is called transparently within the
first LVM command that queries lvmetad.
For now this convertions is not supported, thus disabled.
The only supported conversion for now is to create mirrored thin pools
from mirrored devices.
It would be possible to activate a RAID LV exclusively in a cluster
volume group, but for now we do not allow RAID LVs to exist in a
clustered volume group at all. This has two components:
1) Do not allow RAID LVs to be created in a clustered VG
2) Do not allow changing a VG from single-machine to clustered
if there are RAID LVs present.
Update code for lvconvert.
Change the lvconvert user interface a bit - now we require 2 specifiers
--thinpool takes LV name for data device (and makes the name)
--poolmetadata takes LV name for metadata device.
Fix type in thin help text -z -> -Z.
Supported is also new flag --discards for thinpools.
MD's bitmaps can handle 2^21 regions at most. The RAID code has always
used a region_size of 1024 sectors. That means the size of a RAID LV was
limited to 1TiB. (The user can adjust the region_size when creating a
RAID LV, which can affect the maximum size.) Thus, creating, extending or
converting to a RAID LV greater than 1TiB would result in a failure to
load the new device-mapper table.
Again, the size of the RAID LV is not limited by how much space is allocated
for the metadata area, but by the limitations of the MD bitmap. Therefore,
we must adjust the 'region_size' to ensure that the number of regions does
not exceed the limit. I've added code to do this when extending a RAID LV
(which covers 'create' and 'extend' operations) and when up-converting -
specifically from linear to RAID1.
Don't try to issue discards to a missing PV to avoid segfault.
Prevent lvremove from removing LVs that have any part missing.
https://bugzilla.redhat.com/857554
Failing to clear the LV_NOTSYNCED flag when converting a RAID1 LV to
linear can result in the flag being present after an upconvert - even
if the sync is performed when upconverting.
Mirrors do not allow upconverting if the LV has been created with --nosync.
We will enforce the same rule for RAID1. It isn't hugely critical, since
the portions that have been written will be copied over to the new device
identically from either of the existing images. However, the unwritten
sections may be different, causing the added image to be a hybrid of the
existing images.
Also, we are disallowing the addition of new images to a RAID1 LV that has
not completed the initial sync. This may be different from mirroring, but
that is due to the fact that the 'mirror' segment type "stacks" when adding
a new image and RAID1 does not. RAID1 will rebuild a newly added image
"inline" from the existant images, so they should be in-sync.
The "fedora-wait-storage.service" that the "lvm2-activation.service"
had as a dependency (which was fedora-specific solution anyway)
is obsolete now as this unit called "modprobe scsi_wait_scan"
which is not used anymore.
The "fedora-wait-storage.service" had "systemd-udev-settle" as
its dependency, so let's depend on this one directly now,
bypassing the out-dated "fedora-wait-storage.service".
Using 'activation/auto_activation_volume_list = [ "vg/lvol1" ]'.
Before this patch:
3 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi-a--- 4.00m
lvol3 vg -wi-a--- 4.00m
(vg/lvol1 activated as it passes the list and all subsequent volumes too - wrong!)
With this patch:
1 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi----- 4.00m
lvol3 vg -wi----- 4.00m
(only vg/lvol1 activated as it passes the list and no other - correct!)
Issuing a 'lvchange --resync <VG>/<RAID_LV>' had no effect. This is
because the code to handle RAID LVs was not present. This patch adds
the code that will clear the metadata areas of RAID LVs - causing them
to resync upon activation.
By changing the conditional for resyncing mirrors with core-logs a
bit, we can short-circuit the rest of the function for that case
and reduce the amount of indenting in the rest of the function.
This cleanup will simplify future patches aimed at properly handling
the resync of RAID LVs.
It is necessary when creating a RAID LV to clear the new metadata areas.
Failure to do so could result in a prepopulated bitmap that would cause
the new array to skip syncing portions of the array. It is a requirement
that the metadata LVs be activated and cleared in the process of creating.
However in test mode, this requirement should be lifted - no new LVs should
be created or written to.
Fix setvbuf code by closing and reopening stream before changing buffer.
But we need to review what this code is doing embedded inside a library
function rather than the simpler original form being run independently
at the top of main() by tools that need it.
Accept -q as the short form of --quiet.
Suppress non-essential standard output if -q is given twice.
Treat log/silent in lvm.conf as equivalent to -qq.
Review all log_print messages and change some to
log_print_unless_silent.
When silent, the following commands still produce output:
dumpconfig, lvdisplay, lvmdiskscan, lvs, pvck, pvdisplay,
pvs, version, vgcfgrestore -l, vgdisplay, vgs.
[Needs checking.]
Non-essential messages are shifted from log level 4 to log level 5
for syslog and lvm2_log_fn purposes.
This patch adds support for RAID10. It is not the default at this
stage. The user needs to specify '--type raid10' if they would like
RAID10 instead of stacked mirror over stripe.
Remove the limit for major and minor number arguments used while specifying
persistent numbers via -My --major <major> --minor <minor> option which
was set to 255 before. Follow the kernel limit instead which is 12 bits
for major and 20 bits for minor number (kernel >= 2.6 and LVM formats
that does not have FMT_RESTRICTED_LVIDS - so still keep the old limit
of 255 for lvm1 format).
Allowing people to add devices to a VG that has PVs missing helps
people avoid the inability to repair RAID LVs in certain cases.
For example, if a user creates a RAID 4/5/6 LV using all of the
available devices in a VG, there will be no spare devices to
repair the LV with if a device should fail. Further, because the
VG is missing a device, new devices cannot be added to allow the
repair. If 'vgreduce --removemissing' were attempted, the
"MISSING" PV could not be removed without also destroying the RAID
LV.
Allowing vgextend to operate solves the circular dependency.
When the PV is added by a vgextend operation, the sequence number is
incremented and the 'MISSING' flag is put on the PVs which are missing.
A regression introduced in 2.02.89 (11e520256b)
caused the lvm dumpconfig <node> to print out
the node as well as its subsequent siblings.
The information about "only_one" mode got lost.
Before this patch (just an example node):
# lvm dumpconfig global/use_lvmetad
use_lvmetad=1
thin_check_executable="/usr/sbin/thin_check"
thin_check_options="-q"
(...all nodes to the end of the section)
With this patch applied:
# lvm dumpconfig global/use_lvmetad
use_lvmetad=1
If a daemon (like lvmetad that is using common daemon-server code)
received a kill signal that was supposed to shut the daemon down,
a spurious message was issued: "Failed to handle a client connection".
This happened if the kill signal came just in the middle of waiting
for a client request in "select" - the request that was supposed to
be handled was blank at that moment of course.
Update lvchange to allow change of 'zero' flag for thinpool.
Add support for changing discard handling.
N.B. from/to ignore could be only changed for inactive pool.
Add arg support for discard.
Add discard ignore, nopassdown, passdown (=default) support.
Flags could be set per pool.
lvcreate [--discard {ignore|no_passdown|passdown}] vg/thinlv
When --sysinit -a ay is used with vg/lvchange and lvmetad is up and running,
we should skip manual activation as that would be a useless step - all volumes
are autoactivated once all the PVs for a VG are present.
If lvmetad is not active at the time of the vgchange --sysinit -a ay
call, the activation proceeds in standard 'manual' way.
This way, we can still have vg/lvchange --sysinit -a ay called
unconditionally in system initialization scripts no matter if lvmetad
is used or not.
Reducing a RAID 4/5/6 LV or extending it with a different number of
stripes is still not implemented. This patch covers the "simple" case
where the LV is extended with the same number of stripes as the orginal.
In process_each_pv() if we haven't yet scanned and the PV appears
to be an orphan, we must scan the other PVs looking for mdas that
reference it to find out what VG it is in.
1. If the PV has no mdas, we must scan.
2. If the PV has an mda that is not ignored we do not need to scan.
3. If the PV has an mda that is ignored, we do need to scan.
This patch fixes case 3.
> pvs -o +mda_count,vg_mda_count /dev/loop[0123]
PV VG Fmt Attr PSize PFree #PMda #VMda
/dev/loop0 vg3 lvm2 a- 96.00m 96.00m 0 1
/dev/loop1 vg3 lvm2 a- 96.00m 96.00m 1 1
/dev/loop2 vg2 lvm2 a- 96.00m 96.00m 1 2
/dev/loop3 vg2 lvm2 a- 28.00m 28.00m 1 2
Before:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 lvm2 a-- 100.00m 100.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 lvm2 a-- 100.00m 100.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
After:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 vg2 lvm2 a-- 96.00m 96.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 vg3 lvm2 a-- 96.00m 96.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
If _alloc_parallel_area for raid devices chooses an area already used
up, it doesn't notice that it has no space left in it and leaves
later code trying to place a zero-length area into the LV.
https://bugzilla.redhat.com/832596
The clmvd init script called "vgchange -aly" before to activate
all VGs in cluster environment. This activated all VGs, no matter
if it was clustered or not.
Auto activation for clustered VGs is not supported yet so the behaviour
of -aay is still the same as before for clustered VGs. However, for
non-clustered VGs, we need to check with the activation/auto_activation_volume_list
whether the VG/LV should be activated on boot or not.
One can use "lvcreate --aay" to have the newly created volume
activated or not activated based on the activation/auto_activation_volume_list
this way.
Note: -Z/--zero is not compatible with -aay, zeroing is not used in this case!
When using lvcreate -aay, a default warning message is also issued that zeroing
is not done.
Define auto_activation_handler that activates VGs/LVs automatically
based on the activation/auto_activation_volume_list (activating all
volumes by default if the list is not defined).
The autoactivation is done within the pvscan call in 69-dm-lvmetad.rules
that watches for udev events (device appearance/removal).
For now, this works for non-clustered and complete VGs only.
Normally, the 'vgchange -ay' activates all volume groups (that pass
the activation/volume_list filter if set).
This call can appear in two scenarios:
- system boot (so activation within a script in general)
- manual call on command line (so activaton on user's direct request)
For the former one, we would like to select which VGs should be actually
activated. One can define the list of VGs directly to do that. But that
would require the same list to be provided in all the scripts.
The 'vgchange -aay' will check for the activation/auto_activation_volume_list
in adition and it will activate only those VGs/LVs that pass this
filter (assuming all to be activated if the list is not defined - the
same logic we already have for activation/volume_list).
Init/boot scripts should use this form of activation primarily
(which, anyway, becomes only a fallback now with autoactivation done
on PV appearance in tandem with lvmetad in place).
Define an 'activation_handler' that gets called automatically on
PV appearance/disappearance while processing the lvmetad_pv_found
and lvmetad_pv_gone functions that are supposed to update the
lvmetad state based on PV availability state. For now, the actual
support is for PV appearance only, leaving room for PV disappearance
support as well (which is a more complex problem to solve as this
needs to count with possible device stack).
Add a new activation change mode - CHANGE_AAY exposed as
'--activate ay/-aay' argument ('activate automatically').
Factor out the vgchange activation functionality for use in other
tools (like pvscan...).
We're refererring to 'activation' all over the code and we're talking
about 'LVs being activated' all the time so let's use 'activation/activate'
everywhere for clarity and consistency (still providing the old
'available' keyword as a synonym for backward compatibility with
existing environments).
Update release_lv_segment_area not to discard any PV extents,
as it also gets used when moving extents between LVs.
Instead, call a new function release_and_discard_lv_segment_area() in
the two places where data should be discarded - lv_reduce() and
remove_mirrors_from_segments().
There's no need to have the device open RW while obtaining the readahead value.
The RW open used before caused the CHANGE udev event to be generated if the
WATCH udev rule was set for the underlying device (and that is normally the
case both for non-dm and dm devices by default).
This did not cause any problems before since we were not interested in
*underlying* devices. However, with upcoming changes (autoactivation), we're
watching for events on underlying devices marked as PVs and such a spurious
event could cause the autoactivation code to be triggered. So when trying
to deactivate the volume, we could end up with immediate activation just after
that because of the CHANGE event originated in the WATCH udev rule since the
underlying device was open RW during the deactivation process.
Though maybe a better solution would be to completely filter such spurious
events out of the autoactivation process somehow, it's still useful if there
are as least spurious events generated as possible in the system itself.
If the user specifies number in the range of [4G/1024, 4G>,
the used value would wrap around (32bit math).
So keep the math 64bit.
Note, using such large lvm.conf values is pointless with lvm2.
Support has many limitations and lots of FIXMEs inside,
however it makes initial task when user creates a separate LV for
thin pool data and thin metadata already usable, so let's enable
it for testing.
Easiest API:
lvconvert --chunksize XX --thinpool data_lv metadata_lv
More functionality extensions will follow up.
TODO: Code needs some rework since a lot of same code is getting copied.
When mirrors are up-converted, a transient mirror layer is put in so that
only the new devices are sync'ed. That transient layer must carry the tags
of the original mirror LV, otherwise it will fail to activate when activation
is regulated by lvm.conf:activation/volume_list. The conversion would then
fail.
The fix is to do exactly the same thing that is being done for linear ->
mirror converting (lib/metadata/mirror.c:_init_mirror_log()). We copy the
tags temporarily for the new LV and remove them after the activation.
Snapshots of RAID logical volumes are allowed (including "raid1"). However,
snapshots of "mirror" logical volumes has been disallowed due to unsolvable
issues inherent to the design. The fact that mirroring (dm-raid1.c) must
stop all I/O as the result of a failure and wait for userspace intervention
can lead to a circular dependency if userspace is simultaneously waiting for
snapshots (on mirrors) to make an I/O update before proceeding.
Various snapshot on mirror tests have been removed as a result.
Looking at the code in cmirrord/local.c, we can see the various different
request types handled in different ways. Some information that is non-changing
does not need to go around the cluster and can be short-circuited. For
example, once the cluster mirror is in-sync, it is pointless to continue
sending that query around the cluster. We can save network bandwidth and reply
directly back to the kernel. When it comes to status information, there are
two types 'TABLE' and 'INFO'. The 'TABLE' information never changes and
belongs to the group of requests that can be safely short-circuited. The
'STATUS' information can change - and will change if a device fails. Thus it
cannot be short-circuited, but this is exactly what was found. The 'STATUS'
information request was being short-circuited and therefore never reporting the
failure condition to anyone other than the "server" that experienced it
directly.
If two devices in an array failed, it was previously impossible to replace
just one of them. This patch allows for the replacement of some, but perhaps
not all, failed devices.
Thanks to agk for providing the patch that prevents resume from attempting
(and then failing) to create error devices which already exist; having been
created by a corresponding suspend operation.
The 'mirror' segtype and 'raid1' segtype both set the 'MIRRORED' flag.
However, due to differences in the way these device-mapper targets behave
'mirror' must be suspended with the 'noflush' option and 'raid1' does not
have to be.
This patch ensures that when the 'MIRRORED' flag is checked to see if
'noflush' is needed that it does not also set it for 'raid1' by mistake.
The logic for resuming the original and newly split LVs was not properly
done to handle situations where anything but the last device in the array
was split. It did not take into account the possible name collisions that
might occur when the original LV undergoes the shifting and renaming of its
sub-LVs.
When resizing thin pool - we need to use strip info from _tdata volume.
In future more generic solution will be necessary once we start to support
lvconvert (resize of stacked devices and stay properly aligned).
For now we just allow striped or linear LV so this code will work.
When given lvresize new size - round upward for stripes - unless we use % and
we are at the border of free extents.
This patch is not a complete fix and few more cases will need special care.
Libudev does not provide transactions when querying udev database - once we
get the list of block devices (devices/obtain_device_list_from_udev=1) and
we iterate over the list to get more detailed information about device node
and symlink names used etc., the device could be removed just in between we
get the list and put a query for more info. In this case, libudev returns
NULL value as the device does not exist anymore.
Recently, we've added a warning message to reveal such situations. However,
this could be misleading if the device is not related to the LVM action
we're just processing - the non-related block device could be removed in
parallel and this is not an error but a possible and normal operation.
(N.B. This "missing info" should not happen when devices are related to
the LVM action we're just processing since all such processing should be
synchronized with udev and the udev db must always be in consistent state
after the sync point. But we can't filter this situation out from others,
non-related devices, so we have to lower the message verbosity here for a
general solution.)
various dmeventd plug-ins into a new function called 'dmeventd_lvm2_command',
but the new function did not strip off the "_mlog" extentions that the
mirror plug-in had been doing. This created bug 794904 - failure to replace
devices in a redundant log.
The test suite did catch this scenario because it performs repair tests (mainly)
through the CLI and not dmeventd. It's also not easy to test because the test
itself will hang if the bug is encountered.
When vg_read fails, it internally unlocks VG if it's been locked,
so in error path we should skip unlock_vg for this case.
(user would see ugly internal warning)
Activation on remote node should be tried only if it is masked by tags
locally (like when hosttags enabled, IOW activate_lv_excl_local()
doesn't return error.)
Introduced change caused that lvchange -aey succeeded even if volume was
activated exclusively remotely.
Calling vgscan alone should reuse information from the lvmetad (if running).
The --cache option should initiate direct device scan and update lvmetad
appropriately (if running).
This is mainly for vgscan to behave consistently compared to pvscan.
locally or on more nodes while others are activated exclusively.
Current pvmove code can either use local mirror (for exclusive
activation) or cmirror (for clustered LVs).
Because the whole intenal pvmove LV is just segmented LV containing
segments of several top-level LVs, code cannot properly handle
situation if some segment need to be activated exclusively.
Previously, it wrongly activated exclusive LV on all nodes
(locing code allowed it) but now this is no lnger possible.
If there is exclusively activated LV, pvmove is only
possible if all affected LVs are aslo activated exclusively.
(Note that in non-exclusive mode pvmove still activates LVs
on other nodes during move.)
# lvchange -aly vg_test/lv1
# lvchange -aey vg_test/lv2
# pvmove -i 1 /dev/sdc
Error locking on node bar-01: Device or resource busy
Error locking on node bar-03: Volume is busy on another node
...
Failed to activate lv2
In this case we should allow to use local mirror, check for cmirror
should apply only for lvconvert/lvcreate.
Introduced in 2.02.86 by removing !(lv->status & ACTIVATE_EXCL).
(Partially workaround, it is minimalistic patch for now.)
Code adds better support for monitoring of thin pool devices.
update_pool_lv uses DMEVENTD_MONITOR_IGNORE to not manipulate with monitoring.
vgchange & lvchange are checking real thin pool device for existance
as we are using _tpool real device and visible LV pool device might not
be even active (_tpool is activated implicitely for any thin volume).
monitor_dev_for_events is another _lv_postorder like code it might be worth
to think about reusing it here - for now update the code to properly
monitory thin volume deps.
For unmonitoring add extra code to check the usage of thin pool - in case it's in use
unmonitoring of thin volume is skipped.
There are kernel drivers (smblk) which set '-1' as their device major number.
This number is listed in /proc/devices then - but the kernel itself is using
just 12 bits - thus device is accessible via 4095 - there is posted patch
for 3.4 to fix this behavior (0 for auto allocation was mean to be used).
However to still allow using such devices with older kernels add some code
to use same behavior - so cut 12 bits from the major number from /proc/devices.
For now use log_warn() - maybe the severity of the message could be lowered
to just verbose level.
Fix propagation of -e option - pass it via internal shell variable.
Fix parsing of /proc/mounts files (don't check for substrings).
as reported by O.Mangold with suggested patch:
https://www.redhat.com/archives/linux-lvm/2012-February/msg00030.html
Properly pass arguments with spaces ("$@")
Add validation for YES and EXTOFF variable content.
When down-converting a RAID1 device, it is the last device that is extracted
and removed when the user does not specify a particular device. However,
when a device is specified (and it is not the last), the device is removed and
the remaining sub-LVs are "shifted down" to fill the hole. This cause problems
when resuming the LV because if the shifted devices were resumed (and thus
renamed) before the sub-LV being extracted, there would be a name conflict.
The solution is to resume the extracted sub-LVs first so that they can be
properly renamed preventing a possible conflict.
This addresses bug 801967.
Update a way we handle option passing - so we now support path and options
with space inside.
Fix dm name usage for thin pools with '-' in name.
Use new lvm.conf option thin_check_options to pass in options as string array.
Save some relocation entries and use directly char[].
Since we do not need yes more then 127 partitions per device, use just int8_t.
Move lvm_type_filter_destroy into local static function.
Never return unfinished toolcontext - since error path is hit on
various stages of initialization we cannot leave it partially uninitialized,
since we would need to spread many more test across the code for config_valid.
Instead return NULL and properly release udev library resources as well.
Fix regression in man page. The chunk size is in kilobyte units on command line
input though in the source code we work with sector size unit
so make it clear in the man page.
Update chunksize for thin pool in man page - it's max value is 1024M == 1G.
Fix warning range message to show proper max value.
If the lvcreate may decide some automagical values for a user,
try to keep the pool metadata size into 128MB range for optimal
perfomance (as suggested by Joe).
So if the pool metadata size and chunk_size were not specified,
try to select such values they would fit into 128MB size.
Use thin_dump --repair suggestion in log error message
and use just warning on deactivation path without repair info
(since node has been deactivated).
Also check whether there is not 16 args for thin_check configured.
Avoid using NULL pointers from udev. It seems like some older versions of udev
were improperly returning NULL in some case, so do not silently break here,
and give at least a warning to the user.
if the thin_check fail on thin pool - still return successful deactivation,
since lvremove would currently fail.
TODO: find some way to not run check with lvremove.
Use libdm callback to execute thin_check before activation
thin pool and after deactivation as well.
Supporting thin_check_executable which may pass in extra options for
the tool.
If no size was give the later added minimal size check efectively
disable this code. Also the argument for size now must be kept
in sector_size, so adding division by SECTOR_SIZE (moved into
a const expression)
If the thin pool has disabled zeroing (created with -Zn), we at least
clear initial 4KiB of such thin volume (provisions 1st block).
If lvcreate is executed with '-an' command will abort (same way like we for
normal LV - however for normal LV option -Zn may skip clearing completely,
for thin volumes this option is not supported (applies only for pools).
The OpenAIS checkpoint library is going away; therefore, cmirrord must
operate without it. The algorithms the handle the timing of when to send
a checkpoint, the determination of what to send, and which ongoing cluster
requests are relevent with respect to the checkpoints are unaffected. We
need only replace the functions that actually perform the storing/transmitting
and retrieving/receiving of the checkpoint data. Rather than store the
checkpoint data in an OpenAIS checkpoint file, we simply transmit it along
with the message that notifies the incoming node that the checkpoint is
ready.
Drop whole buffer clearing (most messages at <100 bytes).
Just make sure we have always \0 terminated string for strlen() operations.
(before for PIPE_BUF sized messages this was not set).
Addressing somewhat tricky bug here.
Since stdin,stdout,stderr were closed it's been occasionally possible to
see some unexpected messages to be flowing into a clvmd and generating some
randomly sized allocation of many megabytes. Since the message was not
being generated by standard send_message() construction, after some more
testing it apperead to be a debug log message - thus something has flown
to local socket opened on strandard out descriptor.
To fix the issue - use standard file descriptor duplication code for daemons.
For making easier debugging of polling daemon - developer might want to recompile
without modifition of standard file descriptors.
This could be seen as some sort of simple validation - it's not easy to
recognize a valid message for now - but we definitely do not want to
allocate a lot of megabytes in clvmd memory locked daemon when broken
message gets in.
Size of 8000 is just selected for now - possibly there could be much
lower value put in.
Using report_type_t for bitmask is not correct, since we have not defined types
for all bit combinations - so switching to unsigned type, since values of
report_type_t enum are unsigned.
The code fail to account for the case where we just need a single device
in a RAID 4/5/6 array. There is no good way to tell the allocation functions
that we don't need parity devices when we are allocating just a single device.
So, I've used a bit of a hack. If we are allocating an area_count that is <=
the parity count, then we can assume we are simply allocating a replacement
device (i.e. no need to include parity devices in the calculations). This
should make sense in most cases. If we need to allocate replacement devices
due to failure (or moving), we will never allocate more than the parity count;
or we would cause the array to become unusable. If we are creating a new device,
we should always create more stripes than parity devices.
/etc/tmpfiles.d directory holds configuration files for temporary/volatile
files and directories that should be automatically managed. For example,
if we have some parts of the fs hierarchy on tmpfs, we'd like to recreate
some files or directories on every boot so they're always prepared for use.
Systemd can read such configuration files. For now, the lock and run directory
are the ones that are most probably placed on tmpfs. If this is the case, we
can install the configuration by 'make install_tmpfiles_configuration'.
Read lvm.conf setting for monitoring for each command. So we should not
activate monitoring if the default compilation is set to monitor during
lvconvert commnads.
Patch also removes check for clustered VG and allows to disable monitoring
for clustered VG with the assumption, the problem with monitoring and dmeventd
flag passing for INGNORE is already fixed.
It was not possible to pass down the DM_[FORCE|NO]SYNC flags to
'dm_tree_node_add_raid_target'. This meant that converting to 'raid1' from
'mirror' would cause a full resync. (It also meant that '--nosync' was
ineffective when creating a 'raid1' LV.)
I've taken the 'reserved' parameter in 'dm_tree_node_add_raid_target' and
used it for the "flags" parameter. Now it is possible to pass the sync
flags and any other flags that may come up.
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
Count number of error and existing areas and if there is no existing area
for the LV avoid its activation.
Always disable partial activatio for thin volumes.
For mirrors currently put in hack to let it pass with a special name
since current mirror code needs to activate such LV during some operations.
For reading % of mapped size of thin volume use as origin for
old style snapshot '-real' device needs to be queried.
Fix log_error report given for lvs -a in this case.
Failure to do so results in "Performing unsafe table load while X device(s) are
known to be suspended" errors. While fixing the problem in this way works and
is consistent with the way the mirror segment type does it, it would be nice
to find a solution that uses the generic suspend/resume calls.
Also included in this check-in are additions to the test suite that perform
conversions on RAID LVs under a snapshot. These tests are disabled for the
time being due to a kernel bug that is yet to be tracked down.
Similar to the "mirror" segment type's log device, _add_dev_to_dtree should
be called and not _add_lv_to_dtree when adding metadata sub-LVs to the deptree.
Since _add_lv_to_dtree was being called, 'origin_only' could be set if a
snapshot sits on top of the RAID device. This would cause the actual device
that needed to be added to be skipped in favor of the non-existant device,
"<foo>-real".
Reformat name and path how the LV is represented with lvm1 compatible option,
to switch to the old way - which had number of problem - i.e. many links
do not exist - since for private devices we are not creating them.
Add more info about thin pools and volumes.
Since striped name function knows when to report 'linear' instead of
'stripe' type name - drop it from this place.
This fixes problem when reporting segtype e.g. for thin-pool which
is also using area_count=1 to store thin data device reference.
It also returns properly strduped memory instead of badly casted const char*.
This patch to the suspend code - like the similar change for resume -
queries the lock mode of a cluster volume and records whether it is active
exclusively. This is necessary for suspend due to the possibility of
preloading targets. Failure to check to exclusivity causes the cluster target
of an exclusively activated mirror to be used when converting - rather than
the single machine target.
Basic support to keep info when the LV was created.
Host and time is stored into LV mda section.
FIXME: Current version doesn't support configurable string via lvm.conf
and used fixed version strftime "%Y-%m-%d %T %z".
We want to keep this logic -
when LV is extend - extend the LV by at least given amount,
when LV is reduced - reduce the LV by at most given amount.
So for this the rounding needs to be used.
Current logic which seems to satisfy give rule is to round up all
extent values for LV resize upward except for values with '-' sign
that are round downward.
This patch also fixes the problem when lvextend --use-polices tried
to extend LV the by i.e. 20% - but the resulting 20% were smaller
the extent size thus before this patch no extension happened.
For snapshot, prepare whole command in front into private buffer.
Add also some missing '\n' for syslog messages.
For raid and mirror only convert creation of command line string.
This should avoid any unbound growth of mempool for dm_split_names.
Since the !(dev->flags & DEV_REGULAR) code path just called
dev_name_confirmed() which has just called 'stat()' inside,
remove duplicate second stat() call here.
When both path have identical prefix i.e. /dev/disk/by-id
skip 2 x lstat() for /dev /dev/disk /dev/disk/by-id
and directly lstat() only different part of the path.
Reduces amount of lstat calls on system with lots of devices.
The RAID plug-in for dmeventd now calls 'lvconvert --repair' to address failures
of devices in a RAID logical volume. The action taken can be either to "warn"
or "allocate" a new device from any spares that may be available in the
volume group. The action is designated by setting 'raid_fault_policy' in
lvm.conf - the default being "warn".
Also, don't allow a splitmirror operation on a RAID LV that is already tracking
a split, unless the operation is to stop the tracking and complete the split.
Example:
~> lvconvert --splitmirrors 1 --trackchanges vg/lv /dev/sdc1
# Now tracking changes - image can be merged back or split-off for good
~> lvconvert --splitmirrors 1 -n new_name vg/lv /dev/sdc1
# ^ Completes split ^
If a split is performed on a RAID that is tracking an already split image and
PVs are provided, we must ensure that
1) the already split LV is represented in the PVs
2) we are careful to split only the tracked image
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
LVM metadata knows only of striped segments - not linear ones.
The activation code detects segments with a single stripe and switches
them to use the linear target.
If the new lvm.conf setting is set to 0 (e.g. in a test script), this
'optimisation' is turned off.
Simplify /api makefile and use SUBDIRS target for test dir.
Properly cleanup Makefiles with distclean in /test.
Use symbolic links for shell scripts for non-srcdir compilation.
Remove FIXMES - there should not be any pool free call since
the memory pool is from device manager, and pool is detroyed
after the operation, so doing extra free here would not help here.
However lv_has_target_type() is using cmd mempool so here the extra
call for dm_pool_free makes sence.
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
For write we do not need to hold memory locked.
This relaxes many conditions and avoid problems when allocating
a lot of memory for writting metadata buffers.
(In case of huge MDA size this would lead to mismatch between
locked and unlocked memory region size).
Add also internal check we are not writing in critical section.
Removal of an inactive origin removes also all related snapshots.
When we now support 'old' external snapshots with thin volumes,
removal of pool will not only drop all thin volumes, but as
a consequence also all snapshots - which might be seen a bit
unexpected for the user - so add a query to confirm such action.
lvremove -f will skip the prompt.
Update region_size only for mirror and raid targets.
This fixes warning messages when vg is using small
extent size like 1KiB and no mirror/raid is created,
but the user still got the message:
$> vgcreate -s 1K vg <pvs>
$> lvcreate -L10K vg
Using reduced mirror region size of 4 sectors
When a PV label write is deferred to a vg_write call (as introduced by a patch
in 2.02.86), the PV is flagged with the internal UNLABELLED_PV flag. However,
when calling vg_archive before vg_write, we still have the PV labelled with the
UNLABELLED_PV flag which was not recognised as a proper flag while exporting
VG metadata:
# vgcreate vg /dev/sda
No physical volume label read from /dev/sda
Metadata inconsistency: Not all flags successfully exported.
Metadata inconsistency: Not all flags successfully exported.
Writing physical volume data to disk "/dev/sda"
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
udev may also need to be disabled if you didn't build it statically too.
dmeventd.static could be fixed with some more work but I don't really see the
point: without dlopen() it's useless, and if you have dlopen(), why not support
normal shared libraries too?
Add filter which tries to check if scanned device is part
of active multipath.
Firstly, only SCSI major number devices are handled in filter.
Then it checks if device has exactly one holder (in sysfs) and
if it is device-mapper device and DM-UUID is prefixed by "MPATH-".
If so, this device is filtered out.
The whole filter can be switched off by setting
mpath_component_detection in lvm.conf.
https://bugzilla.redhat.com/show_bug.cgi?id=597010
Signed-off-by: Milan Broz <mbroz@redhat.com>
Before adding a new virtual segment to LV, check first whether
the last segment isn't already of the same type. In this case
extend last segment instead of creating the new one.
Thin volumes should have always only 1 virtual segment, but it
helps also to virtual snapshot or error segtype..
Git commit ID 0864378250 was meant to disallow
'mirrored' logs for cluster mirrors. However, when add_mirror_log is used
to create the log (as is now the case when using 'lvcreate' or converting only
the log) the check is bypassed.
This patch adds the check to add_mirror_log.
pvck prints 'extra' character from the label since there is no '\0'
after the struct label entry and just uint64_t follows directly.
So avoid it by limiting 8 chars to be printed.
https://www.redhat.com/archives/lvm-devel/2011-January/msg00109.html
Signed-off-by: Paul Bolle <pebolle tiscali nl>
This function could be useful for other _manip source files.
Use dm_list manipulation function for provided functionality,
which make the code more readable and avoid touching list
internal details here.
This was the intended behaviour, as described in the lvchange man page, so you
have complete control through volume_list in lvm.conf, but the code seems to
have been treating -ae as local-only for a very long time.
Split syntax for thin-pool since it cannot be fully matched with snapshot.
So to avoid more confusion - take thin support into separate line.
Though still significant updates are needed for thin provisioning.
Version 2 of the userspace log protocol accepts return information during the
DM_ULOG_CTR exchange. The return information contains the name of the log
device that is being used (if there is one). The kernel can then register the
device via 'dm_get_device'. Amoung other things, this allows for userspace to
assemble a correct dependency tree of devices - critical for LVM handling of
suspend/resume calls.
Also, update dm-log-userspace.h to match the kernel header associated with
this protocol change. (Includes a version inc.)
When verify_udev_operations was disable, code for stacking fs operation for
lvm links was completely disable - but this code was also used for collecting
information, that a new node is being created.
Add a new flag which is set when a creation of lv symlinks is requested which
should restore old behaviour of lv_info function, that has called fs_sync()
before quere for open count on device.
Barrier is supposed to be used in situation like this
and replace tricky mutex usage, where mutex has been unlocked
by a different thread than the locking thread.
Properly detect if the filters were refreshed properly.
(May needs few more fixes ??)
Filter refresh may fail because it may be out of free file descriptors
when clvmd gets overloaded.
Replace usleep with pthread condition to increase speed testing
(for simplicity just 1 condition for all locks).
Use thread mutex also for unlock resource (so it wakes up awaiting
threads)
Better check some error states and return error in fail case with
unlocked mutex.
Using log_warn to report missing symlinks as warning, since the command
itself returns as successful, we should not produce log_error().
log_warn is better fit here.
Example:
~> lvconvert --type raid1 vg/mirror_lv
Steps to convert "mirror" to "raid1"
1) Allocate a RAID metadata LV for each mirror image from the same PVs
on which they are located.
2) Clear the metadata LVs. This involves writing LVM metadata, so we don't
change any aspects of the mirror LV before this so that the user can easily
remove LVs from the failed convert attempt while retaining the original
mirror.
3) Remove the mirror log, if it exists.
4) Add metadata LVs to mirror LV
5) Rename mirror sub-lvs (s/mimage/rimage/)
6) Change flags and segtype from mirror to raid1
Example:
~> lvconvert --type raid1 -m 1 vg/lv
The following steps are performed to convert linear to RAID1:
1) Allocate a metadata device from the same PV as the linear device
to provide the metadata/data LV pair required for all RAID components.
2) Allocate the required number of metadata/data LV pairs for the
remaining additional images.
3) Clear the metadata LVs. This performs a LVM metadata update.
4) Create the top-level RAID LV and add the component devices.
We want to make any failure easy to unwind. This is why we don't create the
top-level LV and add the components until the last step. Should anything
happen before that, the user could simply remove the unnecessary images. Also,
we want to ensure that the metadata LVs are cleared before forming the array to
prevent stale information from polluting the new array.
A new macro 'seg_is_linear' was added to allow us to distinguish linear LVs
from striped LVs.
This patch allows a mirror to be extended without an initial resync of the
extended portion. It compliments the existing '--nosync' option to lvcreate.
This action can be done implicitly if the mirror was created with the '--nosync'
option, or explicitly if the '--nosync' option is used when extending the device.
Here are the operational criteria:
1) A mirror created with '--nosync' should extend with 'nosync' implicitly
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv ; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 10.00g lv_mlog 100.00
2) The 'M' attribute ('M' signifies a mirror created with '--nosync', while 'm'
signifies a mirror created w/o '--nosync') must be preserved when extending a
mirror created with '--nosync'. See #1 for example of 'M' attribute.
3) A mirror created without '--nosync' should extend with 'nosync' only when
'--nosync' is explicitly used when extending.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 5.02g lv_mlog 0.39
vs.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv --nosync; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.02g lv_mlog 100.00
4) The 'm' attribute must change to 'M' when extending a mirror created without
'--nosync' is extended with the '--nosync' option. (See #3 examples above.)
5) An inactive mirror's sync percent cannot be determined definitively, so it
must not be allowed to skip resync. Instead, the extend should ask the user if
they want to extend while performing a resync.
[EXAMPLE]# lvchange -an vg/lv
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv is not active. Unable to get sync percent.
Do full resync of extended portion of vg/lv? [y/n]: y
Logical volume lv successfully resized
6) A mirror that is performing recovery (as opposed to an initial sync) - like
after a failure - is not allowed to extend with either an implicit or
explicit nosync option. [You can simulate this with a 'corelog' mirror because
when it is reactivated, it must be recovered every time.]
[EXAMPLE]# lvcreate -m1 -L 5G -n lv vg --nosync --corelog
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "lv" created
[EXAMPLE]# lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 100.00
[EXAMPLE]# lvchange -an vg/lv; lvchange -ay vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 0.08
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv cannot be extended while it is recovering.
7) If 'no' is selected in #5 or if the condition in #6 is hit, it should not
result in the mirror being resized or the 'm/M' attribute being changed.
NOTE: A mirror created with '--nosync' behaves differently than one created
without it when performing an extension. The former cannot be extended when
the mirror is recovering (unless in-active), while the latter can. This is
a reasonable thing to do since recovery of a mirror doesn't take long (at
least in the case of an on-disk log) and it would cause far more time in
degraded mode if the extension w/o '--nosync' was allowed. It might be
reasonable to add the ability to force the operation in the future. This
should /not/ force a nosync extension, but rather force a sync'ed extension.
IOW, the user would be saying, "Yes, yes... I know recovery won't take long
and that I'll be adding significantly to the time spent in degraded mode, but
I need the extra space right now!".
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The original commit comments can be located via this git commit ID:
7d8e615c0b
There were three possible solutions to the original problem proposed in the
initial check-in. The one chosen was as follows:
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it doesn't
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
Turns out, the cluster also views the extra suspend/resume operations as
pointless too and ignores them. So, this solution doesn't work in a cluster.
Further, I've noticed that in addition to the remote cluster nodes still getting
I/O errors from scanning the error target, they also have a different LVM and
DM views of the same LV. IOW, while the LVM level (gotten from the LVM metadata)
sees the correct name for the newly split LV, device-mapper still maintains the
old names.
Because the original fix failed to completely fix the problem (or work-around it)
and because a better solution must be found to address the additional cluster
issue of device renaming, I am reverting the above mentioned commit.
The current code does not always assign proper udev flags to sub-LVs (e.g.
mirror images and log LVs). This shows up especially during a splitmirror
operation in which an image is split off from a mirror to form a new LV.
A mirror with a disk log is actually composed of 4 different LVs: the 2
mirror images, the log, and the top-level LV that "glues" them all together.
When a 2-way mirror is split into two linear LVs, two of those LVs must be
removed. The segments of the image which is not split off to form the new
LV are transferred to the top-level LV. This is done so that the original
LV can maintain its major/minor, UUID, and name. The sub-lv from which the
segments were transferred gets an error segment as a transitory process
before it is eventually removed. (Note that if the error target was not put
in place, a resume_lv would result in two LVs pointing to the same segment!
If the machine crashes before the eventual removal of the sub-LV, the result
would be a residual LV with the same mapping as the original (now linear) LV.)
So, the two LVs that need to be removed are now the log device and the sub-LV
with the error segment. If udev_flags are not properly set, a resume will
cause the error LV to come up and be scanned by udev. This causes I/O errors.
Additionally, when udev scans sub-LVs (or former sub-LVs), it can cause races
when we are trying to remove those LVs. This is especially bad during failure
conditions.
When the mirror is suspended, the top-level along with its sub-LVs are
suspended. The changes (now 2 linear devices and the yet-to-be-removed log
and error LV) are committed. When the resume takes place on the original
LV, there are no longer links to the other sub-lvs through the LVM metadata.
The links are implicitly handled by querying the kernel for a list of
dependencies. This is done in the '_add_dev' function (which is recursively
called for each dependency found) - called through the following chain:
_add_dev
dm_tree_add_dev_with_udev_flags
<*** DM / LVM divide ***>
_add_dev_to_dtree
_add_lv_to_dtree
_create_partial_dtree
_tree_action
dev_manager_activate
_lv_activate_lv
_lv_resume
lv_resume_if_active
When udev flags are calculated by '_get_udev_flags', it is done by referencing
the 'logical_volume' structure. Those flags are then passed down into
'dm_tree_add_dev_with_udev_flags', which in turn passes them to '_add_dev'.
Unfortunately, when '_add_dev' is finding the dependencies, it has no way to
calculate their proper udev_flags. This is because it is below the DM/LVM
divide - it doesn't have access to the logical_volume structure. In fact,
'_add_dev' simply reuses the udev_flags given for the initial device! This
virtually guarentees the udev_flags are wrong for all the dependencies unless
they are reset by some other mechanism. The current code provides no such
mechanism. Even if '_add_new_lv_to_dtree' were called on the sub-devices -
which it isn't - entries already in the tree are simply passed over, failing
to reset any udev_flags. The solution must retain its implicit nature of
discovering dependencies and be able to go back over the dependencies found
to properly set the udev_flags.
My solution simply calls a new function before leaving '_add_new_lv_to_dtree'
that iterates over the dtree nodes to properly reset the udev_flags of any
children. It is important that this function occur after the '_add_dev' has
done its job of querying the kernel for a list of dependencies. It is this
list of children that we use to look up their respective LVs and properly
calculate the udev_flags.
This solution has worked for single machine, cluster, and cluster w/ exclusive
activation.
The problem as reported by "ben <benscott@nwlink.com>" on lvm-devel:
vgsplit fails with mirrored mirror log
#lvs --all -o lv_name,lv_attr,devices
LV Attr Devices
MyMirror mwi--
[MyMirror_mimage_0] Iwi--- /dev/sdq(0)
[MyMirror_mimage_1] Iwi--- /dev/sdo(0)
[MyMirror_mimage_2] Iwi--- /dev/sdi(0)
[MyMirror_mlog] mwi---
[MyMirror_mlog_mimage_0] Iwi--- /dev/sds(0)
[MyMirror_mlog_mimage_1] Iwi--- /dev/sde(0)
#vgsplit -v "TestA" "TestB" "/dev/sdq" "/dev/sdo" "/dev/sdi" "/dev/sds"
"/dev/sde"
Checking for volume group "TestA"
Checking for new volume group "TestB"
Archiving volume group "TestA" metadata (seqno 213).
Can't split mirror MyMirror between two Volume Groups
AFTER FIX:
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
Volume group "new" not found
Skipping volume group new
LV VG Devices
lv vg lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] vg /dev/sdb1(0)
[lv_mimage_1] vg /dev/sdc1(0)
[lv_mlog] vg lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] vg /dev/sdh1(0)
[lv_mlog_mimage_1] vg /dev/sdi1(0)
[root@bp-01 ~]# vgsplit vg new /dev/sd[bchi]1
New volume group "new" successfully split from "vg"
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
LV VG Devices
lv new lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] new /dev/sdb1(0)
[lv_mimage_1] new /dev/sdc1(0)
[lv_mlog] new lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] new /dev/sdh1(0)
[lv_mlog_mimage_1] new /dev/sdi1(0)
Since execve passed only NULL as environ, we had lost all environment vars on
restart - thus actually running 'different' clvmd then the one at start.
Preserving environ allows to restart clvmd with the same settings
(i.e. LD_LIBRARY_PATH)
Add test for second restart.
Add named cluster_ops to easily learn the name of the active cluster manager,
so we are able to restart singlenode manager in testing.
Add simple test for clvmd -S (restart) and -R (refresh)
(though it needs some extensions).
Read 2 environmental vars to learn about overide position for
CLVMD and LVM binaries.
We support LVM_BINARY in other script - and this way we could easily
test restart in our test-suite.
Bugfix:
Add (most probably unfinished) support for -E arg with list of exclusive
locks. (During clvmd restart all exclusive locks would have been lost and
in fact, if there would have been an exclusive lock, usage text would be
printed and clvmd exits.)
Instead of parsing list options multiple times every time some lock UUID is
checked - put them straight into the hash table - make the code easier to
understand as well.
Remove was_ex_lock() function (replaced with dm_hash_lookup()).
Swap return value for get_initial_state() (1 means success).
Update man pages and usage info for -E option.
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
If you specify the segment type (e.g. --type mirror) and the mirrors argument
as zero, it would result in a mirrored LV with only one image. While the device
may be valid in theory, it should not be allowed in practice. It also makes it
difficult on the conversion tools, since they react badly to single-image
mirrors.
Patch fixes Clang warnings about possible access via lv_name NULL pointer.
Replaces allocation of memory (strdup) with just pointer assignment
(since execve is being called anyway).
Checks for !*lv_name only when lv_name is defined.
(and as I'm not quite sure what state this really is - putting a FIXME
around - as this rather looks suspicios ??).
Add debug print of passed clvmd args.
(since data in statbuf are invalid).
Check whether sysconf managed to find _SC_PAGESIZE.
Report at least debug warning about failing unlink
(logging scheme here seems to be a different then in lvm).
Duplicate terminal FDs and use similar code as is made in clvmd
and cleanup warns about missing open/close tests.
FIXME: Looks like we already have 3 instancies of the same code in lvm repo.
When fsadm is test - it needs to execute lvm and fsadm from non-standard path
setting. So adding a support in fsadm script when user set LVM_BINARY, then
the lvm command invoced from fsadm will have the same PATH setting as before
entering fsadm command.
Needed for testing.
In case someone would use filename paths with spaces when changing
this script surround commands with '"'.
With default settings there is no change in behavior.