IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Configurable settings for thin pool create
if they are not specified on command line.
New supported lvm.conf options are:
allocation/thin_pool_chunk_size
allocation/thin_pool_discards
allocation/thin_pool_zero
Similar to the way the 'mirror', 'raid1' and 'raid10' segment types set
the number of mirrors to 2 ('-m 1') if the argument is not specified,
here we set the number of stripes to 2 if not given on the command line
when creating a RAID10 LV.
Move common functions for lvcreate and lvconvert.
get_pool_params() - read thin pool args.
update_pool_params() - updates/validates some thin args.
It is getting complicated and even few more things will be
implemented, so to avoid reimplementing things differently
in lvcreate and lvconvert code has been splitted
into 2 common functions that allow some future extension.
Target tells us its version, and we may allow different set of options
to be supported with different version of driver.
Idea is to provide individual feature flags and later be
able to query for them.
This patch is intended to fix bug 825323 - FS turns read-only during a double
fault of a mirror leg and mirrored log's leg at the same time. It only
affects a 2-way mirror with a mirrored log. 3+-way mirrors and mirrors
without a mirrored log are not affected.
The problem resulted from the fact that the top level mirror was not
using 'noflush' when suspending before its "down-convert". When a
mirror image fails, the bios are queue until a suspend is recieved. If
it is a 'noflush' suspend, the bios can be safely requeued in the DM
core. If 'noflush' is not used, the bios must be pushed through the
target and if a device is failed for a mirror, that means issuing an
error. When an error is received by a file system, it results in it
turning read-only (depending on the FS).
Part of the problem was is due to the nature of the stacking involved in
using a mirror as a mirror's log. When an image in each fail, the top
level mirror stalls because it is waiting for a log flush. The other
stalls waiting for corrective action. When the repair command is issued,
the entire stacked arrangement is collapsed to a linear LV. The log
flush then fails (somewhat uncleanly) and the top-level mirror is suspended
without 'noflush' because it is a linear device.
This patch allows the log to be repaired first, which in turn allows the
top-level mirror's log flush to complete cleanly. The top-level mirror
is then secondarily reduced to a linear device - at which time this mirror
is suspended properly with 'noflush'.
Use log_warn to print non-fatal warning messages.
Use of log_error would confuse checker for testing
whether proper error has been reported for some real error.
When valgrind usage is desired by user (--enable-valgrind-pool)
skip playing/closing/reopenning with descriptors - it makes
valgridng useless.
Make sleep delay for clvmd start longer.
A while back, the behavior of LVM changed from allowing metadata changes
when PVs were missing to not allowing changes. Until recently, this
change was tolerated by HA-LVM by forcing a 'vgreduce --removemissing'
before trying (again) to add tags to an LV and then activate it. LVM
mirroring requires that failed devices are removed anyway, so this was
largely harmless. However, RAID LVs do not require devices to be removed
from the array in order to be activated. In fact, in an HA-LVM
environment this would be very undesirable. Device failures in such an
environment can often be transient and it would be much better to restore
the device to the array than synchronize an entirely new device.
There are two methods that can be used to setup an HA-LVM environment:
"clvm" or "tagging". For RAID LVs, "clvm" is out of the question because
RAID LVs are not supported in clustered VGs - not even in an exclusively
activated manner. That leaves "tagging". HA-LVM uses tagging - coupled
with 'volume_list' - to ensure that only one machine can have an LV active
at a time. If updates are not allowed when a PV is missing, it is
impossible to add or remove tags to allow for activation. This removes
one of the most basic functionalities of HA-LVM - site redundancy. If
mirroring or RAID is used to replicate the storage in two data centers
and one of them goes down, a server and a storage device are lost. When
the service fails-over to the alternate site, the VG will be "partial".
Unable to add a tag to the VG/LV, the RAID device will be unable to
activate.
The solution is to allow vgchange and lvchange to alter the LVM metadata
for a limited set of options - --[add|del]tag included. The set of
allowable options are ones that do not cause changes to the DM kernel
target (like --resync would) or could alter the structure of the LV
(like allocation or conversion).
Compared to names, UUIDs can't be renamed once they are created
for a device. The 'mangle' command will just issue an error message
about a need for manual intervention in this case - reactivating the
device (remove + create) does the job as the defualt mangling mode
used is "auto" and that will assign a correct mangled form the UUID.
For now this convertions is not supported, thus disabled.
The only supported conversion for now is to create mirrored thin pools
from mirrored devices.
Update code for lvconvert.
Change the lvconvert user interface a bit - now we require 2 specifiers
--thinpool takes LV name for data device (and makes the name)
--poolmetadata takes LV name for metadata device.
Fix type in thin help text -z -> -Z.
Supported is also new flag --discards for thinpools.
When reformatting the 'lvchange_resync' code in commit
05131f5853, a '!' should have been removed
from the condition that checks for the LV_NOTSYNCED flag on a corelog
mirror LV. The presence of this '!' caused the LV_NOTSYNCED flag to be
cleared when it wasn't present and left when it was present.
It is not allowed to add images to a 'mirror' or 'raid1' LV if the
LV_NOTSYNCED flag is set. We add some up-convert tests to ensure this
behavior is being enforced and that the LV_NOTSYNCED flag is being
properly cleared by 'lvchange --resync'.
(Not updating WHATS_NEW because this is intrarelease.)
Don't try to issue discards to a missing PV to avoid segfault.
Prevent lvremove from removing LVs that have any part missing.
https://bugzilla.redhat.com/857554
Using 'activation/auto_activation_volume_list = [ "vg/lvol1" ]'.
Before this patch:
3 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi-a--- 4.00m
lvol3 vg -wi-a--- 4.00m
(vg/lvol1 activated as it passes the list and all subsequent volumes too - wrong!)
With this patch:
1 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi----- 4.00m
lvol3 vg -wi----- 4.00m
(only vg/lvol1 activated as it passes the list and no other - correct!)
Issuing a 'lvchange --resync <VG>/<RAID_LV>' had no effect. This is
because the code to handle RAID LVs was not present. This patch adds
the code that will clear the metadata areas of RAID LVs - causing them
to resync upon activation.
When an LV is to be resynced, the metadata areas are cleared and the
LV is reactivated. This is true for mirroring and will also be true
for RAID LVs. We restructure the code in lvchange_resync() so that we
keep all the common steps necessary (validation of ability to resync,
deactivation, activation of meta/log devices, clearing of those devices,
etc) and place the code that will be divergent in separate functions:
detach_metadata_devices()
attach_metadata_devices()
The common steps will be processed on lists of metadata devices. Before
RAID capability is added, this will simply be the mirror log device (if
found).
This patch lays the ground-work for adding resync of RAID LVs.
By changing the conditional for resyncing mirrors with core-logs a
bit, we can short-circuit the rest of the function for that case
and reduce the amount of indenting in the rest of the function.
This cleanup will simplify future patches aimed at properly handling
the resync of RAID LVs.
When printing a message for the user and the lv_segment pointer is available,
use segtype->ops->name() instead of segtype->name. This gives a better
user-readable name for the segment. This is especially true for the
'striped' segment type, which prints "linear" if there is an area_count of
one.
We should check whether the fd is opened before trying to reopen it.
For example, the stdin is closed in test/lib/harness.c causing the
test suite to fail.
Accept -q as the short form of --quiet.
Suppress non-essential standard output if -q is given twice.
Treat log/silent in lvm.conf as equivalent to -qq.
Review all log_print messages and change some to
log_print_unless_silent.
When silent, the following commands still produce output:
dumpconfig, lvdisplay, lvmdiskscan, lvs, pvck, pvdisplay,
pvs, version, vgcfgrestore -l, vgdisplay, vgs.
[Needs checking.]
Non-essential messages are shifted from log level 4 to log level 5
for syslog and lvm2_log_fn purposes.
This patch adds support for RAID10. It is not the default at this
stage. The user needs to specify '--type raid10' if they would like
RAID10 instead of stacked mirror over stripe.
Adding couple INTERNAL_ERROR reports for unwanted parameters:
Ensure the 'top' metadata node cannot be NULL for lvmetad.
Make obvious vginfo2 cannot be NULL.
Report internal error if handler and vg is undefined.
Check for handle in poll_vg().
Ensure seg is not NULL in dev_manager_transient().
Report missing read_ahead for _lv_read_ahead_single().
Check for report handler in dm_report_object().
Check missing VG in _vgreduce_single().
Remove the limit for major and minor number arguments used while specifying
persistent numbers via -My --major <major> --minor <minor> option which
was set to 255 before. Follow the kernel limit instead which is 12 bits
for major and 20 bits for minor number (kernel >= 2.6 and LVM formats
that does not have FMT_RESTRICTED_LVIDS - so still keep the old limit
of 255 for lvm1 format).
Allowing people to add devices to a VG that has PVs missing helps
people avoid the inability to repair RAID LVs in certain cases.
For example, if a user creates a RAID 4/5/6 LV using all of the
available devices in a VG, there will be no spare devices to
repair the LV with if a device should fail. Further, because the
VG is missing a device, new devices cannot be added to allow the
repair. If 'vgreduce --removemissing' were attempted, the
"MISSING" PV could not be removed without also destroying the RAID
LV.
Allowing vgextend to operate solves the circular dependency.
When the PV is added by a vgextend operation, the sequence number is
incremented and the 'MISSING' flag is put on the PVs which are missing.
Update lvchange to allow change of 'zero' flag for thinpool.
Add support for changing discard handling.
N.B. from/to ignore could be only changed for inactive pool.
Add arg support for discard.
Add discard ignore, nopassdown, passdown (=default) support.
Flags could be set per pool.
lvcreate [--discard {ignore|no_passdown|passdown}] vg/thinlv
When --sysinit -a ay is used with vg/lvchange and lvmetad is up and running,
we should skip manual activation as that would be a useless step - all volumes
are autoactivated once all the PVs for a VG are present.
If lvmetad is not active at the time of the vgchange --sysinit -a ay
call, the activation proceeds in standard 'manual' way.
This way, we can still have vg/lvchange --sysinit -a ay called
unconditionally in system initialization scripts no matter if lvmetad
is used or not.
Reducing a RAID 4/5/6 LV or extending it with a different number of
stripes is still not implemented. This patch covers the "simple" case
where the LV is extended with the same number of stripes as the orginal.
In process_each_pv() if we haven't yet scanned and the PV appears
to be an orphan, we must scan the other PVs looking for mdas that
reference it to find out what VG it is in.
1. If the PV has no mdas, we must scan.
2. If the PV has an mda that is not ignored we do not need to scan.
3. If the PV has an mda that is ignored, we do need to scan.
This patch fixes case 3.
> pvs -o +mda_count,vg_mda_count /dev/loop[0123]
PV VG Fmt Attr PSize PFree #PMda #VMda
/dev/loop0 vg3 lvm2 a- 96.00m 96.00m 0 1
/dev/loop1 vg3 lvm2 a- 96.00m 96.00m 1 1
/dev/loop2 vg2 lvm2 a- 96.00m 96.00m 1 2
/dev/loop3 vg2 lvm2 a- 28.00m 28.00m 1 2
Before:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 lvm2 a-- 100.00m 100.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 lvm2 a-- 100.00m 100.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
After:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 vg2 lvm2 a-- 96.00m 96.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 vg3 lvm2 a-- 96.00m 96.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
One can use "lvcreate --aay" to have the newly created volume
activated or not activated based on the activation/auto_activation_volume_list
this way.
Note: -Z/--zero is not compatible with -aay, zeroing is not used in this case!
When using lvcreate -aay, a default warning message is also issued that zeroing
is not done.
Define auto_activation_handler that activates VGs/LVs automatically
based on the activation/auto_activation_volume_list (activating all
volumes by default if the list is not defined).
The autoactivation is done within the pvscan call in 69-dm-lvmetad.rules
that watches for udev events (device appearance/removal).
For now, this works for non-clustered and complete VGs only.
Normally, the 'vgchange -ay' activates all volume groups (that pass
the activation/volume_list filter if set).
This call can appear in two scenarios:
- system boot (so activation within a script in general)
- manual call on command line (so activaton on user's direct request)
For the former one, we would like to select which VGs should be actually
activated. One can define the list of VGs directly to do that. But that
would require the same list to be provided in all the scripts.
The 'vgchange -aay' will check for the activation/auto_activation_volume_list
in adition and it will activate only those VGs/LVs that pass this
filter (assuming all to be activated if the list is not defined - the
same logic we already have for activation/volume_list).
Init/boot scripts should use this form of activation primarily
(which, anyway, becomes only a fallback now with autoactivation done
on PV appearance in tandem with lvmetad in place).
Define an 'activation_handler' that gets called automatically on
PV appearance/disappearance while processing the lvmetad_pv_found
and lvmetad_pv_gone functions that are supposed to update the
lvmetad state based on PV availability state. For now, the actual
support is for PV appearance only, leaving room for PV disappearance
support as well (which is a more complex problem to solve as this
needs to count with possible device stack).
Add a new activation change mode - CHANGE_AAY exposed as
'--activate ay/-aay' argument ('activate automatically').
Factor out the vgchange activation functionality for use in other
tools (like pvscan...).
We're refererring to 'activation' all over the code and we're talking
about 'LVs being activated' all the time so let's use 'activation/activate'
everywhere for clarity and consistency (still providing the old
'available' keyword as a synonym for backward compatibility with
existing environments).
With latest changes in the udev, some deprecated functions were removed
from libudev amongst which there was the "udev_get_dev_path" function
we used to compare a device directory used in udev and directore set in
libdevmapper. The "/dev" is hardcoded in udev now (udev version >= 183).
Amongst other changes and from packager's point of view, it's also
important to note that the libudev development library ("libudev-devel")
could now be a part of the systemd development library ("systemd-devel")
because of the udev + systemd merge.
Support has many limitations and lots of FIXMEs inside,
however it makes initial task when user creates a separate LV for
thin pool data and thin metadata already usable, so let's enable
it for testing.
Easiest API:
lvconvert --chunksize XX --thinpool data_lv metadata_lv
More functionality extensions will follow up.
TODO: Code needs some rework since a lot of same code is getting copied.
Just to make it clearer since there is the "dmsetup info -c -o blkdevname"
as well that shows the "block device name for this mapping", having a
"BlkDevName" header on output.
It's a bit confusing then if the "dmsetup info -c -o devs_used,blkdevs_used"
is named with a plural "DevNames"/"BlkDevNames" but at the same time having
a totally different meaning than the singular form "BlkDevName".
DevNames --> DevNamesUsed
BlkDevNames --> BlkDevNamesUsed
...makes it much more comprehensible.
When resizing thin pool - we need to use strip info from _tdata volume.
In future more generic solution will be necessary once we start to support
lvconvert (resize of stacked devices and stay properly aligned).
For now we just allow striped or linear LV so this code will work.
When given lvresize new size - round upward for stripes - unless we use % and
we are at the border of free extents.
This patch is not a complete fix and few more cases will need special care.
When vg_read fails, it internally unlocks VG if it's been locked,
so in error path we should skip unlock_vg for this case.
(user would see ugly internal warning)
Calling vgscan alone should reuse information from the lvmetad (if running).
The --cache option should initiate direct device scan and update lvmetad
appropriately (if running).
This is mainly for vgscan to behave consistently compared to pvscan.
locally or on more nodes while others are activated exclusively.
Current pvmove code can either use local mirror (for exclusive
activation) or cmirror (for clustered LVs).
Because the whole intenal pvmove LV is just segmented LV containing
segments of several top-level LVs, code cannot properly handle
situation if some segment need to be activated exclusively.
Previously, it wrongly activated exclusive LV on all nodes
(locing code allowed it) but now this is no lnger possible.
If there is exclusively activated LV, pvmove is only
possible if all affected LVs are aslo activated exclusively.
(Note that in non-exclusive mode pvmove still activates LVs
on other nodes during move.)
# lvchange -aly vg_test/lv1
# lvchange -aey vg_test/lv2
# pvmove -i 1 /dev/sdc
Error locking on node bar-01: Device or resource busy
Error locking on node bar-03: Volume is busy on another node
...
Failed to activate lv2
Code adds better support for monitoring of thin pool devices.
update_pool_lv uses DMEVENTD_MONITOR_IGNORE to not manipulate with monitoring.
vgchange & lvchange are checking real thin pool device for existance
as we are using _tpool real device and visible LV pool device might not
be even active (_tpool is activated implicitely for any thin volume).
monitor_dev_for_events is another _lv_postorder like code it might be worth
to think about reusing it here - for now update the code to properly
monitory thin volume deps.
For unmonitoring add extra code to check the usage of thin pool - in case it's in use
unmonitoring of thin volume is skipped.
Never return unfinished toolcontext - since error path is hit on
various stages of initialization we cannot leave it partially uninitialized,
since we would need to spread many more test across the code for config_valid.
Instead return NULL and properly release udev library resources as well.
Fix regression in man page. The chunk size is in kilobyte units on command line
input though in the source code we work with sector size unit
so make it clear in the man page.
Update chunksize for thin pool in man page - it's max value is 1024M == 1G.
Fix warning range message to show proper max value.
If the lvcreate may decide some automagical values for a user,
try to keep the pool metadata size into 128MB range for optimal
perfomance (as suggested by Joe).
So if the pool metadata size and chunk_size were not specified,
try to select such values they would fit into 128MB size.
Auto mode can't deal with multiple mangled names. We can do that while working
in hex mode, but in auto mode, this would lead to device name ambiguity.
Move the code for poolmetadatasize operation into one place.
Report override for minimum and maximum size.
Drop _read_thin_params function its error reporting is handled elsewhere.
If no size was give the later added minimal size check efectively
disable this code. Also the argument for size now must be kept
in sector_size, so adding division by SECTOR_SIZE (moved into
a const expression)
Hold global lock in pvscan --lvmetad. (This might need refinement.)
Add PV name to "PV gone" messages.
Adjust some log message severities. (More changes needed.)
Addressing somewhat tricky bug here.
Since stdin,stdout,stderr were closed it's been occasionally possible to
see some unexpected messages to be flowing into a clvmd and generating some
randomly sized allocation of many megabytes. Since the message was not
being generated by standard send_message() construction, after some more
testing it apperead to be a debug log message - thus something has flown
to local socket opened on strandard out descriptor.
To fix the issue - use standard file descriptor duplication code for daemons.
For making easier debugging of polling daemon - developer might want to recompile
without modifition of standard file descriptors.
There were no messages printed upon completiion of RAID device replacement.
This could cause confusion/concern during automated recovery, because the
user sees the failure messages but no other messages indicating correction.
Read lvm.conf setting for monitoring for each command. So we should not
activate monitoring if the default compilation is set to monitor during
lvconvert commnads.
Patch also removes check for clustered VG and allows to disable monitoring
for clustered VG with the assumption, the problem with monitoring and dmeventd
flag passing for INGNORE is already fixed.
s/Issue/Use/, otherwise it is easy to misread "Issue" as "Issuing" - causing
the user confusion as to whether the action was performed automatically or
whether they need to issue the command.
'_lv_update_log_type' takes a lvconvert_params argument so that it can pass
down the user's preference of 'region_size' and allocation_policy. When
'mirror_remove_missing' was introduced (commit ID
95986e42a1) it didn't make sense to pass down
user preferences - so NULL was given instead. While it may never happen in
practice, static analysis reveals that this argument could be dereferenced.
So, if the user preferences were not passed in, glean the necessary fields
from what is already set in the LV.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
(Not updating WHATSNEW for this simple clean-up.)
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
Tested condition has been already evaluated before
For strlen() code has already excluded <ID_LEN.
For repairing, already tested (!argc && !repairing) before.
Add 'blkdevname' and 'blkdevs_used' field to dmsetup info -c -o.
Add 'blkdevname' option to dmsetup ls --tree to see block device names.
Add '-o options' to dmsetup deps and ls to select device name type on output.
We want to keep this logic -
when LV is extend - extend the LV by at least given amount,
when LV is reduced - reduce the LV by at most given amount.
So for this the rounding needs to be used.
Current logic which seems to satisfy give rule is to round up all
extent values for LV resize upward except for values with '-' sign
that are round downward.
This patch also fixes the problem when lvextend --use-polices tried
to extend LV the by i.e. 20% - but the resulting 20% were smaller
the extent size thus before this patch no extension happened.
The RAID plug-in for dmeventd now calls 'lvconvert --repair' to address failures
of devices in a RAID logical volume. The action taken can be either to "warn"
or "allocate" a new device from any spares that may be available in the
volume group. The action is designated by setting 'raid_fault_policy' in
lvm.conf - the default being "warn".
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
udev may also need to be disabled if you didn't build it statically too.
dmeventd.static could be fixed with some more work but I don't really see the
point: without dlopen() it's useless, and if you have dlopen(), why not support
normal shared libraries too?
Remove DM_THIN_ERROR_DEVICE_ID from API.
Remove API warning.
Drop code that was using DM_THIN_ERROR_DEVICE_ID (already commented)
Remove debug message which slipped in through some previous commit.
Since we finaly recognize thin creation only after
_determine_snapshot_type() - move _read_activation_params()
after it - so we can support lvcreate -an thin snapshot.
Always make sure table gets reloaded.
For now activate and deactivate pool volume if it's not active.
FIXME: we could do this only if we are sure some thin volume is alive.
Since activation of pool is now independent on thin activation,
user may do whatever he needs - thought preferable thin should stay alive,
but it it will be found inactivate, update_pool will bring the pool up.
All thins are created with the next activation and VG is updated
without messages. Only some basic commands works.
(i.e. lvcreate -an -V10 -T mvg/pool)
There can be some combination to confuse this system.
This functionality for snapshots is going to be interesting.
To ensure we properly handle LV cluster locking - explicitely do
not allow to change the availability of the thin pool that is in use
for some thin LV.
As soon as the thin volume is created the only way to activate pool
is via implicit dependency.
Ignore thinpool open count for lv/vgchange operations.
monitoring state of the logical volumes they are currently acting on.
Until now, every time a logical volume has been changed by a dmeventd plugin,
this plugin would have called back to dmeventd through the external FIFO
mechanism. I am fairly sure this was superfluous, inefficient and possibly even
dangerous.
The '--merge' option to lvconvert works on snapshots and RAID1. The man
pages correctly reflect this, but the CLI help output still used the term,
'SnapshotLogicalVolume'.
Workaround for the current code with big FIXME,
since proper solution for pvmove needs to be developed.
Commiting this only for the purpose to get cluster testing covered.
Example:
~> lvconvert --type raid1 vg/mirror_lv
Steps to convert "mirror" to "raid1"
1) Allocate a RAID metadata LV for each mirror image from the same PVs
on which they are located.
2) Clear the metadata LVs. This involves writing LVM metadata, so we don't
change any aspects of the mirror LV before this so that the user can easily
remove LVs from the failed convert attempt while retaining the original
mirror.
3) Remove the mirror log, if it exists.
4) Add metadata LVs to mirror LV
5) Rename mirror sub-lvs (s/mimage/rimage/)
6) Change flags and segtype from mirror to raid1
Example:
~> lvconvert --type raid1 -m 1 vg/lv
The following steps are performed to convert linear to RAID1:
1) Allocate a metadata device from the same PV as the linear device
to provide the metadata/data LV pair required for all RAID components.
2) Allocate the required number of metadata/data LV pairs for the
remaining additional images.
3) Clear the metadata LVs. This performs a LVM metadata update.
4) Create the top-level RAID LV and add the component devices.
We want to make any failure easy to unwind. This is why we don't create the
top-level LV and add the components until the last step. Should anything
happen before that, the user could simply remove the unnecessary images. Also,
we want to ensure that the metadata LVs are cleared before forming the array to
prevent stale information from polluting the new array.
A new macro 'seg_is_linear' was added to allow us to distinguish linear LVs
from striped LVs.
This patch allows a mirror to be extended without an initial resync of the
extended portion. It compliments the existing '--nosync' option to lvcreate.
This action can be done implicitly if the mirror was created with the '--nosync'
option, or explicitly if the '--nosync' option is used when extending the device.
Here are the operational criteria:
1) A mirror created with '--nosync' should extend with 'nosync' implicitly
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv ; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 10.00g lv_mlog 100.00
2) The 'M' attribute ('M' signifies a mirror created with '--nosync', while 'm'
signifies a mirror created w/o '--nosync') must be preserved when extending a
mirror created with '--nosync'. See #1 for example of 'M' attribute.
3) A mirror created without '--nosync' should extend with 'nosync' only when
'--nosync' is explicitly used when extending.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 5.02g lv_mlog 0.39
vs.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv --nosync; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.02g lv_mlog 100.00
4) The 'm' attribute must change to 'M' when extending a mirror created without
'--nosync' is extended with the '--nosync' option. (See #3 examples above.)
5) An inactive mirror's sync percent cannot be determined definitively, so it
must not be allowed to skip resync. Instead, the extend should ask the user if
they want to extend while performing a resync.
[EXAMPLE]# lvchange -an vg/lv
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv is not active. Unable to get sync percent.
Do full resync of extended portion of vg/lv? [y/n]: y
Logical volume lv successfully resized
6) A mirror that is performing recovery (as opposed to an initial sync) - like
after a failure - is not allowed to extend with either an implicit or
explicit nosync option. [You can simulate this with a 'corelog' mirror because
when it is reactivated, it must be recovered every time.]
[EXAMPLE]# lvcreate -m1 -L 5G -n lv vg --nosync --corelog
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "lv" created
[EXAMPLE]# lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 100.00
[EXAMPLE]# lvchange -an vg/lv; lvchange -ay vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 0.08
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv cannot be extended while it is recovering.
7) If 'no' is selected in #5 or if the condition in #6 is hit, it should not
result in the mirror being resized or the 'm/M' attribute being changed.
NOTE: A mirror created with '--nosync' behaves differently than one created
without it when performing an extension. The former cannot be extended when
the mirror is recovering (unless in-active), while the latter can. This is
a reasonable thing to do since recovery of a mirror doesn't take long (at
least in the case of an on-disk log) and it would cause far more time in
degraded mode if the extension w/o '--nosync' was allowed. It might be
reasonable to add the ability to force the operation in the future. This
should /not/ force a nosync extension, but rather force a sync'ed extension.
IOW, the user would be saying, "Yes, yes... I know recovery won't take long
and that I'll be adding significantly to the time spent in degraded mode, but
I need the extra space right now!".
The problem as reported by "ben <benscott@nwlink.com>" on lvm-devel:
vgsplit fails with mirrored mirror log
#lvs --all -o lv_name,lv_attr,devices
LV Attr Devices
MyMirror mwi--
[MyMirror_mimage_0] Iwi--- /dev/sdq(0)
[MyMirror_mimage_1] Iwi--- /dev/sdo(0)
[MyMirror_mimage_2] Iwi--- /dev/sdi(0)
[MyMirror_mlog] mwi---
[MyMirror_mlog_mimage_0] Iwi--- /dev/sds(0)
[MyMirror_mlog_mimage_1] Iwi--- /dev/sde(0)
#vgsplit -v "TestA" "TestB" "/dev/sdq" "/dev/sdo" "/dev/sdi" "/dev/sds"
"/dev/sde"
Checking for volume group "TestA"
Checking for new volume group "TestB"
Archiving volume group "TestA" metadata (seqno 213).
Can't split mirror MyMirror between two Volume Groups
AFTER FIX:
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
Volume group "new" not found
Skipping volume group new
LV VG Devices
lv vg lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] vg /dev/sdb1(0)
[lv_mimage_1] vg /dev/sdc1(0)
[lv_mlog] vg lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] vg /dev/sdh1(0)
[lv_mlog_mimage_1] vg /dev/sdi1(0)
[root@bp-01 ~]# vgsplit vg new /dev/sd[bchi]1
New volume group "new" successfully split from "vg"
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
LV VG Devices
lv new lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] new /dev/sdb1(0)
[lv_mimage_1] new /dev/sdc1(0)
[lv_mlog] new lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] new /dev/sdh1(0)
[lv_mlog_mimage_1] new /dev/sdi1(0)
If you specify the segment type (e.g. --type mirror) and the mirrors argument
as zero, it would result in a mirrored LV with only one image. While the device
may be valid in theory, it should not be allowed in practice. It also makes it
difficult on the conversion tools, since they react badly to single-image
mirrors.
LVM has huge set of options now - it's approaching 60 short-arg less options
and we get interesting case of misdetection for 'merge' option which has been
put into the middle of options with 'short_arg' - thus certainly past 65. (ASCII 'A').
To avoid confusion of short_arg with long_opt number - add '128' to all such
non-short-arg options.
Makes dumpconfig whole-section output wrong in a different way from before,
but we should be able to merge cft_cmdline properly into cmd->cft now and
remove cascade.
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
~> lvconvert --splitmirrors 1 --trackchanges vg/lv
The '--trackchanges' option allows a user the ability to use an image of
a RAID1 array for the purposes of temporary read-only access. The image
can be merged back into the array at a later time and only the blocks that
have changed in the array since the split will be resync'ed. This
operation can be thought of as a partial split. The image is never completely
extracted from the array, in that the array reserves the position the device
occupied and tracks the differences between the array and the split image via
a bitmap. The image itself is rendered read-only and the name (<LV>_rimage_*)
cannot be changed. The user can complete the split (permanently splitting the
image from the array) by re-issuing the 'lvconvert' command without the
'--trackchanges' argument and specifying the '--name' argument.
~> lvconvert --splitmirrors 1 --name my_split vg/lv
Merging the tracked image back into the array is done with the '--merge'
option (included in a follow-on patch).
~> lvconvert --merge vg/lv_rimage_<n>
The internal mechanics of this are relatively simple. The 'raid' device-
mapper target allows for the specification of an empty slot in an array
via '- -'. This is what will be used if a partial activation of an array
is ever required. (It would also be possible to use 'error' targets in
place of the '- -'.) If a RAID image is found to be both read-only and
visible, then it is considered separate from the array and '- -' is used
to hold it's position in the array. So, all that needs to be done to
temporarily split an image from the array /and/ cause the kernel target's
bitmap to track (aka "mark") changes made is to make the specified image
visible and read-only. To merge the device back into the array, the image
needs to be returned to the read/write state of the top-level LV and made
invisible.
Users already have the ability to split an image from an LV of "mirror"
segtype. This patch extends that ability to LVs of "raid1" segtype.
This patch only allows a single image to be split off, however. (The
"mirror" segtype allows an arbitrary number of images to be split off.
e.g. 4-way => 3-way/linear, 2-way/2-way, linear,3-way)
Move the free_vg() to vg.c and replace free_vg with release_vg
and make the _free_vg internal.
Patch is needed for sharing VG in vginfo cache so the release_vg function name
is a better fit here.
Defer the test of the function return value after the string memory is released.
Otherwise in this error path the string would present memory leak.
(Thought in this case we are already out of memory...)
Implementation described in doc/lvm2-raid.txt.
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
The conditional is not just unnecessary, it would have been wrong. The code
is suppose to be checking if the 'splitmirrors_ARG' is negative, but it
instead is checking 'mirrors_ARG'. Rather than changing the argument being
checked, I've pulled the check entirely because 'splitmirrors_ARG' is already
guarenteed to not be negative by virtue of the fact that it is a 'int_arg'.
Negative values will be caught in _process_command_line().
We've used udev fallback code till now to check whether udev
created/removed the entries in /dev correctly and if not,
a repair was done (giving a warning messagea about that).
This patch adds a possibility to enable this additional check
and subsequent fallback only when required (debugging purposes
mostly) and trust udev completely.
So let's disable the fallback code by default and add a new
configuration option "activation/udev_fallback".
(The original code for creating the nodes will still be used
in case the device directory that is set in lvm.conf differs
from the one that udev uses and also when activation/udev_rules
is set to 0 - otherwise we would end up with no nodes/symlinks
at all)
are affected by the move. (Currently it's possible for I/O to become
trapped between suspended devices amongst other problems.
The current fix was selected so as to minimise the testing surface. I
hope eventually to replace it with a cleaner one that extends the
deptree code.
Some lvconvert scenarios still suffer from related problems.
Patch adds check for stripe not only in direct
LV segment but also in mirror image segment.
This prevents bugs like:
# lvcreate -i2 -l10 -n lv vg_test
# lvconvert -m1 -i1 vg_test/lv
# lvreduce -f -l1 vg_test/lv
WARNING: Reducing active logical volume to 4.00 MiB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Reducing logical volume lv to 4.00 MiB
Segment extent reduction 9 not divisible by #stripes 2
Logical volume lv successfully resized
# lvremove -f vg_test
Segment extent reduction 1 not divisible by #stripes 2
LV segment lv:0-4294967295 is incorrectly listed as being used by LV lv_mimage_0
Internal error: LV segments corrupted in lv_mimage_0.
We should never remove more extents than requested by user,
so round up to next stripe boundary during lvreduce.
Also this fixes round to zero sized LV bug:
# lvcreate -i2 -I 64k -l10 -n lvs vg_test
# lvreduce -f -l1 vg_test/lvs
Rounding size (1 extents) down to stripe boundary size for segment (0 extents)
WARNING: Reducing active logical volume to 0
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Reducing logical volume lvs to 0
Failed to suspend lvs
Before, we used vg_write_lock_held call to determnine the way a device is
opened. Unfortunately, this opened many devices in RW mode when it was not
really necessary. With the OPTIONS+="watch" rule used in the udev rules,
this could fire numerous events while closing such devices (and it caused
useless scans from within udev rules in return).
A common bug we hit with this was with the lvremove command which was unable
to remove the LV since it was being opened from within the udev rules. This
patch should minimize such situations (at least with respect to LVM handling
of devices).
Though there's still a possibility someone will open a device 'outside' in
parallel and fire the event based on the watch rule when closing a device
once opened for RW.
allocates these buffers in such way it adds memory page for each such buffer
and size of unlock memory check will mismatch by 1 or 2 pages.
This happens when we print or read lines without '\n' so these buffers are
used. To avoid this extra allocation, use setvbuf to set these bufffers ahead.
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Also, add a new 'obtain_device_list_from_udev' setting to lvm.conf with which
we can turn this feature on or off if needed.
If set, the cache of block device nodes with all associated symlinks
will be constructed out of the existing udev database content.
This avoids using and opening any inapplicable non-block devices or
subdirectories found in the device directory. This setting is applied
to udev-managed device directory only, other directories will be scanned
fully. LVM2 needs to be compiled with udev support for this setting to
take effect. N.B. Any device node or symlink not managed by udev in
udev directory will be ignored with this setting on.
allows us to allocate all images of a mirror (or RAID array) at one
time during create.
The current mirror implementation still requires a separate allocation
for the log, however.
As now the FID management is more complex, the code inside free_vg needs
to access some parts of memory pools which were not needed before.
For this - makes the order of unlock_and_free_vg() unconditional.
Keek using unlock_and_free_vg() API function.
For properly working VG locking mechanism only the alphabeting locking
orderer needs to be preserved.
TODO: there could be few more code parts simplified when we 'officially'
support of referencies between different memory pools.
With recent update of dm_report_field_string() API call to accept
completely const objects - we no longer need loose constness here
and keep it forwarding.
When I see 'seg_is_mirrored', I expect the argument to be an lv_segment.
In this case, it is lvcreate_params. Both structures, have a 'segtype'
entry which the macro dereferences. However, it just seems easier to
understand if we do 'segtype_is_mirrored' instead.
Since format instances will use own memory pool, it's necessary to properly
deallocate it. For now, only fid is deallocated. The PV structure itself
still uses cmd mempool mostly, but anytime we'd like to add a mempool
in the struct physical_volume, we can just rename this fn to free_pv and
add the code (like we have free_vg fn for VGs).
While STRIPE_SIZE_LIMIT * 2 is basically UINT_MAX, 32bit integer
value can already overflow durin arg size parsing.
(This really happens in test where --stripesize 4294967291 is used,
in s390x uint overflow and this test is ineffective.)