IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Fix buggy usage of "" (empty string) as a numerical string
value used for sorting.
On intel 64b platform this was typically resolve
as 0xffffff0000000000 - which is already 'close' to
UINT64_MAX which is used for _minusone64.
On other platforms it might have been giving
different numbers depends on aligment of strings.
Use proper &_minusone64 for sorting value when the reported
value is NUM.
Note: each numerical value needs to be thought about if it needs
default value &_zero64 or &_minusone64 since for cases, were
value of zero is valid, sorting should not be mixing entries
together.
Add wrapper function for dm_report_field_set_value() which returns void
and return 1, so the code could be shorter.
Add wrapper function for percent display _field_set_percent().
This patch fixes mostly cluster behavior but also updates
non-cluster reaction where calls like 'lvchange -aln'
lead to incorrect errors for some segment types.
Fix the implicit activation rules where some segment types could
be activated only in exclusive mode in cluster.
lvm2 command was not preserver 'local' property and incorrectly
converted local activations in to plain exclusive, so the local
activation could have activate volumes exclusively, but remotely.
If the volume_list filters out volume from activation,
it is still success result for this function.
Change the error message back to verbose level.
Detect if the volume is active localy before zeroing,
so we report error a bit later for cases, where volume
could not be activated because it doesn't pass through volume
list (but user still could create volume when he disables
zeroing)
Correct return code of activate_lv_excl().
Function is not supposed to return activation state of
activated volume, but return code of the operation.
Since i.e. when activation filter is allowing to activate
volume on current system, it is still success even though
no volume is activated.
There is a problem with the way mirrors have been designed to handle
failures that is resulting in stuck LVM processes and hung I/O. When
mirrors encounter a write failure, they block I/O and notify userspace
to reconfigure the mirror to remove failed devices. This process is
open to a couple races:
1) Any LVM process other than the one that is meant to deal with the
mirror failure can attempt to read the mirror, fail, and block other
LVM commands (including the repair command) from proceeding due to
holding a lock on the volume group.
2) If there are multiple mirrors that suffer a failure in the same
volume group, a repair can block while attempting to read the LVM
label from one mirror while trying to repair the other.
Mitigation of these races has been attempted by disallowing label reading
of mirrors that are either suspended or are indicated as blocking by
the kernel. While this has closed the window of opportunity for hitting
the above problems considerably, it hasn't closed it completely. This is
because it is still possible to start an LVM command, read the status of
the mirror as healthy, and then perform the read for the label at the
moment after a the failure is discovered by the kernel.
I can see two solutions to this problem:
1) Allow users to configure whether mirrors can be candidates for LVM
labels (i.e. whether PVs can be created on mirror LVs). If the user
chooses to allow label scanning of mirror LVs, it will be at the expense
of a possible hang in I/O or LVM processes.
2) Instrument a way to allow asynchronous label reading - allowing
blocked label reads to be ignored while continuing to process the LVM
command. This would action would allow LVM commands to continue even
though they would have otherwise blocked trying to read a mirror. They
can then release their lock and allow a repair command to commence. In
the event of #2 above, the repair command already in progress can continue
and repair the failed mirror.
This patch brings solution #1. If solution #2 is developed later on, the
configuration option created in #1 can be negated - allowing mirrors to
be scanned for labels by default once again.
Add LV_TEMPORARY flag for LVs with limited existence during command
execution. Such LVs are temporary in way that they need to be activated,
some action done and then removed immediately. Such LVs are just like
any normal LV - the only difference is that they are removed during
LVM command execution. This is also the case for LVs representing
future pool metadata spare LVs which we need to initialize by using
the usual LV before they are declared as pool metadata spare.
We can optimize some other parts like udev to do a better job if
it knows that the LV is temporary and any processing on it is just
useless.
This flag is orthogonal to LV_NOSCAN flag introduced recently
as LV_NOSCAN flag is primarily used to mark an LV for the scanning
to be avoided before the zeroing of the device happens. The LV_TEMPORARY
flag makes a difference between a full-fledged LV visible in the system
and the LV just used as a temporary overlay for some action that needs to
be done on underlying PVs.
For example: lvcreate --thinpool POOL --zero n -L 1G vg
- first, the usual LV is created to do a clean up for pool metadata
spare. The LV is activated, zeroed, deactivated.
- between "activated" and "zeroed" stage, the LV_NOSCAN flag is used
to avoid any scanning in udev
- betwen "zeroed" and "deactivated" stage, we need to avoid the WATCH
udev rule, but since the LV is just a usual LV, we can't make a
difference. The LV_TEMPORARY internal LV flag helps here. If we
create the LV with this flag, the DM_UDEV_DISABLE_DISK_RULES
and DM_UDEV_DISABLE_OTHER_RULES flag are set (just like as it is
with "invisible" and non-top-level LVs) - udev is directed to
skip WATCH rule use.
- if the LV_TEMPORARY flag was not used, there would normally be
a WATCH event generated once the LV is closed after "zeroed"
stage. This will make problems with immediated deactivation that
follows.
This patch reinstates the lv_info call to check for open count of
the LV we're removing/deactivating - this was changed with commit 125712b
some time ago and we relied on the ioctl retry logic deeper in the libdm
while calling the exact 'remove' ioctl.
However, there are still some situations in which it's still required to
check for open count before we do any 'remove' actions - this mainly
applies to LVs which consist of several sub LVs, like it is for
virtual snapshot devices.
The commit 1146691 fixed the issue with ordering of actions during
virtual snapshot removal while the snapshot is still open. But
the check for the open status of the snapshot is still prone to
marking the snapshot as in use with an immediate exit even though
this could be a temporary asynchronous open only, most notably
because of udev and its WATCH udev rule with accompanying scans
for the event which is asynchronous. The situation where this crops
up most often is when we're closing the LV that was open for read-write
and then calling lvremove immediately.
This patch reinstates the original lv_info call for the open status
of the LV in the lv_check_not_in_use fn that gets called before
we do any LV removal/deactivation. In addition to original logic,
this patch adds its own retry loop with a delay (25x0.2 seconds)
besides the existing ioctl retry loop.
Component LVs of a thinpool can be RAID LVs. Users who attempt a
scrubbing operation directly on a thinpool will be prompted to
specify the sub-LV they wish the operation to be performed on. If
neither of the sub-LVs are RAID, then a message telling them that
the operation can only be performed on a RAID LV will be given.
Split image should have an out-of-sync attr ('I') - always. Even if
the RAID LV has not been written to since the LV was split off, it is
still not part of the group that makes up the RAID and is therefore
"out-of-sync".
Since the virtual snapshot has no reason to stay alive once we
detach related snapshot - deactivate whole thing in front of
snapshot removal - otherwice the code would get tricky for
support in cluster.
The correct full solution would require to have transactions
for libdm operations.
Also enable to the check for snapshot being opened prior
the origin deactivation, otherwise we could easily end
with the origin being deactivate, but snapshot still kept
active, desynchronizing locking state in cluster.
Addendum to commit ce7489e which introduced a new *internal* LV_NOSCAN
flag and so it needs to be marked that way properly otherwise it
ends up unrecognized and improperly handled during metadata export.
A common scenario is during new LV creation when we need to wipe the
newly created LV and avoid any udev scanning before this stage otherwise
it could cause the device (the LV) to be claimed by some other subsystem
for which there were stale metadata within LV data.
This patch adds possibility to mark the LV we're just about to wipe with
a flag that gets passed to udev via DM_COOKIE as a subsystem specific
flag - DM_SUBSYSTEM_UDEV_FLAG0 (in this case the subsystem is "LVM")
so LVM udev rules will take care of handling that.
Accept --ignoreskippedcluster with pvs, vgs, lvs, pvdisplay, vgdisplay,
lvdisplay, vgchange and lvchange to avoid the 'Skipping clustered
VG' errors when requesting information about a clustered VG
without using clustered locking and still exit with success.
The messages can still be seen with -v.
Some code has been added recently which makes it impossible to compile
when "configure --disable-devmapper" is used. This patch just shuffles
the code around so it's under proper #ifdef DEVMAPPER_SUPPORT.
lib/metadata/lv_manip.c:_sufficient_pes_free() was calculating the
required space for RAID allocations incorrectly due to double
accounting. This resulted in failure to allocate when available
space was tight.
When RAID data and metadata areas are allocated together, the total
amount is stored in ah->new_extents and ah->alloc_and_split_meta is
set. '_sufficient_pes_free' was adding the necessary metadata extents
to ah->new_extents without ever checking ah->alloc_and_split_meta.
This often led to double accounting of the metadata extents. This
patch checks 'ah->alloc_and_split_meta' to perform proper calculations
for RAID.
This error is only present in the function that checks for the needed
space, not in the functions that do the actual allocation.
If "default" thin pool chunk size calculation method is selected,
use minimum_io_size, otherwise optimal_io_size for "performance"
device hint exposed in sysfs. If there appear to be PVs with
different hints presented, use their least common multiple.
If the hint is less than the default value defined for the
calculation method, use the default value instead.
Add allocation/thin_pool_chunk_size_calculation lvm.conf
option to select a method for calculating thin pool chunk
sizes and define two possible values - "default" and "performance".
A previous commit (b6bfddcd0a) which
was designed to prevent segfaults during lvextend when trying to
extend striped logical volumes forgot to include calculations for
RAID4/5/6 parity devices. This was causing the 'contiguous' and
'cling_by_tags' allocation policies to fail for RAID 4/5/6.
The solution is to remember that while we can compare
ah->area_count == prev_lvseg->area_count
for non-RAID, we should compare
(ah->area_count + ah->parity_count) == prev_lvseg->area_count
for a general solution.
When NULL info struct is passed in - function is usable
as a quick query for lv_is_active_locally() - with a bonus
we may query for layered device.
So it could be seen as a more efficient lv_is_active_locally().
Add internal devtypes reporting command to display built-in recognised
block device types. (The output does not include any additional
types added by a configuration file.)
> lvm devtypes -o help
Device Types Fields
-------------------
devtype_all - All fields in this section.
devtype_name - Name of Device Type exactly as it appears in /proc/devices.
devtype_max_partitions - Maximum number of partitions. (How many device minor numbers get reserved for each device.)
devtype_description - Description of Device Type.
> lvm devtypes
DevType MaxParts Description
aoe 16 ATA over Ethernet
ataraid 16 ATA Raid
bcache 1 bcache block device cache
blkext 1 Extended device partitions
...
Older gcc is giving misleading warning:
metadata/lv_manip.c:4018: warning: ‘seg’ may be used uninitialized in
this function
But warning free compilation is better.
Creation, deletion, [de]activation, repair, conversion, scrubbing
and changing operations are all now available for RAID LVs in a
cluster - provided that they are activated exclusively.
The code has been changed to ensure that no LV or sub-LV activation
is attempted cluster-wide. This includes the often overlooked
operations of activating metadata areas for the brief time it takes
to clear them. Additionally, some 'resume_lv' operations were
replaced with 'activate_lv_excl_local' when sub-LVs were promoted
to top-level LVs for removal, clearing or extraction. This was
necessary because it forces the appropriate renaming actions the
occur via resume in the single-machine case, but won't happen in
a cluster due to the necessity of acquiring a lock first.
The *raid* tests have been updated to allow testing in a cluster.
For the most part, this meant creating devices with '-aey' if they
were to be converted to RAID. (RAID requires the converting LV to
be EX because it is a condition of activation for the RAID LV in
a cluster.)
When images and their associated metadata are removed from a RAID1 LV,
the remaining sub-LVs are "shifted" down to fill the gaps. For
example, if there is a 3-way mirror:
[0][1][2]
and we remove device#0, the devices will be shifted down
[1][2]
and renamed.
[0][1]
This can create a problem for resume_lv (specifically,
dm_tree_activate_children) during the renaming process though. This
is because it will attempt to rename the higher indexed sub-LVs first
and find that it cannot because there are currently other sub-LVs with
that name. The solution is to check for a conflicting name before
attempting to rename. If a conflict is found and that conflicting
sub-LV is also in the process of renaming, we can defer the current
rename until the conflicting sub-LV has renamed and cleared the
conflict.
Now that resume_lv can handle these types of rename conflicts, we can
remove the workaround in RAID that was attempting to resume a RAID1
LV from the bottom-up in order to force a proper rename in assending
order before attempting a resume on the top-level LV. This "hack"
only worked for single machine use-cases of LVM. Clearing this up
paves the way for exclusive activation of RAID LVs in a cluster.
Properly skip unmonitoring of thin pool volume in deactivation code
path. Code makes sure if there is just any thin pool user
it stays monitored with all its resources.
When the pool is created from non-linear target the more complex rules
have to be used and stacking needs to properly decode args for _tdata
LV. Also proper allocation policies are being used according to those
set in lvm2 metadata for data and metadata LVs.
Also properly check for active pool and extra code to active it
temporarily.
With this fix it's now possible to use:
lvcreate -L20 -m2 -n pool vg --alloc anywhere
lvcreate -L10 -m2 -n poolm vg --alloc anywhere
lvconvert --thinpool vg/pool --poolmetadata vg/poolm
lvresize -L+10 vg/pool
The pool metadata LV must be accounted for when determining what PVs
are in a thin-pool. The pool LV must also be accounted for when
checking thin volumes.
This is a prerequisite for pvmove working with thin types.
The function 'get_pv_list_for_lv' will assemble all the PVs that are
used by the specified LV. It uses 'for_each_sub_lv' to traverse all
of the sub-lvs which may compose it.
This is a regression caused by commit 3bd9048854.
The error message added with that commit "mpath major %d is not dm major %d" is
superfluous.
When scanning for mpath components, we're looking for a parent device.
But this parent device is not necessarily an mpath device (so the dm device)
if it exists - it can be any other device layered on top (e.g. an MD RAID device).
- null_fd resource leak on error path in _reopen_fd_null fn
- dead code in verify_message in clvmd code
- dead code in _init_filter_components in toolcontext code
- null dereference in dm_prepare_selinux_context on error path if
setfscreatecon fails while resetting SELinux context
Split out the partitioned device filter that needs to open the device
and move the multipath filter in front of it.
When a device is multipathed, sending I/O to the underlying paths may
cause problems, the most obvious being I/O errors visible to lvm if a
path is down.
Revert the incorrect <backtrace> messages added when a device doesn't
pass a filter.
Log each filter initialisation to show sequence.
Avoid duplicate 'Using $device' debug messages.
According to bug 995193, if a volume group
1) contains a mirror
2) is clustered
3) 'locking_type' = 0 is used
then it is not possible to remove the 'c'luster flag from the VG. This
is due to the way _lv_is_active behaves.
We shouldn't allow the cluster flag to be flipped unless the mirrors in
the cluster are not active. This is because different kernel modules
are used depending on whether a mirror is cluster or not. When we
attempt to see if the mirror is active, we first check locally. If it
is not, then we attempt to check for remotely active instances if the VG
is clustered. Since the no_lock locking type is LCK_CLUSTERED, but does
not implement 'query_resource', remote_lock_held will always return an
error in this case. An error from remove_lock_held is treated as though
the lock _is_ held (i.e. the LV is active remotely). This blocks the
cluster flag from changing.
The solution is to implement 'query_resource' for the no_lock type. It
will report a message and return 1. This will allow _lv_is_active to
function properly. The LV would be considered not active remotely and
the VG can change its flag.
gcc -O2 v4.8 on 32 bit architecture is causing a bug in parameter
passing. It does not happen with -01 nor -O0.
The problematic part of the code was strlen use in config.c in
the config_def_check fn and the call for _config_def_check_tree in it:
<snip>
rplen = strlen(rp);
if (!_config_def_check_tree(handle, vp, vp + strlen(vp), rp, rp + rplen, CFG_PATH_MAX_LEN - rplen, cn, cmd->cft_def_hash)) ...
</snip>
If compiled with -O0 (correct):
Breakpoint 1, config_def_check (cmd=0x819b050, handle=0x81a04f8) at config/config.c:775
(gdb) p vp
$1 = 0x8189ee0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb)
_config_def_check_tree (handle=0x81a04f8, vp=0x8189ee0 <_cfg_path>
"config", pvp=0x8189ee6 <_cfg_path+6> "", rp=0xbfffe1e8 "config",
prp=0xbfffe1ee "", buf_size=58, root=0x81a2568, ht=0x81a65
48) at config/config.c:680
(gdb) p vp
$4 = 0x8189ee0 <_cfg_path> "config"
(gdb) p pvp
$5 = 0x8189ee6 <_cfg_path+6> ""
If compiled with -O2 (incorrect):
Breakpoint 1, config_def_check (cmd=cmd@entry=0x8183050, handle=0x81884f8) at config/config.c:775
(gdb) p vp
$1 = 0x8172fc0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb) p vp + strlen(vp)
$3 = 0x8172fc6 <_cfg_path+6> ""
(gdb)
_config_def_check_tree (handle=handle@entry=0x81884f8, pvp=0x8172fc7
<_cfg_path+7> "host_list", rp=rp@entry=0xbffff190 "config",
prp=prp@entry=0xbffff196 "", buf_size=buf_size@entry=58, ht=0x
818e548, root=0x818a568, vp=0x8172fc0 <_cfg_path> "config") at
config/config.c:674
(gdb) p pvp
$4 = 0x8172fc7 <_cfg_path+7> "host_list"
The difference is in passing the "pvp" arg for _config_def_check_tree.
While in the correct case, the value of _cfg_path+6 is passed
(the result of vp + strlen(vp) - see the snippet of the code above),
in the incorrect case, this value is increased by 1 to _cfg_path+7,
hence totally malforming the string that is being processed.
This ends up with incorrect validation check and incorrect warning
messages are issued like:
"Configuration setting "config/checks" has invalid type. Found integer, expected section."
To workaround this issue, remove the "static" qualifier from the
"static char _cfg_path[CFG_PATH_MAX_LEN]". This causes the optimalizer
to be less aggressive (also shuffling the arg list for
_config_def_check_tree call helps).
Commit b248ba0a39 attempted to
prevent mirror devices which had a failed device in their
mirrored log from being usable/readable by LVM. This was to
protect against circular dependancies where one LVM command
could be blocked trying to read one of these affected mirrors
while the LVM command to fix/unblock that mirror was stuck
behind the currently running command.
The above commit went wrong when it used 'device_is_usable()' to
recurse on the mirrored log device to check if it was suspended
or blocked. The 'device_is_usable' function also contains a check
for reserved names - like *_mlog, etc. This last check always
triggered when checking a mirror's log simply because of the name,
not because it was suspended or blocked - a false positive.
The solution is to create a new function like 'device_is_usable',
but without the check for reserved names. Using this new function
(device_is_suspended_or_blocked), we can check the status of a
mirror's log device properly.
When both the '-i' and '-m' arguments are specified on the command
line, use the "raid10" segment type. This way, the native RAID10
personality is used through dm-raid rather than layering a mirror
on striped LVs. If the old behavior is desired, the '--type'
argument to use would be "mirror" rather than "raid10".
When reading an info about MDAs from lvmetad, we need to use 64 bit
int to read the value of the offset/size, otherwise the value is
overflows and then it's used throughout!
This is dangerous if we're trying to write such metadata area then,
mostly visible if we're using 2 mdas where the 2nd one is at the end
of the underlying device and hence the value of the mda offset is
high enough to cause problems:
(the offset trimmed to value of 0 instead of 4096m, so we write
at the very start of the disk (or elsewhere if the offset has
some other value!)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 4097m
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
[1] raw/~ # pvchange -u /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" changed
1 physical volume changed / 0 physical volumes not changed
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 d43e d2a5 4c20 4d56 2032 5b78 4135 7225
0000010 4e30 3e2a 0001 0000 0000 0000 0000 0000
0000020 0000 0010 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
=======
(the offset overflows to undefined values which is far behind
the end of the disk)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 100g
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # pvchange -u /dev/vg/lvol0
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
Failed to store physical volume "/dev/vg/lvol0"
0 physical volumes changed / 1 physical volume not changed
When creating a new thin pool and there's no profile requested
via "lvcreate --profile ...", inherit any VG profile if it's attached.
Currently this applies to these settings:
allocation/thin_pool_chunk_size
allocation/thin_pool_discards
allocation/thin_pool_zero
Add new configure lvm.conf options for binaries thin_repair
and thin_dump.
Those are part of device-mapper-persistent-data package
and will be used for recovery of thin_pool.
The PREFERRED allocation mechanism requires the number of areas in the
previous LV segment to match the number in the new segment being
allocated. If they do not match, the code may crash.
E.g. https://bugzilla.redhat.com/989347
Introduce A_AREA_COUNT_MATCHES and when not set avoid referring
to the previous segment with the contiguous and cling policies.
When using a global_filter and if this filter is incorrectly
specified, we ended up with a segfault:
raw/~ $ pvs
Invalid filter pattern "r|/dev/sda".
Segmentation fault (core dumped)
In the example above a closing '|' character is missing at the end
of the regex. The segfault itself was caused by trying to destroy
the same filter twice in _init_filters fn within the error path
(the "bad" goto target):
bad:
if (f3)
f3->destroy(f3);
if (f4)
f4->destroy(f4);
Where f3 is the composite filter (sysfs + regex + type + md + mpath filter)
and f4 is the persistent filter which encompasses this composite filter
within persistent filter's 'real' field in 'struct pfilter'.
So in the end, we need to destroy the persistent filter only as
this will also destroy any 'real' filter attached to it.
commit d00d45a8b6 introduced changes
that are causing cluster mirror tests to fail. Ultimately, I think
the change was right, but a proper clean-up will have to wait.
The portion of the commit we are reverting correlates to the
following commit comment:
2) lib/metadata/mirror.c:_delete_lv() - should have been calling
_activate_lv_like_model() with 'mirror_lv'. This is because
'mirror_lv' is the LV that the overall operation is being
performed on. We need to use this LV as the basis for
determining whether to activate locally, or across the
cluster, etc.
It appears that when legs or logs are removed from a mirror, they
are being activated before they are deleted in order to make them
top-level LVs that can be acted upon. When doing this, it appears
they are not activated based on the characteristics of the mirror
from which they came. IOW, if the mirror was exclusively active,
the sub-LVs are activated globally. This is a no-no. This then
made it impossible to activate_lv_like_model if the model was
"mirror_lv" instead of "lv" in _delete_lv(). Thus, at some point
this change should probably be put back and those location where
the sub-LVs are being improperly activated "shared" instead of
EX should be corrected.
Three fixme's addressed in this commit:
1) lib/metadata/lv_manip.c:_calc_area_multiple() - this could be
safely changed to a comment explaining that currently because
RAID10 can only have a 2-way mirror, we don't need to know the
number of stripes. However, we will need to know that in the
future if RAID10 is to support more than 2-way mirroring.
2) lib/metadata/mirror.c:_delete_lv() - should have been calling
_activate_lv_like_model() with 'mirror_lv'. This is because
'mirror_lv' is the LV that the overall operation is being
performed on. We need to use this LV as the basis for
determining whether to activate locally, or across the
cluster, etc.
3) tools/lvcreate.c:_lvcreate_params() - Minor clean-up. If
'-m 0' is given, treat it as though the mirroring argument
was not given (i.e. as though the requested segment type
was 'stripe' and not mirror).
Activation is needed only for clustered VG.
For non-clustered VG skip activation, since deactivate_lv()
is called without problems (no testing for lock presence).
(updates f6ded62291)
When the merging of snapshot is finished, we need to clean dm table
intries for snapshot and -cow device. So for merging snapshot
we have to activate_lv plain 'cow' LV and let the table
resolver to its work - shortly deactivation_lv() request
will follow - in cluster this needs LV lock to be held by clvmd.
Also update a test - add small wait - if lvremove is not 'fast enough'
and merging process has not been stopped and $lv1 removed in background.
Ortherwise the following lvcreate occasionally finds name $lv1 still in use.
(in release fix)
The status printed for dm-raid targets on older kernels does not include
the syncaction field. This is handled by dev_manager_raid_status() just
fine by populating the raid status structure with NULL for that field.
However, lv_raid_sync_action() does not properly handle that field being
NULL. So, check for it and return 0 if it is NULL.
Add --poolmetadataspare option and creates and handles
pool metadata spare lv when thin pool is created.
With default setting 'y' it tries to ensure, spare has
at least the size of created LV.
Pool creation involves clearing of metadata device
which triggers udev watch rule we cannot udev synchronize with
in current code.
This metadata devices needs to be activated localy,
so in cluster mode deactivation and reactivation
is always needed.
However for non-clustered mode we may reload table
via suspend/resume path which avoids collision with
udev watch rule which was occasionaly triggering
retry deactivation loop.
Code has been also split into 2 separate code paths
for thin pools and thin volumes which improved readability
of the code as well.
Deactivation has been moved out of extend_pool() and
decision is now in _lv_create_an_lv() which knows
the change mode.
Since we vg_write&commit metadata LV inside lv_extend() call,
proper restore is needed in case something fails.
So add bad: section which deactivates activated LV
and removes it from VG.
Also check early for metadata LV name lengh fail.
When tree for thin LVs was using external_lv, there has been
far less optimal solution, that has tried to add certain
existing dependencie only when new node was added.
However this has lead to way to complex tree construction since
many repeated checks have been made during such tree build.
This patch move this detection to the proper _partial_tree generation
code and uses for it new 'activation' flag, which is set when
tree for ACTIVATION or PRELOAD is generated.
It increases performance when thins with external origins are used.
(in release update)
Created dlid for test is not needed afterward, so lower a memory
usage of this call is repeatedly used for building some large tree.
TODO: create function to use given buffer on stack as much cheaper.
Code needs to check if the layer origin device is suspended,
It's valid to create thinvolume snapshot of thinvolume which is also
used as an old-style snapshot. In this case we need to check -real
is suspended.
When adding origin_only - add only layer thin volume.
(in case it's also old-snapshot add only -real device)
Remove backup() call from update_pool_lv() as it's been there
duplicated and preperly order backup() call after lvresize,
so there is just one such call.
If the thin pool is known to be active, messages can be passed
to the pool even when the created thin volume is not going to be
activated.
So we do not need to stack large list of message and validate
and catch creation errors earlier in this case.
Replace the test for valid activation combination with simpler list of
deactivation combinations.
cfg_def_get_path uses a global static var to store the result (for efficiency).
So we need to apply the profile first and then get the path for the config item
when calling find_config_tree_* fns.
Also activation/auto_set_activation is not profilable (at least not now,
maybe later if we need that).
Add thin and thin pool lv creation support to lvm library
This is Mohan's thinp patch, re-worked to include suggestions
from Zdenek and Mohan.
Rework of commit 4d5de8322b
which uses refactored properties handling.
Based on work done by M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The common bits from lib/report/properties.[c|h] have
been moved to lib/properties/prop_common.[c|h] to allow
re-use of property handling functionality without
polluting the report handling functionality.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The activation/auto_set_activation_skip enables/disables automatic
adding of the ACTIVATION_SKIP LV flag. By default thin snapshots
are flagged to be skipped during activation.
And by default, the auto_set_activation_skip is enabled.
Also add -k/--setactivationskip y/n and -K/--ignoreactivationskip
options to lvcreate.
The --setactivationskip y sets the flag in metadata for an LV to
skip the LV during activation. Also, the newly created LV is not
activated.
Thin snapsots have this flag set automatically if not specified
directly by the --setactivationskip y/n option.
The --ignoreactivationskip overrides the activation skip flag set
in metadata for an LV (just for the run of the command - the flag
is not changed in metadata!)
A few examples for the lvcreate with the new options:
(non-thin snap LV => skip flag not set in MDA + LV activated)
raw/~ $ lvcreate -l1 vg
Logical volume "lvol0" created
raw/~ $ lvs -o lv_name,attr vg/lvol0
LV Attr
lvol0 -wi-a----
(non-thin snap LV + -ky => skip flag set in MDA + LV not activated)
raw/~ $ lvcreate -l1 -ky vg
Logical volume "lvol1" created
raw/~ $ lvs -o lv_name,attr vg/lvol1
LV Attr
lvol1 -wi------
(non-thin snap LV + -ky + -K => skip flag set in MDA + LV activated)
raw/~ $ lvcreate -l1 -ky -K vg
Logical volume "lvol2" created
raw/~ $ lvs -o lv_name,attr vg/lvol2
LV Attr
lvol2 -wi-a----
(thin snap LV => skip flag set in MDA (default behaviour) + LV not activated)
raw/~ $ lvcreate -L100M -T vg/pool -V 1T -n thin_lv
Logical volume "thin_lv" created
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg
LV Attr
pool twi-a-tz-
thin_lv Vwi-a-tz-
thin_snap Vwi---tz-
(thin snap LV + -K => skip flag set in MDA (default behaviour) + LV activated)
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap -K
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg/thin_lv
LV Attr
thin_lv Vwi-a-tz-
(thins snap LV + -kn => no skip flag in MDA (default behaviour overridden) + LV activated)
[0] raw/~ # lvcreate -s vg/thin_lv -n thin_snap -kn
Logical volume "thin_snap" created
[0] raw/~ # lvs -o name,attr vg/thin_snap
LV Attr
thin_snap Vwi-a-tz-
...when creating config trees while calling config_def_create_tree fn
that constructs a tree out of config_settings.h definition
(CFG_DEF_TREE_NEW/MISSING/DEFAULT/PROFILABLE).
Till now, we needed the config tree merge only for merging
tag configs with lvm.conf. However, this type of merging
did a few extra exceptions:
- leaving out the tags section
- merging values in activation/volume_list
- merging values in devices/filter
- merging values in devices/types
Any other config values were replaced by new values.
However, we'd like to do a 'raw merge' as well, simply
bypassing the exceptions listed above. This will help
us to create a single tree representing the cascaded
configs like CONFIG_STRING -> CONFIG_PROFILE -> ...
The reason for this patch is that when trees are cascaded,
the first value found while traversing the cascade is used,
not making any exceptions like we do for tag configs.
When CFG_DEF_TREE_MISSING is created, it needs to know the status
of the check done on the tree used (the CFG_USED flag).
This bug was introduced with f1c292cc38
"make it possible to run several instances of configuration check at
once". This patch separated the CFG_USED and CFG_VALID flags in
a separate 'status' field in struct cft_check_handle.
However, when creating some trees, like CFG_DEF_TREE_MISSING,
we need this status to do a comparison with full config definition
to determine which items are missing and for which default values
were used. Otherwise, all items would be considered missing.
So, pass this status in a new field called 'check_status' in
struct config_def_tree_spec that defines how the (dumpconfig) tree
should be constructed (and this struct is passed to
config_def_create_tree fn then).
Start separating the validation from the action in the basic lvresize
code moved to the library.
Remove incorrect use of command line error codes from lvresize library
functions. Move errors.h to tools directory to reinforce this,
exporting public versions of the error codes in lvm2cmd.h for dmeventd
plugins to use.
Fix and improve handling on sigint.
Always check for signal presence *before* calling of command,
so it will not call the command when break was hit.
If the command has been finished succesfully there is
no problem to mark the command ok and not report interrupt at all.
Fix cuple related stack; reports and assignments.
The pv resize code required that a lvm_vg_write be done
to commit the change. When the method to add the ability
to list all PVs, including ones that are not assocated with
a VG we had no way for the user to make the change persistent.
Thus additional resize code was move and now liblvm calls into
a resize function that does indeed write the changes out, thus
not requiring the user to explicitly write out he changes.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Code move and changes to support calling code from
command line and from library interface.
V2 Change lock_vol call
Signed-off-by: Tony Asleson <tasleson@redhat.com>
As locks are held, you need to call the included function
to release the memory and locks when done transversing the
list of physical volumes.
V2: Rebase fix
V3: Prevent VGs from getting cached and then write protected.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Simplified version of lv resize.
v3: Rebase changes to make work. Needed to set sizeargs = 1
to indicate to resize that we are asking for a size based
resize.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Add thin and thin pool lv creation support to lvm library
This is Mohan's thinp patch, re-worked to include suggestions
from Zdenek and Mohan.
V2: Remove const lvm_lv_params_create_thin
Add const lvm_lv_params_skip_zero_get
V3: Changed get/set to use generic functions like current
property
V4: Corrected macro in properties.c
V5: Fixed a bug in liblvm/lvm_lv.c function lvm_lv_create.
incorrectly used pool instead of lv_name when doing the
find_lv_in_vg call.
Based on work done by M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
These settins are customizable by profiles:
allocation/thin_pool_zero
allocation/thin_pool_discards
allocation/thin_pool_chunk_size
activation/thin_pool_autoextend_threshold
activation/thin_pool_autoextend_percent
Besides the classical configuration checks (type checking and
checking whether the item is recognized by lvm tools) for profiles,
do an extra check whether the configuration setting is customizable
by a profile at all. Give a warning message if not.
Before, the status of the configuration check (config_def_check fn call)
was saved directly in global configuration definitinion array (as part
of the cfg_def_item_t/flags)
This patch introduces the "struct cft_check_handle" that defines
configuration check parameters as well as separate place to store
the status (status here means CFG_USED and CFG_VALID flags, formerly
saved in cfg_def_item_t/flags). This struct can hold config check
parameters as well as the status for each config tree separately,
thus making it possible to run several instances of config_def_check
without interference.
Just to make it more clear and also not to confuse
config_valid with check against config definition
(and its 'valid' flag within the config defintion tree).
If "vgcreate/lvcreate --profile <profile_name>" is used, the profile
name is automatically stored in metadata for making it possible to
load it automatically next time the VG/LV is used.
This is per VG/LV profile loading on demand. The profile itself is saved
in struct volume_group/logical_volume as "profile" field so we can
reference it whenever needed.
When placing the profile in a configuration cascade, this sequence is
used exactly:
CONFIG_STRING -> CONFIG_PROFILE -> CONFIG_FILE/MERGED_FILES
So if the profile is used, it overloads the lvm.conf (and any
existing tag configs). However, if "--config" is used to define
a custom configuration on command line, this overloads even the
profile config!
This patch adds --profile arg to lvm cmds and adds config/profile_dir
configuration setting to select the directory where profiles are stored
By default it's /etc/lvm/profile.
The profiles are added by using new "add_profile" fn and then loaded
using the "load_profile" fn. All profiles are stored in a cmd context
within the new "struct profile_params":
struct profile_params {
const char *dir;
struct profile *global_profile;
struct dm_list profiles_to_load;
struct dm_list profiles;
};
...where "dir" is the directory with profiles, "global_profile" is
the profile that is set globally via the --profile arg (IOW, not
set per VG/LV basis based on metadata record) and the "profiles"
is the list with loaded profiles.
Configuration profiles are selected configuration items that can
be loaded dynamically on demand and overlayed over existing
configuration on demand (either on cmd line by selecting the profile
name to be used globally or retrieved from metadata and used per
VG/LV basis only).
The default directory where profiles are stored is configurable
at compile time with --with-default-profile-subdir.
A helper type that helps with identification of the configuration source
which makes handling the configuration cascade a bit easier, mainly
removing and adding configuration trees to cascade dynamically.
Currently, the possible types are:
CONFIG_UNDEFINED - configuration is not defined yet (not initialized)
CONFIG_FILE - one file configuration
CONFIG_MERGED_FILES - configuration that is a result of merging more files into one
CONFIG_STRING - configuration string typed on cmd line directly
CONFIG_PROFILE - profile configuration (the new type of configuration, patches will follow...)
Also, generalize existing "remove_overridden_config_tree" to work with
configuration type identification in a cascade. Before, it was just
the CONFIG_STRING we used. Now, we need some more to add in a
cascade (like the CONFIG_PROFILE). So, we have:
struct dm_config_tree *remove_config_tree_by_source(struct cmd_context *cmd, config_source_t source);
config_source_t config_get_source_type(struct dm_config_tree *cft);
... for removing the tree by its source type from the cascade and
simply getting the source type.
In the last update not all code paths have set the archived flag.
If we run in test mode or without archiving enabled - set the bit
as well - so test whether archiving has been called succesfully
will be ok. (in relase fix).
Do not keep multiple archives for the executed command.
Reuse the ALLOCATABLE_PV from pv status for
ARCHIVED_VG vg status. Mark VG with the bit with the
first archivation.
...not the other way round as it was before. This way it makes
more sense as BA use is exceptional and it's useless to
contaminate the log with messages about BA not being found
in metadata.
Since reduce the message has informational character and doesn't lead
to exit of the command - reduce the log level to info print as we
use for other similar types.
Reindent next print message.
When vgname has not existed in metadata, it has crashed on double free
in format_instance destroy() - since VG was created, used FID and was
released - which also released FID, so further use was accessing bad
memory.
Fix it for this code path before release_vg() so FID will exists
when _vg_read_file_name() returns NULL.
Revert commit 37ffe6a. If static variables are to be used then we
will put them elsewhere and limit the optimization to reporting
code, rather that have it be used in the general case.
Reinstate the previous sort order for origin_size, so that LVs with
an empty origin_size continue to appear at the start of the list
not the end.
Ref. 9d445f371c
We use mpath filtering (enabled by devices/multipath_component_detection=1
lvm.conf setting) to avoid a situation in which we could end up with
duplicate PVs found. We need to filter out the mpath components and
use only the top-level multipath mapping instead for PV scans.
However, if the there are partitions on multipath components, we need
to filter out these partitions. This patch fixes it so those
partitions found on multipath components are filtered as well.
For example, let's consider following configuration:
The sda and sdb are mpath components, sda1 and sdb1 the partitions
on these components, mpath-test the mpath mapping and mpath-test1
the partition mapping - created automatically by kpartx right
after mpath-test creation. The PV resides on top.
(LVM PV)
|
mpath-test1
|
mpath-test
|
sda1 ---------- sdb1
\ | |/
sda sdb
E.g. for sda1 and sdb1, the code will detect this and it skips
the partition that belongs to the multipath component:
<snippet from the log>
#filters/filter-mpath.c:156 /dev/sda1: Device is a partition, using primary device /dev/sda for mpath component detection
130 #ioctl/libdm-iface.c:1724 dm status (253:2) OF[16384](*1)
131 #filters/filter-mpath.c:196 /dev/sda1: Skipping mpath component device
</snippet from the log>
Othewise, we'd see the same PV label on sda1/sdb1 and mpath-test1
at the same time ending up with "Duplicate PV found...".
The dev_get_primary_dev fn now returns:
0 if the dev is already a primary dev
1 if the dev is a partition, primary dev is returned in "result" (output arg)
-1 on error
This way, we can better differentiate between the error state
and the state in which the dev supplied is not a partition
in the caller (this was same return value before).
Also, if we already have information about the device type,
we can check its major number against the list of known device
types (cmd->dev_types) directly, so we don't need to go through
the sysfs - we only check the major:minor pair which is a bit
more straightforward and faster. If the dev_types does not have
any info about this device type, the code just fallbacks to
the original sysfs interface to get the partition info.
Changes:
- move device type registration out of "type filter" (filter.c)
to a separate and new dev-type.[ch] for common use throughout the code
- the structure for keeping the major numbers detected for available
device types and available partitioning available is stored in
"dev_types" structure now
- move common partitioning detection code to dev-type.[ch] as well
together with other device-related functions bound to dev_types
(see dev-type.h for the interface)
The dev-type interface contains all common functions used to detect
subsystems/device types, signature/superblock recognition code,
type-specific device properties and other common device properties
(bound to dev_types), including partitioning support.
- add dev_types instance to cmd context as cmd->dev_types for common use
- use cmd->dev_types throughout as a central point for providing
information about device types
Giving volume type information about being 'metadata' type of volume
has higher priority then i.e. 'mirror' or 'thin' flag - for those
type we have 'target attr' (7th. field).
The special suspend/resume code in lv_remove for LVM1 snapshots was interpsersed
with a vg_commit call. However, while with LVM1 metadata, vg_commit is
technically a no-op, the activation code relied on the ondisk and incore
metadata being the same, since on LVM1, a "commit" happens in vg_write
already. Since the "ondisk" metadata was previously not available with format1
(and incore was silently used instead, via lvmcache), the problem was masked.
This ties the two preceding changes together, actually using the "ondisk"
version of VG metadata instead of calling into lvmcache when activating
volumes. The cache hooks are still used as a fallback, because we don't have an
uncached scanning API yet.
Previously, we have relied on UUIDs alone, and on lvmcache to make getting a
"new copy" of VG metadata fast. If the code which triggers the activation has
the correct VG metadata at hand (the version which is currently on disk), it can
now hand it to the activation code directly.
This allows us to get the current on-disk version of the metadata whenever we
have the current in-flight version, without a recourse to scanning or lvmcache.
Last commit made dump filter only partially composable.
Add remaining functionality and also support composable wipe,
which is needed, when i.e. vgscan needs to remove cache.
(in release fix)
Add a generic dump operation to filters and make the composite filter call
through to its components. Previously, when global filter was set, the code
would treat the toplevel composite filter's private area as if it belonged a
persistent filter, trying to write nonsense into a non-sensical file.
Also deal with NULL cmd->filter gracefully.
This patch adds the ability to set the minimum and maximum I/O rate for
sync operations in RAID LVs. The options are available for 'lvcreate' and
'lvchange' and are as follows:
--minrecoveryrate <Rate> [bBsSkKmMgG]
--maxrecoveryrate <Rate> [bBsSkKmMgG]
The rate is specified in size/sec/device. If a suffix is not given,
kiB/sec/device is assumed. Setting the rate to 0 removes the preference.
There is no point in creation of 2chunks snapshot,
since the snapshot is invalidated immeditelly with the first write
as there is no free chunk for COW blocks
(2 chunks are used by the snap header and the 1st. metadata chunk).
Enhance error message about the lowest usable size.
Avoid hitting memory corruption (double free) in code path,
where PV FID has been already destroyed and the released pointer
was left in PV structure and could have been tried to be released
from there 2nd. time with final context destruction.