IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patch allows pvmove to operate on RAID, mirror and thin LVs.
The key component is the ability to avoid moving a RAID or mirror
sub-LV onto a PV that already has another RAID sub-LV on it.
(e.g. Avoid placing both images of a RAID1 LV on the same PV.)
Top-level LVs are processed to determine which PVs to avoid for
the sake of redundancy, while bottom-level LVs are processed
to determine which segments/extents to move.
This approach does have some drawbacks. By eliminating whole PVs
from the allocation list, we might miss the opportunity to perform
pvmove in some senarios. For example, if we have 3 devices and
a linear uses half of the first, a RAID1 uses half of the first and
half of the second, and a linear uses half of the third (FIGURE 1);
we should be able to pvmove the first device (FIGURE 2).
FIGURE 1:
[ linear ] [ -RAID- ] [ linear ]
[ -RAID- ] [ ] [ ]
FIGURE 2:
[ moved ] [ -RAID- ] [ linear ]
[ moved ] [ linear ] [ -RAID- ]
However, the approach we are using would eliminate the second
device from consideration and would leave us with too little space
for allocation. In these situations, the user does have the ability
to specify LVs and move them one at a time.
Do not print success status for lvm2-activation-generator:
"LVM: Activation generator successfully completed."
"LVM: Logical Volume autoactivation enabled." (if use_lvmetad=1)
Though this information is quite useful during boot, it may
be confusing for users if it happens anytime later and it
actually happens if systemd reloads. This is usually on package
update to update the systemd state and load any new units that are
newly installed in the system. The systemd reload is global and
so any existing generators are rerun at that moment too.
This is a regression caused by commit 3bd9048854.
The error message added with that commit "mpath major %d is not dm major %d" is
superfluous.
When scanning for mpath components, we're looking for a parent device.
But this parent device is not necessarily an mpath device (so the dm device)
if it exists - it can be any other device layered on top (e.g. an MD RAID device).
The bug addressed by this patch manifested itself during testing
by showing a mirror that never became 'in-sync' after creation.
The bug is isolated to distributions that do not have support
for openAIS checkpointing (i.e. > RHEL6, > F16).
When a node joins a group that is managing a mirror log, the other
machines in the group send it a checkpoint representing the current
state of the bitmap. More than one machine can send a checkpoint,
but only the initial one should be imported. Once the bitmap state
has been imported from the initial checkpoint, operations (such
as resync, mark, and clear operations) can begin. When subsequent
checkpoints are allowed to be imported, it has the effect of erasing
all the log operations between the initial checkpoint and the ones
that follow.
When cmirrord was updated to handle the absence of openAIS
checkpointing (commit 62e38da133),
the new import_checkpoint() function failed to honor the 'no_read'
parameter. This parameter was designed to avoid reading all but
the initial checkpoint. Honoring this parameter has solved the
issue of corrupting bitmap data with secondary checkpoints.
If loop device is first configured on systems where /dev/loop-control
is used to dynamically create the loop device itself, there's an
ADD+CHANGE even generated. But next time the existing /dev/loop[0-9]*
is reused, there's only a CHANGE event since the device representing
it is already present in kernel (so no ADD event in this case).
We can't ignore this CHANGE event for loop devices! This is a regression
caused by 756bcabbfe. We already had
a similar problem with MD devices which was fixed by
2ac217d408 (but that one was
only an intra-release fix).
When autoactivating a VG, there could be an existing VG with exactly
the same PV UUIDs. The PVs could be reappeared after previous
loss/disconnect (for example disconnecting and reconnecting iscsi).
Since there's no "autodeactivation" yet, the mappings for the LVs
from the VG were left in the system even if the device was disconnected.
These mappings also hold the major:minor of the underlying device.
So if the device reappears, it is assigned a different major:minor
pair (...and kernel name). We need to cope with this during
autoactivation so any existing mappings are corrected for any changes.
The VG refresh does that (the vgchange --refresh functionality) -
call this before VG autoactivation.
(If the VG does not exist yet, the VG refresh is NOP)
Split out the partitioned device filter that needs to open the device
and move the multipath filter in front of it.
When a device is multipathed, sending I/O to the underlying paths may
cause problems, the most obvious being I/O errors visible to lvm if a
path is down.
Revert the incorrect <backtrace> messages added when a device doesn't
pass a filter.
Log each filter initialisation to show sequence.
Avoid duplicate 'Using $device' debug messages.
Recent version of util-linux/umount (v2.23+) provides
umount --all-targets that can unmount all the mount targets of
the same device (the bind mounts). Use this if available when
calling the umount blkdeactivate.
Otherwise, for older versions of util-linux, use findmnt
(that is also a part of the util-linux) to iterate over all
mount targets of the same device - this is the manual way.
The blkdeactivate now suppresses error messages from external
tools that are called. Instead, only a summary message "done"
or "skipped" is issued by blkdeactivate as any error in calling
the external tool (e.g. unmounting or deactivating a device) causes
the device to be skipped and the blkdeactivate continues with the
next device in the tree.
Add new -e/--errors switch to display any error messages from
external tools.
Also, suppress any output given by the external tools and add
new -v/--verbose switch to display it including the verbose
output of the tools called (this will enable error reporting
as well).
Also add blkdeactivate -vv for even more debug (the script's debug).
84 files changed, 1540 insertions(+), 442 deletions(-)
Mostly bug fixes this time.
Also note:
md raid replaces dm mirroring as the default implementation.
Can call out to thin_repair to fix thin metadata.
Improved clvmd error detection/debugging information.
According to bug 995193, if a volume group
1) contains a mirror
2) is clustered
3) 'locking_type' = 0 is used
then it is not possible to remove the 'c'luster flag from the VG. This
is due to the way _lv_is_active behaves.
We shouldn't allow the cluster flag to be flipped unless the mirrors in
the cluster are not active. This is because different kernel modules
are used depending on whether a mirror is cluster or not. When we
attempt to see if the mirror is active, we first check locally. If it
is not, then we attempt to check for remotely active instances if the VG
is clustered. Since the no_lock locking type is LCK_CLUSTERED, but does
not implement 'query_resource', remote_lock_held will always return an
error in this case. An error from remove_lock_held is treated as though
the lock _is_ held (i.e. the LV is active remotely). This blocks the
cluster flag from changing.
The solution is to implement 'query_resource' for the no_lock type. It
will report a message and return 1. This will allow _lv_is_active to
function properly. The LV would be considered not active remotely and
the VG can change its flag.
1) Since the min|maxrecoveryrate args are size_kb_ARGs and they
are recorded (and sent to the kernel) in terms of kB/sec/disk,
we must back out the factor multiple done by size_kb_arg. This
is already performed by 'lvcreate' for these arguments.
2) Allow all RAID types, not just RAID1, to change these values.
3) Add min|maxrecoveryrate_ARG to the list of 'update_partial_unsafe'
commands so that lvchange will not complain about needing at
least one of a certain set of arguments and failing.
4) Add tests that check that these values can be set via lvchange
and lvcreate and that 'lvs' reports back the proper results.
gcc -O2 v4.8 on 32 bit architecture is causing a bug in parameter
passing. It does not happen with -01 nor -O0.
The problematic part of the code was strlen use in config.c in
the config_def_check fn and the call for _config_def_check_tree in it:
<snip>
rplen = strlen(rp);
if (!_config_def_check_tree(handle, vp, vp + strlen(vp), rp, rp + rplen, CFG_PATH_MAX_LEN - rplen, cn, cmd->cft_def_hash)) ...
</snip>
If compiled with -O0 (correct):
Breakpoint 1, config_def_check (cmd=0x819b050, handle=0x81a04f8) at config/config.c:775
(gdb) p vp
$1 = 0x8189ee0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb)
_config_def_check_tree (handle=0x81a04f8, vp=0x8189ee0 <_cfg_path>
"config", pvp=0x8189ee6 <_cfg_path+6> "", rp=0xbfffe1e8 "config",
prp=0xbfffe1ee "", buf_size=58, root=0x81a2568, ht=0x81a65
48) at config/config.c:680
(gdb) p vp
$4 = 0x8189ee0 <_cfg_path> "config"
(gdb) p pvp
$5 = 0x8189ee6 <_cfg_path+6> ""
If compiled with -O2 (incorrect):
Breakpoint 1, config_def_check (cmd=cmd@entry=0x8183050, handle=0x81884f8) at config/config.c:775
(gdb) p vp
$1 = 0x8172fc0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb) p vp + strlen(vp)
$3 = 0x8172fc6 <_cfg_path+6> ""
(gdb)
_config_def_check_tree (handle=handle@entry=0x81884f8, pvp=0x8172fc7
<_cfg_path+7> "host_list", rp=rp@entry=0xbffff190 "config",
prp=prp@entry=0xbffff196 "", buf_size=buf_size@entry=58, ht=0x
818e548, root=0x818a568, vp=0x8172fc0 <_cfg_path> "config") at
config/config.c:674
(gdb) p pvp
$4 = 0x8172fc7 <_cfg_path+7> "host_list"
The difference is in passing the "pvp" arg for _config_def_check_tree.
While in the correct case, the value of _cfg_path+6 is passed
(the result of vp + strlen(vp) - see the snippet of the code above),
in the incorrect case, this value is increased by 1 to _cfg_path+7,
hence totally malforming the string that is being processed.
This ends up with incorrect validation check and incorrect warning
messages are issued like:
"Configuration setting "config/checks" has invalid type. Found integer, expected section."
To workaround this issue, remove the "static" qualifier from the
"static char _cfg_path[CFG_PATH_MAX_LEN]". This causes the optimalizer
to be less aggressive (also shuffling the arg list for
_config_def_check_tree call helps).
If there is no RAID support in the kernel but the default mirror
segtype is "raid1", converting legacy mirrors can be problematic.
For example, changing the log type or converting a mirror to a linear
LV does not require the RAID modules to be present. However, because
lp->segtype is set to be RAID1 by the configuration file, the command
fails.
We should only be setting lp->segtype when converting mirrors if it is
going to change (e.g. to linear or between mirror types).
When creating a new thin pool and there's no profile requested
via "lvcreate --profile ...", inherit any VG profile if it's attached.
Currently this applies to these settings:
allocation/thin_pool_chunk_size
allocation/thin_pool_discards
allocation/thin_pool_zero
When creating a timeout thread for snapshots, the thread is not
tracked and thus never joined. This means that the exit status
of the timeout thread is held indefinitely. Saves a bit of
memory to set PTHREAD_CREATE_DETACHED when creating this thread.
I've also added pthread_attr_init|destroy to setup the creation
pthread_attr_t.
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Initial basic support for repair.
It currently takes pool metadata spare volume, which
is used for recovery. New spare is created if the volume
is successfuly repaired.
After the operation the previous _tmeta volume is moved
into _tmeta%d volume and if everything is ok, this volume
could be removed.
New _tmeta needs to be pvmoved to proper place and also
converted to i.e. mirror if it should be mirrored.
Later version will try to automate some steps here.
Add new configure lvm.conf options for binaries thin_repair
and thin_dump.
Those are part of device-mapper-persistent-data package
and will be used for recovery of thin_pool.
The PREFERRED allocation mechanism requires the number of areas in the
previous LV segment to match the number in the new segment being
allocated. If they do not match, the code may crash.
E.g. https://bugzilla.redhat.com/989347
Introduce A_AREA_COUNT_MATCHES and when not set avoid referring
to the previous segment with the contiguous and cling policies.
When using a global_filter and if this filter is incorrectly
specified, we ended up with a segfault:
raw/~ $ pvs
Invalid filter pattern "r|/dev/sda".
Segmentation fault (core dumped)
In the example above a closing '|' character is missing at the end
of the regex. The segfault itself was caused by trying to destroy
the same filter twice in _init_filters fn within the error path
(the "bad" goto target):
bad:
if (f3)
f3->destroy(f3);
if (f4)
f4->destroy(f4);
Where f3 is the composite filter (sysfs + regex + type + md + mpath filter)
and f4 is the persistent filter which encompasses this composite filter
within persistent filter's 'real' field in 'struct pfilter'.
So in the end, we need to destroy the persistent filter only as
this will also destroy any 'real' filter attached to it.
363 files changed, 19863 insertions(+), 9055 deletions(-)
This is a very large release - so expect bugs!
Please treat this release like a release candidate.
Changes to the external interfaces since 2.02.98 are not yet frozen.
Updated releases will follow quickly (days not weeks) as any problems
are handled.
The --type mirror requires -m/--mirrrors:
lvconvert --type mirror vg/lvol0
--type mirror requires -m/--mirrors
Run `lvconvert --help' for more information.
The --type raid* is allowed (the checks already existed):
lvconvert --type raid10 vg/lvol0
Converting the segment type for vg/lvol0 from linear to raid10 is not yet supported.
The --type snapshot is a synonym to -s/--snapshot:
lvconvert -s vg/lvol0 vg/lvol1
Logical volume lvol1 converted to snapshot.
lvconvert --type snapshot vg/lvol0 vg/lvol1
Logical volume lvol1 converted to snapshot.
All the other segment types are not supported, e.g.:
lvconvert --type zero vg/lvol0
Conversion using --type zero is not supported.
Run `lvconvert --help' for more information.
The status printed for dm-raid targets on older kernels does not include
the syncaction field. This is handled by dev_manager_raid_status() just
fine by populating the raid status structure with NULL for that field.
However, lv_raid_sync_action() does not properly handle that field being
NULL. So, check for it and return 0 if it is NULL.
Pool creation involves clearing of metadata device
which triggers udev watch rule we cannot udev synchronize with
in current code.
This metadata devices needs to be activated localy,
so in cluster mode deactivation and reactivation
is always needed.
However for non-clustered mode we may reload table
via suspend/resume path which avoids collision with
udev watch rule which was occasionaly triggering
retry deactivation loop.
Code has been also split into 2 separate code paths
for thin pools and thin volumes which improved readability
of the code as well.
Deactivation has been moved out of extend_pool() and
decision is now in _lv_create_an_lv() which knows
the change mode.
The new lvm2-activation-net.service activates LVM volumes
after network-attached devices are set up (iSCSI and FCoE)
if lvmetad is disabled and hence the autoactivation is not
used.
Created dlid for test is not needed afterward, so lower a memory
usage of this call is repeatedly used for building some large tree.
TODO: create function to use given buffer on stack as much cheaper.
Code needs to check if the layer origin device is suspended,
It's valid to create thinvolume snapshot of thinvolume which is also
used as an old-style snapshot. In this case we need to check -real
is suspended.
When adding origin_only - add only layer thin volume.
(in case it's also old-snapshot add only -real device)
Remove backup() call from update_pool_lv() as it's been there
duplicated and preperly order backup() call after lvresize,
so there is just one such call.
If the thin pool is known to be active, messages can be passed
to the pool even when the created thin volume is not going to be
activated.
So we do not need to stack large list of message and validate
and catch creation errors earlier in this case.
Replace the test for valid activation combination with simpler list of
deactivation combinations.
Normally, the lvm dumpconfig processes only the configuration tree
that is at the top of the cascade. Considering the cascade is:
CONFIG_STRING -> CONFIG_PROFILE -> CONFIG_MERGED_FILES/CONFIG_FILE
...then:
(dumpconfig of lvm.conf only)
raw/~ $ lvm dumpconfig allocation
allocation {
maximise_cling=1
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=64
}
(dumpconfig of selected profile configuration only)
raw/~ $ lvm dumpconfig --profile test allocation
allocation {
thin_pool_chunk_size=8
thin_pool_discards="passdown"
thin_pool_zero=1
}
(dumpconfig of given --config configuration only)
raw/~ $ lvm dumpconfig --config 'allocation{thin_pool_chunk_size=16}' allocation
allocation {
thin_pool_chunk_size=16
}
The --mergedconfig option causes the configuration cascade to be
merged before processing it with dumpconfig:
(dumpconfig of merged selected profile and lvm.conf)
raw/~ $ lvm dumpconfig --profile test allocation --mergedconfig
allocation {
maximise_cling=1
thin_pool_zero=1
thin_pool_discards="passdown"
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=8
}
(dumpconfig merged given --config and selected profile and lvm.conf)
raw/~ $ lvm dumpconfig --profile test --config 'allocation{thin_pool_chunk_size=16}' allocation --mergedconfig
allocation {
maximise_cling=1
thin_pool_zero=1
thin_pool_discards="passdown"
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=16
}
Hence with the --mergedconfig, we are able to see the
configuration that is actually used when processing any
LVM command while using any combination of --config/--profile
options together with lvm.conf file.
Start separating the validation from the action in the basic lvresize
code moved to the library.
Remove incorrect use of command line error codes from lvresize library
functions. Move errors.h to tools directory to reinforce this,
exporting public versions of the error codes in lvm2cmd.h for dmeventd
plugins to use.
Document following items:
configuration cascade (man lvm.conf)
--profile ProfileName (man lvm)
--detachprofile (man vgchange/lvchange)
-o vg_profile/lv_profile (man vgs/lvs)
Also document --config a bit so we can see where it fits in the
configuration cascade - will be documented more in following commit...
Fix and improve handling on sigint.
Always check for signal presence *before* calling of command,
so it will not call the command when break was hit.
If the command has been finished succesfully there is
no problem to mark the command ok and not report interrupt at all.
Fix cuple related stack; reports and assignments.
Do not keep multiple archives for the executed command.
Reuse the ALLOCATABLE_PV from pv status for
ARCHIVED_VG vg status. Mark VG with the bit with the
first archivation.
If the user would upconvert a linear LV to a mirror without specifying
the segment type ("--type mirror" vs "--type raid1"), the "mirror"
segment type would be chosen without consulting the 'default_mirror_segtype'
setting in lvm.conf. This is now used as the basis for determining
which should be used if left unspecified.
Check for enough space in preallocated buffer.
Fixes problem, when lvm code started to suddenly allocate
too big memory chunks.
TODO: lvmetad protocol should announce needed size ahead,
so if metadata have 1MB we are not reallocating memory...
When vgname has not existed in metadata, it has crashed on double free
in format_instance destroy() - since VG was created, used FID and was
released - which also released FID, so further use was accessing bad
memory.
Fix it for this code path before release_vg() so FID will exists
when _vg_read_file_name() returns NULL.
Support vgsplit for VGs with thin pools and thin volumes.
In case the thin data and thin metadata volumes are moved to a new VG,
move there also all related thin volumes and check that external origins
are also present in this new VG.
We use mpath filtering (enabled by devices/multipath_component_detection=1
lvm.conf setting) to avoid a situation in which we could end up with
duplicate PVs found. We need to filter out the mpath components and
use only the top-level multipath mapping instead for PV scans.
However, if the there are partitions on multipath components, we need
to filter out these partitions. This patch fixes it so those
partitions found on multipath components are filtered as well.
For example, let's consider following configuration:
The sda and sdb are mpath components, sda1 and sdb1 the partitions
on these components, mpath-test the mpath mapping and mpath-test1
the partition mapping - created automatically by kpartx right
after mpath-test creation. The PV resides on top.
(LVM PV)
|
mpath-test1
|
mpath-test
|
sda1 ---------- sdb1
\ | |/
sda sdb
E.g. for sda1 and sdb1, the code will detect this and it skips
the partition that belongs to the multipath component:
<snippet from the log>
#filters/filter-mpath.c:156 /dev/sda1: Device is a partition, using primary device /dev/sda for mpath component detection
130 #ioctl/libdm-iface.c:1724 dm status (253:2) OF[16384](*1)
131 #filters/filter-mpath.c:196 /dev/sda1: Skipping mpath component device
</snippet from the log>
Othewise, we'd see the same PV label on sda1/sdb1 and mpath-test1
at the same time ending up with "Duplicate PV found...".
Add support for lvresize of thin pool metadata device.
lvresize --poolmetadatasize +20 vgname/thinpool_lv
or
lvresize -L +20 vgname/thinpool_lv_tmeta
Where the second one allows all the args for resize (striping...)
and the first option resizes accoding to the last metadata lv segment.
Giving volume type information about being 'metadata' type of volume
has higher priority then i.e. 'mirror' or 'thin' flag - for those
type we have 'target attr' (7th. field).
Last commit made dump filter only partially composable.
Add remaining functionality and also support composable wipe,
which is needed, when i.e. vgscan needs to remove cache.
(in release fix)
Add a generic dump operation to filters and make the composite filter call
through to its components. Previously, when global filter was set, the code
would treat the toplevel composite filter's private area as if it belonged a
persistent filter, trying to write nonsense into a non-sensical file.
Also deal with NULL cmd->filter gracefully.
The global filter in system's lvm.conf may conflict with the custom filter we
set up in vgimportclone (they can easily fail to intersect). Since we explicitly
avoid talking to lvmetad in vgimportclone, it is safe and reasonable to do so.
There is no point in creation of 2chunks snapshot,
since the snapshot is invalidated immeditelly with the first write
as there is no free chunk for COW blocks
(2 chunks are used by the snap header and the 1st. metadata chunk).
Enhance error message about the lowest usable size.
Instead of seeing wierd overflows inside the lvm code,
giving false error messages, kill the user experiment in the begining.
Who needs to use more then 16EiB with lvm2 and 64bit anyway...
Avoid hitting memory corruption (double free) in code path,
where PV FID has been already destroyed and the released pointer
was left in PV structure and could have been tried to be released
from there 2nd. time with final context destruction.
Check for mounted fs also for vgchange command, not just lvchange.
NOTE: Code is using lv_info() just like lvs_in_vg_opened().
It should be probably converted into lv_is_active_locally().
There are places where 'lv_is_active' was being used where it was
more correct to use 'lv_is_active_locally'. For example, when checking
for the existance of a kernel instance before asking for its status.
Most of the time these would work correctly. (RAID is only allowed on
non-clustered VGs at the moment, which means that 'lv_is_active' and
'lv_is_active_locally' would give the same result.) However, it is
more correct to use the proper variant and it helps with future
scenarios where targets might be allowed exclusively (or clustered) in
a cluster VG.
Accept --yes on all commands, even ones that don't today have prompts,
so that test scripts that don't care about interactive prompts no
longer need to deal with them.
But continue to mention --yes only in the command prototypes that
actually use it.
This fixes a long standing regression since LVM2 2.02.74 (commit 4efb1d9c,
"Update heuristic used for default and detected data alignment.")
The default PE alignment could be used (via MAX()) even if it was
determined that the device's MD stripe width, or minimal_io_size or
optimal_io_size were not factors of the default PE alignment (either 64K
or the newer default of 1MB, etc). This bug would manifest if the
default PE alignment was larger than the overriding hint that the
device provided (e.g. default of 1MB vs optimal_io_size of 768K).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the dm_realloc would fail, the already allocate _maps_buffer
memory would have been lost (overwritten with NULL).
Fix this by using temporary line buffer.
Also add a minor cleanup to set end of buffer to '\0',
only when we really know the file size fits the preallocated buffer.
Setting the cmd->default_settings.udev_fallback also requires DM
driver version check. However, this caused useless mapper/control
access with ioctl if not needed actually. For example if we're not
using activation code, we don't need to know the udev_fallback as
there's no node and symlink processing.
For example, this premature mapper/control access caused problems
when using lvm2app even when no activation happens - there are
situations in which we don't need to use mapper/control, but still
need some of the lvm2app functionality. This is also the case for
lvm2-activation systemd generator which just needs to look at the
lvm2 configuration, but it shouldn't touch mapper/control.
Add new lvs segment field 'Monitor' showing 3 states:
"monitored" - LV is monitored by dmeventd.
"not monitored" - LV is currently not being monitored by dmeventd
"" (empty) - LV does not support monitoring, or dmeventd support
is not compiled in.
Support for exclusive activation of snapshots revealed some problems.
When snapshot is created, COW LV is activated first (for clearing) and
then it's transformed into snapshot's COW LV, but it has left the lock
for such LV active in cluster and this lock could not have been removed
from dlm, unless snapshot has been removed within same dlm session.
If the user tried to remove snapshot after rebooting node, the lock was
missing, and COW LV could not have been detached.
Patch modifes the approach in this way:
Always deactivate COW LV for clustered vg after clearing (so it's
activated again via imlicit snapshot activation rule when snapshot is activated).
When snapshot is removed, activate COW LV as independend LV, so the lock
will exist for such LV, but only when the snapshot is active.
Also add test case for testing snapshot removal after cluster reboot.
Patch fixes hidden problem with lvm metadata caching.
When the pretest was made, only the commited data have been cached back
since the call lv_info_by_lvid() triggers mda read operation.
However call of lv_suspend_if_active() also reads precommited metadata.
The problem is visible in this sequence of calls:
vg_write(), suspend_lv(), vg_commit(), resume_lv()
which may end with leaving outdated mda in lvm cache, since vg_write()
drops cached metadata and vg_commit() only transforms precommited
to commited metadata, but in the case of pretesting we have
no precommited mda available so the cache will continue to use
old metadata. This happens, when suspend LV is inactive.
Merging multiple config files together needs to know newest (highest)
timestamp of merged files. Persistent cache file is being used
only in case, the config file is older then .cache file.
Assign fid as the last step before returning VG.
Make the format reader for 'lvm1' and 'pool' equal to 'lvm2' format reader.
It has caused memory corruption to lvmetad as it later calls
destroy_instance() to allocated fid. This patch should fix problems
with crashing test lvmetad-lvm1.sh.
'lvchange' is used to alter a RAID 1 logical volume's write-mostly and
write-behind characteristics. The '--writemostly' parameter takes a
PV as an argument with an optional trailing character to specify whether
to set ('y'), unset ('n'), or toggle ('t') the value. If no trailing
character is given, it will set the flag.
Synopsis:
lvchange [--writemostly <PV>:{t|y|n}] [--writebehind <count>] vg/lv
Example:
lvchange --writemostly /dev/sdb1:y --writebehind 512 vg/raid1_lv
The last character in the 'lv_attr' field is used to show whether a device
has the WriteMostly flag set. It is signified with a 'w'. If the device
has failed, the 'p'artial flag has priority.
Example ("nosync" raid1 with mismatch_cnt and writemostly):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg Rwi---r-m 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-w 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-- 1 linear 4.00m
Example (raid1 with mismatch_cnt, writemostly - but failed drive):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg rwi---r-p 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-p 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-p 1 linear 4.00m
A new reportable field has been added for writebehind as well. If
write-behind has not been set or the LV is not RAID1, the field will
be blank.
Example (writebehind is set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r-- 512
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
Example (writebehind is not set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r--
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
Move common code for changing activation state from
vgchange and lvchange to one function.
Fix the order of checks - so we always implicitelly
activate snapshots and thin volumes in exclusive mode,
and we do not allow local deactivation for them.
New options to 'lvchange' allow users to scrub their RAID LVs.
Synopsis:
lvchange --syncaction {check|repair} vg/raid_lv
RAID scrubbing is the process of reading all the data and parity blocks in
an array and checking to see whether they are coherent. 'lvchange' can
now initaite the two scrubbing operations: "check" and "repair". "check"
will go over the array and recored the number of discrepancies but not
repair them. "repair" will correct the discrepancies as it finds them.
'lvchange --syncaction repair vg/raid_lv' is not to be confused with
'lvconvert --repair vg/raid_lv'. The former initiates a background
synchronization operation on the array, while the latter is designed to
repair/replace failed devices in a mirror or RAID logical volume.
Additional reporting has been added for 'lvs' to support the new
operations. Two new printable fields (which are not printed by
default) have been added: "syncaction" and "mismatches". These
can be accessed using the '-o' option to 'lvs', like:
lvs -o +syncaction,mismatches vg/lv
"syncaction" will print the current synchronization operation that the
RAID volume is performing. It can be one of the following:
- idle: All sync operations complete (doing nothing)
- resync: Initializing an array or recovering after a machine failure
- recover: Replacing a device in the array
- check: Looking for array inconsistencies
- repair: Looking for and repairing inconsistencies
The "mismatches" field with print the number of descrepancies found during
a check or repair operation.
The 'Cpy%Sync' field already available to 'lvs' will print the progress
of any of the above syncactions, including check and repair.
Finally, the lv_attr field has changed to accomadate the scrubbing operations
as well. The role of the 'p'artial character in the lv_attr report field
as expanded. "Partial" is really an indicator for the health of a
logical volume and it makes sense to extend this include other health
indicators as well, specifically:
'm'ismatches: Indicates that there are discrepancies in a RAID
LV. This character is shown after a scrubbing
operation has detected that portions of the RAID
are not coherent.
'r'efresh : Indicates that a device in a RAID array has suffered
a failure and the kernel regards it as failed -
even though LVM can read the device label and
considers the device to be ok. The LV should be
'r'efreshed to notify the kernel that the device is
now available, or the device should be 'r'eplaced
if it is suspected of failing.
...to not pollute the common and format-independent code in the
abstraction layer above.
The format1 pv_write has common code for writing metadata and
PV header by calling the "write_disks" fn and when rewriting
the header itself only (e.g. just for the purpose of changing
the PV UUID) during the pvchange operation, we had to tweak
this functionality for the format1 case and we had to assign
the PV the orphan state temporarily.
This patch removes the need for this format1 tweak and it calls
the write_disks with appropriate flag indicating whether this is
a PV write call or a VG write call, allowing for metatada update
for the latter one.
Also, a side effect of the former tweak was that it effectively
invalidated the cache (even for the non-format1 PVs) as we
assigned it the orphan state temporarily just for the format1
PV write to pass.
Also, that tweak made it difficult to directly detect whether
a PV was part of a VG or not because the state was incorrect.
Also, it's not necessary to backup and restore some PV fields
when doing a PV write:
orig_pe_size = pv_pe_size(pv);
orig_pe_start = pv_pe_start(pv);
orig_pe_count = pv_pe_count(pv);
...
pv_write(pv)
...
pv->pe_size = orig_pe_size;
pv->pe_start = orig_pe_start;
pv->pe_count = orig_pe_count;
...this is already done by the layer below itself (the _format1_pv_write fn).
So let's have this cleaned up so we don't need to be bothered
about any 'format1 special case for pv_write' anymore.
Add basic support for converting LV into an external origin volume.
Syntax:
lvconvert --thinpool vg/pool --originname renamed_origin -T origin
It will convert volume 'origin' into a thin volume, which will
use 'renamed_origin' as an external read-only origin.
All read/write into origin will go via 'pool'.
renamed_origin volume is read-only volume, that could be activated
only in read-only mode, and cannot be modified.
Reorder activation code to look similar for preload tree and
activation tree.
Its also give much better suppport for device stacking,
since now we also support activation of snapshot which might
be then used for other devices.
A new function (dm_tree_node_force_identical_table_reload) was added to
avoid the suppression of identical table reloads. This allows RAID LVs
to reload the on-disk superblock information that contains which devices
have failed and the bitmaps. If the failed device has returned, this has
the effect of restoring the device and initiating recovery. Without this
patch, the user had to completely deactivate their RAID LV and re-activate
it in order to restore the failed device. Now they simply need to
suspend and resume (which is done by 'lvchange --refresh').
The identical table suppression is only avoided if the LV is not PARTAIL
(i.e. all of it's devices can be seen and read by LVM) and the kernel
status of the array contains failed devices. In other words, the function
will only be called in the case where we may have success in restoring
a failed device in the array.
When there are missing PVs in a volume group, most operations that alter
the LVM metadata are disallowed. It turns out that 'vgimport' is one of
those disallowed operations. This is bad because it creates a circular
dependency. 'vgimport' will complain that the VG is inconsistent and that
'vgreduce --removemissing' must be run. However, 'vgreduce' cannot be run
because it has not been imported. Therefore, 'vgimport' must be one of
the operations allowed to change the metadata when PVs are missing. The
'--force' option is the way to make 'vgimport' happen in spite of the
missing PVs.
If zero metadata copies are used, there's no further recalculation of
PV alignment that happens when adding metadata areas to the PV and
which actually calculates the alignment correctly as a matter of fact.
So fix this for "PV without MDA" case as well.
Before this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 8.00m
After this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
Also, remove a superfluous condition "pv->pe_start < pv->pe_align" in:
if (pe_start == PV_PE_START_CALC && pv->pe_start < pv->pe_align)
pv->pe_start = pv->pe_align ...
This part of the condition is not reachable as with the PV_PE_START_CALC,
we always have pv->pe_start set to 0 from the PV struct initialisation
(...the pv->pe_start value is just being calculated).
When a device fails, we may wish to replace those segments with an
error segment. (Like when a 'vgreduce --removemissing' removes a
failed device that happens to be a RAID image/meta.) We are then left
with images that we will eventually want to remove or replace.
This patch allows us to pull out these virtual "error" sub-LVs. This
allows a user to 'lvconvert -m -1 vg/lv' to extract the bad sub-LVs.
Sub-LVs with error segments are considered for extraction before other
possible devices so that good devices are not accidentally removed.
This patch also adds the ability to replace RAID images that contain error
segments. The user will still be unable to run 'lvconvert --replace'
because there is no way to address the 'error' segment (i.e. no PV
that it is associated with). However, 'lvconvert --repair' can be
used to replace the image's error segment with a new PV. This is also
the most appropriate way to do it, since the LV will continue to be
reported as 'partial'.