IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Use dm-writecache to attach a cache volume (CV) to an LV.
The LV type becomes "writecache" (--type writecache can
be optionally specified when attaching.)
lvconvert --cachevol-attach CV LV
lvconvert --cachevol-detach CV LV
The cachevol (CV) is a special LV that was created on
cache devices, e.g.
lvcreate --cachevol -L <size> -n <name> VG
Cache devices are special PVs that are added to the
VG to use for cachevols, e.g.
vgextend --cachedev VG PV
Example:
Add persistent memory to a VG as a cache device:
vgextend --cachedev vg /dev/pmem0
Create a cache volume from pmem0:
lvcreate --cachevol -L1G -n cvol0 vg
Create a second cache volume from pmem0:
lvcreate --cachevol -L1G -n cvol1 vg
Attach the cache volumes to LVs:
lvconvert --cachevol-attach cvol0 vg/lv0
lvconvert --cachevol-attach cvol1 vg/lv1
A cache volume (CV) is a special LV that can be
attached to a standard LV to do caching.
A cache volume is allocated from cache devices (CDs)
which are special PVs in the VG that are used for
caching.
To create a cache volume, add the --cachevol option
to the lvcreate command. lvcreate will allocate
extents for the CV from cache devs that have been
added to the VG.
lvcreate --cachevol -n <name> -L <size> VG [CD...]
A cache device is a special PV that is added to a VG
to be used for caching, and not for allocating standard LVs.
A cache device (CD) is used to create a cache volume.
A cache volume (CV) is a special LV that can be attached
to a standard LV to do caching.
Terminology:
PV used for caching = cache device = cachedev = CD
LV used for caching = cache volume = cachevol = CV
PV metadata for a cachedev includes the CACHEDEV flag.
LV metadata for a cachevol includes the CACHEVOL flag.
A cachedev is added to a VG with:
vgextend --cachedev VG PV
The --cachedev option tells lvm that the PV should
be added to the VG as a cache device, not a standard PV.
(cache volumes are added by the following commit)
Conversions of LVs under snapshot to thinpool or cachepool
correctly fail but leave them inactive and provide cryptic
error messages like 'Internal error: #LVs (10) != #visible
LVs (2) + #snapshots (1) + #internal LVs (5) in VG VG'.
Reject and provide better error message.
Resolves: rhbz1514146
The 'lvconvert LV' command def has caused multiple problems
for command matching because it matches the required options
of any lvconvert command. Any lvconvert with incorrect options
ends up matching 'lvconvert LV', which then produces an error
about incorrect options being used for 'lvconvert LV'. This
prevents suggestions from nearest-command partial command matches.
Add a special case for 'lvconvert LV' so that it won't be used
as a partial match for a command that has options specified.
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
When lvmetad is not used, use temporary files to record
which PVs have appeared. Use these temp files to determine
when a VG is complete, to trigger autoactivation.
This change allows us to remove lvmetad while keeping the
same autoactivation behavior that lvmetad provides.
The temp files are created in /run/lvm/pvs_online/ and are
named for the PVID of the PV. The files contain the
major:minor of the device the PV was read from.
e.g. if VG foo has dev1 and dev2, then:
. pvscan --cache -aay dev1
reads vg metadata from dev1
creates /run/lvm/pvs_online/<pvid-of-dev1>
checks if all vg->pvs are online: no
. pvscan --cache -aay dev2
reads vg metadata from dev2
creates /run/lvm/pvs_online/<pvid-of-dev2>
checks if all vg->pvs are online: yes
autoactivates vg
A 'pvscan --cache dev' (without -aay) still records that
dev is online.
A 'pvscan --cache --major X --minor Y' after a device is
gone will remove the temp file for it.
A 'pvscan --cache [-aay]' (no devs) resets the state of
temp files by removing them all, then scanning all devs
and creating temp files for PVs that are found.
If no online files exist, the first pvscan --cache scans
all devs and creates temp files for any PVs found.
The scope of the temp files is only pvscan, and they are only
used for pvscan-based autoactivation. No other commands are
concerned with or aware of these temp files. When lvm creates
or removes PVs, no attempt is made to update the temp files.
Support vgchange usage with VDO segtype.
Also changing extent size need small update for vdo virtual extent.
TODO: API needs enhancements so it's not about adding ifs() everywhere.
When user create vdo-pool - use different automatic name.
So unlike with traditional LVs using lvol0, lvol1
use vpool0, vpool1...
TODO: apply similar for thin-pool & cache-pool...
Checks whether VDO support is enabled.
Detects presence of 'vdoformat' tool which is required for to format VDO pool.
ATM build of VDO is NOT automatically enabled (None is default).
To enable build of LVM with VDO support use:
configure --with-vdo=internal
TODO: Maybe future version may switch to link some small VDO library for formating
(would require linking and package dependency).
To support autoloading of VDO dm target driver loading of 'kvdo'
kernel module is needed - ATM it's not using 'dm-vdo' name.
So to support this strange name - add temporarily solution to
autoload kvdo kernel module in this case.
Introduce VDO plugin for monitoring VDO devices.
This plugin can be used also by other users, as plugin checks
for UUID prefix 'LVM-' and run lvm actions only on those
devices.
Non LVM- device are only monitored and log warnings
when usage threshold reaches 80%.
Update makefile to link with more libs since now whole liblvm-internal.a
is linked-in and this library has futher dependencies.
Avoid including deps for run-unit-test.
Drop linking separate status.c as it's already linked via internal libs.
Check allocation of thin-pool works on 2PVs, when one is so full,
that even metadata do not fit there (as they need at least 2M,
while 99% of 63MB fills >62MB)
When lvm2 command is executed in test mode, discard ioctl is skipped.
This may cause even data-loose in case, issuing discard for released
areas was enabled and user 'tested' lvreduce.
When allocating thin-pool with more then 1 device - try to
allocate 'metadataLV' with reuse of log-type allocation for mirror LV.
It should be naturally place on other device then 'dataLV'.
However due to somewhat hard to follow allocation logic code,
it's been rejected allocation in cases where there was not
enough space for data or metadata on single PV, thus to successed,
usage of segments was mandatory.
While user may use:
allocation/thin_pool_metadata_require_separate_pvs=1
to enforce separe meta and data LV - on default settings, this is not
enable thus segment allocation is meant to work.
NOTE:
As already said - the original intention of this whole 'if()' is unclear,
so try to split this test into multiple more simple tests that are more readable.
TODO: more validation.
When node loading fails, there is not much the caller can do,
since there is 'unknown' set of devices preloaded.
Only suspend during preload knows future precommitted 'metadata',
so it's non-trivial to drop 'preloaded' entries with any later call.
However dm tree tracks newly loaded entries - so in this case it
may simplify the recovery path by dropping preloaded entries so
they are not leaked in the DM table.
Allow creation of any virtual segment type with just --virtualsize
specified without any real extent size give.
TODO: likely --type error,zero might be later enhanced to use -V
(along with -L) - but since those targets do not allocate real
space, supporting -V makes sense with them.
Amound of linked libraries grows.
Most of them we don't need to lock in, since we are not using
them in locked section, so skip locking them in memory.
It's important to lock memory beforo running SUSPEND ioctl - but whole
lvm preload runs in memory unlocked environment - as in this phase
memory allocation is allowed and is meant to happen.
Once all targets are preload and ready (confirmed from all targets)
we start suspending tree - and here the memory allocation (or i.e.
opening files) is no longer allowed - as it may cause kernel deadlock.
Commit 3f35146 added a check on the value returned by the
_display_info_cols() function:
1024 if (!_switches[COLS_ARG])
1025 _display_info_long(dmt, &info);
1026 else
1027 r = _display_info_cols(dmt, &info);
1028
1029 return r;
This exposes a bug in the dmstats code in _display_info_cols:
the fact that a device has no regions is explicitly not an error
(and is documented as such in the code), but since the return
code is not changed before leaving the function it is now treated
as an error leading to:
# dmstats list
Command failed.
When no regions exist.
Set the return code to the correct value before returning.
For better code reuse split _node_send_messages into commont
messaging part and separate _thin_pool_node_send_messages.
Patch makes it possible to better reuse common code for messaging
other targets.
udev creates a train wreck of events if we open devices
with RDWR. Until we can fix/disable/scrap udev, work around
this by opening RDONLY and then closing/reopening RDWR when
a write is needed. This invalidates the bcache blocks for
the device before writing so it can trigger unnecessary
rereading.
It's no longer needed. Clustered VGs are now handled in
the same way as foreign VGs, and as shared VGs that
can't be accessed:
- A command processing all VGs sees a clustered VG,
prints a message ("Skipping clustered VG foo."),
skips it, and does not fail.
- A command where the clustered VG is explicitly
named on the command line, prints a message and fails.
"Cannot access clustered VG foo, see lvmlockd(8)."
The option is listed in the set of ignored options for
the commands that previously accepted it. (Removing it
entirely would cause commands/scripts to fail if they
set it.)
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
Since we are using "DefaultDependencies=no" we do not get automatic STOP
job on socket connection - so automatically refuse connection on
shutdown by adding this Conflict definition to socket Unit.
Shuffle code for better readability as set of conditions was
hard to follow.
Make it obvious the refresh & activate path is handling
monitoring and polling on its own.
So the only --monitor and --poll option needs explicit care.
Option --monitor without option --poll will now as a result
of this patch NOT start polling.
So command: vgchange --monitor n is no longer a polling starter.
Restoring polling for activated volumes lost with my recent commit:
75fed05d3e and move start of polling
directly into _activate_lvs_in_vg() - as there we know exactly
if there was some volume even activated.
Also make it sharing same code for pvscan -aay.
With the move to top-level makefile - there are some issues
with subdir recursive makefile.
Make the building more tolerant for now until fully resolved.
The previous method for forcibly changing a clustered VG
to a local VG involved using -cn and locking_type 0.
Since those options are deprecated, replace it with
the same command used for other forced lock type changes:
vgchange --locktype none --lockopt force.
Shared VGs will generally be started and activated by
the resource agent. Without the agent, this script doesn't
have a good way to know which LVs to activate.
vgreduce, vgremove and vgcfgrestore were acquiring
the orphan lock in the midst of command processing
instead of at the start of the command. (The orphan
lock moved to being acquired at the start of the
command back when pvcreate/vgcreate/vgextend were
reworked based on pvcreate_each_device.)
vgsplit also needed a small update to avoid reacquiring
a VG lock that it already held (for the new VG name).
A few places were calling a function to check if a
VG lock was held. The only place it was actually
needed is for pvcreate which wants to do its own
locking (and scanning) around process_each_pv.
The locking/scanning exceptions for pvcreate in
process_each_pv/vg_read can be enabled by just passing
a couple of flags instead of checking if the VG is
already locked. This also means that these special
cases won't be enabled unknowingly in other places
where they shouldn't be used.
When pvmoving LV - the target for LV is a mirror so the validation
that checked the type is matching was incorrect.
While we need a more generic enhancment of LVS output for pvmoved LVs,
for now at least stop showing internal errors and 'X' symbols in attrs.
The last commit related to this was incomplete:
"Implement lock-override options without locking type"
This is further reworking and reduction of the locking.[ch]
layer which handled all clustering, but is now only used
for file locking. The "locking types" that this layer
implemented were removed previously, leaving only the
standard file locking. (Some cluster-related artifacts
remain to be cleared out later.)
Command options to override or modify locking behavior
are reimplemented here without using the locking types.
Also, deprecated locking_type values are recognized,
and implemented as if one of the equivalent override
options was set.
Options that override file locking are:
. --nolocking disables all file locking.
. --readonly grants read lock requests without actually
taking a file lock, and refuses write lock requests.
. --ignorelockingfailure tries to set up file locks and
uses them normally if possible. When not possible, it
behaves like --readonly, but allows activation.
. --sysinit is the same as ignorelockingfailure.
. global/metadata_read_only acquires actual read file
locks, and refuses write lock requests.
(Some of these options could probably be deprecated
because they were added as workarounds to various
locking_type behaviors that are now deprecated.)
The locking_type setting now has one valid value: 1 which
refers to standard file locking. Configs that contain
deprecated values are recognized and still work in
largely the same way:
. 0 disabled all locking, now implemented like --nolocking
is set. Allow the nolocking option in all commands.
. 1 is the normal file locking setting and is unchanged.
. 2 was for external locking which was not used, and
reverts to normal file locking.
. 3 was for cluster/clvm. This reverts to normal file
locking, and prints messages about lvmlockd.
. 4 was equivalent to readonly, now implemented like
--readonly is set.
. 5 disabled all locking, now implemented like
--nolocking is set.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
Add tests for linear <-> striped|raid* conversions.
Add region_size config to reshape tests to avoid test
failures in case of it being defined unexpectedly in lvm.conf.
Related: rhbz1439925
Related: rhbz1447809
"lvconvert --type {linear|striped|raid*} ..." on a striped/linear
LV provides convenience interim type to convert to the requested
final layout similar to the given raid* <-> raid* conveninece types.
Whilst on it, add missing raid5_n convenince type from raid5* to raid10.
Resolves: rhbz1439925
Resolves: rhbz1447809
Resolves: rhbz1573255
Fixes breakage from the recent libdm split. Though these didn't
ever appear to be linked (could they have piggy backed from libdevmapper.so
being linked to them?).
In this command, lvcreate creates a new LV and then combines
it with an existing cache pool, producing a cache LV. This
command was previously not allowed in in a shared VG.
When the lvmlockd lock is shared, upgrade it to ex
when repair (writing) is needed during vg_read.
Pass the lockd state through additional read-related
functions so the instances of repair scattered through
vg_read can be handled.
(Temporary solution until the ad hoc repairs can be
pulled out of vg_read into a top level, centralized
repair function.)
It's not an error if a command requests the global lock
when it has already acquired it. It shouldn't happen,
but there could be cases we've not found.
We have been warning about duplicate devices (and disabling lvmetad)
immediately when the dup was detected (during label_scan). Move the
warnings (and the disabling) to happen later, after label_scan is
finished.
This lets us avoid an unwanted warning message about duplicates
in the special case were md components are eliminated during the
duplicate device resolution.
The report uses process_each_vg() which populates lvmcache
based on a VG list from lvmetad. If there are no VGs,
but only orphan PVs, the orphans are not shown. Add an
explicit call to populate lvmcache with PV info from lvmetad.
This minor patch fixes grammar in a few messages which get
printed to users. It also fixes the same grammar mistake in
several comments.
Signed-off-by: Rick Elrod <relrod@redhat.com>
--
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
No point to start building lvm without this header file.
Although there could be 'some point' in supporting standalone build
of 'just' libdm where the libaio might be avoided.
TODO: think about configure option for building libdm only.
Unfortunatelly on kernels <4.16 lvm2 can't user brd ramdisks
for backend device as number of test is failing with this kernel
message:
device-mapper: ioctl: can't change device type after initial table load.
caused by DAX request-based handling, and lvm2 tries to replace device
with backend 'error' bio-based device and such table reload is being
rejected.
So ATM keep ramdisk only on most recent kernel to experiment a bit,
for older machines just stay safe and keep old slower loop backend.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
While newer system can detect need for 4K mkfs, on older test machines
running test suite over 4k is reporting problems.
Some more generic solution is needed thought.
When test happens to run in tmpfs, it cannot use O_DIRECT (unsupported
with tmpfs).
CHECKME: unsure if detection of tmpfs is 'valid' but kind of works and
is very simple.
As Makefiles already do use target with name 'device-mapper'
rename this new device-mapper dir to non-conflicting name.
We also seem to already use '_' in other dir names.
Also rename device_mapper/Makefile to source for generating Makefile.in
so we can use it for build in other source dirs properly.
It's very hard to use some 'non-recurive' Makefiles with
rest of system running 'recursively'.
So ATM drop inclusion of subdir makefile and add support
for 2 new top-level targets:
unit-test (builds test/unit dir)
run-unit-test (build & run test/unit/unit-test run)
Just like 52656c89fd
when now cache is compiled in 'unditionally'.
This patch is actually enforce by changes in
commit: 2bc896f2a3
where CACHE value is not set anymore.
If the test does not need root, it can use 'SKIP_ROOT_DM_CHECK'.
For such test no actions needed root to initilize DM devices and
nodes will be take and test can check i.e. functional unit tests.
Usage of dm_delay looks to be slowing not just 'delayed' portion
of device, but due to the fact it's also slows down ANY flush
operation on such device it's overal speed impact is huge.
In some case we can however user other methods to slowdown disk writes,
in case of old dm 'mirror' target we can throttle I/O of mirror
synchronisation giving the next commands enough time to test couple
race conditions.
Usage:
throttle_dm_mirror [percentage]
Thtrottle down sync speed (lowest is '1' which is also default when
unspecified)
restore_dm_mirror
Restores the value of throttling before call of 'throttle_dm_mirror'
Usually it should '100'
Currently usage of loop device over backend file in ramdisk (tmpfs)
is actually causing unnecassary memory consution, since just
reading such loop device is causing RAM provisioning.
This patch add another possible way how to use ramdisk directly
through 'brd' device when possible (and allowed).
This however has it's limitation as well - brd does not support
TRIM, so the only way how to erase is to remove brd module ??
Alse there is 4K sector size limitation imposed by ramdisk.
Anyway - for some mirror test that were using large amount of
disk space (tens of MB) this brings noticable speed boost.
(But could be worth to solve the slowness of loop in kernel?)
To prevent using 'brd' for testing set LVM_TEST_PREFER_BRD=0
like this:
make check_local LVM_TEST_PREFER_BRD=0
This test can't use brd (ramdisk) as backend since for some
weird reason lsblk is not listing these device.
TODO: test could be probably rewritten to avoid using lsblk somehow??
When the backend device supports only 4K blocks (like ramdisk)
we cannot use for testing any smaller blocksize.
So recalc test for 4K extent size.
We may possibly introduce one list extra test that
can be executed on devices with 512b sectors to
check lvm2 support those min extent sizes...
ATM it's a bit ugly to enforce flushing of 'stdio' here, but works as quick
hot-fix.
log_print*() is using buffered I/O.
But for pooling with typical 1s interval this may take a while before
buffer about continues progress gets flushed.
So ATM fflush().
TODO: either add log_print*_with_flush() or maybe directly use just
line buffering with log_print() and only log_debug() keep using buffered
I/O mode.
md devices using an older superblock version have
superblocks at the end of the md device. For commands
that skip reading the end of devices during filtering,
the md component devs will be scanned, and will appear
as duplicate PVs to the original md device. Remove
these md components from the list of unused duplicate
devices, so they are treated as if they had been
ignored during filtering. This avoids the restrictions
that are placed on using PVs with duplicates.
All these functions are now used as utilities,
e.g. for ioctl (not for io), and need to
open/close the device each time they are called.
(Many of the opens can probably be eliminated by
just using the bcache fd for the ioctl.)
with the --labelsector option. We probably don't
need all this code to support any value for this
option; it's unclear how, when, why it would be
used.
An implementation of an adaptive radix tree. Has the following nice
properties:
- At least as fast as the hash table
- Uses less memory
- You don't need to give an expected size when you create
- It scales nicely (ie. no large reallocations like the hash table).
- You can iterate the keys in lexicographical order.
Only insert and lookup are implemented so far. Plus there's a lot
more performance to come.
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
This reverts commit 0931067dc5.
The dep files should be in the build dir, which is not necc. the src dir.
Easy to fix, but reverting for now until I have time to revisit.
The clvmd saved_vg data is independent from the normal lvm
lvmcache vginfo data, so separate saved_vg from vginfo.
Normal lvm doesn't need to use save_vg at all, and in clvmd,
lvmcache changes on vginfo can be made without worrying
about unwanted effects on saved_vg.
To avoid the chance of freeing a saved vg while another
code path is using it, defer freeing saved vgs until
all the lvmcache content is dropped for the vg.
In case "lvconvert -mN RaidLV" was used on a degraded
raid1 LV, success was returned instead of an error.
Provide message to inform about the need to repair first
before changing number of mirrors and exit with error.
Add new lvconvert-m-raid1-degraded.sh test.
Resolves: rhbz1573960
I don't like having this in a common header because it means you end
up including too much and causing unneccessary dependencies. eg,
lib/misc/lib.h includes libdevmapper.h, internationalisation, and
logging stuff.
There are likely more bits of code that can be removed,
e.g. lvm1/pool-specific bits of code that were identified
using FMT flags.
The vgconvert command can likely be reduced further.
The lvm1-specific config settings should probably have
some other fields set for proper deprecation.
The mixed up vg repair code in vg_read was trying
to repair a vg when vg_read was called by clvmd.
The clvmd daemon isn't supposed to be repairing
or writing a vg.
(This is a temporary workaround; vg repair will soon
be pulled out of vg_read so it can be called in a
controlled way and consolidated instead of spread
around.)
Shift refresh of mirror table right into monitor_dev_for_events().
Use !vg_write_lock_held() to recognize use of lvchange/vgchange.
(this shall change if this would no longer work, but requires
futher some API changes).
With this patch dm mirror table is only refreshed when necassary.
Also update WARNING message about mirror usage without monitoring
and display LV name.
When 'dmsetup' reports result with --nameprefixes it currently
incorrectly 'escapes' problematic characters.
Letting pass such string though shell 'eval' function is hard task.
So instead cut away substring.
Once dmsetup will start to properly escape backslash and apostrophe
this function may need further tuning.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
External build dir now works too.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
The filters save information about devices that should
be ignored, so if we need to repeat a scan (unusual,
but happens in clvmd), we need to update the filters.
When pvmove was run in background mode and forks
instead of using lvmpolld, the child pvmove process
was not clearing the bcache from the parent, so all
the aio ops in the child were failing.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
The copy of the VG which clvmd stashes in lvmcache should
not only be used between suspend and resume, but between
sequential LV operations in clvmd, so that clvmd does not
need to reread the VG for each one. Prepare for that by
renaming the stashed VG as "saved_vg".
Instead of using delayer device user 'zero' device and let mirror
do some real work which takes some time.
In case the test machine is too fast - mirror might need to be made bigger
to meet needed criteria.
Also move all test needed this 'zero' PV trick to the end of test
so $dev2 and $dev4 are covered with 'zero' and can take any amount of
write without consuming any real space.
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Don't allow writes in test mode. test mode should be
more sophisticated than just faking writes, and this
should be a last defense for cases where test mode is
not being checked correctly.
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
In some odd cases (e.g. tests) there are very few devices
which results in creating too few blocks in bcache, so
create bcache with a minimum number of blocks.
When a PV is stacked on an LV, the LV will be kept in
bcache, and the open fd on the LV may interfere with
processing the LV. So, drop/close a bcache fd for
an LV before processing the LV.
Commands using lvmetad will not begin with a proper
label_scan which initializes bcache, but may later
decide they need to scan a set of devs, in which case
they'll need bcache set up at that point.
The improved detection of bad metadata when scanning
(where errors were ignored before) means we now have to
override some errors when forcibly erasing damaged metadata.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
This is a temporary hacky workaround to the problem of
reads going through bcache and writes not using bcache.
The write path wants to read parts of data that it is
incrementally writing to disk, but the reads (using
bcache) don't work because the writes are not in the
bcache. For now, add a dev to bcache before each attempt
to read it in case it's being used on the write path.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
In the same way as the other process_each functions.
In the common case all the info that's needed can be
used from lvmcache after a label scan. But this means
that unchosen devs for duplicate PVs need to be handled
explicitly.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
The old code was doing unnecessary label scans when
checking to see if the new VG name exists. A single
label_scan is sufficient if it is done after the
new VG lock is held.
When lvmlockd indicates that the lvmetad cache is out of
date because of changes by another node, lvmetad_pvscan_vg()
rescans the devices in the VG to update lvmetad. Use the
new label_scan in this function to use the common code and
take advantage of the new aio and reduced reads.
This fixes the use of lvmcache_label_rescan_vg() in the previous
commit for the special case of independent metadata areas.
label scan is about discovering VG name to device associations
using information from disks, but devices in VGs with
independent metadata areas have no information on disk, so
the label scan does nothing for these VGs/devices.
With independent metadata areas, only the VG metadata found
in files is used. This metadata is found and read in
vg_read in the processing phase.
lvmcache_label_rescan_vg() drops lvmcache info for the VG devices
before repeating the label scan on them. In the case of
independent metadata areas, there is no metadata on devices, so the
label scan of the devices will find nothing, so will not recreate
the necessary vginfo/info data in lvmcache for the VG. Fix this
by setting a flag in the lvmcache vginfo struct indicating that
the VG uses independent metadata areas, and label rescanning should
be skipped.
In the case of independent metadata areas, it is the metadata
processing in the vg_read phase that sets up the lvmcache
vginfo/info information, and label scan has no role.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When user configured lvm2 to NOT user monitoring, activated mirror
actually hang upon error and it's quite unusable moment.
So instead Warn those 'brave' non-monitoring users about possible
problem and activation mirror without blocking error handling.
This also makes it a bit simpler for test suite to handle trouble
cases when test is running without dmeventd.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Whenever we make visible LV out of previously invisible one,
reload it's table - the is mandator for proper udev rule
processing as well as ensure content of dm table is correct.
TODO: this new generic rule probably make extra raid rules unnecessary.
Fixing regresion on argument acceptance where any lv can be passed
with paramaterless lvconvert which is meant to figure out needed
operation - i.e. wait for mirror synchronization.
User has no other 'effective' method to wait for mirror getting in-sync.
Since we support snapshot of mirrors, we do need to properly check
for stacked lock holder - fixes problem of pvmove in cluster
with mirrors under snapshot.
WHATS_NEW for this patch goes with 'Restore pvmove support...'
The current logic that avoids setting SYSTEMD_ALIAS and SYSTEMD_WANTS
on "change" events is flawed in the default "systemd background job"
configuration. For systemd, it's important that device properties don't
change spuriously.
If an "add" event starts lvm2-pvscan@.service for a device, and a
"change" event follows, removing SYSTEMD_ALIAS and SYSTEMD_WANTS from the
udev db, information about unit dependencies between the device and the
pvscan service can be lost in systemd, in particular if the daemon
configuration is reloaded.
Steps to reproduce problem:
- create a device with an LVM PV
- remove device
- add device (generates "add" and "change" uevents for the device)
(at this point SYSTEMD_ALIAS and SYSTEMD_WANTS are clear in udev db)
- systemctl daemon-reload
(systemd reloads udev db)
- vgchange -a n
- remove device
=> the lvm2-pvscan@.service for the device is still active although the
device is gone.
- add device again
=> the PV is not detected, because systemd sees the lvm2-pvscan@.service
as active and thus doesn't restart it.
The original purpose of this logic was to avoid volumes being scanned
over and over again. With systemd background jobs, that isn't necessary,
because systemd will not restart the job as long as it's active.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Make the distinction between the cases with and without systemd
background jobs explicit in 69-dm-lvm-metad.rules rather than
substituting the rule from the Makefile. At this stage,
this improves only readibility, at the cost of one GOTO statement.
This patch introduces no functional change to the udev rules.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Test that no (Sub)LV remnants persist if the volume group is
not listed in configuration variable activation/volume_list,
hence not activatable thus causing initialization of rmeta
SubLVs to fail.
Related: rhbz1161347
btrfs is using fake major:minor device numbers.
try to be smarter and detect used node via DM device name.
This shortens delays, where i.e. lvm2 is asked to deactivate
volume with mounted btrfs as such operation is not retryed
and user is informed about device being in use.
Correct testing with format 1 and mq policy.
Add testing of 'smq'
Fix testing with clvmd - where logged message is part of clvmd log
and we can only check command status.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
Use 4K chunks since some older kernels are not capable
to create striped volumes with smaller size.
TODO: lvm2 should detect this ahead and avoid kernel
reporting "Invalid chunk".
Since this function is called with 'fd == -1', but Coverity can't see
this path can't be visited with this argument, add explicit check for
valid descriptor.
If the tools for checking thin_pool or cache metadata are missing,
issue rather just a WARNING, but let the operation of activation
continue.
This has the advantage, the if user is missing those tools,
but he already started to use thinpool or cacheing, he can
access these volumes with a WARNING.
Also if the user is using too old tools i.e. for CacheV2 format
dmpd tool 0.7 is required - provide informative WARNING and
skip failure from older tool version which can't understand
new format V2.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
While prioritized_section() based on raised priority works
nicely for standard lvm comman - separate counter is actually needed
when it's used in daemons like clvmd/dmeventd where priority
stays raised all the time.
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Introduce prioritized_section() as a closer match to previous logic
of critical_section() that has been held over longer sequence of
ioctl commands - essentially it's matching operation on a single
cookie.
While 'critical_section()' now corresponds to locked memory - we hold
this memory only between suspend/resume thus notion of 'cookie' was
lost.
This patch restores some logic unintentionaly lost with dropping
memory locking for just activation/deactivation calls.
When function dm_stats_populate() returns 0 it's an error and needs
log_error() message - function can't have 'success' returning 0 or
error without reasons.
Prevent call of dm_stats_populate(), when there has been no
stats region detected for a DM device.
Such skip is evaluated as 'correct' visit of stats call and
not causing 'dmstats' command failure.
Resulting loop table line was streamed to 'stderr' stream - assuming this
was not a feature when user used '-v' for more verbose output
and properly show it via 'log_verbose()' on 'stdout'.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
With problematic kernels raid devices can be occasionaly left with
'frozen' status - try to 'unfreeze' them with idle message on teardown.
Also replace couple greps with 'built-in' dmsetup --select feature.
Note: dmsetup --select currently reports 'No devices found' on stdout
and return success - looks like a bug to fix.
Just like lvm2 has internal devices like _tdata which is using UUID with
suffix, there is similar private type of device for crypto device where
they are using CRYPT-TEMP uuid prefix.
Also ignore stratis.
Some kernel version suffer from bad state transition where a device
steps into 'frozen' mode. Any application that tries to read such
raid gets unfortunatelly bloked.
As some sort of protection try to skip such raid device from being
scanned to minimize chances to block lvm2 command on such scan.
When such device is found, warning gets printed.
RaidLVs on read_only_volume_list have their SubLVs
activated readonly thus disabling metadata updates
or image resynchronization/recovery. Bug also causes
automatic repairs to fail.
Fix by always activating the RAID SubLVs readwrite.
Resolves: rhbz1208269
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
When snapshot is created in read-only mode with 'lvcreate -s -pr...',
lvm2 still needs to be able to write to layered -cow volume
to store metadata and exceptions blocks.
TODO: in some case we might be able to do full tree with read-only
volume but this probably needs futher validation:
1. checking snapshot header already exist
2. origin & snapshot are both in read-only mode.
Occasionaly users may need to peek into 'component devices.
Normally lvm2 does not let users activation component.
This patch adds special mode where user can activate
component LV in a 'read-only' mode i.e.:
lvchange -ay vg/pool_tdata
All devices can be deactivated with:
lvchange -an vg | vgchange -an....
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
When dmeventd thin plugin forks a configurable script, switch to use
execvp to pass whole environment present to dmeventd - so all configured
paths present at dmeventd startup are visible to script.
This was likely not a problem for common user enviroment,
however in test suite case variable like LVM_SYSTEM_DIR were
not actually used from test itself but rather from
a system present lvm.conf and this may have cause strange
behavior of a testing script.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
This is somewhat tricky - for test suite we keep using
'set -e -o pipefail' - the effect here is - we get error report
from any 'failing' command in whole pipeline - thus when something
like this: 'lvs | head -1' is used - and 'head' finishes before
lead 'lvs' is done - it recieves SIGPIPE and exits with error,
and somewhat misleading gets occasionally reported depending
of speed of commands.
For this case we have to avoid using standard pipes and rather
switch to using streamed results with temporary output file.
This is all nicely handled with bash feature '< <()'.
For more info:
https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail
While 'file-locking' code always dropped cached VG before
lock was taken - other locking types actually missed this.
So while the cache dropping has been implement for i.e. clvmd,
actually running command in cluster keept using cache even
when the lock has been i.e. dropped and taken again.
This rather 'hard-to-hit' error was noticable in some
tests running in cluster where content of PV has been
changed (metadata-balance.sh)
Fix the code by moving cache dropping directly lock_vol() function.
TODO: it's kind of strange we should ever need drop_cached_metadata()
used in several places - this all should happen automatically
this some futher thinking here is likely needed.
Improve pvmove to accept 'locally' active LVs together with
exclusive active LVs.
In the 1st. phase it now recognizes whether exclusive pvmove is needed.
For this case only 'exclusively' or 'locally-only without remote
activative state' LVs are acceptable and all others are skipped.
During build-up of pvmove 'activation' steps are taken, so if
there is any problem we can now 'skip' LVs from pvmove operation
rather then giving-up whole pvmove operation.
Also when pvmove is restarted, recognize need of exclusive pvmove,
and use it whenever there is LV, that require exclusive activation.
So this is a bit more complex and possibly worth futher checking.
ATM clvmd drops cmd->mem mempool AFTER refresh of cmd.
So anything allocating from cmd->mem during toolcontext init
will likely die at some point in time.
As a quick fix - just use regular malloc/free for 'dso' alloction.
It's worth to note - cmd->libmem seems to be often misused
causing hidden memleaking for clvmd.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.
Avoid using same return code for reporting 2 different things
and stricly report error code by return value and add new
parameter for reporting monitoring status.
This makes easier to recognize which error we got from dm_event
and continue only with ENOENT.
Check if the generated vg name still fits the buffer.
So too long strings are rejected.
Drop -1 from size passed to snprintf - as the \0 is already included.
On occasional gcc releases it's better to specify also -ldevmapper
to linking logic for python object.
It's in fact more correct since the liblvm.c code is using
libdevmapper functions - that were linked in only via
liblvm2app library.
This partially reverts commit da37cbd24f.
As the _cmdline structure use mempool for allocated ellement
that is being release on cmd_context close.
Before the better fix is made - restore previous logic and
reinitialize cmd structures again for new cmd_context.
Problem can be hit with e.g. this test run:
make check_local T=foreign LVM_VALGRIND_DMEVENTD=1
Invalid read of size 1
at 0x4C31C83: strcmp (vg_replace_strmem.c:846)
by 0x6BA0939: _find_command (lvmcmdline.c:1555)
by 0x6BA4304: lvm_run_command (lvmcmdline.c:2810)
by 0x6BD5E02: lvm2_run (lvmcmdlib.c:91)
by 0x685607E: dmeventd_lvm2_run (dmeventd_lvm.c:118)
by 0x6652684: _use_policy (dmeventd_thin.c:117)
by 0x6652E56: process_event (dmeventd_thin.c:298)
by 0x10CC5A: _do_process_event (dmeventd.c:945)
by 0x10CF83: _monitor_thread (dmeventd.c:1033)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Address 0x6266270 is 4,352 bytes inside a block of size 8,192 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x5289142: dm_free_wrapper (dbg_malloc.c:393)
by 0x528998A: _free_chunk (pool-fast.c:318)
by 0x52892A6: dm_pool_destroy (pool-fast.c:78)
by 0x6A8E52C: destroy_toolcontext (toolcontext.c:2254)
by 0x6BA5BD6: lvm_fin (lvmcmdline.c:3327)
by 0x6BD5EA7: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x5288F46: dm_malloc_aux (dbg_malloc.c:287)
by 0x52890AC: dm_malloc_wrapper (dbg_malloc.c:371)
by 0x52898E6: _new_chunk (pool-fast.c:286)
by 0x52893BA: dm_pool_alloc_aligned (pool-fast.c:106)
by 0x5289310: dm_pool_alloc (pool-fast.c:90)
by 0x6A8A21A: _load_config_file (toolcontext.c:808)
by 0x6A8A3D9: _init_lvm_conf (toolcontext.c:842)
by 0x6A8D3BD: create_toolcontext (toolcontext.c:1941)
by 0x6BA5B24: init_lvm (lvmcmdline.c:3308)
by 0x6BD5B7C: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5EB8: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
With pthreaded daemons like 'dmeventd' using liblvm via plugin,
lvm2 actually should not 'play' with streams at all - as there
could be parallel outputs running.
As a current quick workaround just disable change for pthreaded
program (gettid() != getpid()).
TODO: it's possible the change of buffering actually doesn't serve us
any measurable benefit and could be dropped as whole later...
Meanwhile this patch is fixing this occasional valgrind race report:
Invalid read of size 4
at 0x571892C: vfprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x57216B3: fprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x5042886: dm_event_log (libdevmapper-event.c:925)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Address 0x6264d30 is 192 bytes inside a block of size 552 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x573907D: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5C00: reopen_standard_stream (log.c:189)
by 0x6A8E62C: destroy_toolcontext (toolcontext.c:2271)
by 0x6BA5C22: lvm_fin (lvmcmdline.c:3339)
by 0x6BD5EF3: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x573932B: fdopen@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5DC2: reopen_standard_stream (log.c:200)
by 0x6A8D11D: create_toolcontext (toolcontext.c:1898)
by 0x6BA5B6B: init_lvm (lvmcmdline.c:3319)
by 0x6BD5BC8: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5F04: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
....
Process terminating with default action of signal 6 (SIGABRT): dumping core
at 0x570016B: raise (in /usr/lib64/libc-2.26.9000.so)
by 0x5701520: abort (in /usr/lib64/libc-2.26.9000.so)
by 0x57437D8: __libc_message (in /usr/lib64/libc-2.26.9000.so)
by 0x5743831: __libc_fatal (in /usr/lib64/libc-2.26.9000.so)
by 0x5744056: _IO_vtable_check (in /usr/lib64/libc-2.26.9000.so)
by 0x574751C: __overflow (in /usr/lib64/libc-2.26.9000.so)
by 0x574191A: fputc (in /usr/lib64/libc-2.26.9000.so)
by 0x50428E3: dm_event_log (libdevmapper-event.c:934)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Some tools are typically installed into /usr/sbin (or /sbin) dir.
And some systems do not add this path to user's $PATH var.
Ensure sbin paths are looked through...
In fact pvmove does support 'clustered-core' target for clustered
pvmove of LVs activated on multiple nodes.
This patch restores support for activation of pvmove on all nodes
for LVs that are also activate on all nodes.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
Add protectional internall error whenever we spot activation
of 'exclusive' only segments in 'non-exclusive' mode.
TODO: possibly the activation locking could be enhanced to handle
this fully behind the scene - as for now this works purely for
lvchange/vgchange activation.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Dedicated functions are now used to process each piece of data obtained,
so the refactoring in this file gives us one for the vgsummary and one
for the metadata header. This new type of function takes two parameters
(for now), the obtained data plus a single struct (that must not
reference any data on the stack) that wraps up the entire context needed
to process it.
Sleep a bit before checking /sys/block dir so the kernel has a moment to
actually put scsi debug device in it...
Some quite old kernels are in troubles with this plain searching grep
without sleep (namely 2.6.32)
modprobe scsi_debug
<sleep .1>
grep -H scsi_debug /sys/block/*/device/model
modprobe -r scsi_debug
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
Add three new raid tests with io load and table
reloads during reshape for target 1.13.2.
Add a raid0 to raid10 conversion test.
Also add more signals to trap in lvconvert-raid-reshape-load.sh.
If there is sufficient space in the metadata area, align the next
metadata to a disk offset that is a multiple of 4096 bytes and
don't write it circularly. If it doesn't all fit at the end
of the metadata area, go back to the start and write it all there
contiguously.
If there is insufficient space to use the new stricter rules, revert to
the original behaviour, aligning on 512-byte boundaries wrapping around
the circular buffer as required.
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
lvmdbusd was started, but the process was not recognized by pgrep.
- configure does not make the script executable - set the flag
explicitly when running make check,
- process name changed to lvmdbusd. The previous python3 value
originated from the use of /usr/bin/env.
Use new ALIGN_ABSOLUTE macro when calculating the start location
of new metadata and adjust the end of buffer detection so that
there is no longer an imposed gap between old and new metadata.
Currently both start and offset should always be divisible by alignment,
so this should have no effect, but a later patch will increase alignment
so these variables can no longer be optimised out.
lvmdbusd executable script must use python3 interpreter detected by
configure script, as site-packages directory used for library is only
used by that interpreter.
Although it doesn't look like it can be a measurable problem
and costs some time to flip priorities outside of activation window.
So just like with memory locking preserve priority until call
memlock_unlock() appears.
(addition to commit c086dfadc3).
Expand out the metadata wrapping calculations to prepare
to support a larger alignment.
The current alignment is 512 bytes so
(mdac_area_start + rlocn->offset) % alignment is zero.
When doing resume, directly pass location where new updated info
needs to be stored.
_resume_node() ensures the info is ONLY updated when the function
is successful and never changes it on error path.
Update the logic towards more explicit logic.
Preload tree normally does not want to resume, only
in certain cases of extension or new loaded nodes can be
resumed. So introduce new internal variable delay_resume_if_extended
controlable by target.
Patch itself is not changing current existing behaviour,
and rather documents existing problem in more readable way.
lvm2 needs to introduce explicit mechanism how to support more
fain-grained (and safe) logic to i.e. resize thin-pool which
can be sitting on cached raid volume.
Variable props.send_messages has 3 states and was not used properly
here. Activation in this moment does not need to verify thin-pool status
as that has been already checked on preload.
So only if there are some real messages (value 2) call function
for sending them.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
In case of failed legs, raid replaces those with
e.g. "vg-lv_rimage_0-missing_0_0" mapped to an error target.
Those errouneously remain on deactivation.
Fix by removing them on deactivation/removal of the RaidLV.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
Code already has dereferenced UUID before this point,
and its already given we require name & uuid when ading new node
(although uuid could be empty string).
Just like everywhere else - use single if() for major:minor setup
(it basically can't fail as of today anyway)
Always leave funtion with correctly set pointers even on error path.
Replicator never really existed in upstream kernel and its support
got deprecated.
Also its support never got finished so no code is supposed to be
using it anyway.
Libdm symbols are remaining, just the implementation will always
return failure - so any user of:
dm_tree_node_add_replicator_dev_target()
dm_tree_node_add_replicator_target().
will now always recieve error message.
Separate handling of error code from _info_by_dev.
This error can only happeng when we are running out of memory.
In such case there is urgent need to stop any futher proceeding
of command and run to error ASAP.
Avoiding "$(get first_extent_sector "$d")" in the loop
allows the test to succeed in the cluster. Further cluster
analysis needed to get to the core reason.
The lvm2 test suite aims at small test resource footprints
(few PVs, small PV sizes) to run on tmpfs backed loop device.
OTOH, lvconvert-reshape-raid.sh aims to test the maxima of
supported total stripes of 64. This patch adds a prerequisite
conditional to skip tests using more than 14 stripes.
It requires the target version 1.13.1 to avoid deadlocks.
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Reduces large amount of duplicate ioctl queries.
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Use code which detectes handlers in a way, which is more
backward-compatible friendly.
Replace read of 'sysfs' uuid entry with dm ioctl call.
Use /sys/block/dm-X/holders path instead of
new path /sys/dev/block/major:minor/holders.
TODO:
There are few more occurencies of this logic around the code
so some abstract interface should be considered.
In some cases the message could be slightly misleading so use
here rather conditional.
TODO:
In future we may possibly further tune the message in case we are
certain the level of redundancy protection has not been reduced.
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Propagate delayed resume at least for preload case in a simple way.
Currently PVMOVE depends on internal logic where 'mirror' with
corelog is 'possible' PVMOVE. In such case resume of 'created'
node is 'delayed'.
This is mostly an ugly internal hack - but for the moment being when we
add propagation for preload - it does work reasonable.
TODO: provide standard API and avoid this internal 'guessing'.
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
Just like suspend handles preload for pvmove finish,
in similar way handle suspend of starting pvmove.
In this case the precommited metadata are checked for list of PVMOVEed
LVs and those are suspended in with committed metadata.
When sanlock or dlm lock managers return an error number
that we don't recognize, replace it with a generic -ELMERR
which is defined in the set of special lvmlockd error
numbers. Otherwise, an unknown lock manager error number
could be misinterpreted for something else if it happened
to overlap another set of error numbers (which they have
not thus far.)
These less common errors returned from sanlock should
also cause sanlock to retry the lock acquire:
- i/o timeout occurs during sanlock_acquire().
other i/o on the same disk as the leases can cause
sanlock i/o timeouts.
- low level disk paxos contention between hosts naturally
causes one host to not acquire the lease. There are a
couple special error numbers associated with these cases
that should just be recognized as a normal failure to
acquire the lease.
There is no need to differentiation between clustered VG and normal VG.
As the activation depends on locking type.
Use unconditionally locally exclusive activation for pvmove.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
When activation of LVs fails prior pvmove start, try to deactivate
already activated LVs.
TODO: possibly remember which LVs where already activate and only those
take down - devices which are already in-use will stay active.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
The persistent filter should not be imported by any command that doesn't
use it so take addtional note of REQUIRES_FULL_LABEL_SCAN (for vgrename)
and introduce IGNORE_PERSISTENT_FILTER for vgscan and pvscan.
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
Commit 763db8aab0 rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Resolves: rhbz1511047
In HA cluster, we have "clvm" resource agent to manage clvmd daemon.
The agent invokes clvmd like: "clvmd -T90 -d0", which always prints
a scaring error message:
"""
local socket: connect failed: No such file or directory
"""
When specifed with "-d" option, clvmd tries to check if an instance
of the clvmd daemon is already running through a testing connection.
The connect() will fail with this ENOENT error in such case, so supress
the error message in such case.
TODO: add missing error reaction code - since ofter log_error, program
is not supposed to continue running (log_error() is for reporting
stopping problems).
Signed-off-by: Eric Ren <zren@suse.com>
Check and prevent starting another snapshot merge before
exiting merging is finished.
TODO: we can possibly implement smarter logic to drop existing
merging and start a new one.
The lvchange-raid[456].sh test checks that mismatches can be detected
properly. It does this by writing garbage to the back half of one of
the legs directly. When performing a "check" or "repair" of mismatches,
MD does a good job going directly to disk and bypassing any buffers that
may prevent it from seeing mismatches. However, in the case of RAID4/5/6
we have the stripe cache to contend with and this is not bypassed. Thus,
mismatches which have /just/ happened to an area that now populates the
stripe cache may be overlooked. This isn't a serious issue, however,
because the stripe cache is short-lived and reasonably small. So, while
there may be a small window of time between the disk changing underneath
the RAID array and when you run a "check"/"repair" - causing a mismatch
to be missed - that would be no worse than if a user had simply run a
"check" a few seconds before the disk changed. IOW, it simply isn't worth
making a fuss over dropping the stripe cache before beginning a "check" or
"repair" (which we actually did attempt to do a while back).
So, to get the test running smoothly, we simply deactivate and reactivate
the LV to force the stripe cache to be dropped and then proceed. We could
just as easily wait a few seconds for the stripe cache to empty also.
When a "recover" is just starting for a RAID LV, it is possible to get
"idle" for the sync action if the status is issued quickly enough. This
is fine, the MD thread just hasn't gotten things going yet. However,
the /need/ for a "recover" should be marked in md->recovery and it would
be simple enough to fix the kernel so this doesn't happen. May eventually
want a separate bug for this, but for now it fits with RHBZ 1507719.
We always preferred and recommended socket activation for our services
so remove the Install section in related .service units which are unused
in this case and keep only the Install section in associated .socket
units.
Signed-off-by: Bastian Blank <waldi@debian.org>
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
There are two known bugs in the lvconvert-raid-status-validation.sh
test. The first one I consider to be more of an annoyance (1507719).
The second one I consider to be more serious (1507729).
RHBZ 1507719 simply documents the fact that the three RAID status
fields may not always be coherent due to the way they are set and
unset when the MD thread is shutting down and starting up. For
example, the sync ratio may be 100% but the sync action may not
yet have switched to "idle" and the health characters may not yet
all be 'A's (i.e. the devices set to InSync).
RHBZ 1507729 is more serious. The sync ratio can be 100% for a
short period of time after upconverting linear -> RAID1. It is
reset to 0 once the MD sync thread gets to work on it. It does
this because, technically, the array /is/ in-sync if the new
devices are excluded - i.e. the data is 100% available and
consistent. I'm not sure what to do about this problem, but we'd
much rather not have this state that looks exactly like the
end of the process when the sync ratio is 100% because the
"recover" process finished, but the sync action and health
characters haven't been updated yet. Put simply, the problem
is that we can't tell if a sync is starting or finished based
on the status output.
Since 4fa5add6b1 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)
Preload reiserfs module for the case, fs is present/compiled for a
kernel but it's not present in memory.
Size reducition needs --yes confirmation to preceed for reiserfs.
Correct reported message when thin snapshot has been already merged.
So lvm2 is no longer reporting "Mergins of snapshot X will occur..."
(even with swapped names).
When an ignored metadata area gets flagged for use again, make sure the
code doesn't try to parse its old metadata. Firstly by trying to detect
this situation and skipping the read (while still remembering the
position reached in the circular buffer), and secondly by clearing the
invalid live metadata location on disk as a precaution when subsequently
writing out the precommitted metadata.
Problems showed up when a metadata area in one VG got moved to
another VG in ignored state (still holding metadata for the original
VG) and then later got brought into use in the new VG - only the header
should be read in this case, not any of the metadata content.
vgsplit shares the vg_rename code so that must only set the PV_MOVED_VG
flag introduced in commit 486ed10848
("vgmerge: Fix intermediate metadata corruption") on PVs that moved.
Since both lvcreate and lvconvert needs to check for same
type of allowed origin for snapshot - move the code into
a single function.
This way we also fix several inconsitencies where snapshot
has been allowed by mistake either through lvcreate or
lvconvert path.
Generation of man pages is generating lot of barely readable output.
For normal build quietize this a bit.
For original verbose build start to use 'make V=1'
(just like i.e. linux kernel does)
TODO: apply at more places...
Converting from one raid level to another, no changes
of stripes or stripesize can be requested because those
are subject to reshaping. I.e. the process requires to
takeover first and secondly request raid algorithm,
stripe or stripesize changes.
Ignore any related changes display warninngs
and proceed with the takeover.
Without this patch, a takeover requesting
stripesize change causes data corruption!
Just like with other vars support this:
make check_local T=xyz LVM_LOG_FILE_MAX_LINES=10000000
Allows easily to override existing line limit.
Also increase limiting size of logs per command since some of
our commands are becoming very verbose....
Add explaining message, when command was aborted due to the reach
of configure line number count (LVM_LOG_FILE_MAX_LINES)
for logging (used mainly with testing).
Last patch missed to mention, we've improved/fixed generated paths
in units and init.d shell scripts when lvm2 was plainly configured
with just i.e. --prefix.
Note: some distros might have fully specified --sbindir and
--usrsbindir - thus those very not seeing problems in generated paths.
Replace lowercase @sbindir@ with @SBINDIR@ which contains
fully decoded path.
Same with @usrsbindir@ which is also used with clvmd and cmirrord.
Also handle SYSCONFDIR for EnvironmentFile.
Patch fixes generated unit files with strings like:
ExecStart=${exec_prefix}/sbin/lvm
Introduce few more AC_SUBST vars for usage in *.in generation.
In some case we want to replace i.e. $sbindir with full path
instead of current ${exec_prefix}/sbin.
This patch provides:
USRSBINDIR
SBINDIR
DEFAULT_SYS_LOCK_DIR
SYSCONFDIR
At the same time properly use sbindir & usrsbindir with
lvm, fsadm, clvmd from one primary definition.
Do not allow to take snapshot of mirror/raid leg or log or metadata LV.
This was actually never supported, but user was able to create it,
and this put device stack in hardly fixable state (needs manual work).
This prevents such creation to pass.
Also improve validation when recreating snapshot volume type
from origin and COW volume.
Correction to function for extracting vgname out of lvconvert
parameters.
Avoid repeating some checks.
Add code to handle generic options which may provide vgname in its argument
and compare them all so they match to a single vgname (otherwise it's a
error).
Extract default (envvar) vgname only when no position nor optional vgname is
found.
Fixing regression instroduce with patchset started with commit:
1e2420bca8 (2.02.169)
split resize_crypt function in two.
a) Detect proper dm-crypt device type and count new --size
value for cryptsetup resize command.
b) Perform the resize
the bug in LUKS grow/shrink decision in fsadm was
masked due to fact that default LVM2 extent size
was larger than LUKS1 default data offset for dm-crypt
mapping. The new test address this bug.
lvcreate supports a 'conversion' when caching LV.
This normally worked fine, however in case passed LV was
thin-pool's data LV with suffix _tdata we have failed to early.
As the easiest fix looks dropping validation of name when
caching type is select - such name check will happen later
once the VG is opened again and properly detect if the LV
with protected name already exists and can be converted,
or will be rejected as ambigiuous operation requiring user
to specify --type cache | --type cache-pool.
Replaced the confusing device error message "not found (or ignored by
filtering)" by either "not found" or "excluded by a filter".
(Later we should be able to say which filter.)
Left the the liblvm code paths alone.
Fixes the following case with 3PVs and 3 legs "mirror" LV:
# lvcreate -l100%FREE --type mirror -m2 vg3
Insufficient free space for log allocation for logical volume .
Unable to allocate extents for mirror log.
Related: rhbz1269533
Activation lock has a primary purpose to serialize locking of individual
LV in case there is no other protecting mechanism for parallel
execution.
However in the case an activated LV is composed from several other LVs,
noone should be able to manipulate with those LVs as well.
This patch add a very 'naive' global VG activation locking in this case.
In the future we may introduce smarter function detecting minimal closed
graph components if this will appear as bottleneck
Patch checks if the VG Write lock is held - in this case we do not
need any more locking - command has exclusive access to VG.
In case we have clustered VG and we are activating an LV which does not
need other LVs - we also do not need any more locks.
In all other cases take respective lock - for single LV - use lvid,
for complex LVs use vgname.
Creating striped RaidLVs with lv size not divisible by region size
caused the region size to be adjusted:
# lvcreate --type raid5 -n region_check.32.00m_3 -i 3 -L 1g --nosync -R 32.00m raid_sanity
Using default stripesize 64.00 KiB.
Rounding size 1.00 GiB (256 extents) up to stripe boundary size <1.01 GiB(258 extents).
WARNING: New raid5 won't be synchronised. Don't read what you didn't write!
Using reduced mirror region size of 8.00 MiB
Logical volume region_check.32.00m_3 created.
Fix by not imposing "mirror" constraints on "raid".
Resolves: rhbz1404007
vgmerge suffers from a similar problem to the one fixed in commit
8146548d25 ("vgsplit: Fix intermediate
metadata corruption.")
When merging, splitting or renaming VGs, use a new PV status flag
PV_MOVED_VG to mark the PVs that hold metadata with the old VG name and
use this to provide PV-level granularity instead of incorrectly assuming
all PVs in the VG are the same.
Add these for dmeventd systemd unit (dm-event.service):
Before: shutdown.target
Conflicts: shutdown.target
This will cause the dmeventd to be properly stopped at shutdown (after
all the dmeventd clients unregistered their devices from monitoring)
with dm-event.service's stop action (there's no direct action defined
for the "stop" so systemd sends SIGTERM instead).
Before, we let dmeventd to get killed only as part of the very last
SIGTERM/SIGKILL for all the remaining processes late in the shutdown
sequence so we may have missed some logs if dmeventd encountered an
error during its shutdown (logging facilities are already off at this
late time in shutdown sequence).
Ref: https://www.redhat.com/archives/lvm-devel/2017-October/msg00000.html
When dmeventd receives SIGTERM/INT/HUP/QUIT it validates if exit is possible.
If there was any device still monitored, such exit request used to
be ignored/refused. This 'usually' worked reasonably well, however if there
is very short time period between last device is unmonitored and signal
reception - there was possibility such EXIT was ignored, as dmeventd has
not yet got into idle state even commands like 'vgchange -an' has already
finished.
This patch changes logic towards scheduling EXIT to the nearest
point when there is no monitored device.
EXIT is never forgotten.
NOTE: if there is only a single monitored device and someone sends
SIGTERM and later someone uses i.e. 'lvchange --refresh' after
unmonitoring dmeventd will exit and new instance needs to be
started.
If you send a SIGUSR1 (10) to the daemon it will dump all the
threads current stacks to stdout. This will be useful when the
daemon is apparently hung and not processing requests.
eg.
$ sudo kill -10 <daemon pid>
Make sure that any and all code that executes in the main thread is
wrapped with a try/except block to ensure that at the very least
we log when things are going wrong.
Changing the VG of a PV uses the same on-disk mechanism as vgrename.
This relies on recognising both the old and new VG names. Prior to this
patch the vgsplit code incorrectly provided the new VG name twice
instead of the old and new ones. This lead the low-level mechanism not
to recognise the device as already belonging to a VG and so paying no
attention to the location of its existing metadata, sometimes partly
overwriting it and then later trying to read the corrupt metadata and
issuing a checksum error.
In some cases we are seeing where there are no VGs, but the data returned from
lvm shows that the PVs have the following for the VG:
"vg_name":"[unknown]", "vg_uuid":""
The code was only checking for the exitence of the VG name and we called into
the function get_object_path_by_uuid_lvm_id which requires both the VG name and
the LV name to exist (asserts this) which results in the following stack trace:
Traceback (most recent call last):
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 563, in runner
obj._run()
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 584, in _run
self.rc = self.f(*self.args)
File "/home/tasleson/lvm2/daemons/lvmdbusd/fetch.py", line 26, in
_main_thread_load
cache_refresh=False)[1]
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/home/tasleson/lvm2/daemons/lvmdbusd/loader.py", line 37, in common
objects = retrieve(search_keys, cache_refresh=False)
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 40, in
pvs_state_retrieve
p["pv_attr"], p["pv_tags"], p["vg_name"], p["vg_uuid"]))
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 84, in __init__
vg_uuid, vg_name, vg_obj_path_generate)
File "/home/tasleson/lvm2/daemons/lvmdbusd/objectmanager.py", line 318,
in get_object_path_by_uuid_lvm_id
assert uuid
AssertionError
When executing in the main thread, if we encounter an exception we
will bypass the notify_all call on the condition and the calling thread
never wakes up.
@staticmethod
def runner(obj):
# noinspection PyProtectedMember
Exception thrown here
----> obj._run()
So the following code doesn't run, which causes calling thread to hang
with obj.cond:
obj.function_complete = True
obj.cond.notify_all()
Additionally for some unknown reason the stderr is lost.
Best guess is it's something to do with scheduling a python function
into the GLib.idle_add. That made finding issue quite difficult.
There's nothing special about /boot other than it's used during boot.
But when blkdeactivate is called either on all devices or including a
device where the /boot is on top, we should also include this mount
point when doing unmount before deactivation of supported devices.
The new blkdeactivate -r|mdraidoption wait causes blkdeactivate to wait
for any resync/recovery/reshape that is currently in progress before
deactivating the device.
If this option is used, blkdeactivate calls mdadm -W|--wait before
mdadm -S|--stop.
Revert dc50f2f4a0.
We're canonicalizing/escaping the names here and we're reusing the
variable name so the code doesn't need to use extra variables and
further assignments that may confuse us. Let's keep the code simple.
The
local name=(...$name)
is not the same as
local name
name=(...$name)
(I know various code-checking tools fuss about this and recommend
the 2nd way, but let's ignore those tools' nitpicking here please.)
There was a typo in blkdeactivate --dmoption/--lvmoption/mpathoption,
it had missing "s" at the end and it was not recognized properly, only
short names for the options (-d/-l/-m).
When certain cmd def RULE's fail, the error messages can
sometimes be confusing. This expands the error messages
to help clarify why the rule failed, especially in cases
where options are used incorrectly.
In a shared VG, only allow pvmove with a named LV,
so that only PE's used by the LV will be moved.
The LV is then activated exclusively, ensuring that
the PE's being moved are not used from another host.
Previously, pvmove was mistakenly allowed on a full PV.
This won't work when LVs using that PV are active on
other hosts.
Avoid starting test, when test dir has less then 50M of free space.
Better to crash early before letting die machine on weird crash
in OOM cases...
Also show free disk space when test starts
Enable handling of --poolmetadataspare so if user can prevent
creation of _pmspare volume during --repair operation (just
like during actual lvcreate or lvconvert) for pool volumes.
The following commands now pass the device list through a
--select|-S filter before processing:
suspend resume clear wipe_table remove deps status table
In a shared VG, lvconvert must be used to create thin pools
and cache pools, not the lvcreate variants of those commands.
Deny these cases early in lvcreate using the new command defs.
Denying these cases deeper in the code was missing some
cleanup of the partially completed command.
Revert the lvmlockd.c changes from:
commit 0bf836aa14
"tidy: prefer not using else after return"
The commit introduced at least one regression, which broke
lvcreate of a thin pool in a shared VG.
ATM we have several instances of daemonizing code.
Each has its 'special' logic so not completely easy
to unify them all into a single routine.
Start to unify them and use one strategy for rediricting
all input/outpus to /dev/null - use 'dup2' function for this
and open /dev/null before fork to make sure it's available.
When file-locking mode failed on locking, such description was leaked
(typically not an issue since command usually exists afterwards).
So shirt close() at the end of function and use it in all error paths.
Also make sure, when interrrupt is detected, it's really not holding
lock and returns 0.
lvmcache_foreach_mda() can fail for numerous reasons
and failing error code cannot be ignored (out-of-memory...)
TODO: might need more error handling tunning.
gcc warns here about storring 69 bytes in 64 byte array (losing
potentially 4 bytes from 'ls->name').
lvmlockd-core.c:2657:36: warning: ‘%s’ directive output may be truncated writing up to 64 bytes into a region of size 60 [-Wformat-truncation=]
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~
lvmlockd-core.c:2657:2: note: ‘snprintf’ output between 5 and 69 bytes into a destination of size 64
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Replaced with slightly better code - but it still misses error path what
to do if the name would be truncated... - so added FIXME.
Also using all bytes for snprintf() buffer size
(as the size is with \0 included)
After the internal lvmlock LV (holding sanlock leases) is
extended to hold more leases, it needs to be zeroed.
sanlock expects to see either zeroed blocks or blocks
initialized with leases.
currently, lvcreate for sanlock find the free lock offset
from the beginning of the lvmlock every time.
after created thousands of lvs, it will issue thousands of read
ios for lvcreate to find free lock offset.
remeber the last free lock offset will greatly reduce the impact
Signed-off-by: Zhang Huan <zhanghuan@huayun.com>
Fix code checking that the 2nd mda which is at the end of disk really
fits the available free space and avoid any DA and MDA interleaving when
we already have DA preallocated. This mainly applies when we're restoring
a PV from VG backup using pvcreate --restorefile where we may already have
some DA preallocated - this means the PV was in a VG before with already
allocated space from it (the LVs were created). Hence we need to avoid
stepping into DA - the MDA can never ever be inside in such case!
The code responsible for this calculation was already in
_text_pv_add_metadata_area fn, but it had a bug in the calculation where
we subtracted one more sector by mistake and then the code could still
incorrectly allocate the MDA inside existing DA. The patch also renames
the variable in the code so it doesn't confuse us in future.
Also, if the 2nd mda doesn't fit, don't silently continue with just 1
MDA (at the start of the disk). If 2nd mda was requested and we can't
create that due to unavailable space, error out correctly (the patch
also adds a test to shell/pvcreate-operation.sh for this case).
If the PV was originally created with a larger-than-default
metadata area the restored one wasn't and might not even be
large enough to hold the metadata!
Previously the cache remembered an existing bootloaderarea and
reinstated it (without even checking for overlap) when asked to
write out the PV. pvcreate could write out an incorrect layout.
Add the new concise format to dmsetup create, either as a single
command-line parameter or from stdin.
Based on patches submitted by
Enric Balletbo i Serra <enric.balletbo@collabora.com>.
Drop 'DMEVENT' make variable and use BUILD_DMEVENTD like with other daemons.
NOTE: by default we do not build dmeventd - maybe time to change
as lot of build targets basically do need dmeventd...
Switch to use SYSTEMD_LIBS and avoid linking systemd library with
every linked tool when dbus notification are enabled.
(TODO add missing testing for lib presence)
Check first if we need to even link -lrt - since clock functions
are normally emebeded with recent glibc (>=2.17)
Use standard RT_LIBS name.
Avoid duplicate test for realtime clock with lvmlockd
Show better error message when realtime clock support is missing or
disabled.
Link RT_LIBS explicitely with lvmlockd and lvmetad.
Avoid adding -g more then once for debug builds.
Avoid enabling DEBUG_MEM when we build multithreaded tools.
Link executables with -fPIE -pie and --export-dynamic LDFLAGS
Introduce PROGS_FLAGS to add option to pass flags for external libs.
Link lvm2 internally library only when really used.
Link DAEMON_LIBS with daemons.
Pass VALGRIND_CFLAGS internally
Set shell failure mode on couple places.
lvm2 warned about zeroing and too big chunksize (>=512KiB), but
only during lvconvert, so lvcreate was creating thin-pools
without any warning about possible slowness of thin provisioning
because of zeroing.
Create a new table output format that concisely shows multiple devices
on one line.
dmsetup table --concise [device...]
<dev_name>,<uuid>,<flags>[,<table>]*[;<dev_name>,<uuid>,<flags>[,<table>]*]*
Table lines are separated by commas.
Devices are separated by semi-colons.
Flags is currently 'ro' or 'rw' (and might be extended in a
yet-to-be-defined way in future).
Any comma, semi-colon or backslash within a field is quoted by a
preceding backslash.
The format can later be supplied as input to dmsetup or even to the
booting kernel as an alternative way to set up devices.
Based on patches submitted by
Enric Balletbo i Serra <enric.balletbo@collabora.com>.
Add an independent command definition for "vgchange --locktype",
and split the implementation out of the set of common metadata
changes. It is unlike normal metadata changes, and can only
be run by itself. (Changing the lock type is similar in
principle to changing the VG name or the VG system ID; it
effects the ability of any host to see or access the VG.)
At some point this command lost the ability to forcibly change
the lock type of a shared VG to "none" (making it a local VG).
This can be necessary to repair shared VGs (e.g. recovery steps
that occur in vg_read are disabled for shared VGs because
they are not locked properly, or recovering sanlock locks
when the PV holding them is lost.)
"vgchange --locktype none --lockopt force VG" is used as the
method of forcing the shared VG to become local so that it
can be repaired.
Since _deactivate_and_remove_lvs() is used in more then one place,
move the needed udev synchronization into this function so other
users automatically get correct fs state before next dm manipulation.
Assumption here is that this udev synchronization 'delay' may also
prevent to 'early' table reloads which might cause kernel problems
for md-core - but we may need more generic time-limited reload
frequency for raid devices.
Note: on udev-less system there will be almost no delay.
Commit 8a912d6dbc missed the wrong logic,
we use 2 vars 'dev' & 'mddev' and their usage can't be mixed.
So correctly separate them so mddev keeps name of MD device.
During test do a more close selection of visible devices.
If some test leaks a device with LVMTEST prefix, next
test should not be influnced (or parallel running one).
If the test is running in non-/dev dir - it's already protected
by full path with $PREFIX in it - however it test is running
in real /dev dir - there was no such protection and test
were confused when they have seen such leaked device.
Commit 9b4b5d449e started to accept
rather way to wide set of strings we do not want to take for size.
i.e. lvresize -t vg0/lvol0 -LNaNM
Restore check for 'digit || locales defined 'dot' | '.'
(as we tend to take '.' even if locales uses ',')
Explictely detect duplicate sing symbols and leave the rest of
double number validation on 'strtod()' function. This way
we can also accept size like:
lvcreate -L.1M
We already accept -L0.1M - but it's common to accept numbers
starting with leading '.' - just as 'strtod()' accepts it).
API for strtod() or strtoul() needs reset of errno, before it's being
called. So add missing resets in missing places and some also some
errno validation for out-of-range numbers.
Since we are reading size as (double) we can get way bigger
number then just plain int64. So to make this check actually
more valid and usable do a maxsize compare in 'double'.
Avoid reading already released memory and do a continue directly.
Invalid read of size 1
at 0x1201B0: main_loop (clvmd.c:931)
by 0x11F640: main (clvmd.c:666)
Address 0x72ddef0 is 32 bytes inside a block of size 224 free'd
at 0x4C30D18: free (vg_replace_malloc.c:530)
by 0x54D6FD1: dm_free_wrapper (dbg_malloc.c:357)
by 0x122E6E: process_work_item (clvmd.c:2034)
by 0x123003: lvm_thread_fn (clvmd.c:2085)
by 0x590A3A8: start_thread (pthread_create.c:465)
by 0x5C3C7FE: clone (in /usr/lib64/libc-2.25.90.so)
Block was alloc'd at
at 0x4C2FB6B: malloc (vg_replace_malloc.c:299)
by 0x54D6EF1: dm_malloc_aux (dbg_malloc.c:286)
by 0x54D6F1C: dm_zalloc_aux (dbg_malloc.c:291)
by 0x54D6F96: dm_zalloc_wrapper (dbg_malloc.c:345)
by 0x11F89C: local_rendezvous_callback (clvmd.c:731)
by 0x1203D2: main_loop (clvmd.c:964)
by 0x11F640: main (clvmd.c:666)
Initialize mutex upfront any debugging and fix this report:
Mutex reinitialization: mutex 0x485d20, recursion count 0, owner 1.
at 0x4C38480: pthread_mutex_init_intercept (drd_pthread_intercepts.c:821)
by 0x4C38480: pthread_mutex_init (drd_pthread_intercepts.c:830)
by 0x11F359: main (clvmd.c:562)
mutex 0x485d20 was first observed at:
at 0x4C38F63: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:885)
by 0x4C38F63: pthread_mutex_lock (drd_pthread_intercepts.c:898)
by 0x11E920: debuglog (clvmd.c:254)
by 0x11F1D8: main (clvmd.c:527)
Switch from warn to log_error since this generated
failing return code for command so printing log_error()
is mandatory.
Happens with i.e. pvscan --cache meets crashing lvmetad.
Commit 34504855a7 introduced
flag LV_RESHAPE_DATA_OFFSET and used it to avoid incompatible
activation on older runtime.
Enhance vg_validate() raid checking functions with checks for it.
Patch 72a58ce4b0 was wronly placing
double quotes around this variable which we want to pass expanded,
as it's just set of 'space' device args ATM.
TODO consider using array[@] to make this cleaner.
Add shellcheck directive to skip warning here
Changes:
- BASH_SOURCE index was one off.
- The first line of stacktrace was pure confusion displaying executed
script together with innermost line number (which was either 125 when
STACKTRACE or 229 when skip was called.)
- We can safely ignore innermost call, as stack trace is always produced
by stacktrace function.
- It is safer to test for array length, instead of testing FUNCNAME is
main - if main function were introduced.
- Bashishm is safe to use as this function as a whole is relying on bash.
In order to reject out of place reshaping with segment data_offset
field on old runtime, add a respective segment type incompatibility
flag causing "+RESHAPE_DATA_OFFSET" to be suffixed to the segment
type name.
When reshape space is allocated anew, an update and reload is needed to
promote the new size to the cluster node with the exclusively active RaidLV
or reloading the RaidLV will fail with a size related error. Additionally,
store "data_offset <sectors>" with the RaidLV in the lvm2 metadata so that
it can be retrieved on cluster nodes.
Process allocation of reshape space on a 2-legged raid4/5 (interim layout
to convert from/to linear via raid1) properly in the cluster.
Resolves: rhbz1461562
Resolves: rhbz1448116
Use 1 logic for 2 loops tearing down left device.
First loops tries to remove all closed devices with 'normal' remove.
Second loop tries to replace those left devices with 'error' target.
We can't really sleep that much in teardown as it slows test too much.
So do a nested loop (similar to 'dmsetup remove_all') and keep
removing devices with open count == 0 as long as it works.
When we want to squash as much device as possible,
it's better to give it some delay, so devices have
some time to release it's resouces for next removal.
Also drop surrounding cookie processing and let each
dmsetup call run on its own.
Add missing get_devs.
When $7 is not given use empty string.
See if we can live with less RAM disk for PVs.
Drop limitation on single core as presence 1.12 should address this.
The checking order here has happend after TESTDIR was removed
resulting in weird further error on trap path.
Properly check for unexpected dmeventd before removing TESTDIR
since 'trap' codepath is still using it.
Also try to kill this unexpected dmeventd so testing is
not skipping all next dmeventd tests.
(Downside would be - if user would be accidentally starting
dmeventd by some regular system admin work - such dmeventd
might be killd if it's unused. It can't kill dmeventd in-use.
Fix mirror_images_on() to actually report something useful (thought
it might be tuned later).
So for now the function got through all '_mimages_' and compares
where the order of them is matching given list of devices.
Possible misspelling: FAILED_MIXED_STR may not be assigned, but FAIL_MIXED_STR is.
Possible misspelling: FAILED_MULTI_STR may not be assigned, but FAIL_MULTI_STR is.
Possible misspelling: FAILED_BLACK_STR may not be assigned, but FAIL_BLACK_STR is.
The blkid we call in 13-dm-disk.rules also returns identifiers for
partitions based on which the /dev/disk/by-part{uuid,label} and
gpt-auto-root symlinks should be created in the same manner as we
already create symlinks for filesystem labels and uuids.
This is because we handle blkid calls and symlink creation under
/dev/disk ourselves in our 13-dm-disk.rules for device-mapper devices
for us to have more control over this process.
See also https://lists.freedesktop.org/archives/systemd-devel/2017-July/039220.html
and original report http://tracker.ceph.com/issues/19489 for
the exact case where these symlinks were missing.
Fix the error messages when an unrecognized command is
run from a script. We shouldn't attempt to parse options
for an unrecognized command name, which causes misleading
errors about bad options, but rather exit right when we
know the command name is not valid. Also don't complain
about exiting without an error message when running a
script if no command didn't exist.
1. dm_uuid is 68 byte length, but buf is 64 which
will cause miss match uuid from lv lock manager
2. no lv lock_type path in dm config, use lock_args instead
Signed-off-by: Zhang Huan <zhanghuan@chinac.com>
If the activation step in lvcreate fails (e.g. the specified
minor number is already used), then the lvcreate is reverted,
but the LV lock in lvmlockd was not being unlocked or properly
freed.
Some lvconvert commands can be used directly on the data sublv:
lvconvert ... vg/pool_tdata
The correct LV lock to use in lvmlockd is the one on the pool LV.
Centralise editing of the client list into _add_client() and
_del_client(). Introduce _local_client_count to track the size of the
list for debugging purposes. Simplify and standardise the various ways
the list gets walked.
While processing one element of the list in main_loop(),
cleanup_zombie() may be called and remove a different element, so make
sure main_loop() refreshes its list state on return. Prior to this
patch, the list edits for clients disappearing could race against the
list edits for new clients connecting and corrupt the list and cause a
variety of segfaults.
An easy way to trigger such failures was by repeatedly running shell
commands such as:
lvs &; lvs &; lvs &;...;killall -9 lvs; lvs &; lvs &;...
Situations that occasionally lead to the failures can be spotted by
looking for 'EOF' with 'inprogress=1' in the clvmd debug logs.
Correctly skip the test when lvmdbusd is found already running.
For pgrep usage we need to add '-f -l' options to get python3 name
printed.
Remove no longer used 'pids' local var.
With commit 41c10034aa we actually
do require LV to be used with _vg_write_lv_suspend_commit_backup().
So write a proper separte single wrapper for write && commit && backup.
lvm_run needs to place NULL as the last element into argv[].
Otherwise we get:
Conditional jump or move depends on uninitialised value(s)
_command_required_pos_matches (lvmcmdline.c:1443)
_find_command (lvmcmdline.c:1610)
lvm_run_command (lvmcmdline.c:2770)
lvm2_run (lvmcmdlib.c:91)
Dmeventd reuses 'dm_task' struct for some STATUS operation, but due to
missing reinitization of dm_task target list, it has caused misprocesing
of recieved events as the parsed target has been simply added to the
list of existing status and cause multiple actions being called for
single event.
Since we discovered status reporting from 'md' goes from large set
of weird states we can't just decided based on this word.
So let it pass for rebuild and idle as well
and check for health devices afterwards.
Add function to adjust printing of percent values in better way.
Rounding here is going along following rules:
0% & 100% are always clearly reported with .0 decimal points.
Values slightly above 0% we make sure a nearest bigger
non zero value with given precission is printed
(i.e. 0.01 for %.2f will be shown)
For values closely approaching 100% we again detect and adjust value
that is less then 100 when printed.
(i.e. 99.99 for %.2f will be shown).
For other values we are leaving them with standard rounding mechanism
since we care mainly about corner values 0 & 100 which need to be
printed precisely.
When we want to report primary leg failure, check for intial 'a',
since otherwice 'Aa idle' is normally visible.
Also reset array of bit flags marking dead devices, once
plugin detects raid is in sync.
Functionality of ignore suspend devices is already granted by:
lvm2_disable_dmeventd_monitoring() -> init_run_by_dmeventd() ->
init_ignore_suspended_devices().
In fact plugins should never use --config because it has
some unpleasant technical issues.
When raid leg rimage device is marked as 'D'ead by mdcore,
lvm2 was not able to replace such device with allocate policy,
as device has not appared as missing.
Add detection of transiently failing devices.
Basically reverting commit 58a9f88b8c.
We can use origin_only in case we are snapshot's origin,
as we do support this stack.
So when we are 'uncaching' origin+snaps - we do need to reload only
origin and we do not need to play with snaps.
Handle change of 'region size' better and follow also standard rule
if the command can't success (i.e. size is already same) we return
error for all such cases.
Also log_pring more info about adjusted value (just like we
do for rounding)
Also avoid keep pointers on 'display_*' values - they are in
ringbuffer for immediate use - not to be kept across multiple calls
(as they could be already overwritten by later calls) - so dropped
seg_region_size_str
'lvdisplay -m' tried to go through NULL policy settings,
when such policy was not defined for CachedLV.
Patch is fixing display of cache-pool without defined settings,
as this is now a valid pool and we mostly want users to define
these settings when actually really caching a LV.
Since cache LV can be a stacked device, there is no real reason
trying to use slight optimised tree for origin_only cache reload
(it could be even wrongly implemented in this case).
We can easily go with stardard tree load here.
When user runs command like 'lvconvert --splitcache' the operation
might be actually either slow or not making any progress in kernel,
so lets give user a chance to abort such operation.
When user press 'Ctrl+C' device table is restored to pre-flushing state.
Remove explicit activation of SubLVs and let lv_update_and_reload()
perform the proper (pre-)loading sequencing of tables.
This avoids related callback functions which are removed.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
When lock-holding LV differs from actually request locked LV,
we drop origin_only flag as it has no use - it'd be applied
on completely different LV.
Example of problem:
Raid is thin-pool _tdata LV.
Raid run origin_only locking on stacked device.
As lock holder is discovered thinLV.
Whole origin_only operation is then applied only on thinLV
changing the meaning of whole operation.
NOTE: this patch does not change anything for LV that are
already top-level lock holding LVs (i.e. thinLVs, snahoshots/origins).
Disable until we have a proper fix for reshape space allocation,
switching it to begin/end of rimages and activation in the cluster.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
Commit 1fe4f80e45 in current version
introduced regression for a terminal user, as he could not enter 'n'
as answer. Add missing break for this case (No whats_new).
New tests to add checking for '100%' in-sync at start of "recover"
process (it shouldn't happen, but I've seen it before). Also,
check status over the whole cycle of various sync processes ("resync"
and "recover").
Enhance reporting code, so it does not need to do 'extra' ioctl to
get 'status' of normal raid and provide percentage directly.
When we have 'merging' snapshot into raid origin, we still need to get
this secondary number with extra status call - however, since 'raid'
is always a single segment LV - we may skip 'copy_percent' call as
we directly know the percent and also with better precision.
NOTE: for mirror we still base reported number on the percetage of
transferred extents which might get quite imprecisse if big size
of extent is used while volume itself is smaller as reporting jump
steps are much bigger the actual reported number provides.
2nd.NOTE: raid lvs line report already requires quite a few extra status
calls for the same device - but fix will be need slight code improval.
Current existing kernels reports status sometimes in weird form.
Instead of showing what is the exact progress, we need to estimate
this in-sync state from several surrounding states.
Main reason here is to never report 100% sync state for a raid device
which will be undergoing i.e. recovery.
Relative to last comit ddf2a1d656:
adjust the dm-raid target version to 1.12.0 which shows
mandatory kernel MD deadlock fixes related to reshaping
are presant in the kernel.
Related: rhbz1443999
First test in this file checks whether 'aa' is ever spotted during
a "recover" operation (it should not be). More tests should follow
in this file to look for oddities in status output - especially as
it relates to the sync_ratio, dev_health, and sync_action fields.
For the test clean-up, I was providing too many devices to the first
command - possibly allowing it to allocate in the wrong place. I was
also not providing a device for the second command - virtually ensuring
the test was not performing correctly at times.
This patch ensures that under normal conditions (i.e. not during repair
operations) that users are prevented from removing devices that would
cause data loss.
When a RAID1 is undergoing its initial sync, it is ok to remove all but
one of the images because they have all existed since creation and
contain all the data written since the array was created. OTOH, if the
RAID1 was created as a result of an up-convert from linear, it is very
important not to let the user remove the primary image (the source of
all the data). They should be allowed to remove any devices they want
and as many as they want as long as one original (primary) device is left
during a "recover" (aka up-convert).
This fixes bug 1461187 and includes the necessary regression tests.
Add the checks necessary to distiguish the state of a RAID when the primary
source for syncing fails during the "recover" process.
It has been possible to hit this condition before (like when converting from
2-way RAID1 to 3-way and having the first two devices die during the "recover"
process). However, this condition is now more likely since we treat linear ->
RAID1 conversions as "recover" now - so it is especially important we cleanly
handle this condition.
Previously, we were treating non-RAID to RAID up-converts as a "resync"
operation. (The most common example being 'linear -> RAID1'.) RAID to
RAID up-converts or rebuilds of specific RAID images are properly treated
as a "recover" operation.
Since we were treating some up-convert operations as "resync", it was
possible to have scenarios where data corruption or data loss were
possibilities if the RAID hadn't been able to sync completely before a
loss of the primary source devices. In order to ensure that the user took
the proper precautions in such scenarios, we required a '--force' option
to be present. Unfortuneately, the force option was rendered useless
because there was no way to distiguish the failure state of a potentially
destructive repair from a nominal one - making the '--force' option a
requirement for any RAID1 repair!
We now treat non-RAID to RAID up-converts properly as "recover" operations.
This eliminates the scenarios that can potentially cause data loss or
data corruption; and this eliminates the need for the '--force' requirement.
This patch removes the requirement to specify '--force' for RAID repairs.
Two of the sync actions performed by the kernel (aka MD runtime) are
"resync" and "recover". The "resync" refers to when an entirely new array
is going through the process of initializing (or resynchronizing after an
unexpected shutdown). The "recover" is the process of initializing a new
member device to the array. So, a brand new array with all new devices
will undergo "resync". An array with replaced or added sub-LVs will undergo
"recover".
These two states are treated very differently when failures happen. If any
device is lost or replaced while "resync", there are no worries. This is
because any writes created from the inception of the array have occurred to
all the devices and can be safely recovered. Even though non-initialized
portions will still be resync'ed with uninitialized data, it is ok. However,
if a pre-existing device is lost (aka, the original linear device in a
linear -> raid1 convert) during a "recover", data loss can be the result.
Thus, writes are errored by the kernel and recovery is halted. The failed
device must be restored or removed. This is the correct behavior.
Unfortunately, we were treating an up-convert from linear as a "resync"
when we should have been treating it as a "recover". This patch
removes the special case for linear upconvert. It allows each new image
sub-LV to be marked with a rebuild flag and treats the array as 'in-sync'.
This has the correct effect of causing the upconvert to be treated as a
"recover" rather than a "resync". There is no need to flag these two states
differently in LVM metadata, because they are already considered differently
by the kernel RAID metadata. (Any activation/deactivation will properly
resume the "recover" process and not a "resync" process.)
We make this behavior change based on the presense of dm-raid target
version 1.9.0+.
On conversion from raid10 to raid0 (takeover), all rmeta
devices and the rimage devices of mirrored stripes are
detached from the raid10 LV. The remaining rimage areas
are being shifted down into the slots of the detached
ones hence requiring renames to show proper _N suffix
sequences (e.g. 0,1,2,3 instead of 0,2,4,6). Only the
top-level raid10 LV has a cluster lock, not the detached
SubLVs thus their deactivation is impossible and e.g the
rename from *_rimage_6 to *_rimage_3 will fail. Fix by
activating exclusively before deactivating and removing.
Resolves: rhbz1448123
The file block count stored in the filemap_monitor was lazily
initialised at the time of the first check. This causes problems
in the case that the file has been truncated between this time and
the time the daemon started: the initial block count and current
block count match and the daemon fails to detect a change.
Separate the setting of the block count from the check and make a
call to update the value at the start of _dmfilemapd().
It's not an error to attempt to update regions from an fd that has
been truncated (or otherwise no longer has any allocated extents):
in this case, the call should remove all regions corresponding to
the group, and return an empty region table.
For proper usage of Cache kernel metadata format V2,
new cache_check tool is basically mandatory.
Print warning during configure time about this problem.
Prohibit activation of reshaping RaidLVs on incompatible
lvm2 runtime by storing e.g. 'raid5+RESHAPE' segment type
strings in the lvm2 metadata. Incompatible runtime not
supporting reshaping won't be able to activate those thus
avoiding potential data corruption.
Any new non-reshaping lvconvert command will reset the
segment type string from 'raid5+RESHAPE' to 'raid5'.
See commits
0299a7af1e and
4141409eb0
for segtype flag support.
When old snapshot is merged, lvm2 still can report some data about
merged 'snapshot' - i.e. it occupied space in VG.
This patch fixes regression from commit:
6fd20be629
and resolved RHBZ: 1460161
Avoid reporting 'checking result' as maybe - it should
clearly tell 'yes' or 'no'.
Just shuffle printed message to the place, where we
already know the 'maybe' answer.
So instead of printing 'unclear':
checking whether to enable libblkid detection of signatures when wiping... maybe
checking for BLKID... yes
checking whether to use udev-systemd protocol for jobs in background... maybe
checking for SYSTEMD... yes
show this:
checking for BLKID... yes
checking whether to enable libblkid detection of signatures when wiping... yes
checking for SYSTEMD... yes
checking whether to use udev-systemd protocol for jobs in background... yes
Code path missed validation of lvcreate --cachepool argument.
If the non cache-pool LV was passed in, code has still continued
further work and failed later on internal error. Validate this
condition at right place now.
When a combination of thin-pool chunk size and thin-pool data size
goes beyond addressable limit, such volume creation is directly
prohibited.
Maximum usable thin-pool size is calculated with use of maximal support
metadata size (even when it's created smaller) and given chunk-size.
If the value data size is found to be too big, the command reports
error and operation fails.
Previously thin-pool was created however lots of thin-pool data LV was
not usable and this space in VG has been wasted.
Only support RAID conversions on active LVs.
If we'd accept e.g. upconverting linear -> raid1 on inactive
linear LVs, any LV flags passed to the kernel aren't properly
cleared thus errouneously passing them on every activation.
Add respective check to lv_raid_change_image_count() and
move existing one in lv_raid_convert() for better messages.
If during the process of fetching current lvm state we experience an
exception we fail to call set_result on the queued_requests we were
processing. When this happens those threads block forever which causes
the service to stall infinitely. Only clear the queued_requests after
we have called set_result.
We were not adding background tasks to flight recorder. Add the meta
data to the flight recorder when we start the command and update the meta
data when the command is finished. Locking was added to meta data to
prevent concurrent update and returning string representation as these can
happen in two different threads.
vgreduce previously allowed --all and --removemissing together even though
it only actual did the remove missing. The lvm dbus daemon was passing
--all anytime there was no entries in pv_object_paths. This change supplies
--all if and only if we are not removing missing and the pv_object_paths
is empty.
Vgreduce has and continues to enforce the invalid combination of supplying a
device list when you specify --all or --removemissing so we do not need
to check for that invalid combination explicitly in the lvm dbus service as
it's already covered.
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1455471
Warn about a PV that has the in-use flag set, but appears in
the orphan VG (no VG was found referencing it.)
There are a number of conditions that could lead to this:
. The PV was created with no mdas and is used in a VG with
other PVs (with metadata) that have not yet appeared on
the system. So, no VG metadata is found by lvm which
references the in-use PV with no mdas.
. vgremove could have failed after clearing mdas but
before clearing the in-use flag. In this case, the
in-use flag needs to be manually cleared on the PV.
. The PV may have damanged/unrecognized VG metadata
that lvm could not read.
. The PV may have no mdas, and the PVs with the metadata
may have damaged/unrecognized metadata.
A PV holding VG metadata that lvm can't understand
(e.g. damaged, checksum error, unrecognized flag)
will appear as an in-use orphan, and will be cleared
by this repair code. Disable this repair until the
code can keep track of these problematic PVs, and
distinguish them from actual in-use orphans.
Reject any stripe adding/removing reshape on raid4/5/6/10 because
of related MD kernel deadlock on single core systems until
we get a proper fix in MD.
Related: rhbz1443999
Since lvmetad is using 'MISSING' in status for 'another' purpose,
we need to support ATM also flag get from this place.
Until fixed better - we accept both flags - alhough lvm2 will
only print in flags.
Switch METADATA_FORMAT flag usage to be stored via segtype
instead of 'status' flag which appeared to cause major
incompatibility troubles.
For backward compatiblity segtype flags are still accepted also
via 'status' bits which were used from version 2.02.169 so metadata
saved by this newer lvm2 version should still work nicely, although
new save version will no longer work on this older lvm2 version.
Allow storing LV status bits with segment type name field.
Switching to this since this field has better support for compatibility
with older version of lvm2 - since such unknown segtype will not cause
complete invisiblity of metadata from older lvm2 code - just the
particular LV will become unusable with unknown type of segment.
- Must reread all objects as PVs might be removed.
- Never consider testsuite provided PVs nested, or tearDown fails to
remove any outstanding VGs on them.
When 'fsadm' was running without terminal (i.e. pipe), it's been
automatically working like in '--yes'.
Detect terminal and only accept empty "" input in this mode.
Add more validation to catch mainly renamed devices, where
filesystem utils are not able to handle devices properly,
as they are not addressed by major:minor by rather via some
symbolic path names which can change over time via rename operation.
Since we add more validation to 'detect_mounted' function make sure
we always use it even with 'resize' action, so numerous validations
are not skipped.
Commit 5fe07d3574 failed to set raid5 types
properly on conversions from raid6. It always enforced raid6_ls_6
for types raid6/raid6_zr/raid6_nr/raid6_nc, thus requiring 3 conversions
instead of 2 when asking for raid5_{la,rs,ra,n}.
Related: rhbz1439403
Offer possible interim LV types and display their aliases
(e.g. raid5 and raid5_ls) for all conversions between
striped and any raid LVs in case user requests a type
not suitable to direct conversion.
E.g. running "lvconvert --type raid5 LV" on a striped
LV will replace raid5 aka raid5_ls (rotating parity)
with raid5_n (dedicated parity on last image).
User is asked to repeat the lvconvert command to get to the
requested LV type (raid5 aka raid5_ls in this example)
when such replacement occurs.
Resolves: rhbz1439403
Add an exception to not allowing lvchange to change properties
on hidden LVs. When a thin pool data LV is a cache LV, we
need to allow changing cache properties on the tdata sublv of
the thin pool.
Trap cases where the percentage calculation currently leads to an empty
LV and the message:
Internal error: Unable to create new logical volume with no extents
Additionally convert the calculated number of extents from physical to
logical when creating a mirror using a percentage that is based on
Physical Extents. Otherwise a command like 'lvcreate -m3 -l80%FREE'
can never leave any free space.
This brings the behaviour closer to that of lvresize.
(A further patch is needed to cover all the raid types.)
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of
PVs with unknown devices.
While one caller avoids passing such PVs into the function, the other
doesn't. Move the check inside the function so it's not forgotten.
Without this patch, if the normal VG reading code tries to repair
inconsistent metadata while there is an unknown PV, it incorrectly
considers the missing PVs no longer to be missing and produces
incorrect 'pvs' output omitting the missing PV, for example.
Easy reproducer:
Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
Reinstate the hidden PV pv2 and at the same time hide a different PV
pv3.
Run 'pvs' - incorrect output.
Run 'pvs' again - correct output.
See https://bugzilla.redhat.com/1434054
There are certain situations (not fully understood)
where is_missing_pv() is false, but pv->dev is NULL,
so this adds a check for NULL pv->dev after is_missing_pv()
to avoid a segfault.
With recent updates for thin pool monitoring in version 169
we lost multiple WARNINGs to be printed in syslog, when
pool crossed 80%, 85%, 90%, 95%, 100%.
Restore this logic as we want to keep user informed more
then just once when 80% boundary is passed.
Fix a regression introduced in 70bb726 that allows a local variable
in the monitored file checking routine to be accessed before its
assignment when the file has already been unlinked.
When a user does a Manager.PvCreate they can specify the block device using a
device path that may be different than what lvm reports is the device path. For
example a user could use:
/dev/disk/by-id/wwn-0x5002538500000000 instead of /dev/sdc
In this case the pvcreate will succeed, but when we query lvm we don't find the
newly created PV. We fail because it's device path is returned as /dev/sdc. This
change re-uses an internal lookup which can accommodate this and correctly find
the newly created PV.
Corrects https://bugzilla.redhat.com/show_bug.cgi?id=1445654
There were a handful of references to other man
pages using the standard command(N) form which were
not in bold, so they were not turned into links
in html formats.
Many will look in 'man lvm.conf' expecting to find a
description of the lvm.conf fields, which are not there.
State at the beginning how to get this (by running
lvmconfig.)
The lvm(8) man page never said what LVM is,
it never defined what the acronym LVM stands for,
it never spelled out other common acronyms VG, PV, LV,
and never described what they are.
This adds a very minimal definition which at least defines
the acronyms and entities, but it could obviously be
expanded on, either here or elsewhere.
Clean up the handling of memory used for cmd defs
so it doesn't trip up memory debugging.
Allocate memory for commands[] from libmem.
Free temporary memory used by define_commands()
at the end of the function.
Clear all the command def state in in lvm_fin().
Using any arg with a command name in a script file
would cause the command to fail.
The name of the script file being executed was being passed
to lvm_register_commands() and define_commands(), which
prevented command defs from being defined (simple commands
were still being defined only by name which was enough for those
to still work when run trivially with no args).
The logic for suggesting the nearest valid command syntax
was missing the simplest case. If a command has only one
valid syntax, that is the one we should suggest. (We were
suggesting nothing in this case.)
If the device size does not match the size requested
by --setphysicalvolumesize, then prompt the user.
Make the pvcreate checking/prompting code handle
multiple prompts for the same device, since the
new prompt can be in addition to the existing
prompt when the PV is in a VG.
Adding qualifier makes the only unqualified log_debug occurence
consistent with other uses in the same file.
Other possible ways to fix this:
- using `from .utils import log_debug`
- moving the line below `from . import utils` line
Unless a change of the regionsize is requested via "lvconvert -R N ...",
keep the region size when the number of images changes in a raid1 LV.
Related: rhbz1443705
Unless a change of the regionsize is requested via "lvconvert -R N ...",
keep the region size when the number of images changes in a raid1 LV.
Resolves: rhbz1443705
lvconvert parameters not causing a conversion (i.e. no type,
number of stripes, stripesize or regionsize changes) will
remove any allocated reshape space in which case the command
returns success. If reshape space does not exist though,
return error.
When metadata LV size was over DM_THIN_MAX_METADATA_SIZE sectors,
the info() routine was incorrectly trying to match bigger size,
while we do never pass any bigger device.
Fixing a case, where lvs should be displaying status for metadata
LV with 16GB size.
When a cmd def RULE fails because of a disallowed
combination of options, improve the error message
to show the option combination, not just the options
that broke the rule.
Reshape check failed when regionsize changed and current raid type
was provided with no other change requested (stripes or stripesize).
E.g. "lvconvert --type raid6 --regionsize 256K" on a raid6 LV
with != 256K regionsize.
Enable --type in test script.
Remove any newly allocated sub LV (pair) remnants in case
allocation fails due to lag of (parallel) free PV space
and keep initial raid type.
Resolves: rhbz1438013
A command def can include a specific constant option value,
but the value was not being checked for optional opts.
e.g. this is an incorrect command and does not match any
command defs:
lvconvert --type cache --cachepool vg/lv
However, it was mistakely being matched to this cmd def,
where the required options match, but the optional options
do not:
lonvert --cachepool LV_linear_striped_raid_cachepool
OO: --type cache-pool, ...
The optional options were mistakely considered matching
because 'cache' and 'cache-pool' were not being compared.
SIGINT isn't blocked properly after a sigint_allow(),
sigint_restore() cycle leading to illicit interruptable
metadata updates. These can leave corrupted metadata behind.
Issues addressed in this commit:
sigint_allow() fails to set _oldmasked[] members properly due
to an offset by one bug on indexing the members of the array.
It bails out prematurely comparing to MAX_SIGINTS causing nesting
depths to be one less than MAX_SIGINTS. Fix the comparision.
Correct the related comparison flaw in sigint_restore().
Initialize all sig_atomic_t variables consequently.
Resolves: rhbz1440766
Avoid error message
"Logical Volume *_rimage_0 already exists in volume group,,,"
on takeover conversion from a 2-legged raid1 to raid4
(aiming to reshape it adding images).
Resolves: rhbz1439398
Requesting _raid_remove_images() to commit the
metadata missed to reload the origin causing a
kernel takeover error converting a 2-legged raid1
(with previously removed images) to raid5.
Allow the combination of both arguments keeping
the raid level but changing the regionssize
(e.g. "lvconvert --type raid1 --regionsize 1M RaidLV"
on an existing raid1 LV).
Resolves: rhbz1438396
lvchange/lvconvert operations that do not require the LV
to already be active need to acquire a transient LV lock
before changing/converting the LV to ensure the LV is not
active on another host. If the LV is already active,
a persistent LV lock already exists on the host and the
transient LV lock request is a no-op.
Some lvmlockd locks in lvchange were lost in the cmd def
changes. The lvmlockd locks in lvconvert seem to have
been missed from the start.
We have to unset the LoadState variable from previous use when we check
for systemd unit state. We use this variable to check if systemd services
are loaded or not and if it is loaded, we issue systemctl commands to
enable/disable and start/stop the service. We don't issue these commands
if the unit is not loaded to avoid error messages which may confuse users.
The actual value specified by the --poll y|n option was not
being used. The way the --poll value is used is hidden
through an indirection where the value is stored in a
global variable at the start of the command, and then the
value is read from there later. Setting the global
variable early in the command had been lost with the
cmd def changes.
Fill in some gaps where old versions of lvm allowed
--poll and --monitor in combination with other operations,
but those combinations had been lost since the cmd def work.
(The new cmd def code also added some combinations that
had been missed by the old code.)
Changes:
lvchange --activate: add poll and monitor options, and
add calls to them in implementation.
lvchange --refresh: add monitor option (poll already there),
and call to monitor in implementation.
lvchange <metadata ops>: add poll and monitor options, and
add calls to them in implementation.
vgchange <metadata ops>: add poll option (call to poll
already in implementation).
vgchange --refresh: remove monitor option (not used by code)
lvchange --persistent y: add poll and monitor options, and
add calls to them, and to activate
in the implementation. (Making it
match the main lvchange metadata
command.)
Summary of current usage:
lvchange --activate: monitor, poll
vgchange --activate: monitor, poll
lvchange --refresh: monitor, poll
vgchange --refresh: poll
lvchange --monitor: ok
lvchange --poll: ok
lvchange --monitor --poll: ok
vgchange --monitor: ok
vgchange --poll: ok
vgchange --monitor --poll: ok
lvchange <metadata ops>: monitor, poll
vgchange <metadata ops>: poll
Enhance commit 25b5915c9b
to process options requiring immediate metadata commits
and reloads after those we can group together doing just
one commit and an optional reload for the whole group.
Backup metadata after processing options successfully.
Related: rhbz1437611
Don't abbreviate the --help output quite as much
when there are many command defs. Print all the
options in the cmd defs that are shown. --longhelp
output is unchanged and includes everything.
These days --partial is only used with activation in
lvchange/vgchange. It probably had another meaning
at some point in history which is no longer used,
so ignore it in those cases.
When included in the OO list, the option is advertised in
help/man output, implying it is meaningful to the command,
when in fact the command never uses it.
The IO list means the option won't cause an error if it's
used, but is not displayed as an valid option for the command.
If the option is not included in either OO or IO lists,
using the option would cause a command error, which would cause
problems for anyone is using the option for historical reasons.
_lvchange_properties_single() processes multiple command
line arguments in a loop causing metadata updates and/or
backups per argument.
Optimize to only perform one update and/or backup
(but necessary interim ones; e.g. for --resync)
per command run.
Related: rhbz1437611
Use only selected paths for finding .so in builddir.
So if builddir constains some embeded subdirs with some more
occurences of project (i.e. 'make rpm' build subdir)
those library paths will not get into list.
With monolithic kernels we can't actually modprobe
for cache modules as they are already compiled-in
and policy modules do not export version symbol.
Reported issue on list:
https://www.redhat.com/archives/dm-devel/2017-March/msg00061.html
Fix will try to look for explicit kernel symbols first before
calling modprobe.
Fix a regression introduced by new command line processing
(see starting commit 1e2420bca8)
causing activation to be rejected in combination with setting
persistent minor device numbers.
Resolves: rhbz1437611
Since _filemap_monitor_set_notify() is only called after daemon
start up it should not use the _early_log() macros. Instead, use
log_sys_error to log errors from the two syscalls in the function.
The fallback branch in _stats_update_file() is redundant (since the
branch taken when the daemon starts successfully must jump to the
'out' label anyway): remove it and re-order the conditions to
improve readability.
Use log_sys_error rather than log_error if execvp() fails:
/mnt/redhat/xdoio.13752.XIORQ: Created new group with 2 region(s) as group ID 0.
# execvp() failed.
vs:
/var/lib/libvirt/images/rhel7-vm1.qcow2: Created new group with 884 region(s) as group ID 0.
dmfilemapd: execvp failed: No such file or directory
The function _filemap_monitor_check_file_unlinked() attempts to
test whether a fd value should be closed by comparison to zero:
if ((fd > 0) && close(fd))
log_error("Error closing fd %d", fd);
The test should be '>=' since 0 is a valid file descriptor.
Similar to 40fb91a, but for the file descriptor opened using the
link path reported by /proc/<self>/fd/<fd>.
The daemon opens a new file descriptor from /proc/<self>/fd when
checking for an unlinked file with mode=inode. Ensure that it is
always closed even if the same file test fails.
The daemon opens a new file descriptor from fm->path when checking
for an unlinked file with mode=inode. Ensure that it is always
closed even if the same file test fails.
The path argument check in dmfilemapd incorrectly tests for:
if (argv[0] == '/')
Rather than testing the 1st character in the string pointed to by
argv[0].
Removing some unused new lines and changing some incorrect "can't
release until this is fixed" comments. Rename license.txt to make
it clear its merely an included file, not itself a licence.
This patch fixed lvm2 compilation running on x32 arch.
(Using 64bit x86 cpu features but running on 32b address space,
so consuming less mem in VM).
On x32 arch 'time_t' is 64bit while 'long' is 32bit.
Commits a29bb6a14b
... 5c199d99f4
narrowed down on addressing the escaping of hyphens
in the dynamic creation of manuals whilst avoiding
them in creating help texts. This lead to a sequence
of slipping through hyphens adrressed by additional
patches in aforementioned commit series.
On the other hand, postprocessing dynamically man-generator
created and statically provided manuals catches all hyphens
in need of escaping.
Changes:
- revert the above commits whilst keeping man-generator
streamlining and the detection of any '\' when generating
help texts in order to avoid escapes to slip in
- Dynamically escape hyphens in manaual pages using sed(1)
in the respective Makefile targets
- remove any manually added escaping on hyphens from any
static manual sources or headers
raid1 doesn't allow to set all images to writemostly because at
least one image is required to receive any written data immediately.
The dm-raid target will detect such invalid request and
fail it iwith a kernel error message.
Reject such request in uspace displaying a respective error message.
The recent command definitions commit took the command
name from argv[0] without applying basename to the value,
so a pathname, e.g. /usr/sbin, would cause lvm to not
recognize the command name.
This commit supersedes reverted 1e4462dbfb
to avoid changes to liblvm and the libdm API completely.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range in libdm and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
This reverts commit 1e4462dbfb
in favour of an enhanced solution avoiding changes in liblvm
completetly by checking the target versions in libdm and emitting
the respective parameter lines.
Fixes command defs related to creating a new thin pool and
then a new thin lv in the new pool.
1. lvcreate --size --virtualsize --thinpool
Needs a cmd def, it was missing.
The def is unique by the three required
options: size, virtualsize and thinpool.
2. lvcreate --size --virtualsize --thinpool VG
Needs a cmd def, it was missing.
The def is unique by the three required
options: size, virtualsize and thinpool,
and one required positional arg: VG.
3. lvcreate --thin --virtualsize --size LV_new|VG
This existing def should not accept an optional
--type thin, which if used makes it indistinct
from another def.
4. lvcreate --size --virtualsize VG
This existing def should not accept an optional
--type thin or --thin, which if used makes it
indistinct from other defs (e.g. 3)
This reverts commit 642d682d8d.
Using the thinpool option with this cmd def makes it
indistinct from other cmd defs where thinpool is a
required option.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
lvcreate --thinpool POOL1 -L 100M --virtualsize 100M snapper_thinp
https://bugzilla.redhat.com/1434027
(The general rule is that a command is accepted if it is unambiguous.
The combination -L -V --thinpool uniquely identifies the operation.)
Udev events can come in like a flood when something changes. It really
doesn't do us any good to refresh the state of the service numerous times
when 1 would suffice. We had something like this before, but it was
removed when we added the refresh thread. However, we have since learned
that we need to sequence events in the correct order and block dbus
operations if we believe the state has been affected, thus udev events are
being processed on the main work queue. This change limits spurious
work items from getting to the queue.
If we always disable the sending of notify dbus events then in the case
where all the users are lvm dbus users we will be in udev handling mode
until at least 1 external lvm command occurs. Instead we will not disable
notify dbus until after we get at least 1 external event. This makes the
service get into the correct mode of operation faster.
lvconvert should not leak 'error' device.
(This patch is not fix the problem, just makes it more easily visible
instead of more confusing 'clvmd' trace).
Starting with dm-raid target version 1.9.0 shrinking of mapped devices is supported.
Check for support being present in lvresize and lvreduce.
Related: rhbz1394048
Since we want to test different cache policies with profiles mq&smq
raise version to 1.8.
TODO: maybe split in more tests so older targets can test here as well.
N.B.: passthrough is also not supported with version 1.3
Quering non-thin-pool segment for discard property may lead
to intenal error if the segment had set 'out-of-range' value,
so only thin-pool is allowed, for other it returns NULL.
If SubLVs to be removed still exist after an image removing
conversion (i.e. "lvconvert --yes --force --stripes N "
with N < total stripes) any request to convert to a different
striped/raid* level has to be rejected until after those freed
SubLVs got removed by running the aforementioned lvconvert again.
Add tests to check conversion to striped/raid* gets rejected.
Enhance a test comment.
Related: rhbz1191935
Related: rhbz1366296
Adjust to final target version 1.10.1 supporting reshape
properly and to recently changed report field specifications
(e.g. rehape_len_le) to allow these tests to run.
Lower mirror region size to suite the tiny test VG.
Previous commit 506d88a2ec introduced disabling lvmetad on repairs.
Avoid calling lvscan and use of any --config options altogether
in the mirror and raid DSOs.
Related: rhbz1380521
Repairing missing devices does not work reliably
with lvmetad, so disable lvmetad before repair.
A standard lvmetad refresh (pvscan --cache) will
enable lvmetad again.
Commit 07ded8059c assumed that the mirror is blocked which is not the case.
It is accessible, degraded and in need of repair because some of its legs
(partially) failed. Any auto-repair via dmeventd fails though because
of lvmetad not providing proper data about the failed PV(s). That's why
this workaround got introduced in commit 76f6951c3e until we get to
the lvmetad interaction core issue.
Mind any mirror auto-repair failure is caused by such lvmetad interaction
problems not yet solved so disabling lvmetad works as a resort as elaborated
on in the related bz.
Reintroducing the interim solution.
Resolves: rhbz1380521
Effectively revert whole 76f6951c3e.
We need to figure out some other solution.
At this moment usage of --config with 'repair' of blocked mirror
is 'freezing' combination.
For each section 8 man page, a .8_gen file is created from one of:
.8_main - Old-style man page - content used directly
.8_des and .8_end - Description and end section of a generated page
.8_pregen - Pre-generated page used if the generator fails
Other man sections are not generated and use the suffix .5_main or .7_main.
Developers should use 'make generate' to regenerate the .8_pregen files.
When a given command:
- matches command definition A based on required options,
but uses optional options that are not accepted by A.
- matches command definition B based on required options,
(but fewer required items than A), and uses no
unaccepted optional options.
then B is the correct choice. Command A would fail because
of the unaccepted optional options. The logic was mistakenly
letting A win because it had a greater number of required option
matches, without accounting for the optional option mismatches.
Systematically outline every combination of:
--type striped, --type mirror, --type raid, --mirrors, --stripes
and make sure each is assigned to one specific cmd def.
This revealed that a new command def is needed for
lvcreate that uses both --mirrors and --stripes
but no --type option.
The use of LVCREATE_RAID shortcut for an option set
resulted in mirrors/stripes being included in optional
opts set when they were already in the required list.
Sending %d as format argument in lvmetad_vg_remove_pending() will cause
segfaults in config_make_nodes_v() when va_arg() casts to int64_t. Also, it is
clearly advertised in the lvm source code that using plain %d is prohibited, so
let's switch to FMTd64.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Require that the path argument to dmfilemapd be an absolute path
and document this in tool output, libdevmapper.h and dmfilemapd.8.
The check is also enforced by dm_stats_start_filemapd() to avoid
forking a new process with an invalid path argument.
The initial check on argc incorrectly returns 1 when the wrong
number of arguments are present, causing a segfault in main()
when no arguments are given:
# dmfilemapd
Wrong number of arguments.
usage: dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Segmentation fault (core dumped)
Commit f4b30b0dae was about displaying visible LV size
when reshape space is allocated. Take parity devices
into account when displaying the user visible LV size.
As was recently done with relative signes for sizes/extents,
limit the signs used with the mirrors option, e.g.
lvcreate --mirrors now does not accept or advertise an
optional minus sign with the value. lvconvert --mirrors
accepts +|-.
Add a warning about maximum supported numbers of stripes
with striped LVs realtive to RAID conversions.
Add examples for a more elaborate, multi-step conversion
from linear to striped (and vice versa).
Shrink lvs examples output.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The LVCONVERT_RAID shortcut for including options ended
up including mirrors/stripes as optional opts in defs where
they were already required, or in defs where they would
not be used. Remove the option set and specify in each
case only the options wanted.
Better support for lvdisplay.
By default info about running (in kernel) cache status is printed.
To get 'segtype' info, user runs: 'lvdisplay -m', example:
--- Logical volume ---
LV Path /dev/vg/lvol0
LV Name lvol0
VG Name vg
LV UUID Y4uWuN-TBGk-duer-aPWl-yBWn-iFFR-RU1gg1
LV Write Access read/write
LV Creation host, time linux, 2017-03-01 20:52:39 +0100
LV Cache pool name lvol2
LV Cache origin name lvol0_corig
LV Status available
# open 0
LV Size 12,00 MiB
Cache used blocks 10,42%
Cache metadata blocks 0,49%
Cache dirty blocks 0,00%
Cache read hits/misses 112 / 34
Cache wrt hits/misses 133 / 0
Cache demotions 0
Cache promotions 20
Current LE 3
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Segments ---
Logical extents 0 to 2:
Type cache
Chunk size 64,00 KiB
Metadata format 1
Mode writethrough
Policy smq
Setting migration_threshold=100000
OO_LVCREATE_CACHE accepts --cachemetadataformat.
Support new option --cachemetadataformat auto|1|2 for caching.
Word 'auto' can be also be given as '0'.
Report CMFmt column with cache metadata format version.
Report KMFmt column with 'kernel cache metadata format version' for device.
(a value reported from status).
(Update 'CacheMode' to name 'Cache' as primary segtype).
Only cache-pool segtype may store cache_metadata_format.
Only supported values are 0,1,2
Format 2 requires LV status uses LV_METADATA_FORMAT.
Format 0 (unselected) or 1 shall not set this 'incompatible' status.
Cache pool read/writes metadata_format within its segment type..
For CachePoolLV unselected metadata format is NOT stored in metadata.
For CacheLV when metadata format is not present/selected in lvm2 metadata,
it's automatically assumed to be the version 1 (backward compatible).
To ensure older lvm2 will not 'miss-read' metadata with new version 2,
such LV is marked with METADATA_FORMAT status flag (segment is
specifying metadata format). So when cache uses metadata format 2,
it will become inaccesible on older system without such support.
(kernel dm cache < 1.10, lvm2 < 2.02.169).
Add new profilable configation setting to let user select
which metadata format of a created cache pool he wish to use.
By default the 'best' available format is autodetected at runtime,
but user may enforce format 1 or 2 ATM.
Code also detects availability for metadata2 supporting cache target.
In case of troubles user may easily Disable usage of this feature
by placing 'metadata2' into global/cache_disabled_features list.
Older library version was not detecting unknown 'feature' bits
and could let start target without needed option.
New versioned symbol now checks for supported feature bits.
_Base version keeps accepting only previously known features and
mask/ignores unknown bits.
NB: if the older binary passed in 'random' bits, it will not get
metadata2 by chance. New linked binary get new validation function.
Library user is required to not pass 'trash' for unsupported bits,
as such calls will be rejected.
Dm cache target version 1.10 introduces new cache metadata format
(upstream kernel >=4.11).
New format is enable by passing new target feature flag metadata2.
Interace side on libdm uses DM_CACHE_FEATURE_METADATA2.
This feature bit is now also recognized on status
and set in 'feature_flags' field of dm_status_cache structure.
Code also adds check for 'highest' supported feature flag bit.
So it rejects properly any 'unknown' feature bit set by application.
Better code to enforce writethrough caching for cleaner policy.
Only check for cleaner when DM_CACHE_FEATURE_PASSTHROUGH or
DM_CACHE_FEATURE_WRITEBACK is set.
As now we can properly recognize all paramerters for pool creation,
we may drop PASS_ARG_ defines and rely on '_UNSELECTED' or 0 entries
as being those without user given args.
When setting are not given on command line - 'update' function
fill them from profiles or configuration. For this 'profile' arg
was needed to be passed around and since 'VG' itself is not needed,
it's been all replaced with 'cmd, profile, extents_size' args.
Reuse same code for getting/setting cache parameters with lvcreate.
So there is single one place how to get vars from profiles and configs.
Variables declarations are moved to start of function and
there is no need to initialize them as they are always
defined by functions get_cache_params() and get_pool_params().
Since cache chunk might be huge and there is no technical need
to enforce rounding and there is actually more 'real' VG space
used then necessary - keep rounding on 'chunk' bounrary only
for thin volumes - where it's the space used anyway.
NB: we support conversion of any-size 'existing' LV into cached LV.
Fix missing reset of '*settings' pointer when no args were given.
Handle cache_chunk settings like all other settings, so it is properly
updated only with non-zero settings and the existing cache-pool
chunk_size is not being reconfigured.
User can specify metadata profile which stores important cache
geometry data for easy configuration.
Fix missing support for getting chunk_size, cache_mode, cache_policy
for a cache/cache pools volumes from configuration or metadata profile.
Split-off patch that just replaces returns with 'goto bad'
so there is single place to release policy_settings.
In the next patch we will start to use some shared
function call between lvconvert and lvcreate
(effectively restoring previous logic which has got lost).
To more easily recognize unselected state from select '0' state
add new 'THIN_ZERO_UNSELECTED' enum.
Same applies to THIN_DISCARDS_UNSELECTED.
For those we no longer need to use PASS_ARG_ZERO or PASS_ARG_DISCARDS.
Basically code moving operation to have a single place resolving
thin_pool_chunk_size_policy.
Supported are generic & performance profiles.
Function is now shared between thin manipulation code and configuration
_CFG logic to obtain defaults and handle correct reporting upward coding
stack.
Test for NULL in dm_stats_destroy() and return immediately if
the struct dm_stats pointer is NULL (similar to free(NULL)).
This simplifies cleanup code which otherwise needs to:
out:
if (dms)
dm_stats_destroy(dms);
return;
Older compilers cannot tell that the 'mode' variable is only
used in branches in which it is assigned:
dmsetup.c:5651: warning: "mode" may be used uninitialized in this function
dmsetup.c:5023: warning: "mode" may be used uninitialized in this function
Avoid this by always assigning the variable a value.
Older compilers are not able to determine that although group_id
is only assigned in one branch of a conditional, it is never used
used when the other branch is taken:
libdm-stats.c:3319: warning: "group_id" may be used uninitialized in this function
Avoid this by always initialising the variable when it is
declared.
Utilizing the --config option we will utilize global/notify_dbus=0 so
that the service itself doesn't generate change events which it then needs to
process.
We need to place query operations in the queue to prevent the case where
a client knows of something before the service does. For example if a
client creates a PV/VG/LV outside of the dbus API and then immediately
tries to lookup and use that resource in the lvm dbus service it should
be present. By placing the queries in the work queue any previous
refresh operation will complete before we process the query.
In addition to the already supported conversion between 2-legged
raid1 and raid5, raid1 and raid4 can be also converted into each
other with 2 legs (raid4/5 are limited to map a 2-legged raid1).
This patch supports the missing raid4 conversion in the sequence
linear -> 2-legged raid1 -> raid4/5, then restripe to more than one
data stripes for performance and resilience reasons and optionally
convert to striped/raid0.
The other conversion sequence is also possible by converting N-way
striped/raid0 to raid4/5, then restripe to 2 legs followed by a
conversion to raid1 and optionally to linear (loosing all resilience).
Add examples for reshaping number of stripes
and converting from raid6 to striped to raid10.
Remove trailing spaces.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The filemap daemon takes its program_id from the regions it is
managing: use DM_STATS_ALL_PROGRAMS when retrieving an initial
listing and then obtain the correct program_id from the group
leader.
On conversion from striped to raid0, data LVs are created
and all segments and their respective areas of the striped
LV are moved across to new segments allocated for the raid0
image LVs. This can cause non-canonical segments to be added
to the image LVs.
Add a call to lv_merge_segments() once all segments have been
added to an image LV to compensate for that. This avoids
unsafe table loads on activation.
Fix comments.
Automatic dmeventd repair of mirrors with active lvmetad configured
(mirror_image_fault_policy = "allocate") fails because the lvscan
run before the repair in the mirror DSO does not update the
lvmetad cache properly thus "lvconvert --repair ..." fails.
Need to scan the mirror LV before and after the repair
to have proper cache content after the repair finished.
The cache can't be relied on or the repair will fail.
Resolves: rhbz1380521
Launch an instance of the filemap monitoring daemon when creating,
or updating, a file mapped group, unless the --nomonitor switch is
given.
Unless --foreground is given the daemon will detach from the
terminal and run in the background until it is signaled or the
daemon termination conditions are met.
The --follow={inode|path} switch is added to control the daemon
behaviour when files are moved, unlinked, or renamed while they
are being monitored.
The daemon runs with the same verbosity as the dmstats command
that starts it.
Add a daemon that can be launched to monitor a group of regions
corresponding to the extents of a file, and to update the regions as the
file's allocation changes.
The daemon is intended to be started from a library interface, but can
also be run from the command line:
dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Where fd is a file descriptor open on the mapped file, group_id is the
group identifier of the mapped group and mode is either "inode" or
"path". E.g.:
# dmfilemapd 3 0 vm.img inode 1 3 3<vm.img
...
If foreground is non-zero, the daemon will not fork to run in the
background. If verbose is non-zero, libdm and daemon log messages will
be printed.
It is possible for the group identifier to change when regions are
re-mapped: this occurs when the group leader is deleted (regroup=1 in
dm_stats_update_regions_from_fd()), and another region is created before
the daemon has a chance to recreate the leader region.
The operation is inherently racey since there is currently no way to
atomically move or resize a dm_stats region while retaining its
region_id.
Detect this condition and update the group_id value stored in the
filemap monitor.
A function is also provided in the the stats API to launch the filemap
monitoring daemon:
int dm_stats_start_filemapd(int fd, uint64_t group_id, const char *path,
dm_filemapd_mode_t mode, unsigned foreground,
unsigned verbose);
This carries out the first fork and execs dmfilemapd with the arguments
specified.
A dm_filemapd_mode_t value is specified by the mode argument: either
DM_FILEMAPD_FOLLOW_INODE, or DM_FILEMAPD_FOLLOW_PATH. A helper function,
dm_filemapd_mode_from_string(), is provided to parse a string containing
a valid mode name into the appropriate dm_filemapd_mode_t value.
It's not an error to call dm_stats_group_present() on a handle
that contains no regions.
This causes dmfilemap to log a false backtrace during shutdown
if all regions are removed from the corresponding device:
exiting _filemap_monitor_get_events() with deleted=0, check=0
waiting for FILEMAPD_WAIT
dm message (253:1) [ opencount flush ] @stats_list dmstats [32768] (*1)
<backtrace>
Filemap group removed: exiting.
Change this to only emit a backtrace if the handle is NULL.
Splitting off an image LV of a 2-legged
raid1 LV causes loss of resilience.
Ask user to avoid uninformed loss of all resilience.
Don't ask for N > 2 legged raid1 LVs.
Adjust tests.
Splitting off an image LV of a 2-legged raid1 LV tracking changes
causes loosing partial resilience for any newly written data set.
Full resilience will be provided again after the split off image LV
got merged back in and the new data set got fully synchronized.
Reason being that the data is only stored on the remaining single
writable image during the split.
Ask user to avoid uninformed loss of such partial resilience.
Don't ask for N > 2 legged raid1 LVs.
In case N images fail (N <= parity chunks) _and_
a "vgreduce --removemissing --force VG" was applied
a following repair of the RaidLV fails:
Unable to remove N images: Only 0 devices given.
Failed to remove the specified images from tb/r.
Failed to replace faulty devices in tb/r.
Fix as of this commit results in correct repair:
Faulty devices in tb/r successfully replaced.
Add new values for different sign variations, resulting in:
size_VAL no sign accepted
ssize_VAL accepts + or -
psize_VAL accepts +
nsize_VAL accpets -
extents_VAL no sign accepted
sextents_VAL accepts + or -
pextents_VAL accepts +
nextents_VAL accepts -
Depending on the command being run, change the option
values for --size, --extents, --poolmetadatasize to
use the appropriate value from above.
lvcreate uses no sign (but accepts + and ignores it).
lvresize accepts +|- for a relative change.
lvextend accepts + for a relative change.
lvreduce accepts - for a relative change.
Like opt and val arrays in previous commit, combine duplicate
arrays for lv types and props in command.c and lvmcmdline.c.
Also move the command_names array to be defined in command.c
so it's consistent with the others.
command.c and lvmcmdline.c each had a full array defining
all options and values. This duplication was not removed
when the command.c code was merged into the run time.
seg->extents_copied has to be defined properly on reducing
the size of a raid LV or conversion from raid5 with 1 stripe
to raid1 will fail.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Reshaping a raid5 LV to one stripe aiming to convert it to
raid1 (and optionally to linear) reports the wrong LV size
when still having reshape space allocated.
The lv_extend/_lv_reduce API doesn't cope with resizing RaidLVs
with allocated reshape space and ongoing conversions. Prohibit
resizing during conversions and remove the reshape space before
processing resize. Add missing seg->data_copies initialisation.
Fix typo/comment.
For the time being raid10 is limited to even number of total stripes
as is and 2 data copies. The number of stripes provided on creation
of a raid10(_near) LV with -i/--stripes gets doubled to define
that even total number of stripes (i.e. images).
Apply the same on disk adding conversions (reshapes) with
"lvconvert --stripes RaidLV" (e.g. 2 stripes = 4 images
total converted to 3 stripes = 6 images total).
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Part of the UNIT description was still living in the
--size description. Move it to the Size[UNIT] description
since it is used by other options, not just --size.
In lvcreate/lvconvert, --poolmetdatasize can only be an
absolute value, but in lvresize/lvextend, --poolmetadatasize
can be given a + relative value.
The val types only currently support relative values that
go in both directions +|-. Further work is needed to add
val types that can be relative in only one direction, and
switching various option values to one those depending on
the command name. Then poolmetdatasize will not appear
with a +|- option in lvcreate/lvconvert, and will
appear with only the + option in lvresize/lvextend.
Enhance the raid report functions for the recently added LV fields
reshape_len, reshape_len_le, data_offset, new_data_offset, data_copies,
data_stripes and parity_chunks to cope with "lvs --select".
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Clean up and correct the details around --extents and --size.
lvcreate/lvresize/lvreduce/lvextend all now display the
extents option in usages.
The Size and Number value variables for --size and --extents
are now displayed without the [+|-] prefix for lvcreate.
A special case is needed to display
--extents for lvcreate since the cmd defs
treat --extents as an automatic alternative
to --size (to avoid duplicating every cmd def).
Now that we got the "data_stripes" field key, adjust the "stripes" field description.
Enhance the "regionsize" field description to cover raids as well.
When a cmd def includes multiple sets of options (OO_FOO),
allow multiple OO_FOO sets to contain the same option and
avoid repeating it in the cmd def.
There was confusion in the code about whether or not the
--size option accepted a sign. Make it consistent and clear
that it does.
This exposes a new problem in that an option can only
accept one value type, e.g. --size can only accept a
signed number, it cannot accept a positive or negative
number for some commands and reject negative numbers for
others.
In practice, lvcreate accepts only positive --size
values and lvresize accepts positive or negative --size
values. There is currently no way to encode this
difference. Until that is fixed, the man page output
is hacked to avoid printing the [+|-] prefix for sizes
in lvcreate.
Settings specified in other command line args take precedence over
profiles and --config, which takes precedence over settings in actual
config files.
Since commit 1e2420bca8 ('commands: new
method for defining commands') commands like this:
lvchange --config 'global/test=1' -ay vg
have been printing the 'TEST MODE' message, but nevertheless making
real changes.
Commit 48778bc503 introduced new RAID reshaping related report fields.
The inclusioon of segtype.h in properties.c isn't mandatory, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 80a6de616a versioned the dm_tree_node_add_raid_target_with_params()
and dm_tree_node_add_raid_target() APIs for compatibility reasons.
There's no user of the latter function, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
During an ongoing reshape, the MD kernel runtime reads stripes relative
to data_offset and starts storing the reshaped stripes (with new raid
layout and/or new stripesize and/or new number of stripes) relative
to new_data_offset. This is to avoid writing over any data in place
which is non-atomic by nature and thus be recoverable without data loss
in the transition. MD uses the term out-of-place reshaping for it.
There's 2 other areas we don't have report capability for:
- number of data stripes vs. total stripes
(e.g. raid6 with 7 stripes toal has 5 data stripes)
- number of (rotating) parity/syndrome chunks
(e.g. raid6 with 7 stripes toal has 2 parity chunks; one
per stripe for P-Syndrome and another one for Q-Syndrome)
Thus, add the following reportable keys:
- reshape_len (in current units)
- reshape_len_le (in logical extents)
- data_offset (in sectors)
- new_data_offset ( " )
- data_stripes
- parity_chunks
Enhance lvchange-raid.sh, lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh, lvconvert-raid-reshape.sh
and lvconvert-raid-takeover.sh to make use of new keys.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add/remove the SECONDARY_SYNTAX flag to cmd defs.
cmd defs with this flag will be listed under the
ADVANCED USAGE man page section, so that the main
USAGE section contains the most common commands
without distraction.
- When multiple cmd defs do the same thing, one variant
can be displayed in the first list.
- Very advanced, unusual or uncommon commands should be
in the second list.
Recently added check for reshaping in this function called for
a cleanup to avoid proliferating it with more explicit conditionals.
Base the reshaping check on the given _features array.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
An initialization was missing when converting striped to raid0(_meta)
causing unitialized reshape_len in the new component LVs first segment.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The imposed minimum region size can cause rejection on
disk removing reshapes. Lower it to avoid that.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 27384c52cf lowered the maximum number of devices
back to 64 for compatibility.
Because more members have been added to the API in
'struct dm_tree_node_raid_params *', we have to version
the public libdm RAID API to not break any existing users.
Changes:
- keep the previous 'struct dm_tree_node_raid_params' and
dm_tree_node_add_raid_target_with_params()/dm_tree_node_add_raid_target()
in order to expose the already released public RAID API
- introduce 'struct dm_tree_node_raid_params_v2' and additional functions
dm_tree_node_add_raid_target_with_params_v2()/dm_tree_node_add_raid_target_v2()
to be used by the new lvm2 lib reshape extentions
With this new API, the bitfields for rebuild/writemostly legs in
'struct dm_tree_node_raid_params_v2' can be raised to 256 bits
again (253 legs maximum supported in MD kernel).
Mind that we can limit the maximum usable number via the
DEFAULT_RAID{1}_MAX_IMAGES definition in defaults.h.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
- Combine the equivalent lvconvert --type raid defs.
(Two cmd defs must be different without relying
on LV type, which are not known at the time the
cmd def is matched.)
- Remove unused optional options from lvconvert --stripes,
and lvconvert --stripesize.
- Use Number for --stripes_long val type.
- Combine the cmd def for raid reshape cleanup into the
existing start_poll cmd def (they were duplicate defs).
Calls into the raid code from a poll opertion will be
added.
Commit 64a2fad5d6 raised the maximum number of RAID devices to 64.
Commit e2354ea344 introduced RAID_BITMAP_SIZE as 4 to have
256 bits (4 * 64 bit array members), thus changing the libdm API
unnecessarilly for the time being.
To not change the API, reduce RAID_BITMAP_SIZE to 1.
Remove an unneeded definition of it from libdm-common.h.
If we ever decide to raise past 64, we'll version the API.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The fedorahosted git repository shuts down tomorrow:
https://communityblog.fedoraproject.org/fedorahosted-sunset-2017-02-28/
Our upstream git repository has moved back to sourceware.org.
Mailing list hosting is not changing.
Gitweb:
https://www.sourceware.org/git/?p=lvm2
Git:
git://sourceware.org/git/lvm2.git
ssh://sourceware.org/git/lvm2.git
http://sourceware.org/git/lvm2.git
Example command to change the origin of a repository clone:
Public:
git remote set-url origin git://sourceware.org/git/lvm2.git
Committers:
git remote set-url origin git+ssh://sourceware.org/git/lvm2.git
The options list was sorted as:
- options with both long and short forms, alphabetically
- options with only long form, alphabetically
This was done only for the visual effect. Change to
sort alphabetically by long opt, without regard to
short forms.
Fixes commit 286d39ee3c, which was correct except
for a reversed strstr. Now uses strchr, and modifies
a copy of the name so the original argv is preserved.
When requesting a regionsize change during conversions, check
for constraints or the command may fail in the kernel n case
the region size is too smalle or too large thus leaving any
new SubLVs behind.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Allow regionsize on upconvert from linear:
fix related commit 2574d3257a to actually work
Related: rhbz1394427
Remove setting raid5_n on conversions from raid1
as of commit 932db3db53 because any raid5 mapping
may be requested.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add aforementioned but forgotten new test scripts
lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh and
lvconvert-raid-reshape.sh
Those presume dm-raid target version 1.10.2
provided by a following kernel patch.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Because of contraints in renaming shifted rimage/rmeta LV names
the current RaidLV limit is a maximum of 10 SubLV pairs.
With the previous introduction of reshaping infratructure that
constriant got removed.
Kernel supports 253 since dm-raid target 1.9.0, older kernels 64.
Raise the maximum number of RaidLV rimage/rmeta pairs to 64.
If we want to raise past 64, we have to introdce a check for
the kernel supporting it in lvcreate/lvconvert.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces the changes to call the reshaping infratructure
from lv_raid_convert().
Changes:
- add reshaping calls from lv_raid_convert()
- add command definitons for reshaping to tools/command-lines.in
- fix raid_rimage_extents()
- add 2 new test scripts lvconvert-raid-reshape-linear_to_striped.sh
and lvconvert-raid-reshape-striped_to_linear.sh to test
the linear <-> striped multi-step conversions
- add lvconvert-raid-reshape.sh reshaping tests
- enhance lvconvert-raid-takeover.sh with new raid10 tests
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- allow raid_rimage_extents() to calculate raid10
- remove an __unused__ attribute
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- add missing raid1 <-> raid5 conversions to support
linear <-> raid5 <-> raid0(_meta)/striped conversions
- rename related new takeover functions to
_takeover_from_raid1_to_raid5 and _takeover_from_raid5_to_raid1,
because a reshape to > 2 legs is only possible with
raid5 layout
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- enhance _clear_meta_lvs() to support raid0 allowing
raid0_meta -> raid10 conversions to succeed by clearing
the raid0 rmeta images or the kernel will fail because
of discovering reordered raid devices
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- enhance _raid45610_to_raid0_or_striped_wrapper() to support
raid5_n with 2 areas to raid1 conversion to allow for
striped/raid0(_meta)/raid4/5/6 -> raid1/linear conversions;
rename it to _takeover_downconvert_wrapper to discontinue the
illegible function name
- enhance _striped_or_raid0_to_raid45610_wrapper() to support
raid1 with 2 areas to raid5* conversions to allow for
linear/raid1 -> striped/raid0(_meta)/raid4/5/6 conversions;
rename it to _takeover_upconvert_wrapper for the same reason
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add missing possible reshape conversions and conversion options
to allow/prohibit changing stripesize or number fo stripes
- enhance setting convenient riad types in reshape conversions
(e.g. raid1 with 2 legs -> radi5_n)
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add _raid_reshape() using the pre/post callbacks and
the stripes add/remove reshape functions introduced before
- and _reshape_requested function checking if a reshape
was requested
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add vg metadata update functions
- add pre and post activation callback functions for
proper sequencing of sub lv activations during reshaping
- move and enhance _lv_update_reload_fns_reset_eliminate_lvs()
to support pre and post activation callbacks
- add _reset_flags_passed_to_kernel() which resets anyxi
rebuild/reshape flags after they have been passed into the kernel
and sets the SubLV remove after reshape flags on legs to be removed
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function to support disk adding reshapes
- add function to support disk removing reshapes
- add function to support layout (e.g. raid5ls -> raid5_rs)
or stripesize reshaping
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function providing state of a reshaped RaidLV
- add function to adjust the size of a RaidLV was
reshaped to add/remove stripes
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add lv_raid_data_copies returning raid type specific number;
needed for raid10 with more than 2 data copies
- remove _shift_and_rename_image_components() constraint
to support more than 10 raid legs
- add function to calculate total rimage length used by out-of-place
reshape space allocation
- add out-of-place reshape space alloc/relocate/free functions
- move _data_rimages_count() used by reshape space alloc/realocate functions
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces local infrastructure to raid_manip.c
used by followup patches.
Add functions:
- to check reshaping is supported in target attibute
- to return device health string needed to check
the raid device is ready to reshape
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces infrastructure prerequisites to be used
by raid_manip.c extensions in followup patches.
This base is needed for allocation of out-of-place
reshape space required by the MD raid personalities to
avoid writing over data in-place when reading off the
current RAID layout or number of legs and writing out
the new layout or to a different number of legs
(i.e. restripe)
Changes:
- add members reshape_len to 'struct lv_segment' to store
out-of-place reshape length per component rimage
- add member data_copies to struct lv_segment
to support more than 2 raid10 data copies
- make alloc_lv_segment() aware of both reshape_len and data_copies
- adjust all alloc_lv_segment() callers to the new API
- add functions to retrieve the current data offset (needed for
out-of-place reshaping space allocation) and the devices count
from the kernel
- make libdm deptree code aware of reshape_len
- add LV flags for disk add/remove reshaping
- support import/export of the new 'struct lv_segment' members
- enhance lv_extend/_lv_reduce to cope with reshape_len
- add seg_is_*/segtype_is_* macros related to reshaping
- add target version check for reshaping
- grow rebuilds/writemostly bitmaps to 246 bit to support kernel maximal
- enhance libdm deptree code to support data_offset (out-of-place reshaping)
and delta_disk (legs add/remove reshaping) target arguments
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
(Change to recent commit 3f4ecaf8c2.)
Use --foo Number[k|UNIT] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Previously, underlined/italic Unit was used,
like other of variables, but this UNIT is more
like a shortcut than an actual variable.
The MD kernel raid1 personality does no use any writemostly leg as the primary.
In case a previous linear LV holding data gets upconverted to
raid1 it becomes the primary leg of the new raid1 LV and a full
resynchronization is started to update the new legs.
No writemostly and/or writebehind setting may be allowed during
this initial, full synchronization period of this new raid1 LV
(using the lvchange(8) command), because that would change the
primary (i.e the previous linear LV) thus causing data loss.
lvchange has a bug not preventing this scenario.
Fix rejects setting writemostly and/or writebehind on resychronizing raid1 LVs.
Once we have status in the lvm2 metadata about the linear -> raid upconversion,
we may relax this constraint for other types of resynchronization
(e.g. for user requested "lvchange --resync ").
New lvchange-raid1-writemostly.sh test is added to the test suite.
Resolves: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855895
We use --foo Number[k|Units] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Capitalize and underline Units so it is consistent
with other variables, and reference it at the end.
Technically, the k should be bold, but this
tends to make the text visually hard to read
because of the excessive highlights scattered
everywhere. So it's left normal text for now
(it's unlikely to confuse anyone.)
Print ... after a repeatable option in the OPTIONS section.
An alternative would be to just mention in the text description
that the option is repeatable.
When the test would need to try to write some large amount of data
we can give it 'zero' segments - for obvious reason such written data
can't be verified but in some test cases it doesn't really matter.
Usage follows 'error_dev' style.
For now test suite is not supporting any combination of
error/delay/zero segments so only 1 type could be used per PV.
Previously when lvremove tried to remove 'active' origin,
it had been asking for every 'snapshot' LV separately
and doing individual single snapshot removals first.
To be faster it now deactivates origin before removal
all connected snapshots.
This avoids multiple reloads of dm table for origin volume
which were unnecessary as origin was meant to be removed as well.
There are two kinds of common options:
1. options common to all variants of a given command name
2. options common to all lvm commands
Previously, both kinds of common options were listed together
under "Common options". Now the first are printed under
"Common options for command" (when needed), and the second
are printed under "Common options for lvm" (always).
Remove the "usage notes" which should just
live in the man pages.
When there are 3 or more variants of a command,
print all the options produces a lot of output,
so require --longhelp to print all the options
in these cases.
Check 'dmsetup version' is called before starting any more
advanced logic in $DM_DEV_DIR.
Call also replaces mkdir as it creates needed path with control node.
The --type option has previously been accepted for
lvresize/lvextend. Using it did not affect the operation
of the command. The value was simply verified as matching
the current seg type of the LV.
Commit f45b689406 caused regression
of lvresize -m and --type parameter
After fix this sequence may work when we also fix syntax description:
lvcreate -l1 -m1 -n lv1 vg
lvextend --type mirror -m1 -l+1 vg/lv1
For this syntax:
lvconvert --thinpool LV1 --poolmetadata LV2
lvconvert --cachepool LV1 --poolmetadata LV2
Restore the metadata swapping behavior in addition to
the pool creation behavior. When LV1 is already a pool,
the metadata LV will be swapped with LV2.
When LV1 is not a pool, it will be converted to a
pool using the specified LV for metadata.
This syntax is no longer advertised because of the
ambiguous behavior. The primary syntaxes for pool
creation and metadata swapping will be the advertised
methods.
As we now user binary search - it's nondeterministict
which of the same 'args' will be give - so duplicates
need 'extra' care.
So provide same hack for output for --uuidstr_ARG as
for input.
Solves 'pvscan -u'.
Since there is a lot of options and lot of searches,
use binary search to keep strcmp at minimum.
The interesting part is - alphabetically sorted array contains
duplicates and some of them are not the 'right anwer', so
after we find matching string but not matching long_ARG,
we may need to check if the surrouding strings are the right matching
one.
The single loops is used also for strictly define --foo_long
(i.e. --stripes) and just differs at final part.
TODO1: replace strstr call with some flag (just like short_opt).
TODO2: drop '--' from being stored and tests by strcmp.
When parsing command defs, track and report all
errors that are found. Add an error return case
from define_commands so the standard error exit
path is used.
When using liblvm2cmd, a process can do multiple
init/exit calls, i.e.
lvm2_init(); lvm2_run(); lvm2_exit();
lvm2_init(); lvm2_run(); lvm2_exit();
...
define_commands() needs to set up the global commands[]
definitions only the first time.
The old ad hoc arg parsing when combining a split snapshot
allowed the first lv to have a vgname, but not the second.
Since lvconvert now uses the standard arg parsing in
process_each_lv(), the old one-off behavior requires a
work around.
This reverts commit 717363bb94.
These alternate forms for swapping metadata cannot be
distinguished from the command for creating a pool.
If we were to add these alternate forms for swapping
metadata, we would need to overload the pool creation
command defs, making those definitions ambiguous.
Change run time access to the command_name struct
cmd->cname instead of indirectly through
cmd->command->cname. This removes the two run time
fields from struct command.
All lvconvert functionality has been moved out of the
previous monolithic lvconvert code, except conversions
related to raid/mirror/striped/linear. This switches
that remaining code to be based on command defs, and
standard process_each_lv arg processing. This final
switch results in quite a bit of dead code that is
also removed.
This is a new explicit version of 'lvconvert LV'
which has been an obscure way of triggering polling
to be restarted on an LV that was previously converted.
Lift all the snapshot utilities (merge, split, combine)
out of the monolithic lvconvert implementation, using
the command definitions. The old code associated with
these commands is now unused and will be removed separately.
This lifts the lvconvert --repair and --replace commands
out of the monolithic lvconvert implementation. The
previous calls into repair/replace can no longer be
reached and will be removed in a separate commit.
The new check_single_lv() function is called prior to the
existing process_single_lv(). If the check function returns 0,
the LV will not be processed.
The check_single_lv function is meant to be a standard method
to validate the combination of specific command + specific LV,
and decide if the combination is allowed. The check_single
function can be used by anything that calls process_each_lv.
As commands are migrated to take advantage of command
definitions, each command definition gets its own entry
point which calls process_each for itself, passing a
pair of check_single/process_single functions which can
be specific to the narrowly defined command def.
. Define a prototype for every lvm command.
. Match every user command with one definition.
. Generate help text and man pages from them.
The new file command-lines.in defines a prototype for every
unique lvm command. A unique lvm command is a unique
combination of: command name + required option args +
required positional args. Each of these prototypes also
includes the optional option args and optional positional
args that the command will accept, a description, and a
unique string ID for the definition. Any valid command
will match one of the prototypes.
Here's an example of the lvresize command definitions from
command-lines.in, there are three unique lvresize commands:
lvresize --size SizeMB LV
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync, --reportformat String, --resizefs,
--stripes Number, --stripesize SizeKB, --poolmetadatasize SizeMB
OP: PV ...
ID: lvresize_by_size
DESC: Resize an LV by a specified size.
lvresize LV PV ...
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --resizefs, --stripes Number, --stripesize SizeKB
ID: lvresize_by_pv
DESC: Resize an LV by specified PV extents.
FLAGS: SECONDARY_SYNTAX
lvresize --poolmetadatasize SizeMB LV_thinpool
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --stripes Number, --stripesize SizeKB
OP: PV ...
ID: lvresize_pool_metadata_by_size
DESC: Resize a pool metadata SubLV by a specified size.
The three commands have separate definitions because they have
different required parameters. Required parameters are specified
on the first line of the definition. Optional options are
listed after OO, and optional positional args are listed after OP.
This data is used to generate corresponding command definition
structures for lvm in command-lines.h. usage/help output is also
auto generated, so it is always in sync with the definitions.
Every user-entered command is compared against the set of
command structures, and matched with one. An error is
reported if an entered command does not have the required
parameters for any definition. The closest match is printed
as a suggestion, and running lvresize --help will display
the usage for each possible lvresize command.
The prototype syntax used for help/man output includes
required --option and positional args on the first line,
and optional --option and positional args enclosed in [ ]
on subsequent lines.
command_name <required_opt_args> <required_pos_args>
[ <optional_opt_args> ]
[ <optional_pos_args> ]
Command definitions that are not to be advertised/suggested
have the flag SECONDARY_SYNTAX. These commands will not be
printed in the normal help output.
Man page prototypes are also generated from the same original
command definitions, and are always in sync with the code
and help text.
Very early in command execution, a matching command definition
is found. lvm then knows the operation being done, and that
the provided args conform to the definition. This will allow
lots of ad hoc checking/validation to be removed throughout
the code.
Each command definition can also be routed to a specific
function to implement it. The function is associated with
an enum value for the command definition (generated from
the ID string.) These per-command-definition implementation
functions have not yet been created, so all commands
currently fall back to the existing per-command-name
implementation functions.
Using per-command-definition functions will allow lots of
code to be removed which tries to figure out what the
command is meant to do. This is currently based on ad hoc
and complicated option analysis. When using the new
functions, what the command is doing is already known
from the associated command definition.
Some archs can use even 64K pages and then lvm2 runs into trouble if
the stack is 'too small' to fit extra page capturing stack overwrite.
So when lvm2 limits stack - add extra mem page - be it 4K or 64K.
Relates to ppc64le bug: https://bugzilla.redhat.com/1387279
Remove allocate_pvs from raid_manip.c:_region_size_change_request() API
and lv_extend() using it added for temporary test purpose.
Related: rhbz1366296
Add:
- region size checks to raid_manip.c types array and supporting functions
- tests to lvconvert-raid-takeover.sh to check bogus
"lvconvert --type --regionsize N " requests
Related: rhbz1366296
When we preload device with smaller size, we avoid its resume,
so later suspend/resume of full device tree my process all
existing in flight bios.
Also update comment and avoid using confusing opposite meaning.
Kernel 4.10 (dm-crypt v1.15.0) and later supports loading device
tables with crypt segment having key in kernel keyring retention
service.
dmsetup hid key section of tables output. With this patch dmsetup
no longer hides key section if it detects kernel key description
instead of hex byte representation of key itself.
Add:
- conversion support from striped/raid0/raid0_meta to/from raid10;
raid10 goes by the near format (same as used in creation of
raid10 LVs), which groups data copies together with their original
blocks (e.g. 3-way striped, 2 data copies resulting in 112233 in the
first stripe followed by 445566 in the second etc.) and is limited
to even numbers of legs for now
- related tests to lvconvert-raid-takeover.sh
- typo
Related: rhbz1366296
- support shrinking of raid0/1/4/5/6/10 LVs
- enhance lvresize-raid.sh tests: add raid0* and raid10
- fix have_raid4 in aux.sh to allow lv_resize-raid.sh
and other scripts to test raid4
Resolves: rhbz1394048
Commit cfb6ef654d introduced
support to change RAID region size.
Fix:
- don't change region_size until after prompting the user
- use log_print_unless_silent() instead of log_warn()
- avoid superfluous sigint() calls which are already
covered in yes_no_prompt()
- typo
Related: rhbz1392947
Commit cfb6ef654d introduced
support to change RAID region size.
Add:
- missing conditions to support any types to function with
it in lv_raid_convert(); temporary workaround used until
cli validation patches get merged
- tests requesting "-R " to lvconvert-raid-takeover.sh
involving a cleanup of the script
Related: rhbz1392947
Cleanups as of Jons review:
- enhance comment about mandatory raid4 <-> raid5_n activation w/o metadata SubLVs
- remove bogus segment flag setting
- fix to sync related comments on conversions to raid0/striped and amongst raid4/5
- add missing error message for non-synced conversion to raid0/striped
Related: rhbz1366296
Add:
- support to change region size of existing RaidLVs
(all RAID LV types but raid0/raid0_meta)
- lvconvert-raid-regionsize.sh with test variations
for different RAID types and region sizes
Resolves: rhbz1392947
Well waiting for zeroing may take enough time to finish 'raid' sync.
So make the test running faster without zeroing and better avoid race
to have chance to happen (i.e. lvcreate is finished after array
gets already in sync).
Add:
- support for segment types raid6_{ls,rs,la,ra}_6
(striped raid with dedicated last Q-Syndrome SubLVs)
- conversion support from raid5_{ls,rs,la,ra} to/from raid6_{ls,rs,la,ra}_6
- setting convenient segtypes on conversions from/to raid4/5/6
- related tests to lvconvert-raid-takeover.sh factoring
out _lvcreate,_lvconvert funxtions
Related: rhbz1366296
Add:
- support for segment type raid6_n_6 (striped raid with dedicated last parity/Q-Syndrome SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid6_n_6
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add:
- support for segment type raid5_n (striped raid with dedicated last parity SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid5_n
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add a new update_filemap command to dmstats that allows a filemap
group to be updated:
# dmstats update_filemap --groupid 0 vm.img
/var/lib/libvirt/images/vm.img: Updated group ID 0 with 137 region(s).
This will update the set of regions mapped to the file to reflect
the current file system allocation.
Currently this needs to be run manually - a future update will add
support for monitoring file maps via a daemon, allowing them to be
automatically updated when the underlying file is modified.
Add a call to update the regions corresponding to a file mapped
group of regions. The regions to be updated must be grouped, to
allow us to correctly identify extents that have been deallocated
since the map was created.
Tables are built of the file extents, and the extents currently
mapped to dmstats regions: if a region no longer has a matching
file extent, it is deleted, and new regions are created for any
file extents without a matching region.
The FIEMAP call returns extents that are currently in-memory (or
journaled) and awaiting allocation in the file system. These have
the FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_DELALLOC flag bits set
in the fe_flags field - these extents are skipped until they
have a known disk location.
Since it is possile for the 0th extent of the file to have been
deallocated this must also handle the possible deletion and
re-creation of the group leader: if no other region allocation
is taking place the group identifier will not change.
If the group_id passed to _stats_group_id_present is equal to the
special value DM_STATS_GROUP_NOT_PRESENT there is no need to perform
any further tests: return false immediately.
For more advanced support we need to ensure better logic for calling
external much more advanced script for maintanance of thin-pool.
So this new code ensures:
When thin-pool data or metadata is bigger then 50%,
then with each 5% increment, action is called.
This is independent from autoextend_threshold.
This action always happens when thin-pool is over threshold,
(so no action when it's exactly i.e. 60%).
The only exception is 100% full thin-pool - which invokes 'last'
action.
Since thin-pool occupancy may change also downward, code needs
to also handle possibly reduction of occupancy of thin-pool.
So when usage drop from 90% to 50%, thin-pool will start to call
again action when it will pass 55% threshold.
This give external commands lot of option i.e. to call 'fstrim'
before actual resize is needed.
Default internal logic will stop trying to do any 'rescue' action
when executed command fails.
This will be now fully in hands of external script if such
behaviour is needed.
Instead of stopping monitoring after couple failing retries,
keep monitoring forever, just make larger delays between command
retries (ATM upto ~42 minutes).
So syslog is not spammed too often, yet commands have a chance to
be retried and succeed eventually...
When dmeventd configured command does not start with 'lvm ' prefix,
it's going to be an 'external' command.
In this case we split command by spaces to argv strings.
When showing sizes with 'H|human' units we do use standard rounding.
This however is confusing users from time to time,
when the printed number uses some biger units i.e. GiB and there is just
tiny fraction of space missing.
So here is some real-life example with new 'r' unit.
$lvs
LV VG Attr LSize Pool Origin
lvol0 vg -wi-a----- 1.99g
lvol1 vg -wi-a----- <2.00g
lvol2 vg -wi-a----- <2.01g
Meaning is - lvol1 has 'slightly' less then 2.00g - from sign '<' user
can be aware the LV doesn't have full 2.00GiB in size so he
will be less surpriced allocation of 2G volume will not succeed.
$ vgs
VG #PV #LV #SN Attr VSize VFree
vg 2 2 0 wz--n- <6,00g <2,01g
For uses needing 'old' undecorated human unit simply will continue
to use 'H|h' units.
The new R|r may further change when we would recongnize some
other way how to improve readability.
With recent commit d6a74025df using
INTERNAL_ERROR while cheking layer LV - it's been noticed mirror
logic currently doesn't do a correct thing during upconversion and
does a full-try instead of checking only allocator capabilities.
This leads to invalid usage of layer.
To keep existing code running before providing a fix, relax
INTERNAL_ERROR just an error and keep the 'code' running.
Once mirror code is fixed, these all check should be switched
to internal errors.
Since we still experience occasiaonal test failure - slow
things down even more to avoid race.
Add support for 'quick' table changes between normal & delayed tables.
This was missing piece in 77997c7673.
When merging origin is inactive (while driver is loaded) we
could already report merge in progress values as there is
no way to activate 'old state' now.
Show proper internal error for failing command when there are some
inconsitencies in sizes of LV and its layer instead of rather
meaningless error code 5.
(Could be hit i.e. if user tried to 'resize' cached LV and then
uncache such LV.)
During rework of resize code this validation check
has been lost (in my resize branch). Upstream
is still not supporting resize of any cache type LV
so needs to be prevented.
When we need to clear dirty cache content of cached LV, there
is table reload which usually is shortly followed by next metadata
change. However udev can't (as of now) process udev event
while device is 'suspended'.
So whenever sequence of 'suspend/resume/suspend' is needed,
we need to wait first for finishing of 'resume' processing before
starting next 'suspend'. Otherwise there is 'race' danger of triggering
unwantend umount by systemd as such event will trigger
SYSTEMD_READY=0 state for a moment for such changed device.
Such race is pretty ugly to trace so we may need to review more
sequencies for missing 'sync'.
(Other option is to enhnace 'udev' rules processing to avoid
such dramatic actions to be happening for suspended devices).
Solves: https://bugzilla.redhat.com/1280496
The only reasonable behaviour here is to error on
any number out of accepted range (i.e. now numbers
wrapping around with some hidden logic).
As this is plain bug there is no support for
backward compatibility since noone should
set numbers >UINT32_MAX and expect 0 or error
depending on how big number was used....
TODO: more fields might need to be converted.
Add simple function to wrap usage for only uint32 numbers.
Unlike 'int_arg' which accepts full range of 64bit number
this function will error on numbers out of this range:
<0, UINT32_MAX>
Add to commits 87117c2b25 and 0b8bf73a63 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Resolves: rhbz1025322
Related: rhbz1265191
Related: rhbz1399844
Related: rhbz1404425
When thin-pool processes event and 'lvextend --use-policies' fails
rather capture up-to-date new info as the fullness percentage may
have jumped noticable. This way we could use 'more' correct numbers
when checking for thresholds.
When there is 'merging' of an origin in progress, but metadata stil
do provide both origin and snapshot, we should show data from merged
snapshot. This is important mainly for thin case, where there was
a window, where i.e. 'lvs -o+device_id' would report information
about 'already gone' origin thin LV.
This race window is usually hard to trigger but can be ocasionally hit.
Usually shortly after activation, but before polling process manages
to update metadata after merge.
Before starting polling process, validate the merge has actually started
so there is not pointless invoke of lvmpolld.
This also fixes reported message from command, so user has
correct info whether merging has already started or
if it's delayed for next activation.
Move individual segment validation to a separate function
executed for 'complete_vg'.
Move some 'extra' validation bits from 'raid' validation to global
segtype validation (so extending existing normal validation)
TODO: still some test are left to be moved.
Reduce some duplication in validation process - there are still
some left thought so still room for improving overal speed.
The function timeout_add_seconds has quite a bit of variability. Using
timeout_add which specifies the timeout in ms instead of seconds. Testing
shows that this is much more consistent which should improve clients that
are using shorter timeouts for the API and the connection.
We can't keep 'display_lvname' for too long - it's using
ringbuffer and keeps limited number of names. So it's
safe only per few simple tests, but can't be used anymore
after some function calls..
(Fixes 00e641ef37)
Call _stats_regions_destroy() from dm_stats_list() if dms->regions
is non-NULL. This avoids leaking any pool allocations and ensures
the handle is in a known state: if an error occurs during the list,
dms->regions will be NULL and the handle will appear empty.
It could be actually better to use even cache origin in
read-only mode so there could no be some 'acidental'
change being done on this volume.
This however need further tools enhancment - where we would need
to handle whole subtree on 'lvchange -pr/-prw'.
When command calls backup() more then once (which is actually not
wanted) this warning message is shown repeatedly:
"WARNING: This metadata update is NOT backed up."
Instead now print message just once and less confuse user.
Add this functionality to lvconvert:
'lvconvert --thin cachedLV --thinpool vg/poll'
Converts cachedLV to external origin (which will be read-only).
New thin volume is created in thinpool LV and it's using external
origin as source for unprovisioned chunks.
This conversion happens online (while volume is in use).
Thin LV remains fully writable.
Cached external origin no longer could be written so cache will be used
ONLY for read operations. For this limitation we require cache mode
to be writethrough (as writeback cannot write to read-only volumes).
When thinLV is later removed cached external origin is again
fully usable, just note, LV remain in 'read-only' mode.
When read-write is needed, 'lvchange -prw' has to be used.
Single external origin could be user by multiple thinLV in
multiple differen thin pool.
When cache volume may be converted from normal to -real layer LV
we need to improve logic for call cache_check.
With this patch, we register call for cache_check only when metadata LV
is not yet present in active table slot (should match initial table
load).
This avoids unwanted checking when cache would become layer device
online.
The system is likely in some very inconsisten state.
Do not try to make it even more problematic with trying
to invoke tools like thin_check via callback.
External origin could be reloaded via more locks.
It's actually even more complex then thin-pool,
as it may be active on more nodes for linear LVs
(and maybe even more types).
External origin is always read-only thus unmodifiable
device so there should not be a problem accesing it
through multiple nodes.
Also for thin-pool check first presence of active thin-pool.
FIXME:
It's not easy to detect on which nodes this device is active
Thus manipulation with such device may require checking every
node and it active state and refresh.
But since such setup is quite complex to prepare and use,
hopefully there are not user trying to 'explore' this usage yet.
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
Avoid printing misleading message about single dirty block.
Instead properly detect condition where the 'cleaner' policy
needs to be installed without 'overloading' dirty variable.
Also print warning if we would be clearing read-only volume.
(it really shouldn't happen).
External origin could be activated as stand-alone device.
When the last thin LV is removed, external origin is no longer
the external origin and it's layer property was dropped.
Ensure dm table is correct by reloading external origin
(when it's active).
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We do the same for i.e. old snaphost origin.
Activation of raid has brough up also splitted image with tracing
(without taking lock for this).
So when raid is now activate - such image is not put into
table (with _rmeta). When user needs such device, just active it.
When --count=0 interval numbers are miscalculated:
Interval #18446744069414584325 time delta: 999920887ns
Interval #18446744069414584325 current err: -79113ns
End interval #18446744069414584325 duration: 999920887ns
This is because the command line argument is cast through the
uint32_t type, and fixed to UINT32_MAX:
_count = ((uint32_t)_int_args[COUNT_ARG]) ? : UINT32_MAX;
We also need to handle --count=0 specially when calculating the
interval number: since intervals count from #1, this must account
for the implicit "minus one" when converting from zero to the
UINT64_MAX value used (which is too large to store in _int_args).
The time management code mixes tests of the _timer_fd value with
code that should be timer agnostic: this causes problems for users
of the usleep() timer, since it cannot properly detect the start
of a new interval:
Beginning first interval
Interval #18446744069414584348 time delta: 1000000000ns
Interval #18446744069414584348 current err: 0ns
End interval #18446744069414584348 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
[...]
Beginning first interval
Interval #18446744069414584349 time delta: 1000000000ns
Interval #18446744069414584349 current err: 0ns
End interval #18446744069414584349 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
Separate these out, by defining a _timer_running() call that each
timer implements, and only define _timer_fd if we are compiling
with TIMERFD enabled.
Although the usleep() interval timer is not used if the Linux
TIMERFD interface is available it should still provide reasonably
good timing.
Instead of trying to estimate the error from the duration of the
last sleep, peg it to the start time of the program, and use the
value of ((start_time - now) % interval) to correct the current
interval duration.
This always pulls us back into sync at the end of each interval,
rather than relying on trying to incrementally adjust the time
duration at each interval start.
This greatly reduces drift when the usleep() clock is used.
The dm_bit_copy() macro uses the source (bs1) bitset size as the
limit for memcpy:
memcpy((bs1) + 1, (bs2) + 1, ((*(bs1) / DM_BITS_PER_INT) + 1)..)
This is safe if the destination bitset is smaller than the source,
or if the two bitsets are of the same size.
With a destination that is larger (e.g. when resizing a bitmap to
add more capacity), the memcpy will overrun the source bitset and
set garbage bits in the destination.
There are nine uses of the macro currently (8 in libdm/regex, and
1 in daemons/cmirrord): in each case the two bitsets are always of
equal size so the behaviour is unchanged.
Fix the macro to use bs2's size to simplify resizing bitsets and
avoid the need for another copy macro.
Commit 0690392040 revealed a problem
in raid metadata manipulation.
We do two operations in one table reload:
- raid leg/image extraction
- rename remaining raid legs
This should be made in separate steps. Otherwise we do an
uncorrectable table change on error path (leaving tables
for admin and dmsetup).
As a hotfix - restore the previous logic and use a single
new function _lv_update_and_reload_list which activates exclusively
extracted LVs on the list before resuming suspended raid LV.
This restore 'rename' functionality upon resume.
Also still preserve the 'origin_only' logic - although we know
it's not working properly for cluster and LV stacking.
Further fixes are needed.
If FIEMAP returns a single extent after the first call, no extent
boundary is detected and the first extent is not counted by the
normal mechanism.
In this case, increment nr_extents at the same time the extent is
added to the region table, before returning.
backup is not 'tested' for success and also it should
actually happen just when command is finished.
We do not target to make backups with each inter-step
metadata change.
RAID is LV property
TODO: only 2 flags are seg->status: PVMOVE & MERGING
At least the second one should be soon elimanted as again
we merge LV not a segment.
This is another place for 'common' use pattern or
reload and activation of deleted devices.
(Moving the exclusive activation to _deactivate_and_remove_lvs()).
TODO: looks like halve of raid function is reloading
just 'origin' - and the other full LV.
It's useful to be able to specify a minimum number of bits for a
new bitmap parsed from a list, for e.g. to allow for expansing a
group without needing to copy/reallocate the bitmap.
Add a backwards compatible symbol for programs linked against old
versions of the library.
It is sometimes convenient to iterate over the set bits in a dm
bitset in reverse order (from the highest set bit toward zero), or
to quickly find the last set bit.
Add dm_bit_get_last() and dm_bit_get_prev(), mirroring the existing
dm_bit_get_first() and dm_bit_get_next().
dm_bit_get_prev() uses __builtin_clz when available to efficiently
test the bitset in reverse.
Add a macro for the clz (count leading zeros) operation.
Use the GCC __builtin_clz() for clz() if it is available and fall
back to a shift based implementation on systems that do not set
HAVE___BUILTIN_CLZ.
Split out the loop that iterates over each batch of FIEMAP
extent data from the function that sets up and calls the ioctl
to reduce nesting and simplify local variable use:
_stats_get_extents_for_file()
-> _stats_map_extents()
The _stats_map_extents() function is responsible for detecting
eof and extent boundaries and adding whole, allocated extents
to the file extent table for region creation.
Check that all region_id values specified in a group bitmap are
actually present: although this should not normally happen when
using the dmstats tool, it is possible as a result of manual
changes (or bugs) for a group descriptor to contain one or more
group_id values that do not exist.
Check for this situation when reading group descriptors, warn
the user the user, and clear these bits in the bitmap when
formatting it for output.
If a region has a a DMS_GROUP tag in aux_data where the first
region_id in the bitmap is not the same as the containing region,
dmstats will segfault:
# '2' is never a valid group bitset list for region_id == 0
# dmsetup message vg_hex/root 0 "@stats_set_aux 0 DMS_GROUP=img:2#"
# dmsetup message vg_hex/root 0 "@stats_list"
0: 45383680+16384 16384 dmstats DMS_GROUP=img:2#
1: 46071808+32768 32768 dmstats -
2: 47382528+16384 16384 dmstats -
# dmstats list
Segmentation fault (core dumped)
The crash will occur in some arbitrary dm_stats_get_* property
method - this happens while processing the 1st region_id in the
bitset, because the region is marked as grouped, but there is
no group bitmap present at dms->groups[2]->regions.
Fix this by detecting a mismatch between the expected region_id
and dm_bit_get_first() for the parsed bitset during
_parse_aux_data_group().
Handle files that contain multiple logical extents in a single
physical extent properly:
- In FIEMAP terms a logical extent is a contiguous range of
sectors in the file's address space.
- One or more physically adjacent logical extents comprise a
physical extent: these are the disk areas that will be mapped
to regions.
- An extent boundary occurs when the start sector of extent
n+1 is not equal to (n.start + n.length).
This requires that we accumulate the length values of extents
returned by FIEMAP until a discontinuity is found (since each
struct fiemap_extent returned by FIEMAP only represents a single
logical extent, which may be contiguous with other logical
extents on-disk).
This avoids creating large numbers of regions for physically
adjacent (logical) extents and fixes the earlier behaviour which
would only map the first logical extent of the physical extent,
leaving gaps in the region table for these files.
To be able to detect lvm2 command is not leaking some
'unexpected' device - remove all devices before
test exits by its own command so test teardown
now can check what was 'left' unexpectedly.
Fix order of operation when converting raid1 into old mirror.
Before any later metadata modification are initiated prepare
mirror_log device with all clearing.
Then directly convert raid1 into mirror with mirror_log.
This convertion now properly see as precommitted metadata
new 'mirror' and committed old 'raid' and is able to
preload all LVs.
When mapping regions to a file descriptor, a temporary table of
extent descriptors is built using the dm_pool object building
interface.
Previously this use borrowed the dms->mem region and counter
table pool (since nothing can interleave with the allocation
while the caller is still in dm_stats_create_regions_from_fd()).
This turns out to be problematic for error recovery. When a
region creation operation fails partway through file mapping,
we need to roll back the set of already created regions and
this requires a listed handle: the dm_stats_list() will then
allocate from the same pool as the extents; we either have
to throw away valid list data, or leak the extent table, to
return the handle in a valid state.
Avoid this problem by creating a new, temporary mem pool in
_stats_create_file_regions() to hold the extent data, and
discarding it on exit from the function.
While cleaning up the table of already created regions during a
failed dm_stats_create_regions_from_fd(), list the handle once,
and call _stats_delete_region() directly. This avoids sending a
@stats_list message for each region deleted, reducing runtime
from 6s to 0.7s when cleaning up ~250 out of ~10000 regions:
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
<< pauses here >>
Command failed
real 0m6.267s
user 0m3.770s
sys 0m2.487s
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
Command failed
real 0m0.716s
user 0m0.034s
sys 0m0.581s
Testing the error path requires region creation to start to
fail part way through the operation (in order to have regions
to clean up): the simplest way is to ensure the system is
close to the kernel limit of 1/4 RAM or 1/2 vmalloc space
consumed by dmstats data.
Split dm_stats_delete_region() so that internal callers can manage
the handle state themselves.
dm_stats_delete_region() now just handles checking the state of the
handle, reporting validation errors, and calling dm_stats_list() if
necessary, before calling _stats_delete_region().
The new _stats_delete_region() function performs the actual group
member removal and region deletion, and requires a fully listed
handle to operate.
Callers that repeatedly delete regions can use a single listed
handle for many operations on the same device, avoiding one
message ioctl per region deleted: since @stats_list with many
regions is expensive, this yields large runtime improvements.
If we fail to create a region during dm_stats_create_regions_from_fd(),
we must remove all regions that were created to do this to date. This
needs to loop over the table of region_id values that were populated
by _stats_create_file_regions() before the error.
The code for this failure case in the out_remove branch incorrectly
uses the table index as the region_id:
for (--i; i != DM_STATS_REGION_NOT_PRESENT; i--) {
if (!dm_stats_delete_region(dms, i))
log_error("Could not delete region " FMTu64 ".", i);
}
This causes the cleanup code to delete a completely unrelated set
of regions (since the index here will always be nr_regions..0).
Fix it to pass the actual region_id stored in regions[i] instead.
Fix a silly bug in dm_stats_delete_region() that hugely inflates
runtimes when deleting a large number of regions.
For ~50,000 regions this change reduces the runtime from 98s to
6s on my test systems (a ~93% reduction).
The bug exists because dm_stats_delete_region() applies a truth
test to the return value of dm_stats_get_nr_areas(); this is
never correct usage - it will walk the entire region table and
calculate area counts for each region (which is roughly O(n^2)
in the number of regions, as dm_stats_delete_region() is being
called inside a region walk).
Although the individual area calculation is not that costly,
uselessly running anything 2,500,000,000 times over gets a bit
slow.
A much cheaper test (which is always true if the areas check is
true) is to just test dm_stats_get_nr_regions() or dms->regions;
if either is true it implies at least one area exists.
Old:
Performance counter stats for 'dmstats delete --allregions --alldevices':
98117.791458 task-clock (msec) # 1.000 CPUs utilized
127 context-switches # 0.001 K/sec
3 cpu-migrations # 0.000 K/sec
6,631 page-faults # 0.068 K/sec
307,711,724,562 cycles # 3.136 GHz
544,762,959,577 instructions # 1.77 insn per cycle
84,287,824,115 branches # 859.047 M/sec
2,538,875 branch-misses # 0.00% of all branches
98.119578733 seconds time elapsed
New:
Performance counter stats for 'dmstats delete --allregions --alldevices':
6427.251074 task-clock (msec) # 1.000 CPUs utilized
6 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
6,634 page-faults # 0.001 M/sec
21,613,018,724 cycles # 3.363 GHz
3,794,755,445 instructions # 0.18 insn per cycle
852,974,026 branches # 132.712 M/sec
808,625 branch-misses # 0.09% of all branches
6.428953647 seconds time elapsed
Simplify info run for use only for INFO & STATUS.
Drop handling MKNODES within _info_run() call
and use more advanced _setup_task_run() directly.
This allows to further simplify _info_run().
Integrate also query for inactive table and
handle dm_task_run() and dm_task_get_info()
(thus switching to setup_task_run)
Add one exception case for DM_DEVICE_TARGET_MSG.
This allows further shortening and simplification of all
other users of this function.
It's actually not needed to call extra lv_has_target_type() to detect
snapshot merge is in progress - decode this right during status
capturing and save even few extra ioctl calls.
Drop LV from passed API arg - it's always segment being checked.
Also use_layer is now in full control of lv_info_with_seg_status().
It decides which device needs to be checked to get 'the most info'.
TODO: future version should be able to expose status from
Start moving selection of status taken for a LV into a single place.
The logic for showing info & status has been spread over multiple
places and were doing too complex decision going agains each other.
Unify selection of status of origin & cow scanned device.
TODO: in future we want to grab status for LV and layered LV and have
both statuses present for display - i.e. when 'old snapshot'
of thinLV is takes and there is ongoing merge - at some moment
we are not capable to show all needed info.
When lvm2 wants to see a status, it needs to validate,
segment for status reading is matching whan lvm2 expects in
metadata.
Also ensure status failure will not cause '0' from info reading
when actual info was collected properly.
Failure in 'status' reading is considered to be
a 'log_warn()' event only.
When we can't parse status, switch to warning as this is not
considered an errornous case. LVS is not supposed to return
error status code when device is not what it's been expected to
be - but it should be WARNING a user there is something unexpected.
Convert lvs -o lv_merge_failed,lv_snapshot_invalid to use
lv_info_and_status function.
This makes it equal to attr value showing this info
(as they were different since they were derived from
different data set and different logic as well).
Also saves couple extra ioctl that were needed to obtain this info.
When displaying <reporting_command> -o help, we'd like to have fields
grouped nicely, not starting having groups interleaved as it was before.
The code that displays the help output for fields takes the order as
written in columns.h file - this caused output like:
$ lvs -o help
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Instead, let's have it without groups interleaved which may be
a bit confusing, so:
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
..and so on.
Use LVM_DBUSD_TEST_MODE env variable to customize what we test.
Default is the same where we try to test all combinations of all
modes. Renamed to make it consistent with the other env variables
that are used in the unit test.
- We check that all properties match the introspection data. We
don't verify values for every property as only lvm knows what they
should be.
- We are testing vg.Move
Added a properties changed signal on the job dbus object so that client
can wait for a signal that the job is complete instead of polling or
blocking on the wait method.
Allows the user to override the number of commands that get dumped
to the log when we encounter a lvm error. Also useful during
development when you don't want to see the blackbox output.
In case any SubLV of a RaidLV transiently fails, it needs
two "lvchange --refresh RaidLV" runs to get it to fully
operational mode again. Reason being, that lvm reloads all
targets for the RaidLV tree but doesn't resume the SubLVs
until after the whole tree has been reloaded in the first
refresh run. Thus the live mapping table of the SubLVs
still point to an "error" mapping and the dm-raid target
can't retrieve any superblock from the MetaLV(s) in processing
the constructor during this preload thus not discovering the
again accessible SubLVs. In the second run, the SubLV targets
map proper (meta)data, hence the constructor discovers those
fine now.
Solve by resuming the SubLVs of the RaidLV before
preloading the respective top-level RaidLV target.
Resolves: rhbz1399844
When reading data from stdout & stderr we were reading until the
reading until we got None back which really isn't needed as the
read will return everything that is available.
Avoid code duplication and use exiting commonly used
lv_update_and_reload() function.
There is still one place left where mirror is doing strange
double suspend call - needs there more thinking what's wrong with
that code.
When lvconvert adds a new leg - it's doing it free 'temporary' image
layer - however this temporary 'internal' mirror is also MIRRORED LV.
But the status bit was not properly transfered through layer.
We need to acquire a lock which can block us which in turn causes
the dbus request handling to block as well. Place the request on
the work queue instead.
Our expectation was that when using the lvm shell that when the lvm prompt
was read from stdout, that all other ouput had been written and flushed.
However, this doesn't appear to be the case. Add extra read passes to
retrieve delayed report data.
The default dbus python library mode of operation is to leverage
introspection. However, this introspection data isn't accessible
for users of the library and they have to specifically retrieve
the introspection data too. This resulted in many introspection
calls being made. This change eliminates introspection calls if
we are testing multiple concurrent test clients. If it's a single
client we will leverage a reduced amount of introspection data to
verify the introspection data is correct. Typically clients don't
leverage introspection data nearly as much as this test client.
The env variable LVM_DBUSD_PV_DEVICE_LIST when present and filled in
with at least 4 physical devices will run concurrently with other
instances running as long as they specify different devices in their
env variable.
When the env variable is not present the test runs as it did before.
In preparation to have more than one thread issuing commands to lvm
at the same time we need to serialize updates to the dbus state and
retrieving the global lvm state. To achieve this we have one thread
handling this with a thread safe queue taking and coalescing requests.
Looks like this isn't support across versions. Need to add functionality
to service to return the supported segment types, so we only use the
supported ones.
There are two possible errors in _dm_stats_populate_region():
* No region struct in dms->regions[region_id]
* Failure to parse data from @stats_print
These have very different causes: the first occurs where a client
program is populating one region at a time (region_id is a single
region identifier), and has not previously called dm_stats_list()
to dimension the region tables; this is an API usage error.
The second occurs when either we read unparseable data from the
kernel (kernel bug), or where various resource allocations fail.
Separate these two cases out and log separate messages for each
(allocation failures in the path already have their own distinct
message), since the "failed to parse.." message in the un-listed
handle case is confusing and misleading.
Do not emit warning message but only log debug message if
lvm2-lvmdbusd.service unit is missing and at the same time
we have global/notify_dbus=1 (which is used by default if we
configured sources with "--enable-notify-dbus"). We don't want
hard dependency between LVM2 and lvmdbusd so it's enough to log
only debug message in this case.
0 interval leads as of now to a busy loop with lvmetad and command.
Avoid testing this patological case.
TODO: Code should possibly translate zero interval into some small
sleep. With lvmpolld it's already 1/10s
Add new targets:
make check_lvmpolld
make check_cluster_lvmpolld
make check_lvmetad_lvmpolld
make check_all_lvmpolld
So check_lvmetad runs only base lvmetad test - to much
logic of remaining targets.
Previous behavior is available via check_all_lvmpolld.
Make it easier to replace missing segments with 'zero' returning
target - otherwise user would have to create some extra target
to provide zeros as /dev/zero can't be used (not a block device).
Also break code loop when segment is found and make it an INTERNAL_ERROR
where it's missing.
Instead of clearing multiple rmeta device with sequential activation
process and waiting for udev for every _rmeta device separately,
activate all _rmeta devices first and then clear them and deactivate
afterwards.
Also update some tracing messages.
When anyhing goes wrong during clearing process, always try to
deactivate as much _rmeta devices as possible before fail.
This code is no longer needed because the back ground task has been
removed. Will add back if we change the design and end up utilizing
multiple worker threads.
There is no reason to create another background task when the task that
created it is going to block waiting for it to finish. Instead we will
just execute the logic in the worker thread that is servicing the worker
queue.
Translate log_info() into log_very_verbose() which is macro
supposed to be used by our code.
log_info() is internal macro with eventually some 'symbolic' meaning
in syslogging daemons.
Instead of compiling 2 log call for 2 different logging functions,
and runtime decide which version to use - use only 'newer' function
and when user sets his own OLD dm_log logging translate it runtime
for old arg list set.
The positive part is - we get shorter generated library,
on the negative part this translation means, we always have evaluate
all args and print the message into local on stack buffer, before
we can pass this buffer to the users' logging function with proper
expected parameters (and such function may later decide to discard
logging based on message level so whole printing was unnecessary).
Ensure different logging function for dmeventd.c logging
and dm and lvm library.
We can recognize we want to show every log_info() and
log_notice() message from dmeventd.c code while not
exposing those from libdm/libdevmapper-event
Also switch to use log with errno - it's not changing
anything and doesn't bring any more features yet to dmeventd
logging but we just properly pass dm_errno_or_class properly
through the whole code stack for possible future use
(i.e. support of class logging for dmeventd).
Reword the logging logic and try to restore previous logging
behavior for 'standalone' running daemon while preserving
debuggable feautures it has gained.
So actual rules:
dmeventd without any '-d' option will syslog all messages
from dmeventd.c it dmeventd plugins.
log_notice()==log_verbose()
log_info()==log_very_verbose()
But to show also log_debug() used has to give '-ddd'.
When user specified '-d, -dd, -ddd, -dddd' it
will also enable tracing of messages from libdm & lib
executed code - which is mainly useful for testing
i.e.: 'dmeventd -fldddd'
Introduce macros:
log_level(), log_stderr(), log_once(), log_bypass_report()
For easier and more consisten way how to 'decoder' bits
of info from passed 'level'.
This patch fixes potential problem when 'level' of message
might not have always masked right bits.
Instead of creating a thread to handle the case where a client
is calling job.Wait, we will utilize a timer. This significantly
reduces the number of threads that get created and destroyed while
the service is running.
We will fetch the lvm state in non-main thread and only process the new
data with the main thread to prevent hanging the main thread event loop.
ref. https://bugs.freedesktop.org/show_bug.cgi?id=98521
pvscan --cache -aay was activating LVs in exported VGs
when it should not.
It appears that this was a regression from commit 9b640c3684
"pvscan: use process_each_vg for autoactivate".
If blkdeactivate finds out that the device on top of device stack
is already unmounted, it still proceeds with device stack deactivation
underneath now.
This situation can happen if blkdeactivate is started and the mount
point is unmounted in parallel by chance (so when blkdeactivate
gets the the actual umount call, the device is not mounted anymore).
Before, the blkdeactivate added such device to skip list which caused
all the stack underneath to be skipped too on deactivation. Now, we
proceed just as if blkdeactivate did the umount itself.
For example, in the example below, the vg-lvol0 is mounted on /mnt/test
when blkdeactivate is called, but it gets unmounted in parallel later
on when blkdeactivate gets to the actual umount call.
Before this patch (vg-lvol0 underneath not deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... skipping
With this patch applied (vg-lvol0 underneath still deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... already unmounted
[LVM]: deactivating Logical Volume vg/lvol0... done
(Automatic) repair may not be allowed during the initial sync of an upconverted
linear LV, because the data on the failing, primary leg hasn't been completely
synchronized to the N-1 other legs of the raid1 LV (replacing failed legs during
repair involves discontinuing access to any replaced legs data, thus preventing
data recovery on the primary leg e.g. via dd_rescue).
Even though repair would not cause data loss when adding legs to a fully synced
raid1 LV, we don't have information yet defining this state yet (e.g. a raid1
LV flag telling the fully synchronized status before any legs were added),
hence can't automatically decide to allow to repair.
If nonetheless a repair on a non-synced raid1 LVs is intended, the "--force"
option has to be provided.
Resolves: rhbz1311765
Validate kernel support for raid0/raid4 on given and
requested segtype before requesting conversions on them.
Because raid10 wasn't present in old RAID targets, add
the same validation to be prepared once we support them.
Check for dm-raid target version with non-standard raid4 mapping expecting the dedicated
parity device in the last rather than the first slot and prohibit to create, activate or
convert to such LVs from striped/raid0* or vice-versa in order to avoid data corruption.
Add related tests to lvconvert-raid-takeover.sh
Resolves: rhbz1388962
On conversions between striped/raid0* and raid4, the kernel expects
the dedicated raid4 parity SubLVs in the first segment area rather than
in the last it's been allocated to, thus the data mapping ain't proper.
Enhance lvconvert (lib/metadata/raid_manip.c) to shift the dedicated
parity SubLVs on conversions from striped/raid0* to raid4 and vice-versa.
In case of raid0_meta -> raid4 where the MD raid0 personality already has
stored RAID array device positions in the superblocks, the MetaLVs have to
be cleared so that the kernel doesn't fail validating the array positions
after lvm has shifted them up by one.
Add more tests to lvconvert-raid-takeover.sh including one to check for
mapping flaws by converting a created raid4 with filesystem -> striped
and fsck it.
Whilst on it:
- add missing direct striped -> raid4 conversion to the takeover array
to avoid an intermim conversion from striped -> raid0*
- clean up the takeover array
- allow lvconvert to actually call lv_raid_convert() on all takeover requests
in order to check parameters and display messages provided by takeover
functions rather than just "...not supported" from within lvconvert
- fix a typo
Resolves: rhbz1386148
If a device disappears after obtaining the list of devices but before
processing it as a member of that list, dmsetup exits with a failure code.
Most commands still produce what output they can in these circumstances,
but 'ls --tree' and 'info -c' with fields depending on device dependencies
didn't. Change this.
Keep for now function logic making its decision on string content.
We need bigger patch converting all things to bit-checks later.
This needs however bigger refactoring.
So this commit reverts some changes from:
c8b6c13015
Commit 088b3d036a allowed repair on cache origin RAID LVs
and restricted lvconvert actions on RAID SubLVs to change number of mirrors, repair,
replace and type changes in order to avoid unsuitable coversions on them.
This introduced a regression prohibiting --splitmirrors on any RAID SubLVs
(e.g. of cache or thin LVs; lvconvert-{cache,thin}-raid.sh tests failing).
Fix allows split mirrors again.
Fix some indenting whilst on it.
When we have already decoded arg_is_set into a local var
or already set segment type - already use these
values instead of repeating calls and string checks.
Seems some error path where not converted to 'new' ECMD return value.
Fix them to always 'goto out'.
Also drop unneeded 'ret = 0' when ret already is 0.
This test never passes on loop back, so we will skip unless the
pv devices are real devices which contain `/dev/sd`.
We always fail because we need lvm to run slow to get a timer to
pop, and loopback are too fast.
If you run multiple runs of unittest.main, unless you don't pass exit=true
the test case always ends with a 0 exit code. Add ability to store the
result of each invocation of the test and exit with a non-zero exit code
if anyone of them fail.
In case a RAID orig LV is being cached and fails, repair is impossible because
"lvconvert --repair" gets rejected.
Fix by allowing repair on cache orig RAID LVs and
"lvconvert --replace/--mirrors/--type {raid*|mirror|striped|linear}" as well.
Allow the same lvconvert actions on any cache pool and metadata RAID SubLVs.
Resolves: rhbz1380532
The following LvCommon properties were added so that the API
would have the same functionality as lvm2app has.
LvCommon.MetaDataSizeBytes
LvCommon.Attr
LvCommon.MetaDataPercent
LvCommon.CopyPercent
LvCommon.SnapPercent
LvCommon.SyncPercent
Fix missing wait so we have paired waiting.
Also 'wait' for precise PID to get 'exit' code.
Test for 'error' replacing only with newer snapshot targets.
The old one will wait for resume.
Note: 'wait -n' is not always available so can't be used..
Integrate back _unblock_sigalrm() and check for error code of
pthread_sigmask() function so we do not use uninitialized
sigmask_t on error path (Coverity).
When a PV device is missing lvm will return '[unknown]' for the device
path. The object manager keeps a hash table lookup for uuid and for PV's
device name. When we had multiple PVs with the same device path we
we only had 1 key in the table for the lvm id (device path). This caused
a problem when the PV device transitioned from '[unknown]' to known as any
subsequent transitions would cause an exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 66, in run_cmd
result = self.method(*self.arguments)
File "/usr/lib/python3.5/site-packages/lvmdbusd/manager.py", line 205, in _pv_scan
cfg.load()
File "/usr/lib/python3.5/site-packages/lvmdbusd/fetch.py", line 24, in load
cache_refresh=False)[1]
File "/usr/lib/python3.5/site-packages/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/usr/lib/python3.5/site-packages/lvmdbusd/loader.py", line 80, in common
cfg.om.remove_object(cfg.om.get_object_by_path(k), True)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 153, in remove_object
self._lookup_remove(path)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 97, in _lookup_remove
del self._id_to_object_path[lvm_id]
KeyError: '[unknown]'
when trying to delete a key that wasn't present. In this case we don't add a
lookup key for the device path and the PV can only be located by UUID.
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1379357
Works if the pool is inactive.
Activation code doesn't notice a new raid dependency in on-disk metadata
when a thin LV is already active.
https://bugzilla.redhat.com/1365286
The dm_stats_delete_region() call removes a region from the bound
device, and, if the region is grouped, from the group leader
group descriptor stored in aux_data.
To do this requires a listed handle: previous versions of the
library do not since no dependencies exist between regions without
grouping.
This leads to strange behaviour when a command built against an old
version of the library is used with one supporting groups. Deleting
a region with dmstats succeeds, but logs errors:
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
vg_hex-root 2 2.00g 1.00g 1 1.00g dmstats
# dmstats delete --regionid 2 vg_hex/root
Region ID 2 does not exist
Could not delete statistics region.
Command failed
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
This happens because the call to dm_stats_delete_region() is inside
a dm_stats_walk_*() iterator: upon entry to the call, the iterator
is at its end conditions and about to terminate. Due to the call to
dm_stats_list() inside the function, it returns with an iterator at
the beginning of a walk and performs a further iteration before
exiting. This final loop makes a further attempt to delete the
(already deleted) region, leading to the confusing error messages.
The current dmsetup.c handles DR_STATS and DR_STATS_META reports
separately in _display_info_cols(), meaning that the stats walk
functions are never called for these report types.
Versions before v2.02.159 have a loop using dm_stats_walk_do() and
dm_stats_walk_while(), that executes once for non-stats reports,
and once per region, or area, for DR_STATS/DR_STATS_META reports.
This older behaviour relies on the documented behaviour that the
walk functions will accept a NULL pointer as the struct dm_stats*
argument.
This was broken by commit f1f2df7b: the NULL test on dms and
dms->regions were incorrectly moved from the dm_stats_walk_end()
wrapper to the internal '_stats_walk_end()' helper.
Since the pointer is dereferenced in between these points, using
an older dmsetup with current libdm results in a segfault when
running a non-stats report:
# dmsetup info -c vg00/lvol0
Segmentation fault (core dumped)
Restore the NULL checks to the wrapper function as intended.
We shouldn't be losing pvscans just because of the fact that the
underlying device (PV) appears and disappears quickly in the system,
otherwise lvmetad may not see the device if it appears again (or it may
still keep the device in cache even it's already gone).
We added lightweight toolcontext handle to avoid useless initialization
of some parts of the context and also to avoid problems when using the
handle very soon at system boot, like in lvm2-activation-generator
through lvm2app interface. However, we missed reading all the other
config sources like lvmlocal.conf as well as any tag config - we need to
read these too to get the final config value which may be overriden in
any of these additional config sources.
Currently, we use this lightweight toolcontext handle to read
global/use_lvmetad and global/use_lvmpolld config values in
lvm2-activation-generator using lvm2app interface (lvm_config_find_bool
lvm2app function).
Pre 1.9 dm-raid targets status output was racy, which caused
the device status chars to be unreliable _during_ synchronization.
This shows paritcularly with tiny test devices used.
Enhance lvchange-rebuild-raid.sh to not check status
chars _during_ synchronization. Just check afterwards.
Though I'm not quite sure why we push this limit on user,
current --repair logic requires user to wait outside of command.
TODO: I'm not quite sure this repair logic is 'the most wanted'.
Make sure that the temporary dm_histogram used for the bounds
argument is freed in the case that the user provided a --bounds
argument on the command line.
The dm-raid target now rejects device rebuild requests during ongoing
resynchronization thus causing 'lvconvert --repair ...' to fail with
a kernel error message. This regresses with respect to failing automatic
repair via the dmeventd RAID plugin in case raid_fault_policy="allocate"
is configured in lvm.conf as well.
Previously allowing such repair request required cancelling the
resynchronization of any still accessible DataLVs, hence reasoning
potential data loss.
Patch allows the resynchronization of still accessible DataLVs to
finish up by rejecting any 'lvconvert --repair ...'.
It enhances the dmeventd RAID plugin to be able to automatically repair
by postponing the repair after synchronization ended.
More tests are added to lvconvert-rebuild-raid.sh to cover single
and multiple DataLV failure cases for the different RAID levels.
- resolves: rhbz1371717
Commit 199697accf rerouted funtion
for priting cache volume origin to lvm2app app function - which
however had a bug. So restore the original functionality
and print correct LV as cache origin LV.
Gris debugged that when we don't have a method the introspection
data is missing the interface itself eg.
<interface name="<your_obj_iface_name>" />
When adding the properties to the dbus object introspection we will
add the interface too if it's missing. This now allows us the
ability to have a dbus object with only properties.
Because of the different code paths we need to test job handling with
all operations. Test now runs virtually everything with timeout == 0
and timeout == 15 so that we test both use cases.
Note: If a client passes -1 for the timeout value they need to make
sure that their connection time out is also infinite. Otherwise, if the client
times out the service side hangs any new dbus calls until the job
that is in progress completes. Not sure why this behavior is occuring
at this time, but it appears a limitation/bug of the dbus-python library.
When we register a failure we need to use a valid value which will be
returned with the object manager. Otherwise we will raise an Exception
because we are trying to construct an object path from None.
The methods were returning an instance of the object instead of the
object path which was causing an exception when the result was returned
with the job object as we are explicity trying to return an object path.
Unit test added which re-creates the issue and verifies the fix.
Add simple helper/wrapper check function to check result
of dmsetup call i.e.:
check grep_dmsetup table vg-lv "grep_expected"
check grep_dmsetup status vg-lv -v "grep_unexpected"
Reload of thin-pool origin_only is designed to only post messages
to a thin-pool. It's not intended to be used for reload of thin-pool
table. Fix it by using standard call 'lv_update_and_reload()'.
Unconditionally guard there is at least 1/4 of metadata volume
free (<16Mib) or 4MiB - whichever value is smaller.
In case there is not enough free space do not let operation proceed and
recommend thin-pool metadata resize (in case user has not
enabled autoresize, manual 'lvextend --poolmetadatasize' is needed).
In the case there is no active thin volume, report thin pool
as lock holder. This fixed function like lvextend
which either expecte lock holder LV is some active thin
or 'possibly' inactive thin pool.
The existing code doesn't understand that mirror logs should cling to
parallel LVs (like extending them) instead of avoiding them.
As a quick workaround to avoid lvcreate failures, hard-code
--alloc normal for mirror logs even if the rest of the allocation
used a stricter policy.
https://bugzilla.redhat.com/show_bug.cgi?id=1376532
When rescanning a VG from disk, the metadata read from
each PV was compared as a sanity check. The comparison
is done by exporting the vg metadata from each dev to
a config tree, and then comparing the config trees.
The function to create the config tree inserts
extraneous information along with the actual VG metadata.
This extra info includes creation_time. The config
trees for two devs can easily be created one second
apart in which case the different creation_times would
cause the metadata comparison to fail. The fix is to
exclude the extraneous info from the metadata comparison.
Correction for aux test result ([] -> if;then;fi)
Use issue_discard to lower memory demands on discardable test devices
Use large devices directly through prepare_pvs
I'm still observing more then 0.5G of data usage through.
Particullary:
'lvcreate' followed by 'lvconvert' (which doesn't yet support --nosync
option) is quite demanging, and resume returns quite 'late' when
a lot of data has been already written on PV.
Reinstantiate reporting of metadata percent usage for cache volumes.
Also show the same percentage with hidden cache-pool LV.
This regression was caused by optimization for a single-ioctl in
2.02.155.
Allow RAID scrubbing on cache origin sub-LV
This patch adds the ability to perform RAID scrubbing on the cache
origin sub-LV (https://bugzilla.redhat.com/1169495). Cache origin
operations are restricted to non-clustered RAID LVs until there can
be further testing in a cluster (even for exclusive activation).
User can either specify directly _corig LV
or he can specify cache LV and operation --syncation is
passed ONLY to _corig LV.
If users wants to manipulation with cache-pool devices - he
needs to specify this object name.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Older udev versions (udev < v165), don't have the official
udev_device_get_is_initialized function available to query for
device initialization state in udev database. Also, devices don't
have USEC_INITIALIZED udev db variable set - this is bound to the
udev_device_get_is_initialized fn functionality.
In this case, check for "DEVLINKS" variable instead - all block devices
have at least one symlink set for the node (the "/dev/block/<major:minor>".
This symlink is set by default basic udev rules provided by udev directly.
We'll use this as an alternative for the check that initial udev
processing for a device has already finished.
It's possible (mainly during boot) that udev has not finished
processing the device and hence the udev database record for that
device is still marked as uninitialized when we're trying to look
at it as part of multipath component check in pvscan --cache code.
So check several times with a short delay to wait for the udev db
record to be initialized before giving up completely.
When scanning devs to populate lvmetad during system startup,
filter-mpath with native sysfs multipath component detection
may not detect that a dev is multipath component. This is
because the multipath devices may not be set up yet.
Because of this, pvscan will scan multipath components during
startup, will see them as duplicate PVs, and will disable
lvmetad. This will leave lvmetad disabled on systems using
multipath, unless something or someone runs pvscan --cache
to rescan.
To avoid this problem, the code that is scanning devices to
populate lvmetad will now check the udev db to see if a
dev is a multipath component that should be skipped.
(This may not be perfect due to inherent udev races, but will
cover most cases and will be at least as good as it's ever
been.)
The lsblk is just a nice helper here - it's not crucial for lvmdump so
do best effort here and use the most we can from current version of
lsblk that is installed on system. The lsblk -s option was added a bit
later after lsblk introduction and lsblk -O support even more later -
so if these are not available, use only pure lsblk output without any
extras.
- Prevent --lvmshell with --nojson, not a valid combination
- If user is preventing json, then no lvmshell usage
- Return boolean on Manager.UseLvmShell
The normal mode of operation will be to monitor for udev events until an
ExternalEvent occurs. In that case the service will disable monitoring
for udev events and use ExternalEvent exclusively.
Note: User specifies --udev the service will always monitor udev regardless
if ExternalEvent is being called too.
With the addition of JSON and the ability to get output which is known to
not contain any extraneous text we can now leverage lvm shell, so that we
don't fork and exec lvm command line repeatedly.
When we are running in a terminal it's useful to have a date & ts on log
output like you get when output goes to the journal. Check if we are
running on a tty and if we are, add it in.
Avoid monitoring of activated cache-pool - where the only purpose ATM
is to clear metadata volume which is actually activate in place
of cache-pool name (using public LV name).
Since VG lock is held across whole clear operation, dmeventd cannot
be used anyway - however in case of appliction crash we may
leave unmonitored device.
In future we may provide better mechanism as the current name
replacemnet is creating 'uncommon' table setups in case the metadata
LV is more complex type like raid (needs some futher thinking about
error path results).
Another point to think about is the fact we should not clear device
while holding lock (i.e. dmeventd mirror repair cannot work in cases
like this).
Introduce 'hard limit' for max number of cache chunks.
When cache target operates with too many chunks (>10e6).
When user is aware of related possible troubles he
may increase the limit in lvm.conf.
Also verbosely inform user about possible solution.
Code works for both lvcreate and lvconvert.
Lvconvert fully supports change of chunk_size when caching LV
(and validates for compatible settings).
Commit e947c362dd introduced
config_settings.h file for central place to store all definitions for
config options. By mistake, it used report/colums_as_rows instead
of report/columns_as_rows (missing "n" in "columns").
If the number of stripes requested is incompatible with the requested
type of raid, give an error instead of adjusting it.
If no stripes argument is supplied, continue to use an appropriate
default.
a579ba2ac2 fixed a regression causing a segfault if no external
origin existed but broke the logic leading to erroneous error
messages and creations of split off exported VGs in case the
external origin and the pool LVs were allocated on different PVs.
- resolves rhbz1367459
Creating a RaidLV in VGs with very small extent sizes caused
late failure in the kernel giving a not very informative error
message. Catch the attempt early and display failure message
'Unable to create RAID LV: requires minimum VG extent size 4.00 KiB'.
- resoves rhbz1179970
'pvmove -n name pv1 pv2' called with the name of a top-level LV
failed with mentioned commit.
Enhance pvmove-raid-segtypes.sh to test for prohibited RAID SubLV moves.
'pvmove -n name pv1 pv2' allows to collocate multiple RAID SubLVs
on pv2 (e.g. results in collocated raidlv_rimage_0 and raidlv_rimage_1),
thus causing loss of resilence and/or performance of the RaidLV.
Fix this pvmove flaw leading to potential data loss in case of PV failure
by preventing any SubLVs from collocation on any PVs of the RaidLV.
Still allow to collocate any DataLVs of a RaidLV with their sibling MetaLVs
and vice-versa though (e.g. raidlv_rmeta_0 on pv1 may still be moved to pv2
already holding raidlv_rimage_0).
Because access to the top-level RaidLV name is needed,
promote local _top_level_lv_name() from raid_manip.c
to global top_level_lv_name().
- resolves rhbz1202497
Adding MetaLVs to given DataLVs (e.g. raid0 -> raid0_meta takeover),
_avoid_pvs_with_other_images_of_lv() was missing code to prohibit
allocation when called with a just allocated MetaLV to prohibit
collaocation of the next allocated MetaLV on the same PV.
- resolves rhbz1366738
When lvm is compiled with --enable-notify-dbus and a user uses lvm
shell, after they issue 200+ commands the lvm shell will hang for
~30 seconds trying to notify the lvm dbus service that a change
has occurred. This appears to be caused by resource exhaustion,
because the sockets used for dbus communication are not be closed.
Enforce mirror/raid0/1/10/4/5/6 type specific maximum images when
creating LVs or converting them from mirror <-> raid1.
Document those maxima in the lvcreate/lvconvert man pages.
- resolves rhbz1366060
Some settings are not suitable for override in interactive/shell
mode because such settings may confuse the code and it may end
up with unexpected behaviour. This is because of the fact that
once we're in the interactive/shell mode, we have already applied
some settings for the shell itself and we can't override them
further because we're already using those settings to drive the
interactive/shell mode. Such settings would get ignored silently
or, in worse case, they would mess up the existing configuration.
When lvm commands are executed in lvm shell, we cover the whole lvm
command execution within this shell now. That means, all messages logged
and status caught during each command execution is now recorded in the
log report, including overall command's return code.
The dm_report_group_output_and_pop_all calls dm_report_output and
dm_report_group_pop for all the items that are currently in report
group. This is just a shortcut that makes it easier to output and
pop group's content so the group handle can be reused again without
a need to initialize and configure it again.
The functionality of dm_report_group_output_and_pop_all is the
same as dm_report_destroy but without destroying the report group
handle.
This patch moves printing of starting '{' character for JSON output up
untili it's known there's any further output following - either the
content or ending '}' character.
Also, remove unnecessary switch for different report group types and
calling individual functions to handle dm_report_group_create as that
code is shared for all existing types at the moment.
Calling dm_report_destroy_rows makes it possible to destroy any report
content we have but at the same time it doesn't destroy the report
handle itself, thus it's possible to reuse that handle again for new
report content.
Functionally, this is the same as calling dm_report_output with the
report handle but omitting the output iself. This functionality may
be useful if we, for whatever reason, need to discard the report
content and start a fresh new one but with the same report configuration
and initialization and thus we can just reuse the existing handle.
We may call arg_count/grouped_arg_count/arg_value soon enough that
cmd->arg_values is not set yet.
Normally, when running a command, we execute lvm_run_command which in
turn calls _process_command_line to allocate and parse the command line
values and stores them in cmd->arg_values.
However, if we run lvm shell, this one doesn't accept any command line
options and we parse the command line for each command that is executed
within the lvm shell then. If we used any code that tries to access
cmd->arg_values through any of the the arg handling functions too
early, we could end up with a segfault due to uninitialized (NULL)
cmd->arg_values.
This patch just saves extra checks in all the code where arg handling
may be called too early so that the cmd->arg_values is not set up yet.
This does not apply to any of existing code, but subsequent patches
will need that.
With patches that will follow, this will make it possible to widen log
report coverage when commands are executed from lvm shell so the amount
of messages that may end up in stderr/stdout instead of log report are
minimized.
Add new log_context=shell and with log_object_type=cmd and
log_object_name=<command_name> for command log report to collect
overall return code from last command (this is reported under
log_type=status).
Currently, the output is separated in 3 parts and each part can go into
a separate and user-defined file descriptor:
- common output (stdout by default, customizable by LVM_OUT_FD environment variable)
- error output (stderr by default, customizable by LVM_ERR_FD environment variable)
- report output (stdout by default, customizable by LVM_REPORT_FD environment variable)
For example, each type of output goes to different output file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvs fedora vg/abc 1>out 2>err 3>report
[0] fedora/~ # cat out
[0] fedora/~ # cat err
Volume group "vg" not found
Cannot process volume group vg
[0] fedora/~ # cat report
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
Another example in LVM shell where the report goes to "report" file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvm 3>report
(in lvm shell)
lvm> vgs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
(in lvm shell)
lvm> lvs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
RAID6 LVs may not be created with --nosync or data corruption
may occur in case of device failures. The underlying MD raid6
personality used to drive the RaidLV performs read-modify-write
updates on stripes and thus relies on properly written parity
(P and Q Syndromes) during initial synchronization.
Once on it, enhance test to create/extend more and
larger RaidLVs and check sync/nosync status.
Commit 76ef2d15d8 introduced
raid0 <-> raid4 takeover and full mirror <-> raid1 support.
Add tests for these conversions.
Tests exposed a kernel semantics change freezing resynchronization
on conversions from raid0[_meta] -> raid4 or adding raid1 legs
because kernel kept the RAID mapped device in 'frozen' state unless
an 'idle' message was sent or the table was reloaded (kernel patch pending).
Although the use of the first region_id in a group to store the
DMS_GROUP=... aux_data tag is an internal implementation detail,
it has a user visible consequence in that deleting this region will
cause the group to disappear: add an explanation of this to the
'group' command and 'Regions, areas, and groups' section.
The MD raid6 personality being used to drive lvm raid6 LVs does
read-modify-write updates to any stripes and thus relies on correct
P and Q Syndromes being written during initial synchronization or
it may fail reconstructing proper user data in case of SubLVs failing.
We may not allow the '--nosync' option on
creation of raid6 LVs for that reason.
Update/fix 'man lvcreate' in that regard.
add lvcreate-raid-nosync.sh test script.
- Resolves rhbz1358532
We don't need to refresh whole cmd context if we drop profile after
processing LVM command - just like we don't refresh cmd context when
we're applying the profile. It's because profiles contain only safe
subset of settings which do not require complete cmd context refresh.
This patch calls process_profilable_config instead of
refresh_toolcontext if there was profile applied for the LVM
command only, not --config which requires toolcontext refresh.
The process_profilable_config just sets proper values based on
values of profilable settings, but it does not do complete
reinitialization of various parts (e.g. filters, logging etc.).
'lvchange --resync LV' or 'lvchange --syncaction repair LV' request the
RAID layout specific parity blocks in raid4/5/6 to be recreated or the
mirrored blocks to be copied again from the master leg/copy for raid1/10,
thus not allowing a rebuild of a particular PV.
Introduce repeatable option '--[raid]rebuild PV' to allow to request
rebuilds of specific PVs in a RaidLV which are known to contain corrupt
data (e.g. rebuild a raid1 master leg).
Add test lvchange-rebuild-raid.sh to test/shell doing rebuild
variations on raid1/10 and 5; add aux function check_status_chars
to support the new test.
- Resolves rhbz1064592
Prepare for new segment type conversion functionality in cases that
currently fail. In the short-term, we need to do this while limiting
the changes to the code paths for the conversions that are already
supported.
introduced with commit 8f62b7bfe5 rely on complete
defintions of the relations between the LVs of a VG.
Hence only run these checks when the complete_vg flag
is set on calls to check_lv_segments().
lvconvert failed in test lvconvert-thin-raid.sh when
calling check_lv_segments() from _read_segments() without
providing a complete definition.
on any thin snap external origin LV which caused a segfault
when none existed as exposed by the vgsplit-thin.sh test.
Only call lv_is_on_pvs() if an external origin LV actually
exists and correct the related splitting logic.
When converting to a cache lv, tests were hanging with a prompt for
"Do you want wipe existing metadata of cache pool volume
To preserve cache metadata add option "--zero n".
WARNING: Reusing mismatched cache pool metadata MAY DESTROY YOUR DATA!"
This is new.
When a client is doing a wait on a job, any other clients will hang
when trying to do anything with the service. This is caused by
the wait code which was placing the thread that handles
incoming dbus requests to sleep until either the timeout expired or
the job operation completed.
This change creates a thread for the wait request, so that the thread
processing incoming requests can continue to run.
with respect to the changed, configurable default behaviour
introduced with commit 7eb7909193.
E.g. raid default of 2 stripes rather than number of PVs in the VG
or on the command line minus one.
The syntax for converting an LV to a thin LV
included an unnecessary --thin option. I was
probably still confused about these options
when writing this.
General RAID and RAID segment type specific checks are added
to merge.c. New static _check_raid_seg() is called on each segment
of a RaidLV (which have just one) from check_lv_segments().
New checks caught some unititialized segment members
which are addressed here as well:
- initialize seg->region_size to 0 in lvcreate.c for raid0/raid0_meta
- initialize list seg->origin_list in lv_manip.c
Add matching support for -Z option also we doing full conversion
to cache-pool.
Extending coversion message to show which pool type is created
and whether the metadata will be wiped or remain unmodified.
Follow-up to 27a767d5e8.
Tunning behavior in a way we always prompt when option --zero is NOT specified.
Without -Z lvm expects user wants to 'reset' cache-pool metadata
(they could have been splitted from some cached LV)
If user doesn't want to zero metadata he needs to specify -Zn.
User may also avoid prompting for zeroing by using -Zy for
cache-pool (basically equals using --yes without -Z being given)
(unlike full convert case, there is no cache-pool being converted,
so there is not 'uncoditional' prompt in this case).
When volume was lvconvert-ed to a thin-volume with external origin,
then in case thin-pool was in non-zeroing mode
it's been printing WARNING about not zeroing thin volume - but
this is wanted and expected - so nothing to warn about.
So in this particular use case WARNING needs to be suppressed.
Adding parameter support for lvcreate_params.
So now lvconvert creates 'normal thin LV' in read-only mode
(so any read will 'return 0' for a moment)
then deactivate regular thin LV and reacreate in 'final R/RW' mode
thin LV with external origin and activate again.
Before, the automatic update from older to newer version of PV extension
header happened within vg_write call. This may have caused problems under
some circumnstances where there's a code in between vg_write and vg_commit
which may have failed. In such situation, we reverted precommitted metadata
and put back the state to working version of VG metadata.
However, we don't have revert for PV write operation at the moment. So
if we updated PV headers already and we reverted vg_write due to failure
in subsequent code (before vg_commit), we ended up with lost VG metadata
(because old metadata pointers got reset by the PV write operation).
To minimize problematic situations here, we should put vg_write and
vg_commit that is done after PV header rewrites as close to each
other as possible.
This patch moves the automatic PV header rewrite for new extension
header part from vg_write to _vg_read where it's done the same way
as we do any other VG repairs if detected during VG read operation
(under VG write lock).
If the VG holding the global lock is removed, we can indicate
that as the reason for not being able to acquire the global
lock in subsequent error messages, and can suggest enabling
the global lock in another VG. (This helpful error message
will go away if the global lock is enabled in another VG,
or if lvmlockd is restarted.)
When cache pool is reused for a new cached volume, there is
normally no need to 'keep' old cache-pool metadata as this
could cause major data lose.
Unlike with 'lvcreate -H -LX --cachepool' conversion, this lvconvert
path left the metadata unzeroed - partly for making easier some
debugging, but this was rather a bug.
So to keep possible reattach of 'unzeroed' metadata, user
now has to use 'lvconvert -Zn' for such conversion. In this case
the prompt will appear about possibe data loss and to proceed,
user has to confirm such operation. Without -Zn metadata are wiped.
In some cases, the command will update VG metadata
in lvmetad without writing it. In these cases there
is no vg->vg_committed and it should use 'vg' directly.
This happens when the command finds that the lvmetad
VG has been invalidated, rereads the metadata from disk,
then updates lvmetad with that metadata. This happens
often with lvmlockd or foreign VGs, and can happen without
lvmlockd if a previous command fails after invalidating
the VG in lvmetad.
Commit 3928c96a37 introduced
new defaults for raid number of stripes, which may cause
backwards compatibility issues with customer scripts.
Adding configurable option 'raid_stripe_all_devices' defaulting
to '0' (i.e. off = new behaviour) to select the old behaviour
of using all PVs in the VG or those provided on the command line.
In case any scripts rely on the old behaviour, just set
'raid_strip_all_devices = 1'.
- resolves rhbz1354650
This fixes a regression from commit a7c45ddc5, which moved
the lvmetad VG update from vg_commit() to unlock_vg().
The lvmetad VG update needs to send the version of metadata
that was committed rather than sending the state of struct 'vg'.
The 'vg' may have been partially modified since vg_commit(),
and contain non-committed metadata that shouldn't be sent
to lvmetad.
Any failing stripes in raid0/raid0_meta type LVs cause data loss,
thus replacement via 'lvconvert --replace...' does not make sense.
Patch prohibits replacement on raid0/raid0_meta LVs.
- resolves rhbz1356734
The --uuid, --major and --alldevices arguments were incorrectly tested
after confirming argc is > 0, in a branch that only executes if argc
== 0 (i.e. they were unreachable).
Move all device checks before the test for argc and log an appropriate
error before returning.
Including major and minor numbers in pvs and lvs output when calling
lvmdump -a makes it a bit easier to match these items with possible
system log/journal.
Commit ca878a3426 changed behavior
or resize operation. Later the code has been futher changed
to skip fs resize completely when size of LV is already matching
and finaly at the most recent resize changeset for resize the
check for matching size has been eliminated as well so we ended
with a request call to resize fs to 0 size in some cases.
This commit reoders some test so the prompt happens just once before
resize of possibly 2 related volumes.
Also extra test for having LV already given size is added, and
whole metadata update is skipped for this case as the only
result would be an increment of seqno.
However the filesystem is still resized when requested,
so if the LV has some size and the resize is resolved to
the same size, the filesystem resize is called so in case FS
would not match, the resize will happen.
raid0/raid0_meta type LVs don't have a default number of stripes when
created without '-i/--stripes Stripes' whereas other raid types have one.
Patch sets the default for raid0/raid0_meta to 2 stripes.
The default amount of stripes for raid4/5/10 is changed to 2 and for raid6 to 3
rather than using all PVs in the VG or those provided on the command line.
This is to avoid unintended high number of stripes in case of many PVs.
To select a different amount of stripes from the default,
use 'lvcreate -i/--stripes Stripes'.
- resolves rhbz1354650
A livelock occurs on extension in lv_manip when adjusting the region size,
which doesn't apply to any raid0/raid0_meta LVs (these don't have a bitmap).
Fix by prohibiting the region size adjustment on any such LVs.
- resolves rhbz1354604
An unconditional access to the non-existing MetaLV of a raid0 LV in
lv_raid_remove_missing() was causing the segfault.
Only call log_debug() on replacements of existing MetaLVs.
- resolves rhbz1354646
Resync attempts on raid0/raid0_meta via 'lvchange --resync ...'
cause segfaults.
'lvchange --syncaction ...' doesn't get rejected either.
Prohibit both on raid0/raid0_meta LVs.
- resolves rhbz1354656
4420d41fea introduced recursive split of lvs which
splits a top-level LV together with it's sub LVs.
This lead to invalid temporary list pointers
causing hangs/OOM situations.
Patch updates the temporary list pointer
referencing a moved sub LV.
- resolves rhbz1354686
Synopsis are very useful for quick orientation and also
we provide then for all remaining command.
Also list ALL supported options in a single ordered list,
user should not seek for them.
blkdeactivate -m disablequeueing causes "multipathd disablequeueing maps"
call inside blkdeactivate script before deactivating devices. This
avoids a situation where blkdeactivate may wait for paths to appear if
multipath is set to queueing and there's a stack of other devices and/or
mount points on top of such multipath device.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1344381.
Reduce number of evualted pvcreate commands.
Since 0 is default value used to fill missing params,
and 0 is also 1st. value in array, it's being tested.
Drop unused data_alignment_offset.
When logging to epoch files we would like to prevent creating too large
log files otherwise a spining command could fulfill available space
very easily and quickly.
Limit for to 100000 per command.
Make the --filemap switch take no arguments and instead accept one
or more files on the command line to be mapped and placed into
groups.
This allows --filemap to be used with a glob:
# dmstats create --filemap *
rhel5.10-1.qcow2: Created new group with 87 region(s) as group ID 1564.
rhel5.10.qcow2: Created new group with 8 region(s) as group ID 1651.
rhel7.0-1.qcow2: Created new group with 11 region(s) as group ID 1659.
rhel7.0.qcow2: Created new group with 1454 region(s) as group ID 1670.
vm.img: Created new group with 2 region(s) as group ID 3124.
lvconvert --splitcache VG/CachePool_corig
Allow the split via the hidden/used cache pool for the time being,
since the new lvconvert code did intend to allow it, but was just
missing the exception in the list of hidden LVs that were allowed.
The preferred method for splitcache is to run it on the visible
cache LV, not the hidden cache pool. That may eventually become
the only method since we try to avoid running commands on
hidden LVs.
When a 'dmstats create --filemap' operation fails (e.g. during
open(2), close(2), or dm_stats_create_regions_from_fd()), use the
canonical version of the path. This avoids cryptic/confusing error
messages when symbolic links exist in the path argument given:
# findmnt /var/lib/libvirt/images -otarget,source
TARGET SOURCE
/var/lib/libvirt/images /dev/mapper/vg_hex-lv_images
# readlink /var/lib/libvirt/images/my.img
/boot/my.img
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /var/lib/libvirt/images/my.img
Command failed
Using the canonical path the error is immediately obvious:
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /boot/my.img
Command failed
Grouping is also useful in combination with --segments: creating a
group allows both individual segment data and data for the device
as a whole to be presented in the same report.
Support grouping for 'create --segments' in the same manner as for
'create --filemap'; group regions by default, applying an optional
alias specified with --alias, unless the user specifies --nogroup.
Support aggregate group and region histograms by allocating a new
histogram from the pool and populating it with a sum of the histogram
data for the areas contained in the region or group.
To avoid repeatedly summing the same histogram data, cache the pointer
in the group and regions structs for subsequent access. The aggregate
histograms are allocated from the same pool as the area histograms in
the corresponding handle and will be discarded at each list or populate
operation.
Add a new option to the create command to create regions that map the
extents of a file:
# dmstats create --filemap /path/to/file
/path/to/file: Created new group with 10 region(s) as group ID 0.
When performing a --filemap no device argument is required (and
supplying one results in error) since the device to bind to is implied
by the file path and is obtained directly from an fstat().
Grouping may be optionally disabled by the --nogroup switch: in this
case the command will report each region individually:
# dmstats create --nogroup --filemap /path/to/file
/path/to/file: Created new region with 1 area as region ID 0.
/path/to/file: Created new region with 1 area as region ID 1.
/path/to/file: Created new region with 1 area as region ID 2.
When grouping regions the group alias is automatically set to the
basename (as returned by dm_basename()) of the provided file.
This can be overridden to a user-defined value at the command line by
use of the --alias option.
If grouping is disabled no alias can be set.
Use of offset and subdivision options (--start, --length, --segments,
--areas, --areasize).
Setting aux_data and histograms for groups is possible but is not
currently implemented.
Add a call to create dmstats regions that correspond to the extents
present in a file descriptor open on a file in a local file system.
The file must reside on a file system type that correctly supports
physical extent location data in the FIEMAP ioctl.
Regions are optionally placed into a group with a user-defined alias.
File systems that do not support physical offsets in FIEMAP (btrfs
currently) are detected via fstatfs() - although attempting to map
a --filemap group on btrfs will fail anyway with the generic error
"Not on a device-mapper device" this is confusing; the file system
mount is on a device-mapper device, but btrfs' volume layer masks
this in the returned st_dev field since the returned logical file
extents may span multiple physical devices.
The function _stats_remove_region_id_from_group() incorecctly set
the group_id to DM_STATS_GROUP_NOT_PRESENT _before_ the call to
_stats_group_destroy(). This will cause the destroy function to
return immediately without doing anything:
339 static void _stats_group_destroy(struct dm_stats_group *group)
340 {
341 if (!_stats_group_present(group))
342 return;
Invalidating the ID in _stats_region_region_id_from_group() is
redundant anyway; it is rightly done as the last operation in
_stats_group_destroy (and it is not possible for anything to see
the old value between the two calls).
Remove the change to group_id to ensure that the alias and bitset
resources are correctly freed.
The call to dm_stats_walk_start() before the do statement makes
dm_stats_walk_do() behave inconsistently depending on context;
wrap them in an additional do { } while (0) so that the macro
always expands to a valid statement.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/RaidLV" would fail, but
"lvconvert --mirrors 0 VG/RaidLV" would succeed.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/MirrorLV" would fail, but
"lvconvert --mirrors 0 VG/MirrorLV" would succeed.
If after extracting stats arguments and group tags nothing remains
of aux_data but '-' set the region->aux_data field to the empty
string to match behaviour for non-grouped regions.
Although not harmful do not allow a group containing regions with
histograms since it is not currently possible to present histogram
data aggregated for the group.
Although a non-zero value for the number of ticks spent doing IO
should imply a non-zero number of IOs in the interval test for
this explicitly to avoid a divide-by-zero in the event of bad
counter data.
It's possible for interval_ns to be zero if the interval is not
set or the clock is misconfigured. Test for this before using the
value as the divisor in the utilisation calculation.
Walk flags are ULL constants; cast the result to a uint64_t before
logging with a FMTx64 format specifier to avoid a compiler warning:
warning: format ‘%lx’ expects argument of type ‘long unsigned int’,
but argument 5 has type ‘long long unsigned int’
Make it clear that the "aux data" presented in reports is the user
data stored in the field (and does not include any library-internal
state such as group descriptors) by renaming the field to user_data
and changing the heading to "UserData".
Make it clear in libdevmapper.h, and in function argument names, that
libdm-stats uses the aux_data field internally and that any values set
for user_data are appended to the library values before being stored
with a region, and similarly, that internal data fields will be stripped
prior to returning any previously stored user_data.
Replace --statstype=area,region,group with a separate switch for
each object type: --area, --region, --group. Omitting any object
type switch will use the defaults for the current command (regions
and groups for list, and regions, groups and areas for verbose list).
Replace the 'name' field with 'statsname' in order to report alias
names for groups, and include the 'group_id' field between statsname
and the 'region_id' field to make it clear to the user when groups
are in use.
Walk avaiable groups and regions (in addition to areas) and report
aggregate statistics and properties.
A new switch is added to filter the type of obects inclued in the
report:
--statstype={all,area,region,group}
The type of the current row is also available in a new
DR_STATS_META field 'type'.
To allow the names used to describe statistics report objects to
change (for e.g to support groups and region and group aliases)
introduce a new "stats_name" field that evaluates to the correct
name for the object being reported.
Add a pair of commands to create and delete stats groups:
dmstats group --regions REGIONS
dmstats ungroup --groupid ID
REGIONS specifies a list of regions to be included in the group.
Regions are specified as a comma separated list in order of
increasing region ID. Ranges may be specified as a hypen separated
pair of values giving the first and last member of the range.
Add support do dm_stats_walk*() to walk over the set of
available groups using the cursor embedded in the dm_stats
handle, and to obtain the type of the object at the current
stats cursor location. A set of flags is introduced to
control which objects are visited:
DM_STATS_WALK_AREA
DM_STATS_WALK_REGION
DM_STATS_WALK_GROUP
DM_STATS_WALK_ALL
A final flag suppresses visits to regions that contain only a
single area - since the aggregate of such a region is idential
to the area it contains this allows these duplicates to be
filtered out:
DM_STATS_WALK_SKIP_SINGLE_AREA
If flags are not initialised before beginning a walk the default
set matches the behaviour of previous versions of the library.
Also accept group identifiers as immediate arguments to the
counter, metric, and property functions by adding control
flags to the region and area identifiers passed in.
Region and area properties are mapped to their equivalents for
the group (for example: group size is reported as the sum of
all regions contained in the group). Counter and metric values
are aggregated for the region or group.
Introduce constants for the buffer sizes that libdm-stats uses:
one for messages sent to the kernel, one for rows of response data
returned, and a pair for the "start+len" range and histogram bounds
strings.
Add a grouping facility to the libdm-stats library that allows the
user to bind several regions together as a group. Groups may be
used to aggregate data from several regions for reporting, or to
select and sort among large sets of regions.
A textual descriptor ("group tag") is associated with each group
and is stored in the first group member's aux_data field. The
tag contains the group member list and an optional alias for the
group, allowing the user to assign meaningful names to groups of
regions.
These descriptors are parsed in @stats_list message responses and
populate the resulting region and area tables with the group
structure.
Groups with overlapping regions are permitted but since this will
result in some events being counted more than once a warning is
printed in this case.
Nested and overlapping groups are not currently supported and
attempting to create these configurations results in error.
Add a new enum based interface for accessing counter and metric
values that uses a single function for each:
uint64_t dm_stats_get_counter(const struct dm_stats *dms,
dm_stats_counter_t counter
uint64_t region_id, uint64_t area_id);
int dm_stats_get_metric(const struct dm_stats *dms, int metric,
uint64_t region_id, uint64_t area_id,
double *value);
This simplifies the implementation of value aggregation for
groups of regions. The named function interface now calls the
enum interface internally so that all new functionality is
available regardless of the method used to retrieve values.
Cache the device-mapper name of a bound device in the dm_stats
handle.
This will be used by stats groups to report a device name or
user defined alias for groups.
The device-mapper name, device numbers and uuid stored in the
dm_stats handle are used only to bind the handle to a specific
device in order to issue ioctls.
Rename them to "bind_*" to reflect this usage in preparation
for caching the device-mapper name of the bound device in the
dm_stats handle.
This will be used to allow optional aliases to be set for
dmstats groups.
Add a function to parse a list of integer values and ranges into
a dm_bitset representation. Individual values signify that that bit
is set in the resulting mask and ranges are given as a pair of
start and end values, M-N, such that M and N are the first and
last members of the range (inclusive).
The implementation is based on the kernel's __bitmap_parselist()
that is used for cpumasks and other set configuration passed in
string form from user space.
TODO: it might be better to log dmeventd messages with test output
just like we do with clvmd - maybe we will switch to this one
instead of extra DMEVENTD log file in future....
Add config for mkfs to get more predicatable results
when using mkfs across variety of distributions.
In future maybe use this per all tests as default.
For now user has to specify in a test MKE2FS_CONFIG envvar to use it.
When force removing thin-pool we loose 'real' access to hidden device,
and if such pool is in suspended state, any thin volume cannot be
dropped. It likely should be also checked by dmsetup, but meanwhile
apply simple logic - try to force remove first all higher minors first
with assumption we first create thin-pool and then thin volume
and there are usually not being released lower dm numbers to
get the order wrong.
With a single report (--count=1) no timerfd is set up and the cycle
and current timestamps should be freed during the single call to
_update_interval_times().
This patch fixes link validation for used thin-pool.
Udev rules correctly creates symlinks only for unused new thin-pool.
Such thin-pool can be used by foreing apps (like Docker) thus
has /dev/vg/lv link.
However when thin-pool becomes used by thinLV - this link is no
longer exposed to user - but internal verfication missed this
and caused messages like this to be printed upon 'vgchange -ay':
The link /dev/vg/pool should have been created by udev but it was not
found. Falling back to direct link creation.
And same with 'vgchange -an':
The link /dev/vg/pool should have been removed by udev but it is still
present. Falling back to direct link removal.
This patch ensures only unused thin-pool has this link.
Run umount code only when either thin data or metadata are
above 95% - so if there are resize failures with 60%.
system fill keep running.
Also umount will only be tried with lvm2 LVs.
Foreign users are ATM unsuppored.
Add new logic to identify each unique operation and route
it to the correct function to perform it. The functions
that perform the conversions remain unchanged.
This new code checks every allowed combination of LV type
and requested operation, and for each valid combination
calls the function that performs that conversion.
The first stage of option validation which checks for
incompatible combinations of command line options, is done
done before process_each is called. This is unchanged.
(This new code will allow that first stage validation to
be simplified in a future commit.)
The second stage of checking options against the specific
LV type is done by this new code. For each valid combination
of operation + LV type, the new code calls an existing
function that implements it.
With this in place, the ad hoc checks for valid combinations
of LV types and operations can be removed from the existing
code in a future commit.
(The #if 0 is used to keep the patch clean, and the
disabled code will be removed by a following patch.)
There are detailed messages inside _create_dir_recursive that
dm_create_dir calls (except EROFS which where the message is not
generated, like anywhere else in the code).
When we test Vg.LvCreateRaid some of the hidden LVs volume type go from
'I' to 'i' between the time it takes us to create the LV and
the time it takes to call into refresh to verify the service is up to date.
This is a fairly rare occurance.
We call 'lvm help' to find out if fullreport is supported. Lvm
dumps help to stderr. Common code prints a warning if we exit
with 0, but have something in stderr so we are skipping the warning
message.
The following operations would hang if lvm was compiled with
'enable-notify-dbus' and the client specified -1 for the timeout:
* LV snapshot merge
* VG move
* LV move
This was caused because the implementation of these three dbus methods is
different. Most of the dbus method calls are executed by gathering information
needed to fulfill it, placing that information on a thread safe queue and
returning. The results later to be returned to the client with callbacks.
With this approach we can process an arbitrary number of commands without any
of them blocking other dbus commands. However, the 3 dbus methods listed
above did not utilize this functionality because they were implemented with a
separate thread that handles the fork & exec of lvm. This is done because these
operations can be very slow to complete. However, because of this the lvm
command that we were waiting on is trying to call back into the dbus service to
notify it that something changed. Because the code was blocking the process
that handles the incoming dbus activity the lvm command blocked. We were stuck
until the client timed-out the connection, which then causes the service to
unblock and continue. If the client did not have a timeout, we would have been
hung indefinitely.
The fix is to always utilize the worker queue on all dbus methods. We need to
ensure that lvm is tested with 'enable-notify-dbus' enabled and disabled.
Apply the same idea as vg_update.
Before doing the VG remove on disk, invalidate
the VG in lvmetad. After the VG is removed,
remove the VG in lvmetad. If the command fails
after removing the VG on disk, but before removing
the VG metadata from lvmetad, then a subsequent
command will see the INVALID flag and not use the
stale metadata from lvmetad.
Previously, a command sent lvmetad new VG metadata in vg_commit().
In vg_commit(), devices are suspended, so any memory allocation
done by the command while sending to lvmetad, or by lvmetad while
updating its cache could deadlock if memory reclaim was triggered.
Now lvmetad is updated in unlock_vg(), after devices are resumed.
The new method for updating VG metadata in lvmetad is in two phases:
1. In vg_write(), before devices are suspended, the command sends
lvmetad a short message ("set_vg_info") telling it what the new
VG seqno will be. lvmetad sees that the seqno is newer than
the seqno of its cached VG, so it sets the INVALID flag for the
cached VG. If sending the message to lvmetad fails, the command
fails before the metadata is committed and the change is not made.
If sending the message succeeds, vg_commit() is called.
2. In unlock_vg(), after devices are resumed, the command sends
lvmetad the standard vg_update message with the new metadata.
lvmetad sees that the seqno in the new metadata matches the
seqno it saved from set_vg_info, and knows it has the latest
copy, so it clears the INVALID flag for the cached VG.
If a command fails between 1 and 2 (after committing the VG on disk,
but before sending lvmetad the new metadata), the cached VG retains
the INVALID flag in lvmetad. A subsequent command will read the
cached VG from lvmetad, see the INVALID flag, ignore the cached
copy, read the VG from disk instead, update the lvmetad copy
with the latest copy from disk, (this clears the INVALID flag
in lvmetad), and use the correct VG metadata for the command.
(This INVALID mechanism already existed for use by lvmlockd.)
This fixes commit 0ba5f4b8e9 which moved
field recalculation (field width and sort position) from
dm_report_object to dm_report_output but it didn't handle the case when
dm_report_column_headings was used separately to report headings (before
dm_report_outpout call) and hence we ended up with intial widths for
fields in the headings.
If we're using dm_report_column_headings, we need to recalculate
fields if we haven't done so yet, the same way as we do in
dm_report_output.
Simplify code around _do_get_report_selection - remove "expected_idxs[]"
argument which is superfluous and add "allow_single" switch instead to
allow for recognition of "--configreport <report_name> -S" as well as
single "-S" if needed.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 961 in _do_report_get_selection()
Null pointer dereferences (FORWARD_NULL) Dereferencing null pointer "single_args".
Uninitialized variables (UNINIT) /safe/guest2/covscan/LVM2.2.02.158/tools/toollib.c: 3520 in _process_pvs_in_vgs()
Uninitialized variables (UNINIT) Using uninitialized value "do_report_ret_code".
Null pointer dereferences (REVERSE_INULL) /safe/guest2/covscan/LVM2.2.02.158/libdm/libdm-report.c: 4745 in dm_report_output()
Null pointer dereferences (REVERSE_INULL) Null-checking "rh" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
Incorrect expression (MISSING_COMMA) /safe/guest2/covscan/LVM2.2.02.158/lib/log/log.c: 280 in _get_log_level_name()
Incorrect expression (MISSING_COMMA) In the initialization of "log_level_names", a suspicious concatenated string ""noticeinfo"" is produced.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 816 in_get_report_options()
Null pointer dereferences (FORWARD_NULL) Comparing "mem" to null implies that "mem" might be null.
This reverts commit fa69ed0bc8.
This code sometimes expects to be presented with a read-only filesystem
(during some boot sequences for example) and copes appropriately with
this and it should not lead to expected error messages that might cause
unnecessary alarm.
lv_name arg is only used without known LV for resolving '*lv'.
Once we know *lv, never use lv_name ever again.
So setting it when passing *lv has not needed.
Add code to support more LVs to be resized through a same code path
using a single lvresize_params struct.
(Now it's used for thin-pool metadata resize,
next user will be snapshot virtual resize).
Update code to adjust percent amount resize for use_policies.
Properly activate inactive thin-pool in case of any pool resize
as the command should not 'deffer' this operation to next activation.
Use common API design and pass just LV pointer to lv_manip.c functions.
Read cmd struct via lv->vg->cmd when needed.
Also do not try to return EINVALID_CMD_LINE error when we
have already openned VG - this error code can only be returned before
locking VG.
We have only 2 users of _lv_active() - one was already checking for ==1
while the other use (_lv_is_active()) could have take '-1' as a sign of having
an LV active. So return 0 and log_debug also the reason while detection
has failed (i.e. in case --driverload n - it's kind of expectable,
but might have confused user seeing just <backtrace>).
This fixes commit f50d4011cd which
introduced a problem when using older lvm2 code with newer libdm.
In this case, the old LVM didn't recognize new _LOG_BYPASS_REPORT flag
that libdm-report code used. This ended up with no output at all
from libdm where log_print_bypass_report was called because the
_LOG_BYPASS_REPORT was not masked properly in lvm2's print_log fn
which was called as callback function for logging.
With this patch, the lvm2 registers separate print_log_libdm logging
function for libdm instead. The print_log_libdm is exactly the same
as print_log (used throughout lvm2 code) but it checks whether we're
printing common line on output where "common" means not going to stderr,
not a warning and not an error and if we are, it adds the
_LOG_BYPASS_REPORT flag so the log_print goes directly to output, not
to any log report.
So this achieves the same goal as in f50d4011cd,
just doing it in a way that newer libdm is still compatible with older
lvm2 code (libdm-report is the only code using log_print).
Looking at the opposite mixture - older libdm with newer lvm2 code,
that won't be compilable because the new log report functionality
that is in lvm2 also requires new dm_report_group_* libdm functions
so we don't need to care here.
Move code from original print_log fn to a separate _vprint_log function
that accepts va_list and make print_log a wrapper over _vprint_log.
The print_log just initializes the va_list and uses it for _vprint_log
call now. This way, we can reuse _vprint_log if needed.
In commit 6ae22125, vgcfgrestore began disabling lvmetad
while running, and rescanned to enable it again at the end,
but missed the rescanning/enabling in the error case.
Reconnect to lvmetad if either the send fails (e.g. lvmetad
was restarted since lvmlockd last connected), or if no
lvmetad connection exists (e.g. lvmetad was started after
lvmlockd so no previous connection existed.)
Currently, a shutdown signal will cause lvmetad to quit
responding to new connections, but not actually exit until
all connections are gone. If a program is maintaining a
long running connection (e.g. lvmlockd, or even an lvm
command) when lvmetad gets a shutdown signal, then all
further commands will hang indefinately waiting for a
response that won't be sent.
With this patch, make lvmetad continue handling new
connections even after a shutdown signal. It will exit
once all connections are gone.
Previously, vgcfgrestore would attempt to vg_remove the
existing VG from lvmetad and then vg_update to add the
restored VG. But, if there was a failure in the command
or with vg_update, the lvmetad cache would be left incorrect.
Now, disable lvmetad before the restore begins, and then
rescan to populate lvmetad from disk after restore has
written the new VG to disk.
Reporting commands can be of different types (even if the command name
is the same):
- pvs command can be either of PVS, PVSEGS or LABEL report type,
- vgs command is of VGS report type,
- lvs command is of LVS or SEGS report type.
Use basic report type when looking for report prefix used for
--configreport option.
This means that:
- 'pvs --configreport pv' applies to PVS, PVSEGS or LABEL report type
- 'vgs --configreport vg' applies to VGS report type
- 'lvs --configreport lv' applies to LVS and SEGS report type
The DM_REPORT_OUTPUT_MULTIPLE_TIMES instructs reporting code to
keep rows even after dm_report_output call - the rows are not
destroyed in this case which makes it possible to call dm_report_output
multiple times.
This allows for moving parts of the code from dm_report_object to
dm_report_output which is important for subsequent patches that allow
for repeated dm_report_output, not destroying rows on each
dm_report_output call.
log_print is used during cmd line processing to log the result of the
operation (e.g. "Volume group vg successfully changed" and similar).
We don't want output from log_print to be interleaved with current
reports from group where log is reported as well. Also, the information
printed by log_print belongs to the log report too, so it should be
rerouted to log report if it's set.
Since the code in libdm-report which is responsible for doing the report
output uses log_print too, we need to use a different kind of log_print
which bypasses any log report currently used for logging (...simply,
we can't call log_print to output the log report itself which in turn
would again reroute to report - the report would never get on output
this way).
This patch adds structures and functions to reroute error and warning
logs to log report, if it's set.
There are 5 new functions:
- log_set_report
Set log report where logging will be rerouted.
- log_set_report_context
Set context globally so any report_cmdlog call will use it.
- log_set_report_object_type
Set object type globally so any report_cmdlog call will use it.
- log_set_report_object_name_and_id
Set object ID and name globally so any report_cmdlog call will use it.
- log_set_report_object_group_and_group_id
Set object group ID and name globally so any report_cmdlog call will use it.
These functions will be called during LVM command processing so any logs
which are rerouted to log report contain proper information about current
processing state.
The lvm fullreport works per VG and as such, the vg, lv, pv, seg and
pvseg subreport is done for each VG. However, if the PV is not part of
any VG yet, we still want to display pv and pvseg subreports for these
"orphan" PVs - so enable this for lvm fullreport's process_each_vg call.
If we have fullreport, make sure that the options/sort keys used for
each report doesn't change its type - we want to preserve the original
type so it's always 5 different subreports within fullreport (vg, lv, pv,
seg, pvseg). Since we have all report types within fullreport, users
should add fields under proper subreport type - this minimizes
duplication of info displayed on output.
Groupable args (the ones marked with ARG_GROUPABLE flag) start a new
group of args if:
- this is the first time we hit such a groupable arg,
- or if non-countable arg is repeated.
However, there may be cases where we want to give priorities when
forming groups and hence force new group creation if we hit an arg
with higher grouping priority.
For example, let's assume (for now) hypothetical sequence of args used:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
Without giving any priorites, we end up with:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\__________GROUP1___________/ \________GROUP2___________/ \_GROUP3_/
This is because we hit "-o" as the first groupable arg. The --configreport,
even though it's groupable too, it falls into the previous "-o" group.
While we may need to give priority to the --configreport arg that should
always start a new group in this scenario instead:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\_GROUP1_/ \_________GROUP2___________/ \_________GROUP3__________/
So here "-o" started a new group but since "--configreport" has higher
priority than "-o", it starts fresh new group now and hence the rest of
the command line's args are grouped by --configreport now.
lvm fullreport executes 5 subreports (vg, pv, lv, pvseg, seg) per each VG
(and so taking one VG lock each time) within one command which makes it
easier to produce full report about LVM entities.
Since all 5 subreports for a VG are done under a VG lock, the output is
more consistent mainly in cases where LVM entities may be changed in
parallel.
Add any report (pvs/vgs/lvs) currently processed to current report group
which is part of processing handle and which already contains log
report. This way both log report and pvs/vgs/lvs report will be
reported as a whole within a group, thus having same output format as
selected by --reportformat option.
If there's parent processing handle, we don't need to create completely
new report group and status report - we'll just reuse the one already
initialized for the parent.
Currently, the situation where this matter is when doing internal report
to do the selection for processing commands where we have parent processing
handle for the command itself and processing handle for the selection
part (that is selection for non-reporting tools).
Wire up report group creation with log report in struct
processing_handle and call report_format_init during processing handle
initialization (init_processing_handle fn) and destroy it while
destroing processing handle (destroy_processing_handle fn).
This way, all the LVM command processing using processing handle
has access to log report via which the current command log
can be reported as items are processed.
Separating common report and per-report arguments prepares the code for
handling several reports per one command (for example, the command log
report and LVM command report itself).
Each report can have sort keys, options (fields), list of fields to
compact and selection criteria set individually. Hooks for setting these
per report within one command will be a part of subsequent patches, this
patch only separates new struct single_report_args out of existing
struct report_args.
New report/output_format configuration sets the output format used
for all LVM commands globally. Currently, there are 2 formats
recognized:
- basic (the classical basic output with columns and rows, used by default)
- json (output is in json format)
Add new --reportformat option and new report_format_init function that
checks this option and creates new report group accordingly, also
preparing log report handle and adding it to the report group just
created.
This is a preparation for new CMDLOG report type which is going to be
used for reporting LVM command log.
The new report type introduces several new fields (log_seq_num, log_type,
log_context, log_object_type, log_object_group, log_object_id, object_name,
log_message, log_errno, log_ret_code) as well as new configuration settings
to set this report type (report/command_log_sort and report/command_log_cols
lvm.conf settings).
This patch also introduces internal report_cmdlog helper function
which is a wrapper over dm_report_object to report command log via
CMDLOG report type and which is going to be used throughout the code
to report the log items.
This patch introduces DM_REPORT_GROUP_JSON report group type. When using
this group type and when pushing a report to such a group, these flags
are automatically unset:
DM_REPORT_OUTPUT_ALIGNED
DM_REPORT_OUTPUT_HEADINGS
DM_REPORT_OUTPUT_COLUMNS_AS_ROWS
...and this flag is set:
DM_REPORT_OUTPUT_BUFFERED
The whole group is encapsulated in { } for the outermost JSON object
and then each report is reported on output as array of objects where
each object is the row from report:
{
"report_name1": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
],
"report_name2": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
]
...
}
This patch introduces DM_REPORT_GROUP_BASIC report group type. This
type has exactly the classical output format as we know from before
introduction of report groups. However, in addition to that, it allows
to put several reports into a group - this is the very basic grouping
scheme that doesn't change the output format itself:
Report: report1_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
Report: report2_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
There's no change in output for this report group type - with this type,
we only make sure there's always only one report in a group at a time,
not more.
This patch introduces DM report group (represented by dm_report_group
structure) that is used to group several reports to make a whole. As a
whole, all the reports in the group follow the same settings and/or
formatting used on output and it controls that the output is properly
ordered (e.g. the output from different reports is not interleaved
which would break readability and/or syntax of target output format
used for the whole group).
To support this feature, there are 4 new functions:
- dm_report_group_create
- dm_report_group_push
- dm_report_group_pop
- dm_report_group_destroy
From the naming used (dm_report_group_push/pop), it's clear the reports
are pushed onto a stack. The rule then is that only the report on top
of the stack can be reported (that means calling dm_report_output).
This way we make sure that the output is not interleaved and provides
determinism and control over the output.
Different formats may allow or disallow some of the existing report
flags controlling output itself (DM_REPORT_OUTPUT_*) to be set or not so
once the report is pushed to a group, the grouping code makes sure that
all the reports have compatible flags set and then these flags are
restored once each report is popped from the report group stack.
We also allow to push/pop non-report item in which case such an item
creates a structure (e.g. to put several reports together with any
opening and/or closing lines needed on output which pose as extra
formatting structure besides formatting the reports).
The dm_report_group_push function accepts an argument to pass any
format-specific data needed (e.g. handle, name, structures passed
along while working with reports...).
We can call dm_report_output directly anytime we need (with the only
restriction that we can call dm_report_output only for the report that
is currently on top of the group's stack). Or we don't need to call
dm_report_output explicitly in which case all the reports in a stack are
reported on output automatically once we call dm_report_group_destroy.
This fixes a problem in commit ae0a8740c. The problem
in that commit was that all existing PVs are initially
dropped from lvmetad. This works if the VG is updated
at the end, which replaces the dropped PVs, but if the
rescan finds that the VG seqno is unchanged, it leaves
the cached VG in place. So, we should only drop the
existing PVs in lvmetad when the VG is going to be updated.
commit 15da467b was meant to address the case where
use_lvmetad=1 in lvm.conf, and lvmetad is not available,
in which case, pvscan --cache -aay should activate LVs.
But the commit unintentionally also changed the case
where use_lvmetad=0 in lvm.conf, in which case
pvscan --cache -aay should not activate LVs, so fix
that here.
When pvscan --cache -aay fails to connect to lvmetad it will
simply exit and do nothing. Change this so that it will
skip the lvmetad cache step and do the activation step from
disk.
We were initially looking to see if an LV was hidden and if it was we were
creating an instance of a LvCommon object to represent it. Thus if we
had a hidden cache pool for example we were missing the methods and
properties for the cache pool. However, when we create the object path,
any hidden LVs, regardless of type/functionality will be placed in the
hidden path.
The object manager method get_object_by_lvm_id was used in many cases for
the sole reason of getting the object path for the object. Instead of
retrieving the object and then calling 'dbus_object_path' on the object, we
are adding a method which returns the object path.
When we are processing the LVs we need to build up dbus objects from least
dependent to most dependent, so that we have information available when
constructing.
Original code missed to catch all apperances of SIGINT.
Also enhance logging when running in shell without tty.
Accept this regex as valid input:
'^[ ^t]*([Yy]([Ee]([Ss]|)|)|[Nn]([Oo]|))[ ^t]*$'
Some commands scan labels to populate lvmcache multiple
times, i.e. lvmcache_init, scan labels to fill lvmcache,
lvmcache_destroy, then later repeat
Each time labels are scanned, duplicates are detected,
and preferred devices are chosen. Each time this is done
within a single command, we want to choose the same
preferred devices. So, check for existing preferences
when choosing preferred devices.
This also fixes a problem with the list of unused duplicate
devs when run in an lvm shell. The devs had been allocated
from cmd memory, resulting in invalid list entries between
commands.
A number of places are working on a specific dev when they
call lvmcache_info_from_pvid() to look up an info struct
based on a pvid. In those cases, pass the dev being used
to lvmcache_info_from_pvid(). When a dev is specified,
lvmcache_info_from_pvid() will verify that the cached
info it's using matches the dev being processed before
returning the info. Calling code will not mistakenly
get info for the wrong dev when duplicate devs exist.
This confusion was happening when scanning labels when
duplicate devs existed. label_read for the first dev
would add an info struct to lvmcache for that dev/pvid.
label_read for the second dev would see the pvid in
lvmcache from first dev, and mistakenly conclude that
the label_read from the second dev can be skipped
because it's already been done. By verifying that the
dev for the cached pvid matches the dev being read,
this mismatch is avoided and the label is actually read
from the second duplicate.
If a command gets stuck during an lvmetad update, lvmetad
will cancel that update after the timeout. The next command
to check the lvmetad will see that lvmetad needs to be
populated because lvmetad will return token of "none" after
a timed out update (same as when lvmetad is not populated
at all after starting.)
If a command gets an error during an lvmetad update, it
will now just quit and leave its updating token in place.
That update will be cancelled after the timeout.
Commit #5b3a4a9 caused the "name" variable to be cleared if
declaration and assignment is on two lines so put it back
so it's on one line for it to work again.
pvmove began processing tags unintentionally from commit,
6d7dc87cb pvmove: use toollib
pvmove works on a single PV, but tags can match multiple PVs.
If we allowed tags, but processed only the first matching PV,
then the resulting PV would be unpredictable.
Also, the current processing code does not allow us to simply
report an error and do nothing if more than one PV matches the tag,
because the command starts processing PVs as they are found,
so it's too late to do nothing if a second PV matches.
If configuration consists of several sources in config cascade
("config cascade" defined in man lvmconfig(8)), lvmconfig displayed
only difference from defaults of the topmost config in the cascade.
Fix lvmconfig to display complete difference, considering all
the configuration in the cascade.
For example, before this patch:
(use_lvmetad=0 set in lvm.conf which differs from defaults)
$ lvmconfig --type diff
global {
use_lvmetad=0
}
(compact_output=1 set on cmd line)
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
(headings=0 set in profile)
$ lvmconfig --type diff --commandprofile test
report {
headings=0
}
(difference in topmost configuration source is displayed)
$ lvmconfig --type diff --commandprofile test --config report/compact_output=1
report {
compact_output=1
}
With this patch applied (the config cascade is merged before looking for
difference from defaults in configuration):
$ lvmconfig --type diff
global {
use_lvmetad=0
}
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test
report {
headings=0
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test --config report/compact_output=1
report {
headings=0
compact_output=1
}
global {
use_lvmetad=0
}
Treat loop device created with 'losetup -P' as regular
partitioned device - so if it has partition table,
prevent its usage in commands like 'pvcreate'.
Before 'pvcreate /dev/loop0' could have erased and formated as PV,
after this patch, device is filtered out and cannot be used.
All the variables for sscanf in lvmlockctl.c and lvmlockd-sanlock.c are
zeroed before sscanf call so the failure is detected by seeing the zero
value instead of proper one in subsequent code - so use (void) for
sscanf calls to ignore return value here.
Before this fix, when reporting 'lvm devtypes', the report was
initialized with incorrect reserved values - the ones used for
pvs/vgs/lvs report were used instead of NULL value (because devtypes
doesn't have any reserved values).
For example, trying to (incorrectly) use lv_name for the -S|--select
with lvm devtypes which doesn't have this field at all:
Before this patch (internal error issued):
$ lvm devtypes -S 'lv_name=lvol0'
Internal error: _check_reserved_values_supported: field-specific reserved value of type 0x0 for field not supported
Internal error: dm_report_init_with_selection: trying to register unsupported reserved value type, skipping report selection
DevType MaxParts Description
aoe 16 ATA over Ethernet
ataraid 16 ATA Raid
bcache 1 bcache block device cache
...
With this patch applied (correct error displayed about
unrecognized selection field):
$ lvm devtypes -S 'lv_name=lvol0'
Device Types Fields
-------------------
devtype_name - Name of Device Type exactly as it appears in /proc/devices. [string]
devtype_max_partitions - Maximum number of partitions. (How many device minor numbers get reserved for each device.) [number]
devtype_description - Description of Device Type. [string]
Special Fields
--------------
selected - Set if item passes selection criteria. [number]
help - Show help. [unselectable number]
? - Show help. [unselectable number]
Unrecognised selection field: lv_name
Selection syntax error at 'lv_name=lvol0'.
Use 'help' for selection to get more help.
Convert fields into using a single status ioctl call per LV.
This is a bit tricky since when there are more complicated
stacks, at this moment its undefined which values should be shown.
It's clear we need to cache more then single ioctl per LV,
but also we need to define more explicitely relation between
reported values for snapshots.
This patch is not a final state, rather a transitional step.
It should not be giving more 'worst' values then previous
many-ioctl-calls-per-lv solution.
Add function to obtain percentage value for cache lv_seg_status.
This API is rather evolving 'middle' step as the ultimate goal
is segment API fuctionality.
But first we need to be clear at reporting level which values
are needed to be reported for which LVs and segments.
Add more code to properly store status for snapshot segment
maintaining lvm2 fiction of COW and snapshot internal volumes.
The key issue here is however not though-through reporting
logic - as there is no single answer for whole line state.
It not counting with layer and we may need few more ioctl to
cover all reporting needs depending upon what is actually
needed.
In reality we need to 'cache' more ioctl status queries for
individual LVs and their segments (so they checked at most once).
The other 'hard' topic for conversion is mirror segment handling.
Also we definitelly need to relocate some logic into segment's methods,
yet it might be complex as we have not clear border between targets.
TODO: define more clearly how are reporting fields defined in case
we 'stack' volumes like - cache of stacked thin LV snapshot origin.
lv_refresh_suspend_resume() has escaped with fail ret code
after failing suspend and could have left many volumes in suspend state.
So always unconditionally call resume also when suspend has failed.
To get better control when flushing is used add extra arg when
setting up dm task.
By default now check dm device status without flush.
(At this moment this should effect only thin and cache volumes).
Also switch dev_manager_thin_pool_status() to use more
readable 'flush' parameter instead of 'no_flush'.
Check first the LV is cow before even checking it's a merging COW.
Note: previosly merging_cow was also merging origin, so without
this explicit check it used to return '1' also when passed
LV has been merging origin.
When mirror/raid called copy_percent function to return,
when 100% was supposed to be returned, wrong float 100.0 value
could have been reported back instead of dm_percent_t DM_PERCENT_100.
There is broken API somewhere, since the function here rely on
actively being modifid VG content even when doing 'lvs' operation.
(extents_copies)
This reverts commit 8fd886f735.
This was a deliberate omission because logging token-by-token metadata
parsing greatly increases the amount of logging for hardly any benefit.
In general, only LVM config file settings need to be logged, and in
places where it's considered important to log particular elements of
metadata that should be done using specific log_* lines.
This area can be revisited.
In the same way that process_each_vg() can be passed
a single VG name to process, also allow process_each_lv()
to be passed a single VG name and LV name to process.
This refactors the code for autoactivation. Previously,
as each PV was found, it would be sent to lvmetad, and
the VG would be autoactivated using a non-standard VG
processing function (the "activation_handler") called via
a function pointer from within the lvmetad notification path.
Now, any scanning that the command needs to do (scanning
only the named device args, or scanning all devices when
there are no args), is done first, before any activation
is attempted. During the scans, the VG names are saved.
After scanning is complete, process_each_vg is used to do
autoactivation of the saved VG names. This makes pvscan
activation much more similar to activation done with
vgchange or lvchange.
The separate autoactivate phase also means that if lvmetad
is disabled (either before or during the scan), the command
can continue with the activation step by simply not using
lvmetad and reverting to disk scanning to do the
activation.
Add support for active cache LV.
Handle --cachemode args validation during command line processing.
Rework some lvm2 internal to use lvm2 defined CACHE_MODE enums
indepently on libdm defines and use enum around the code instead
of passing and comparing strings.
Don't use lvm_init() to create a full command context, which
does a lot of command setup (like connecting to daemons), which
is unnecessary for simply reading a value from lvm.conf.
Passing a NULL context arg to the lvm_config_ function is now
allowed, in which case lvm.conf is read without doing lvm
command setup.
A program may be using liblvm2app for simply checking a config
setting in lvm.conf. In this case, a full lvm context is not
needed, only cmd->cft (which are the config settings read from
lvm.conf).
lvm_config_find_bool() can now be passed a NULL lvm context
in which case it will only create cmd->cft, check the config
setting asked for, and destroy the cmd.
Only call lvm_init() when it's needed so that simply
loading the lvm python code in another program doesn't
make that program do lvm initialization.
The version call doesn't need a handle.
The garbage collection can just do lvm_quit to destroy
the command. The next call that needs lvm_init will
do it first.
When setting up a toolcontext, the lib init function
was detecting an error when there was none, and then
it was returning an incompletely initialized cmd struct
instead of NULL. The effect was that the lib would try
to use the uninitialized cmd struct and segfault.
This would happen if a non-fatal error occurred during
cmd setup, e.g. user permission failed on lvmetad socket,
causing cmd to fall back to scanning and not use lvmetad.
The only real error condition is when create_toolcontext
returns NULL. If cmd is returned, the lib can use it.
The _report fn is getting big - separate it in two:
- _report fn to get all the options and arguments
- _do_report fn for reporting itself
Also, place all the variables/arguments in one structure for easier
handling of the variables around.
If a command begins repopulating the lvmetad cache,
and fails part way through, it should set the disabled
state in lvmetad so other commands don't use bad data.
If a subsequent scan succeeds, the disabled state is
cleared.
If duplicate devices exist for a PV, and one device's
size matches the PV size, but the other doesn't, then
prefer the matching device.
If one device is used by an active LV, prefer that device.
When there are duplicate devices for a PV, one device
is preferred and chosen to exist in the VG. The other
devices are not used by lvm, but are displayed by pvs
with a new PV attr "d", indicating that they are
unchosen duplicate PVs.
The "duplicate" reporting field is set to "duplicate"
when the PV is an unchosen duplicate, and that field
is blank for the chosen PV.
Previously, duplicate PVs were processed as a side effect
of processing the "chosen" PV in lvmcache. The duplicate
PV would be hacked into lvmcache temporarily in place of
the chosen PV.
In the old way, we had to always process the "chosen" PV
device, even if a duplicate of it was named on the command
line. This meant we were processing a different device than
was asked for. This could be worked around by naming
multiple duplicate devs on the command line in which case
they were swapped in and out of lvmcache for processing.
Now, the duplicate devs are processed directly in their
own processing loop. This means we can remove the old
hacks related to processing dups as a side effect of
processing the chosen device. We can now simply process
the device that was named on the command line.
When the same PVID exists on two or more devices, one device
is preferred and used in the VG, and the others are duplicates
and are not used in the VG. The preferred device exists in
lvmcache as usual. The duplicates exist in a specical list
of unused duplicate devices.
The duplicate devs have the "d" attribute and the "duplicate"
reporting field displays "duplicate" for them.
'pvs' warns about duplicates, but the formal output only
includes the single preferred PV.
'pvs -a' has the same warnings, and the duplicate devs are
included in the output.
'pvs <path>' has the same warnings, and displays the named
device, whether it is preferred or a duplicate.
Wait to compare and choose alternate duplicate devices until
after all devices are scanned. During scanning, the first
duplicate dev is kept in lvmcache, and others are kept in a
new list (_found_duplicate_devs).
After all devices are scanned, compare all the duplicates
available for a given PVID and decide which is best.
If the dev used in lvmcache is changed, drop the old dev
from lvmcache entirely and rescan the replacement dev.
Previously the VG metadata from the old dev was kept in
lvmcache and only the dev was replaced.
A new config setting devices/allow_changes_with_duplicate_pvs
can be set to 0 which disallows modifying a VG or activating
LVs in it when the VG contains PVs with duplicate devices.
Set to 1 is the old behavior which allowed the VG to be
changed.
The logic for which of two devs is preferred has changed.
The primary goal is to choose a device that is currently
in use if the other isn't, e.g. by an active LV.
. prefer dev with fs mounted if the other doesn't, else
. prefer dev that is dm if the other isn't, else
. prefer dev in subsystem if the other isn't
If neither device is preferred by these rules, then don't
change devices in lvmcache, leaving the one that was found
first.
The previous logic for preferring a device was:
. prefer dev in subsystem if the other isn't, else
. prefer dev without holders if the other has holders, else
. prefer dev that is dm if the other isn't
When duplicate PVs are detected, set the disabled
flag so that commands will disable use of lvmetad.
This duplicate detection is done by lvmetad itself
when it's told about a single new PV with a PVID
that matches an existing PV on another device.
(This is different from the case where the command
is scanning all devices and detects the duplicate.)
Remove the "altdev" logic that attempted to keep
track of multiple devices for a single PV. It
is no longer used since lvmetad is disabled in
this case.
For raid1 use chunksize as bitmap-chunk specification.
Always enforce usage of bitmap - getting comparable outcome
as lvm2 raid support uses.
Add udev_wait after stopping md array - as in fact leg-device
are still in use by target even command has finished.
(mdadm --stop causes WATCH rule wakeup, and
ioctl(STOP_ARRAY) returns IMHO to early - it should finish
and fsync work on leg devices first).
Support parsing --chunksize option also when converting.
Now user can use cache pool created with i.e. 32K chunksize,
while in caching user can select 512K blocks.
Tool is supposed to validate cache metadata size is big enough
to support such chunk size. Otherwise error is shown.
When creating LV - in some case we change created segment type
(ATM for cache and snapshot) and we then manipulate with
lv segment according to 'lp' segtype.
Fix this by checking for proper type before accessing segment members.
This makes command like:
lvcreate --type cache-pool -L10 vg/cpool
lvcreate -H -L10 --cachesettings migtation_threshold=10000 vg/cpool
to pass since now tool correctly selects default cache policy.
Rather than doing repeated translations from name to
device when comparing args to existing PVs, do one
translation of the arg names and saving the device,
before checking existing PVs.
If there's an activation volume_filter, it might not be possible
to activate the rmeta LVs to wipe them. At least inherit any
LV tags from the parent LV while attempting this.
pvscan autoactivation has its own VG processing implementation,
so it can't properly handle things like foreign or shared VGs,
so make it ignore those VG types (or errors from them) as best
as possible.
Add a FIXME stating that pvscan autoactivation must really be
moved to the standard VG processing by calling process_each_vg
to do activation once the scanning / cache update is finished.
Checking for devices uses is_missing_pv() to check
if there is a device for the PV. is_missing_pv()
is based on the MISSING_PV flag, which does not
always correspond to !pv->dev. When using lvmetad,
a command like:
pvs --config 'devices/filter=["a|/dev/sdb|", "r|.*|"]'
will cause a number of PVs to have NULL pv->dev, but
not the MISSING_PV flag. So, NULL pv->dev needs to
also be checked.
Before executing modprobe for given module name, just check
if the module is not already present in /sys/module.
Useful when checking dm-cache-policy modules as we do not
having matching interface like for targets.
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Probably not worth mentioning "segments" here, just state that devices
for an LV can't be all found during the check - it's less mysterious for
user then:
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find all devices for LV fedora/root while checking used and assumed devices.
WARNING: Couldn't find all devices for LV fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
When checking assumed PVs against real devices used for LVs and if
there's no device assigned for an assumed PV (e.g. due to filters),
do log_warn instead of log_error and continue checking LV segments
and associated assumed PVs further, just like we do log_warn elsewhere
in this situation.
This way user will see the warning for each LV which couldn't be
checked completely against real PVs used. Before, we logged only
the very first occurence of missing device for an LV in a VG and we
returned from the function doing this check for all the LVs in VG
immediately which may be a bit misleading because it didn't tell
user about all the other LVs and whether they could be checked
or not.
For example, we have this setup:
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
/dev/vda2 fedora lvm2 a-- 19.49g 0
[0] fedora/~ # lvs -o+devices
LV VG Attr LSize Devices
root fedora -wi-ao---- 19.00g /dev/vda2(0)
swap fedora -wi-ao---- 500.00m /dev/vda2(4864)
Before this patch (only the very first LV in a VG is logged to have a
problem while checking used and assumed devices):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
With this patch applied (all LVs where we hit problem while checking
used and assumed devices are logged and it's warning, not error):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Check that @stats_list and @stats_print returned data in the
_stats_parse_list() and _stats_parse_region() functions before
attempting to operate on region and area values.
This avoids a coverity warning since fgets() could potentially
return no data from the memory buffer returned by the ioctl.
In both cases the ioctl would return an error, preventing these
functions from running, however it is cleaner to test for the
condition explicitly and fail in those cases.
If lvmetad is running, and a command opts to not use it
(--config global/use_lvmetad=0), and the command changes
metadata, then the metadata change is not visible to
lvmetad. Subsequent commands using lvmetad to change
metadata may cause corruption based on the invalid
lvmetad state.
Eventually we can set the disabled state in lvmetad
to prevent this problem, but for now print a warning
about the possibility.
When command is not using lvmetad because
use_lvmetad=0 in the config, but the lvmetad
pidfile exists, print a warning (previously
this checked for the socket existing instead
of the pidfile existing.)
vg/snapshotN should not appear anywhere.
No code should be showing this, but it was noticed in some logs last
week and we can deal with it in display_lvname().
When user requested on cmdline disabling of lvmetad/lvmpoll,
respect it and when lvmlockd requires these daemon,
Error configure with clear message about misconfiguration.
The lvmetad connection is created within the
init_connections() path during command startup,
rather than via the old lvmetad_active() check.
The old lvmetad_active() checks are replaced
with lvmetad_used() which is a simple check that
tests if the command is using/connected to lvmetad.
The old lvmetad_set_active(cmd, 0) calls, which
stopped the command from using lvmetad (to revert to
disk scanning), are replaced with lvmetad_make_unused(cmd).
The test was a weak attempt at verifying the special
combination of lvchange/vgchange -aay --sysinit, but
was only looking for lvmetad connection warnings.
Update the warning checks, and check the LV activation
state directly which is the main point.
Rename the test to reflect its purpose of checking
the -aay --sysinit combination.
Update the check about lvmetad running but not used.
Also add tests related to the new lvmetad disabled state.
lvm1 metadata is used here to test the disabled state
because lvm1 metadata is the first condition using the
disabled state.
After a device rescan that repopulates lvmetad,
if no reason for disabling lvmetad was seen
(lvm1 metadata or duplicate PVs), then clear
the disabled flag in lvmetad. This allows
commands to resume using the lvmetad cache
after the cause for disabling it has been removed.
Commands already check if the lvmetad token is valid,
and if not, they rescan devices to repopulate lvmetad
before running. Now, in addition to checking the
lvmetad token, they also check if the lvmetad disabled
flag is set. If so, they do not use the lvmetad cache
and revert to disk scanning.
A global flag in lvmetad indicates it has been disabled.
Other flags indicate the reason it was disabled.
These flags can be queried using get_global_info.
The lvmetactl debugging utility can set and clear the
disabled flag in lvmetad. Nothing else sets the
disabled flag yet.
Commands will check these flags after connecting to
lvmetad. If the disabled flag is set, the command
will not use the lvmetad cache, but revert to disk
scanning.
To test this feature:
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 0
disable_reason = "none"
token = "filter:3041577944"
$ vgs
(should report VGs from lvmetad)
$ lvmetactl set_global_disable 1
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 1
disable_reason = "DIRECT"
token = "filter:3041577944"
$ vgs
WARNING: Not using lvmetad because the disable flag was set directly.
(should report VGs without contacting lvmetad)
$ lvmetactl set_global_disable 0
$ vgs
(should report VGs from lvmetad)
process_each_pv was doing:
1. lvmcache_seed_infos_from_lvmetad()
sends pv_list request to lvmetad.
2. get_vgnameids()
sends vg_list request to lvmetad.
3. _get_all_devices()
first calls lvmcache_seed_infos_from_lvmetad(),
which is a no-op if it's already been called.
Because get_vgnameids() does not use the information
from lvmcache_seed_infos_from_lvmetad(), it does not
need to be called prior to get_all_devices where
it is actually needed.
Improve code for snapshot merge for readabilty
and also reduce number of tests needed to decide
if merging can or cannot be started.
(Further improving 9cccf5245a)
When dm_tree_find_node_by_uuid() fails to find passed uuid,
report in lof_debug the complete original uuid,
not the one stripped of LVM- prefix.
TODO: inspect manipulation with LVM- prefix here.
To recognize in runtime if we are merging or not
to make consistent decision between suspend and resume
add function to parse thin table line when add
merging thin device to the table.
A snapshot merge into its origin cannot be initiated while the devices
are in use. If there is outside interference (such as from udev),
the suspend (preload) and resume stages can reach conflicting decisions
about whether or not to proceed.
Try to make the logic more robust by checking the inactive or live
table during resume. (This is still not perfect.)
Commit 971ab733b7 ("thin: activation of
merging thin snapshot") also added an incorrect deactivation attempt
for non-thin LVs: find_snapshot(lv)->lv is not designed to be
activated and any attempt to deactivate it is incorrect.
Move checking the lvmetad state, and the possible rescan,
out of lvmetad_send() to the start of the command.
Previously, the token mismatch and rescan would occur
within lvmetad_send() for some other request. Now,
the token mismatch is detected earlier, so the
rescan can be done before the main command is in
progress. Rescanning deep within the processing of
another command will disturb the lvmcache state of
that other command.
A rescan already exists at the start of the command
for the case where foreign VGs are going to be read.
This same rescan is now also performed when there is
an lvmetad token mismatch (from a changed global_filter).
The commands pvscan/vgscan/lvscan/vgimport are excluded
from this preemptive checking/rescanning for lvmetad
because they want to do rescanning themselves explicitly.
If rescanning devices fails, then lvmetad has not been
correctly repopulated and should not be used, so make
the command revert to not using lvmetad.
When not obtaining device from udev, we are doing deep devdir scan,
and at the same time we try to insert everything what /sys/dev/block
knows about. However in case lvm2 is configured to use nonstardard
devdir this way it will see (and scan) devices from a real system.
lvm2 test suite is using its own test devdir with its
own device nodes. To avoid touching real /dev devices, validate
the device node exist in give dir and do not insert such device
into a cache.
With obtain list from udev this patch has no effect
(the normal user path).
We have _insert_dirs() for udev and non-udev compilation.
Compiling without udev missed to call dev_cache_index_devs().
Move the call after _insert_dirs() call so both compilation
gets it.
When scanning if device is being usable as PV,
we call STATUS - but this status should not cause
any flushing.
Skip also open_count information as it's not needed.
We don't have any report field of this type yet. Return this patch into
the play if we really need that. Currenly we always report status
(result of "status" dm ioctl) for an LV as a whole where we choose
segment which represents the LV, not calling status for each possible
segment it contains - we don't need this now so I'm removing it to
not make the code more complex uselessly.
Devices without "LVM-" uuid prefix have been generated by very old
version of lvm2 2.00 and 2.01.
Since version 2.02 all lvm2 devices are using prefix "LVM-".
However checking for present of ancient non prefixed devices does
take extra IOCTL per every call and for majority of todays user
it will not find anything new.
So use the assumption that users with kernel 3.X and newer are not
really using such old versions of lvm2 (year <2005) and with their
new kernel they are also using new version of lvm2 and skip
checking for them.
This change also makes trace logs more readable.
This is hotfix for RHBZ: https://bugzilla.redhat.com/1324537
However already the %FREE is not a good fit and we need something
better. Meanwhile make -l%PVS work at least as good as %FREE
for thin-pool.
TODO: this needs rework - it should be allocator to do all the size
decisions at one place.
Improved test script to verify lost mirror image does not
cause mirror corruption while mirror is in use.
TODO: add more cases (lost mlog...), lost image from 3leg mirror...
When leaving preload routine it should not altet state of flush required
when it's been already set to 1 and drop it to 0.
The API here is unclean, but current usage expects the same
variable pointer is for all preload calls and combines 'flush_required'
across all of them.
Commit 844b009584 tried to move
limit for usage of noflush to 'preload' however this was not
correctly processed.
Intead explicitly check for which types we do not want noflush
and also add debug message in this case.
Fix regression caused by commit ba41ee1dc9.
The idea was to use no_flush for changed device only for thin volumes
and thin pools but also to merge this with change made in commit
844b009584.
However the resulting condition has caused misbehavior for the mirror
suspend - as that has been before the ONLY allowed target type
that could have been suspended with noflush.
Result was badly working repair for --type mirror that has been
passing 'flush' to the repaired mirror target whenever preload
wrongly set flush_required.
The origin code has required the flush_required to be set whenever
deivce size is changed.
Now it first detects if device size got smaller
'dm_tree_node_size_changed(root) < 0' - this requires flush.
Otherwise it checks device is not thin volume nor thin pool and its
size has changed (got bigger) and requires flush.
This mean upsize of thin-pool or thin volume will not require flush.
/sys/dev/block is available since kernel version 2.2.26 (~ 2008):
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-dev
The VGID/LVID indexing code relies on this feature so skip indexing
if it's not available to avoid error messages about inability to open
/sys/dev/block directory.
We're not going to provide fallback code to read the /sys/block/
instead in this case as that's not that efficient - it needs extra
reads for getting major:minor and reading partitions would also
pose further reads and that's not worth it.
If compiling without udev_sync support, udev_get_library_context simply
returns NULL so we don't need to remember putting ifdef UDEV_SYNC_SUPPORT
in the code all the time we just need to check whether there's any udev
context initialized or not.
If obtain_device_list_from_udev=0, LVM can make use of persistent .cache
file. This cache file contains only devices which underwent filters in
previous LVM command run. But we need to iterate over all block devices
to create the VGID/LVID index completely for the device mismatch check
to be complete as well.
This patch iterates over block devices found in sysfs to generate the
VGID/LVID index in dev cache if obtain_device_list_from_udev=0
(if obtain_device_list_from_udev=1, we always read complete list of
block devices from udev and we ignore .cache file so we don't need
to look in sysfs for the complete list).
For the case when we print device name associated with struct device
that was not found in /dev, but in sysfs, for example when printing
devices where LV device mismatch is found.
Use meta% to expose highest mapped sector in thinLV.
so showing there 100.00% means thinLV maps latest sector.
Currently using a 'trick' with total_numerator to pass-in
device size when 'seg==NULL'
TODO: Improve device status API per target - current 'percentage'
is not really extensible.
Previous fix missed the fact the we do query for 'percent' with
seg value either set or unset (API overload...)
When 'seg' was unset, we still issue flush with status.
Fix it by cheking segtype by target_type.
As we check for segtype - we could also skip whole percentage
if the 'segtype' is unknown by code directly.
Reported-by: Ming-Hung Tsai <mingnus gmail com
It's correct to have a DM device that has no DM UUID assigned
so no need to issue error message in this case. Also, if the
device doesn't have DM UUID, it's also clear it's not an LVM LV
(...when looking for VGID/LVID while creating VGID/LVID indices
in dev cache).
For example:
$ dmsetup create test --table "0 1 linear /dev/sda 0"
And there's no PV in the system.
Before this patch (spurious error message issued):
$ pvs
_get_sysfs_value: /sys/dev/block/253:2/dm/uuid: no value
With this patch applied (no spurious error message):
$ pvs
If we're using persistent .cache file, we're reading this file instead
of traversing the /dev content. Fix missing indexing by VGID and LVID
here - hook this into persistent_filter_load where we populate device
cache from persistent .cache file instead of scanning /dev.
For example, inducing situation in which we warn about different device
actually used than what LVM thinks should be used based on metadata:
$ lsblk -s /dev/vg/lvol0
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol0 253:4 0 124M 0 lvm
`-loop1 7:1 0 128M 0 loop
$ lvmconfig --type diff
global {
use_lvmetad=0
}
devices {
obtain_device_list_from_udev=0
}
(obtain_device_list_from_udev=0 also means the persistent .cache file is used)
Before this patch - pvs is fine as it does the dev scan, but lvs relies
on persistent .cache file and it misses the VGID/LVID indices to check
and warn about incorrect devices used:
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
With this patch applied - both pvs and lvs is fine - the indices are
always created correctly (lvs just an example here, other LVM commands
that rely on persistent .cache file are fixed with this patch too):
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
It's possible that while a device is already referenced in sysfs, the node
is not yet in /dev directory.
This may happen in some rare cases right after LVs get created - we sync
with udev (or alternatively we create /dev content ourselves) while VG
lock is held. However, dev scan is done without VG lock so devices may
already be in sysfs, but /dev may not be updated yet if we call LVM command
right after LV creation (so the fact that fs_unlock is done within VG
lock is not usable here much). This is not a problem with devtmpfs as
there's at least kernel name for device in /dev as soon as the sysfs
item exists, but we still support environments without devtmpfs or
where different directory for dev nodes is used (e.g. our test suite).
This patch covers these situations by tracking such devices in
_cache.sysfs_only_names helper hash for the vgid/lvid check to work still.
This also resolves commit 6129d2e64d
which was then reverted by commit 109b7e2095
due to performance issues it may have brought (...and it didn't resolve
the problem fully anyway).
dmeventd daemon may call further code itself that looks at /dev, e.g.
via dmeventd_lvm2_command call. We need to have a consistent view of
the /dev content at that time. Therefore, sync /dev content before
calling monitoring hook which contacts dmeventd.
This problem was quite hidden before, but now it has manifested itself
because of recent additions to dev-cache code where we started looking
at device holders as seen in sysfs. What happened here was that the
device was already in sysfs, but not yet under /dev and this triggered
the new error message sometimes:
log_error("%s: failed to find associated device structure for holder %s.", devname, devpath);
This problem has manifested recently in our api/pytest.sh test from
testsuite where we create thin pool LVs and thin LVs and hence it also
causes dmeventd to be used as well and these error messages were
visible there.
UUID for LV is either "LVM-<vg_uuid><lv_uuid>" or "LVM-<vg_uuid><lv_uuid>-<suffix>".
The code before just checked the length of the UUID based on the first
template, not the variant with suffix - so LVs with this suffix were not
processed properly.
For example a thin pool LV (as an example of an LV that contains
sub LVs where UUIDs have suffixes):
[0] fedora/~ # lsblk -s /dev/vg/lvol1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol1 253:8 0 4M 0 lvm
`-vg-pool-tpool 253:6 0 116M 0 lvm
|-vg-pool_tmeta 253:2 0 4M 0 lvm
| `-sda 8:0 0 128M 0 disk
`-vg-pool_tdata 253:3 0 116M 0 lvm
`-sda 8:0 0 128M 0 disk
Before this patch (spurious warning message about device mismatch):
[0] fedora/~ # pvs
WARNING: Device mismatch detected for vg/lvol1 which is accessing /dev/mapper/vg-pool-tpool instead of (null).
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
With this patch applied (no spurious warning message about device mismatch):
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
To help out with debug, when an exception is thrown in the dbus service we
will dump all the information we have on the last 16 commands that were
executed along with the stack strace.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
With commit 2d5dc6512e the dbus server
no longer needs to utilize udev to know when to update its internal
state.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Check if the value we read from sysfs is not blank and replace the '\n'
at the end only when needed ('\n' should usually be there for sysfs values,
but better check this).
It's possible for an LVM LV to use a device during activation which
then differs from device which LVM assumes based on metadata later on.
For example, such device mismatch can occur if LVM doesn't have
complete view of devices during activation or if filters are
misbehaving or they're incorrectly set during activation.
This patch adds code that can detect this mismatch by creating
VG UUID and LV UUID index while scanning devices for device cache.
The VG UUID index maps VG UUID to a device list. Each device in the
list has a device layered above as a holder which is an LVM LV device
and for which we know the VG UUID (and similarly for LV UUID index).
We can acquire VG and LV UUID by reading /sys/block/<dm_dev_name>/dm/uuid.
So these indices represent the actual state of PV device use in
the system by LVs and then we compare that to what LVM assumes
based on metadata.
For example:
[0] fedora/~ # lsblk /dev/sdq /dev/sdr /dev/sds /dev/sdt
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdq 65:0 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev1 253:3 0 104M 0 mpath
sdr 65:16 0 104M 0 disk
`-mpath_dev1 253:3 0 104M 0 mpath
sds 65:32 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev2 253:4 0 104M 0 mpath
sdt 65:48 0 104M 0 disk
`-mpath_dev2 253:4 0 104M 0 mpath
In this case the vg-lvol0 is mapped onto sdq and sds becauset this is
what was available and seen during activation. Then later on, sdr and
sdt appeared and mpath devices were created out of sdq+sdr (mpath_dev1)
and sds+sdt (mpath_dev2). Now, LVM assumes (correctly) that mpath_dev1
and mpath_dev2 are the PVs that should be used, not the mpath
components (sdq/sdr, sds/sdt).
[0] fedora/~ # pvs
Found duplicate PV xSUix1GJ2SK82ACFuKzFLAQi8xMfFxnO: using /dev/mapper/mpath_dev1 not /dev/sdq
Using duplicate PV /dev/mapper/mpath_dev1 from subsystem DM, replacing /dev/sdq
Found duplicate PV MvHyMVabtSqr33AbkUrobq1LjP8oiTRm: using /dev/mapper/mpath_dev2 not /dev/sds
Using duplicate PV /dev/mapper/mpath_dev2 from subsystem DM, ignoring /dev/sds
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/sdq, /dev/sds instead of /dev/mapper/mpath_dev1, /dev/mapper/mpath_dev2.
PV VG Fmt Attr PSize PFree
/dev/mapper/mpath_dev1 vg lvm2 a-- 100.00m 0
/dev/mapper/mpath_dev2 vg lvm2 a-- 100.00m 0
If we're using non-standard /dev layout so we can't get the dm-X name
easily, we can't also look at the /sys/blocl/dm-X/dev to get the major:minor
pair. Use "stat" in this case even though it triggers automounts
(but there's no better way for now).
The /proc/self/mountinfo is not bound to device names like /proc/mounts
and it uses major:minor pairs instead.
This fixes a situation in which a volume is mounted and then renamed
later on - that makes /proc/mounts unreliable when detecting mounted
volumes.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1196910.
Commit b64703401d cause regression
when handling stacked resize of pool metadata volume that would
be a raid LV.
Fix it by properly setting up size also for layer extension.
Allow pvchange and pvresize to process exported VGs,
and have them check for the exported state in their
single function.
Previously, the exported VG state would trigger a
failure in vg_read()/ignore_vg() because the VGs are
being read with READ_FOR_UPDATE. Because these commands
read all VGs to search for the intended PVs, any
exported VG would trigger a failure, even if it was
not related to the intended PV.
Use <> around user entered option parameters to make it visually
different (just like we use Italic style in man page).
TODO:
In the future we should consistently provide such notation and
possibly generate it automagically from some internal data structures.
Preferably for man pages as well so we report actual set of supported
options.
Recent kernel (4.4) start to report values smaller then sector size
(but in reporting size for SSD which support data zeroing on discard).
For now log warning and assume it really means 1 sector.
Addressing RHBZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1313377
Continuing with lvconvert style updates.
Minimize usage of '\:' as zero-width break space, since html renderers
do not handle this groff sign well (some of them seems to even render
':' there).
But since we do not want see badly rendered man pages itself, prefer
the 'man visual' appearence over html page generation (man2html should
be actually fixed).
Start to use 'user input option' with Capitals more consistently.
Fixing vpath usage as it has been checking for existance of
generated file also in the $(scrdir) e.g.:
No need to remake target '10-dm.rules.in'; using VPATH name '...'
If the $(srcdir) had been also $(builddir) and contained already
generated rules file, it's been used instead generating new
one.
(See: http://make.mad-scientist.net/papers/how-not-to-use-vpath/)
Commit abd9618dd8 tried to improve
parsing of vg name from logical path - but still missed couple
corner cases.
This patch further improves the logic and reuses
validate_lvname_param() for parsing of lv_name.
Also explicitly checks for LVM_VG_NAME in the right case.
So now also properly parses cases like:
'lvconvert --repairt vg/'
and will provide correct error message.
There's a window between doing VG read and checking PV device size
against real device size. If the device is removed in this window,
the dev cache still holds struct device and pv->dev still references
that and that PV is not marked as missing. However, if we're trying
to get size for such device, the open fails because that device
doesn't exists anymore.
We called existing pv_dev_size in _check_pv_dev_sizes fn. But
pv_dev_size assigned a size of 0 if the dev_get_size it called failed
(because the device is gone).
So call the dev_get_size directly and check for the return code
in _check_pv_dev_sizes and go further only if we really know the
device size. This is to avoid confusing warning messages like:
Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
One or more devices used as PVs in VG helter_skelter have changed sizes.
While running on F24 a number of warnings were being emitted from using the
deprecated GObject instead of GLib. Tested on python 3.4 and 3.5.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Python 3.5 in F24 was throwing the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/main.py", line 73, in process_request
req.run_cmd()
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 73, in run_cmd
self.register_error(-1, st)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 123, in register_error
self._reg_ending(None, error_rc, error)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 115, in _reg_ending
self.cb_error(self._rc_error)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 669, in <lambda>
keywords[error_callback] = lambda exception: _method_reply_error(connection, message, exception)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 293, in _method_reply_error
exception))
File "/usr/lib64/python3.5/traceback.py", line 136, in format_exception_only
return list(TracebackException(etype, value, None).format_exception_only())
File "/usr/lib64/python3.5/traceback.py", line 442, in __init__
if (exc_value and exc_value.__cause__ is not None
AttributeError: 'str' object has no attribute '__cause__'
This was caused because we were calling the dbus error callback with a
string instead of an actual exception. On python 3.4 this was apparently
OK, but not with 3.5. Corrected to pass the exception to error callback.
Change tested on both python 3.4 and 3.5.
Reported-by: Vratislav Podzimek <vpodzime@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Specifying an output width of 0 now leads to a default minimum width
taken from the width of the column heading. (Most fields should use
this.)
Components of field names are generally separated by underscores (which
are optional at run-time). (Dropped 3 duplicate fields now covered by
this rule.)
Each field heading must be unique and generally doesn't have spaces
between words (which get capitalised) unless they are already short and
the fields are normally longer or clarity demands it.
With the recent conversion of pvcreate/pvremove to the
common toollib processing function, skipping in-use PVs
in _process_pvs_in_vg prevented them from being protected
as intended by the in-use flag.
The processing code for pvcreate/pvremove checks for the
in-use state itself and prevents using an in-use PV.
If a PV is skipped, it looks like an unused device and
is not protected from being used in pvcreate/pvremove.
This command option can be used to trigger a D-Bus
notification independent of the usual notifications
that are sent from other commands as an effect of
changes to PV/VG/LV state. If lvm is not built with
dbus notification support or if notify_dbus is disabled
in the config, this command will exit with an error.
When a command modifies a PV or VG, or changes the
activation state of an LV, it will send a dbus
notification when the command is finished. This
can be enabled/disabled with a config setting.
The code in _print_historical_lv function works with temporary
"descendants_buffer" that is allocated and freed within this
function.
When printing text out, we used "outf" macro which called
"out_text" fn and it checked return value and if failed,
the macro called "return_0" automatically. But since we
use the temporary buffer, if any of the out_text calls
fails, we need to deallocate this buffer properly - that's
the "goto_out", otherwise we'll be leaking memory.
So add new "outfgo" helper macro which does the same as "outf",
but it calls "goto_out" instead of "return_0" so we can jump
to a cleanup hook at the end.
- stdout/stderr for test output
- debug.log* for daemon output
- use install instead of cp for DBus config
- use system bus
- DBus reloads on new file. Better having it created with correct
permissions.
Notes:
- Squashed commits from mcsontos development branch
- Still disabled at this time
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Using lvm1 metadata with lvmetad is not generally allowed,
but nothing has prevented creating new lvm1 metadata with
lvmetad (missing error checking in pvcreate/vgcreate.)
Various tests are using lvm1 with lvmetad and happen to
work because of the missing error checks.
This commit fixes the tests so they won't fail when the
lvm1/lvmetad error checking is fixed. A new variable
LVM_TEST_LVM1 is defined and is used in the scripts to
decide if lvm1 metadata should be tested. LVM_TEST_LVM1
is not defined when lvmetad is being tested, and the
combination of LVM_TEST_LVM1 and LVM_TEST_LVMETAD can
be used to verify the desired lvmetad+lvm1 behavior.
Historical LV is valid as long as there is at least one live LV among
its ancestors. If we find any invalid (dangling) historical LVs, remove
them automatically.
The vg_strip_outdated_historical_lvs iterates over the list of historical LVs
we have and it shoots down the ones which are outdated.
Configuration hook to set the timeout will be in subsequent patch.
The metadata/record_lvs_history is global switch which enables or
disables recording historical LVs in metadata.
If both metadata/record_lvs_history=1 lvm.conf option and
--nohistory command switch is used at the same time, the
--nohistory prevails.
Report proper values for historical LVs in lv_layout and lv_role fields.
Any historical LV doesn't have any layout anymore and the role is "history".
For example:
$ lvs -H -o name,lv_attr,lv_layout,lv_role vg/-lvol1
LV Attr Layout Role
-lvol1 ----h----- none public,history
The lv_full_ancestors reporting field is just like the existing
lv_ancestors field but it also takes into account any history and
indirect relations recorded.
All names for historical LVs are prefixed with '-' character to make clear
difference between live and historical LVs. The '-' can't be set by users
for live LV names during lvcreate hence we never get into a conflict with
the names that user defines for live LVs.
The lv_historical reporting field is a simple binary field that reports
whether an LV is historical one ("historical" value or value of "1" displayed)
or not (blank string "" or value of "0" displayed).
When processing LVs in a VG and when the -H|--history switch is used,
make process_each_lv_in_vg to iterate over historical volumes too.
For each historical LV, we use dummy struct logical_volume instance with
the "this_glv" reference set to a wrapper over proper struct
historical_logical_volume representation. This makes it possible to process
historical LVs just like normal live LVs (though a dummy one without any
segments and all the other fields zeroed and blank) and it also allows
for using all historical LV related information via lv->this_glv->historical
reference.
One can use a simple call to lv_is_historical to make a difference between
live and historical LV in all the code that is called by process_each_* fns.
This patch adds "include_historical_lvs" field to struct cmd_context to
make it possible for the command to switch between original funcionality
where no historical LVs are processed and functionality where historical
LVs are taken into account (and reported or processed further). The switch
between these modes is done using the '-H|--history' switch on command
line.
The include_historical_lvs state is then passed to process_each_* fns
using the "include_historical_lvs" field within struct processing_handle.
Add support for making an interconnection between thin LV segment and
its indirect origin (which may be historical or live LV) - add a new
"indirect_origin" argument to attach_pool_lv function.
Also export historical LVs when exporting LVM2 metadata.
This is list of all historical LVs listed in
"historical_logical_volumes" metadata section with all
the properties exported for each historical LV.
For example, we have this thin snapshot sequence:
lvol1 --> lvol2 --> lvol3
\
--> lvol4
We end up with these metadata:
logical_volume {
...
(lvol1, lvol3 and lvol4 listed here as usual - no change here)
...
}
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "lvol1"
descendants = ["lvol3", "lvol4"]
}
}
By removing lvol1 further, we end up with:
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "-lvol1"
descendants = ["lvol3", "lvol4"]
}
lvol1 {
id = "me0mes-aYnK-nRfT-vNlV-UiR1-GP7r-ojbROr"
creation_time = 1456919608 # 2016-03-02 12:53:28 +0100
removal_time = 1456919767 # 2016-03-02 12:56:07 +0100
}
}
When an LV is being removed, we create an instance of
"struct historical_logical_volume" wrapped up in
"struct generic_logical_volume".
All instances of "struct historical_logical_volume" are then recorded in
"historical_lvs" list which is part of "struct volume_group".
The "historical LV" is then interconnected with "live LVs" to
connect a history chain for the live LV.
The add_glv_to_indirect_glvs is a helper function that registers a
volume represented by struct generic_logical_volume instance ("glv")
as an indirect user of another volume ("origin_glv") and vice versa -
it also registers the other volume ("origin_glv") as indirect_origin
of user volume ("glv").
The remove_glv_from_indirect_glvs does the opposite.
The get_or_create_glv is helper function that retrieves any existing
generic_logical_volume wrapper for the LV. If the wrapper does not exist
yet, it's created.
The get_org_create_glvl is the same as get_or_create_glv but it creates
the glv_list wrapper in addition so it can be added to a list.
Add new structures and new fields in existing structures to support
tracking history of LVs (the LVs which don't exist - the have been
removed already):
- new "struct historical_logical_volume"
This structure keeps information specific to historical LVs
(historical LV is very reduced form of struct logical_volume +
it contains a few specific fields to track historical LV
properties like removal time and connections among other LVs).
- new "struct generic_logical_volume"
Wrapper for "struct historical_logical_volume" and
"struct logical_volume" to make it possible to handle volumes
in uniform way, no matter if it's live or historical one.
- new "struct glv_list"
Wrapper for "struct generic_logical_volume" so it can be
added to a list.
- new "indirect_glvs" field in "struct logical_volume"
List that stores references to all indirect users of this LV - this
interconnects live LV with historical descendant LVs or even live
descendant LVs.
- new "indirect_origin" field in "struct lv_segment"
Reference to indirect origin of this segment - this interconnects
live LV (segment) with historical ancestor.
- new "this_glv" field in "struct logical_volume"
This references an existing generic_logical_volume wrapper for this
LV, if used. It can be NULL if not needed - which means we're not
handling historical LVs at all.
- new "historical_lvs" field in "struct volume group
List of all historical LVs read from VG metadata.
Pair kernel_cache_settings with kernel_cache_policy.
There is very small chance in error case that the value in table
might be differnet from the value stored in metadata
so make it 'checkable'.
Showing 'u' in the pv_attr reporting field is mostly unnecessary because
most PVs are allocatable, and being allocatable implies it is (u)sed,
and this is already obvious from other fields in the default 'pvs'
output like the VG name.
So move the new (u)sed pv_attr from character position 4 to 1, and only
show it in those rare cases when the PV is not (a)llocatable or the
relevant metadata is missing.
(Scripts should not be using pv_attr, but rather pv_allocatable,
pv_exported, pv_missing, pv_in_use etc.)
Helping with understanding we will not try to deref NULL pointer,
as if the sizes are initialized to NULL it also means 'mem' would
be NULL, but thats too hard to model so make it obvious.
Correct name is lvm2-lvmdbusd.service not lvmdbusd.service.
This makes the bus-activation (auto-activation) work.
Signed-off-by: Vratislav Podzimek <vpodzime@redhat.com>
Make the data_alignment variable 64 bits so it
can hold the invalid command line arg used in
pvreate-usage.sh pvcreate --dataalignment 1e.
On 32 bit arches, the smaller variable wouldn't
hold the invalid value so the error would not
trigger as expected by the test.
This simple tool calls the Manager.Refresh method on the dbus service
to check and see if the dbus service has the most up to date state.
This is to be used for testing to ensure that event driven updates are
working as planned.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When we use udev or have lvm call back into the dbus service when a
change occurs, even if that change originated from the dbus service
we end up refreshing the state of the system twice which is not
needed or wanted. This change handles this case by removing any
pending refreshes in the worker queue if the state of the system
was just updated.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Since we want to read env LVM_VG_NAME vg names,
we cannot just check LV names which do contain '/'.
So before the patch commands like:
> lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Path required for Logical Volume "vg".
Please provide a valid volume group name
> LVM_VG_NAME=vg lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Can't find LV vg in VG vg
Commit ca878a3426 introduced an issue
that zero sized extesion suddenly started to be accepted and
missed to return error.
Properly check invalid input values for sizes.
Add tests for the "dmstats report" command:
* report
* report --count
* report --histogram
So far the tests just check the command runs as expected when a
correctly configured stats region exists: validation of output
can be added later.
Add tests for the "dmstats create" command:
* simple whole-device region
* region using --start/--len options
* region using --segments option
* region with precise timestamps (--precise)
* region with histogram bounds (--bounds)
The install target already depends on .tests-stamp - since this
in turn depends on lib/version-expected there is no need to have
this as a dependency of install.
Add initial dmstats tests to 000-basic.sh. These tests ensure that
the dmsetup binary is built and linked correctly when called as
'dmstats' and that the version of the binary matches the expected
library version used for the build.
"pvcreate_each_params" was a temporary name used
to transition from the old "pvcreate_params".
Remove the old pvcreate_params struct and rename the
new pvcreate_each_params struct to pvcreate_params.
Rename various pvcreate_each_params terms to simply
pvcreate_params.
Use the new pvcreate_each_device() function from
toollib, previously added for pvcreate, in place
of the old pvcreate_vol().
This also requires shifting the location where the
lock is acquired for the new VG name. The lock for
the new VG is supposed to be acquired before pvcreate.
This means splitting the vg_lock_newname() out of
vg_create(), and calling vg_lock_newname() directly
before pvcreate, and then calling the remainder of
vg_create() after pvcreate.
The new function vg_lock_and_create() now does
vg_lock_newname() + vg_create(), like the previous
version of vg_create().
The lock on the new VG name is released before the
pvcreate and reacquired after the pvcreate because
pvcreate needs to reset lvmcache, which doesn't work
when locks are held. An exception could likely be
made for the new VG name lock, which would allow
vgcreate to hold the new VG name lock across the
pvcreate step.
This is common code for handling PV create/remove
that can be shared by pvcreate/vgcreate/vgextend/pvremove.
This does not change any commands to use the new code.
- Pull out the hidden equivalent of process_each_pv
into an actual top level process_each_pv.
- Pull the prompts to the top level, and do not
run any prompts while locks are held.
The orphan lock is reacquired after any prompts are
done, and the devices being created are checked for
any change made while the lock was not held.
Previously, pvcreate_vol() was the shared function for
creating a PV for pvcreate, vgcreate, vgextend.
Now, it will be toollib function pvcreate_each_device().
pvcreate_vol() was called effectively as a helper, from
within vgcreate and vgextend code paths.
pvcreate_each_device() will be called at the same level
as other process_each functions.
One of the main problems with pvcreate_vol() is that
it included a hidden equivalent of process_each_pv for
each device being created:
pvcreate_vol() -> _pvcreate_check() ->
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
pvcreate_each_device() reorganizes the code so that
each-VG-each-PV loop is done once, and uses the standard
process_each_pv function at the top level of the function.
This uses the vg->pv_write_list in place of the
vg->pvs_to_write list, and eliminates the use of
pvcreate_params. The label remove and zeroing
steps are shifted out of vg_write() to the higher
level like pvcreate will do.
The vg->pv_write_list contains pv_list structs for which
vg_write() should call pv_write().
The new list will replace vg->pvs_to_write that contains
vg_to_create structs which are used to perform higher-level
pvcreate-related operations. The higher level pvcreate
operations will be moved out of vg_write() to higher levels.
Since we already check in few other places 'info' is not NULL,
do the same for others - however when info would be NULL
it more or less looks like internal error.
Use #define instead, since we do not require actually buffer needs
to exists to eliminated new gcc6 warning:
clvm.h:53:19: warning: ‘CLVMD_SOCKNAME’ defined but not used
[-Wunused-const-variable]
Reshuffle messages during pvremove.
Always print WARNING: when PV is in use so using options
--force --force doesn't make this important user
notification go away.
Simplify variable 'used' usage (so older gcc doesn't warn
about the use of unitilizied variable).
Add some '.' into messages.
Currently it's been checked for 'zero' header for thin-pool,
but lets use it always for cache as well - since it's relatively 'cheap'
detection of read 'error' problems as thin/cache tools
currently do not work fast enough in this case.
export LVMDBUSD_SESSION=True to run on the session bus instead
of the system bus so that we can run the unit test without
installing the dbus conf file.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Reduced the size of LVs created and use actual PE numbers instead of hard
coding them to allow us to work with the loop back devices.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
It appears that the output of lvconvert --merge can vary some. The code
was blowing up as it was trying to parse a line of stdout to retrieve the
% complete, but the line did not have the needed format and an execption
was thrown. The uncaught exception caused the background thread to exit
without updating the job object, which caused the client to hang forever
waiting. Added a default exception handler to prevent unhandled execptions
causing hangs and removed the parameter skip_first_line as it's no longer
needed. The code checks to see if the line can be parsed before doing so.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
After the lockspace has been successfully removed,
invalidate the name field in the lockspace struct.
The struct remains on the list of lockspaces until
the struct can be freed later. Until the struct is
freed, its name will prevent another new lockspace
from being created with the same name.
When update fails in suspend() (sending of messages
fails because metadata space is full) call resume(),
so the locking sequence works properly for clustering.
Also failing deactivation should unlock memory.
Fix reporting of Fail thin-pool target status
as attr[8] letter 'F'.
Report 'needs_check' status from thin-pool target via
attr field [4] (letter 'c'/'C'), and also via CheckNeeded field.
TODO: think about better name here?
TODO: lots of prop_not_implemented_set
Fix parsing of 'Fail' status (using capital letter) for thin-pool.
Add also parsing of 'Error' state for thin-pool.
Add needs_check test for thin-pool.
Detect Fail state for thin.
Ask for confirmation when using pvcreate/pvremove on a PV which is
marked as belonging to a VG, just like we do in case of a PV which
belongs to known VG:
$ pvcreate -ff /dev/sda
Really INITIALIZE physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume not initialized
$ pvremove -ff /dev/sda
Really WIPE LABELS from physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume label not removed
The host that owns foreign VGs is responsible for fixing up PV_EXT_USED
flag - the same already applies to repairing any inconsistent VG.
This patch also moves the iteration over vg->pvs inside
_check_or_repair_pv_ext fn - it's cleaner this way.
pv->vg is not set yet during pvcreate processing. Use pv->fmt instead to
check for these fake PVs (all normal PVs have format defined, devices
which are not PVs don't have this set).
This fixes commit 0000db7f98.
If we know that a PV belongs to some VG and we're missing metadata
(because we have only those PV(s) from VG present in the system that
don't have metadata areas), we should skip such PV when processing
under system ID.
This is because we know that the PV belongs to some VG, but we
really can't decide whether it matches system ID unless the VG
metadata is present again.
Some of the PVs are not even orphan PVs - they're fake PVs - this can
happen if we're listing all devices with "pvs -a". Such PV must not
be marked as used.
The backup_restore_vg is used directly for restoring the VG from backup.
It's also used to do the VG conversions from one metadata format to
another which means vgconvert calls backup_restore_vg too.
When restoring VG from backup, we need to rewrite/write PV headers as
PVs may have been orphans before and now they're becoming part of some
VG - we need to write the PV_EXT_USED flag at least.
When using the backup_restore_vg for vgconvert, we need to write
completely new PV header in different format.
Avoid the special "pv_write" call and handling that was used before
this patch in vgconvert (vgconvert_single function to be more precise)
and reuse existing internal interface to register PV header for writing
(or rewriting) via vg->pvs_to_write list instead like we do it elsewhere
in the code.
This patch also resolves a problem in which PV headers with target
format were written in the vgconvert_single fn as orphans and VG
metadata were added later on - this was a tiny hack actually.
We can't do this now - we need to write the PV as belonging
to a VG because otherwise the PV_EXT_USED flag won't be written
properly (if the PV header is written as orphan, the PV_EXT_USED
is set to 0, of course, even though metadata are attached later).
So this patch removes this tiny inconsistency which was passing
just fine before because we didn't have any relation to the VG
in PV header before. Now we have the PV_EXT_USED flag which says
the "PV is used in some VG".
The same check as we already do for orphan PVs, just the other way
round now: if the PV is surely part of some VG and any PV the VG
contains does not have the PV_EXT_USED flag set, repair it.
For example - /dev/sda here is in VG vg and it's incorrectly not
marked as used by PV_EXT_USED flag:
pvs --binary -o pv_ext_vsn,pv_in_use
WARNING: Volume Group vg is not consistent.
WARNING: Repairing Physical Volume /dev/sda that is in Volume Group vg but not marked as used.
PV VG Fmt Attr PSize PFree ExtVsn PInUse
/dev/sda vg lvm2 a-- 124.00m 124.00m 2 1
PV header extension versions:
0 - the original PV without any extensions
1 - bootloader area support added
2 - PV_EXT_USED flag support added
So do the associated checks related to PV_EXT_USED flag only if
PV header extension found is of version 2 and higher.
If we know that the PV is orphan, meaning there's at least one MDA on
that PV which does not reference any VG and at the same time there's
PV_EXT_USED flag set, we're certainly in an inconsistent state and we
need to fix this.
For example, such situation can happen during vgremove/vgreduce if we
removed/reduced the VG, but we haven't written PV headers yet because
vgremove stopped abruptly for whatever reason just before writing new
PV headers with updated state, including PV extension flags (and so the
PV_EXT_USED flag).
However, in case the PV has no MDAs at all, we can't double-check
whether the PV_EXT_USED is correct or not - if that PV is marked
as used, it's either:
- really used (but other disks with MDAs are missing)
- or the error state as described above is hit
User needs to overwrite the PV header directly if it's really clear
the PV having no MDAs does not belong to any VG and at the same time
it's still marked as being in use (pvcreate -ff <dev_name> will fix this).
For example - /dev/sda here has 1 MDA, orphan and is incorrectly marked
with PV_EXT_USED flag:
$ pvs --binary -o+pv_in_use
WARNING: Found inconsistent standalone Physical Volumes.
WARNING: Repairing flag incorrectly marking Physical Volume /dev/sda as used.
PV VG Fmt Attr PSize PFree InUse
/dev/sda lvm2 --- 128.00m 128.00m 0
For example:
$ pvs -o pv_name,vg_name,pv_in_use
PV VG InUse
/dev/sda vg used
/dev/sdb
/dev/sdc used
(sda is part of vg - it's used
sdb is not part of vg - it's not used
sdc is part of vg, but MDAs missing - it's used)
Scenario:
$ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
We're adding the PV to a VG.
Before this patch:
$ vgcreate vg /dev/sda
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
With this path applied:
$ vgcreate vg /dev/sda
Volume group "vg" successfully created
...and verbose log containing: "Physical volume "/dev/sda" successfully written"
Make sure we won't use a PV that is already marked as used. Normally,
VG metadata would stop us from doing that, but we can run into a
situation where such metadata is missing because PVs with MDAs
are missing and the PVs left are the ones with 0 MDAs.
(/dev/sda in this example has 0 MDAs and it belongs to a VG,
but other PVs with MDA are missing)
$ pvs -o pv_name,pv_mda_count /dev/sda
PV #PMda
/dev/sda 0
$ pvcreate /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't initialize PV '/dev/sda' without -ff.
$ pvchange -u /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't change PV '/dev/sda' without -ff.
Physical volume /dev/sda not changed
0 physical volumes changed / 1 physical volume not changed
$ pvremove /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
(If you are certain you need pvremove, then confirm by using --force twice.)
$ vgcreate vg /dev/sda
Physical volume '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Unable to add physical volume '/dev/sda' to volume group 'vg'.
We'll use this struct in subsequent patches for PVs which should
be rewritten, not just created. So rename struct pv_to_create to
struct pv_to_write for clarity.
This is a hotfix for a bug introduced in
6d7dc87cb3.
The bug description: First we allocate memory for
processing handle (at an address 1) then we
allocate some memory on the same pool for later use
in pvmove_poll function inside the process_each_pv
function (at an address 2). After we jump out of
process_each_pv we called destroy_processing_handle.
As a result of destroying the handle memory pool could
deallocate all memory at address 1 or higher. The
pvmove_poll function tried to copy a memory allocated
at address 2 that could be returned to the system.
If it was so it led to segfault.
We need to rethink proper fix but in the same time
cmd->mem pool is recreated per each lvm command so
this should not cause problems even when we run
multiple commands in lvm shell.
A valgrind snapshot of the corruption:
Invalid read of size 1
at 0x4C29F92: strlen (mc_replace_strmem.c:403)
by 0x5495F2E: dm_pool_strdup (pool.c:51)
by 0x1592A7: _create_id (pvmove.c:774)
by 0x159409: pvmove_poll (pvmove.c:796)
by 0x1599E3: pvmove (pvmove.c:931)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address 0xf15df8a is 138 bytes inside a block of size 8,192 free'd
at 0x4C28430: free (vg_replace_malloc.c:446)
by 0x5494E73: dm_free_wrapper (dbg_malloc.c:357)
by 0x5495DE2: _free_chunk (pool-fast.c:318)
by 0x549561C: dm_pool_free (pool-fast.c:151)
by 0x164451: destroy_processing_handle (toollib.c:1837)
by 0x1598C1: pvmove (pvmove.c:903)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address this gcc warning:
metadata/lv.c:243: warning: initialized field overwritten
metadata/lv.c:243: warning: (near initialization for 'status.seg_status')
Present with e.g.: gcc version 4.3.2 (Debian 4.3.2-1.1)
Simplify calculation of extents rounding needed for
segment size.
Segment size has to divisible by 'extent count' needed to contain
whole stripe. LVM currently does not support stripes across segment.
In case the stripe size is bigger then extent size,
require bigger rounding.
Fixing regression caused by 197b5e6dc7.
So the 'TODO' part now finally know the answer - there is 'sparc64'
architecture which imposes limitation to read 64b words only through
64b aligned address.
Since we never could know how is the user going to use the returned
pointer and the userusually expects it's aligned on the highest CPU
required alignement, preserve it also for char*.
Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=809685
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Wrong thin-pool feature flag ordering in dm table: It will lead to
unnecessary table reload.
Fix it by placeing feature flags in order they are returned from the
kernel so current 'table line diff' code will not see a difference.
'verbose' was marked as a boolean option while it
takes integer args - so it has limited usage to 0 or 1,
but we supported 0-4 at least.
Fix it by switching to corrent int type.
(Hopefully noone was trying to use this variable as true/yes/false/no
way - as the would be unsupported/undocumented).
Reporter noticed lvm2 incorrectly translated
lvm2 threshold value to water mark in commit:
99237f0908
Fix it by properly translating size to number of
blocks in thin-pool and then calc for free blocks
matching configured lvm2 threshold value.
Reported-by: Ming-Hung Tsai <mingnus@gmail.com>
Normally, we generate and provide lvm.conf file where use_blkid_wiping
is set based on whether support for this is compiled in or not. This was
generated properly based on configure.
However, if lvm.conf is not used at all (someone deletes it) or the value
in lvm.conf is commented out (user edited it), we still need to use
proper default value that is based on DEFAULT_USE_BLKID_WIPING taken
from configure script - we used hardcoded value of "1" in this case
by mistake.
We already do check for suspended devs within udev rules where
the pvscan is to update lvmetad. So the check for suspended devs
in "pre-lvmetad" chain is not useful here - remove it - it may
be a source of hardly to detect races anyway (if udev rule detects
the device is not suspended and then the pvscan instance sees the
dev as suspended, we may end up not reacting to the event properly).
lvm1 and pool format do not support bootloader areas and we need to
remove any existing associated bootloader areas when we read lvm1 and
pool labels.
This has its importance if we're converting from one format to another
and we're reusing lvmcache in long-running commands (e.g. clvmd or lvm
shell) and we need to make lvmcache consistent and valid for current format.
Non-dm devices have ID_PART_TABLE_TYPE variable exported in
udev db from blkid scan for *both* whole devices and partitions.
We used ID_PART_ENTRY_DISK in addition to decide whether this
is the whole device or partition and then we filtered out only
whole devices where the partition table really is.
However, ID_PART_ENTRY_DISK was added in blkid 2.20 so we need
to use a different set of variables to decide on whole devices
and partitions on systems where older blkid is still used.
Now, we use ID_PART_TABLE_TYPE to detect that there's something
related to partitioning with this device and we use DEVTYPE variable
instead to decide between whole device (DEVTYPE="disk") and partition
(DEVTYPE="partition").
For dm devices it's simpler, we have ID_PART_TABLE_TYPE variable\
set in udev db for whole devices. It's not set for partitions,
hence we don't need more variable in addition to make the decision
on whole device vs. partition (dm devices do not have regular
partitions, hence DEVTYPE can't be used anyway, it's always set
to "disk" for whole disks and partitions).
Add "size" and "size_seqno" to struct device to cache device's size
and also to control its lifetime - the cached value is valid as long
as the global _dev_size_seqno is equal to the device's size_seqno,
otherwise we need to get the size again and cache the new value.
This patch also adds new dev_size_seqno_inc() fn for the appropriate
parts of the code to increment current global value of _dev_size_seqno
and hence to cause all currently cached values for device sizes to
be invalidated.
The device size is now cached because we're planning to reuse this
information for further checks and we want to avoid checking it more
than necessary to save resources.
Fix regression caused by c9f021de0b.
This commit actually transfered real-action (e.g. device removal)
into the next loop which has however missed to check for break.
So add check for break also there.
When creating a list in 'context of command' - use proper mempool.
vg->vgmem is mempool related to VG metadata - and can be eventually
locked read-only when VG struct is shared.
W: manual-page-warning /usr/share/man/man8/lvm.8.gz 491: warning: macro `_cdata',' not defined
rpmlint actually notices we had few hidden word in man page.
the line cannot start with apostrophe as it has then a different
meaning.
If not using explicit --enable-blkid-wiping/--disable-blkid-wiping
configure option, the configure script tries to enable/disable blkid
wiping feature automatically based on blkid library version found.
The script incorrectly set default value for lvm.conf's
allocation/use_blkid_wiping" setting to "1" (enabled) if proper
blkid library version was not found or the version found was less
than the minimum required. It should be set to "0" in this case.
The extent size must fits all blocks in 4294967295 sectors
(in 512b units) this is 1/2 KiB less then 2TiB.
So while previous statement 'suggested' 2TiB is still acceptable value,
make it clear it's not.
As now we support any multiples of 128KB as extent size -
values like 2047G will still 'flow-in' otherwise the largest power-of-2
supported value is 1TiB.
With 1TiB user needs 8388608 extents for 8EiB device.
(FYI such device is already unusable with todays glibc-2.22.90-27)
4GiB extent size is currently the smallest extent size which allows
a user to create 8EiB devices (with 2GiB it's less then 8EiB).
TODO: lvm2 may possibly print amount of 'lost/unused space' on a PV,
since using such ridiculously sized extent size may result in huge
space being left unaccessible.
Since commit 2fc126b00d, the library
code requires udev to be initialised for device scanning and
clvmd can fail to find VGs if devices/external_device_info_source
is set to "udev".
There are two basic groups of fields for LV segment device reporting:
- related to LV segment's devices: devices and seg_pe_ranges
- related to LV segment's metadata devices: metadata_devices and seg_metadata_le_ranges
The devices and metadata_devices report devices in this format:
"device_name(extent_start)"
The seg_pe_ranges and seg_metadata_le_ranges report devices in
this format:
"device_name:extent_start-extent_end"
This patch reverts partly what commit 7f74a99502
(v 2.02.140) introduced in this area - it added [] for
hidden devices to mark them for all four fields mentioned above.
We won't be marking hidden devices in devices and metadata_devices
fields.
The seg_metadata_le_ranges field will have hidden devices marked -
it's new enough that we don't need to care about compatibility much
yet.
The seg_pe_ranges is old enough that we shouldn't be changing this
one - so we're reverting to not marking hidden devices here.
Instead, there's going to be a new field "seg_le_ranges" which
is going to replace the seg_pe_ranges and it will mark hidden devices -
this is going to be introduced in a patch later.
So in the end we'll end up with:
(LV segment's devices)
devices field with "device_name(extent_start)" format, not marking hidden devices
seg_pe_ranges field with "device_name:extent_start-extent_end" format, not marking hidden devices (deprecated, new seg_le_ranges should be used instead for standardized format)
seg_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
(LV segment's metadata devices)
metadata_devices field with "device_name:extent_start-extent_end" format, not marking hidden devices
seg_metadata_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
Also, both seg_le_ranges and seg_metadata_le_ranges will honour the
report/list_item_separator setting which can be used to configure
the delimiter used for list items.
So, to sum it up, we will recommend using the new seg_le_ranges and
seg_metadata_le_ranges fields because they display devices with
standard extent range format, they can mark hidden devices and they
honour the report/list_item_separator setting.
We'll be keeping devices,seg_pe_ranges and metadata_devices fields
for compatibility.
The associated devices,metadata_devices,seg_pe_ranges and
seg_metadata_le_ranges are reported as genuine string lists now.
This allows for using the items separately in -S|--select
(so searching for subsets etc.) and also it allows for
configuring the separator using report/list_item_separator
which may be useful in scripts (however, we'll enable this
only for seg_le_metadata_ranges and not for devices,seg_pe_ranges
and seg_metadata_devices for compatibility reasons - see following
patch).
Add a comment in _process_pvs_in_vg() to document the
place where there have been problems with processing
PVs twice.
For a while we had a hacky workaround here where we'd
skip processing a PV if its device wasn't found in
all_devices (and !is_missing_pv since we want to
process PVs with missing devices.). That workaround
was removed in commit 5cd4d46f because it was no
longer needed.
The workaround had originally been needed to prevent
a device from being processed twice when the PV had
no MDAs -- it would be processed once in its real VG
and then the workaround would prevent it from being
processed a second time in the orphan VG.
Wrongly appearing as an orphan likely happened because
lvmcache would consider the no-MDA PV an orphan unless
the real VG holding that PV was also in lvmcache.
This issue is also mentioned in pvchange where holding
the global lock allows VGs to remain in lvmcache so
PVs with 0 mdas are not considered orphans.
The workaround in _process_pvs_in_vg() was originally
intended for reporting commands, not for pvchange.
But, it was accidentally helping pvchange also because
the method described by the pvchange global lock
comment had been subverted by commit 80f4b4b8.
Commit 80f4b4b8 was found to be unnecessary, and was
reverted in commit e710bac0. This restored the
intended global lock lvmcache effect to pvchange, and
it no longer relied on the workaround in toollib.
When reporting on LVs, take the end of the range from the size of the
underlying (hidden) LV rather than the logical size of the current
segment (that PVs use).
Previously, pvmove used the function find_pv_in_vg() which did the
equivalent of process_each_pv() by doing:
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
With the found 'pv', it would do vg_read() on pv_vg_name(pv),
and then do the actual pvmove processing.
This commit simplifies by using process_each_pv() and putting
the actual pvmove processing into the "single" function.
This eliminates both find_pv_by_name() and the vg_read().
The processing code that followed vg_read remains the same.
The return code for the pvmove command is not based on the
process_each_pv return code, but is based on the success/fail
conditions in the existing code.
Make the lvb validation rules for convert match
those for unlock (even though it would be very
unlikely or impossible for convert to deal with
zero lvb.)
When an orphan PV is changed/resized, the
lvmlockd global lock is converted from sh
to ex. If the command is changing two
orphan PVs, the conversion to ex should
be done only once.
Existing cache_settings field displays the settings which are
saved in metadata. Add new kernel_cache_settings fields to display
the settings which are currently used by kernel, including fields
for which default values are used.
This way users have complete view of the set of cache settings
supported (and which they can set) and their values which are used
at the moment by kernel.
For example:
$ lvs -o name,cache_policy,cache_settings,kernel_cache_settings vg
LV Cache Policy Cache Settings KCache Settings
cached1 mq migration_threshold=1024,write_promote_adjustment=2 migration_threshold=1024,random_threshold=4,sequential_threshold=512,discard_promote_adjustment=1,read_promote_adjustment=4,write_promote_adjustment=2
cached2 smq migration_threshold=1024 migration_threshold=1024
cached3 smq migration_threshold=2048
Fix lvm2app to return either 0 or 1 for lvm_vg_is_{clustered,exported},
including internal functions pvseg_is_allocated and vg_is_resizeable
which are not yet exposed in lvm2app but make them consistent with the
rest.
This reverts e28e22b9e1
The problem that that commit was fixing (pytest failure)
no longer appears with the current code, so the commit is
not needed.
That commit is a problem for pvchange, because it prevents
lvmcache from retaining VG metadata even while the global
lock is held. pvchange holds the global lock to ensure
that VG metadata is kept in lvmcache throughout processing.
If the cache is not kept, a PV with zero MDAs will appear
first in its actual VG and then appear again in the orphan VG.
It wrongly appears a second time in the orphan VG only if
the actual VG is dropped from lvmcache.
Thin pool discard mode set in metadata can be different from the one
actually used if any device underneath does not support that mode. Add
kernel_discard report field to make it possible to see this difference.
Internal _alloc_init() is only called from allocate_extents(),
which already does prevent usage of virtual segments.
So mark as internal error early and do not process it any further.
Add new test for lv_is_snapshot().
Also move few other bitchecks into same place as remaining bit tests.
TODO: drop lv_is_merging_origin() and keep using lv_is_merging().
The problem addressed by this workaround no longer
seems to exist, so remove it. PVs with no mdas
no longer appear in both their actual VG and in
the orphan VG.
Include brackets for the name if the dev is invisible.
This change applies to all callers of _format_pvsegs fn:
- lvseg_devices (the "lvs -o devices")
- lvseg_metadata_devices (the "lvs -o metadata_devices)
- lvseg_seg_pe_ranges (the "lvs -o seg_pe_ranges")
- lvseg_seg_metadata_le_ranges (the "lvs -o seg_metadata_le_ranges")
The common lv_pool_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_metadata_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_data_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_mirror_log_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_origin_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_convert_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
Use common _lvname_disp to report lv_parent. The _lvname_disp
takes care of properly marking LVs which are not visible - such
LVs are always enclosed in brackets when reported within any
other field.
For example, thin pool over RAID.
Before:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] cache_pool_tdata
[cache_pool_tdata_rimage_1] cache_pool_tdata
[cache_pool_tdata_rmeta_0] cache_pool_tdata
[cache_pool_tdata_rmeta_1] cache_pool_tdata
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] cache_pool_tmeta
[cache_pool_tmeta_rimage_1] cache_pool_tmeta
[cache_pool_tmeta_rmeta_0] cache_pool_tmeta
[cache_pool_tmeta_rmeta_1] cache_pool_tmeta
[lvol0_pmspare]
With this patch applied:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] [cache_pool_tdata]
[cache_pool_tdata_rimage_1] [cache_pool_tdata]
[cache_pool_tdata_rmeta_0] [cache_pool_tdata]
[cache_pool_tdata_rmeta_1] [cache_pool_tdata]
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] [cache_pool_tmeta]
[cache_pool_tmeta_rimage_1] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_0] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_1] [cache_pool_tmeta]
[lvol0_pmspare]
Do not mix dm_report_field_set_value and _field_set_value and
use single function call throughout for clarity. The same applies
for dm_report_field_string and _string_disp.
Fix regression caused by commit c2d4330f27
which removed the dm_pool_strdup for the cache policy name in
_cache_policy_disp report function.
This regression was hit with buffered reporting only (which is
used by default). The reason is that for buffered reporting, we're
iterating over LVs in VG (process_each_lv) while gathering
all the information that is needed for the report. In this case,
the LV's cache policy name has not been duped, but only the pointer
to the original VG buffer was stored. When the LV iteration finished,
the VG buffer was freed and any report to output called later
(dm_report_output call) accessed already freed VG data.
This didn't appear if unbuffered reporting was used (--unbuffered)
because in this case, the data were reported to output as
soon as they were processed, hence it was reported to output
before the VG data was freed.
The lvm2-activation{-early,-net}.service systemd unit statuses were missing
in dump gathered by lvmdump -s. These are quite important when debugging
scenarios with systemd environment and where lvmetad is not used.
Have commands send lvmlockd the update message
in vg_write instead of vg_commit, so that it's
not done while LVs are suspended. If the vg_write
is not committed, and the seqno sent to lvmlockd
is not used, then lvmlockd can detect this when
the next update uses the same seqno.
Use process_each_vg() to lock and read the old VG,
and then call the main vgrename code.
When real VG names are used (not a UUID in place of the
old name), the command still pre-locks the new name
(when strcmp wants it locked first), before calling
process_each_vg on the old name.
In the case where the old name is replaced with a UUID,
process_each_vg now translates that UUID into the real
VG name, which it locks and reads. In this case, we
cannot do pre-locking to maintain lock ordering because
the old name is unknown. So, in this case the strcmp
based lock ordering is suppressed and the old name is
always locked first. This opens a remote chance for
lock ordering conflict between racing vgrenames between
two names where one or both commands use the UUID.
Also always clear the internal lvmcache after rescanning, and
reinstate a test for --trustcache so that 'pvs --trustcache'
(for example) avoids rescanning.
Before commit c1f246fedf,
_get_all_devices() did a full device scan before
get_vgnameids() was called. The full scan in
_get_all_devices() is from calling dev_iter_create(f, 1).
The '1' arg forces a full scan.
By doing a full scan in _get_all_devices(), new devices
were added to dev-cache before get_vgnameids() began
scanning labels. So, labels would be read from new devices.
(e.g. by the first 'pvs' command after the new device appeared.)
After that commit, _get_all_devices() was called
after get_vgnameids() was finished scanning labels.
So, new devices would be missed while scanning labels.
When _get_all_devices() saw the new devices (after
labels were scanned), those devices were added to
the .cache file. This meant that the second 'pvs'
command would see the devices because they would be
in .cache.
Now, the full device scan is factored out of
_get_all_devices() and called by itself at the
start of the command so that new devices will
be known before get_vgnameids() scans labels.
Since we mark cache-pool as 'hidden/private' while it is in-use,
we may still allow user to change it's name.
It should not cause any harm and user may prefer better naming
for a cache-pool in use.
If an existing fifo has the wrong attributes it cannot be trusted
so we must unlink it and recreate it correctly.
(Replaces 2c8d6f5c90: if the other end of
the fifo already got opened while its mode was insecure, delaying the
chmod isn't going to make any difference!)
Reinstate and extend checks removed by e1b111b02a.
The code has always assumed that only root has access to the directory
containing the fifos and that they are under the complete control of
dmeventd code. If anything is found not to be as expected, then open()
should certainly not be attempted!
# lvmdbusd relies on command log report to inspect LVM command's execution status
report_command_log=1
# display only outermost LVM shell-related log that lvmdbusd inspects first after LVM command execution (it calls 'lastlog' for more detailed log afterwards if needed)
Scan all supported LVM block devices in the system for physical volumes
NOTE: major_minors & device_paths only usable when cache == True
:param activate: If True, activate any newly found LVs
:param cache: If True, update lvmetad
:param device_paths: Array of device paths or empty
:param major_minors: Array of structures (major,minor)
:param tmo: Timeout for operation
:param scan_options: Additional options to pvscan
:param cb: Not visible in API (used for async. callback)
:param cbe: Not visible in API (used for async. error callback)
:return: '/' if operation done, else job path
"""
fordindevice_paths:
utils.validate_device_path(MANAGER_INTERFACE,d)
r=RequestEntry(
tmo,Manager._pv_scan,
(activate,cache,device_paths,major_minors,
scan_options),cb,cbe,False)
cfg.worker_q.put(r)
@property
deflvm_id(self):
"""
Intended to be overridden by classes that inherit
"""
returnstr(id(self))
@property
defUuid(self):
"""
Intended to be overridden by classes that inherit
"""
importuuid
returnuuid.uuid1()
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.