IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum.
Create a linear LV with a dm-integrity layer above it:
lvcreate --type integrity --integrity String [options]
Create a raid1 LV with a dm-integrity layer over each
raid image:
lvcreate --type raid1 --integrity y [options]
Add a dm-integrity layer to an existing linear LV,
or to each image of an existing raid1 LV:
lvconvert --integrity y LV
Remove the dm-integrity layer from a linear LV, or
from the images of a raid1 LV:
lvconvert --integrity n LV
Integrity metadata:
The --integrity String specifies if the dm-integrity
metadata (checksums) should be interleaved with data
blocks, or written to a separate external LV.
--integrity external (default)
Use integrity with metadata on a separate LV.
Allows removing integrity from the LV later.
--integrity y
Same as integrity external.
--integrity n
Remove integrity (external metadata only.)
--integrity internal
Use integrity with metadata interleaved with
data on the same LV.
Only allowed with new linear LVs.
Internal integrity cannot be removed from an LV.
Around 1% of the LV size is used for integrity metadata.
Command variations:
lvcreate --type integrity -n Name -L Size VG
[Uses integrity external, the default.]
lvcreate --integrity y|external -n Name -L Size VG
[Uses type integrity, which is implied.]
lvcreate --integrity internal -n Name -L Size VG
[Uses type integrity, which is implied.]
lvcreate --type raid1 --integrity y -m Num -n Name -L Size VG
[Uses type integrity for each raid image.]
lvconvert --type integrity LV
[Converts linear LV to type integrity.]
lvconvert --integrity y|external LV
[Converts linear LV to type integrity, or each
image to type integrity in a raid1 LV.]
Options:
--integritymetadata LV
Use the specified LV for external metadata.
Allows specific device placement of metadata.
Without this option, the command creates a
hidden LV (with an _imeta suffix) to hold the
metadata. (Not usable with raid1+integrity.)
--integritysettings String
set dm-integrity parameters, e.g. to use a journal
instead of bitmap, --integritysettings "mode=J".
Example:
$ lvcreate --integrity external -n lvex -L1G vg
$ lvs -a vg
LV VG Attr LSize Origin
lvex vg gwi-a----- 1.00g [lvex_iorig]
[lvex_imeta] vg ewi-ao---- 12.00m
[lvex_iorig] vg -wi-ao---- 1.00g
$ lvcreate --integrity internal -n lvin -L1G vg
$ lvs -a vg
LV VG Attr LSize Origin
lvin vg gwi-a----- 1.00g [lvin_iorig]
[lvin_iorig] vg -wi-ao---- 1.00g
$ lvcreate --type raid1 --integrity y -m 1 -n lver -L1G vg
$ lvs -a vg
LV VG Attr LSize Origin
lver vg rwi-a-r--- 1.00g
[lver_rimage_0] vg gwi-aor--- 1.00g [lver_rimage_0_iorig]
[lver_rimage_0_imeta] vg ewi-ao---- 12.00m
[lver_rimage_0_iorig] vg -wi-ao---- 1.00g
[lver_rimage_1] vg gwi-aor--- 1.00g [lver_rimage_1_iorig]
[lver_rimage_1_imeta] vg ewi-ao---- 12.00m
[lver_rimage_1_iorig] vg -wi-ao---- 1.00g
[lver_rmeta_0] vg ewi-aor--- 4.00m
[lver_rmeta_1] vg ewi-aor--- 4.00m
To write a new/repaired pv_header and label_header:
pvck --repairtype pv_header --file <file> <device>
This uses the metadata input file to find the PV UUID,
device size, and data offset.
To write new/repaired metadata text and mda_header:
pvck --repairtype metadata --file <file> <device>
This requires a good pv_header which points to one or two
metadata areas. Any metadata areas referenced by the
pv_header are updated with the specified metadata and
a new mda_header. "--settings mda_num=1|2" can be used
to select one mda to repair.
To combine all header and metadata repairs:
pvck --repair --file <file> <device>
It's best to use a raw metadata file as input, that was
extracted from another PV in the same VG (or from another
metadata area on the same PV.) pvck will also accept a
metadata backup file, but that will produce metadata that
is not identical to other metadata copies on other PVs
and other areas. So, when using a backup file, consider
using it to update metadata on all PVs/areas.
To get a raw metadata file to use for the repair, see
pvck --dump metadata|metadata_search.
List all instances of metadata from the metadata area:
pvck --dump metadata_search <device>
Save one instance of metadata at the given offset to
the specified file (this file can be used for repair):
pvck --dump metadata_search --file <file>
--settings "metadata_offset=<off>" <device>
using --settings:
mda_offset=<offset> mda_size=<size> can be used
in place of the offset/size that normally come
from headers.
metadata_offset=<offset> prints/saves one instance
of metadata text at the given offset, in
metadata_all or metadata_search.
This reverts commit 7474440d3b.
lvs can use the scanning optimization again since it has
been changed in:
"scanning: optimize by checking text offset and checksum"
After the VG lock is taken for vg_read, reread the mda_header
and compare the metadata text offset and checksum to what was
seen during label scan. If it is unchanged, then the metadata
has not changed since the label scan, and the metadata does not
need to be reread under the lock for command processing.
For commands that do not make changes (e.g. reporting), the
mda_header is reread and checked on one mda to decide if the
full metadata rereading can be skipped. For other commands
(e.g. modifying the vg) the mda_header is reread and checked
from all PVs. (These could probably just check one mda also.)
The kernel MD runtime requires region size to be larger than stripe size
on striped raid layouts, thus the dm-raid target's constructor rejects
such request.
This causes e.g. an 'lvcreate --type raid10 -i3 -I4096 -R2048 -n lv vg' to fail.
Avoid failing late in the kernel by enforcing region size to be
larger or equal to stripe size.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1698225
When pvcreate/pvremove prompt the user, they first release
the global lock, then acquire it again after the prompt,
to avoid blocking other commands while waiting for a user
response. This release/reacquire changes the locking
order with respect to the hints flock (and potentially other
locks). So, to avoid deadlock, use a nonblocking request
when reacquiring the global lock.
The scanning optimization can produce warnings from
'lvs' when run concurrently with commands modifying LVs,
so disable the optimization until it can be improved.
Without the scanning optimization, lvs will always
read all PVs twice:
1. read metadata from all PVs, saving it in memory
2. for each VG
3. lock VG
4. reread metadata from all PVs in VG, replacing metadata
saved from step 1
5. run command on VG
6. unlock VG
The optimization would usually cause step 4 to be skipped,
and PVs would be read only once.
Running the command in step 5 using metadata that was not
read under the VG lock is usually fine, except for the
fact that lvs attempts to validate the metadata by comparing
it to current dm state. If other commands are modifying dm
state while lvs is running, lvs may see differences between
metadata from step 1 and dm state checked during step 5,
and print warnings.
(A better fix may be to detect the concurrent change and
fall back to rereading metadata in step 4 only when needed.)
Since we fixed linking of proper version of 'libdevmapper' with
linking lvm2 plugin correctly - we already have correct function
available linked with internal lvm library.
So drop unneeded include of parsing function.
Avoid mem leaking hint on every loop continue and
allocate hint only when it's going to be added into list.
Switch to use 'dm_strncpy()' and validate sizes.
For dev_in_device_list() != 0 allocated 'devl' was
actually leaking - so instead allocate 'devl' only
when !dev_in_device_list() and indent code around.
Since we check for NULL pointers earlier we need
to be consistent across function - since the NULL
would applies across whole function.
When dropping 'mda' check - we are actually
already dereferencing it before - so it can't
be NULL at that places (and it's validated
before entering _read_mda_header_and_metadata).
dev_unset_last_byte() must be called while the fd is still valid.
After a write error, dev_unset_last_byte() must be called before
closing the dev and resetting the fd.
In the write error path, dev_unset_last_byte() was being called
after label_scan_invalidate() which meant that it would not unset
the last_byte values.
After a write error, dev_unset_last_byte() is now called in
dev_write_bytes() before label_scan_invalidate(), instead of by
the caller of dev_write_bytes().
In the common case of a successful write, the sequence is still:
dev_set_last_byte(); dev_write_bytes(); dev_unset_last_byte();
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
When resizing 2 volumes like thin-pool and it's metadata and they
would be of a different type - command would be actually expecting
both LVs being of a same segtype - and would throw an error in
case they are different.
This patch fixes is by setting a new segtype from last segment of
2nd. extented device.
Also it fixes the possible 'percentage' extension setup that
might have been used for 'primary' volume - while the 'secondary'
LV always goes with direct size - as we do not support 'percentage'
setup for them
This affects maily usage of thin-pool where the extension of
thin-pool data size may also lead to extension of metadata size.
When running cluster test with clvmd, the actual 'monitoring'
happens in cluster - so the 'already monitored' message
is also logged within clvmd code and the command cannot
see such effect.
clvmd was incapable to report this information back to command
so it cannot be displayed this way.
Add 'lvs -o+seg_monitor' validation which also works in clustered mode.
Instead of checking all LVs in a VG - do just a direct copy of LVs
from the existing list ->segs_using_thin_lv.
TODO: maybe it could be better to expose seg_list to /tools...
Enhance lv_info with lv_info_with_name_check.
This 'variant' not only check existance if UUID in DM table
but also compares its DM name whether it's matching expected LV name.
Otherwise activation may 'skip' activation with rename in case the
DM UUID already exists, just device is different name.
This change make fairly easier manipulation with i.e. detached mirror
leg which ATM is using same UUID - just the LV name have been changed.
Used code was not able to run 'activation' (and do a rename) and just
skipped the call. So the code used to do a workaround and 'tried'
to deactivate such LV firts - this however work only in non-clvmd case,
as cluster was not having the lock for deactivated LV.
With this extended lv_info code will run 'activation' and will
synchronize the name to match expected LV name.
Patch extends _lv_info() with new paramter 'with_name_check',
which is later translated into 'name_check' argument for
_info_run() which in case of name mismatch evaluates the
check as if device does not exists.
Such call is only used in one place _lv_activate() which then
let activation run. All other invocation of _info() calls
are left intact.
TODO: fix mirror table manipulation (and raid)....
Avoid making more dbus calls to get information we already have. This
also avoids us getting an error where a dbus object representation is
being deleted by another process while we are trying to gather information
about it across the wire.
When a LV loses an interface it ends up getting removed and recreated.
This happens after the VGs have been processed and updated. Thus when
this happens we need to re-check the VGs.
VDO pool LVs are represented by a new dbus interface VgVdo. Currently
the interface only has additional VDO properties, but when the
ability to support additional LV creation is added we can add a method
to the interface.
The return value from bcache_invalidate_fd() was not being checked.
So I've introduced a little function, _invalidate_fd() that always
calls bcache_abort_fd() if the write fails.
The resume of 'released' 'COW' should preceed the resume of origin.
The fact we need to do the sequence differently for merge was
cause by bugs fixed in 2 previous commits - so we no longer need
to recognize 'merging' and we should always go with single
sequence.
The importance of this order is - to properly remove '-real' device
from origin LV. When COW is activated as 2nd. '-real' device is
kept in table as it cannot be removed during 1st. resume of origin,
and later activation of COW LV no longer builds tree associated
with origin LV.
When checking device id of a thin device that is just being
merged - the snapshot actually could have been already finished
which means '-real' suffix for the LV is already gone and just LV
is there - so check explicitely for this condition and use
correct UUID for this case.
When a cachevol LV is attached, have the LV keep it's lock
allocated. The lock on the cachevol won't be used while
it's attached. When the cachevol is split a new lock does
not need to be allocated. (Applies to cachevol usage by
both dm-cache and dm-writecache.)
When a user includes "--type foo" in a command, only
look at command definitions with matching type, as
opposed to using matching/mismatching --type as a
vote for/against a given command def. This means a
command with --type foo will prioritize a command def
with --type foo over other command defs that have
more matching options but an unmatching type. This
makes it more likely that a closely matching command
def will be recommended.
When LV gets cached and uses cache-pool - such cache-pool
will now get _cpool suffix automatically.
Thus 'Pool' column for cached LV will now show either _cvol
or _cpool LV.
Improve the implementation of extracting all text metadata
copies from the metadata area. Use this for the existing
metadata_all dump option.
Add a new metadata_search dump option which does not use
lvm headers to find metadata, but looks in standard
locations. This is useful if headers are damaged and
can't be used to locate metadata.
Adding '-v' to metadata_all or metadata_search will add
the description and creation_time to the printed list of
metadata instances that are found.
Before 'archive()' is called, lvm2 must not touch/modify metadata.
So move setting CACHE_VOL related flags past this point.
Also make sure reading of cache segtype always restores this
flag properly (even if compatible flag would be lost).
Since code is using -cdata and -cmeta UUID suffixes, it does not need
any new 'extra' ID to be generated and stored in metadata.
Since introduce of new 'segtype' cache+CACHE_USES_CACHEVOL we can
safely assume 'new' cache with cachevol will now be created
without extra metadata_id and data_id in metadata.
For backward compatibility, code still reads them in case older
version of metadata have them - so it still should be able
to activate such volumes.
Bonus is lowered size of lv structure used to store info about LV
(noticable with big volume groups).
In the hex dump output, grep for the vgname
followed by one space. This allows for test pids
with up to seven digits, which are used to contruct
the variable vgname used by the test. Otherwise
the long vgname wraps to the next line and fails to
match in grep.
When an LV is used as a writecache cachevol, give
it the LV name a _cvol suffix. Remove the suffix
when the cachevol is detached, restoring the
original LV name.
The first part of a cachevol LV is used for metadata,
and the rest of the space is used for data. The
division of space between metadata and data depends
on the total size of the cachevol.
The previous division gave more space than needed to
metadata, it was:
cachevol size 8M to 128M -> metadata size 16M *
cachevol size 128M to 1G -> metadata size 32M
cachevol size 1G and up -> metadata size 64M
(* if this resulted in over half the LV used as
metadata, then half the cachevol would be used
for metadata, and the other half for data.)
The division of space now gives less space to
metadata, it is:
cachevol size 8M to 16M -> metadata size 4M
cachevol size 16M to 4G -> metadata size 8M
cachevol size 4G to 16G -> metadata size 16M
cachevol size 16G to 32G -> metadata size 32M
cachevol size 32G and up -> metadata size 64M
When a VG contains some LVs with unknown segtypes, the user
should still be allowed to activate other LVs in the VG that
are understood.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi------- 4.00m
other foo vwi---u--- 48.00m
$ lvcreate -l1 foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Cannot change VG foo with unknown segments in it!
Cannot process volume group foo
$ lvchange -ay foo/lvol0
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
$ lvchange -ay foo/other
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Refusing activation of LV foo/other containing an unrecognised segment.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi-a----- 4.00m
other foo vwi---u--- 48.00m
A cachevol LV had the CACHE_VOL status flag in metadata,
and the cache LV using it had no new flag. This caused
problems if the new metadata was used by an old version
of lvm. An old version of lvm would have two problems
processing the new metadata:
. The old lvm would return an error when reading the VG
metadata when it saw the unknown CACHE_VOL status flag.
. The old lvm would return an error when reading the VG
metadata because it would not find an expected cache pool
attached to the cache LV (since the cache LV had a
cachevol attached instead.)
Change the use of flags:
. Change the CACHE_VOL flag to be a COMPATIBLE flag (instead
of a STATUS flag) so that old versions will not fail when
they see it.
. When a cache LV is using a cachevol, the cache LV gets
a new SEGTYPE flag CACHE_USES_CACHEVOL. This flag is
appended to the segtype name, so that old lvm versions
will fail to use the LV because of an unknown segtype,
as opposed to failing to read the VG.
Enhance activation of cached devices using cachevol.
Correctly instatiace cachevol -cdata & -cmeta devices with
'-' in name (as they are only layered devices).
Code is also a bit more compacted (although still not ideal,
as the usage of extra UUIDs stored in metadata is troublesome
and will be repaired later).
NOTE: this patch my brink potentially minor incompatiblity for 'runtime' upgrade
Instead of using 'noflush' option, switch cache_mode into WRITETHROUGH
which does not require flushing, when user confirmed he does not
want flushing for WRITEBACK (because of (partially) missing caching PV)
For wiping we activate and clear 'regular' devices,
since in case of whole process interuption (i.e. kill -9)
we leave metadata & DM table and workable state all the time.
Let vgck --updatemetadata repair cases where different mdas
hold indepedently valid but unmatching copies of the metadata,
i.e. different text metadata checksums or text metadata sizes.
vgck --updatemetadata would write the same correct
metadata to good mdas, and then to bad mdas, but the
sequence of vg_write/vg_commit calls betwen good and
bad mdas could cause a different description field to
be generated for good/bad mdas. (The description field
describing the command was recently included in the
ondisk copy of the metadata text.)
If a VG is forcibly changed from lock_type sanlock to
lock_type none, the internal lvmlock LV is left behind.
If that LV is not removed before vgremove is run on the
VG, then an internal check will be triggered by the
hidden lvmlock LV. So, check for and remove a left over
lvmlock LV during vgremove.
Add lots of vdo fields:
vdo_operating_mode - For vdo pools, its current operating mode.
vdo_compression_state - For vdo pools, whether compression is running.
vdo_index_state - For vdo pools, state of index for deduplication.
vdo_used_size - For vdo pools, currently used space.
vdo_saving_percent - For vdo pools, percentage of saved space.
vdo_compression - Set for compressed LV (vdopool).
vdo_deduplication - Set for deduplicated LV (vdopool).
vdo_use_metadata_hints - Use REQ_SYNC for writes (vdopool).
vdo_minimum_io_size - Minimum acceptable IO size (vdopool).
vdo_block_map_cache_size - Allocated caching size (vdopool).
vdo_block_map_era_length - Speed of cache writes (vdopool).
vdo_use_sparse_index - Sparse indexing (vdopool).
vdo_index_memory_size - Allocated indexing memory (vdopool).
vdo_slab_size - Increment size for growing (vdopool).
vdo_ack_threads - Acknowledging threads (vdopool).
vdo_bio_threads - IO submitting threads (vdopool).
vdo_bio_rotation - IO enqueue (vdopool).
vdo_cpu_threads - CPU threads for compression and hashing (vdopool).
vdo_hash_zone_threads - Threads for subdivide parts (vdopool).
vdo_logical_threads - Logical threads for subdivide parts (vdopool).
vdo_physical_threads - Physical threads for subdivide parts (vdopool).
vdo_max_discard - Maximum discard size volume can recieve (vdopool).
vdo_write_policy - Specified write policy (vdopool).
vdo_header_size - Header size at front of vdopool.
Previously only 'lvdisplay -m' was exposing them.
Since we now support activation of 'vdo' volume
without explicit activation of 'vdopool' it's now possible
to have active layer vdopool (-vpool) volume and
having vdopool itself inactive - yet still in this
case we can show available stats for this volume.
But we need to show correct activation status and other
standard info.
Adds support for the DM_GET_TARGET_VERSION to dmsetup.
It introduces a new comman "target-version" that will accept list
of targets and print their version.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Use /dev/md33 instead of /dev/md0 to reduce chances of
conflicting with an existing name.
Only call 'mdadm --stop /dev/md33' for cleanup and don't
use 'mdadm --stop --scan' to avoid stopping other md devs.
Due to a dm-raid target flaw fixed in target version 1.15.0,
extents of raid sets don't get resynchronized when new MD bitmp
pages have to be allocated due to the extension.
Introduce lvextend-raid.sh to test this flaw.
Related: rhbz1671964
When the PV device names in the VG metadata do not match the
current PV device names seen on the system, do not use the
optimized activation function (that avoids extra device scanning.)
When the device names do not match, it's a clue that there could
be duplicate PVs, in which case we want to scan all devicess to
find any duplicates and stop the activation if found.
This does not prevent autoactivating a VG from the incorrect
duplicate PV, because the incorrect duplicate may appear by itself
first. At that point its duplicate PV does not exist to be seen.
(A future enhancement could use the WWID to strengthen this
detection.)
Analogous to the case of a device with no regions, it is not an
error to attempt to list the stats groups on a device that has no
configured groups: just return success and continue.
Avoid checking 'lv_is_active()' since special LV types does this
validation anyway what calling _percent() function and call it
ONLY when none of special types is queried.
This restores support for VDO resize (as with support for
separate VDO pool activation, plain query for lv_is_active()
is not working in this case).
If the linear mapping is lost (for whatever reason, i.e.
test suite forcible 'dmsetup remove' linear LV,
lvm2 had hard times figuring out how to deactivate such DM table.
So add function which is in case inactive VDO pool LV checks if
the pool is actually still active (-vpool device present) and
it has open count == 0. In this case deactivation is allowed
to continue and cleanup DM table.
- use internal CACHE_VOL flag on cachevol LV
- add suffixes to dm uuids for internal LVs
- display appropriate letters in the LV attr field
- display writecache's cachevol in lvs output
This reverts commit ad560a286a.
The reverted patch also removed the warning which we realized we need
to keep as valuable process information (see related bugzilla below).
In a followup patch, we'll keep the message and avoid bailing out thus
always allowing lvconvert to try repairing if 'allocate' fault policy set.
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1751887
. For dm-cache in writethrough, always allow splitcache,
whether the cache is missing PVs or not.
. For dm-cache in writeback, if the cache is missing PVs,
allow splitcache with force and yes.
. For dm-writecache, if the cache is missing PVs,
allow splitcache with force and yes.
Enhance 'activation' experience for VDO pool to more closely match
what happens for thin-pools where we do use a 'fake' LV to keep pool
running even when no thinLVs are active. This gives user a choice
whether he want to keep thin-pool running (wihout possibly lenghty
activation/deactivation process)
As we do plan to support multple VDO LVs to be mapped into a single VDO,
we want to give user same experience and 'use-patter' as with thin-pools.
This patch gives option to activate VDO pool only without activating
VDO LV.
Also due to 'fake' layering LV we can protect usage of VDO pool from
command like 'mkfs' which do require exlusive access to the volume,
which is no longer possible.
Note: VDO pool contains 1024 initial sectors as 'empty' header - such
header is also exposed in layered LV (as read-only LV).
For blkid we are indentified as LV with UUID suffix - thus private DM
device of lvm2 - so we do not need to store any extra info in this
header space (aka zero is good enough).
When lvm2 is activating layered pool LV (to basically keep pool opened,
the other function used to be 'locking' be in sync with DM table)
use this LV in read-only mode - this prevents 'write' access into
data volume content of thin-pool.
Note: since EMPTY/unused thin-pool is created as 'public LV' for generic
use by any user who i.e. wish to maintain thin-pool and thins himself.
At this moment, thin-pool appears as writable LV. As soon as the 1st.
thinLV is created, layer volume will appear is 'read-only' LV from this moment.
When pvscan is used to activate a VG via an
asynchronous service (i.e. lvm2-pvscan), there
is no requirement that the command wait for
udev to create device nodes before returning.
It's possible that waiting for udev is slow
enough to cause the service running the command
to time out. So, allow the --noudevsync option
to be given to pvscan to skip waiting for udev.
(This commit is not changing the lvm2-pvscan
service itself to use --noudevsync.)
Still unknown is whether there are any complex
LV activation cases in which lvm itself requires
access to a device node, in which case the udev
wait could be needed by lvm itself.
(When running an activation command directly
from the command line, it's generally expected
that the activated LVs are ready to use when
the command is finished, so lvm waits for
udev to finish creating the dev nodes.)
When an online PV completed a VG, the standard
activation functions were used to activate the VG.
These functions use a full scan of all devs.
When many pvscans are run during startup and need
to activate many VGs, scanning all devs from all
the pvscans can take a long time.
Optimize VG activation in pvscan to scan only the
devs in the VG being activated. This makes use of
the online file info that was used to determine
the VG was complete.
The downside of this approach is that pvscan activation
will not detect duplicate PVs and block activation,
where a normal activation command (which scans all
devices) would.
Fixes a regression from commit ba7ff96faf
"improve reading and repairing vg metadata"
where the error path for a vg name with invalid
charaters was missing an error flag, which led
to the caller not recognizing an error occured.
Previously, an error flag was hidden in the old
_vg_make_handle function.
Only the first entry of the filter array was being
included in the copy of the filter, rather than the
entire thing. The result is that hints would not be
refreshed if the filter was changed but the first
entry was unchanged.
Fixes a segfault in the recent commit e01fddc57:
"improve duplicate pv handling for md components"
While choosing between duplicates, the info struct is
not always valid; it may have been dropped already.
Remove the code that was still using the info struct for
size comparisons. The size comparisons were a bogus check
anyway because it was just preferring the dev that had
already been chosen, it wasn't actually comparing the
dev size to the PV size. It would be good to use a
dev/PV size comparison in the duplicate handling code, but
the PV size is not available until after vg_read, not
from the scan.
With Python 3.8 converting these directly to string using str()
no longer works, we need to convert these to integer first.
On Python 3.8:
>>> str(dbus.Int64(1))
'dbus.Int64(1)'
On Python 3.7 (and older):
>>> str(dbus.UInt64(1))
'1'
This is probably related to removing __str__ function from method
from int (dbus.UInt is subclass of int) which happened in 3.8, see
https://docs.python.org/3.8/whatsnew/3.8.html
Signed-off-by: Vojtech Trefny <vtrefny@redhat.com>
Since we need to preserve allocated strings across 2 separate
activation calls of '_tree_action()' we need to use other mem
pool them dm->mem - but since cmd->mem is released between
individual lvm2 locking calls, we rather introduce a new separate
mem pool just for pending deletes with easy to see life-span.
(not using 'libmem' as it would basicaly keep allocations over
the whole lifetime of clvmd)
This patch is fixing previous commmit where the memory was
improperly used after pool release.
Update configure and make code compilable if prlimit() is not present.
Since the code is suspicious do not cope yet with it's replacement
with set/getrlimit().
New udev in rawhide seems to be 'dropping' udev rule operations for devices
that are no longer existing - while this is 'probably' a bug - it's
revealing moments in lvm2 that likely should not run in a single
transaction and we should wait for a cookie before submitting more work.
TODO: it seem more 'error' paths should always include synchronization
before starting deactivating 'just activated' devices.
We should probably figure out some 'automatic' solution for this instead
of placing sync_local_dev_name() all over the place...
Support internal removal of 'cache origin' volume - which we
do not normally expose to a user - however internal processing
loops may hit this condition (depending on order of list LVs).
So when this operation is internally requested - we automatically
try to remove it's 'holding' LV (cache LV) - which will also
remove the origin.
Drop the 'cluster-only' optimization so we do resume ALL device
before we try to wait on cookie before 'removal' operation.
It's more correct order of operation - alhtough possibly slightly
less efficient - but until we have correct list of operations
'in-progress' we can't do anything better.
With previous patch 30a98e4d67 we
started to put devices one pending_delete list instead
of directly scheduling their removal.
However we have operations like 'snapshot merge' where we are
resuming device tree in 2 subsequent activation calls - so
1st such call will still have suspened devices and no chance
to push 'remove' ioctl.
Since we curently cannot easily solve this by doing just single
activation call (which would be preferred solution) - we introduce
a preservation of pending_delete via command structure and
then restore it on next activation call.
This way we keep to remove devices later - although it might be
not the best moment - this may need futher tunning.
Also we don't keep the list of operation in 1 trasaction
(unless we do verify udev symlinks) - this could probably
also make it more correct in terms of which 'remove' can
be combined we already running 'resume'.
Udev debugging is a bit tricky, so to more easily pair cookie ID,
which is the lowest 16 bit - print cookie as hexa number.
This simplify pairing of processed cookies while the 'higher bit flags'
are changed for the same cookie.
Resuming of 'error' table entry followed with it's dirrect removal
is now troublesame with latest udev as it may skip processing of
udev rules for already 'dropped' device nodes.
As we cannot 'synchronize' with udev while we know we have devices
in suspended state - rework 'cleanup' so it collects nodes
for removal into pending_delete list and process the list with
synchronization once we are without any suspended nodes.
Between 'resume' and 'remove' we need to wait for udev to synchronize,
otherwise udev may 'skip' resume event processing if the udev node
is already gone.
When pvmove is finished, we do a tricky operation since we try to
resume multiple different device that were all joined into 1 big tree.
Currently we use the infromation from existing live DM table,
where we can get list of all holders of pvmove device.
We look for these nodes (by uuid) in new metadata, and we do now a full
regular device add into dm tree structure. All devices should be
already PRELOAD with correct table before entering suspend state,
however for correctly working readahead we need to put correct info
also into RESUME tree. Since table are preloaded, the same table
is skip and resume, but correct read ahead is now set.
Eliminate md components at the start so they don't
interfere with actual duplicates, and don't need
to be removed later. This also allows for choosing
no copy of a PVID if they all happen to be md
components.
Usually md components are eliminated in label scan and/or
duplicate resolution, but they could sometimes get into
the vg_read stage, where set_pv_devices compares the
device to the PV.
If set_pv_devices runs an md component check and finds
one, vg_read should eliminate the components.
In set_pv_devices, run an md component check always
if the PV is smaller than the device (this is not
very common.) If the PV is larger than the device,
(more common), do the component check when the config
setting is "auto" (the default).
The OPTIONS+="event_timeout" is Unsupported since systemd/udev version 216,
that is ~5 years ago.
Since systemd/udev version 243, there's a new message printed if unsupported
OPTIONS value is used:
Invalid value for OPTIONS key, ignoring: 'event_timeout=180'
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1740666
Some older BB with older cryptsetup tool do not 'retry' on remove
and when remove is issued right after 'fsck' - it might be
rejected with:
Device @PREFIX@-tcrypt2 is busy.
Try to use udevadm settle.
Since we use 'set -euE -o pipefail' for shell execution,
any failure of any command in the 'piped' shell can result
in failure of whole executed chain - resulting in typically
unsually test skip, that was left unnoticed.
Since checked command have usually short output, the simplest
fix seems to be to let grep parse whole output instead
of quiting after first match.
Kernels <2.6.27 don't have /sys/dev dir - add code for looking
out device name via longre seach in /sys/block
This makes commands like 'dmsetup dep -o blkdevname' working.
Fix versioning for updated symbols dm_stats_create_region
and dm_stats_create_region.
Only the latest symbol should have global entry.
Since I'm not sure what is currenlty the best option for
old symbols - we added support for easy commenting of them
(so we do not lose information when the symbol appeared
for the first time.)
Note: some old already deleted symbols should have been
restored as comments.
When there are more devices than the current soft
open file limit (default 1024), raise the soft limit
to the hard/max limit (default 4096).
Do this prior to scanning in case enough of the
devices are PVs that need to be kept open.
Avoid having PVs with different logical block sizes in the same VG.
This prevents LVs from having mixed block sizes, which can produce
file system errors.
The new config setting devices/allow_mixed_block_sizes (default 0)
can be changed to 1 to return to the unrestricted mode.
Do this at two levels, although one would be enough to
fix the problem seen recently:
- Ignore any reported sector size other than 512 of 4096.
If either sector size (physical or logical) is reported
as 512, then use 512. If neither are reported as 512,
and one or the other is reported as 4096, then use 4096.
If neither is reported as either 512 or 4096, then use 512.
- When rounding up a limited write in bcache to be a multiple
of the sector size, check that the resulting write size is
not larger than the bcache block itself. (This shouldn't
happen if the sector size is 512 or 4096.)
Previously, consecutive copies of metadata would have garbage
data in the space between them. After metadata wrapping,
the garbage would be portions of old metadata. This made
analysis of the metadata area more difficult.
This would happen because the start of new copy of metadata
is advanced from the end of the last copy to start at the
next 512 byte boundary.
Zero the space between consecutive copies of metadata by
extending each metadata write to end at the next 512 byte
boundary. The size of the metadata itself is not extended,
only the write. The buffer being written contains the
metadata text followed by the necessary number of zeros.
An active md device with an end superblock causes lvm to
enable full md component detection. This was being done
within the filter loop instead of before, so the full
filtering of some devs could be missed.
Also incorporate the recently added config setting that
controls the md component detection.
Fix commit 7836e7aa1c
"pvscan: ignore device with incorrect size"
which caused pvscan to not consider a PV online (for purposes
of event based activation) if the PV and device sizes differed.
This helped to avoid mistaking MD components for PVs, and is
replaced by triggering an md component check when PV and device
sizes differ (which happens in set_pv_device).
This check was mistakenly removed when shifting code in commit
"separate code for setting devices from metadata parsing".
Put it back with some new conditions.
This allows the creation of a striped mirror leg(s) during upconvert
by adding lvconvert command line options --stripes/--stripesize
for 'mirror' to tools/command-lines.in.
In case multiple mirror legs are being added, all will have the
same requested striped layout.
Resolves: rhbz1720705
We've been assigning this in 69-dm-lvm-metad.rules:
ENV{ID_MODEL}="LVM PV $env{ID_FS_UUID_ENC} on /dev/$name"
This was for the description to appear for each systemd device
unit representing this device, for example:
$systemctl -a | grep "LVM PV"
dev-block-252:2.device loaded active plugged LVM PV JhxC7B-YTgk-3jIU-5GVo-c4gV-W8t3-UUz06p on /dev/vda2 2
dev-disk-by\x2did-lvm\x2dpv\x2duuid\x2dJhxC7B\x2dYTgk\x2d3jIU\x2d5GVo\x2dc4gV\x2dW8t3\x2dUUz06p.device loaded active plugged LVM PV JhxC7B-YTgk-3jIU-5GVo-c4gV-W8t3-UUz06p on /dev/vda2 2
...
However, there could be an actual ID_MODEL that people are interested in
more than the fact that this is an LVM PV and so we shouldn't overwrite
the value.
Also, we already have a symlink /dev/disk/by-id/lvm-pv-uuid-<PV_UUID>
created which is then reflected as device unit (all device's symlinks
have systemd device unit representation) so we can still reach this
information in systemd unit listings even without setting the ID_MODEL.
Reported here: https://github.com/lvmteam/lvm2/issues/21
The cache repair utility does not yet work with a cachevol
(where metadata and data exist on the same LV.) So, warn
and prompt if writeback is specified with a cachevol.
The exported VG checking/enforcement was scattered and
inconsistent. This centralizes it and makes it consistent,
following the existing approach for foreign and shared
VGs/PVs, which are very similar to exported VGs/PVs.
The access policy that now applies to foreign/shared/exported
VGs/PVs, is that if a foreign/shared/exported VG/PV is named
on the command line (i.e. explicitly requested by the user),
and the command is not permitted to operate on it because it
is foreign/shared/exported, then an access error is reported
and the command exits with an error. But, if the command is
processing all VGs/PVs, and happens to come across a
foreign/shared/exported VG/PV (that is not explicitly named on
the command line), then the command silently skips it and does
not produce an error.
A command using tags or --select handles inaccessible VGs/PVs
the same way as a command processing all VGs/PVs, and will
not report/return errors if these inaccessible VGs/PVs exist.
The new policy fixes the exit codes on a somewhat random set of
commands that previously exited with an error if they were
looking at all VGs/PVs and an exported VG existed on the system.
There should be no change to which commands are allowed/disallowed
on exported VGs/PVs.
Certain LV commands (lvs/lvdisplay/lvscan) would previously not
display LVs from an exported VG (for unknown reasons). This has
not changed. The lvm fullreport command would previously report
info about an exported VG but not about the LVs in it. This
has changed to include all info from the exported VG.
When vg_read rescans devices with the intention of
writing the VG, the label rescan can open the devs
RW so they do not need to be closed and reopened
RW in dev_write_bytes.
Previously the VG metadata description field (which contains
the command line) was only included in backup/archive copies
of the metadata. Now also include it in the metadata written
to the metadata areas.
When monitoring, skip exported VGs without causing a command
failure.
The lvm2-monitor service runs 'vgchange --monitor y', so
any exported VG on the system would cause the service to
fail.
The man page generation for pvchange/lvchange/vgchange was
incorrect (leaving out some option listings) as a result of
commit e225bf5 "fix command definition for pvchange -a"
So when the target name happened to be a suffix of another one,
the grep was filtering incorrect line
(i.e. dm-cache && dm-writecache) - so do a line head matching.
The -a was being included in the set of "one or more"
options instead of an actual required option. Even
though the cmd def was not implementing the restrictions
correctly, the command internally was.
Adjust the cmd def code which did not support a command
with some real required options and a set of "one or more"
options.
The way that this command now uses the global lock
followed by a label scan, it can simply check if the
new VG name exists, and if not lock it and create it.
These two flags may be not reset at the end of
the command when the unlock is implicit, which
is a problem if the cmd struct is reused.
Clear the flags in the general fin_locking.
* origin/master: (22 commits)
tests: add metadata-bad-mdaheader.sh
tests: add metadata-bad-text.sh
tests: add outdated-pv.sh
tests: add metadata-old.sh
tests: add missing-pv missing-pv-unused
metadata.c: removed unused code
improve reading and repairing vg metadata
add a warning message when updating old metadata
vgcfgbackup add error messages
vgck --updatemetadata is a new command
move pv header repairs to vg_write
process_each_pv handle outdated pvs
move wipe_outdated_pvs to vg_write
create separate lvmcache update functions for read and write
fix vg_commit return value
change args for text label read function
add mda arg to add_mda
keep track of which mdas have old metadata in lvmcache
ability to keep track of outdated pvs in lvmcache
ability to keep track of bad mdas in lvmcache
...
The fact that vg repair is implemented as a part of vg read
has led to a messy and complicated implementation of vg_read,
and limited and uncontrolled repair capability. This splits
read and repair apart.
Summary
-------
- take all kinds of various repairs out of vg_read
- vg_read no longer writes anything
- vg_read now simply reads and returns vg metadata
- vg_read ignores bad or old copies of metadata
- vg_read proceeds with a single good copy of metadata
- improve error checks and handling when reading
- keep track of bad (corrupt) copies of metadata in lvmcache
- keep track of old (seqno) copies of metadata in lvmcache
- keep track of outdated PVs in lvmcache
- vg_write will do basic repairs
- new command vgck --updatemetdata will do all repairs
Details
-------
- In scan, do not delete dev from lvmcache if reading/processing fails;
the dev is still present, and removing it makes it look like the dev
is not there. Records are now kept about the problems with each PV
so they be fixed/repaired in the appropriate places.
- In scan, record a bad mda on failure, and delete the mda from
mda in use list so it will not be used by vg_read or vg_write,
only by repair.
- In scan, succeed if any good mda on a device is found, instead of
failing if any is bad. The bad/old copies of metadata should not
interfere with normal usage while good copies can be used.
- In scan, add a record of old mdas in lvmcache for later, do not repair
them while reading, and do not let them prevent us from finding and
using a good copy of metadata from elsewhere. One result is that
"inconsistent metadata" is no longer a read error, but instead a
record in lvmcache that can be addressed separate from the read.
- Treat a dev with no good mdas like a dev with no mdas, which is an
existing case we already handle.
- Don't use a fake vg "handle" for returning an error from vg_read,
or the vg_read_error function for getting that error number;
just return null if the vg cannot be read or used, and an error_flags
arg with flags set for the specific kind of error (which can be used
later for determining the kind of repair.)
- Saving an original copy of the vg metadata, for purposes of reverting
a write, is now done explicitly in vg_read instead of being hidden in
the vg_make_handle function.
- When a vg is not accessible due to "access restrictions" but is
otherwise fine, return the vg through the new error_vg arg so that
process_each_pv can skip the PVs in the VG while processing.
(This is a temporary accomodation for the way process_each_pv
tracks which devs have been looked at, and can be dropped later
when process_each_pv implementation dev tracking is changed.)
- vg_read does not try to fix or recover a vg, but now just reads the
metadata, checks access restrictions and returns it.
(Checking access restrictions might be better done outside of vg_read,
but this is a later improvement.)
- _vg_read now simply makes one attempt to read metadata from
each mda, and uses the most recent copy to return to the caller
in the form of a 'vg' struct.
(bad mdas were excluded during the scan and are not retried)
(old mdas were not excluded during scan and are retried here)
- vg_read uses _vg_read to get the latest copy of metadata from mdas,
and then makes various checks against it to produce warnings,
and to check if VG access is allowed (access restrictions include:
writable, foreign, shared, clustered, missing pvs).
- Things that were previously silently/automatically written by vg_read
that are now done by vg_write, based on the records made in lvmcache
during the scan and read:
. clearing the missing flag
. updating old copies of metadata
. clearing outdated pvs
. updating pv header flags
- Bad/corrupt metadata are now repaired; they were not before.
Test changes
------------
- A read command no longer writes the VG to repair it, so add a write
command to do a repair.
(inconsistent-metadata, unlost-pv)
- When a missing PV is removed from a VG, and then the device is
enabled again, vgck --updatemetadata is needed to clear the
outdated PV before it can be used again, where it wasn't before.
(lvconvert-repair-policy, lvconvert-repair-raid, lvconvert-repair,
mirror-vgreduce-removemissing, pv-ext-flags, unlost-pv)
Reading bad/old metadata
------------------------
- "bad metadata": the mda_header or metadata text has invalid fields
or can't be parsed by lvm. This is a form of corruption that would
not be caused by known failure scenarios. A checksum error is
typically included among the errors reported.
- "old metadata": a valid copy of the metadata that has a smaller seqno
than other copies of the metadata. This can happen if the device
failed, or io failed, or lvm failed while commiting new metadata
to all the metadata areas. Old metadata on a PV that has been
removed from the VG is the "outdated" case below.
When a VG has some PVs with bad/old metadata, lvm can simply ignore
the bad/old copies, and use a good copy. This is why there are
multiple copies of the metadata -- so it's available even when some
of the copies cannot be used. The bad/old copies do not have to be
repaired before the VG can be used (the repair can happen later.)
A PV with no good copies of the metadata simply falls back to being
treated like a PV with no mdas; a common and harmless configuration.
When bad/old metadata exists, lvm warns the user about it, and
suggests repairing it using a new metadata repair command.
Bad metadata in particular is something that users will want to
investigate and repair themselves, since it should not happen and
may indicate some other problem that needs to be fixed.
PVs with bad/old metadata are not the same as missing devices.
Missing devices will block various kinds of VG modification or
activation, but bad/old metadata will not.
Previously, lvm would attempt to repair bad/old metadata whenever
it was read. This was unnecessary since lvm does not require every
copy of the metadata to be used. It would also hide potential
problems that should be investigated by the user. It was also
dangerous in cases where the VG was on shared storage. The user
is now allowed to investigate potential problems and decide how
and when to repair them.
Repairing bad/old metadata
--------------------------
When label scan sees bad metadata in an mda, that mda is removed
from the lvmcache info->mdas list. This means that vg_read will
skip it, and not attempt to read/process it again. If it was
the only in-use mda on a PV, that PV is treated like a PV with
no mdas. It also means that vg_write will also skip the bad mda,
and not attempt to write new metadata to it. The only way to
repair bad metadata is with the metadata repair command.
When label scan sees old metadata in an mda, that mda is kept
in the lvmcache info->mdas list. This means that vg_read will
read/process it again, and likely see the same mismatch with
the other copies of the metadata. Like the label_scan, the
vg_read will simply ignore the old copy of the metadata and
use the latest copy. If the command is modifying the vg
(e.g. lvcreate), then vg_write, which writes new metadata to
every mda on info->mdas, will write the new metadata to the
mda that had the old version. If successful, this will resolve
the old metadata problem (without needing to run a metadata
repair command.)
Outdated PVs
------------
An outdated PV is a PV that has an old copy of VG metadata
that shows it is a member of the VG, but the latest copy of
the VG metadata does not include this PV. This happens if
the PV is disconnected, vgreduce --removemissing is run to
remove the PV from the VG, then the PV is reconnected.
In this case, the outdated PV needs have its outdated metadata
removed and the PV used flag needs to be cleared. This repair
will be done by the subsequent repair command. It is also done
if vgremove is run on the VG.
MISSING PVs
-----------
When a device is missing, most commands will refuse to modify
the VG. This is the simple case. More complicated is when
a command is allowed to modify the VG while it is missing a
device.
When a VG is written while a device is missing for one of it's PVs,
the VG metadata is written to disk with the MISSING flag on the PV
with the missing device. When the VG is next used, it is treated
as if the PV with the MISSING flag still has a missing device, even
if that device has reappeared.
If all LVs that were using a PV with the MISSING flag are removed
or repaired so that the MISSING PV is no longer used, then the
next time the VG metadata is written, the MISSING flag will be
dropped.
Alternative methods of clearing the MISSING flag are:
vgreduce --removemissing will remove PVs with missing devices,
or PVs with the MISSING flag where the device has reappeared.
vgextend --restoremissing will clear the MISSING flag on PVs
where the device has reappeared, allowing the VG to be used
normally. This must be done with caution since the reappeared
device may have old data that is inconsistent with data on other PVs.
Bad mda repair
--------------
The new command:
vgck --updatemetadata VG
first uses vg_write to repair old metadata, and other basic
issues mentioned above (old metadata, outdated PVs, pv_header
flags, MISSING_PV flags). It will also go further and repair
bad metadata:
. text metadata that has a bad checksum
. text metadata that is not parsable
. corrupt mda_header checksum and version fields
(To keep a clean diff, #if 0 is added around functions that
are replaced by new code. These commented functions are
removed by the following commit.)
uses vg_write to correct more common or less severe issues,
and also adds the ability to repair some metadata corruption
that couldn't be handled previously.
and implement it based on a device, not based
on a pv struct (which is not available when the
device is not a part of the vg.)
currently only the vgremove command wipes outdated
pvs until more advanced recovery is added in a
subsequent commit
The vg read and vg write cases need to update lvmcache
differently, so create separate functions for them.
The read case now handles checking for outdated mdas
and moves them aside into a new list to be repaired in
a subsequent commit.
The existing comment was desribing the correct behavior,
but the code didn't match. The commit is successful if
one mda was committed. Making it depend on the result of
the internal lvmcache update was wrong.
Have the caller pass the label_sector to the read
function so the read function can set the sector
field in the label struct, instead of having the
read function return a pointer to the label for
the caller to set the sector field.
Also have the read function return a flag indicating
to the caller that the scanned device was identified
as a duplicate pv.
Outdated PVs hold metadata for VG from which they
have been removed. Add the ability to keep track
of these in lvmcache.
This will be used for more advanced repair in a
subsequent commit.
mda's that cannot be processed by lvm because of
some corruption can be kept on a separate list.
These will be used for more advanced repair in a
subsequent commit.
When reading metadata headers and text, use a new set
of flags to identify specific errors that are seen.
These will be used for more advanced repair in a
subsequent commit.
If udev info is missing for a device, (which would indicate
if it's an MD component), then do an end-of-device read to
check if a PV is an MD component. (This is skipped when
using hints since we already know devs in hints are good.)
A new config setting md_component_checks can be used to
disable the additional end-of-device MD checks, or to
always enable end-of-device MD checks.
When both hints and udev info are disabled/unavailable,
the end of PVs will now be scanned by default. If md
devices with end-of-device superblocks are not being
used, the extra I/O overhead can be avoided by setting
md_component_checks="start".
Save the previous duplicate PVs in a global list instead
of a list on the cmd struct. dmeventd reuses the cmd struct
for multiple commands, and the list entries between commands
were being freed (apparently), causing a segfault in dmeventd
when it tried to use items in cmd->unused_duplicate_devs
that had been saved there by the previous command.
Use the recently added dump routines to produce the
old/traditional pvck output, and remove the code that
had been used for that.
The validation/checking done by the new routines means
that new lines prefixed with CHECK are printed for
incorrect values.
Recent kernel version from kernel commit:
de7180ff908b2bc0342e832dbdaa9a5f1ecaa33a
started to report in cache status line new flag:
no_discard_passdown
Whenever lvm spots unknown status it reports:
Unknown feature in status:
So add reconginzing this feature flag and also report this with
'lvs -o+kernel_discards'
When no_discard_passdown is found in status 'nopassdown' gets reported
for this field (roughly matching what we report for thin-pools).
Add 'pvck --dump headers' to print all the
lvm ondisk structs. Also checks the values
and prints any problems.
The previous dump metadata is also converted to
use these same routines, which do not depend on lvm
fully scanning/reading/processing the headers and
metadata on disk. This makes it useful to get data in
cases where there is corruption that would otherwise
prevent the normal functions from working.
The test was failing consistently on some VMs (F25), and inconsistently
on Rawhide.
With increased latency these failures are no longer reproducible.
Reproducer:
make check_lvmpolld T=pvmove-resume-multiseg.sh
The new command 'pvck --dump metadata PV' will extract
the current version of VG metadata from a PV for testing
and debugging. --dump metadata_area extracts the entire
text metadata area.
teardown after the test was failing, probably because
of uncoordinated udev actions running on the test
system. Try to avoid this by doing some work before
teardown.
lvm2 till version 2.02.169 (commit 78d004efa8)
was printing invalid creation_time argument into metadata on 32bit arch.
However with commit ba9820b142 we started
to properly validate all input numbers and thus we refused to accept
invalid metadata with 'garbage' string - but this results in the
situation where metadata produced on older lvm2 on 32 bit architecture
will become unreadable after upgrade.
To fix this case - extend libdm parser in a way, that whenever we
find error integer value, we also check if the parsed value is not for
creation_time node and in this case we let the metadata pass through
with made-up date 2018-05-24 (release date of 2.02.169).
commit aa75b31db5
"pvscan: handle case of scanning PV without metadata last"
failed to recognize that an arg may be null in the case of
'pvscan --cache' (without -aay) which does not keep track
of complete VGs because it does not need to activate them.
The scanning rework missed removing this instance of label scan.
It's no longer needed because of the way that label scan is always
run once from the start of the command. This unnecessary scan
would be triggered by running 'pvs @tag'.
If an md component is not excluded by other means and
vg_read is used to read metadata from it, then this new
check compares the device size with the PV size, and runs
a full md check on the device if the sizes don't match.
and don't call it from inside pvcreate_each_device.
This avoids having to repeat it for users of
pvcreate_each_device (pvcreate/pvremove/vgcreate/vgextend.)
There have been two file locks used to protect lvm
"global state": "ORPHANS" and "GLOBAL".
Commands that used the ORPHAN flock in exclusive mode:
pvcreate, pvremove, vgcreate, vgextend, vgremove,
vgcfgrestore
Commands that used the ORPHAN flock in shared mode:
vgimportclone, pvs, pvscan, pvresize, pvmove,
pvdisplay, pvchange, fullreport
Commands that used the GLOBAL flock in exclusive mode:
pvchange, pvscan, vgimportclone, vgscan
Commands that used the GLOBAL flock in shared mode:
pvscan --cache, pvs
The ORPHAN lock covers the important cases of serializing
the use of orphan PVs. It also partially covers the
reporting of orphan PVs (although not correctly as
explained below.)
The GLOBAL lock doesn't seem to have a clear purpose
(it may have eroded over time.)
Neither lock correctly protects the VG namespace, or
orphan PV properties.
To simplify and correct these issues, the two separate
flocks are combined into the one GLOBAL flock, and this flock
is used from the locking sites that are in place for the
lvmlockd global lock.
The logic behind the lvmlockd (distributed) global lock is
that any command that changes "global state" needs to take
the global lock in ex mode. Global state in lvm is: the list
of VG names, the set of orphan PVs, and any properties of
orphan PVs. Reading this global state can use the global lock
in sh mode to ensure it doesn't change while being reported.
The locking of global state now looks like:
lockd_global()
previously named lockd_gl(), acquires the distributed
global lock through lvmlockd. This is unchanged.
It serializes distributed lvm commands that are changing
global state. This is a no-op when lvmlockd is not in use.
lockf_global()
acquires an flock on a local file. It serializes local lvm
commands that are changing global state.
lock_global()
first calls lockf_global() to acquire the local flock for
global state, and if this succeeds, it calls lockd_global()
to acquire the distributed lock for global state.
Replace instances of lockd_gl() with lock_global(), so that the
existing sites for lvmlockd global state locking are now also
used for local file locking of global state. Remove the previous
file locking calls lock_vol(GLOBAL) and lock_vol(ORPHAN).
The following commands which change global state are now
serialized with the exclusive global flock:
pvchange (of orphan), pvresize (of orphan), pvcreate, pvremove,
vgcreate, vgextend, vgremove, vgreduce, vgrename,
vgcfgrestore, vgimportclone, vgmerge, vgsplit
Commands that use a shared flock to read global state (and will
be serialized against the prior list) are those that use
process_each functions that are based on processing a list of
all VG names, or all PVs. The list of all VGs or all PVs is
global state and the shared lock prevents those lists from
changing while the command is processing them.
The ORPHAN lock previously attempted to produce an accurate
listing of orphan PVs, but it was only acquired at the end of
the command during the fake vg_read of the fake orphan vg.
This is not when orphan PVs were determined; they were
determined by elimination beforehand by processing all real
VGs, and subtracting the PVs in the real VGs from the list
of all PVs that had been identified during the initial scan.
This is fixed by holding the single global lock in shared mode
while processing all VGs to determine the list of orphan PVs.
wipe_lv knows it's going to write the device, so it
can open rw from the start. It was opening readonly,
and then dev_write needed to reopen it readwrite.
To avoid tiny race on checking arrival of signal and entering select
(that can latter remain stuck as signal was already delivered) switch
to use pselect().
If it would needed, we can eventually add extra code for older systems
without pselect(), but there are probably no such ancient systems in
use.
Handle the case where pvscan --cache -aay (with no dev args)
gets to the final PV, completing the VG, but that final PV does not
have VG metadata. In this case, we need to use VG metadata from a
previously scanned PV in the same VG, which we saved for this
possibility. Using this saved metadata, we can find which VG
this PVID belongs to, and then check if that VG is now complete,
and if so add the VG name to the list of complete VGs to be
autoactivated.
When hints are invalid and ignored, the list of hints
could be non-empty (from additions before an invalid
hint was found). This confused the calling code which
was checking for an empty list to see if hints were used.
Ensure the list is empty when hints are not used.
We already used Conflicts=shutdown target to stop LVM2 services on shutdown.
But we still missed the ordering - the shutdown.target should be reached
only after all the services are really stopped.
Reported here: https://github.com/lvmteam/lvm2/issues/17
If a device looks like a PV, but its size does not
match the PV size in the metadata, then skip it for
purposes of autoactivation. It's probably not wrong
device for the PV.
In the past, the first 'pvscan --cache -aay dev' command
to run on the system would initialize the pvs_online dir
by scanning all devs and creating online files for all pvs
it found, and then autoactivating the VG (if complete) for
the named dev. The idea was that the system may not have
been able to run pvscan commands for early devices, so the
first pvscan to run would need to "make up" for any devices
that had appeared previously, which the system was unable to
scan. The problem or idea of making up for missed scans is
historical and should no longer be needed, so remove this
special init case.
When pvscan is run for the initialization case (the first
pvscan run on the system), it scans all devs and creates
online files for all PVs it finds. Previously it would
then autoactivate every complete VG, but change this to
only autoactive the (complete) VG corresponding to the
named device arg(s).
- remove reference to locking_type which is no longer used
- remove references to adopting locks which has been disabled
- move some sanlock-specific info out of a general section
- remove info about doing automatic lockstart by the system
since this was never used (the resource agent does it)
- replace info about lvextend and manual refresh under gfs2
with a description about the automatic remote refresh
This reverts 518a8e8cfb
"lvmlockd: activate mirror LVs in shared mode with cmirrord"
because while activating a mirror LV with cmirrord worked,
changes to the active cmirror did not work.
When data are growing, adapt also size of metadata.
As we get way too many reports from users doing huge growths of
data portion while keep metadata small and avoiding using monitoring.
So to enhance the user-experience in case user requests grown of
thin-pool (without passing PV list for growth) - lvm2 will automaticaly
grown also the metadata part of thin-pool (if possible).
Add function for estimation of thin-pool metadata size for given size of
data. Function is using already existing internal API so it can
be reused for resize of thin-pool data.
Update the previous commit to leave the vgname as
an arg instead of moving it into the select option,
(the compound select option rule is confusing the
dlm arg processing.)
Using --select 'lvname=LV && vgname=VG' avoids the problem
of the lvchange exit code not distinguishing an actual error
result vs the VG or LV not existing. (This is in case there
is an odd dlm/gfs2 setup where some nodes are running the dlm
but do not have access to the VG.)
When lvextend extends an LV that is active with a shared
lock, use this as a signal that other hosts may also have
the LV active, with gfs2 mounted, and should have the LV
refreshed to reflect the new size. Use the libdlmcontrol
run api, which uses dlm_controld/corosync to run an
lvchange --refresh command on other cluster nodes.
When an LV is active with a shared lock, a command can be
run to change the LV with --lockopt skiplv (to override the
exclusive lock the command ordinarily requires which is not
compatible with the outstanding shared lock.)
In this case, other commands may have the LV active and may
need to refresh the LV, so print warning stating this.
Udev is running udev-rule action upon 'resume'.
However lvm2 in special case is doing replacement of
'soon-to-be-removed' device with 'error' target for resuming
and then follows actual removal - the sequence is usually quick,
so when udev start action - it can result in 'strange' error
message in kernel log like:
Process '/usr/sbin/dmsetup info -j 253 -m 17 -c --nameprefixes --noheadings --rows -o name,uuid,suspended' failed with exit code 1.
To avoid this - we need to ensure there is synchronization wait for udev
between 'resume' and 'remove' part of this process.
However existing code put strict requirement to avoid synchronizing with
udev inside critical section - but this originally came from requirement
to not do anything special while there could be devices in
suspend-state. Now we are able to see differnce between critical section
with or without suspended devices. For udev synchronization only
suspended devices are prohibited to be there - so slightly relax
condition and allow calling and using 'fs_sync()' even inside critical
section - but there must not be any suspended device.
Allow using caching with VDO.
User can either cache a single vdopool or
a vdo LV - difference when the caching is put-in depends on a use-case
and it's upto user to decide which kind of speed is expected.
Internal detection of SCSI device being in-use by DM mpath has been
performed several times for each component device - this could be
eventually racy - so instead when we do remember 1st. checked result
for device being mpath and use it consistenly over the filter runtime.
Move DM usage into dev_manager.c source file.
Also convert STATUS to INFO ioctl - as that's enough
to obtain UUID - this also avoid issuing unwanted flush on checked DM
device for being mpath.
Fix to previous commit
"pvscan: ignore online for shared and foreign PVs"
which was incorrectly considering a PV foreign if its
VG had no system ID when the host did have a system ID.
Activation would not be allowed anyway, but we can
check for these cases early and avoid wasted time in
pvscan managing online files an attempting activation.
This is the default bcache size that is created at the
start of the command. It needs to be large enough to
hold a single copy of metadata for a given VG, or the
VG cannot be read or written (since the entire VG would
not fit into available memory.)
Increasing the default reduces the chances of anyone
needing to increase the default to use their VG.
The size can be set in lvm.conf global/io_memory_size;
the lower limit is 4 MiB and the upper limit is 128 MiB.
When a single copy of metadata gets within 1MB of the
current io_memory_size value, begin printing a warning
that the io_memory_size should be increased.
which defines the amount of memory that lvm will allocate
for bcache. Increasing this setting is required if it is
smaller than a single copy of VG metadata.
and "cachepool" to refer to a cache on a cache pool object.
The problem was that the --cachepool option was being used
to refer to both a cache pool object, and to a standard LV
used for caching. This could be somewhat confusing, and it
made it less clear when each kind would be used. By
separating them, it's clear when a cachepool or a cachevol
should be used.
Previously:
- lvm would use the cache pool approach when the user passed
a cache-pool LV to the --cachepool option.
- lvm would use the cache vol approach when the user passed
a standard LV in the --cachepool option.
Now:
- lvm will always use the cache pool approach when the user
uses the --cachepool option.
- lvm will always use the cache vol approach when the user
uses the --cachevol option.
For users who do not want all of the fields included
in debug lines, let them specify in lvm.conf which
fields to include. timestamp, command[pid], and
file:line fields can all be disabled.
Without this, the output from different commands in a single
log file could not be separated.
Change the default "indent" setting to 0 so that the default
debug output does not include variable spaces in the middle
of debug lines.
Use the correct loop variable within the loop, instead of reusing the
initial value. Table lines after the first don't get terminated in
the right place.
Signed-off-by: Kurt Garloff <kurt@garloff.de>
When a VG has multiple PVs, and all those PVs come online
at the same time, concurrent pvscans for each PV will all
create the individual pvid files, and all will often see
the VG is now complete. This causes each of the pvscan
commands to think it should activate the VG, so there
are multiple activations of the same VG. The vg lock
serializes them, and only the first pvscan actually does
the activation, but there is still a lot of extra overhead
and time used by the other pvscans that attempt to
activate the already active VG. This can lead to a backlog
of pvscans and timeouts.
To fix this, this adds a new /run/lvm/vgs_online/ dir that
works like the existing /run/lvm/pvs_online/ dir. Each pvscan
that wants to activate a VG will first try to exlusively create
the file vgs_online/<vgname>. Only the first pvscan will
succeed, and that one will do the VG activation. The other
pvscans will find the vgname file exists and will not do the
activation step.
When a PV goes offline, the vgs_online file for the corresponding
VG is removed. This allows the VG to be autoactivated again
when the PV comes online again. This requires that the vgname be
stored in the pvid files.
Use a file lock to ensure that only one pvscan will do
initialization of pvs_online, otherwise multiple concurrent
pvscans may all see an empty pvs_online directory and
do initialization.
The pvscan that is doing initialization should also only
attempt to activate complete VGs.
When aay was included in the pvscan --cache command,
the activation part was complaining about the unusual
state of the hint file since it had been recreated
just prior.
udev_dev_is_md_component and udev_dev_is_mpath_component
are not used for obtaining the device list, but they still
use libudev for device info. When there are problems with
udev, these functions can get stuck. So, use the existing
obtain_device_list_from_udev config setting to also control
whether these "is component" functions are used, which gives
us a way to avoid using libudev entirely when it's causing
problems.
Just like we support for thin-pool syntax:
lvcreate --thinpool new_tpoolname -L Size vg
add same support logic with for vdo-poo:
lvcreate --vdopool new_vpoolname -L Size vg
Also move description of syntax bellow thin-pool, so it's
correctly ordered in generated man page.
Whenever thin-pool chunk size is unspecified and left for lvm calculation
try to select the size as nearest highest power-of-2 instead of
just being a multiple of 64KiB.
Fixing recent commit 022ebb0cfe
Resize already has size that needs to be counted with,
otherwise upsizing operation could turn into size reduction one.
Since the parse_vdo_pool_status() become vdo_manip API part,
and there will be no 'dm' matching status parser,
the API can be simplified and closely match thin API here.
Now with newer VDO kvdo target we can start to use standard mechanism
to enable resize of VDO volumes.
VDO pool can be grown.
Virtual volume grows on top of VDO pool when is not big enough.
Reduced VDOLV is calling discard for reduced areas - this can
take long time!
TODO: implement some pollable mechanism for out-of-lock TRIM.
When using 'lvcreate -l100%VG' and there is big disproportion between
real available space and requested setting - automatically fallback
to 100%FREE.
Difference can be seen when VG is big and already most space was
allocated, so the requestion 100%VG can end (and by spec for % modifier
it's correct) as LV with size of 1%VG. Usually this is not a big
problem - buit in some cases - like cache-pool allocation, this
can result a big difference for chunksize selection.
With this patch it's more closely match common-sense logic without
the need of reitteration of too big changes in lvm2 core ATM.
TODO: in the future there should be allocator solving all allocations
in a single call.
When using caches with BIG pool size (>TB) there is required
to use relatively huge chunk size. Once the chunksize has
got over 1MiB - kernel cache target stopped writing such chunks
back on this if the migration_threshold remained on default 1MiB
(2048 sectors) size.
This patch ensure, DM layer will not let pass table line which
has not big enough migration threshold that can let pass
at least 8 chunks (independently of lvm2 metadata).
lvm uses 'minimum_io_size' name to exactly match VDO naming here,
however in all common cases _size is using 'sector/512b' unit.
But in this case the value is in bytes and can have only 2 values:
either 512 or 4096.
It's probably not worth to rename it internaly, so we can just
drop comment - instead of using 1 or 8.
Thought let's think about it....
Lvm can at times have duplicate names. When this happens the daemon will
internally use vg_name:vg_uuid as the name for lookups, but display just
the vg_name externally. If an API user uses the Manager.LookUpByLvmId and
queries the vg name they will only get returned one result as the API
can only accommodate returning 1. The one returned is the first instance
found when sorting the volume groups by UUID.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1583510
When we have two logical volumes which switch their names at the
same time we are left with incorrect lookups. Anytime we find
an entry by doing a lookup by UUID or by name we will ensure
that the lookups are indeed correct.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1642176
An idea from Zdenek for better ensuring valid hints by invalidating
them when pvscan --cache <device> sees a new PV, which is a case
where we know that hints should be invalidated. This is triggered
from systemd/udev logic, and there may be some cases where it would
invalidate hints that the existing methods wouldn't detect.
If there are two independent scripts doing:
vgchange --lockstart vg
lvchange -ay vg/lv
The first vgchange to do the lockstart will wait for
the lockstart to complete before returning.
The second vgchange to do the lockstart will see that
the start is already in progress (from the first) and
will do nothing. This means the second does not wait
for any lockstart to complete, and moves on to the
lvchange which may find the lockspace still starting
and fail.
To fix this, make the vgchange lockstart command
wait for any lockstart's in progress to complete.
Save the list of PVs in /run/lvm/hints. These hints
are used to reduce scanning in a number of commands
to only the PVs on the system, or only the PVs in a
requested VG (rather than all devices on the system.)
The systemd generators are executed very early during the switch
from initramfs to system partition and the syslog is not yet fully
operational - it may cause blocking, if some debug logging is enabled
at the same time in /etc/lvm/lvm.conf log{} section.
To avoid timeouting and killing this generator - rather enhance lvm
code to suppress any syslog communication when LVM_SUPPRESS_SYSLOG
envvar is set.
Use of this envvar is needed since the parsing of i.e. cmdline options
that could eventually override lvm.conf setting happens in this case
way too late and number of lines could have been already streamed to
syslog.
Fix a scenario where global/event_activation setting is not found. In
this case we need to take default value just like lvm tools do when
executed. So use "lvmconfig --type full".
Also, if we fail to execute lvmconfig for whatever reason, fallback to
generating the activation units as failsafe action.
Reported by: Bastian Blank <waldi debian org>
When initializing an LV to hold the writecache, use wipe_lv()
which looks for specific signatures on the LV.
Wiping signatures is not necessary, but printing a warning
that names a specific signature (in addition to the existing
generic warning/confirmation) may help if a user accidentally
specifies the wrong LV which contains something important.
Detach function return 0 for error and 1 for success.
Add missing log errors from failing deactivation.
Add missing log error from failing synchronization.
In few cases error paths from initialization were returned as
'success == 1'.
Also assing num_mb with single compare checking valid sector_size.
For dumb compiler make num_mb always defined.
Since configure.h is a generated header and it's missing traditional
ifdefs preambule - it can be included & parsed multiple times.
Normally compiler is fine when defines have same value and there is
no warning - yet we don't need to parse this several times
and by adding -include directive we can ensure every file
in the package is rightly compile with configure.h as the
first header file.
If we are running the test where the device is /dev/* we will will
run the unit tests 'test_nesting' and 'test_pv_symlinks'. Otherwise
we will skip them.
When a VG is exported, the 'fullreport' returns an exit code of 5, but
otherwise returns the data we are wanting.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
In some cases we get stuck where we are unable to retrieve the current
state of lvm as we are encountering an error. When the error is
persistent we will log and exit the daemon instead of consuming vast
amounts of resources.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Drop very old original format of VDO target and focus on V2 version.
So some variables were renamed or replaced.
There is no compatibility preserved (with assumption so far this is
experimental feature and there is no real user).
Note - version currently VDO calls this version 6.2.
A bit of chicken & egg problem - dmeventd needs to use old libdm library.
VDO is only part of new device_mapper internal library.
So include directly source file for parsing status - this fixes usability
problem of VDO plugin introduced with previous Makefile reshaping
patchset.
NOTE: source file needs to be keep then compilable in both environments.
Also add missing copyright header.
This is a followup patch to commit edb72cb70c
to support related lvm2 test suite tests.
A 'global/support_mirrored_mirror_log' bool configuration variable gets
introduced allowing the creation of, or conversion to mirrored 'mirror'
logs if set. The capability to create these in turn allows the rest of
the tests to perform activation of such existing LVs and their conversions
to disk/core 'mirror' logs.
Display a disclaimer warning if enabled that this is not for regular use.
Add definition of the enabled config option to respective test scripts.
Related: rhbz1643562
With older gcc - we need to resolve symbols linked with devmapper-event
that is now using -ldevmapper.
Also add forgotten systemd library needed for dbus notification.
Avoid doing hard set of LIBS var,
so if the LIBS is set before 'include make.tmpl' it's not lost.
This gives better control over order of linked libraries.
Since there is very little change there will be any new devel going
to happing with cmirror - avoid eating extra disk space and link
with already installed libdm which implements all use basic
function of dm list
Since dmeventd is 'libdm' based project, it needs to link
libdm library instead of its internal version
An external users may provide plugins loadeable by dmeventd.
So external user of libdevmapper-event library has no other option
then to link with released libdevmapper library.
The complexity comes with lvm2 plugins.
The lvm2 plugin itself uses internal version of device_mapper,
but libdevmapper-event usage is libdm based - so there needs to be avoided
any breakage on compatibility of internal i.e. dm_task_run structures.
TODO: most likely dmeventd itself should be moved into libdm/dm-tools dir,
and only lvm2 plugins should be created as part of lvm project,
but those still need to link with libdevmapper.
If we don't know the meaning we will return the key with default text
instead of raising an exception and taking the daemon out in the
process.
Resolves: rhbz1657950
When we get bug reports we may not get the entire log, so lets
dump the fight recorder from newest to oldest as the one we
are interested in was likely to be the last command run.
Scenario: Given an existed LV `lvol0`, I want to create another LV
on the PVs used by `lvol0`.
I use `build_parallel_areas_from_lv()` to obtain the `pv_list` of each segments.
However, the returned `pv_list` is not properly initialized, which causes
segfault in subsequent operations.
Ensure configure.h is always 1st. included header.
Maybe we could eventually introduce gcc -include option, but for now
this better uses dependency tracking.
Also move _REENTRANT and _GNU_SOURCE into configure.h so it
doesn't need to be present in various source files.
This ensures consistent compilation of headers like stdio.h since
it may produce different declaration.
There's a small window during creation of a new RaidLV when
rmeta SubLVs are made visible to wipe them in order to prevent
erroneous discovery of stale RAID metadata. In case a crash
prevents the SubLVs from being committed hidden after such
wiping, the RaidLV can still be activated with the SubLVs visible.
During deactivation though, a deadlock occurs because the visible
SubLVs are deactivated before the RaidLV.
The patch adds _check_raid_sublvs to the raid validation in merge.c,
an activation check to activate.c (paranoid, because the merge.c check
will prevent activation in case of visible SubLVs) and shares the
existing wiping function _clear_lvs in raid_manip.c moved to lv_manip.c
and renamed to activate_and_wipe_lvlist to remove code duplication.
Whilst on it, introduce activate_and_wipe_lv to share with
(lvconvert|lvchange).c.
Resolves: rhbz1633167
In RHEL7 we marked mirrored mirror logs as deprecated and
added a related message. This patch prohibits creating new
'mirror' LVs with that log type or converting existing LVs
to have one.
Existing LVs with mirrored mirror log can be activated
and converted to disk/core logs.
Avoid double deprecation message when running lvconvert.
Resolves: rhbz1643562
Patch 668c9d0762 introduced regression,
since the code here would actually always return failing result.
Replace it with more simple call to strndup().
For some targets we do not want to generate dependencies.
Also add note about usage of such Makefile - it might be
possibly better to rename it to different filename to avoid
any confusion.
commit de28637
scan: use full md filter when md 1.0 devices are present
missed the fact that md superblock version 0.90 also puts
metadata at the end of the device, so the full md filter
needs to be used when either 0.90 or 1.0 is present.
. When using default settings, this commit should change
nothing. The first PE continues to be placed at 1 MiB
resulting in a metadata area size of 1020 KiB (for
4K page sizes; slightly smaller for larger page sizes.)
. When default_data_alignment is disabled in lvm.conf,
align pe_start at 1 MiB, based on a default metadata area
size that adapts to the page size. Previously, disabling
this option would result in mda_size that was too small
for common use, and produced a 64 KiB aligned pe_start.
. Customized pe_start and mda_size values continue to be
set as before in lvm.conf and command line.
. Remove the configure option for setting default_data_alignment
at build time.
. Improve alignment related option descriptions.
. Add section about alignment to pvcreate man page.
Previously, DEFAULT_PVMETADATASIZE was 255 sectors.
However, the fact that the config setting named
"default_data_alignment" has a default value of 1 (MiB)
meant that DEFAULT_PVMETADATASIZE was having no effect.
The metadata area size is the space between the start of
the metadata area (page size offset from the start of the
device) and the first PE (1 MiB by default due to
default_data_alignment 1.) The result is a 1020 KiB metadata
area on machines with 4KiB page size (1024 KiB - 4 KiB),
and smaller on machines with larger page size.
If default_data_alignment was set to 0 (disabled), then
DEFAULT_PVMETADATASIZE 255 would take effect, and produce a
metadata area that was 188 KiB and pe_start of 192 KiB.
This was too small for common use.
This is fixed by making the default metadata area size a
computed value that matches the value produced by
default_data_alignment.
The pvscan systemd service for autoactivation was
mistakenly dropped along with the lvmetad related
services.
The activation generator program now looks at the new
lvm.conf setting "event_activation" (default 1) to
switch between event activation and direct activation.
Previously, the old use_lvmetad setting was used to
switch between event and direct activation.
instead of a separate --writecacheblocksize option.
writecache block_size is not technically a setting,
but it can borrow the option as a special case.
fix lseek error check
fix read/write error checks
handle zero return from read and write
don't return an error for short io
fix partial read/write loop
io_setup() for aio may fail if a system has reached the
aio request limit. In this case, fall back to using
sync io. Also, lvm use of aio can be disabled entirely
with config setting global/use_aio=0.
The system limit for aio requests can be seen from
/proc/sys/fs/aio-max-nr
The current usage of aio requests can be seen from
/proc/sys/fs/aio-nr
The system limit for aio requests can be increased by
setting fs.aio-max-nr using sysctl.
Also add last-byte limit to the sync io code.
For embeded reshaping operation require higher driver version.
(otherwise we get:
Converting vg/LV1 from raid6 (same as raid6_zr) is directly possible to the following layouts:
raid6_nc
raid6_nr
raid6_la_6
raid6_ls_6
raid6_ra_6
raid6_rs_6
raid6_n_6
lvmpolld ATM is not desingned to preserve interval checking
in the same way the 'lvconvert' tool is doing - so the passed
'-i 40' is not respected and lvmpolld autonomously checks
state of conversion and updates lvm2 metadata and dm tables
when needed.
So skip portion of test that relayed on this and preserve this logic
only for command line invocation and forking of polling process
where the interval of checking is under full control.
Although we primarily want to check externally used libdevmapper
library, out internal all.h is still keeping all symbols
as the original library has - so for simpler compilation keep
using this private copy for defining needed symbols.
Just for case ensure compiler is not able to optimize
memset() away for resources that are released.
This idea of using memory barrier is taken from openssl.
Other options would be to check for 'explicit_bzero' function.
When preparing ioctl buffer and flatting all parameters,
add table parameters only to ioctl that do process them.
Note: list of ioctl should be kept in sync with kernel code.
DM_DEVICE_CREATE with table is doing several ioctl operations,
however only some of then takes parameters.
Since _create_and_load_v4() reused already existing dm task from
DM_DEVICE_RELOAD it has also kept passing its table parameters
to DM_DEVICE_RESUME ioctl - but this ioctl is supposed to not take
any argument and thus there is no wiping of passed data - and
since kernel returns buffer and shortens dmi->data_size accordingly,
anything past returned data size remained uncleared in zfree()
function.
This has problem if the user used dm_task_secure_data (i.e. cryptsetup),
as in this case binary expact secured data are erased from main memory
after use, but they may have been left in place.
This patch is also closing the possible hole for error path,
which also reuse same dm task structure for DM_DEVICE_REMOVE.
Add a little wait loop - since lvconvert started background process
and we need to wait till this bg task initiate its work -
adding ~1s loop should give reasonable enough time to start mirroring.
This could be seen as continuation of
6cee8f1b06.
Some test maching with old udev system shows problem,
where udev 'jumps on' leg device after mirror target
releases its legs - since udev does not (in this old case) skips
such device from scanning - it opens device - and this prevent
leg device to be deactivated - effectively such device stays
'leaked' in DM table invisibly to lvm2 command.
So to 'combat' this issue - if the device has '_mimage' in its name,
the retry of deactivation is automatically assumed.
NOTE: wider impact is unexpected - as it's touching only old mirror
target which is nowadays replaced with 'raid'.
In case there will be some problem identified - probably both patches
should be reverted.
With older systems and udevs we don't have control over scanning of lvm2
internal devices - so far we set retry-removal only for top-level LVs,
but in occasional cases udev can be 'fast enough' to open device for
scanning and prevent removal of such device from DM table.
So to combat this case - try to pass 'retry' flag also for removal of
internal device so see how many races can go away with this simple
patch.
Note: patch is applied only to internal version of libdm so the external
API remains working in the old way for now.
Commit 813347cf84 added extra validation,
however in this particular we do want to trim suffix out so rather ignore
resulting error code here intentionaly.
If a single, standard LV is specified as the cache, use
it directly instead of converting it into a cache-pool
object with two separate LVs (for data and metadata).
With a single LV as the cache, lvm will use blocks at the
beginning for metadata, and the rest for data. Separate
dm linear devices are set up to point at the metadata and
data areas of the LV. These dm devs are given to the
dm-cache target to use.
The single LV cache cannot be resized without recreating it.
If the --poolmetadata option is used to specify an LV for
metadata, then a cache pool will be created (with separate
LVs for data and metadata.)
Usage:
$ lvcreate -n main -L 128M vg /dev/loop0
$ lvcreate -n fast -L 64M vg /dev/loop1
$ lvs -a vg
LV VG Attr LSize Type Devices
main vg -wi-a----- 128.00m linear /dev/loop0(0)
fast vg -wi-a----- 64.00m linear /dev/loop1(0)
$ lvconvert --type cache --cachepool fast vg/main
$ lvs -a vg
LV VG Attr LSize Origin Pool Type Devices
[fast] vg Cwi---C--- 64.00m linear /dev/loop1(0)
main vg Cwi---C--- 128.00m [main_corig] [fast] cache main_corig(0)
[main_corig] vg owi---C--- 128.00m linear /dev/loop0(0)
$ lvchange -ay vg/main
$ dmsetup ls
vg-fast_cdata (253:4)
vg-fast_cmeta (253:5)
vg-main_corig (253:6)
vg-main (253:24)
vg-fast (253:3)
$ dmsetup table
vg-fast_cdata: 0 98304 linear 253:3 32768
vg-fast_cmeta: 0 32768 linear 253:3 0
vg-main_corig: 0 262144 linear 7:0 2048
vg-main: 0 262144 cache 253:5 253:4 253:6 128 2 metadata2 writethrough mq 0
vg-fast: 0 131072 linear 7:1 2048
$ lvchange -an vg/min
$ lvconvert --splitcache vg/main
$ lvs -a vg
LV VG Attr LSize Type Devices
fast vg -wi------- 64.00m linear /dev/loop1(0)
main vg -wi------- 128.00m linear /dev/loop0(0)
Since the test is currently directly working with live directory,
which can be getting updates from system's udev - add wait
for settling so removal of all known PVs happens after that.
But still this has major influce on behavior of running system,
so the test should never be executed on a user used box.
Check that type is always defined, if not make it explicit internal
error (although logged as debug - so catched only with proper lvm.conf
setting).
This ensures later type being NULL can't be dereferenced with coredump.
When sanlock_release returns an error because of an i/o
timeout releasing the lease on disk, lvmlockd should just
consider the lock released. sanlock will continue trying
to release the lease on disk after the original request
times out.
The lvmlock LV size was not adjusted correctly for 512 vs 4K
sector sizes which influence the lease size used by sanlock.
When lvmlock was automatically extended, the zeroing through
bcache wasn't working.
The choice about sector size and lease align size is
now made by the sanlock user, in this case lvmlockd.
This will allow lvmlockd to use other lease sizes in
the future. This also prevents breakage if hosts
report different sector sizes, or the sector size
reported by a device changes.
Since the stats handle is neither bound nor listed before the
attempt to call dm_stats_get_nr_regions(), it will always return
zero: this prevents reporting of any dmstats regions on any
device.
Remove the dm_stats_get_nr_regions() check and instead rely on
the correct return status from dm_stats_populate() which only
returns 0 in the case that there are regions to inspect (and
which logs a specific error for all other cases).
Reported-by: Bryan Gurney <bgurney@redhat.com>
It doesn't make sense to test or warn about the region count until
the stats handle has been listed: at this point it may or may not
contain valid information (but is guaranteed to be correct after
the list).
lvm uses a bcache block size of 128K. A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated. How much depends on
alignments. When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.
This patch is a quick hack to prevent lvm from writing
past the end of the metadata area.
This reverts commit 16ae968d24.
We need to come up with a better fix, because we fall short
wiping all known signatures when not using the wipe_lv API.
lvm metadata writes, commits and activations are performed
for (newly) allocated RAID metadata SubLVs to wipe any preexisiting
data thus avoid false raid superblock positives on RaidLV activation.
This process can be interrupted by command or system crashs
thus leaving stale SubLVs in the lvm metadata as a problem.
Because we hold an exclusive lock in this metadata SubLV wiping
process, we can address this problem by avoiding aforementioned
commits/writes/activations altogether wiping the respective first
sector of the first physical extent allocated to any metadata SubLV
directly via the existing dev_set() API.
Succeeds all LVM RAID tests.
Related: rhbz1633167
When persistent_filter_create() fails, the existing passed filter
should be preserved, so it could be properly deleted on
error path - so new pfilter is assigned instead.
Thin plugin started to use configuble setting to allow to configure
usage of external scripts - however to read this value it needed to
execute internal command as dmeventd itself has no access to lvm.conf
and the API for dmeventd plugin has been kept stable.
The call of command itself was not normally 'a big issue' until users
started to use higher number of monitored LVs and execution of command
got stuck because other monitored resource already started to execute
some other lvm2 command and become blocked waiting on VG lock.
This scenario revealed necesity to somehow avoid calling lvm2 command
during resource registration - but this requires bigger changes - so
meanwhile this patch tries to minimize the possibility to hit this race
by obtaining any configurable setting just once - such patch is small
and covers majority of problem - yet better solution needs to be
introduced likely with bigger rework of dmeventd.
TODO: avoid blocking registration of resource with execution of lvm2
commands since those can get stuck waiting on mutexes.
%ghost should not be used for directories created by systemd-tmpfiles.
This may prevent package from working right after installation without
invoking systemd-tmpfiles.
See: https://pagure.io/packaging-committee/issue/439
Previously the size was limited by checking if the
old and new copies of the metadata overlapped.
This generally limited the size to about half of
the total space, but it could be larger given the
size differences between old and new. Now add a
direct check to limit the size to half the space.
Remove another instance of an invalid check for metadata
overflow during read. The previous instance was removed
in commit 5fb15b193.
This was checking for metadata that that overflowed the
circular disk metadata buffer during read, but such metadata
cannot be written, so it shouldn't be possible to find see.
Also, the check was incorrect and could trigger when there
was no overflow.
This fixes a problem in commit e6bb780d24, in which the
back compat handling for the old locking_type=4 was
incorrectly translated to mean the same thing as --readonly,
which prevented activation because activation uses an
exclusive vg lock. Previously, locking_type=4 allowed
activation.
If we see locking_type 4 in an old config, translate it to
the new combination of --readonly and --sysinit, which we
now define to mean the --readonly behavior with an exception
to allow activation.
The vg_write/vg_commit code was imprecise, uncommented, and
hard to understand. Rewrite it with clearer, cleaner code,
extensive comments, descriptions of how it works, and add
more info in debugging output.
The minor changes in behavior are to things that were
either incorrect or probably unintended:
- vg_write/vg_commit no longer check that the current vgname at
the start of the text metadata matches the vgname being written.
This has already been done at least twice by the time they are
called, and repeating it again against the same cached data has
no use.
- A fragment of old removed code had been left behind that checked
if the old unused alignment policy would wrap. It was still
being checked to decide if the metadata area was full, which
could possibly cause an incorrect full metadata failure.
- vg_remove now clears both the raw_locns in the mda_header that
point to committed metadata (raw_locn slot 0) and precommitted
metadata (raw_locn slot 1). Previously it fully cleared the
committed slot, and would only clear the offset field in the
precommitted slot if it saw a problem with the metadata in the
vg being removed.
- read_metadata_location_summary was wrongly comparing the number
of wrapped bytes with an offset to report an error about the
metadata being too large. This wrong check is removed, it
could have resulted in erroneous errors.
Commit 989626926c
introduced 2 new tests
lvconvert-raid-takeover-linear_to_raid4.sh and
lvconvert-raid-takeover-raid4_to_linear.sh
which involve raid reshaping.
Bump the checked dm-raid target version to 1.14.0
which has reshape kernel fixes to avoid test suite
runs to hang.
Bump target version to 1.14.0 which contains fixes
for reshape deadlock/corruption to allow tests to
run once the respective fixes show up in kernels.
Remove now superfluous multi-core checks.
Resolves: rhbz1501145
Related: rhbz1514539
Related: rhbz1586123
Related: rhbz1613039
Allow "lvconvert --type linear RaidLV" on a raid4 LV
providing convenient interim steps to convert to linear.
Add respective new test
lvconvert-raid-takeover-raid4_to_linear.sh
and
lvconvert-raid-takeover-linear_to_raid4.sh
for linear to raid4 once on it.
When converting from striped/raid0/raid0_meta
to raid6 with > 2 stripes, allow possible
direct conversion (to raid6_n_6).
In case of 2 stripes, first convert to raid5_n to restripe
to at least 3 data stripes (the raid6 minimum in lvm2) in
a second conversion before finally converting to raid6_n_6.
As before, raid6_n_6 then can be converted
to any other raid6 layout.
Enhance lvconvert-raid-takeover.sh to test the
2 stripes conversions to raid6.
Resolves: rhbz1624038
devices/scan_lvs (default 1) determines whether lvm
will scan LVs for layered PVs. The lvm behavior has
always been to scan LVs, but it's rare for LVs to have
layered PVs, and much more common for there to be many
LVs that substantially slow down scanning with no benefit.
This is implemented in the usable filter, and has the
same effect as listing all LVs in the global_filter.
Add "lvm2-activation-generator: " prefix for all kmsg messages written by
lvm2-activation-generator so we can identify the message in global system log.
We need to have Ceph RBD devices mapped first before use in a stack
where LVM is on top so make sure rbdmap.service is called before
generated lvm2-activation-net.service.
On shutdown, we need to stop blk-availability first before we stop the
rbdmap.service.
Resolves: rhbz1623479
This is the number of concurrent async io requests that
the scan layer will submit to the bcache layer. There
will be an open fd for each of these, so it is best to
keep this well below the default limit for max open files
(1024), otherwise lvm may get EMFILE from open(2) when
there are around 1024 devices to scan on the system.
"lvconvert --type linear RaidLV" on striped and raid4/5/6/10
have to provide the convenient interim layouts. Fix involves
a cleanup to the convenience type function.
As a result of testing, add missing sync waits to
lvconvert-raid-reshape-linear_to_raid6-single-type.sh.
Resolves: rhbz1447809
Conversion to striped from raid0/raid0_meta is directly possible.
Fix a regression setting superfluous interim raid5_n conversion type
introduced by commit bd7cdd0b09.
Add new test script lvconvert-raid0-striped.sh.
Resolves: rhbz1608067
With improved mirror activation code --splitmirror issue poppedup
since there was missing proper preload code and deactivation
for splitted mirror leg.
If a mirror LV is listed in read_only_volume_list, it would
still be activated rw. The activation would initially be
readonly, but the monitoring function would immediately
change it to rw. This was a regression from commit
fade45b1d1 mirror: improve table update
The monitoring function needs to copy the read_only setting
into the new set of mirror activation options it uses.
When vgcreate does an automatic pvcreate, it opens the
dev with O_EXCL to ensure no other subsystem is using
the device. This exclusive fd remained in bcache and
prevented activation parts of lvm from using the dev.
This appeared with vgcreate of a sanlock VG because of
the unique combination where the dev is not yet a PV,
so pvcreate is needed, and the vgcreate also creates
and activates an internal LV for sanlock.
Fix this by closing the exclusive fd after it's used
by pvcreate so that it won't interfere with other
bits of lvm that may try to use the device.
Conversions of LVs under snapshot to thinpool or cachepool
correctly fail but leave them inactive and provide cryptic
error messages like 'Internal error: #LVs (10) != #visible
LVs (2) + #snapshots (1) + #internal LVs (5) in VG VG'.
Reject and provide better error message.
Resolves: rhbz1514146
The 'lvconvert LV' command def has caused multiple problems
for command matching because it matches the required options
of any lvconvert command. Any lvconvert with incorrect options
ends up matching 'lvconvert LV', which then produces an error
about incorrect options being used for 'lvconvert LV'. This
prevents suggestions from nearest-command partial command matches.
Add a special case for 'lvconvert LV' so that it won't be used
as a partial match for a command that has options specified.
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
When lvmetad is not used, use temporary files to record
which PVs have appeared. Use these temp files to determine
when a VG is complete, to trigger autoactivation.
This change allows us to remove lvmetad while keeping the
same autoactivation behavior that lvmetad provides.
The temp files are created in /run/lvm/pvs_online/ and are
named for the PVID of the PV. The files contain the
major:minor of the device the PV was read from.
e.g. if VG foo has dev1 and dev2, then:
. pvscan --cache -aay dev1
reads vg metadata from dev1
creates /run/lvm/pvs_online/<pvid-of-dev1>
checks if all vg->pvs are online: no
. pvscan --cache -aay dev2
reads vg metadata from dev2
creates /run/lvm/pvs_online/<pvid-of-dev2>
checks if all vg->pvs are online: yes
autoactivates vg
A 'pvscan --cache dev' (without -aay) still records that
dev is online.
A 'pvscan --cache --major X --minor Y' after a device is
gone will remove the temp file for it.
A 'pvscan --cache [-aay]' (no devs) resets the state of
temp files by removing them all, then scanning all devs
and creating temp files for PVs that are found.
If no online files exist, the first pvscan --cache scans
all devs and creates temp files for any PVs found.
The scope of the temp files is only pvscan, and they are only
used for pvscan-based autoactivation. No other commands are
concerned with or aware of these temp files. When lvm creates
or removes PVs, no attempt is made to update the temp files.
Support vgchange usage with VDO segtype.
Also changing extent size need small update for vdo virtual extent.
TODO: API needs enhancements so it's not about adding ifs() everywhere.
When user create vdo-pool - use different automatic name.
So unlike with traditional LVs using lvol0, lvol1
use vpool0, vpool1...
TODO: apply similar for thin-pool & cache-pool...
Checks whether VDO support is enabled.
Detects presence of 'vdoformat' tool which is required for to format VDO pool.
ATM build of VDO is NOT automatically enabled (None is default).
To enable build of LVM with VDO support use:
configure --with-vdo=internal
TODO: Maybe future version may switch to link some small VDO library for formating
(would require linking and package dependency).
To support autoloading of VDO dm target driver loading of 'kvdo'
kernel module is needed - ATM it's not using 'dm-vdo' name.
So to support this strange name - add temporarily solution to
autoload kvdo kernel module in this case.
Introduce VDO plugin for monitoring VDO devices.
This plugin can be used also by other users, as plugin checks
for UUID prefix 'LVM-' and run lvm actions only on those
devices.
Non LVM- device are only monitored and log warnings
when usage threshold reaches 80%.
Update makefile to link with more libs since now whole liblvm-internal.a
is linked-in and this library has futher dependencies.
Avoid including deps for run-unit-test.
Drop linking separate status.c as it's already linked via internal libs.
Check allocation of thin-pool works on 2PVs, when one is so full,
that even metadata do not fit there (as they need at least 2M,
while 99% of 63MB fills >62MB)
When lvm2 command is executed in test mode, discard ioctl is skipped.
This may cause even data-loose in case, issuing discard for released
areas was enabled and user 'tested' lvreduce.
When allocating thin-pool with more then 1 device - try to
allocate 'metadataLV' with reuse of log-type allocation for mirror LV.
It should be naturally place on other device then 'dataLV'.
However due to somewhat hard to follow allocation logic code,
it's been rejected allocation in cases where there was not
enough space for data or metadata on single PV, thus to successed,
usage of segments was mandatory.
While user may use:
allocation/thin_pool_metadata_require_separate_pvs=1
to enforce separe meta and data LV - on default settings, this is not
enable thus segment allocation is meant to work.
NOTE:
As already said - the original intention of this whole 'if()' is unclear,
so try to split this test into multiple more simple tests that are more readable.
TODO: more validation.
When node loading fails, there is not much the caller can do,
since there is 'unknown' set of devices preloaded.
Only suspend during preload knows future precommitted 'metadata',
so it's non-trivial to drop 'preloaded' entries with any later call.
However dm tree tracks newly loaded entries - so in this case it
may simplify the recovery path by dropping preloaded entries so
they are not leaked in the DM table.
Allow creation of any virtual segment type with just --virtualsize
specified without any real extent size give.
TODO: likely --type error,zero might be later enhanced to use -V
(along with -L) - but since those targets do not allocate real
space, supporting -V makes sense with them.
Amound of linked libraries grows.
Most of them we don't need to lock in, since we are not using
them in locked section, so skip locking them in memory.
It's important to lock memory beforo running SUSPEND ioctl - but whole
lvm preload runs in memory unlocked environment - as in this phase
memory allocation is allowed and is meant to happen.
Once all targets are preload and ready (confirmed from all targets)
we start suspending tree - and here the memory allocation (or i.e.
opening files) is no longer allowed - as it may cause kernel deadlock.
Commit 3f35146 added a check on the value returned by the
_display_info_cols() function:
1024 if (!_switches[COLS_ARG])
1025 _display_info_long(dmt, &info);
1026 else
1027 r = _display_info_cols(dmt, &info);
1028
1029 return r;
This exposes a bug in the dmstats code in _display_info_cols:
the fact that a device has no regions is explicitly not an error
(and is documented as such in the code), but since the return
code is not changed before leaving the function it is now treated
as an error leading to:
# dmstats list
Command failed.
When no regions exist.
Set the return code to the correct value before returning.
For better code reuse split _node_send_messages into commont
messaging part and separate _thin_pool_node_send_messages.
Patch makes it possible to better reuse common code for messaging
other targets.
udev creates a train wreck of events if we open devices
with RDWR. Until we can fix/disable/scrap udev, work around
this by opening RDONLY and then closing/reopening RDWR when
a write is needed. This invalidates the bcache blocks for
the device before writing so it can trigger unnecessary
rereading.
It's no longer needed. Clustered VGs are now handled in
the same way as foreign VGs, and as shared VGs that
can't be accessed:
- A command processing all VGs sees a clustered VG,
prints a message ("Skipping clustered VG foo."),
skips it, and does not fail.
- A command where the clustered VG is explicitly
named on the command line, prints a message and fails.
"Cannot access clustered VG foo, see lvmlockd(8)."
The option is listed in the set of ignored options for
the commands that previously accepted it. (Removing it
entirely would cause commands/scripts to fail if they
set it.)
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
Since we are using "DefaultDependencies=no" we do not get automatic STOP
job on socket connection - so automatically refuse connection on
shutdown by adding this Conflict definition to socket Unit.
Shuffle code for better readability as set of conditions was
hard to follow.
Make it obvious the refresh & activate path is handling
monitoring and polling on its own.
So the only --monitor and --poll option needs explicit care.
Option --monitor without option --poll will now as a result
of this patch NOT start polling.
So command: vgchange --monitor n is no longer a polling starter.
Restoring polling for activated volumes lost with my recent commit:
75fed05d3e and move start of polling
directly into _activate_lvs_in_vg() - as there we know exactly
if there was some volume even activated.
Also make it sharing same code for pvscan -aay.
With the move to top-level makefile - there are some issues
with subdir recursive makefile.
Make the building more tolerant for now until fully resolved.
The previous method for forcibly changing a clustered VG
to a local VG involved using -cn and locking_type 0.
Since those options are deprecated, replace it with
the same command used for other forced lock type changes:
vgchange --locktype none --lockopt force.
Shared VGs will generally be started and activated by
the resource agent. Without the agent, this script doesn't
have a good way to know which LVs to activate.
vgreduce, vgremove and vgcfgrestore were acquiring
the orphan lock in the midst of command processing
instead of at the start of the command. (The orphan
lock moved to being acquired at the start of the
command back when pvcreate/vgcreate/vgextend were
reworked based on pvcreate_each_device.)
vgsplit also needed a small update to avoid reacquiring
a VG lock that it already held (for the new VG name).
A few places were calling a function to check if a
VG lock was held. The only place it was actually
needed is for pvcreate which wants to do its own
locking (and scanning) around process_each_pv.
The locking/scanning exceptions for pvcreate in
process_each_pv/vg_read can be enabled by just passing
a couple of flags instead of checking if the VG is
already locked. This also means that these special
cases won't be enabled unknowingly in other places
where they shouldn't be used.
When pvmoving LV - the target for LV is a mirror so the validation
that checked the type is matching was incorrect.
While we need a more generic enhancment of LVS output for pvmoved LVs,
for now at least stop showing internal errors and 'X' symbols in attrs.
The last commit related to this was incomplete:
"Implement lock-override options without locking type"
This is further reworking and reduction of the locking.[ch]
layer which handled all clustering, but is now only used
for file locking. The "locking types" that this layer
implemented were removed previously, leaving only the
standard file locking. (Some cluster-related artifacts
remain to be cleared out later.)
Command options to override or modify locking behavior
are reimplemented here without using the locking types.
Also, deprecated locking_type values are recognized,
and implemented as if one of the equivalent override
options was set.
Options that override file locking are:
. --nolocking disables all file locking.
. --readonly grants read lock requests without actually
taking a file lock, and refuses write lock requests.
. --ignorelockingfailure tries to set up file locks and
uses them normally if possible. When not possible, it
behaves like --readonly, but allows activation.
. --sysinit is the same as ignorelockingfailure.
. global/metadata_read_only acquires actual read file
locks, and refuses write lock requests.
(Some of these options could probably be deprecated
because they were added as workarounds to various
locking_type behaviors that are now deprecated.)
The locking_type setting now has one valid value: 1 which
refers to standard file locking. Configs that contain
deprecated values are recognized and still work in
largely the same way:
. 0 disabled all locking, now implemented like --nolocking
is set. Allow the nolocking option in all commands.
. 1 is the normal file locking setting and is unchanged.
. 2 was for external locking which was not used, and
reverts to normal file locking.
. 3 was for cluster/clvm. This reverts to normal file
locking, and prints messages about lvmlockd.
. 4 was equivalent to readonly, now implemented like
--readonly is set.
. 5 disabled all locking, now implemented like
--nolocking is set.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
Add tests for linear <-> striped|raid* conversions.
Add region_size config to reshape tests to avoid test
failures in case of it being defined unexpectedly in lvm.conf.
Related: rhbz1439925
Related: rhbz1447809
"lvconvert --type {linear|striped|raid*} ..." on a striped/linear
LV provides convenience interim type to convert to the requested
final layout similar to the given raid* <-> raid* conveninece types.
Whilst on it, add missing raid5_n convenince type from raid5* to raid10.
Resolves: rhbz1439925
Resolves: rhbz1447809
Resolves: rhbz1573255
Fixes breakage from the recent libdm split. Though these didn't
ever appear to be linked (could they have piggy backed from libdevmapper.so
being linked to them?).
In this command, lvcreate creates a new LV and then combines
it with an existing cache pool, producing a cache LV. This
command was previously not allowed in in a shared VG.
When the lvmlockd lock is shared, upgrade it to ex
when repair (writing) is needed during vg_read.
Pass the lockd state through additional read-related
functions so the instances of repair scattered through
vg_read can be handled.
(Temporary solution until the ad hoc repairs can be
pulled out of vg_read into a top level, centralized
repair function.)
It's not an error if a command requests the global lock
when it has already acquired it. It shouldn't happen,
but there could be cases we've not found.
We have been warning about duplicate devices (and disabling lvmetad)
immediately when the dup was detected (during label_scan). Move the
warnings (and the disabling) to happen later, after label_scan is
finished.
This lets us avoid an unwanted warning message about duplicates
in the special case were md components are eliminated during the
duplicate device resolution.
The report uses process_each_vg() which populates lvmcache
based on a VG list from lvmetad. If there are no VGs,
but only orphan PVs, the orphans are not shown. Add an
explicit call to populate lvmcache with PV info from lvmetad.
This minor patch fixes grammar in a few messages which get
printed to users. It also fixes the same grammar mistake in
several comments.
Signed-off-by: Rick Elrod <relrod@redhat.com>
--
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
No point to start building lvm without this header file.
Although there could be 'some point' in supporting standalone build
of 'just' libdm where the libaio might be avoided.
TODO: think about configure option for building libdm only.
Unfortunatelly on kernels <4.16 lvm2 can't user brd ramdisks
for backend device as number of test is failing with this kernel
message:
device-mapper: ioctl: can't change device type after initial table load.
caused by DAX request-based handling, and lvm2 tries to replace device
with backend 'error' bio-based device and such table reload is being
rejected.
So ATM keep ramdisk only on most recent kernel to experiment a bit,
for older machines just stay safe and keep old slower loop backend.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
While newer system can detect need for 4K mkfs, on older test machines
running test suite over 4k is reporting problems.
Some more generic solution is needed thought.
When test happens to run in tmpfs, it cannot use O_DIRECT (unsupported
with tmpfs).
CHECKME: unsure if detection of tmpfs is 'valid' but kind of works and
is very simple.
As Makefiles already do use target with name 'device-mapper'
rename this new device-mapper dir to non-conflicting name.
We also seem to already use '_' in other dir names.
Also rename device_mapper/Makefile to source for generating Makefile.in
so we can use it for build in other source dirs properly.
It's very hard to use some 'non-recurive' Makefiles with
rest of system running 'recursively'.
So ATM drop inclusion of subdir makefile and add support
for 2 new top-level targets:
unit-test (builds test/unit dir)
run-unit-test (build & run test/unit/unit-test run)
Just like 52656c89fd
when now cache is compiled in 'unditionally'.
This patch is actually enforce by changes in
commit: 2bc896f2a3
where CACHE value is not set anymore.
If the test does not need root, it can use 'SKIP_ROOT_DM_CHECK'.
For such test no actions needed root to initilize DM devices and
nodes will be take and test can check i.e. functional unit tests.
Usage of dm_delay looks to be slowing not just 'delayed' portion
of device, but due to the fact it's also slows down ANY flush
operation on such device it's overal speed impact is huge.
In some case we can however user other methods to slowdown disk writes,
in case of old dm 'mirror' target we can throttle I/O of mirror
synchronisation giving the next commands enough time to test couple
race conditions.
Usage:
throttle_dm_mirror [percentage]
Thtrottle down sync speed (lowest is '1' which is also default when
unspecified)
restore_dm_mirror
Restores the value of throttling before call of 'throttle_dm_mirror'
Usually it should '100'
Currently usage of loop device over backend file in ramdisk (tmpfs)
is actually causing unnecassary memory consution, since just
reading such loop device is causing RAM provisioning.
This patch add another possible way how to use ramdisk directly
through 'brd' device when possible (and allowed).
This however has it's limitation as well - brd does not support
TRIM, so the only way how to erase is to remove brd module ??
Alse there is 4K sector size limitation imposed by ramdisk.
Anyway - for some mirror test that were using large amount of
disk space (tens of MB) this brings noticable speed boost.
(But could be worth to solve the slowness of loop in kernel?)
To prevent using 'brd' for testing set LVM_TEST_PREFER_BRD=0
like this:
make check_local LVM_TEST_PREFER_BRD=0
This test can't use brd (ramdisk) as backend since for some
weird reason lsblk is not listing these device.
TODO: test could be probably rewritten to avoid using lsblk somehow??
When the backend device supports only 4K blocks (like ramdisk)
we cannot use for testing any smaller blocksize.
So recalc test for 4K extent size.
We may possibly introduce one list extra test that
can be executed on devices with 512b sectors to
check lvm2 support those min extent sizes...
ATM it's a bit ugly to enforce flushing of 'stdio' here, but works as quick
hot-fix.
log_print*() is using buffered I/O.
But for pooling with typical 1s interval this may take a while before
buffer about continues progress gets flushed.
So ATM fflush().
TODO: either add log_print*_with_flush() or maybe directly use just
line buffering with log_print() and only log_debug() keep using buffered
I/O mode.
md devices using an older superblock version have
superblocks at the end of the md device. For commands
that skip reading the end of devices during filtering,
the md component devs will be scanned, and will appear
as duplicate PVs to the original md device. Remove
these md components from the list of unused duplicate
devices, so they are treated as if they had been
ignored during filtering. This avoids the restrictions
that are placed on using PVs with duplicates.
All these functions are now used as utilities,
e.g. for ioctl (not for io), and need to
open/close the device each time they are called.
(Many of the opens can probably be eliminated by
just using the bcache fd for the ioctl.)
with the --labelsector option. We probably don't
need all this code to support any value for this
option; it's unclear how, when, why it would be
used.
An implementation of an adaptive radix tree. Has the following nice
properties:
- At least as fast as the hash table
- Uses less memory
- You don't need to give an expected size when you create
- It scales nicely (ie. no large reallocations like the hash table).
- You can iterate the keys in lexicographical order.
Only insert and lookup are implemented so far. Plus there's a lot
more performance to come.
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
This reverts commit 0931067dc5.
The dep files should be in the build dir, which is not necc. the src dir.
Easy to fix, but reverting for now until I have time to revisit.
The clvmd saved_vg data is independent from the normal lvm
lvmcache vginfo data, so separate saved_vg from vginfo.
Normal lvm doesn't need to use save_vg at all, and in clvmd,
lvmcache changes on vginfo can be made without worrying
about unwanted effects on saved_vg.
To avoid the chance of freeing a saved vg while another
code path is using it, defer freeing saved vgs until
all the lvmcache content is dropped for the vg.
In case "lvconvert -mN RaidLV" was used on a degraded
raid1 LV, success was returned instead of an error.
Provide message to inform about the need to repair first
before changing number of mirrors and exit with error.
Add new lvconvert-m-raid1-degraded.sh test.
Resolves: rhbz1573960
I don't like having this in a common header because it means you end
up including too much and causing unneccessary dependencies. eg,
lib/misc/lib.h includes libdevmapper.h, internationalisation, and
logging stuff.
There are likely more bits of code that can be removed,
e.g. lvm1/pool-specific bits of code that were identified
using FMT flags.
The vgconvert command can likely be reduced further.
The lvm1-specific config settings should probably have
some other fields set for proper deprecation.
The mixed up vg repair code in vg_read was trying
to repair a vg when vg_read was called by clvmd.
The clvmd daemon isn't supposed to be repairing
or writing a vg.
(This is a temporary workaround; vg repair will soon
be pulled out of vg_read so it can be called in a
controlled way and consolidated instead of spread
around.)
Shift refresh of mirror table right into monitor_dev_for_events().
Use !vg_write_lock_held() to recognize use of lvchange/vgchange.
(this shall change if this would no longer work, but requires
futher some API changes).
With this patch dm mirror table is only refreshed when necassary.
Also update WARNING message about mirror usage without monitoring
and display LV name.
When 'dmsetup' reports result with --nameprefixes it currently
incorrectly 'escapes' problematic characters.
Letting pass such string though shell 'eval' function is hard task.
So instead cut away substring.
Once dmsetup will start to properly escape backslash and apostrophe
this function may need further tuning.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
External build dir now works too.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
The filters save information about devices that should
be ignored, so if we need to repeat a scan (unusual,
but happens in clvmd), we need to update the filters.
When pvmove was run in background mode and forks
instead of using lvmpolld, the child pvmove process
was not clearing the bcache from the parent, so all
the aio ops in the child were failing.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
The copy of the VG which clvmd stashes in lvmcache should
not only be used between suspend and resume, but between
sequential LV operations in clvmd, so that clvmd does not
need to reread the VG for each one. Prepare for that by
renaming the stashed VG as "saved_vg".
Instead of using delayer device user 'zero' device and let mirror
do some real work which takes some time.
In case the test machine is too fast - mirror might need to be made bigger
to meet needed criteria.
Also move all test needed this 'zero' PV trick to the end of test
so $dev2 and $dev4 are covered with 'zero' and can take any amount of
write without consuming any real space.
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Don't allow writes in test mode. test mode should be
more sophisticated than just faking writes, and this
should be a last defense for cases where test mode is
not being checked correctly.
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
In some odd cases (e.g. tests) there are very few devices
which results in creating too few blocks in bcache, so
create bcache with a minimum number of blocks.
When a PV is stacked on an LV, the LV will be kept in
bcache, and the open fd on the LV may interfere with
processing the LV. So, drop/close a bcache fd for
an LV before processing the LV.
Commands using lvmetad will not begin with a proper
label_scan which initializes bcache, but may later
decide they need to scan a set of devs, in which case
they'll need bcache set up at that point.
The improved detection of bad metadata when scanning
(where errors were ignored before) means we now have to
override some errors when forcibly erasing damaged metadata.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
This is a temporary hacky workaround to the problem of
reads going through bcache and writes not using bcache.
The write path wants to read parts of data that it is
incrementally writing to disk, but the reads (using
bcache) don't work because the writes are not in the
bcache. For now, add a dev to bcache before each attempt
to read it in case it's being used on the write path.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
In the same way as the other process_each functions.
In the common case all the info that's needed can be
used from lvmcache after a label scan. But this means
that unchosen devs for duplicate PVs need to be handled
explicitly.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
The old code was doing unnecessary label scans when
checking to see if the new VG name exists. A single
label_scan is sufficient if it is done after the
new VG lock is held.
When lvmlockd indicates that the lvmetad cache is out of
date because of changes by another node, lvmetad_pvscan_vg()
rescans the devices in the VG to update lvmetad. Use the
new label_scan in this function to use the common code and
take advantage of the new aio and reduced reads.
This fixes the use of lvmcache_label_rescan_vg() in the previous
commit for the special case of independent metadata areas.
label scan is about discovering VG name to device associations
using information from disks, but devices in VGs with
independent metadata areas have no information on disk, so
the label scan does nothing for these VGs/devices.
With independent metadata areas, only the VG metadata found
in files is used. This metadata is found and read in
vg_read in the processing phase.
lvmcache_label_rescan_vg() drops lvmcache info for the VG devices
before repeating the label scan on them. In the case of
independent metadata areas, there is no metadata on devices, so the
label scan of the devices will find nothing, so will not recreate
the necessary vginfo/info data in lvmcache for the VG. Fix this
by setting a flag in the lvmcache vginfo struct indicating that
the VG uses independent metadata areas, and label rescanning should
be skipped.
In the case of independent metadata areas, it is the metadata
processing in the vg_read phase that sets up the lvmcache
vginfo/info information, and label scan has no role.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When user configured lvm2 to NOT user monitoring, activated mirror
actually hang upon error and it's quite unusable moment.
So instead Warn those 'brave' non-monitoring users about possible
problem and activation mirror without blocking error handling.
This also makes it a bit simpler for test suite to handle trouble
cases when test is running without dmeventd.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Whenever we make visible LV out of previously invisible one,
reload it's table - the is mandator for proper udev rule
processing as well as ensure content of dm table is correct.
TODO: this new generic rule probably make extra raid rules unnecessary.
Fixing regresion on argument acceptance where any lv can be passed
with paramaterless lvconvert which is meant to figure out needed
operation - i.e. wait for mirror synchronization.
User has no other 'effective' method to wait for mirror getting in-sync.
Since we support snapshot of mirrors, we do need to properly check
for stacked lock holder - fixes problem of pvmove in cluster
with mirrors under snapshot.
WHATS_NEW for this patch goes with 'Restore pvmove support...'
The current logic that avoids setting SYSTEMD_ALIAS and SYSTEMD_WANTS
on "change" events is flawed in the default "systemd background job"
configuration. For systemd, it's important that device properties don't
change spuriously.
If an "add" event starts lvm2-pvscan@.service for a device, and a
"change" event follows, removing SYSTEMD_ALIAS and SYSTEMD_WANTS from the
udev db, information about unit dependencies between the device and the
pvscan service can be lost in systemd, in particular if the daemon
configuration is reloaded.
Steps to reproduce problem:
- create a device with an LVM PV
- remove device
- add device (generates "add" and "change" uevents for the device)
(at this point SYSTEMD_ALIAS and SYSTEMD_WANTS are clear in udev db)
- systemctl daemon-reload
(systemd reloads udev db)
- vgchange -a n
- remove device
=> the lvm2-pvscan@.service for the device is still active although the
device is gone.
- add device again
=> the PV is not detected, because systemd sees the lvm2-pvscan@.service
as active and thus doesn't restart it.
The original purpose of this logic was to avoid volumes being scanned
over and over again. With systemd background jobs, that isn't necessary,
because systemd will not restart the job as long as it's active.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Make the distinction between the cases with and without systemd
background jobs explicit in 69-dm-lvm-metad.rules rather than
substituting the rule from the Makefile. At this stage,
this improves only readibility, at the cost of one GOTO statement.
This patch introduces no functional change to the udev rules.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Test that no (Sub)LV remnants persist if the volume group is
not listed in configuration variable activation/volume_list,
hence not activatable thus causing initialization of rmeta
SubLVs to fail.
Related: rhbz1161347
btrfs is using fake major:minor device numbers.
try to be smarter and detect used node via DM device name.
This shortens delays, where i.e. lvm2 is asked to deactivate
volume with mounted btrfs as such operation is not retryed
and user is informed about device being in use.
Correct testing with format 1 and mq policy.
Add testing of 'smq'
Fix testing with clvmd - where logged message is part of clvmd log
and we can only check command status.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
Use 4K chunks since some older kernels are not capable
to create striped volumes with smaller size.
TODO: lvm2 should detect this ahead and avoid kernel
reporting "Invalid chunk".
Since this function is called with 'fd == -1', but Coverity can't see
this path can't be visited with this argument, add explicit check for
valid descriptor.
If the tools for checking thin_pool or cache metadata are missing,
issue rather just a WARNING, but let the operation of activation
continue.
This has the advantage, the if user is missing those tools,
but he already started to use thinpool or cacheing, he can
access these volumes with a WARNING.
Also if the user is using too old tools i.e. for CacheV2 format
dmpd tool 0.7 is required - provide informative WARNING and
skip failure from older tool version which can't understand
new format V2.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
While prioritized_section() based on raised priority works
nicely for standard lvm comman - separate counter is actually needed
when it's used in daemons like clvmd/dmeventd where priority
stays raised all the time.
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Introduce prioritized_section() as a closer match to previous logic
of critical_section() that has been held over longer sequence of
ioctl commands - essentially it's matching operation on a single
cookie.
While 'critical_section()' now corresponds to locked memory - we hold
this memory only between suspend/resume thus notion of 'cookie' was
lost.
This patch restores some logic unintentionaly lost with dropping
memory locking for just activation/deactivation calls.
When function dm_stats_populate() returns 0 it's an error and needs
log_error() message - function can't have 'success' returning 0 or
error without reasons.
Prevent call of dm_stats_populate(), when there has been no
stats region detected for a DM device.
Such skip is evaluated as 'correct' visit of stats call and
not causing 'dmstats' command failure.
Resulting loop table line was streamed to 'stderr' stream - assuming this
was not a feature when user used '-v' for more verbose output
and properly show it via 'log_verbose()' on 'stdout'.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
With problematic kernels raid devices can be occasionaly left with
'frozen' status - try to 'unfreeze' them with idle message on teardown.
Also replace couple greps with 'built-in' dmsetup --select feature.
Note: dmsetup --select currently reports 'No devices found' on stdout
and return success - looks like a bug to fix.
Just like lvm2 has internal devices like _tdata which is using UUID with
suffix, there is similar private type of device for crypto device where
they are using CRYPT-TEMP uuid prefix.
Also ignore stratis.
Some kernel version suffer from bad state transition where a device
steps into 'frozen' mode. Any application that tries to read such
raid gets unfortunatelly bloked.
As some sort of protection try to skip such raid device from being
scanned to minimize chances to block lvm2 command on such scan.
When such device is found, warning gets printed.
RaidLVs on read_only_volume_list have their SubLVs
activated readonly thus disabling metadata updates
or image resynchronization/recovery. Bug also causes
automatic repairs to fail.
Fix by always activating the RAID SubLVs readwrite.
Resolves: rhbz1208269
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
When snapshot is created in read-only mode with 'lvcreate -s -pr...',
lvm2 still needs to be able to write to layered -cow volume
to store metadata and exceptions blocks.
TODO: in some case we might be able to do full tree with read-only
volume but this probably needs futher validation:
1. checking snapshot header already exist
2. origin & snapshot are both in read-only mode.
Occasionaly users may need to peek into 'component devices.
Normally lvm2 does not let users activation component.
This patch adds special mode where user can activate
component LV in a 'read-only' mode i.e.:
lvchange -ay vg/pool_tdata
All devices can be deactivated with:
lvchange -an vg | vgchange -an....
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
When dmeventd thin plugin forks a configurable script, switch to use
execvp to pass whole environment present to dmeventd - so all configured
paths present at dmeventd startup are visible to script.
This was likely not a problem for common user enviroment,
however in test suite case variable like LVM_SYSTEM_DIR were
not actually used from test itself but rather from
a system present lvm.conf and this may have cause strange
behavior of a testing script.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
This is somewhat tricky - for test suite we keep using
'set -e -o pipefail' - the effect here is - we get error report
from any 'failing' command in whole pipeline - thus when something
like this: 'lvs | head -1' is used - and 'head' finishes before
lead 'lvs' is done - it recieves SIGPIPE and exits with error,
and somewhat misleading gets occasionally reported depending
of speed of commands.
For this case we have to avoid using standard pipes and rather
switch to using streamed results with temporary output file.
This is all nicely handled with bash feature '< <()'.
For more info:
https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail
While 'file-locking' code always dropped cached VG before
lock was taken - other locking types actually missed this.
So while the cache dropping has been implement for i.e. clvmd,
actually running command in cluster keept using cache even
when the lock has been i.e. dropped and taken again.
This rather 'hard-to-hit' error was noticable in some
tests running in cluster where content of PV has been
changed (metadata-balance.sh)
Fix the code by moving cache dropping directly lock_vol() function.
TODO: it's kind of strange we should ever need drop_cached_metadata()
used in several places - this all should happen automatically
this some futher thinking here is likely needed.
Improve pvmove to accept 'locally' active LVs together with
exclusive active LVs.
In the 1st. phase it now recognizes whether exclusive pvmove is needed.
For this case only 'exclusively' or 'locally-only without remote
activative state' LVs are acceptable and all others are skipped.
During build-up of pvmove 'activation' steps are taken, so if
there is any problem we can now 'skip' LVs from pvmove operation
rather then giving-up whole pvmove operation.
Also when pvmove is restarted, recognize need of exclusive pvmove,
and use it whenever there is LV, that require exclusive activation.
So this is a bit more complex and possibly worth futher checking.
ATM clvmd drops cmd->mem mempool AFTER refresh of cmd.
So anything allocating from cmd->mem during toolcontext init
will likely die at some point in time.
As a quick fix - just use regular malloc/free for 'dso' alloction.
It's worth to note - cmd->libmem seems to be often misused
causing hidden memleaking for clvmd.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.
Avoid using same return code for reporting 2 different things
and stricly report error code by return value and add new
parameter for reporting monitoring status.
This makes easier to recognize which error we got from dm_event
and continue only with ENOENT.
Check if the generated vg name still fits the buffer.
So too long strings are rejected.
Drop -1 from size passed to snprintf - as the \0 is already included.
On occasional gcc releases it's better to specify also -ldevmapper
to linking logic for python object.
It's in fact more correct since the liblvm.c code is using
libdevmapper functions - that were linked in only via
liblvm2app library.
This partially reverts commit da37cbd24f.
As the _cmdline structure use mempool for allocated ellement
that is being release on cmd_context close.
Before the better fix is made - restore previous logic and
reinitialize cmd structures again for new cmd_context.
Problem can be hit with e.g. this test run:
make check_local T=foreign LVM_VALGRIND_DMEVENTD=1
Invalid read of size 1
at 0x4C31C83: strcmp (vg_replace_strmem.c:846)
by 0x6BA0939: _find_command (lvmcmdline.c:1555)
by 0x6BA4304: lvm_run_command (lvmcmdline.c:2810)
by 0x6BD5E02: lvm2_run (lvmcmdlib.c:91)
by 0x685607E: dmeventd_lvm2_run (dmeventd_lvm.c:118)
by 0x6652684: _use_policy (dmeventd_thin.c:117)
by 0x6652E56: process_event (dmeventd_thin.c:298)
by 0x10CC5A: _do_process_event (dmeventd.c:945)
by 0x10CF83: _monitor_thread (dmeventd.c:1033)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Address 0x6266270 is 4,352 bytes inside a block of size 8,192 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x5289142: dm_free_wrapper (dbg_malloc.c:393)
by 0x528998A: _free_chunk (pool-fast.c:318)
by 0x52892A6: dm_pool_destroy (pool-fast.c:78)
by 0x6A8E52C: destroy_toolcontext (toolcontext.c:2254)
by 0x6BA5BD6: lvm_fin (lvmcmdline.c:3327)
by 0x6BD5EA7: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x5288F46: dm_malloc_aux (dbg_malloc.c:287)
by 0x52890AC: dm_malloc_wrapper (dbg_malloc.c:371)
by 0x52898E6: _new_chunk (pool-fast.c:286)
by 0x52893BA: dm_pool_alloc_aligned (pool-fast.c:106)
by 0x5289310: dm_pool_alloc (pool-fast.c:90)
by 0x6A8A21A: _load_config_file (toolcontext.c:808)
by 0x6A8A3D9: _init_lvm_conf (toolcontext.c:842)
by 0x6A8D3BD: create_toolcontext (toolcontext.c:1941)
by 0x6BA5B24: init_lvm (lvmcmdline.c:3308)
by 0x6BD5B7C: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5EB8: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
With pthreaded daemons like 'dmeventd' using liblvm via plugin,
lvm2 actually should not 'play' with streams at all - as there
could be parallel outputs running.
As a current quick workaround just disable change for pthreaded
program (gettid() != getpid()).
TODO: it's possible the change of buffering actually doesn't serve us
any measurable benefit and could be dropped as whole later...
Meanwhile this patch is fixing this occasional valgrind race report:
Invalid read of size 4
at 0x571892C: vfprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x57216B3: fprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x5042886: dm_event_log (libdevmapper-event.c:925)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Address 0x6264d30 is 192 bytes inside a block of size 552 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x573907D: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5C00: reopen_standard_stream (log.c:189)
by 0x6A8E62C: destroy_toolcontext (toolcontext.c:2271)
by 0x6BA5C22: lvm_fin (lvmcmdline.c:3339)
by 0x6BD5EF3: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x573932B: fdopen@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5DC2: reopen_standard_stream (log.c:200)
by 0x6A8D11D: create_toolcontext (toolcontext.c:1898)
by 0x6BA5B6B: init_lvm (lvmcmdline.c:3319)
by 0x6BD5BC8: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5F04: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
....
Process terminating with default action of signal 6 (SIGABRT): dumping core
at 0x570016B: raise (in /usr/lib64/libc-2.26.9000.so)
by 0x5701520: abort (in /usr/lib64/libc-2.26.9000.so)
by 0x57437D8: __libc_message (in /usr/lib64/libc-2.26.9000.so)
by 0x5743831: __libc_fatal (in /usr/lib64/libc-2.26.9000.so)
by 0x5744056: _IO_vtable_check (in /usr/lib64/libc-2.26.9000.so)
by 0x574751C: __overflow (in /usr/lib64/libc-2.26.9000.so)
by 0x574191A: fputc (in /usr/lib64/libc-2.26.9000.so)
by 0x50428E3: dm_event_log (libdevmapper-event.c:934)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Some tools are typically installed into /usr/sbin (or /sbin) dir.
And some systems do not add this path to user's $PATH var.
Ensure sbin paths are looked through...
In fact pvmove does support 'clustered-core' target for clustered
pvmove of LVs activated on multiple nodes.
This patch restores support for activation of pvmove on all nodes
for LVs that are also activate on all nodes.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
Add protectional internall error whenever we spot activation
of 'exclusive' only segments in 'non-exclusive' mode.
TODO: possibly the activation locking could be enhanced to handle
this fully behind the scene - as for now this works purely for
lvchange/vgchange activation.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Dedicated functions are now used to process each piece of data obtained,
so the refactoring in this file gives us one for the vgsummary and one
for the metadata header. This new type of function takes two parameters
(for now), the obtained data plus a single struct (that must not
reference any data on the stack) that wraps up the entire context needed
to process it.
Sleep a bit before checking /sys/block dir so the kernel has a moment to
actually put scsi debug device in it...
Some quite old kernels are in troubles with this plain searching grep
without sleep (namely 2.6.32)
modprobe scsi_debug
<sleep .1>
grep -H scsi_debug /sys/block/*/device/model
modprobe -r scsi_debug
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
Add three new raid tests with io load and table
reloads during reshape for target 1.13.2.
Add a raid0 to raid10 conversion test.
Also add more signals to trap in lvconvert-raid-reshape-load.sh.
If there is sufficient space in the metadata area, align the next
metadata to a disk offset that is a multiple of 4096 bytes and
don't write it circularly. If it doesn't all fit at the end
of the metadata area, go back to the start and write it all there
contiguously.
If there is insufficient space to use the new stricter rules, revert to
the original behaviour, aligning on 512-byte boundaries wrapping around
the circular buffer as required.
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
lvmdbusd was started, but the process was not recognized by pgrep.
- configure does not make the script executable - set the flag
explicitly when running make check,
- process name changed to lvmdbusd. The previous python3 value
originated from the use of /usr/bin/env.
Use new ALIGN_ABSOLUTE macro when calculating the start location
of new metadata and adjust the end of buffer detection so that
there is no longer an imposed gap between old and new metadata.
Currently both start and offset should always be divisible by alignment,
so this should have no effect, but a later patch will increase alignment
so these variables can no longer be optimised out.
lvmdbusd executable script must use python3 interpreter detected by
configure script, as site-packages directory used for library is only
used by that interpreter.
Although it doesn't look like it can be a measurable problem
and costs some time to flip priorities outside of activation window.
So just like with memory locking preserve priority until call
memlock_unlock() appears.
(addition to commit c086dfadc3).
Expand out the metadata wrapping calculations to prepare
to support a larger alignment.
The current alignment is 512 bytes so
(mdac_area_start + rlocn->offset) % alignment is zero.
When doing resume, directly pass location where new updated info
needs to be stored.
_resume_node() ensures the info is ONLY updated when the function
is successful and never changes it on error path.
Update the logic towards more explicit logic.
Preload tree normally does not want to resume, only
in certain cases of extension or new loaded nodes can be
resumed. So introduce new internal variable delay_resume_if_extended
controlable by target.
Patch itself is not changing current existing behaviour,
and rather documents existing problem in more readable way.
lvm2 needs to introduce explicit mechanism how to support more
fain-grained (and safe) logic to i.e. resize thin-pool which
can be sitting on cached raid volume.
Variable props.send_messages has 3 states and was not used properly
here. Activation in this moment does not need to verify thin-pool status
as that has been already checked on preload.
So only if there are some real messages (value 2) call function
for sending them.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
In case of failed legs, raid replaces those with
e.g. "vg-lv_rimage_0-missing_0_0" mapped to an error target.
Those errouneously remain on deactivation.
Fix by removing them on deactivation/removal of the RaidLV.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
Code already has dereferenced UUID before this point,
and its already given we require name & uuid when ading new node
(although uuid could be empty string).
Just like everywhere else - use single if() for major:minor setup
(it basically can't fail as of today anyway)
Always leave funtion with correctly set pointers even on error path.
Replicator never really existed in upstream kernel and its support
got deprecated.
Also its support never got finished so no code is supposed to be
using it anyway.
Libdm symbols are remaining, just the implementation will always
return failure - so any user of:
dm_tree_node_add_replicator_dev_target()
dm_tree_node_add_replicator_target().
will now always recieve error message.
Separate handling of error code from _info_by_dev.
This error can only happeng when we are running out of memory.
In such case there is urgent need to stop any futher proceeding
of command and run to error ASAP.
Avoiding "$(get first_extent_sector "$d")" in the loop
allows the test to succeed in the cluster. Further cluster
analysis needed to get to the core reason.
The lvm2 test suite aims at small test resource footprints
(few PVs, small PV sizes) to run on tmpfs backed loop device.
OTOH, lvconvert-reshape-raid.sh aims to test the maxima of
supported total stripes of 64. This patch adds a prerequisite
conditional to skip tests using more than 14 stripes.
It requires the target version 1.13.1 to avoid deadlocks.
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Reduces large amount of duplicate ioctl queries.
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Use code which detectes handlers in a way, which is more
backward-compatible friendly.
Replace read of 'sysfs' uuid entry with dm ioctl call.
Use /sys/block/dm-X/holders path instead of
new path /sys/dev/block/major:minor/holders.
TODO:
There are few more occurencies of this logic around the code
so some abstract interface should be considered.
In some cases the message could be slightly misleading so use
here rather conditional.
TODO:
In future we may possibly further tune the message in case we are
certain the level of redundancy protection has not been reduced.
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Propagate delayed resume at least for preload case in a simple way.
Currently PVMOVE depends on internal logic where 'mirror' with
corelog is 'possible' PVMOVE. In such case resume of 'created'
node is 'delayed'.
This is mostly an ugly internal hack - but for the moment being when we
add propagation for preload - it does work reasonable.
TODO: provide standard API and avoid this internal 'guessing'.
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
Just like suspend handles preload for pvmove finish,
in similar way handle suspend of starting pvmove.
In this case the precommited metadata are checked for list of PVMOVEed
LVs and those are suspended in with committed metadata.
When sanlock or dlm lock managers return an error number
that we don't recognize, replace it with a generic -ELMERR
which is defined in the set of special lvmlockd error
numbers. Otherwise, an unknown lock manager error number
could be misinterpreted for something else if it happened
to overlap another set of error numbers (which they have
not thus far.)
These less common errors returned from sanlock should
also cause sanlock to retry the lock acquire:
- i/o timeout occurs during sanlock_acquire().
other i/o on the same disk as the leases can cause
sanlock i/o timeouts.
- low level disk paxos contention between hosts naturally
causes one host to not acquire the lease. There are a
couple special error numbers associated with these cases
that should just be recognized as a normal failure to
acquire the lease.
There is no need to differentiation between clustered VG and normal VG.
As the activation depends on locking type.
Use unconditionally locally exclusive activation for pvmove.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
When activation of LVs fails prior pvmove start, try to deactivate
already activated LVs.
TODO: possibly remember which LVs where already activate and only those
take down - devices which are already in-use will stay active.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
The persistent filter should not be imported by any command that doesn't
use it so take addtional note of REQUIRES_FULL_LABEL_SCAN (for vgrename)
and introduce IGNORE_PERSISTENT_FILTER for vgscan and pvscan.
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
Commit 763db8aab0 rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Resolves: rhbz1511047
In HA cluster, we have "clvm" resource agent to manage clvmd daemon.
The agent invokes clvmd like: "clvmd -T90 -d0", which always prints
a scaring error message:
"""
local socket: connect failed: No such file or directory
"""
When specifed with "-d" option, clvmd tries to check if an instance
of the clvmd daemon is already running through a testing connection.
The connect() will fail with this ENOENT error in such case, so supress
the error message in such case.
TODO: add missing error reaction code - since ofter log_error, program
is not supposed to continue running (log_error() is for reporting
stopping problems).
Signed-off-by: Eric Ren <zren@suse.com>
Check and prevent starting another snapshot merge before
exiting merging is finished.
TODO: we can possibly implement smarter logic to drop existing
merging and start a new one.
The lvchange-raid[456].sh test checks that mismatches can be detected
properly. It does this by writing garbage to the back half of one of
the legs directly. When performing a "check" or "repair" of mismatches,
MD does a good job going directly to disk and bypassing any buffers that
may prevent it from seeing mismatches. However, in the case of RAID4/5/6
we have the stripe cache to contend with and this is not bypassed. Thus,
mismatches which have /just/ happened to an area that now populates the
stripe cache may be overlooked. This isn't a serious issue, however,
because the stripe cache is short-lived and reasonably small. So, while
there may be a small window of time between the disk changing underneath
the RAID array and when you run a "check"/"repair" - causing a mismatch
to be missed - that would be no worse than if a user had simply run a
"check" a few seconds before the disk changed. IOW, it simply isn't worth
making a fuss over dropping the stripe cache before beginning a "check" or
"repair" (which we actually did attempt to do a while back).
So, to get the test running smoothly, we simply deactivate and reactivate
the LV to force the stripe cache to be dropped and then proceed. We could
just as easily wait a few seconds for the stripe cache to empty also.
When a "recover" is just starting for a RAID LV, it is possible to get
"idle" for the sync action if the status is issued quickly enough. This
is fine, the MD thread just hasn't gotten things going yet. However,
the /need/ for a "recover" should be marked in md->recovery and it would
be simple enough to fix the kernel so this doesn't happen. May eventually
want a separate bug for this, but for now it fits with RHBZ 1507719.
We always preferred and recommended socket activation for our services
so remove the Install section in related .service units which are unused
in this case and keep only the Install section in associated .socket
units.
Signed-off-by: Bastian Blank <waldi@debian.org>
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
There are two known bugs in the lvconvert-raid-status-validation.sh
test. The first one I consider to be more of an annoyance (1507719).
The second one I consider to be more serious (1507729).
RHBZ 1507719 simply documents the fact that the three RAID status
fields may not always be coherent due to the way they are set and
unset when the MD thread is shutting down and starting up. For
example, the sync ratio may be 100% but the sync action may not
yet have switched to "idle" and the health characters may not yet
all be 'A's (i.e. the devices set to InSync).
RHBZ 1507729 is more serious. The sync ratio can be 100% for a
short period of time after upconverting linear -> RAID1. It is
reset to 0 once the MD sync thread gets to work on it. It does
this because, technically, the array /is/ in-sync if the new
devices are excluded - i.e. the data is 100% available and
consistent. I'm not sure what to do about this problem, but we'd
much rather not have this state that looks exactly like the
end of the process when the sync ratio is 100% because the
"recover" process finished, but the sync action and health
characters haven't been updated yet. Put simply, the problem
is that we can't tell if a sync is starting or finished based
on the status output.
Since 4fa5add6b1 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)
Preload reiserfs module for the case, fs is present/compiled for a
kernel but it's not present in memory.
Size reducition needs --yes confirmation to preceed for reiserfs.
Correct reported message when thin snapshot has been already merged.
So lvm2 is no longer reporting "Mergins of snapshot X will occur..."
(even with swapped names).
When an ignored metadata area gets flagged for use again, make sure the
code doesn't try to parse its old metadata. Firstly by trying to detect
this situation and skipping the read (while still remembering the
position reached in the circular buffer), and secondly by clearing the
invalid live metadata location on disk as a precaution when subsequently
writing out the precommitted metadata.
Problems showed up when a metadata area in one VG got moved to
another VG in ignored state (still holding metadata for the original
VG) and then later got brought into use in the new VG - only the header
should be read in this case, not any of the metadata content.
vgsplit shares the vg_rename code so that must only set the PV_MOVED_VG
flag introduced in commit 486ed10848
("vgmerge: Fix intermediate metadata corruption") on PVs that moved.
Since both lvcreate and lvconvert needs to check for same
type of allowed origin for snapshot - move the code into
a single function.
This way we also fix several inconsitencies where snapshot
has been allowed by mistake either through lvcreate or
lvconvert path.
Generation of man pages is generating lot of barely readable output.
For normal build quietize this a bit.
For original verbose build start to use 'make V=1'
(just like i.e. linux kernel does)
TODO: apply at more places...
Converting from one raid level to another, no changes
of stripes or stripesize can be requested because those
are subject to reshaping. I.e. the process requires to
takeover first and secondly request raid algorithm,
stripe or stripesize changes.
Ignore any related changes display warninngs
and proceed with the takeover.
Without this patch, a takeover requesting
stripesize change causes data corruption!
Just like with other vars support this:
make check_local T=xyz LVM_LOG_FILE_MAX_LINES=10000000
Allows easily to override existing line limit.
Also increase limiting size of logs per command since some of
our commands are becoming very verbose....
Add explaining message, when command was aborted due to the reach
of configure line number count (LVM_LOG_FILE_MAX_LINES)
for logging (used mainly with testing).
Last patch missed to mention, we've improved/fixed generated paths
in units and init.d shell scripts when lvm2 was plainly configured
with just i.e. --prefix.
Note: some distros might have fully specified --sbindir and
--usrsbindir - thus those very not seeing problems in generated paths.
Replace lowercase @sbindir@ with @SBINDIR@ which contains
fully decoded path.
Same with @usrsbindir@ which is also used with clvmd and cmirrord.
Also handle SYSCONFDIR for EnvironmentFile.
Patch fixes generated unit files with strings like:
ExecStart=${exec_prefix}/sbin/lvm
Introduce few more AC_SUBST vars for usage in *.in generation.
In some case we want to replace i.e. $sbindir with full path
instead of current ${exec_prefix}/sbin.
This patch provides:
USRSBINDIR
SBINDIR
DEFAULT_SYS_LOCK_DIR
SYSCONFDIR
At the same time properly use sbindir & usrsbindir with
lvm, fsadm, clvmd from one primary definition.
Do not allow to take snapshot of mirror/raid leg or log or metadata LV.
This was actually never supported, but user was able to create it,
and this put device stack in hardly fixable state (needs manual work).
This prevents such creation to pass.
Also improve validation when recreating snapshot volume type
from origin and COW volume.
Correction to function for extracting vgname out of lvconvert
parameters.
Avoid repeating some checks.
Add code to handle generic options which may provide vgname in its argument
and compare them all so they match to a single vgname (otherwise it's a
error).
Extract default (envvar) vgname only when no position nor optional vgname is
found.
Fixing regression instroduce with patchset started with commit:
1e2420bca8 (2.02.169)
split resize_crypt function in two.
a) Detect proper dm-crypt device type and count new --size
value for cryptsetup resize command.
b) Perform the resize
the bug in LUKS grow/shrink decision in fsadm was
masked due to fact that default LVM2 extent size
was larger than LUKS1 default data offset for dm-crypt
mapping. The new test address this bug.
lvcreate supports a 'conversion' when caching LV.
This normally worked fine, however in case passed LV was
thin-pool's data LV with suffix _tdata we have failed to early.
As the easiest fix looks dropping validation of name when
caching type is select - such name check will happen later
once the VG is opened again and properly detect if the LV
with protected name already exists and can be converted,
or will be rejected as ambigiuous operation requiring user
to specify --type cache | --type cache-pool.
Replaced the confusing device error message "not found (or ignored by
filtering)" by either "not found" or "excluded by a filter".
(Later we should be able to say which filter.)
Left the the liblvm code paths alone.
Fixes the following case with 3PVs and 3 legs "mirror" LV:
# lvcreate -l100%FREE --type mirror -m2 vg3
Insufficient free space for log allocation for logical volume .
Unable to allocate extents for mirror log.
Related: rhbz1269533
Activation lock has a primary purpose to serialize locking of individual
LV in case there is no other protecting mechanism for parallel
execution.
However in the case an activated LV is composed from several other LVs,
noone should be able to manipulate with those LVs as well.
This patch add a very 'naive' global VG activation locking in this case.
In the future we may introduce smarter function detecting minimal closed
graph components if this will appear as bottleneck
Patch checks if the VG Write lock is held - in this case we do not
need any more locking - command has exclusive access to VG.
In case we have clustered VG and we are activating an LV which does not
need other LVs - we also do not need any more locks.
In all other cases take respective lock - for single LV - use lvid,
for complex LVs use vgname.
Creating striped RaidLVs with lv size not divisible by region size
caused the region size to be adjusted:
# lvcreate --type raid5 -n region_check.32.00m_3 -i 3 -L 1g --nosync -R 32.00m raid_sanity
Using default stripesize 64.00 KiB.
Rounding size 1.00 GiB (256 extents) up to stripe boundary size <1.01 GiB(258 extents).
WARNING: New raid5 won't be synchronised. Don't read what you didn't write!
Using reduced mirror region size of 8.00 MiB
Logical volume region_check.32.00m_3 created.
Fix by not imposing "mirror" constraints on "raid".
Resolves: rhbz1404007
vgmerge suffers from a similar problem to the one fixed in commit
8146548d25 ("vgsplit: Fix intermediate
metadata corruption.")
When merging, splitting or renaming VGs, use a new PV status flag
PV_MOVED_VG to mark the PVs that hold metadata with the old VG name and
use this to provide PV-level granularity instead of incorrectly assuming
all PVs in the VG are the same.
Add these for dmeventd systemd unit (dm-event.service):
Before: shutdown.target
Conflicts: shutdown.target
This will cause the dmeventd to be properly stopped at shutdown (after
all the dmeventd clients unregistered their devices from monitoring)
with dm-event.service's stop action (there's no direct action defined
for the "stop" so systemd sends SIGTERM instead).
Before, we let dmeventd to get killed only as part of the very last
SIGTERM/SIGKILL for all the remaining processes late in the shutdown
sequence so we may have missed some logs if dmeventd encountered an
error during its shutdown (logging facilities are already off at this
late time in shutdown sequence).
Ref: https://www.redhat.com/archives/lvm-devel/2017-October/msg00000.html
When dmeventd receives SIGTERM/INT/HUP/QUIT it validates if exit is possible.
If there was any device still monitored, such exit request used to
be ignored/refused. This 'usually' worked reasonably well, however if there
is very short time period between last device is unmonitored and signal
reception - there was possibility such EXIT was ignored, as dmeventd has
not yet got into idle state even commands like 'vgchange -an' has already
finished.
This patch changes logic towards scheduling EXIT to the nearest
point when there is no monitored device.
EXIT is never forgotten.
NOTE: if there is only a single monitored device and someone sends
SIGTERM and later someone uses i.e. 'lvchange --refresh' after
unmonitoring dmeventd will exit and new instance needs to be
started.
If you send a SIGUSR1 (10) to the daemon it will dump all the
threads current stacks to stdout. This will be useful when the
daemon is apparently hung and not processing requests.
eg.
$ sudo kill -10 <daemon pid>
Make sure that any and all code that executes in the main thread is
wrapped with a try/except block to ensure that at the very least
we log when things are going wrong.
Changing the VG of a PV uses the same on-disk mechanism as vgrename.
This relies on recognising both the old and new VG names. Prior to this
patch the vgsplit code incorrectly provided the new VG name twice
instead of the old and new ones. This lead the low-level mechanism not
to recognise the device as already belonging to a VG and so paying no
attention to the location of its existing metadata, sometimes partly
overwriting it and then later trying to read the corrupt metadata and
issuing a checksum error.
In some cases we are seeing where there are no VGs, but the data returned from
lvm shows that the PVs have the following for the VG:
"vg_name":"[unknown]", "vg_uuid":""
The code was only checking for the exitence of the VG name and we called into
the function get_object_path_by_uuid_lvm_id which requires both the VG name and
the LV name to exist (asserts this) which results in the following stack trace:
Traceback (most recent call last):
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 563, in runner
obj._run()
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 584, in _run
self.rc = self.f(*self.args)
File "/home/tasleson/lvm2/daemons/lvmdbusd/fetch.py", line 26, in
_main_thread_load
cache_refresh=False)[1]
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/home/tasleson/lvm2/daemons/lvmdbusd/loader.py", line 37, in common
objects = retrieve(search_keys, cache_refresh=False)
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 40, in
pvs_state_retrieve
p["pv_attr"], p["pv_tags"], p["vg_name"], p["vg_uuid"]))
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 84, in __init__
vg_uuid, vg_name, vg_obj_path_generate)
File "/home/tasleson/lvm2/daemons/lvmdbusd/objectmanager.py", line 318,
in get_object_path_by_uuid_lvm_id
assert uuid
AssertionError
When executing in the main thread, if we encounter an exception we
will bypass the notify_all call on the condition and the calling thread
never wakes up.
@staticmethod
def runner(obj):
# noinspection PyProtectedMember
Exception thrown here
----> obj._run()
So the following code doesn't run, which causes calling thread to hang
with obj.cond:
obj.function_complete = True
obj.cond.notify_all()
Additionally for some unknown reason the stderr is lost.
Best guess is it's something to do with scheduling a python function
into the GLib.idle_add. That made finding issue quite difficult.
There's nothing special about /boot other than it's used during boot.
But when blkdeactivate is called either on all devices or including a
device where the /boot is on top, we should also include this mount
point when doing unmount before deactivation of supported devices.
The new blkdeactivate -r|mdraidoption wait causes blkdeactivate to wait
for any resync/recovery/reshape that is currently in progress before
deactivating the device.
If this option is used, blkdeactivate calls mdadm -W|--wait before
mdadm -S|--stop.
Revert dc50f2f4a0.
We're canonicalizing/escaping the names here and we're reusing the
variable name so the code doesn't need to use extra variables and
further assignments that may confuse us. Let's keep the code simple.
The
local name=(...$name)
is not the same as
local name
name=(...$name)
(I know various code-checking tools fuss about this and recommend
the 2nd way, but let's ignore those tools' nitpicking here please.)
There was a typo in blkdeactivate --dmoption/--lvmoption/mpathoption,
it had missing "s" at the end and it was not recognized properly, only
short names for the options (-d/-l/-m).
When certain cmd def RULE's fail, the error messages can
sometimes be confusing. This expands the error messages
to help clarify why the rule failed, especially in cases
where options are used incorrectly.
In a shared VG, only allow pvmove with a named LV,
so that only PE's used by the LV will be moved.
The LV is then activated exclusively, ensuring that
the PE's being moved are not used from another host.
Previously, pvmove was mistakenly allowed on a full PV.
This won't work when LVs using that PV are active on
other hosts.
Avoid starting test, when test dir has less then 50M of free space.
Better to crash early before letting die machine on weird crash
in OOM cases...
Also show free disk space when test starts
Enable handling of --poolmetadataspare so if user can prevent
creation of _pmspare volume during --repair operation (just
like during actual lvcreate or lvconvert) for pool volumes.
The following commands now pass the device list through a
--select|-S filter before processing:
suspend resume clear wipe_table remove deps status table
In a shared VG, lvconvert must be used to create thin pools
and cache pools, not the lvcreate variants of those commands.
Deny these cases early in lvcreate using the new command defs.
Denying these cases deeper in the code was missing some
cleanup of the partially completed command.
Revert the lvmlockd.c changes from:
commit 0bf836aa14
"tidy: prefer not using else after return"
The commit introduced at least one regression, which broke
lvcreate of a thin pool in a shared VG.
ATM we have several instances of daemonizing code.
Each has its 'special' logic so not completely easy
to unify them all into a single routine.
Start to unify them and use one strategy for rediricting
all input/outpus to /dev/null - use 'dup2' function for this
and open /dev/null before fork to make sure it's available.
When file-locking mode failed on locking, such description was leaked
(typically not an issue since command usually exists afterwards).
So shirt close() at the end of function and use it in all error paths.
Also make sure, when interrrupt is detected, it's really not holding
lock and returns 0.
lvmcache_foreach_mda() can fail for numerous reasons
and failing error code cannot be ignored (out-of-memory...)
TODO: might need more error handling tunning.
gcc warns here about storring 69 bytes in 64 byte array (losing
potentially 4 bytes from 'ls->name').
lvmlockd-core.c:2657:36: warning: ‘%s’ directive output may be truncated writing up to 64 bytes into a region of size 60 [-Wformat-truncation=]
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~
lvmlockd-core.c:2657:2: note: ‘snprintf’ output between 5 and 69 bytes into a destination of size 64
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Replaced with slightly better code - but it still misses error path what
to do if the name would be truncated... - so added FIXME.
Also using all bytes for snprintf() buffer size
(as the size is with \0 included)
After the internal lvmlock LV (holding sanlock leases) is
extended to hold more leases, it needs to be zeroed.
sanlock expects to see either zeroed blocks or blocks
initialized with leases.
currently, lvcreate for sanlock find the free lock offset
from the beginning of the lvmlock every time.
after created thousands of lvs, it will issue thousands of read
ios for lvcreate to find free lock offset.
remeber the last free lock offset will greatly reduce the impact
Signed-off-by: Zhang Huan <zhanghuan@huayun.com>
Fix code checking that the 2nd mda which is at the end of disk really
fits the available free space and avoid any DA and MDA interleaving when
we already have DA preallocated. This mainly applies when we're restoring
a PV from VG backup using pvcreate --restorefile where we may already have
some DA preallocated - this means the PV was in a VG before with already
allocated space from it (the LVs were created). Hence we need to avoid
stepping into DA - the MDA can never ever be inside in such case!
The code responsible for this calculation was already in
_text_pv_add_metadata_area fn, but it had a bug in the calculation where
we subtracted one more sector by mistake and then the code could still
incorrectly allocate the MDA inside existing DA. The patch also renames
the variable in the code so it doesn't confuse us in future.
Also, if the 2nd mda doesn't fit, don't silently continue with just 1
MDA (at the start of the disk). If 2nd mda was requested and we can't
create that due to unavailable space, error out correctly (the patch
also adds a test to shell/pvcreate-operation.sh for this case).
If the PV was originally created with a larger-than-default
metadata area the restored one wasn't and might not even be
large enough to hold the metadata!
Previously the cache remembered an existing bootloaderarea and
reinstated it (without even checking for overlap) when asked to
write out the PV. pvcreate could write out an incorrect layout.
Add the new concise format to dmsetup create, either as a single
command-line parameter or from stdin.
Based on patches submitted by
Enric Balletbo i Serra <enric.balletbo@collabora.com>.
Drop 'DMEVENT' make variable and use BUILD_DMEVENTD like with other daemons.
NOTE: by default we do not build dmeventd - maybe time to change
as lot of build targets basically do need dmeventd...
Switch to use SYSTEMD_LIBS and avoid linking systemd library with
every linked tool when dbus notification are enabled.
(TODO add missing testing for lib presence)
Check first if we need to even link -lrt - since clock functions
are normally emebeded with recent glibc (>=2.17)
Use standard RT_LIBS name.
Avoid duplicate test for realtime clock with lvmlockd
Show better error message when realtime clock support is missing or
disabled.
Link RT_LIBS explicitely with lvmlockd and lvmetad.
Avoid adding -g more then once for debug builds.
Avoid enabling DEBUG_MEM when we build multithreaded tools.
Link executables with -fPIE -pie and --export-dynamic LDFLAGS
Introduce PROGS_FLAGS to add option to pass flags for external libs.
Link lvm2 internally library only when really used.
Link DAEMON_LIBS with daemons.
Pass VALGRIND_CFLAGS internally
Set shell failure mode on couple places.
lvm2 warned about zeroing and too big chunksize (>=512KiB), but
only during lvconvert, so lvcreate was creating thin-pools
without any warning about possible slowness of thin provisioning
because of zeroing.
Create a new table output format that concisely shows multiple devices
on one line.
dmsetup table --concise [device...]
<dev_name>,<uuid>,<flags>[,<table>]*[;<dev_name>,<uuid>,<flags>[,<table>]*]*
Table lines are separated by commas.
Devices are separated by semi-colons.
Flags is currently 'ro' or 'rw' (and might be extended in a
yet-to-be-defined way in future).
Any comma, semi-colon or backslash within a field is quoted by a
preceding backslash.
The format can later be supplied as input to dmsetup or even to the
booting kernel as an alternative way to set up devices.
Based on patches submitted by
Enric Balletbo i Serra <enric.balletbo@collabora.com>.
Add an independent command definition for "vgchange --locktype",
and split the implementation out of the set of common metadata
changes. It is unlike normal metadata changes, and can only
be run by itself. (Changing the lock type is similar in
principle to changing the VG name or the VG system ID; it
effects the ability of any host to see or access the VG.)
At some point this command lost the ability to forcibly change
the lock type of a shared VG to "none" (making it a local VG).
This can be necessary to repair shared VGs (e.g. recovery steps
that occur in vg_read are disabled for shared VGs because
they are not locked properly, or recovering sanlock locks
when the PV holding them is lost.)
"vgchange --locktype none --lockopt force VG" is used as the
method of forcing the shared VG to become local so that it
can be repaired.
Since _deactivate_and_remove_lvs() is used in more then one place,
move the needed udev synchronization into this function so other
users automatically get correct fs state before next dm manipulation.
Assumption here is that this udev synchronization 'delay' may also
prevent to 'early' table reloads which might cause kernel problems
for md-core - but we may need more generic time-limited reload
frequency for raid devices.
Note: on udev-less system there will be almost no delay.
Commit 8a912d6dbc missed the wrong logic,
we use 2 vars 'dev' & 'mddev' and their usage can't be mixed.
So correctly separate them so mddev keeps name of MD device.
During test do a more close selection of visible devices.
If some test leaks a device with LVMTEST prefix, next
test should not be influnced (or parallel running one).
If the test is running in non-/dev dir - it's already protected
by full path with $PREFIX in it - however it test is running
in real /dev dir - there was no such protection and test
were confused when they have seen such leaked device.
Commit 9b4b5d449e started to accept
rather way to wide set of strings we do not want to take for size.
i.e. lvresize -t vg0/lvol0 -LNaNM
Restore check for 'digit || locales defined 'dot' | '.'
(as we tend to take '.' even if locales uses ',')
Explictely detect duplicate sing symbols and leave the rest of
double number validation on 'strtod()' function. This way
we can also accept size like:
lvcreate -L.1M
We already accept -L0.1M - but it's common to accept numbers
starting with leading '.' - just as 'strtod()' accepts it).
API for strtod() or strtoul() needs reset of errno, before it's being
called. So add missing resets in missing places and some also some
errno validation for out-of-range numbers.
Since we are reading size as (double) we can get way bigger
number then just plain int64. So to make this check actually
more valid and usable do a maxsize compare in 'double'.
Avoid reading already released memory and do a continue directly.
Invalid read of size 1
at 0x1201B0: main_loop (clvmd.c:931)
by 0x11F640: main (clvmd.c:666)
Address 0x72ddef0 is 32 bytes inside a block of size 224 free'd
at 0x4C30D18: free (vg_replace_malloc.c:530)
by 0x54D6FD1: dm_free_wrapper (dbg_malloc.c:357)
by 0x122E6E: process_work_item (clvmd.c:2034)
by 0x123003: lvm_thread_fn (clvmd.c:2085)
by 0x590A3A8: start_thread (pthread_create.c:465)
by 0x5C3C7FE: clone (in /usr/lib64/libc-2.25.90.so)
Block was alloc'd at
at 0x4C2FB6B: malloc (vg_replace_malloc.c:299)
by 0x54D6EF1: dm_malloc_aux (dbg_malloc.c:286)
by 0x54D6F1C: dm_zalloc_aux (dbg_malloc.c:291)
by 0x54D6F96: dm_zalloc_wrapper (dbg_malloc.c:345)
by 0x11F89C: local_rendezvous_callback (clvmd.c:731)
by 0x1203D2: main_loop (clvmd.c:964)
by 0x11F640: main (clvmd.c:666)
Initialize mutex upfront any debugging and fix this report:
Mutex reinitialization: mutex 0x485d20, recursion count 0, owner 1.
at 0x4C38480: pthread_mutex_init_intercept (drd_pthread_intercepts.c:821)
by 0x4C38480: pthread_mutex_init (drd_pthread_intercepts.c:830)
by 0x11F359: main (clvmd.c:562)
mutex 0x485d20 was first observed at:
at 0x4C38F63: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:885)
by 0x4C38F63: pthread_mutex_lock (drd_pthread_intercepts.c:898)
by 0x11E920: debuglog (clvmd.c:254)
by 0x11F1D8: main (clvmd.c:527)
Switch from warn to log_error since this generated
failing return code for command so printing log_error()
is mandatory.
Happens with i.e. pvscan --cache meets crashing lvmetad.
Commit 34504855a7 introduced
flag LV_RESHAPE_DATA_OFFSET and used it to avoid incompatible
activation on older runtime.
Enhance vg_validate() raid checking functions with checks for it.
Patch 72a58ce4b0 was wronly placing
double quotes around this variable which we want to pass expanded,
as it's just set of 'space' device args ATM.
TODO consider using array[@] to make this cleaner.
Add shellcheck directive to skip warning here
Changes:
- BASH_SOURCE index was one off.
- The first line of stacktrace was pure confusion displaying executed
script together with innermost line number (which was either 125 when
STACKTRACE or 229 when skip was called.)
- We can safely ignore innermost call, as stack trace is always produced
by stacktrace function.
- It is safer to test for array length, instead of testing FUNCNAME is
main - if main function were introduced.
- Bashishm is safe to use as this function as a whole is relying on bash.
In order to reject out of place reshaping with segment data_offset
field on old runtime, add a respective segment type incompatibility
flag causing "+RESHAPE_DATA_OFFSET" to be suffixed to the segment
type name.
When reshape space is allocated anew, an update and reload is needed to
promote the new size to the cluster node with the exclusively active RaidLV
or reloading the RaidLV will fail with a size related error. Additionally,
store "data_offset <sectors>" with the RaidLV in the lvm2 metadata so that
it can be retrieved on cluster nodes.
Process allocation of reshape space on a 2-legged raid4/5 (interim layout
to convert from/to linear via raid1) properly in the cluster.
Resolves: rhbz1461562
Resolves: rhbz1448116
Use 1 logic for 2 loops tearing down left device.
First loops tries to remove all closed devices with 'normal' remove.
Second loop tries to replace those left devices with 'error' target.
We can't really sleep that much in teardown as it slows test too much.
So do a nested loop (similar to 'dmsetup remove_all') and keep
removing devices with open count == 0 as long as it works.
When we want to squash as much device as possible,
it's better to give it some delay, so devices have
some time to release it's resouces for next removal.
Also drop surrounding cookie processing and let each
dmsetup call run on its own.
Add missing get_devs.
When $7 is not given use empty string.
See if we can live with less RAM disk for PVs.
Drop limitation on single core as presence 1.12 should address this.
The checking order here has happend after TESTDIR was removed
resulting in weird further error on trap path.
Properly check for unexpected dmeventd before removing TESTDIR
since 'trap' codepath is still using it.
Also try to kill this unexpected dmeventd so testing is
not skipping all next dmeventd tests.
(Downside would be - if user would be accidentally starting
dmeventd by some regular system admin work - such dmeventd
might be killd if it's unused. It can't kill dmeventd in-use.
Fix mirror_images_on() to actually report something useful (thought
it might be tuned later).
So for now the function got through all '_mimages_' and compares
where the order of them is matching given list of devices.
Possible misspelling: FAILED_MIXED_STR may not be assigned, but FAIL_MIXED_STR is.
Possible misspelling: FAILED_MULTI_STR may not be assigned, but FAIL_MULTI_STR is.
Possible misspelling: FAILED_BLACK_STR may not be assigned, but FAIL_BLACK_STR is.
The blkid we call in 13-dm-disk.rules also returns identifiers for
partitions based on which the /dev/disk/by-part{uuid,label} and
gpt-auto-root symlinks should be created in the same manner as we
already create symlinks for filesystem labels and uuids.
This is because we handle blkid calls and symlink creation under
/dev/disk ourselves in our 13-dm-disk.rules for device-mapper devices
for us to have more control over this process.
See also https://lists.freedesktop.org/archives/systemd-devel/2017-July/039220.html
and original report http://tracker.ceph.com/issues/19489 for
the exact case where these symlinks were missing.
Fix the error messages when an unrecognized command is
run from a script. We shouldn't attempt to parse options
for an unrecognized command name, which causes misleading
errors about bad options, but rather exit right when we
know the command name is not valid. Also don't complain
about exiting without an error message when running a
script if no command didn't exist.
1. dm_uuid is 68 byte length, but buf is 64 which
will cause miss match uuid from lv lock manager
2. no lv lock_type path in dm config, use lock_args instead
Signed-off-by: Zhang Huan <zhanghuan@chinac.com>
If the activation step in lvcreate fails (e.g. the specified
minor number is already used), then the lvcreate is reverted,
but the LV lock in lvmlockd was not being unlocked or properly
freed.
Some lvconvert commands can be used directly on the data sublv:
lvconvert ... vg/pool_tdata
The correct LV lock to use in lvmlockd is the one on the pool LV.
Centralise editing of the client list into _add_client() and
_del_client(). Introduce _local_client_count to track the size of the
list for debugging purposes. Simplify and standardise the various ways
the list gets walked.
While processing one element of the list in main_loop(),
cleanup_zombie() may be called and remove a different element, so make
sure main_loop() refreshes its list state on return. Prior to this
patch, the list edits for clients disappearing could race against the
list edits for new clients connecting and corrupt the list and cause a
variety of segfaults.
An easy way to trigger such failures was by repeatedly running shell
commands such as:
lvs &; lvs &; lvs &;...;killall -9 lvs; lvs &; lvs &;...
Situations that occasionally lead to the failures can be spotted by
looking for 'EOF' with 'inprogress=1' in the clvmd debug logs.
Correctly skip the test when lvmdbusd is found already running.
For pgrep usage we need to add '-f -l' options to get python3 name
printed.
Remove no longer used 'pids' local var.
With commit 41c10034aa we actually
do require LV to be used with _vg_write_lv_suspend_commit_backup().
So write a proper separte single wrapper for write && commit && backup.
lvm_run needs to place NULL as the last element into argv[].
Otherwise we get:
Conditional jump or move depends on uninitialised value(s)
_command_required_pos_matches (lvmcmdline.c:1443)
_find_command (lvmcmdline.c:1610)
lvm_run_command (lvmcmdline.c:2770)
lvm2_run (lvmcmdlib.c:91)
Dmeventd reuses 'dm_task' struct for some STATUS operation, but due to
missing reinitization of dm_task target list, it has caused misprocesing
of recieved events as the parsed target has been simply added to the
list of existing status and cause multiple actions being called for
single event.
Since we discovered status reporting from 'md' goes from large set
of weird states we can't just decided based on this word.
So let it pass for rebuild and idle as well
and check for health devices afterwards.
Add function to adjust printing of percent values in better way.
Rounding here is going along following rules:
0% & 100% are always clearly reported with .0 decimal points.
Values slightly above 0% we make sure a nearest bigger
non zero value with given precission is printed
(i.e. 0.01 for %.2f will be shown)
For values closely approaching 100% we again detect and adjust value
that is less then 100 when printed.
(i.e. 99.99 for %.2f will be shown).
For other values we are leaving them with standard rounding mechanism
since we care mainly about corner values 0 & 100 which need to be
printed precisely.
When we want to report primary leg failure, check for intial 'a',
since otherwice 'Aa idle' is normally visible.
Also reset array of bit flags marking dead devices, once
plugin detects raid is in sync.
Functionality of ignore suspend devices is already granted by:
lvm2_disable_dmeventd_monitoring() -> init_run_by_dmeventd() ->
init_ignore_suspended_devices().
In fact plugins should never use --config because it has
some unpleasant technical issues.
When raid leg rimage device is marked as 'D'ead by mdcore,
lvm2 was not able to replace such device with allocate policy,
as device has not appared as missing.
Add detection of transiently failing devices.
Basically reverting commit 58a9f88b8c.
We can use origin_only in case we are snapshot's origin,
as we do support this stack.
So when we are 'uncaching' origin+snaps - we do need to reload only
origin and we do not need to play with snaps.
Handle change of 'region size' better and follow also standard rule
if the command can't success (i.e. size is already same) we return
error for all such cases.
Also log_pring more info about adjusted value (just like we
do for rounding)
Also avoid keep pointers on 'display_*' values - they are in
ringbuffer for immediate use - not to be kept across multiple calls
(as they could be already overwritten by later calls) - so dropped
seg_region_size_str
'lvdisplay -m' tried to go through NULL policy settings,
when such policy was not defined for CachedLV.
Patch is fixing display of cache-pool without defined settings,
as this is now a valid pool and we mostly want users to define
these settings when actually really caching a LV.
Since cache LV can be a stacked device, there is no real reason
trying to use slight optimised tree for origin_only cache reload
(it could be even wrongly implemented in this case).
We can easily go with stardard tree load here.
When user runs command like 'lvconvert --splitcache' the operation
might be actually either slow or not making any progress in kernel,
so lets give user a chance to abort such operation.
When user press 'Ctrl+C' device table is restored to pre-flushing state.
Remove explicit activation of SubLVs and let lv_update_and_reload()
perform the proper (pre-)loading sequencing of tables.
This avoids related callback functions which are removed.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
When lock-holding LV differs from actually request locked LV,
we drop origin_only flag as it has no use - it'd be applied
on completely different LV.
Example of problem:
Raid is thin-pool _tdata LV.
Raid run origin_only locking on stacked device.
As lock holder is discovered thinLV.
Whole origin_only operation is then applied only on thinLV
changing the meaning of whole operation.
NOTE: this patch does not change anything for LV that are
already top-level lock holding LVs (i.e. thinLVs, snahoshots/origins).
Disable until we have a proper fix for reshape space allocation,
switching it to begin/end of rimages and activation in the cluster.
Related: rhbz1448116
Related: rhbz1461526
Related: rhbz1448123
Commit 1fe4f80e45 in current version
introduced regression for a terminal user, as he could not enter 'n'
as answer. Add missing break for this case (No whats_new).
New tests to add checking for '100%' in-sync at start of "recover"
process (it shouldn't happen, but I've seen it before). Also,
check status over the whole cycle of various sync processes ("resync"
and "recover").
Enhance reporting code, so it does not need to do 'extra' ioctl to
get 'status' of normal raid and provide percentage directly.
When we have 'merging' snapshot into raid origin, we still need to get
this secondary number with extra status call - however, since 'raid'
is always a single segment LV - we may skip 'copy_percent' call as
we directly know the percent and also with better precision.
NOTE: for mirror we still base reported number on the percetage of
transferred extents which might get quite imprecisse if big size
of extent is used while volume itself is smaller as reporting jump
steps are much bigger the actual reported number provides.
2nd.NOTE: raid lvs line report already requires quite a few extra status
calls for the same device - but fix will be need slight code improval.
Current existing kernels reports status sometimes in weird form.
Instead of showing what is the exact progress, we need to estimate
this in-sync state from several surrounding states.
Main reason here is to never report 100% sync state for a raid device
which will be undergoing i.e. recovery.
Relative to last comit ddf2a1d656:
adjust the dm-raid target version to 1.12.0 which shows
mandatory kernel MD deadlock fixes related to reshaping
are presant in the kernel.
Related: rhbz1443999
First test in this file checks whether 'aa' is ever spotted during
a "recover" operation (it should not be). More tests should follow
in this file to look for oddities in status output - especially as
it relates to the sync_ratio, dev_health, and sync_action fields.
For the test clean-up, I was providing too many devices to the first
command - possibly allowing it to allocate in the wrong place. I was
also not providing a device for the second command - virtually ensuring
the test was not performing correctly at times.
This patch ensures that under normal conditions (i.e. not during repair
operations) that users are prevented from removing devices that would
cause data loss.
When a RAID1 is undergoing its initial sync, it is ok to remove all but
one of the images because they have all existed since creation and
contain all the data written since the array was created. OTOH, if the
RAID1 was created as a result of an up-convert from linear, it is very
important not to let the user remove the primary image (the source of
all the data). They should be allowed to remove any devices they want
and as many as they want as long as one original (primary) device is left
during a "recover" (aka up-convert).
This fixes bug 1461187 and includes the necessary regression tests.
Add the checks necessary to distiguish the state of a RAID when the primary
source for syncing fails during the "recover" process.
It has been possible to hit this condition before (like when converting from
2-way RAID1 to 3-way and having the first two devices die during the "recover"
process). However, this condition is now more likely since we treat linear ->
RAID1 conversions as "recover" now - so it is especially important we cleanly
handle this condition.
Previously, we were treating non-RAID to RAID up-converts as a "resync"
operation. (The most common example being 'linear -> RAID1'.) RAID to
RAID up-converts or rebuilds of specific RAID images are properly treated
as a "recover" operation.
Since we were treating some up-convert operations as "resync", it was
possible to have scenarios where data corruption or data loss were
possibilities if the RAID hadn't been able to sync completely before a
loss of the primary source devices. In order to ensure that the user took
the proper precautions in such scenarios, we required a '--force' option
to be present. Unfortuneately, the force option was rendered useless
because there was no way to distiguish the failure state of a potentially
destructive repair from a nominal one - making the '--force' option a
requirement for any RAID1 repair!
We now treat non-RAID to RAID up-converts properly as "recover" operations.
This eliminates the scenarios that can potentially cause data loss or
data corruption; and this eliminates the need for the '--force' requirement.
This patch removes the requirement to specify '--force' for RAID repairs.
Two of the sync actions performed by the kernel (aka MD runtime) are
"resync" and "recover". The "resync" refers to when an entirely new array
is going through the process of initializing (or resynchronizing after an
unexpected shutdown). The "recover" is the process of initializing a new
member device to the array. So, a brand new array with all new devices
will undergo "resync". An array with replaced or added sub-LVs will undergo
"recover".
These two states are treated very differently when failures happen. If any
device is lost or replaced while "resync", there are no worries. This is
because any writes created from the inception of the array have occurred to
all the devices and can be safely recovered. Even though non-initialized
portions will still be resync'ed with uninitialized data, it is ok. However,
if a pre-existing device is lost (aka, the original linear device in a
linear -> raid1 convert) during a "recover", data loss can be the result.
Thus, writes are errored by the kernel and recovery is halted. The failed
device must be restored or removed. This is the correct behavior.
Unfortunately, we were treating an up-convert from linear as a "resync"
when we should have been treating it as a "recover". This patch
removes the special case for linear upconvert. It allows each new image
sub-LV to be marked with a rebuild flag and treats the array as 'in-sync'.
This has the correct effect of causing the upconvert to be treated as a
"recover" rather than a "resync". There is no need to flag these two states
differently in LVM metadata, because they are already considered differently
by the kernel RAID metadata. (Any activation/deactivation will properly
resume the "recover" process and not a "resync" process.)
We make this behavior change based on the presense of dm-raid target
version 1.9.0+.
On conversion from raid10 to raid0 (takeover), all rmeta
devices and the rimage devices of mirrored stripes are
detached from the raid10 LV. The remaining rimage areas
are being shifted down into the slots of the detached
ones hence requiring renames to show proper _N suffix
sequences (e.g. 0,1,2,3 instead of 0,2,4,6). Only the
top-level raid10 LV has a cluster lock, not the detached
SubLVs thus their deactivation is impossible and e.g the
rename from *_rimage_6 to *_rimage_3 will fail. Fix by
activating exclusively before deactivating and removing.
Resolves: rhbz1448123
The file block count stored in the filemap_monitor was lazily
initialised at the time of the first check. This causes problems
in the case that the file has been truncated between this time and
the time the daemon started: the initial block count and current
block count match and the daemon fails to detect a change.
Separate the setting of the block count from the check and make a
call to update the value at the start of _dmfilemapd().
It's not an error to attempt to update regions from an fd that has
been truncated (or otherwise no longer has any allocated extents):
in this case, the call should remove all regions corresponding to
the group, and return an empty region table.
For proper usage of Cache kernel metadata format V2,
new cache_check tool is basically mandatory.
Print warning during configure time about this problem.
Prohibit activation of reshaping RaidLVs on incompatible
lvm2 runtime by storing e.g. 'raid5+RESHAPE' segment type
strings in the lvm2 metadata. Incompatible runtime not
supporting reshaping won't be able to activate those thus
avoiding potential data corruption.
Any new non-reshaping lvconvert command will reset the
segment type string from 'raid5+RESHAPE' to 'raid5'.
See commits
0299a7af1e and
4141409eb0
for segtype flag support.
When old snapshot is merged, lvm2 still can report some data about
merged 'snapshot' - i.e. it occupied space in VG.
This patch fixes regression from commit:
6fd20be629
and resolved RHBZ: 1460161
Avoid reporting 'checking result' as maybe - it should
clearly tell 'yes' or 'no'.
Just shuffle printed message to the place, where we
already know the 'maybe' answer.
So instead of printing 'unclear':
checking whether to enable libblkid detection of signatures when wiping... maybe
checking for BLKID... yes
checking whether to use udev-systemd protocol for jobs in background... maybe
checking for SYSTEMD... yes
show this:
checking for BLKID... yes
checking whether to enable libblkid detection of signatures when wiping... yes
checking for SYSTEMD... yes
checking whether to use udev-systemd protocol for jobs in background... yes
Code path missed validation of lvcreate --cachepool argument.
If the non cache-pool LV was passed in, code has still continued
further work and failed later on internal error. Validate this
condition at right place now.
When a combination of thin-pool chunk size and thin-pool data size
goes beyond addressable limit, such volume creation is directly
prohibited.
Maximum usable thin-pool size is calculated with use of maximal support
metadata size (even when it's created smaller) and given chunk-size.
If the value data size is found to be too big, the command reports
error and operation fails.
Previously thin-pool was created however lots of thin-pool data LV was
not usable and this space in VG has been wasted.
Only support RAID conversions on active LVs.
If we'd accept e.g. upconverting linear -> raid1 on inactive
linear LVs, any LV flags passed to the kernel aren't properly
cleared thus errouneously passing them on every activation.
Add respective check to lv_raid_change_image_count() and
move existing one in lv_raid_convert() for better messages.
If during the process of fetching current lvm state we experience an
exception we fail to call set_result on the queued_requests we were
processing. When this happens those threads block forever which causes
the service to stall infinitely. Only clear the queued_requests after
we have called set_result.
We were not adding background tasks to flight recorder. Add the meta
data to the flight recorder when we start the command and update the meta
data when the command is finished. Locking was added to meta data to
prevent concurrent update and returning string representation as these can
happen in two different threads.
vgreduce previously allowed --all and --removemissing together even though
it only actual did the remove missing. The lvm dbus daemon was passing
--all anytime there was no entries in pv_object_paths. This change supplies
--all if and only if we are not removing missing and the pv_object_paths
is empty.
Vgreduce has and continues to enforce the invalid combination of supplying a
device list when you specify --all or --removemissing so we do not need
to check for that invalid combination explicitly in the lvm dbus service as
it's already covered.
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1455471
Warn about a PV that has the in-use flag set, but appears in
the orphan VG (no VG was found referencing it.)
There are a number of conditions that could lead to this:
. The PV was created with no mdas and is used in a VG with
other PVs (with metadata) that have not yet appeared on
the system. So, no VG metadata is found by lvm which
references the in-use PV with no mdas.
. vgremove could have failed after clearing mdas but
before clearing the in-use flag. In this case, the
in-use flag needs to be manually cleared on the PV.
. The PV may have damanged/unrecognized VG metadata
that lvm could not read.
. The PV may have no mdas, and the PVs with the metadata
may have damaged/unrecognized metadata.
A PV holding VG metadata that lvm can't understand
(e.g. damaged, checksum error, unrecognized flag)
will appear as an in-use orphan, and will be cleared
by this repair code. Disable this repair until the
code can keep track of these problematic PVs, and
distinguish them from actual in-use orphans.
Reject any stripe adding/removing reshape on raid4/5/6/10 because
of related MD kernel deadlock on single core systems until
we get a proper fix in MD.
Related: rhbz1443999
Since lvmetad is using 'MISSING' in status for 'another' purpose,
we need to support ATM also flag get from this place.
Until fixed better - we accept both flags - alhough lvm2 will
only print in flags.
Switch METADATA_FORMAT flag usage to be stored via segtype
instead of 'status' flag which appeared to cause major
incompatibility troubles.
For backward compatiblity segtype flags are still accepted also
via 'status' bits which were used from version 2.02.169 so metadata
saved by this newer lvm2 version should still work nicely, although
new save version will no longer work on this older lvm2 version.
Allow storing LV status bits with segment type name field.
Switching to this since this field has better support for compatibility
with older version of lvm2 - since such unknown segtype will not cause
complete invisiblity of metadata from older lvm2 code - just the
particular LV will become unusable with unknown type of segment.
- Must reread all objects as PVs might be removed.
- Never consider testsuite provided PVs nested, or tearDown fails to
remove any outstanding VGs on them.
When 'fsadm' was running without terminal (i.e. pipe), it's been
automatically working like in '--yes'.
Detect terminal and only accept empty "" input in this mode.
Add more validation to catch mainly renamed devices, where
filesystem utils are not able to handle devices properly,
as they are not addressed by major:minor by rather via some
symbolic path names which can change over time via rename operation.
Since we add more validation to 'detect_mounted' function make sure
we always use it even with 'resize' action, so numerous validations
are not skipped.
Commit 5fe07d3574 failed to set raid5 types
properly on conversions from raid6. It always enforced raid6_ls_6
for types raid6/raid6_zr/raid6_nr/raid6_nc, thus requiring 3 conversions
instead of 2 when asking for raid5_{la,rs,ra,n}.
Related: rhbz1439403
Offer possible interim LV types and display their aliases
(e.g. raid5 and raid5_ls) for all conversions between
striped and any raid LVs in case user requests a type
not suitable to direct conversion.
E.g. running "lvconvert --type raid5 LV" on a striped
LV will replace raid5 aka raid5_ls (rotating parity)
with raid5_n (dedicated parity on last image).
User is asked to repeat the lvconvert command to get to the
requested LV type (raid5 aka raid5_ls in this example)
when such replacement occurs.
Resolves: rhbz1439403
Add an exception to not allowing lvchange to change properties
on hidden LVs. When a thin pool data LV is a cache LV, we
need to allow changing cache properties on the tdata sublv of
the thin pool.
Trap cases where the percentage calculation currently leads to an empty
LV and the message:
Internal error: Unable to create new logical volume with no extents
Additionally convert the calculated number of extents from physical to
logical when creating a mirror using a percentage that is based on
Physical Extents. Otherwise a command like 'lvcreate -m3 -l80%FREE'
can never leave any free space.
This brings the behaviour closer to that of lvresize.
(A further patch is needed to cover all the raid types.)
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of
PVs with unknown devices.
While one caller avoids passing such PVs into the function, the other
doesn't. Move the check inside the function so it's not forgotten.
Without this patch, if the normal VG reading code tries to repair
inconsistent metadata while there is an unknown PV, it incorrectly
considers the missing PVs no longer to be missing and produces
incorrect 'pvs' output omitting the missing PV, for example.
Easy reproducer:
Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
Reinstate the hidden PV pv2 and at the same time hide a different PV
pv3.
Run 'pvs' - incorrect output.
Run 'pvs' again - correct output.
See https://bugzilla.redhat.com/1434054
There are certain situations (not fully understood)
where is_missing_pv() is false, but pv->dev is NULL,
so this adds a check for NULL pv->dev after is_missing_pv()
to avoid a segfault.
With recent updates for thin pool monitoring in version 169
we lost multiple WARNINGs to be printed in syslog, when
pool crossed 80%, 85%, 90%, 95%, 100%.
Restore this logic as we want to keep user informed more
then just once when 80% boundary is passed.
Fix a regression introduced in 70bb726 that allows a local variable
in the monitored file checking routine to be accessed before its
assignment when the file has already been unlinked.
When a user does a Manager.PvCreate they can specify the block device using a
device path that may be different than what lvm reports is the device path. For
example a user could use:
/dev/disk/by-id/wwn-0x5002538500000000 instead of /dev/sdc
In this case the pvcreate will succeed, but when we query lvm we don't find the
newly created PV. We fail because it's device path is returned as /dev/sdc. This
change re-uses an internal lookup which can accommodate this and correctly find
the newly created PV.
Corrects https://bugzilla.redhat.com/show_bug.cgi?id=1445654
There were a handful of references to other man
pages using the standard command(N) form which were
not in bold, so they were not turned into links
in html formats.
Many will look in 'man lvm.conf' expecting to find a
description of the lvm.conf fields, which are not there.
State at the beginning how to get this (by running
lvmconfig.)
The lvm(8) man page never said what LVM is,
it never defined what the acronym LVM stands for,
it never spelled out other common acronyms VG, PV, LV,
and never described what they are.
This adds a very minimal definition which at least defines
the acronyms and entities, but it could obviously be
expanded on, either here or elsewhere.
Clean up the handling of memory used for cmd defs
so it doesn't trip up memory debugging.
Allocate memory for commands[] from libmem.
Free temporary memory used by define_commands()
at the end of the function.
Clear all the command def state in in lvm_fin().
Using any arg with a command name in a script file
would cause the command to fail.
The name of the script file being executed was being passed
to lvm_register_commands() and define_commands(), which
prevented command defs from being defined (simple commands
were still being defined only by name which was enough for those
to still work when run trivially with no args).
The logic for suggesting the nearest valid command syntax
was missing the simplest case. If a command has only one
valid syntax, that is the one we should suggest. (We were
suggesting nothing in this case.)
If the device size does not match the size requested
by --setphysicalvolumesize, then prompt the user.
Make the pvcreate checking/prompting code handle
multiple prompts for the same device, since the
new prompt can be in addition to the existing
prompt when the PV is in a VG.
Adding qualifier makes the only unqualified log_debug occurence
consistent with other uses in the same file.
Other possible ways to fix this:
- using `from .utils import log_debug`
- moving the line below `from . import utils` line
Unless a change of the regionsize is requested via "lvconvert -R N ...",
keep the region size when the number of images changes in a raid1 LV.
Related: rhbz1443705
Unless a change of the regionsize is requested via "lvconvert -R N ...",
keep the region size when the number of images changes in a raid1 LV.
Resolves: rhbz1443705
lvconvert parameters not causing a conversion (i.e. no type,
number of stripes, stripesize or regionsize changes) will
remove any allocated reshape space in which case the command
returns success. If reshape space does not exist though,
return error.
When metadata LV size was over DM_THIN_MAX_METADATA_SIZE sectors,
the info() routine was incorrectly trying to match bigger size,
while we do never pass any bigger device.
Fixing a case, where lvs should be displaying status for metadata
LV with 16GB size.
When a cmd def RULE fails because of a disallowed
combination of options, improve the error message
to show the option combination, not just the options
that broke the rule.
Reshape check failed when regionsize changed and current raid type
was provided with no other change requested (stripes or stripesize).
E.g. "lvconvert --type raid6 --regionsize 256K" on a raid6 LV
with != 256K regionsize.
Enable --type in test script.
Remove any newly allocated sub LV (pair) remnants in case
allocation fails due to lag of (parallel) free PV space
and keep initial raid type.
Resolves: rhbz1438013
A command def can include a specific constant option value,
but the value was not being checked for optional opts.
e.g. this is an incorrect command and does not match any
command defs:
lvconvert --type cache --cachepool vg/lv
However, it was mistakely being matched to this cmd def,
where the required options match, but the optional options
do not:
lonvert --cachepool LV_linear_striped_raid_cachepool
OO: --type cache-pool, ...
The optional options were mistakely considered matching
because 'cache' and 'cache-pool' were not being compared.
SIGINT isn't blocked properly after a sigint_allow(),
sigint_restore() cycle leading to illicit interruptable
metadata updates. These can leave corrupted metadata behind.
Issues addressed in this commit:
sigint_allow() fails to set _oldmasked[] members properly due
to an offset by one bug on indexing the members of the array.
It bails out prematurely comparing to MAX_SIGINTS causing nesting
depths to be one less than MAX_SIGINTS. Fix the comparision.
Correct the related comparison flaw in sigint_restore().
Initialize all sig_atomic_t variables consequently.
Resolves: rhbz1440766
Avoid error message
"Logical Volume *_rimage_0 already exists in volume group,,,"
on takeover conversion from a 2-legged raid1 to raid4
(aiming to reshape it adding images).
Resolves: rhbz1439398
Requesting _raid_remove_images() to commit the
metadata missed to reload the origin causing a
kernel takeover error converting a 2-legged raid1
(with previously removed images) to raid5.
Allow the combination of both arguments keeping
the raid level but changing the regionssize
(e.g. "lvconvert --type raid1 --regionsize 1M RaidLV"
on an existing raid1 LV).
Resolves: rhbz1438396
lvchange/lvconvert operations that do not require the LV
to already be active need to acquire a transient LV lock
before changing/converting the LV to ensure the LV is not
active on another host. If the LV is already active,
a persistent LV lock already exists on the host and the
transient LV lock request is a no-op.
Some lvmlockd locks in lvchange were lost in the cmd def
changes. The lvmlockd locks in lvconvert seem to have
been missed from the start.
We have to unset the LoadState variable from previous use when we check
for systemd unit state. We use this variable to check if systemd services
are loaded or not and if it is loaded, we issue systemctl commands to
enable/disable and start/stop the service. We don't issue these commands
if the unit is not loaded to avoid error messages which may confuse users.
The actual value specified by the --poll y|n option was not
being used. The way the --poll value is used is hidden
through an indirection where the value is stored in a
global variable at the start of the command, and then the
value is read from there later. Setting the global
variable early in the command had been lost with the
cmd def changes.
Fill in some gaps where old versions of lvm allowed
--poll and --monitor in combination with other operations,
but those combinations had been lost since the cmd def work.
(The new cmd def code also added some combinations that
had been missed by the old code.)
Changes:
lvchange --activate: add poll and monitor options, and
add calls to them in implementation.
lvchange --refresh: add monitor option (poll already there),
and call to monitor in implementation.
lvchange <metadata ops>: add poll and monitor options, and
add calls to them in implementation.
vgchange <metadata ops>: add poll option (call to poll
already in implementation).
vgchange --refresh: remove monitor option (not used by code)
lvchange --persistent y: add poll and monitor options, and
add calls to them, and to activate
in the implementation. (Making it
match the main lvchange metadata
command.)
Summary of current usage:
lvchange --activate: monitor, poll
vgchange --activate: monitor, poll
lvchange --refresh: monitor, poll
vgchange --refresh: poll
lvchange --monitor: ok
lvchange --poll: ok
lvchange --monitor --poll: ok
vgchange --monitor: ok
vgchange --poll: ok
vgchange --monitor --poll: ok
lvchange <metadata ops>: monitor, poll
vgchange <metadata ops>: poll
Enhance commit 25b5915c9b
to process options requiring immediate metadata commits
and reloads after those we can group together doing just
one commit and an optional reload for the whole group.
Backup metadata after processing options successfully.
Related: rhbz1437611
Don't abbreviate the --help output quite as much
when there are many command defs. Print all the
options in the cmd defs that are shown. --longhelp
output is unchanged and includes everything.
These days --partial is only used with activation in
lvchange/vgchange. It probably had another meaning
at some point in history which is no longer used,
so ignore it in those cases.
When included in the OO list, the option is advertised in
help/man output, implying it is meaningful to the command,
when in fact the command never uses it.
The IO list means the option won't cause an error if it's
used, but is not displayed as an valid option for the command.
If the option is not included in either OO or IO lists,
using the option would cause a command error, which would cause
problems for anyone is using the option for historical reasons.
_lvchange_properties_single() processes multiple command
line arguments in a loop causing metadata updates and/or
backups per argument.
Optimize to only perform one update and/or backup
(but necessary interim ones; e.g. for --resync)
per command run.
Related: rhbz1437611
Use only selected paths for finding .so in builddir.
So if builddir constains some embeded subdirs with some more
occurences of project (i.e. 'make rpm' build subdir)
those library paths will not get into list.
With monolithic kernels we can't actually modprobe
for cache modules as they are already compiled-in
and policy modules do not export version symbol.
Reported issue on list:
https://www.redhat.com/archives/dm-devel/2017-March/msg00061.html
Fix will try to look for explicit kernel symbols first before
calling modprobe.
Fix a regression introduced by new command line processing
(see starting commit 1e2420bca8)
causing activation to be rejected in combination with setting
persistent minor device numbers.
Resolves: rhbz1437611
Since _filemap_monitor_set_notify() is only called after daemon
start up it should not use the _early_log() macros. Instead, use
log_sys_error to log errors from the two syscalls in the function.
The fallback branch in _stats_update_file() is redundant (since the
branch taken when the daemon starts successfully must jump to the
'out' label anyway): remove it and re-order the conditions to
improve readability.
Use log_sys_error rather than log_error if execvp() fails:
/mnt/redhat/xdoio.13752.XIORQ: Created new group with 2 region(s) as group ID 0.
# execvp() failed.
vs:
/var/lib/libvirt/images/rhel7-vm1.qcow2: Created new group with 884 region(s) as group ID 0.
dmfilemapd: execvp failed: No such file or directory
The function _filemap_monitor_check_file_unlinked() attempts to
test whether a fd value should be closed by comparison to zero:
if ((fd > 0) && close(fd))
log_error("Error closing fd %d", fd);
The test should be '>=' since 0 is a valid file descriptor.
Similar to 40fb91a, but for the file descriptor opened using the
link path reported by /proc/<self>/fd/<fd>.
The daemon opens a new file descriptor from /proc/<self>/fd when
checking for an unlinked file with mode=inode. Ensure that it is
always closed even if the same file test fails.
The daemon opens a new file descriptor from fm->path when checking
for an unlinked file with mode=inode. Ensure that it is always
closed even if the same file test fails.
The path argument check in dmfilemapd incorrectly tests for:
if (argv[0] == '/')
Rather than testing the 1st character in the string pointed to by
argv[0].
Removing some unused new lines and changing some incorrect "can't
release until this is fixed" comments. Rename license.txt to make
it clear its merely an included file, not itself a licence.
This patch fixed lvm2 compilation running on x32 arch.
(Using 64bit x86 cpu features but running on 32b address space,
so consuming less mem in VM).
On x32 arch 'time_t' is 64bit while 'long' is 32bit.
Commits a29bb6a14b
... 5c199d99f4
narrowed down on addressing the escaping of hyphens
in the dynamic creation of manuals whilst avoiding
them in creating help texts. This lead to a sequence
of slipping through hyphens adrressed by additional
patches in aforementioned commit series.
On the other hand, postprocessing dynamically man-generator
created and statically provided manuals catches all hyphens
in need of escaping.
Changes:
- revert the above commits whilst keeping man-generator
streamlining and the detection of any '\' when generating
help texts in order to avoid escapes to slip in
- Dynamically escape hyphens in manaual pages using sed(1)
in the respective Makefile targets
- remove any manually added escaping on hyphens from any
static manual sources or headers
raid1 doesn't allow to set all images to writemostly because at
least one image is required to receive any written data immediately.
The dm-raid target will detect such invalid request and
fail it iwith a kernel error message.
Reject such request in uspace displaying a respective error message.
The recent command definitions commit took the command
name from argv[0] without applying basename to the value,
so a pathname, e.g. /usr/sbin, would cause lvm to not
recognize the command name.
This commit supersedes reverted 1e4462dbfb
to avoid changes to liblvm and the libdm API completely.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range in libdm and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
This reverts commit 1e4462dbfb
in favour of an enhanced solution avoiding changes in liblvm
completetly by checking the target versions in libdm and emitting
the respective parameter lines.
Fixes command defs related to creating a new thin pool and
then a new thin lv in the new pool.
1. lvcreate --size --virtualsize --thinpool
Needs a cmd def, it was missing.
The def is unique by the three required
options: size, virtualsize and thinpool.
2. lvcreate --size --virtualsize --thinpool VG
Needs a cmd def, it was missing.
The def is unique by the three required
options: size, virtualsize and thinpool,
and one required positional arg: VG.
3. lvcreate --thin --virtualsize --size LV_new|VG
This existing def should not accept an optional
--type thin, which if used makes it indistinct
from another def.
4. lvcreate --size --virtualsize VG
This existing def should not accept an optional
--type thin or --thin, which if used makes it
indistinct from other defs (e.g. 3)
This reverts commit 642d682d8d.
Using the thinpool option with this cmd def makes it
indistinct from other cmd defs where thinpool is a
required option.
The libdevmapper interface compares existing table line retrieved from
the kernel to new table line created to decide if it can suppress a reload.
Any difference between input and output of the table line is taken to be a
change thus causing a table reload.
The dm-raid target started to misorder the raid parameters (e.g. 'raid10_copies')
starting with dm-raid target version 1.9.0 up to (excluding) 1.11.0. This causes
runtime failures (limited to raid10 as of tests) and needs to be reversed to allow
e.g. old lvm2 uspace to run properly.
Check for the aforementioned version range and adjust creation of the table line
to the respective (mis)ordered sequence inside and correct order outside the range
(as described for the raid target in the kernels Documentation/device-mapper/dm-raid.txt).
lvcreate --thinpool POOL1 -L 100M --virtualsize 100M snapper_thinp
https://bugzilla.redhat.com/1434027
(The general rule is that a command is accepted if it is unambiguous.
The combination -L -V --thinpool uniquely identifies the operation.)
Udev events can come in like a flood when something changes. It really
doesn't do us any good to refresh the state of the service numerous times
when 1 would suffice. We had something like this before, but it was
removed when we added the refresh thread. However, we have since learned
that we need to sequence events in the correct order and block dbus
operations if we believe the state has been affected, thus udev events are
being processed on the main work queue. This change limits spurious
work items from getting to the queue.
If we always disable the sending of notify dbus events then in the case
where all the users are lvm dbus users we will be in udev handling mode
until at least 1 external lvm command occurs. Instead we will not disable
notify dbus until after we get at least 1 external event. This makes the
service get into the correct mode of operation faster.
lvconvert should not leak 'error' device.
(This patch is not fix the problem, just makes it more easily visible
instead of more confusing 'clvmd' trace).
Starting with dm-raid target version 1.9.0 shrinking of mapped devices is supported.
Check for support being present in lvresize and lvreduce.
Related: rhbz1394048
Since we want to test different cache policies with profiles mq&smq
raise version to 1.8.
TODO: maybe split in more tests so older targets can test here as well.
N.B.: passthrough is also not supported with version 1.3
Quering non-thin-pool segment for discard property may lead
to intenal error if the segment had set 'out-of-range' value,
so only thin-pool is allowed, for other it returns NULL.
If SubLVs to be removed still exist after an image removing
conversion (i.e. "lvconvert --yes --force --stripes N "
with N < total stripes) any request to convert to a different
striped/raid* level has to be rejected until after those freed
SubLVs got removed by running the aforementioned lvconvert again.
Add tests to check conversion to striped/raid* gets rejected.
Enhance a test comment.
Related: rhbz1191935
Related: rhbz1366296
Adjust to final target version 1.10.1 supporting reshape
properly and to recently changed report field specifications
(e.g. rehape_len_le) to allow these tests to run.
Lower mirror region size to suite the tiny test VG.
Previous commit 506d88a2ec introduced disabling lvmetad on repairs.
Avoid calling lvscan and use of any --config options altogether
in the mirror and raid DSOs.
Related: rhbz1380521
Repairing missing devices does not work reliably
with lvmetad, so disable lvmetad before repair.
A standard lvmetad refresh (pvscan --cache) will
enable lvmetad again.
Commit 07ded8059c assumed that the mirror is blocked which is not the case.
It is accessible, degraded and in need of repair because some of its legs
(partially) failed. Any auto-repair via dmeventd fails though because
of lvmetad not providing proper data about the failed PV(s). That's why
this workaround got introduced in commit 76f6951c3e until we get to
the lvmetad interaction core issue.
Mind any mirror auto-repair failure is caused by such lvmetad interaction
problems not yet solved so disabling lvmetad works as a resort as elaborated
on in the related bz.
Reintroducing the interim solution.
Resolves: rhbz1380521
Effectively revert whole 76f6951c3e.
We need to figure out some other solution.
At this moment usage of --config with 'repair' of blocked mirror
is 'freezing' combination.
For each section 8 man page, a .8_gen file is created from one of:
.8_main - Old-style man page - content used directly
.8_des and .8_end - Description and end section of a generated page
.8_pregen - Pre-generated page used if the generator fails
Other man sections are not generated and use the suffix .5_main or .7_main.
Developers should use 'make generate' to regenerate the .8_pregen files.
When a given command:
- matches command definition A based on required options,
but uses optional options that are not accepted by A.
- matches command definition B based on required options,
(but fewer required items than A), and uses no
unaccepted optional options.
then B is the correct choice. Command A would fail because
of the unaccepted optional options. The logic was mistakenly
letting A win because it had a greater number of required option
matches, without accounting for the optional option mismatches.
Systematically outline every combination of:
--type striped, --type mirror, --type raid, --mirrors, --stripes
and make sure each is assigned to one specific cmd def.
This revealed that a new command def is needed for
lvcreate that uses both --mirrors and --stripes
but no --type option.
The use of LVCREATE_RAID shortcut for an option set
resulted in mirrors/stripes being included in optional
opts set when they were already in the required list.
Sending %d as format argument in lvmetad_vg_remove_pending() will cause
segfaults in config_make_nodes_v() when va_arg() casts to int64_t. Also, it is
clearly advertised in the lvm source code that using plain %d is prohibited, so
let's switch to FMTd64.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Require that the path argument to dmfilemapd be an absolute path
and document this in tool output, libdevmapper.h and dmfilemapd.8.
The check is also enforced by dm_stats_start_filemapd() to avoid
forking a new process with an invalid path argument.
The initial check on argc incorrectly returns 1 when the wrong
number of arguments are present, causing a segfault in main()
when no arguments are given:
# dmfilemapd
Wrong number of arguments.
usage: dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Segmentation fault (core dumped)
Commit f4b30b0dae was about displaying visible LV size
when reshape space is allocated. Take parity devices
into account when displaying the user visible LV size.
As was recently done with relative signes for sizes/extents,
limit the signs used with the mirrors option, e.g.
lvcreate --mirrors now does not accept or advertise an
optional minus sign with the value. lvconvert --mirrors
accepts +|-.
Add a warning about maximum supported numbers of stripes
with striped LVs realtive to RAID conversions.
Add examples for a more elaborate, multi-step conversion
from linear to striped (and vice versa).
Shrink lvs examples output.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The LVCONVERT_RAID shortcut for including options ended
up including mirrors/stripes as optional opts in defs where
they were already required, or in defs where they would
not be used. Remove the option set and specify in each
case only the options wanted.
Better support for lvdisplay.
By default info about running (in kernel) cache status is printed.
To get 'segtype' info, user runs: 'lvdisplay -m', example:
--- Logical volume ---
LV Path /dev/vg/lvol0
LV Name lvol0
VG Name vg
LV UUID Y4uWuN-TBGk-duer-aPWl-yBWn-iFFR-RU1gg1
LV Write Access read/write
LV Creation host, time linux, 2017-03-01 20:52:39 +0100
LV Cache pool name lvol2
LV Cache origin name lvol0_corig
LV Status available
# open 0
LV Size 12,00 MiB
Cache used blocks 10,42%
Cache metadata blocks 0,49%
Cache dirty blocks 0,00%
Cache read hits/misses 112 / 34
Cache wrt hits/misses 133 / 0
Cache demotions 0
Cache promotions 20
Current LE 3
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Segments ---
Logical extents 0 to 2:
Type cache
Chunk size 64,00 KiB
Metadata format 1
Mode writethrough
Policy smq
Setting migration_threshold=100000
OO_LVCREATE_CACHE accepts --cachemetadataformat.
Support new option --cachemetadataformat auto|1|2 for caching.
Word 'auto' can be also be given as '0'.
Report CMFmt column with cache metadata format version.
Report KMFmt column with 'kernel cache metadata format version' for device.
(a value reported from status).
(Update 'CacheMode' to name 'Cache' as primary segtype).
Only cache-pool segtype may store cache_metadata_format.
Only supported values are 0,1,2
Format 2 requires LV status uses LV_METADATA_FORMAT.
Format 0 (unselected) or 1 shall not set this 'incompatible' status.
Cache pool read/writes metadata_format within its segment type..
For CachePoolLV unselected metadata format is NOT stored in metadata.
For CacheLV when metadata format is not present/selected in lvm2 metadata,
it's automatically assumed to be the version 1 (backward compatible).
To ensure older lvm2 will not 'miss-read' metadata with new version 2,
such LV is marked with METADATA_FORMAT status flag (segment is
specifying metadata format). So when cache uses metadata format 2,
it will become inaccesible on older system without such support.
(kernel dm cache < 1.10, lvm2 < 2.02.169).
Add new profilable configation setting to let user select
which metadata format of a created cache pool he wish to use.
By default the 'best' available format is autodetected at runtime,
but user may enforce format 1 or 2 ATM.
Code also detects availability for metadata2 supporting cache target.
In case of troubles user may easily Disable usage of this feature
by placing 'metadata2' into global/cache_disabled_features list.
Older library version was not detecting unknown 'feature' bits
and could let start target without needed option.
New versioned symbol now checks for supported feature bits.
_Base version keeps accepting only previously known features and
mask/ignores unknown bits.
NB: if the older binary passed in 'random' bits, it will not get
metadata2 by chance. New linked binary get new validation function.
Library user is required to not pass 'trash' for unsupported bits,
as such calls will be rejected.
Dm cache target version 1.10 introduces new cache metadata format
(upstream kernel >=4.11).
New format is enable by passing new target feature flag metadata2.
Interace side on libdm uses DM_CACHE_FEATURE_METADATA2.
This feature bit is now also recognized on status
and set in 'feature_flags' field of dm_status_cache structure.
Code also adds check for 'highest' supported feature flag bit.
So it rejects properly any 'unknown' feature bit set by application.
Better code to enforce writethrough caching for cleaner policy.
Only check for cleaner when DM_CACHE_FEATURE_PASSTHROUGH or
DM_CACHE_FEATURE_WRITEBACK is set.
As now we can properly recognize all paramerters for pool creation,
we may drop PASS_ARG_ defines and rely on '_UNSELECTED' or 0 entries
as being those without user given args.
When setting are not given on command line - 'update' function
fill them from profiles or configuration. For this 'profile' arg
was needed to be passed around and since 'VG' itself is not needed,
it's been all replaced with 'cmd, profile, extents_size' args.
Reuse same code for getting/setting cache parameters with lvcreate.
So there is single one place how to get vars from profiles and configs.
Variables declarations are moved to start of function and
there is no need to initialize them as they are always
defined by functions get_cache_params() and get_pool_params().
Since cache chunk might be huge and there is no technical need
to enforce rounding and there is actually more 'real' VG space
used then necessary - keep rounding on 'chunk' bounrary only
for thin volumes - where it's the space used anyway.
NB: we support conversion of any-size 'existing' LV into cached LV.
Fix missing reset of '*settings' pointer when no args were given.
Handle cache_chunk settings like all other settings, so it is properly
updated only with non-zero settings and the existing cache-pool
chunk_size is not being reconfigured.
User can specify metadata profile which stores important cache
geometry data for easy configuration.
Fix missing support for getting chunk_size, cache_mode, cache_policy
for a cache/cache pools volumes from configuration or metadata profile.
Split-off patch that just replaces returns with 'goto bad'
so there is single place to release policy_settings.
In the next patch we will start to use some shared
function call between lvconvert and lvcreate
(effectively restoring previous logic which has got lost).
To more easily recognize unselected state from select '0' state
add new 'THIN_ZERO_UNSELECTED' enum.
Same applies to THIN_DISCARDS_UNSELECTED.
For those we no longer need to use PASS_ARG_ZERO or PASS_ARG_DISCARDS.
Basically code moving operation to have a single place resolving
thin_pool_chunk_size_policy.
Supported are generic & performance profiles.
Function is now shared between thin manipulation code and configuration
_CFG logic to obtain defaults and handle correct reporting upward coding
stack.
Test for NULL in dm_stats_destroy() and return immediately if
the struct dm_stats pointer is NULL (similar to free(NULL)).
This simplifies cleanup code which otherwise needs to:
out:
if (dms)
dm_stats_destroy(dms);
return;
Older compilers cannot tell that the 'mode' variable is only
used in branches in which it is assigned:
dmsetup.c:5651: warning: "mode" may be used uninitialized in this function
dmsetup.c:5023: warning: "mode" may be used uninitialized in this function
Avoid this by always assigning the variable a value.
Older compilers are not able to determine that although group_id
is only assigned in one branch of a conditional, it is never used
used when the other branch is taken:
libdm-stats.c:3319: warning: "group_id" may be used uninitialized in this function
Avoid this by always initialising the variable when it is
declared.
Utilizing the --config option we will utilize global/notify_dbus=0 so
that the service itself doesn't generate change events which it then needs to
process.
We need to place query operations in the queue to prevent the case where
a client knows of something before the service does. For example if a
client creates a PV/VG/LV outside of the dbus API and then immediately
tries to lookup and use that resource in the lvm dbus service it should
be present. By placing the queries in the work queue any previous
refresh operation will complete before we process the query.
In addition to the already supported conversion between 2-legged
raid1 and raid5, raid1 and raid4 can be also converted into each
other with 2 legs (raid4/5 are limited to map a 2-legged raid1).
This patch supports the missing raid4 conversion in the sequence
linear -> 2-legged raid1 -> raid4/5, then restripe to more than one
data stripes for performance and resilience reasons and optionally
convert to striped/raid0.
The other conversion sequence is also possible by converting N-way
striped/raid0 to raid4/5, then restripe to 2 legs followed by a
conversion to raid1 and optionally to linear (loosing all resilience).
Add examples for reshaping number of stripes
and converting from raid6 to striped to raid10.
Remove trailing spaces.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The filemap daemon takes its program_id from the regions it is
managing: use DM_STATS_ALL_PROGRAMS when retrieving an initial
listing and then obtain the correct program_id from the group
leader.
On conversion from striped to raid0, data LVs are created
and all segments and their respective areas of the striped
LV are moved across to new segments allocated for the raid0
image LVs. This can cause non-canonical segments to be added
to the image LVs.
Add a call to lv_merge_segments() once all segments have been
added to an image LV to compensate for that. This avoids
unsafe table loads on activation.
Fix comments.
Automatic dmeventd repair of mirrors with active lvmetad configured
(mirror_image_fault_policy = "allocate") fails because the lvscan
run before the repair in the mirror DSO does not update the
lvmetad cache properly thus "lvconvert --repair ..." fails.
Need to scan the mirror LV before and after the repair
to have proper cache content after the repair finished.
The cache can't be relied on or the repair will fail.
Resolves: rhbz1380521
Launch an instance of the filemap monitoring daemon when creating,
or updating, a file mapped group, unless the --nomonitor switch is
given.
Unless --foreground is given the daemon will detach from the
terminal and run in the background until it is signaled or the
daemon termination conditions are met.
The --follow={inode|path} switch is added to control the daemon
behaviour when files are moved, unlinked, or renamed while they
are being monitored.
The daemon runs with the same verbosity as the dmstats command
that starts it.
Add a daemon that can be launched to monitor a group of regions
corresponding to the extents of a file, and to update the regions as the
file's allocation changes.
The daemon is intended to be started from a library interface, but can
also be run from the command line:
dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Where fd is a file descriptor open on the mapped file, group_id is the
group identifier of the mapped group and mode is either "inode" or
"path". E.g.:
# dmfilemapd 3 0 vm.img inode 1 3 3<vm.img
...
If foreground is non-zero, the daemon will not fork to run in the
background. If verbose is non-zero, libdm and daemon log messages will
be printed.
It is possible for the group identifier to change when regions are
re-mapped: this occurs when the group leader is deleted (regroup=1 in
dm_stats_update_regions_from_fd()), and another region is created before
the daemon has a chance to recreate the leader region.
The operation is inherently racey since there is currently no way to
atomically move or resize a dm_stats region while retaining its
region_id.
Detect this condition and update the group_id value stored in the
filemap monitor.
A function is also provided in the the stats API to launch the filemap
monitoring daemon:
int dm_stats_start_filemapd(int fd, uint64_t group_id, const char *path,
dm_filemapd_mode_t mode, unsigned foreground,
unsigned verbose);
This carries out the first fork and execs dmfilemapd with the arguments
specified.
A dm_filemapd_mode_t value is specified by the mode argument: either
DM_FILEMAPD_FOLLOW_INODE, or DM_FILEMAPD_FOLLOW_PATH. A helper function,
dm_filemapd_mode_from_string(), is provided to parse a string containing
a valid mode name into the appropriate dm_filemapd_mode_t value.
It's not an error to call dm_stats_group_present() on a handle
that contains no regions.
This causes dmfilemap to log a false backtrace during shutdown
if all regions are removed from the corresponding device:
exiting _filemap_monitor_get_events() with deleted=0, check=0
waiting for FILEMAPD_WAIT
dm message (253:1) [ opencount flush ] @stats_list dmstats [32768] (*1)
<backtrace>
Filemap group removed: exiting.
Change this to only emit a backtrace if the handle is NULL.
Splitting off an image LV of a 2-legged
raid1 LV causes loss of resilience.
Ask user to avoid uninformed loss of all resilience.
Don't ask for N > 2 legged raid1 LVs.
Adjust tests.
Splitting off an image LV of a 2-legged raid1 LV tracking changes
causes loosing partial resilience for any newly written data set.
Full resilience will be provided again after the split off image LV
got merged back in and the new data set got fully synchronized.
Reason being that the data is only stored on the remaining single
writable image during the split.
Ask user to avoid uninformed loss of such partial resilience.
Don't ask for N > 2 legged raid1 LVs.
In case N images fail (N <= parity chunks) _and_
a "vgreduce --removemissing --force VG" was applied
a following repair of the RaidLV fails:
Unable to remove N images: Only 0 devices given.
Failed to remove the specified images from tb/r.
Failed to replace faulty devices in tb/r.
Fix as of this commit results in correct repair:
Faulty devices in tb/r successfully replaced.
Add new values for different sign variations, resulting in:
size_VAL no sign accepted
ssize_VAL accepts + or -
psize_VAL accepts +
nsize_VAL accpets -
extents_VAL no sign accepted
sextents_VAL accepts + or -
pextents_VAL accepts +
nextents_VAL accepts -
Depending on the command being run, change the option
values for --size, --extents, --poolmetadatasize to
use the appropriate value from above.
lvcreate uses no sign (but accepts + and ignores it).
lvresize accepts +|- for a relative change.
lvextend accepts + for a relative change.
lvreduce accepts - for a relative change.
Like opt and val arrays in previous commit, combine duplicate
arrays for lv types and props in command.c and lvmcmdline.c.
Also move the command_names array to be defined in command.c
so it's consistent with the others.
command.c and lvmcmdline.c each had a full array defining
all options and values. This duplication was not removed
when the command.c code was merged into the run time.
seg->extents_copied has to be defined properly on reducing
the size of a raid LV or conversion from raid5 with 1 stripe
to raid1 will fail.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Reshaping a raid5 LV to one stripe aiming to convert it to
raid1 (and optionally to linear) reports the wrong LV size
when still having reshape space allocated.
The lv_extend/_lv_reduce API doesn't cope with resizing RaidLVs
with allocated reshape space and ongoing conversions. Prohibit
resizing during conversions and remove the reshape space before
processing resize. Add missing seg->data_copies initialisation.
Fix typo/comment.
For the time being raid10 is limited to even number of total stripes
as is and 2 data copies. The number of stripes provided on creation
of a raid10(_near) LV with -i/--stripes gets doubled to define
that even total number of stripes (i.e. images).
Apply the same on disk adding conversions (reshapes) with
"lvconvert --stripes RaidLV" (e.g. 2 stripes = 4 images
total converted to 3 stripes = 6 images total).
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Part of the UNIT description was still living in the
--size description. Move it to the Size[UNIT] description
since it is used by other options, not just --size.
In lvcreate/lvconvert, --poolmetdatasize can only be an
absolute value, but in lvresize/lvextend, --poolmetadatasize
can be given a + relative value.
The val types only currently support relative values that
go in both directions +|-. Further work is needed to add
val types that can be relative in only one direction, and
switching various option values to one those depending on
the command name. Then poolmetdatasize will not appear
with a +|- option in lvcreate/lvconvert, and will
appear with only the + option in lvresize/lvextend.
Enhance the raid report functions for the recently added LV fields
reshape_len, reshape_len_le, data_offset, new_data_offset, data_copies,
data_stripes and parity_chunks to cope with "lvs --select".
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Clean up and correct the details around --extents and --size.
lvcreate/lvresize/lvreduce/lvextend all now display the
extents option in usages.
The Size and Number value variables for --size and --extents
are now displayed without the [+|-] prefix for lvcreate.
A special case is needed to display
--extents for lvcreate since the cmd defs
treat --extents as an automatic alternative
to --size (to avoid duplicating every cmd def).
Now that we got the "data_stripes" field key, adjust the "stripes" field description.
Enhance the "regionsize" field description to cover raids as well.
When a cmd def includes multiple sets of options (OO_FOO),
allow multiple OO_FOO sets to contain the same option and
avoid repeating it in the cmd def.
There was confusion in the code about whether or not the
--size option accepted a sign. Make it consistent and clear
that it does.
This exposes a new problem in that an option can only
accept one value type, e.g. --size can only accept a
signed number, it cannot accept a positive or negative
number for some commands and reject negative numbers for
others.
In practice, lvcreate accepts only positive --size
values and lvresize accepts positive or negative --size
values. There is currently no way to encode this
difference. Until that is fixed, the man page output
is hacked to avoid printing the [+|-] prefix for sizes
in lvcreate.
Settings specified in other command line args take precedence over
profiles and --config, which takes precedence over settings in actual
config files.
Since commit 1e2420bca8 ('commands: new
method for defining commands') commands like this:
lvchange --config 'global/test=1' -ay vg
have been printing the 'TEST MODE' message, but nevertheless making
real changes.
Commit 48778bc503 introduced new RAID reshaping related report fields.
The inclusioon of segtype.h in properties.c isn't mandatory, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 80a6de616a versioned the dm_tree_node_add_raid_target_with_params()
and dm_tree_node_add_raid_target() APIs for compatibility reasons.
There's no user of the latter function, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
During an ongoing reshape, the MD kernel runtime reads stripes relative
to data_offset and starts storing the reshaped stripes (with new raid
layout and/or new stripesize and/or new number of stripes) relative
to new_data_offset. This is to avoid writing over any data in place
which is non-atomic by nature and thus be recoverable without data loss
in the transition. MD uses the term out-of-place reshaping for it.
There's 2 other areas we don't have report capability for:
- number of data stripes vs. total stripes
(e.g. raid6 with 7 stripes toal has 5 data stripes)
- number of (rotating) parity/syndrome chunks
(e.g. raid6 with 7 stripes toal has 2 parity chunks; one
per stripe for P-Syndrome and another one for Q-Syndrome)
Thus, add the following reportable keys:
- reshape_len (in current units)
- reshape_len_le (in logical extents)
- data_offset (in sectors)
- new_data_offset ( " )
- data_stripes
- parity_chunks
Enhance lvchange-raid.sh, lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh, lvconvert-raid-reshape.sh
and lvconvert-raid-takeover.sh to make use of new keys.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add/remove the SECONDARY_SYNTAX flag to cmd defs.
cmd defs with this flag will be listed under the
ADVANCED USAGE man page section, so that the main
USAGE section contains the most common commands
without distraction.
- When multiple cmd defs do the same thing, one variant
can be displayed in the first list.
- Very advanced, unusual or uncommon commands should be
in the second list.
Recently added check for reshaping in this function called for
a cleanup to avoid proliferating it with more explicit conditionals.
Base the reshaping check on the given _features array.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
An initialization was missing when converting striped to raid0(_meta)
causing unitialized reshape_len in the new component LVs first segment.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The imposed minimum region size can cause rejection on
disk removing reshapes. Lower it to avoid that.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 27384c52cf lowered the maximum number of devices
back to 64 for compatibility.
Because more members have been added to the API in
'struct dm_tree_node_raid_params *', we have to version
the public libdm RAID API to not break any existing users.
Changes:
- keep the previous 'struct dm_tree_node_raid_params' and
dm_tree_node_add_raid_target_with_params()/dm_tree_node_add_raid_target()
in order to expose the already released public RAID API
- introduce 'struct dm_tree_node_raid_params_v2' and additional functions
dm_tree_node_add_raid_target_with_params_v2()/dm_tree_node_add_raid_target_v2()
to be used by the new lvm2 lib reshape extentions
With this new API, the bitfields for rebuild/writemostly legs in
'struct dm_tree_node_raid_params_v2' can be raised to 256 bits
again (253 legs maximum supported in MD kernel).
Mind that we can limit the maximum usable number via the
DEFAULT_RAID{1}_MAX_IMAGES definition in defaults.h.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
- Combine the equivalent lvconvert --type raid defs.
(Two cmd defs must be different without relying
on LV type, which are not known at the time the
cmd def is matched.)
- Remove unused optional options from lvconvert --stripes,
and lvconvert --stripesize.
- Use Number for --stripes_long val type.
- Combine the cmd def for raid reshape cleanup into the
existing start_poll cmd def (they were duplicate defs).
Calls into the raid code from a poll opertion will be
added.
Commit 64a2fad5d6 raised the maximum number of RAID devices to 64.
Commit e2354ea344 introduced RAID_BITMAP_SIZE as 4 to have
256 bits (4 * 64 bit array members), thus changing the libdm API
unnecessarilly for the time being.
To not change the API, reduce RAID_BITMAP_SIZE to 1.
Remove an unneeded definition of it from libdm-common.h.
If we ever decide to raise past 64, we'll version the API.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The fedorahosted git repository shuts down tomorrow:
https://communityblog.fedoraproject.org/fedorahosted-sunset-2017-02-28/
Our upstream git repository has moved back to sourceware.org.
Mailing list hosting is not changing.
Gitweb:
https://www.sourceware.org/git/?p=lvm2
Git:
git://sourceware.org/git/lvm2.git
ssh://sourceware.org/git/lvm2.git
http://sourceware.org/git/lvm2.git
Example command to change the origin of a repository clone:
Public:
git remote set-url origin git://sourceware.org/git/lvm2.git
Committers:
git remote set-url origin git+ssh://sourceware.org/git/lvm2.git
The options list was sorted as:
- options with both long and short forms, alphabetically
- options with only long form, alphabetically
This was done only for the visual effect. Change to
sort alphabetically by long opt, without regard to
short forms.
Fixes commit 286d39ee3c, which was correct except
for a reversed strstr. Now uses strchr, and modifies
a copy of the name so the original argv is preserved.
When requesting a regionsize change during conversions, check
for constraints or the command may fail in the kernel n case
the region size is too smalle or too large thus leaving any
new SubLVs behind.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Allow regionsize on upconvert from linear:
fix related commit 2574d3257a to actually work
Related: rhbz1394427
Remove setting raid5_n on conversions from raid1
as of commit 932db3db53 because any raid5 mapping
may be requested.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add aforementioned but forgotten new test scripts
lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh and
lvconvert-raid-reshape.sh
Those presume dm-raid target version 1.10.2
provided by a following kernel patch.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Because of contraints in renaming shifted rimage/rmeta LV names
the current RaidLV limit is a maximum of 10 SubLV pairs.
With the previous introduction of reshaping infratructure that
constriant got removed.
Kernel supports 253 since dm-raid target 1.9.0, older kernels 64.
Raise the maximum number of RaidLV rimage/rmeta pairs to 64.
If we want to raise past 64, we have to introdce a check for
the kernel supporting it in lvcreate/lvconvert.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces the changes to call the reshaping infratructure
from lv_raid_convert().
Changes:
- add reshaping calls from lv_raid_convert()
- add command definitons for reshaping to tools/command-lines.in
- fix raid_rimage_extents()
- add 2 new test scripts lvconvert-raid-reshape-linear_to_striped.sh
and lvconvert-raid-reshape-striped_to_linear.sh to test
the linear <-> striped multi-step conversions
- add lvconvert-raid-reshape.sh reshaping tests
- enhance lvconvert-raid-takeover.sh with new raid10 tests
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- allow raid_rimage_extents() to calculate raid10
- remove an __unused__ attribute
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- add missing raid1 <-> raid5 conversions to support
linear <-> raid5 <-> raid0(_meta)/striped conversions
- rename related new takeover functions to
_takeover_from_raid1_to_raid5 and _takeover_from_raid5_to_raid1,
because a reshape to > 2 legs is only possible with
raid5 layout
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- enhance _clear_meta_lvs() to support raid0 allowing
raid0_meta -> raid10 conversions to succeed by clearing
the raid0 rmeta images or the kernel will fail because
of discovering reordered raid devices
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- enhance _raid45610_to_raid0_or_striped_wrapper() to support
raid5_n with 2 areas to raid1 conversion to allow for
striped/raid0(_meta)/raid4/5/6 -> raid1/linear conversions;
rename it to _takeover_downconvert_wrapper to discontinue the
illegible function name
- enhance _striped_or_raid0_to_raid45610_wrapper() to support
raid1 with 2 areas to raid5* conversions to allow for
linear/raid1 -> striped/raid0(_meta)/raid4/5/6 conversions;
rename it to _takeover_upconvert_wrapper for the same reason
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add missing possible reshape conversions and conversion options
to allow/prohibit changing stripesize or number fo stripes
- enhance setting convenient riad types in reshape conversions
(e.g. raid1 with 2 legs -> radi5_n)
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add _raid_reshape() using the pre/post callbacks and
the stripes add/remove reshape functions introduced before
- and _reshape_requested function checking if a reshape
was requested
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add vg metadata update functions
- add pre and post activation callback functions for
proper sequencing of sub lv activations during reshaping
- move and enhance _lv_update_reload_fns_reset_eliminate_lvs()
to support pre and post activation callbacks
- add _reset_flags_passed_to_kernel() which resets anyxi
rebuild/reshape flags after they have been passed into the kernel
and sets the SubLV remove after reshape flags on legs to be removed
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function to support disk adding reshapes
- add function to support disk removing reshapes
- add function to support layout (e.g. raid5ls -> raid5_rs)
or stripesize reshaping
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function providing state of a reshaped RaidLV
- add function to adjust the size of a RaidLV was
reshaped to add/remove stripes
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add lv_raid_data_copies returning raid type specific number;
needed for raid10 with more than 2 data copies
- remove _shift_and_rename_image_components() constraint
to support more than 10 raid legs
- add function to calculate total rimage length used by out-of-place
reshape space allocation
- add out-of-place reshape space alloc/relocate/free functions
- move _data_rimages_count() used by reshape space alloc/realocate functions
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces local infrastructure to raid_manip.c
used by followup patches.
Add functions:
- to check reshaping is supported in target attibute
- to return device health string needed to check
the raid device is ready to reshape
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces infrastructure prerequisites to be used
by raid_manip.c extensions in followup patches.
This base is needed for allocation of out-of-place
reshape space required by the MD raid personalities to
avoid writing over data in-place when reading off the
current RAID layout or number of legs and writing out
the new layout or to a different number of legs
(i.e. restripe)
Changes:
- add members reshape_len to 'struct lv_segment' to store
out-of-place reshape length per component rimage
- add member data_copies to struct lv_segment
to support more than 2 raid10 data copies
- make alloc_lv_segment() aware of both reshape_len and data_copies
- adjust all alloc_lv_segment() callers to the new API
- add functions to retrieve the current data offset (needed for
out-of-place reshaping space allocation) and the devices count
from the kernel
- make libdm deptree code aware of reshape_len
- add LV flags for disk add/remove reshaping
- support import/export of the new 'struct lv_segment' members
- enhance lv_extend/_lv_reduce to cope with reshape_len
- add seg_is_*/segtype_is_* macros related to reshaping
- add target version check for reshaping
- grow rebuilds/writemostly bitmaps to 246 bit to support kernel maximal
- enhance libdm deptree code to support data_offset (out-of-place reshaping)
and delta_disk (legs add/remove reshaping) target arguments
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
(Change to recent commit 3f4ecaf8c2.)
Use --foo Number[k|UNIT] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Previously, underlined/italic Unit was used,
like other of variables, but this UNIT is more
like a shortcut than an actual variable.
The MD kernel raid1 personality does no use any writemostly leg as the primary.
In case a previous linear LV holding data gets upconverted to
raid1 it becomes the primary leg of the new raid1 LV and a full
resynchronization is started to update the new legs.
No writemostly and/or writebehind setting may be allowed during
this initial, full synchronization period of this new raid1 LV
(using the lvchange(8) command), because that would change the
primary (i.e the previous linear LV) thus causing data loss.
lvchange has a bug not preventing this scenario.
Fix rejects setting writemostly and/or writebehind on resychronizing raid1 LVs.
Once we have status in the lvm2 metadata about the linear -> raid upconversion,
we may relax this constraint for other types of resynchronization
(e.g. for user requested "lvchange --resync ").
New lvchange-raid1-writemostly.sh test is added to the test suite.
Resolves: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855895
We use --foo Number[k|Units] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Capitalize and underline Units so it is consistent
with other variables, and reference it at the end.
Technically, the k should be bold, but this
tends to make the text visually hard to read
because of the excessive highlights scattered
everywhere. So it's left normal text for now
(it's unlikely to confuse anyone.)
Print ... after a repeatable option in the OPTIONS section.
An alternative would be to just mention in the text description
that the option is repeatable.
When the test would need to try to write some large amount of data
we can give it 'zero' segments - for obvious reason such written data
can't be verified but in some test cases it doesn't really matter.
Usage follows 'error_dev' style.
For now test suite is not supporting any combination of
error/delay/zero segments so only 1 type could be used per PV.
Previously when lvremove tried to remove 'active' origin,
it had been asking for every 'snapshot' LV separately
and doing individual single snapshot removals first.
To be faster it now deactivates origin before removal
all connected snapshots.
This avoids multiple reloads of dm table for origin volume
which were unnecessary as origin was meant to be removed as well.
There are two kinds of common options:
1. options common to all variants of a given command name
2. options common to all lvm commands
Previously, both kinds of common options were listed together
under "Common options". Now the first are printed under
"Common options for command" (when needed), and the second
are printed under "Common options for lvm" (always).
Remove the "usage notes" which should just
live in the man pages.
When there are 3 or more variants of a command,
print all the options produces a lot of output,
so require --longhelp to print all the options
in these cases.
Check 'dmsetup version' is called before starting any more
advanced logic in $DM_DEV_DIR.
Call also replaces mkdir as it creates needed path with control node.
The --type option has previously been accepted for
lvresize/lvextend. Using it did not affect the operation
of the command. The value was simply verified as matching
the current seg type of the LV.
Commit f45b689406 caused regression
of lvresize -m and --type parameter
After fix this sequence may work when we also fix syntax description:
lvcreate -l1 -m1 -n lv1 vg
lvextend --type mirror -m1 -l+1 vg/lv1
For this syntax:
lvconvert --thinpool LV1 --poolmetadata LV2
lvconvert --cachepool LV1 --poolmetadata LV2
Restore the metadata swapping behavior in addition to
the pool creation behavior. When LV1 is already a pool,
the metadata LV will be swapped with LV2.
When LV1 is not a pool, it will be converted to a
pool using the specified LV for metadata.
This syntax is no longer advertised because of the
ambiguous behavior. The primary syntaxes for pool
creation and metadata swapping will be the advertised
methods.
As we now user binary search - it's nondeterministict
which of the same 'args' will be give - so duplicates
need 'extra' care.
So provide same hack for output for --uuidstr_ARG as
for input.
Solves 'pvscan -u'.
Since there is a lot of options and lot of searches,
use binary search to keep strcmp at minimum.
The interesting part is - alphabetically sorted array contains
duplicates and some of them are not the 'right anwer', so
after we find matching string but not matching long_ARG,
we may need to check if the surrouding strings are the right matching
one.
The single loops is used also for strictly define --foo_long
(i.e. --stripes) and just differs at final part.
TODO1: replace strstr call with some flag (just like short_opt).
TODO2: drop '--' from being stored and tests by strcmp.
When parsing command defs, track and report all
errors that are found. Add an error return case
from define_commands so the standard error exit
path is used.
When using liblvm2cmd, a process can do multiple
init/exit calls, i.e.
lvm2_init(); lvm2_run(); lvm2_exit();
lvm2_init(); lvm2_run(); lvm2_exit();
...
define_commands() needs to set up the global commands[]
definitions only the first time.
The old ad hoc arg parsing when combining a split snapshot
allowed the first lv to have a vgname, but not the second.
Since lvconvert now uses the standard arg parsing in
process_each_lv(), the old one-off behavior requires a
work around.
This reverts commit 717363bb94.
These alternate forms for swapping metadata cannot be
distinguished from the command for creating a pool.
If we were to add these alternate forms for swapping
metadata, we would need to overload the pool creation
command defs, making those definitions ambiguous.
Change run time access to the command_name struct
cmd->cname instead of indirectly through
cmd->command->cname. This removes the two run time
fields from struct command.
All lvconvert functionality has been moved out of the
previous monolithic lvconvert code, except conversions
related to raid/mirror/striped/linear. This switches
that remaining code to be based on command defs, and
standard process_each_lv arg processing. This final
switch results in quite a bit of dead code that is
also removed.
This is a new explicit version of 'lvconvert LV'
which has been an obscure way of triggering polling
to be restarted on an LV that was previously converted.
Lift all the snapshot utilities (merge, split, combine)
out of the monolithic lvconvert implementation, using
the command definitions. The old code associated with
these commands is now unused and will be removed separately.
This lifts the lvconvert --repair and --replace commands
out of the monolithic lvconvert implementation. The
previous calls into repair/replace can no longer be
reached and will be removed in a separate commit.
The new check_single_lv() function is called prior to the
existing process_single_lv(). If the check function returns 0,
the LV will not be processed.
The check_single_lv function is meant to be a standard method
to validate the combination of specific command + specific LV,
and decide if the combination is allowed. The check_single
function can be used by anything that calls process_each_lv.
As commands are migrated to take advantage of command
definitions, each command definition gets its own entry
point which calls process_each for itself, passing a
pair of check_single/process_single functions which can
be specific to the narrowly defined command def.
. Define a prototype for every lvm command.
. Match every user command with one definition.
. Generate help text and man pages from them.
The new file command-lines.in defines a prototype for every
unique lvm command. A unique lvm command is a unique
combination of: command name + required option args +
required positional args. Each of these prototypes also
includes the optional option args and optional positional
args that the command will accept, a description, and a
unique string ID for the definition. Any valid command
will match one of the prototypes.
Here's an example of the lvresize command definitions from
command-lines.in, there are three unique lvresize commands:
lvresize --size SizeMB LV
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync, --reportformat String, --resizefs,
--stripes Number, --stripesize SizeKB, --poolmetadatasize SizeMB
OP: PV ...
ID: lvresize_by_size
DESC: Resize an LV by a specified size.
lvresize LV PV ...
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --resizefs, --stripes Number, --stripesize SizeKB
ID: lvresize_by_pv
DESC: Resize an LV by specified PV extents.
FLAGS: SECONDARY_SYNTAX
lvresize --poolmetadatasize SizeMB LV_thinpool
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --stripes Number, --stripesize SizeKB
OP: PV ...
ID: lvresize_pool_metadata_by_size
DESC: Resize a pool metadata SubLV by a specified size.
The three commands have separate definitions because they have
different required parameters. Required parameters are specified
on the first line of the definition. Optional options are
listed after OO, and optional positional args are listed after OP.
This data is used to generate corresponding command definition
structures for lvm in command-lines.h. usage/help output is also
auto generated, so it is always in sync with the definitions.
Every user-entered command is compared against the set of
command structures, and matched with one. An error is
reported if an entered command does not have the required
parameters for any definition. The closest match is printed
as a suggestion, and running lvresize --help will display
the usage for each possible lvresize command.
The prototype syntax used for help/man output includes
required --option and positional args on the first line,
and optional --option and positional args enclosed in [ ]
on subsequent lines.
command_name <required_opt_args> <required_pos_args>
[ <optional_opt_args> ]
[ <optional_pos_args> ]
Command definitions that are not to be advertised/suggested
have the flag SECONDARY_SYNTAX. These commands will not be
printed in the normal help output.
Man page prototypes are also generated from the same original
command definitions, and are always in sync with the code
and help text.
Very early in command execution, a matching command definition
is found. lvm then knows the operation being done, and that
the provided args conform to the definition. This will allow
lots of ad hoc checking/validation to be removed throughout
the code.
Each command definition can also be routed to a specific
function to implement it. The function is associated with
an enum value for the command definition (generated from
the ID string.) These per-command-definition implementation
functions have not yet been created, so all commands
currently fall back to the existing per-command-name
implementation functions.
Using per-command-definition functions will allow lots of
code to be removed which tries to figure out what the
command is meant to do. This is currently based on ad hoc
and complicated option analysis. When using the new
functions, what the command is doing is already known
from the associated command definition.
Some archs can use even 64K pages and then lvm2 runs into trouble if
the stack is 'too small' to fit extra page capturing stack overwrite.
So when lvm2 limits stack - add extra mem page - be it 4K or 64K.
Relates to ppc64le bug: https://bugzilla.redhat.com/1387279
Remove allocate_pvs from raid_manip.c:_region_size_change_request() API
and lv_extend() using it added for temporary test purpose.
Related: rhbz1366296
Add:
- region size checks to raid_manip.c types array and supporting functions
- tests to lvconvert-raid-takeover.sh to check bogus
"lvconvert --type --regionsize N " requests
Related: rhbz1366296
When we preload device with smaller size, we avoid its resume,
so later suspend/resume of full device tree my process all
existing in flight bios.
Also update comment and avoid using confusing opposite meaning.
Kernel 4.10 (dm-crypt v1.15.0) and later supports loading device
tables with crypt segment having key in kernel keyring retention
service.
dmsetup hid key section of tables output. With this patch dmsetup
no longer hides key section if it detects kernel key description
instead of hex byte representation of key itself.
Add:
- conversion support from striped/raid0/raid0_meta to/from raid10;
raid10 goes by the near format (same as used in creation of
raid10 LVs), which groups data copies together with their original
blocks (e.g. 3-way striped, 2 data copies resulting in 112233 in the
first stripe followed by 445566 in the second etc.) and is limited
to even numbers of legs for now
- related tests to lvconvert-raid-takeover.sh
- typo
Related: rhbz1366296
- support shrinking of raid0/1/4/5/6/10 LVs
- enhance lvresize-raid.sh tests: add raid0* and raid10
- fix have_raid4 in aux.sh to allow lv_resize-raid.sh
and other scripts to test raid4
Resolves: rhbz1394048
Commit cfb6ef654d introduced
support to change RAID region size.
Fix:
- don't change region_size until after prompting the user
- use log_print_unless_silent() instead of log_warn()
- avoid superfluous sigint() calls which are already
covered in yes_no_prompt()
- typo
Related: rhbz1392947
Commit cfb6ef654d introduced
support to change RAID region size.
Add:
- missing conditions to support any types to function with
it in lv_raid_convert(); temporary workaround used until
cli validation patches get merged
- tests requesting "-R " to lvconvert-raid-takeover.sh
involving a cleanup of the script
Related: rhbz1392947
Cleanups as of Jons review:
- enhance comment about mandatory raid4 <-> raid5_n activation w/o metadata SubLVs
- remove bogus segment flag setting
- fix to sync related comments on conversions to raid0/striped and amongst raid4/5
- add missing error message for non-synced conversion to raid0/striped
Related: rhbz1366296
Add:
- support to change region size of existing RaidLVs
(all RAID LV types but raid0/raid0_meta)
- lvconvert-raid-regionsize.sh with test variations
for different RAID types and region sizes
Resolves: rhbz1392947
Well waiting for zeroing may take enough time to finish 'raid' sync.
So make the test running faster without zeroing and better avoid race
to have chance to happen (i.e. lvcreate is finished after array
gets already in sync).
Add:
- support for segment types raid6_{ls,rs,la,ra}_6
(striped raid with dedicated last Q-Syndrome SubLVs)
- conversion support from raid5_{ls,rs,la,ra} to/from raid6_{ls,rs,la,ra}_6
- setting convenient segtypes on conversions from/to raid4/5/6
- related tests to lvconvert-raid-takeover.sh factoring
out _lvcreate,_lvconvert funxtions
Related: rhbz1366296
Add:
- support for segment type raid6_n_6 (striped raid with dedicated last parity/Q-Syndrome SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid6_n_6
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add:
- support for segment type raid5_n (striped raid with dedicated last parity SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid5_n
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add a new update_filemap command to dmstats that allows a filemap
group to be updated:
# dmstats update_filemap --groupid 0 vm.img
/var/lib/libvirt/images/vm.img: Updated group ID 0 with 137 region(s).
This will update the set of regions mapped to the file to reflect
the current file system allocation.
Currently this needs to be run manually - a future update will add
support for monitoring file maps via a daemon, allowing them to be
automatically updated when the underlying file is modified.
Add a call to update the regions corresponding to a file mapped
group of regions. The regions to be updated must be grouped, to
allow us to correctly identify extents that have been deallocated
since the map was created.
Tables are built of the file extents, and the extents currently
mapped to dmstats regions: if a region no longer has a matching
file extent, it is deleted, and new regions are created for any
file extents without a matching region.
The FIEMAP call returns extents that are currently in-memory (or
journaled) and awaiting allocation in the file system. These have
the FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_DELALLOC flag bits set
in the fe_flags field - these extents are skipped until they
have a known disk location.
Since it is possile for the 0th extent of the file to have been
deallocated this must also handle the possible deletion and
re-creation of the group leader: if no other region allocation
is taking place the group identifier will not change.
If the group_id passed to _stats_group_id_present is equal to the
special value DM_STATS_GROUP_NOT_PRESENT there is no need to perform
any further tests: return false immediately.
For more advanced support we need to ensure better logic for calling
external much more advanced script for maintanance of thin-pool.
So this new code ensures:
When thin-pool data or metadata is bigger then 50%,
then with each 5% increment, action is called.
This is independent from autoextend_threshold.
This action always happens when thin-pool is over threshold,
(so no action when it's exactly i.e. 60%).
The only exception is 100% full thin-pool - which invokes 'last'
action.
Since thin-pool occupancy may change also downward, code needs
to also handle possibly reduction of occupancy of thin-pool.
So when usage drop from 90% to 50%, thin-pool will start to call
again action when it will pass 55% threshold.
This give external commands lot of option i.e. to call 'fstrim'
before actual resize is needed.
Default internal logic will stop trying to do any 'rescue' action
when executed command fails.
This will be now fully in hands of external script if such
behaviour is needed.
Instead of stopping monitoring after couple failing retries,
keep monitoring forever, just make larger delays between command
retries (ATM upto ~42 minutes).
So syslog is not spammed too often, yet commands have a chance to
be retried and succeed eventually...
When dmeventd configured command does not start with 'lvm ' prefix,
it's going to be an 'external' command.
In this case we split command by spaces to argv strings.
When showing sizes with 'H|human' units we do use standard rounding.
This however is confusing users from time to time,
when the printed number uses some biger units i.e. GiB and there is just
tiny fraction of space missing.
So here is some real-life example with new 'r' unit.
$lvs
LV VG Attr LSize Pool Origin
lvol0 vg -wi-a----- 1.99g
lvol1 vg -wi-a----- <2.00g
lvol2 vg -wi-a----- <2.01g
Meaning is - lvol1 has 'slightly' less then 2.00g - from sign '<' user
can be aware the LV doesn't have full 2.00GiB in size so he
will be less surpriced allocation of 2G volume will not succeed.
$ vgs
VG #PV #LV #SN Attr VSize VFree
vg 2 2 0 wz--n- <6,00g <2,01g
For uses needing 'old' undecorated human unit simply will continue
to use 'H|h' units.
The new R|r may further change when we would recongnize some
other way how to improve readability.
With recent commit d6a74025df using
INTERNAL_ERROR while cheking layer LV - it's been noticed mirror
logic currently doesn't do a correct thing during upconversion and
does a full-try instead of checking only allocator capabilities.
This leads to invalid usage of layer.
To keep existing code running before providing a fix, relax
INTERNAL_ERROR just an error and keep the 'code' running.
Once mirror code is fixed, these all check should be switched
to internal errors.
Since we still experience occasiaonal test failure - slow
things down even more to avoid race.
Add support for 'quick' table changes between normal & delayed tables.
This was missing piece in 77997c7673.
When merging origin is inactive (while driver is loaded) we
could already report merge in progress values as there is
no way to activate 'old state' now.
Show proper internal error for failing command when there are some
inconsitencies in sizes of LV and its layer instead of rather
meaningless error code 5.
(Could be hit i.e. if user tried to 'resize' cached LV and then
uncache such LV.)
During rework of resize code this validation check
has been lost (in my resize branch). Upstream
is still not supporting resize of any cache type LV
so needs to be prevented.
When we need to clear dirty cache content of cached LV, there
is table reload which usually is shortly followed by next metadata
change. However udev can't (as of now) process udev event
while device is 'suspended'.
So whenever sequence of 'suspend/resume/suspend' is needed,
we need to wait first for finishing of 'resume' processing before
starting next 'suspend'. Otherwise there is 'race' danger of triggering
unwantend umount by systemd as such event will trigger
SYSTEMD_READY=0 state for a moment for such changed device.
Such race is pretty ugly to trace so we may need to review more
sequencies for missing 'sync'.
(Other option is to enhnace 'udev' rules processing to avoid
such dramatic actions to be happening for suspended devices).
Solves: https://bugzilla.redhat.com/1280496
The only reasonable behaviour here is to error on
any number out of accepted range (i.e. now numbers
wrapping around with some hidden logic).
As this is plain bug there is no support for
backward compatibility since noone should
set numbers >UINT32_MAX and expect 0 or error
depending on how big number was used....
TODO: more fields might need to be converted.
Add simple function to wrap usage for only uint32 numbers.
Unlike 'int_arg' which accepts full range of 64bit number
this function will error on numbers out of this range:
<0, UINT32_MAX>
Add to commits 87117c2b25 and 0b8bf73a63 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Resolves: rhbz1025322
Related: rhbz1265191
Related: rhbz1399844
Related: rhbz1404425
When thin-pool processes event and 'lvextend --use-policies' fails
rather capture up-to-date new info as the fullness percentage may
have jumped noticable. This way we could use 'more' correct numbers
when checking for thresholds.
When there is 'merging' of an origin in progress, but metadata stil
do provide both origin and snapshot, we should show data from merged
snapshot. This is important mainly for thin case, where there was
a window, where i.e. 'lvs -o+device_id' would report information
about 'already gone' origin thin LV.
This race window is usually hard to trigger but can be ocasionally hit.
Usually shortly after activation, but before polling process manages
to update metadata after merge.
Before starting polling process, validate the merge has actually started
so there is not pointless invoke of lvmpolld.
This also fixes reported message from command, so user has
correct info whether merging has already started or
if it's delayed for next activation.
Move individual segment validation to a separate function
executed for 'complete_vg'.
Move some 'extra' validation bits from 'raid' validation to global
segtype validation (so extending existing normal validation)
TODO: still some test are left to be moved.
Reduce some duplication in validation process - there are still
some left thought so still room for improving overal speed.
The function timeout_add_seconds has quite a bit of variability. Using
timeout_add which specifies the timeout in ms instead of seconds. Testing
shows that this is much more consistent which should improve clients that
are using shorter timeouts for the API and the connection.
We can't keep 'display_lvname' for too long - it's using
ringbuffer and keeps limited number of names. So it's
safe only per few simple tests, but can't be used anymore
after some function calls..
(Fixes 00e641ef37)
Call _stats_regions_destroy() from dm_stats_list() if dms->regions
is non-NULL. This avoids leaking any pool allocations and ensures
the handle is in a known state: if an error occurs during the list,
dms->regions will be NULL and the handle will appear empty.
It could be actually better to use even cache origin in
read-only mode so there could no be some 'acidental'
change being done on this volume.
This however need further tools enhancment - where we would need
to handle whole subtree on 'lvchange -pr/-prw'.
When command calls backup() more then once (which is actually not
wanted) this warning message is shown repeatedly:
"WARNING: This metadata update is NOT backed up."
Instead now print message just once and less confuse user.
Add this functionality to lvconvert:
'lvconvert --thin cachedLV --thinpool vg/poll'
Converts cachedLV to external origin (which will be read-only).
New thin volume is created in thinpool LV and it's using external
origin as source for unprovisioned chunks.
This conversion happens online (while volume is in use).
Thin LV remains fully writable.
Cached external origin no longer could be written so cache will be used
ONLY for read operations. For this limitation we require cache mode
to be writethrough (as writeback cannot write to read-only volumes).
When thinLV is later removed cached external origin is again
fully usable, just note, LV remain in 'read-only' mode.
When read-write is needed, 'lvchange -prw' has to be used.
Single external origin could be user by multiple thinLV in
multiple differen thin pool.
When cache volume may be converted from normal to -real layer LV
we need to improve logic for call cache_check.
With this patch, we register call for cache_check only when metadata LV
is not yet present in active table slot (should match initial table
load).
This avoids unwanted checking when cache would become layer device
online.
The system is likely in some very inconsisten state.
Do not try to make it even more problematic with trying
to invoke tools like thin_check via callback.
External origin could be reloaded via more locks.
It's actually even more complex then thin-pool,
as it may be active on more nodes for linear LVs
(and maybe even more types).
External origin is always read-only thus unmodifiable
device so there should not be a problem accesing it
through multiple nodes.
Also for thin-pool check first presence of active thin-pool.
FIXME:
It's not easy to detect on which nodes this device is active
Thus manipulation with such device may require checking every
node and it active state and refresh.
But since such setup is quite complex to prepare and use,
hopefully there are not user trying to 'explore' this usage yet.
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
Avoid printing misleading message about single dirty block.
Instead properly detect condition where the 'cleaner' policy
needs to be installed without 'overloading' dirty variable.
Also print warning if we would be clearing read-only volume.
(it really shouldn't happen).
External origin could be activated as stand-alone device.
When the last thin LV is removed, external origin is no longer
the external origin and it's layer property was dropped.
Ensure dm table is correct by reloading external origin
(when it's active).
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We do the same for i.e. old snaphost origin.
Activation of raid has brough up also splitted image with tracing
(without taking lock for this).
So when raid is now activate - such image is not put into
table (with _rmeta). When user needs such device, just active it.
When --count=0 interval numbers are miscalculated:
Interval #18446744069414584325 time delta: 999920887ns
Interval #18446744069414584325 current err: -79113ns
End interval #18446744069414584325 duration: 999920887ns
This is because the command line argument is cast through the
uint32_t type, and fixed to UINT32_MAX:
_count = ((uint32_t)_int_args[COUNT_ARG]) ? : UINT32_MAX;
We also need to handle --count=0 specially when calculating the
interval number: since intervals count from #1, this must account
for the implicit "minus one" when converting from zero to the
UINT64_MAX value used (which is too large to store in _int_args).
The time management code mixes tests of the _timer_fd value with
code that should be timer agnostic: this causes problems for users
of the usleep() timer, since it cannot properly detect the start
of a new interval:
Beginning first interval
Interval #18446744069414584348 time delta: 1000000000ns
Interval #18446744069414584348 current err: 0ns
End interval #18446744069414584348 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
[...]
Beginning first interval
Interval #18446744069414584349 time delta: 1000000000ns
Interval #18446744069414584349 current err: 0ns
End interval #18446744069414584349 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
Separate these out, by defining a _timer_running() call that each
timer implements, and only define _timer_fd if we are compiling
with TIMERFD enabled.
Although the usleep() interval timer is not used if the Linux
TIMERFD interface is available it should still provide reasonably
good timing.
Instead of trying to estimate the error from the duration of the
last sleep, peg it to the start time of the program, and use the
value of ((start_time - now) % interval) to correct the current
interval duration.
This always pulls us back into sync at the end of each interval,
rather than relying on trying to incrementally adjust the time
duration at each interval start.
This greatly reduces drift when the usleep() clock is used.
The dm_bit_copy() macro uses the source (bs1) bitset size as the
limit for memcpy:
memcpy((bs1) + 1, (bs2) + 1, ((*(bs1) / DM_BITS_PER_INT) + 1)..)
This is safe if the destination bitset is smaller than the source,
or if the two bitsets are of the same size.
With a destination that is larger (e.g. when resizing a bitmap to
add more capacity), the memcpy will overrun the source bitset and
set garbage bits in the destination.
There are nine uses of the macro currently (8 in libdm/regex, and
1 in daemons/cmirrord): in each case the two bitsets are always of
equal size so the behaviour is unchanged.
Fix the macro to use bs2's size to simplify resizing bitsets and
avoid the need for another copy macro.
Commit 0690392040 revealed a problem
in raid metadata manipulation.
We do two operations in one table reload:
- raid leg/image extraction
- rename remaining raid legs
This should be made in separate steps. Otherwise we do an
uncorrectable table change on error path (leaving tables
for admin and dmsetup).
As a hotfix - restore the previous logic and use a single
new function _lv_update_and_reload_list which activates exclusively
extracted LVs on the list before resuming suspended raid LV.
This restore 'rename' functionality upon resume.
Also still preserve the 'origin_only' logic - although we know
it's not working properly for cluster and LV stacking.
Further fixes are needed.
If FIEMAP returns a single extent after the first call, no extent
boundary is detected and the first extent is not counted by the
normal mechanism.
In this case, increment nr_extents at the same time the extent is
added to the region table, before returning.
backup is not 'tested' for success and also it should
actually happen just when command is finished.
We do not target to make backups with each inter-step
metadata change.
RAID is LV property
TODO: only 2 flags are seg->status: PVMOVE & MERGING
At least the second one should be soon elimanted as again
we merge LV not a segment.
This is another place for 'common' use pattern or
reload and activation of deleted devices.
(Moving the exclusive activation to _deactivate_and_remove_lvs()).
TODO: looks like halve of raid function is reloading
just 'origin' - and the other full LV.
It's useful to be able to specify a minimum number of bits for a
new bitmap parsed from a list, for e.g. to allow for expansing a
group without needing to copy/reallocate the bitmap.
Add a backwards compatible symbol for programs linked against old
versions of the library.
It is sometimes convenient to iterate over the set bits in a dm
bitset in reverse order (from the highest set bit toward zero), or
to quickly find the last set bit.
Add dm_bit_get_last() and dm_bit_get_prev(), mirroring the existing
dm_bit_get_first() and dm_bit_get_next().
dm_bit_get_prev() uses __builtin_clz when available to efficiently
test the bitset in reverse.
Add a macro for the clz (count leading zeros) operation.
Use the GCC __builtin_clz() for clz() if it is available and fall
back to a shift based implementation on systems that do not set
HAVE___BUILTIN_CLZ.
Split out the loop that iterates over each batch of FIEMAP
extent data from the function that sets up and calls the ioctl
to reduce nesting and simplify local variable use:
_stats_get_extents_for_file()
-> _stats_map_extents()
The _stats_map_extents() function is responsible for detecting
eof and extent boundaries and adding whole, allocated extents
to the file extent table for region creation.
Check that all region_id values specified in a group bitmap are
actually present: although this should not normally happen when
using the dmstats tool, it is possible as a result of manual
changes (or bugs) for a group descriptor to contain one or more
group_id values that do not exist.
Check for this situation when reading group descriptors, warn
the user the user, and clear these bits in the bitmap when
formatting it for output.
If a region has a a DMS_GROUP tag in aux_data where the first
region_id in the bitmap is not the same as the containing region,
dmstats will segfault:
# '2' is never a valid group bitset list for region_id == 0
# dmsetup message vg_hex/root 0 "@stats_set_aux 0 DMS_GROUP=img:2#"
# dmsetup message vg_hex/root 0 "@stats_list"
0: 45383680+16384 16384 dmstats DMS_GROUP=img:2#
1: 46071808+32768 32768 dmstats -
2: 47382528+16384 16384 dmstats -
# dmstats list
Segmentation fault (core dumped)
The crash will occur in some arbitrary dm_stats_get_* property
method - this happens while processing the 1st region_id in the
bitset, because the region is marked as grouped, but there is
no group bitmap present at dms->groups[2]->regions.
Fix this by detecting a mismatch between the expected region_id
and dm_bit_get_first() for the parsed bitset during
_parse_aux_data_group().
Handle files that contain multiple logical extents in a single
physical extent properly:
- In FIEMAP terms a logical extent is a contiguous range of
sectors in the file's address space.
- One or more physically adjacent logical extents comprise a
physical extent: these are the disk areas that will be mapped
to regions.
- An extent boundary occurs when the start sector of extent
n+1 is not equal to (n.start + n.length).
This requires that we accumulate the length values of extents
returned by FIEMAP until a discontinuity is found (since each
struct fiemap_extent returned by FIEMAP only represents a single
logical extent, which may be contiguous with other logical
extents on-disk).
This avoids creating large numbers of regions for physically
adjacent (logical) extents and fixes the earlier behaviour which
would only map the first logical extent of the physical extent,
leaving gaps in the region table for these files.
To be able to detect lvm2 command is not leaking some
'unexpected' device - remove all devices before
test exits by its own command so test teardown
now can check what was 'left' unexpectedly.
Fix order of operation when converting raid1 into old mirror.
Before any later metadata modification are initiated prepare
mirror_log device with all clearing.
Then directly convert raid1 into mirror with mirror_log.
This convertion now properly see as precommitted metadata
new 'mirror' and committed old 'raid' and is able to
preload all LVs.
When mapping regions to a file descriptor, a temporary table of
extent descriptors is built using the dm_pool object building
interface.
Previously this use borrowed the dms->mem region and counter
table pool (since nothing can interleave with the allocation
while the caller is still in dm_stats_create_regions_from_fd()).
This turns out to be problematic for error recovery. When a
region creation operation fails partway through file mapping,
we need to roll back the set of already created regions and
this requires a listed handle: the dm_stats_list() will then
allocate from the same pool as the extents; we either have
to throw away valid list data, or leak the extent table, to
return the handle in a valid state.
Avoid this problem by creating a new, temporary mem pool in
_stats_create_file_regions() to hold the extent data, and
discarding it on exit from the function.
While cleaning up the table of already created regions during a
failed dm_stats_create_regions_from_fd(), list the handle once,
and call _stats_delete_region() directly. This avoids sending a
@stats_list message for each region deleted, reducing runtime
from 6s to 0.7s when cleaning up ~250 out of ~10000 regions:
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
<< pauses here >>
Command failed
real 0m6.267s
user 0m3.770s
sys 0m2.487s
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
Command failed
real 0m0.716s
user 0m0.034s
sys 0m0.581s
Testing the error path requires region creation to start to
fail part way through the operation (in order to have regions
to clean up): the simplest way is to ensure the system is
close to the kernel limit of 1/4 RAM or 1/2 vmalloc space
consumed by dmstats data.
Split dm_stats_delete_region() so that internal callers can manage
the handle state themselves.
dm_stats_delete_region() now just handles checking the state of the
handle, reporting validation errors, and calling dm_stats_list() if
necessary, before calling _stats_delete_region().
The new _stats_delete_region() function performs the actual group
member removal and region deletion, and requires a fully listed
handle to operate.
Callers that repeatedly delete regions can use a single listed
handle for many operations on the same device, avoiding one
message ioctl per region deleted: since @stats_list with many
regions is expensive, this yields large runtime improvements.
If we fail to create a region during dm_stats_create_regions_from_fd(),
we must remove all regions that were created to do this to date. This
needs to loop over the table of region_id values that were populated
by _stats_create_file_regions() before the error.
The code for this failure case in the out_remove branch incorrectly
uses the table index as the region_id:
for (--i; i != DM_STATS_REGION_NOT_PRESENT; i--) {
if (!dm_stats_delete_region(dms, i))
log_error("Could not delete region " FMTu64 ".", i);
}
This causes the cleanup code to delete a completely unrelated set
of regions (since the index here will always be nr_regions..0).
Fix it to pass the actual region_id stored in regions[i] instead.
Fix a silly bug in dm_stats_delete_region() that hugely inflates
runtimes when deleting a large number of regions.
For ~50,000 regions this change reduces the runtime from 98s to
6s on my test systems (a ~93% reduction).
The bug exists because dm_stats_delete_region() applies a truth
test to the return value of dm_stats_get_nr_areas(); this is
never correct usage - it will walk the entire region table and
calculate area counts for each region (which is roughly O(n^2)
in the number of regions, as dm_stats_delete_region() is being
called inside a region walk).
Although the individual area calculation is not that costly,
uselessly running anything 2,500,000,000 times over gets a bit
slow.
A much cheaper test (which is always true if the areas check is
true) is to just test dm_stats_get_nr_regions() or dms->regions;
if either is true it implies at least one area exists.
Old:
Performance counter stats for 'dmstats delete --allregions --alldevices':
98117.791458 task-clock (msec) # 1.000 CPUs utilized
127 context-switches # 0.001 K/sec
3 cpu-migrations # 0.000 K/sec
6,631 page-faults # 0.068 K/sec
307,711,724,562 cycles # 3.136 GHz
544,762,959,577 instructions # 1.77 insn per cycle
84,287,824,115 branches # 859.047 M/sec
2,538,875 branch-misses # 0.00% of all branches
98.119578733 seconds time elapsed
New:
Performance counter stats for 'dmstats delete --allregions --alldevices':
6427.251074 task-clock (msec) # 1.000 CPUs utilized
6 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
6,634 page-faults # 0.001 M/sec
21,613,018,724 cycles # 3.363 GHz
3,794,755,445 instructions # 0.18 insn per cycle
852,974,026 branches # 132.712 M/sec
808,625 branch-misses # 0.09% of all branches
6.428953647 seconds time elapsed
Simplify info run for use only for INFO & STATUS.
Drop handling MKNODES within _info_run() call
and use more advanced _setup_task_run() directly.
This allows to further simplify _info_run().
Integrate also query for inactive table and
handle dm_task_run() and dm_task_get_info()
(thus switching to setup_task_run)
Add one exception case for DM_DEVICE_TARGET_MSG.
This allows further shortening and simplification of all
other users of this function.
It's actually not needed to call extra lv_has_target_type() to detect
snapshot merge is in progress - decode this right during status
capturing and save even few extra ioctl calls.
Drop LV from passed API arg - it's always segment being checked.
Also use_layer is now in full control of lv_info_with_seg_status().
It decides which device needs to be checked to get 'the most info'.
TODO: future version should be able to expose status from
Start moving selection of status taken for a LV into a single place.
The logic for showing info & status has been spread over multiple
places and were doing too complex decision going agains each other.
Unify selection of status of origin & cow scanned device.
TODO: in future we want to grab status for LV and layered LV and have
both statuses present for display - i.e. when 'old snapshot'
of thinLV is takes and there is ongoing merge - at some moment
we are not capable to show all needed info.
When lvm2 wants to see a status, it needs to validate,
segment for status reading is matching whan lvm2 expects in
metadata.
Also ensure status failure will not cause '0' from info reading
when actual info was collected properly.
Failure in 'status' reading is considered to be
a 'log_warn()' event only.
When we can't parse status, switch to warning as this is not
considered an errornous case. LVS is not supposed to return
error status code when device is not what it's been expected to
be - but it should be WARNING a user there is something unexpected.
Convert lvs -o lv_merge_failed,lv_snapshot_invalid to use
lv_info_and_status function.
This makes it equal to attr value showing this info
(as they were different since they were derived from
different data set and different logic as well).
Also saves couple extra ioctl that were needed to obtain this info.
When displaying <reporting_command> -o help, we'd like to have fields
grouped nicely, not starting having groups interleaved as it was before.
The code that displays the help output for fields takes the order as
written in columns.h file - this caused output like:
$ lvs -o help
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Instead, let's have it without groups interleaved which may be
a bit confusing, so:
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
..and so on.
Use LVM_DBUSD_TEST_MODE env variable to customize what we test.
Default is the same where we try to test all combinations of all
modes. Renamed to make it consistent with the other env variables
that are used in the unit test.
- We check that all properties match the introspection data. We
don't verify values for every property as only lvm knows what they
should be.
- We are testing vg.Move
Added a properties changed signal on the job dbus object so that client
can wait for a signal that the job is complete instead of polling or
blocking on the wait method.
Allows the user to override the number of commands that get dumped
to the log when we encounter a lvm error. Also useful during
development when you don't want to see the blackbox output.
In case any SubLV of a RaidLV transiently fails, it needs
two "lvchange --refresh RaidLV" runs to get it to fully
operational mode again. Reason being, that lvm reloads all
targets for the RaidLV tree but doesn't resume the SubLVs
until after the whole tree has been reloaded in the first
refresh run. Thus the live mapping table of the SubLVs
still point to an "error" mapping and the dm-raid target
can't retrieve any superblock from the MetaLV(s) in processing
the constructor during this preload thus not discovering the
again accessible SubLVs. In the second run, the SubLV targets
map proper (meta)data, hence the constructor discovers those
fine now.
Solve by resuming the SubLVs of the RaidLV before
preloading the respective top-level RaidLV target.
Resolves: rhbz1399844
When reading data from stdout & stderr we were reading until the
reading until we got None back which really isn't needed as the
read will return everything that is available.
Avoid code duplication and use exiting commonly used
lv_update_and_reload() function.
There is still one place left where mirror is doing strange
double suspend call - needs there more thinking what's wrong with
that code.
When lvconvert adds a new leg - it's doing it free 'temporary' image
layer - however this temporary 'internal' mirror is also MIRRORED LV.
But the status bit was not properly transfered through layer.
We need to acquire a lock which can block us which in turn causes
the dbus request handling to block as well. Place the request on
the work queue instead.
Our expectation was that when using the lvm shell that when the lvm prompt
was read from stdout, that all other ouput had been written and flushed.
However, this doesn't appear to be the case. Add extra read passes to
retrieve delayed report data.
The default dbus python library mode of operation is to leverage
introspection. However, this introspection data isn't accessible
for users of the library and they have to specifically retrieve
the introspection data too. This resulted in many introspection
calls being made. This change eliminates introspection calls if
we are testing multiple concurrent test clients. If it's a single
client we will leverage a reduced amount of introspection data to
verify the introspection data is correct. Typically clients don't
leverage introspection data nearly as much as this test client.
The env variable LVM_DBUSD_PV_DEVICE_LIST when present and filled in
with at least 4 physical devices will run concurrently with other
instances running as long as they specify different devices in their
env variable.
When the env variable is not present the test runs as it did before.
In preparation to have more than one thread issuing commands to lvm
at the same time we need to serialize updates to the dbus state and
retrieving the global lvm state. To achieve this we have one thread
handling this with a thread safe queue taking and coalescing requests.
Looks like this isn't support across versions. Need to add functionality
to service to return the supported segment types, so we only use the
supported ones.
There are two possible errors in _dm_stats_populate_region():
* No region struct in dms->regions[region_id]
* Failure to parse data from @stats_print
These have very different causes: the first occurs where a client
program is populating one region at a time (region_id is a single
region identifier), and has not previously called dm_stats_list()
to dimension the region tables; this is an API usage error.
The second occurs when either we read unparseable data from the
kernel (kernel bug), or where various resource allocations fail.
Separate these two cases out and log separate messages for each
(allocation failures in the path already have their own distinct
message), since the "failed to parse.." message in the un-listed
handle case is confusing and misleading.
Do not emit warning message but only log debug message if
lvm2-lvmdbusd.service unit is missing and at the same time
we have global/notify_dbus=1 (which is used by default if we
configured sources with "--enable-notify-dbus"). We don't want
hard dependency between LVM2 and lvmdbusd so it's enough to log
only debug message in this case.
0 interval leads as of now to a busy loop with lvmetad and command.
Avoid testing this patological case.
TODO: Code should possibly translate zero interval into some small
sleep. With lvmpolld it's already 1/10s
Add new targets:
make check_lvmpolld
make check_cluster_lvmpolld
make check_lvmetad_lvmpolld
make check_all_lvmpolld
So check_lvmetad runs only base lvmetad test - to much
logic of remaining targets.
Previous behavior is available via check_all_lvmpolld.
Make it easier to replace missing segments with 'zero' returning
target - otherwise user would have to create some extra target
to provide zeros as /dev/zero can't be used (not a block device).
Also break code loop when segment is found and make it an INTERNAL_ERROR
where it's missing.
Instead of clearing multiple rmeta device with sequential activation
process and waiting for udev for every _rmeta device separately,
activate all _rmeta devices first and then clear them and deactivate
afterwards.
Also update some tracing messages.
When anyhing goes wrong during clearing process, always try to
deactivate as much _rmeta devices as possible before fail.
This code is no longer needed because the back ground task has been
removed. Will add back if we change the design and end up utilizing
multiple worker threads.
There is no reason to create another background task when the task that
created it is going to block waiting for it to finish. Instead we will
just execute the logic in the worker thread that is servicing the worker
queue.
Translate log_info() into log_very_verbose() which is macro
supposed to be used by our code.
log_info() is internal macro with eventually some 'symbolic' meaning
in syslogging daemons.
Instead of compiling 2 log call for 2 different logging functions,
and runtime decide which version to use - use only 'newer' function
and when user sets his own OLD dm_log logging translate it runtime
for old arg list set.
The positive part is - we get shorter generated library,
on the negative part this translation means, we always have evaluate
all args and print the message into local on stack buffer, before
we can pass this buffer to the users' logging function with proper
expected parameters (and such function may later decide to discard
logging based on message level so whole printing was unnecessary).
Ensure different logging function for dmeventd.c logging
and dm and lvm library.
We can recognize we want to show every log_info() and
log_notice() message from dmeventd.c code while not
exposing those from libdm/libdevmapper-event
Also switch to use log with errno - it's not changing
anything and doesn't bring any more features yet to dmeventd
logging but we just properly pass dm_errno_or_class properly
through the whole code stack for possible future use
(i.e. support of class logging for dmeventd).
Reword the logging logic and try to restore previous logging
behavior for 'standalone' running daemon while preserving
debuggable feautures it has gained.
So actual rules:
dmeventd without any '-d' option will syslog all messages
from dmeventd.c it dmeventd plugins.
log_notice()==log_verbose()
log_info()==log_very_verbose()
But to show also log_debug() used has to give '-ddd'.
When user specified '-d, -dd, -ddd, -dddd' it
will also enable tracing of messages from libdm & lib
executed code - which is mainly useful for testing
i.e.: 'dmeventd -fldddd'
Introduce macros:
log_level(), log_stderr(), log_once(), log_bypass_report()
For easier and more consisten way how to 'decoder' bits
of info from passed 'level'.
This patch fixes potential problem when 'level' of message
might not have always masked right bits.
Instead of creating a thread to handle the case where a client
is calling job.Wait, we will utilize a timer. This significantly
reduces the number of threads that get created and destroyed while
the service is running.
We will fetch the lvm state in non-main thread and only process the new
data with the main thread to prevent hanging the main thread event loop.
ref. https://bugs.freedesktop.org/show_bug.cgi?id=98521
pvscan --cache -aay was activating LVs in exported VGs
when it should not.
It appears that this was a regression from commit 9b640c3684
"pvscan: use process_each_vg for autoactivate".
If blkdeactivate finds out that the device on top of device stack
is already unmounted, it still proceeds with device stack deactivation
underneath now.
This situation can happen if blkdeactivate is started and the mount
point is unmounted in parallel by chance (so when blkdeactivate
gets the the actual umount call, the device is not mounted anymore).
Before, the blkdeactivate added such device to skip list which caused
all the stack underneath to be skipped too on deactivation. Now, we
proceed just as if blkdeactivate did the umount itself.
For example, in the example below, the vg-lvol0 is mounted on /mnt/test
when blkdeactivate is called, but it gets unmounted in parallel later
on when blkdeactivate gets to the actual umount call.
Before this patch (vg-lvol0 underneath not deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... skipping
With this patch applied (vg-lvol0 underneath still deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... already unmounted
[LVM]: deactivating Logical Volume vg/lvol0... done
(Automatic) repair may not be allowed during the initial sync of an upconverted
linear LV, because the data on the failing, primary leg hasn't been completely
synchronized to the N-1 other legs of the raid1 LV (replacing failed legs during
repair involves discontinuing access to any replaced legs data, thus preventing
data recovery on the primary leg e.g. via dd_rescue).
Even though repair would not cause data loss when adding legs to a fully synced
raid1 LV, we don't have information yet defining this state yet (e.g. a raid1
LV flag telling the fully synchronized status before any legs were added),
hence can't automatically decide to allow to repair.
If nonetheless a repair on a non-synced raid1 LVs is intended, the "--force"
option has to be provided.
Resolves: rhbz1311765
Validate kernel support for raid0/raid4 on given and
requested segtype before requesting conversions on them.
Because raid10 wasn't present in old RAID targets, add
the same validation to be prepared once we support them.
Check for dm-raid target version with non-standard raid4 mapping expecting the dedicated
parity device in the last rather than the first slot and prohibit to create, activate or
convert to such LVs from striped/raid0* or vice-versa in order to avoid data corruption.
Add related tests to lvconvert-raid-takeover.sh
Resolves: rhbz1388962
On conversions between striped/raid0* and raid4, the kernel expects
the dedicated raid4 parity SubLVs in the first segment area rather than
in the last it's been allocated to, thus the data mapping ain't proper.
Enhance lvconvert (lib/metadata/raid_manip.c) to shift the dedicated
parity SubLVs on conversions from striped/raid0* to raid4 and vice-versa.
In case of raid0_meta -> raid4 where the MD raid0 personality already has
stored RAID array device positions in the superblocks, the MetaLVs have to
be cleared so that the kernel doesn't fail validating the array positions
after lvm has shifted them up by one.
Add more tests to lvconvert-raid-takeover.sh including one to check for
mapping flaws by converting a created raid4 with filesystem -> striped
and fsck it.
Whilst on it:
- add missing direct striped -> raid4 conversion to the takeover array
to avoid an intermim conversion from striped -> raid0*
- clean up the takeover array
- allow lvconvert to actually call lv_raid_convert() on all takeover requests
in order to check parameters and display messages provided by takeover
functions rather than just "...not supported" from within lvconvert
- fix a typo
Resolves: rhbz1386148
If a device disappears after obtaining the list of devices but before
processing it as a member of that list, dmsetup exits with a failure code.
Most commands still produce what output they can in these circumstances,
but 'ls --tree' and 'info -c' with fields depending on device dependencies
didn't. Change this.
Keep for now function logic making its decision on string content.
We need bigger patch converting all things to bit-checks later.
This needs however bigger refactoring.
So this commit reverts some changes from:
c8b6c13015
Commit 088b3d036a allowed repair on cache origin RAID LVs
and restricted lvconvert actions on RAID SubLVs to change number of mirrors, repair,
replace and type changes in order to avoid unsuitable coversions on them.
This introduced a regression prohibiting --splitmirrors on any RAID SubLVs
(e.g. of cache or thin LVs; lvconvert-{cache,thin}-raid.sh tests failing).
Fix allows split mirrors again.
Fix some indenting whilst on it.
When we have already decoded arg_is_set into a local var
or already set segment type - already use these
values instead of repeating calls and string checks.
Seems some error path where not converted to 'new' ECMD return value.
Fix them to always 'goto out'.
Also drop unneeded 'ret = 0' when ret already is 0.
This test never passes on loop back, so we will skip unless the
pv devices are real devices which contain `/dev/sd`.
We always fail because we need lvm to run slow to get a timer to
pop, and loopback are too fast.
If you run multiple runs of unittest.main, unless you don't pass exit=true
the test case always ends with a 0 exit code. Add ability to store the
result of each invocation of the test and exit with a non-zero exit code
if anyone of them fail.
In case a RAID orig LV is being cached and fails, repair is impossible because
"lvconvert --repair" gets rejected.
Fix by allowing repair on cache orig RAID LVs and
"lvconvert --replace/--mirrors/--type {raid*|mirror|striped|linear}" as well.
Allow the same lvconvert actions on any cache pool and metadata RAID SubLVs.
Resolves: rhbz1380532
The following LvCommon properties were added so that the API
would have the same functionality as lvm2app has.
LvCommon.MetaDataSizeBytes
LvCommon.Attr
LvCommon.MetaDataPercent
LvCommon.CopyPercent
LvCommon.SnapPercent
LvCommon.SyncPercent
Fix missing wait so we have paired waiting.
Also 'wait' for precise PID to get 'exit' code.
Test for 'error' replacing only with newer snapshot targets.
The old one will wait for resume.
Note: 'wait -n' is not always available so can't be used..
Integrate back _unblock_sigalrm() and check for error code of
pthread_sigmask() function so we do not use uninitialized
sigmask_t on error path (Coverity).
When a PV device is missing lvm will return '[unknown]' for the device
path. The object manager keeps a hash table lookup for uuid and for PV's
device name. When we had multiple PVs with the same device path we
we only had 1 key in the table for the lvm id (device path). This caused
a problem when the PV device transitioned from '[unknown]' to known as any
subsequent transitions would cause an exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 66, in run_cmd
result = self.method(*self.arguments)
File "/usr/lib/python3.5/site-packages/lvmdbusd/manager.py", line 205, in _pv_scan
cfg.load()
File "/usr/lib/python3.5/site-packages/lvmdbusd/fetch.py", line 24, in load
cache_refresh=False)[1]
File "/usr/lib/python3.5/site-packages/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/usr/lib/python3.5/site-packages/lvmdbusd/loader.py", line 80, in common
cfg.om.remove_object(cfg.om.get_object_by_path(k), True)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 153, in remove_object
self._lookup_remove(path)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 97, in _lookup_remove
del self._id_to_object_path[lvm_id]
KeyError: '[unknown]'
when trying to delete a key that wasn't present. In this case we don't add a
lookup key for the device path and the PV can only be located by UUID.
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1379357
Works if the pool is inactive.
Activation code doesn't notice a new raid dependency in on-disk metadata
when a thin LV is already active.
https://bugzilla.redhat.com/1365286
The dm_stats_delete_region() call removes a region from the bound
device, and, if the region is grouped, from the group leader
group descriptor stored in aux_data.
To do this requires a listed handle: previous versions of the
library do not since no dependencies exist between regions without
grouping.
This leads to strange behaviour when a command built against an old
version of the library is used with one supporting groups. Deleting
a region with dmstats succeeds, but logs errors:
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
vg_hex-root 2 2.00g 1.00g 1 1.00g dmstats
# dmstats delete --regionid 2 vg_hex/root
Region ID 2 does not exist
Could not delete statistics region.
Command failed
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
This happens because the call to dm_stats_delete_region() is inside
a dm_stats_walk_*() iterator: upon entry to the call, the iterator
is at its end conditions and about to terminate. Due to the call to
dm_stats_list() inside the function, it returns with an iterator at
the beginning of a walk and performs a further iteration before
exiting. This final loop makes a further attempt to delete the
(already deleted) region, leading to the confusing error messages.
The current dmsetup.c handles DR_STATS and DR_STATS_META reports
separately in _display_info_cols(), meaning that the stats walk
functions are never called for these report types.
Versions before v2.02.159 have a loop using dm_stats_walk_do() and
dm_stats_walk_while(), that executes once for non-stats reports,
and once per region, or area, for DR_STATS/DR_STATS_META reports.
This older behaviour relies on the documented behaviour that the
walk functions will accept a NULL pointer as the struct dm_stats*
argument.
This was broken by commit f1f2df7b: the NULL test on dms and
dms->regions were incorrectly moved from the dm_stats_walk_end()
wrapper to the internal '_stats_walk_end()' helper.
Since the pointer is dereferenced in between these points, using
an older dmsetup with current libdm results in a segfault when
running a non-stats report:
# dmsetup info -c vg00/lvol0
Segmentation fault (core dumped)
Restore the NULL checks to the wrapper function as intended.
We shouldn't be losing pvscans just because of the fact that the
underlying device (PV) appears and disappears quickly in the system,
otherwise lvmetad may not see the device if it appears again (or it may
still keep the device in cache even it's already gone).
We added lightweight toolcontext handle to avoid useless initialization
of some parts of the context and also to avoid problems when using the
handle very soon at system boot, like in lvm2-activation-generator
through lvm2app interface. However, we missed reading all the other
config sources like lvmlocal.conf as well as any tag config - we need to
read these too to get the final config value which may be overriden in
any of these additional config sources.
Currently, we use this lightweight toolcontext handle to read
global/use_lvmetad and global/use_lvmpolld config values in
lvm2-activation-generator using lvm2app interface (lvm_config_find_bool
lvm2app function).
Pre 1.9 dm-raid targets status output was racy, which caused
the device status chars to be unreliable _during_ synchronization.
This shows paritcularly with tiny test devices used.
Enhance lvchange-rebuild-raid.sh to not check status
chars _during_ synchronization. Just check afterwards.
Though I'm not quite sure why we push this limit on user,
current --repair logic requires user to wait outside of command.
TODO: I'm not quite sure this repair logic is 'the most wanted'.
Make sure that the temporary dm_histogram used for the bounds
argument is freed in the case that the user provided a --bounds
argument on the command line.
The dm-raid target now rejects device rebuild requests during ongoing
resynchronization thus causing 'lvconvert --repair ...' to fail with
a kernel error message. This regresses with respect to failing automatic
repair via the dmeventd RAID plugin in case raid_fault_policy="allocate"
is configured in lvm.conf as well.
Previously allowing such repair request required cancelling the
resynchronization of any still accessible DataLVs, hence reasoning
potential data loss.
Patch allows the resynchronization of still accessible DataLVs to
finish up by rejecting any 'lvconvert --repair ...'.
It enhances the dmeventd RAID plugin to be able to automatically repair
by postponing the repair after synchronization ended.
More tests are added to lvconvert-rebuild-raid.sh to cover single
and multiple DataLV failure cases for the different RAID levels.
- resolves: rhbz1371717
Commit 199697accf rerouted funtion
for priting cache volume origin to lvm2app app function - which
however had a bug. So restore the original functionality
and print correct LV as cache origin LV.
Gris debugged that when we don't have a method the introspection
data is missing the interface itself eg.
<interface name="<your_obj_iface_name>" />
When adding the properties to the dbus object introspection we will
add the interface too if it's missing. This now allows us the
ability to have a dbus object with only properties.
Because of the different code paths we need to test job handling with
all operations. Test now runs virtually everything with timeout == 0
and timeout == 15 so that we test both use cases.
Note: If a client passes -1 for the timeout value they need to make
sure that their connection time out is also infinite. Otherwise, if the client
times out the service side hangs any new dbus calls until the job
that is in progress completes. Not sure why this behavior is occuring
at this time, but it appears a limitation/bug of the dbus-python library.
When we register a failure we need to use a valid value which will be
returned with the object manager. Otherwise we will raise an Exception
because we are trying to construct an object path from None.
The methods were returning an instance of the object instead of the
object path which was causing an exception when the result was returned
with the job object as we are explicity trying to return an object path.
Unit test added which re-creates the issue and verifies the fix.
Add simple helper/wrapper check function to check result
of dmsetup call i.e.:
check grep_dmsetup table vg-lv "grep_expected"
check grep_dmsetup status vg-lv -v "grep_unexpected"
Reload of thin-pool origin_only is designed to only post messages
to a thin-pool. It's not intended to be used for reload of thin-pool
table. Fix it by using standard call 'lv_update_and_reload()'.
Unconditionally guard there is at least 1/4 of metadata volume
free (<16Mib) or 4MiB - whichever value is smaller.
In case there is not enough free space do not let operation proceed and
recommend thin-pool metadata resize (in case user has not
enabled autoresize, manual 'lvextend --poolmetadatasize' is needed).
In the case there is no active thin volume, report thin pool
as lock holder. This fixed function like lvextend
which either expecte lock holder LV is some active thin
or 'possibly' inactive thin pool.
The existing code doesn't understand that mirror logs should cling to
parallel LVs (like extending them) instead of avoiding them.
As a quick workaround to avoid lvcreate failures, hard-code
--alloc normal for mirror logs even if the rest of the allocation
used a stricter policy.
https://bugzilla.redhat.com/show_bug.cgi?id=1376532
When rescanning a VG from disk, the metadata read from
each PV was compared as a sanity check. The comparison
is done by exporting the vg metadata from each dev to
a config tree, and then comparing the config trees.
The function to create the config tree inserts
extraneous information along with the actual VG metadata.
This extra info includes creation_time. The config
trees for two devs can easily be created one second
apart in which case the different creation_times would
cause the metadata comparison to fail. The fix is to
exclude the extraneous info from the metadata comparison.
Correction for aux test result ([] -> if;then;fi)
Use issue_discard to lower memory demands on discardable test devices
Use large devices directly through prepare_pvs
I'm still observing more then 0.5G of data usage through.
Particullary:
'lvcreate' followed by 'lvconvert' (which doesn't yet support --nosync
option) is quite demanging, and resume returns quite 'late' when
a lot of data has been already written on PV.
Reinstantiate reporting of metadata percent usage for cache volumes.
Also show the same percentage with hidden cache-pool LV.
This regression was caused by optimization for a single-ioctl in
2.02.155.
Allow RAID scrubbing on cache origin sub-LV
This patch adds the ability to perform RAID scrubbing on the cache
origin sub-LV (https://bugzilla.redhat.com/1169495). Cache origin
operations are restricted to non-clustered RAID LVs until there can
be further testing in a cluster (even for exclusive activation).
User can either specify directly _corig LV
or he can specify cache LV and operation --syncation is
passed ONLY to _corig LV.
If users wants to manipulation with cache-pool devices - he
needs to specify this object name.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Older udev versions (udev < v165), don't have the official
udev_device_get_is_initialized function available to query for
device initialization state in udev database. Also, devices don't
have USEC_INITIALIZED udev db variable set - this is bound to the
udev_device_get_is_initialized fn functionality.
In this case, check for "DEVLINKS" variable instead - all block devices
have at least one symlink set for the node (the "/dev/block/<major:minor>".
This symlink is set by default basic udev rules provided by udev directly.
We'll use this as an alternative for the check that initial udev
processing for a device has already finished.
It's possible (mainly during boot) that udev has not finished
processing the device and hence the udev database record for that
device is still marked as uninitialized when we're trying to look
at it as part of multipath component check in pvscan --cache code.
So check several times with a short delay to wait for the udev db
record to be initialized before giving up completely.
When scanning devs to populate lvmetad during system startup,
filter-mpath with native sysfs multipath component detection
may not detect that a dev is multipath component. This is
because the multipath devices may not be set up yet.
Because of this, pvscan will scan multipath components during
startup, will see them as duplicate PVs, and will disable
lvmetad. This will leave lvmetad disabled on systems using
multipath, unless something or someone runs pvscan --cache
to rescan.
To avoid this problem, the code that is scanning devices to
populate lvmetad will now check the udev db to see if a
dev is a multipath component that should be skipped.
(This may not be perfect due to inherent udev races, but will
cover most cases and will be at least as good as it's ever
been.)
The lsblk is just a nice helper here - it's not crucial for lvmdump so
do best effort here and use the most we can from current version of
lsblk that is installed on system. The lsblk -s option was added a bit
later after lsblk introduction and lsblk -O support even more later -
so if these are not available, use only pure lsblk output without any
extras.
- Prevent --lvmshell with --nojson, not a valid combination
- If user is preventing json, then no lvmshell usage
- Return boolean on Manager.UseLvmShell
The normal mode of operation will be to monitor for udev events until an
ExternalEvent occurs. In that case the service will disable monitoring
for udev events and use ExternalEvent exclusively.
Note: User specifies --udev the service will always monitor udev regardless
if ExternalEvent is being called too.
With the addition of JSON and the ability to get output which is known to
not contain any extraneous text we can now leverage lvm shell, so that we
don't fork and exec lvm command line repeatedly.
When we are running in a terminal it's useful to have a date & ts on log
output like you get when output goes to the journal. Check if we are
running on a tty and if we are, add it in.
Avoid monitoring of activated cache-pool - where the only purpose ATM
is to clear metadata volume which is actually activate in place
of cache-pool name (using public LV name).
Since VG lock is held across whole clear operation, dmeventd cannot
be used anyway - however in case of appliction crash we may
leave unmonitored device.
In future we may provide better mechanism as the current name
replacemnet is creating 'uncommon' table setups in case the metadata
LV is more complex type like raid (needs some futher thinking about
error path results).
Another point to think about is the fact we should not clear device
while holding lock (i.e. dmeventd mirror repair cannot work in cases
like this).
Introduce 'hard limit' for max number of cache chunks.
When cache target operates with too many chunks (>10e6).
When user is aware of related possible troubles he
may increase the limit in lvm.conf.
Also verbosely inform user about possible solution.
Code works for both lvcreate and lvconvert.
Lvconvert fully supports change of chunk_size when caching LV
(and validates for compatible settings).
Commit e947c362dd introduced
config_settings.h file for central place to store all definitions for
config options. By mistake, it used report/colums_as_rows instead
of report/columns_as_rows (missing "n" in "columns").
If the number of stripes requested is incompatible with the requested
type of raid, give an error instead of adjusting it.
If no stripes argument is supplied, continue to use an appropriate
default.
a579ba2ac2 fixed a regression causing a segfault if no external
origin existed but broke the logic leading to erroneous error
messages and creations of split off exported VGs in case the
external origin and the pool LVs were allocated on different PVs.
- resolves rhbz1367459
Creating a RaidLV in VGs with very small extent sizes caused
late failure in the kernel giving a not very informative error
message. Catch the attempt early and display failure message
'Unable to create RAID LV: requires minimum VG extent size 4.00 KiB'.
- resoves rhbz1179970
'pvmove -n name pv1 pv2' called with the name of a top-level LV
failed with mentioned commit.
Enhance pvmove-raid-segtypes.sh to test for prohibited RAID SubLV moves.
'pvmove -n name pv1 pv2' allows to collocate multiple RAID SubLVs
on pv2 (e.g. results in collocated raidlv_rimage_0 and raidlv_rimage_1),
thus causing loss of resilence and/or performance of the RaidLV.
Fix this pvmove flaw leading to potential data loss in case of PV failure
by preventing any SubLVs from collocation on any PVs of the RaidLV.
Still allow to collocate any DataLVs of a RaidLV with their sibling MetaLVs
and vice-versa though (e.g. raidlv_rmeta_0 on pv1 may still be moved to pv2
already holding raidlv_rimage_0).
Because access to the top-level RaidLV name is needed,
promote local _top_level_lv_name() from raid_manip.c
to global top_level_lv_name().
- resolves rhbz1202497
Adding MetaLVs to given DataLVs (e.g. raid0 -> raid0_meta takeover),
_avoid_pvs_with_other_images_of_lv() was missing code to prohibit
allocation when called with a just allocated MetaLV to prohibit
collaocation of the next allocated MetaLV on the same PV.
- resolves rhbz1366738
When lvm is compiled with --enable-notify-dbus and a user uses lvm
shell, after they issue 200+ commands the lvm shell will hang for
~30 seconds trying to notify the lvm dbus service that a change
has occurred. This appears to be caused by resource exhaustion,
because the sockets used for dbus communication are not be closed.
Enforce mirror/raid0/1/10/4/5/6 type specific maximum images when
creating LVs or converting them from mirror <-> raid1.
Document those maxima in the lvcreate/lvconvert man pages.
- resolves rhbz1366060
Some settings are not suitable for override in interactive/shell
mode because such settings may confuse the code and it may end
up with unexpected behaviour. This is because of the fact that
once we're in the interactive/shell mode, we have already applied
some settings for the shell itself and we can't override them
further because we're already using those settings to drive the
interactive/shell mode. Such settings would get ignored silently
or, in worse case, they would mess up the existing configuration.
When lvm commands are executed in lvm shell, we cover the whole lvm
command execution within this shell now. That means, all messages logged
and status caught during each command execution is now recorded in the
log report, including overall command's return code.
The dm_report_group_output_and_pop_all calls dm_report_output and
dm_report_group_pop for all the items that are currently in report
group. This is just a shortcut that makes it easier to output and
pop group's content so the group handle can be reused again without
a need to initialize and configure it again.
The functionality of dm_report_group_output_and_pop_all is the
same as dm_report_destroy but without destroying the report group
handle.
This patch moves printing of starting '{' character for JSON output up
untili it's known there's any further output following - either the
content or ending '}' character.
Also, remove unnecessary switch for different report group types and
calling individual functions to handle dm_report_group_create as that
code is shared for all existing types at the moment.
Calling dm_report_destroy_rows makes it possible to destroy any report
content we have but at the same time it doesn't destroy the report
handle itself, thus it's possible to reuse that handle again for new
report content.
Functionally, this is the same as calling dm_report_output with the
report handle but omitting the output iself. This functionality may
be useful if we, for whatever reason, need to discard the report
content and start a fresh new one but with the same report configuration
and initialization and thus we can just reuse the existing handle.
We may call arg_count/grouped_arg_count/arg_value soon enough that
cmd->arg_values is not set yet.
Normally, when running a command, we execute lvm_run_command which in
turn calls _process_command_line to allocate and parse the command line
values and stores them in cmd->arg_values.
However, if we run lvm shell, this one doesn't accept any command line
options and we parse the command line for each command that is executed
within the lvm shell then. If we used any code that tries to access
cmd->arg_values through any of the the arg handling functions too
early, we could end up with a segfault due to uninitialized (NULL)
cmd->arg_values.
This patch just saves extra checks in all the code where arg handling
may be called too early so that the cmd->arg_values is not set up yet.
This does not apply to any of existing code, but subsequent patches
will need that.
With patches that will follow, this will make it possible to widen log
report coverage when commands are executed from lvm shell so the amount
of messages that may end up in stderr/stdout instead of log report are
minimized.
Add new log_context=shell and with log_object_type=cmd and
log_object_name=<command_name> for command log report to collect
overall return code from last command (this is reported under
log_type=status).
Currently, the output is separated in 3 parts and each part can go into
a separate and user-defined file descriptor:
- common output (stdout by default, customizable by LVM_OUT_FD environment variable)
- error output (stderr by default, customizable by LVM_ERR_FD environment variable)
- report output (stdout by default, customizable by LVM_REPORT_FD environment variable)
For example, each type of output goes to different output file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvs fedora vg/abc 1>out 2>err 3>report
[0] fedora/~ # cat out
[0] fedora/~ # cat err
Volume group "vg" not found
Cannot process volume group vg
[0] fedora/~ # cat report
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
Another example in LVM shell where the report goes to "report" file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvm 3>report
(in lvm shell)
lvm> vgs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
(in lvm shell)
lvm> lvs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
RAID6 LVs may not be created with --nosync or data corruption
may occur in case of device failures. The underlying MD raid6
personality used to drive the RaidLV performs read-modify-write
updates on stripes and thus relies on properly written parity
(P and Q Syndromes) during initial synchronization.
Once on it, enhance test to create/extend more and
larger RaidLVs and check sync/nosync status.
Commit 76ef2d15d8 introduced
raid0 <-> raid4 takeover and full mirror <-> raid1 support.
Add tests for these conversions.
Tests exposed a kernel semantics change freezing resynchronization
on conversions from raid0[_meta] -> raid4 or adding raid1 legs
because kernel kept the RAID mapped device in 'frozen' state unless
an 'idle' message was sent or the table was reloaded (kernel patch pending).
Although the use of the first region_id in a group to store the
DMS_GROUP=... aux_data tag is an internal implementation detail,
it has a user visible consequence in that deleting this region will
cause the group to disappear: add an explanation of this to the
'group' command and 'Regions, areas, and groups' section.
The MD raid6 personality being used to drive lvm raid6 LVs does
read-modify-write updates to any stripes and thus relies on correct
P and Q Syndromes being written during initial synchronization or
it may fail reconstructing proper user data in case of SubLVs failing.
We may not allow the '--nosync' option on
creation of raid6 LVs for that reason.
Update/fix 'man lvcreate' in that regard.
add lvcreate-raid-nosync.sh test script.
- Resolves rhbz1358532
We don't need to refresh whole cmd context if we drop profile after
processing LVM command - just like we don't refresh cmd context when
we're applying the profile. It's because profiles contain only safe
subset of settings which do not require complete cmd context refresh.
This patch calls process_profilable_config instead of
refresh_toolcontext if there was profile applied for the LVM
command only, not --config which requires toolcontext refresh.
The process_profilable_config just sets proper values based on
values of profilable settings, but it does not do complete
reinitialization of various parts (e.g. filters, logging etc.).
'lvchange --resync LV' or 'lvchange --syncaction repair LV' request the
RAID layout specific parity blocks in raid4/5/6 to be recreated or the
mirrored blocks to be copied again from the master leg/copy for raid1/10,
thus not allowing a rebuild of a particular PV.
Introduce repeatable option '--[raid]rebuild PV' to allow to request
rebuilds of specific PVs in a RaidLV which are known to contain corrupt
data (e.g. rebuild a raid1 master leg).
Add test lvchange-rebuild-raid.sh to test/shell doing rebuild
variations on raid1/10 and 5; add aux function check_status_chars
to support the new test.
- Resolves rhbz1064592
Prepare for new segment type conversion functionality in cases that
currently fail. In the short-term, we need to do this while limiting
the changes to the code paths for the conversions that are already
supported.
introduced with commit 8f62b7bfe5 rely on complete
defintions of the relations between the LVs of a VG.
Hence only run these checks when the complete_vg flag
is set on calls to check_lv_segments().
lvconvert failed in test lvconvert-thin-raid.sh when
calling check_lv_segments() from _read_segments() without
providing a complete definition.
on any thin snap external origin LV which caused a segfault
when none existed as exposed by the vgsplit-thin.sh test.
Only call lv_is_on_pvs() if an external origin LV actually
exists and correct the related splitting logic.
When converting to a cache lv, tests were hanging with a prompt for
"Do you want wipe existing metadata of cache pool volume
To preserve cache metadata add option "--zero n".
WARNING: Reusing mismatched cache pool metadata MAY DESTROY YOUR DATA!"
This is new.
When a client is doing a wait on a job, any other clients will hang
when trying to do anything with the service. This is caused by
the wait code which was placing the thread that handles
incoming dbus requests to sleep until either the timeout expired or
the job operation completed.
This change creates a thread for the wait request, so that the thread
processing incoming requests can continue to run.
with respect to the changed, configurable default behaviour
introduced with commit 7eb7909193.
E.g. raid default of 2 stripes rather than number of PVs in the VG
or on the command line minus one.
The syntax for converting an LV to a thin LV
included an unnecessary --thin option. I was
probably still confused about these options
when writing this.
General RAID and RAID segment type specific checks are added
to merge.c. New static _check_raid_seg() is called on each segment
of a RaidLV (which have just one) from check_lv_segments().
New checks caught some unititialized segment members
which are addressed here as well:
- initialize seg->region_size to 0 in lvcreate.c for raid0/raid0_meta
- initialize list seg->origin_list in lv_manip.c
Add matching support for -Z option also we doing full conversion
to cache-pool.
Extending coversion message to show which pool type is created
and whether the metadata will be wiped or remain unmodified.
Follow-up to 27a767d5e8.
Tunning behavior in a way we always prompt when option --zero is NOT specified.
Without -Z lvm expects user wants to 'reset' cache-pool metadata
(they could have been splitted from some cached LV)
If user doesn't want to zero metadata he needs to specify -Zn.
User may also avoid prompting for zeroing by using -Zy for
cache-pool (basically equals using --yes without -Z being given)
(unlike full convert case, there is no cache-pool being converted,
so there is not 'uncoditional' prompt in this case).
When volume was lvconvert-ed to a thin-volume with external origin,
then in case thin-pool was in non-zeroing mode
it's been printing WARNING about not zeroing thin volume - but
this is wanted and expected - so nothing to warn about.
So in this particular use case WARNING needs to be suppressed.
Adding parameter support for lvcreate_params.
So now lvconvert creates 'normal thin LV' in read-only mode
(so any read will 'return 0' for a moment)
then deactivate regular thin LV and reacreate in 'final R/RW' mode
thin LV with external origin and activate again.
Before, the automatic update from older to newer version of PV extension
header happened within vg_write call. This may have caused problems under
some circumnstances where there's a code in between vg_write and vg_commit
which may have failed. In such situation, we reverted precommitted metadata
and put back the state to working version of VG metadata.
However, we don't have revert for PV write operation at the moment. So
if we updated PV headers already and we reverted vg_write due to failure
in subsequent code (before vg_commit), we ended up with lost VG metadata
(because old metadata pointers got reset by the PV write operation).
To minimize problematic situations here, we should put vg_write and
vg_commit that is done after PV header rewrites as close to each
other as possible.
This patch moves the automatic PV header rewrite for new extension
header part from vg_write to _vg_read where it's done the same way
as we do any other VG repairs if detected during VG read operation
(under VG write lock).
If the VG holding the global lock is removed, we can indicate
that as the reason for not being able to acquire the global
lock in subsequent error messages, and can suggest enabling
the global lock in another VG. (This helpful error message
will go away if the global lock is enabled in another VG,
or if lvmlockd is restarted.)
When cache pool is reused for a new cached volume, there is
normally no need to 'keep' old cache-pool metadata as this
could cause major data lose.
Unlike with 'lvcreate -H -LX --cachepool' conversion, this lvconvert
path left the metadata unzeroed - partly for making easier some
debugging, but this was rather a bug.
So to keep possible reattach of 'unzeroed' metadata, user
now has to use 'lvconvert -Zn' for such conversion. In this case
the prompt will appear about possibe data loss and to proceed,
user has to confirm such operation. Without -Zn metadata are wiped.
In some cases, the command will update VG metadata
in lvmetad without writing it. In these cases there
is no vg->vg_committed and it should use 'vg' directly.
This happens when the command finds that the lvmetad
VG has been invalidated, rereads the metadata from disk,
then updates lvmetad with that metadata. This happens
often with lvmlockd or foreign VGs, and can happen without
lvmlockd if a previous command fails after invalidating
the VG in lvmetad.
Commit 3928c96a37 introduced
new defaults for raid number of stripes, which may cause
backwards compatibility issues with customer scripts.
Adding configurable option 'raid_stripe_all_devices' defaulting
to '0' (i.e. off = new behaviour) to select the old behaviour
of using all PVs in the VG or those provided on the command line.
In case any scripts rely on the old behaviour, just set
'raid_strip_all_devices = 1'.
- resolves rhbz1354650
This fixes a regression from commit a7c45ddc5, which moved
the lvmetad VG update from vg_commit() to unlock_vg().
The lvmetad VG update needs to send the version of metadata
that was committed rather than sending the state of struct 'vg'.
The 'vg' may have been partially modified since vg_commit(),
and contain non-committed metadata that shouldn't be sent
to lvmetad.
Any failing stripes in raid0/raid0_meta type LVs cause data loss,
thus replacement via 'lvconvert --replace...' does not make sense.
Patch prohibits replacement on raid0/raid0_meta LVs.
- resolves rhbz1356734
The --uuid, --major and --alldevices arguments were incorrectly tested
after confirming argc is > 0, in a branch that only executes if argc
== 0 (i.e. they were unreachable).
Move all device checks before the test for argc and log an appropriate
error before returning.
Including major and minor numbers in pvs and lvs output when calling
lvmdump -a makes it a bit easier to match these items with possible
system log/journal.
Commit ca878a3426 changed behavior
or resize operation. Later the code has been futher changed
to skip fs resize completely when size of LV is already matching
and finaly at the most recent resize changeset for resize the
check for matching size has been eliminated as well so we ended
with a request call to resize fs to 0 size in some cases.
This commit reoders some test so the prompt happens just once before
resize of possibly 2 related volumes.
Also extra test for having LV already given size is added, and
whole metadata update is skipped for this case as the only
result would be an increment of seqno.
However the filesystem is still resized when requested,
so if the LV has some size and the resize is resolved to
the same size, the filesystem resize is called so in case FS
would not match, the resize will happen.
raid0/raid0_meta type LVs don't have a default number of stripes when
created without '-i/--stripes Stripes' whereas other raid types have one.
Patch sets the default for raid0/raid0_meta to 2 stripes.
The default amount of stripes for raid4/5/10 is changed to 2 and for raid6 to 3
rather than using all PVs in the VG or those provided on the command line.
This is to avoid unintended high number of stripes in case of many PVs.
To select a different amount of stripes from the default,
use 'lvcreate -i/--stripes Stripes'.
- resolves rhbz1354650
A livelock occurs on extension in lv_manip when adjusting the region size,
which doesn't apply to any raid0/raid0_meta LVs (these don't have a bitmap).
Fix by prohibiting the region size adjustment on any such LVs.
- resolves rhbz1354604
An unconditional access to the non-existing MetaLV of a raid0 LV in
lv_raid_remove_missing() was causing the segfault.
Only call log_debug() on replacements of existing MetaLVs.
- resolves rhbz1354646
Resync attempts on raid0/raid0_meta via 'lvchange --resync ...'
cause segfaults.
'lvchange --syncaction ...' doesn't get rejected either.
Prohibit both on raid0/raid0_meta LVs.
- resolves rhbz1354656
4420d41fea introduced recursive split of lvs which
splits a top-level LV together with it's sub LVs.
This lead to invalid temporary list pointers
causing hangs/OOM situations.
Patch updates the temporary list pointer
referencing a moved sub LV.
- resolves rhbz1354686
Synopsis are very useful for quick orientation and also
we provide then for all remaining command.
Also list ALL supported options in a single ordered list,
user should not seek for them.
blkdeactivate -m disablequeueing causes "multipathd disablequeueing maps"
call inside blkdeactivate script before deactivating devices. This
avoids a situation where blkdeactivate may wait for paths to appear if
multipath is set to queueing and there's a stack of other devices and/or
mount points on top of such multipath device.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1344381.
Reduce number of evualted pvcreate commands.
Since 0 is default value used to fill missing params,
and 0 is also 1st. value in array, it's being tested.
Drop unused data_alignment_offset.
When logging to epoch files we would like to prevent creating too large
log files otherwise a spining command could fulfill available space
very easily and quickly.
Limit for to 100000 per command.
Make the --filemap switch take no arguments and instead accept one
or more files on the command line to be mapped and placed into
groups.
This allows --filemap to be used with a glob:
# dmstats create --filemap *
rhel5.10-1.qcow2: Created new group with 87 region(s) as group ID 1564.
rhel5.10.qcow2: Created new group with 8 region(s) as group ID 1651.
rhel7.0-1.qcow2: Created new group with 11 region(s) as group ID 1659.
rhel7.0.qcow2: Created new group with 1454 region(s) as group ID 1670.
vm.img: Created new group with 2 region(s) as group ID 3124.
lvconvert --splitcache VG/CachePool_corig
Allow the split via the hidden/used cache pool for the time being,
since the new lvconvert code did intend to allow it, but was just
missing the exception in the list of hidden LVs that were allowed.
The preferred method for splitcache is to run it on the visible
cache LV, not the hidden cache pool. That may eventually become
the only method since we try to avoid running commands on
hidden LVs.
When a 'dmstats create --filemap' operation fails (e.g. during
open(2), close(2), or dm_stats_create_regions_from_fd()), use the
canonical version of the path. This avoids cryptic/confusing error
messages when symbolic links exist in the path argument given:
# findmnt /var/lib/libvirt/images -otarget,source
TARGET SOURCE
/var/lib/libvirt/images /dev/mapper/vg_hex-lv_images
# readlink /var/lib/libvirt/images/my.img
/boot/my.img
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /var/lib/libvirt/images/my.img
Command failed
Using the canonical path the error is immediately obvious:
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /boot/my.img
Command failed
Grouping is also useful in combination with --segments: creating a
group allows both individual segment data and data for the device
as a whole to be presented in the same report.
Support grouping for 'create --segments' in the same manner as for
'create --filemap'; group regions by default, applying an optional
alias specified with --alias, unless the user specifies --nogroup.
Support aggregate group and region histograms by allocating a new
histogram from the pool and populating it with a sum of the histogram
data for the areas contained in the region or group.
To avoid repeatedly summing the same histogram data, cache the pointer
in the group and regions structs for subsequent access. The aggregate
histograms are allocated from the same pool as the area histograms in
the corresponding handle and will be discarded at each list or populate
operation.
Add a new option to the create command to create regions that map the
extents of a file:
# dmstats create --filemap /path/to/file
/path/to/file: Created new group with 10 region(s) as group ID 0.
When performing a --filemap no device argument is required (and
supplying one results in error) since the device to bind to is implied
by the file path and is obtained directly from an fstat().
Grouping may be optionally disabled by the --nogroup switch: in this
case the command will report each region individually:
# dmstats create --nogroup --filemap /path/to/file
/path/to/file: Created new region with 1 area as region ID 0.
/path/to/file: Created new region with 1 area as region ID 1.
/path/to/file: Created new region with 1 area as region ID 2.
When grouping regions the group alias is automatically set to the
basename (as returned by dm_basename()) of the provided file.
This can be overridden to a user-defined value at the command line by
use of the --alias option.
If grouping is disabled no alias can be set.
Use of offset and subdivision options (--start, --length, --segments,
--areas, --areasize).
Setting aux_data and histograms for groups is possible but is not
currently implemented.
Add a call to create dmstats regions that correspond to the extents
present in a file descriptor open on a file in a local file system.
The file must reside on a file system type that correctly supports
physical extent location data in the FIEMAP ioctl.
Regions are optionally placed into a group with a user-defined alias.
File systems that do not support physical offsets in FIEMAP (btrfs
currently) are detected via fstatfs() - although attempting to map
a --filemap group on btrfs will fail anyway with the generic error
"Not on a device-mapper device" this is confusing; the file system
mount is on a device-mapper device, but btrfs' volume layer masks
this in the returned st_dev field since the returned logical file
extents may span multiple physical devices.
The function _stats_remove_region_id_from_group() incorecctly set
the group_id to DM_STATS_GROUP_NOT_PRESENT _before_ the call to
_stats_group_destroy(). This will cause the destroy function to
return immediately without doing anything:
339 static void _stats_group_destroy(struct dm_stats_group *group)
340 {
341 if (!_stats_group_present(group))
342 return;
Invalidating the ID in _stats_region_region_id_from_group() is
redundant anyway; it is rightly done as the last operation in
_stats_group_destroy (and it is not possible for anything to see
the old value between the two calls).
Remove the change to group_id to ensure that the alias and bitset
resources are correctly freed.
The call to dm_stats_walk_start() before the do statement makes
dm_stats_walk_do() behave inconsistently depending on context;
wrap them in an additional do { } while (0) so that the macro
always expands to a valid statement.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/RaidLV" would fail, but
"lvconvert --mirrors 0 VG/RaidLV" would succeed.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/MirrorLV" would fail, but
"lvconvert --mirrors 0 VG/MirrorLV" would succeed.
If after extracting stats arguments and group tags nothing remains
of aux_data but '-' set the region->aux_data field to the empty
string to match behaviour for non-grouped regions.
Although not harmful do not allow a group containing regions with
histograms since it is not currently possible to present histogram
data aggregated for the group.
Although a non-zero value for the number of ticks spent doing IO
should imply a non-zero number of IOs in the interval test for
this explicitly to avoid a divide-by-zero in the event of bad
counter data.
It's possible for interval_ns to be zero if the interval is not
set or the clock is misconfigured. Test for this before using the
value as the divisor in the utilisation calculation.
Walk flags are ULL constants; cast the result to a uint64_t before
logging with a FMTx64 format specifier to avoid a compiler warning:
warning: format ‘%lx’ expects argument of type ‘long unsigned int’,
but argument 5 has type ‘long long unsigned int’
Make it clear that the "aux data" presented in reports is the user
data stored in the field (and does not include any library-internal
state such as group descriptors) by renaming the field to user_data
and changing the heading to "UserData".
Make it clear in libdevmapper.h, and in function argument names, that
libdm-stats uses the aux_data field internally and that any values set
for user_data are appended to the library values before being stored
with a region, and similarly, that internal data fields will be stripped
prior to returning any previously stored user_data.
Replace --statstype=area,region,group with a separate switch for
each object type: --area, --region, --group. Omitting any object
type switch will use the defaults for the current command (regions
and groups for list, and regions, groups and areas for verbose list).
Replace the 'name' field with 'statsname' in order to report alias
names for groups, and include the 'group_id' field between statsname
and the 'region_id' field to make it clear to the user when groups
are in use.
Walk avaiable groups and regions (in addition to areas) and report
aggregate statistics and properties.
A new switch is added to filter the type of obects inclued in the
report:
--statstype={all,area,region,group}
The type of the current row is also available in a new
DR_STATS_META field 'type'.
To allow the names used to describe statistics report objects to
change (for e.g to support groups and region and group aliases)
introduce a new "stats_name" field that evaluates to the correct
name for the object being reported.
Add a pair of commands to create and delete stats groups:
dmstats group --regions REGIONS
dmstats ungroup --groupid ID
REGIONS specifies a list of regions to be included in the group.
Regions are specified as a comma separated list in order of
increasing region ID. Ranges may be specified as a hypen separated
pair of values giving the first and last member of the range.
Add support do dm_stats_walk*() to walk over the set of
available groups using the cursor embedded in the dm_stats
handle, and to obtain the type of the object at the current
stats cursor location. A set of flags is introduced to
control which objects are visited:
DM_STATS_WALK_AREA
DM_STATS_WALK_REGION
DM_STATS_WALK_GROUP
DM_STATS_WALK_ALL
A final flag suppresses visits to regions that contain only a
single area - since the aggregate of such a region is idential
to the area it contains this allows these duplicates to be
filtered out:
DM_STATS_WALK_SKIP_SINGLE_AREA
If flags are not initialised before beginning a walk the default
set matches the behaviour of previous versions of the library.
Also accept group identifiers as immediate arguments to the
counter, metric, and property functions by adding control
flags to the region and area identifiers passed in.
Region and area properties are mapped to their equivalents for
the group (for example: group size is reported as the sum of
all regions contained in the group). Counter and metric values
are aggregated for the region or group.
Introduce constants for the buffer sizes that libdm-stats uses:
one for messages sent to the kernel, one for rows of response data
returned, and a pair for the "start+len" range and histogram bounds
strings.
Add a grouping facility to the libdm-stats library that allows the
user to bind several regions together as a group. Groups may be
used to aggregate data from several regions for reporting, or to
select and sort among large sets of regions.
A textual descriptor ("group tag") is associated with each group
and is stored in the first group member's aux_data field. The
tag contains the group member list and an optional alias for the
group, allowing the user to assign meaningful names to groups of
regions.
These descriptors are parsed in @stats_list message responses and
populate the resulting region and area tables with the group
structure.
Groups with overlapping regions are permitted but since this will
result in some events being counted more than once a warning is
printed in this case.
Nested and overlapping groups are not currently supported and
attempting to create these configurations results in error.
Add a new enum based interface for accessing counter and metric
values that uses a single function for each:
uint64_t dm_stats_get_counter(const struct dm_stats *dms,
dm_stats_counter_t counter
uint64_t region_id, uint64_t area_id);
int dm_stats_get_metric(const struct dm_stats *dms, int metric,
uint64_t region_id, uint64_t area_id,
double *value);
This simplifies the implementation of value aggregation for
groups of regions. The named function interface now calls the
enum interface internally so that all new functionality is
available regardless of the method used to retrieve values.
Cache the device-mapper name of a bound device in the dm_stats
handle.
This will be used by stats groups to report a device name or
user defined alias for groups.
The device-mapper name, device numbers and uuid stored in the
dm_stats handle are used only to bind the handle to a specific
device in order to issue ioctls.
Rename them to "bind_*" to reflect this usage in preparation
for caching the device-mapper name of the bound device in the
dm_stats handle.
This will be used to allow optional aliases to be set for
dmstats groups.
Add a function to parse a list of integer values and ranges into
a dm_bitset representation. Individual values signify that that bit
is set in the resulting mask and ranges are given as a pair of
start and end values, M-N, such that M and N are the first and
last members of the range (inclusive).
The implementation is based on the kernel's __bitmap_parselist()
that is used for cpumasks and other set configuration passed in
string form from user space.
TODO: it might be better to log dmeventd messages with test output
just like we do with clvmd - maybe we will switch to this one
instead of extra DMEVENTD log file in future....
Add config for mkfs to get more predicatable results
when using mkfs across variety of distributions.
In future maybe use this per all tests as default.
For now user has to specify in a test MKE2FS_CONFIG envvar to use it.
When force removing thin-pool we loose 'real' access to hidden device,
and if such pool is in suspended state, any thin volume cannot be
dropped. It likely should be also checked by dmsetup, but meanwhile
apply simple logic - try to force remove first all higher minors first
with assumption we first create thin-pool and then thin volume
and there are usually not being released lower dm numbers to
get the order wrong.
With a single report (--count=1) no timerfd is set up and the cycle
and current timestamps should be freed during the single call to
_update_interval_times().
This patch fixes link validation for used thin-pool.
Udev rules correctly creates symlinks only for unused new thin-pool.
Such thin-pool can be used by foreing apps (like Docker) thus
has /dev/vg/lv link.
However when thin-pool becomes used by thinLV - this link is no
longer exposed to user - but internal verfication missed this
and caused messages like this to be printed upon 'vgchange -ay':
The link /dev/vg/pool should have been created by udev but it was not
found. Falling back to direct link creation.
And same with 'vgchange -an':
The link /dev/vg/pool should have been removed by udev but it is still
present. Falling back to direct link removal.
This patch ensures only unused thin-pool has this link.
Run umount code only when either thin data or metadata are
above 95% - so if there are resize failures with 60%.
system fill keep running.
Also umount will only be tried with lvm2 LVs.
Foreign users are ATM unsuppored.
Add new logic to identify each unique operation and route
it to the correct function to perform it. The functions
that perform the conversions remain unchanged.
This new code checks every allowed combination of LV type
and requested operation, and for each valid combination
calls the function that performs that conversion.
The first stage of option validation which checks for
incompatible combinations of command line options, is done
done before process_each is called. This is unchanged.
(This new code will allow that first stage validation to
be simplified in a future commit.)
The second stage of checking options against the specific
LV type is done by this new code. For each valid combination
of operation + LV type, the new code calls an existing
function that implements it.
With this in place, the ad hoc checks for valid combinations
of LV types and operations can be removed from the existing
code in a future commit.
(The #if 0 is used to keep the patch clean, and the
disabled code will be removed by a following patch.)
There are detailed messages inside _create_dir_recursive that
dm_create_dir calls (except EROFS which where the message is not
generated, like anywhere else in the code).
When we test Vg.LvCreateRaid some of the hidden LVs volume type go from
'I' to 'i' between the time it takes us to create the LV and
the time it takes to call into refresh to verify the service is up to date.
This is a fairly rare occurance.
We call 'lvm help' to find out if fullreport is supported. Lvm
dumps help to stderr. Common code prints a warning if we exit
with 0, but have something in stderr so we are skipping the warning
message.
The following operations would hang if lvm was compiled with
'enable-notify-dbus' and the client specified -1 for the timeout:
* LV snapshot merge
* VG move
* LV move
This was caused because the implementation of these three dbus methods is
different. Most of the dbus method calls are executed by gathering information
needed to fulfill it, placing that information on a thread safe queue and
returning. The results later to be returned to the client with callbacks.
With this approach we can process an arbitrary number of commands without any
of them blocking other dbus commands. However, the 3 dbus methods listed
above did not utilize this functionality because they were implemented with a
separate thread that handles the fork & exec of lvm. This is done because these
operations can be very slow to complete. However, because of this the lvm
command that we were waiting on is trying to call back into the dbus service to
notify it that something changed. Because the code was blocking the process
that handles the incoming dbus activity the lvm command blocked. We were stuck
until the client timed-out the connection, which then causes the service to
unblock and continue. If the client did not have a timeout, we would have been
hung indefinitely.
The fix is to always utilize the worker queue on all dbus methods. We need to
ensure that lvm is tested with 'enable-notify-dbus' enabled and disabled.
Apply the same idea as vg_update.
Before doing the VG remove on disk, invalidate
the VG in lvmetad. After the VG is removed,
remove the VG in lvmetad. If the command fails
after removing the VG on disk, but before removing
the VG metadata from lvmetad, then a subsequent
command will see the INVALID flag and not use the
stale metadata from lvmetad.
Previously, a command sent lvmetad new VG metadata in vg_commit().
In vg_commit(), devices are suspended, so any memory allocation
done by the command while sending to lvmetad, or by lvmetad while
updating its cache could deadlock if memory reclaim was triggered.
Now lvmetad is updated in unlock_vg(), after devices are resumed.
The new method for updating VG metadata in lvmetad is in two phases:
1. In vg_write(), before devices are suspended, the command sends
lvmetad a short message ("set_vg_info") telling it what the new
VG seqno will be. lvmetad sees that the seqno is newer than
the seqno of its cached VG, so it sets the INVALID flag for the
cached VG. If sending the message to lvmetad fails, the command
fails before the metadata is committed and the change is not made.
If sending the message succeeds, vg_commit() is called.
2. In unlock_vg(), after devices are resumed, the command sends
lvmetad the standard vg_update message with the new metadata.
lvmetad sees that the seqno in the new metadata matches the
seqno it saved from set_vg_info, and knows it has the latest
copy, so it clears the INVALID flag for the cached VG.
If a command fails between 1 and 2 (after committing the VG on disk,
but before sending lvmetad the new metadata), the cached VG retains
the INVALID flag in lvmetad. A subsequent command will read the
cached VG from lvmetad, see the INVALID flag, ignore the cached
copy, read the VG from disk instead, update the lvmetad copy
with the latest copy from disk, (this clears the INVALID flag
in lvmetad), and use the correct VG metadata for the command.
(This INVALID mechanism already existed for use by lvmlockd.)
This fixes commit 0ba5f4b8e9 which moved
field recalculation (field width and sort position) from
dm_report_object to dm_report_output but it didn't handle the case when
dm_report_column_headings was used separately to report headings (before
dm_report_outpout call) and hence we ended up with intial widths for
fields in the headings.
If we're using dm_report_column_headings, we need to recalculate
fields if we haven't done so yet, the same way as we do in
dm_report_output.
Simplify code around _do_get_report_selection - remove "expected_idxs[]"
argument which is superfluous and add "allow_single" switch instead to
allow for recognition of "--configreport <report_name> -S" as well as
single "-S" if needed.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 961 in _do_report_get_selection()
Null pointer dereferences (FORWARD_NULL) Dereferencing null pointer "single_args".
Uninitialized variables (UNINIT) /safe/guest2/covscan/LVM2.2.02.158/tools/toollib.c: 3520 in _process_pvs_in_vgs()
Uninitialized variables (UNINIT) Using uninitialized value "do_report_ret_code".
Null pointer dereferences (REVERSE_INULL) /safe/guest2/covscan/LVM2.2.02.158/libdm/libdm-report.c: 4745 in dm_report_output()
Null pointer dereferences (REVERSE_INULL) Null-checking "rh" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
Incorrect expression (MISSING_COMMA) /safe/guest2/covscan/LVM2.2.02.158/lib/log/log.c: 280 in _get_log_level_name()
Incorrect expression (MISSING_COMMA) In the initialization of "log_level_names", a suspicious concatenated string ""noticeinfo"" is produced.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 816 in_get_report_options()
Null pointer dereferences (FORWARD_NULL) Comparing "mem" to null implies that "mem" might be null.
This reverts commit fa69ed0bc8.
This code sometimes expects to be presented with a read-only filesystem
(during some boot sequences for example) and copes appropriately with
this and it should not lead to expected error messages that might cause
unnecessary alarm.
lv_name arg is only used without known LV for resolving '*lv'.
Once we know *lv, never use lv_name ever again.
So setting it when passing *lv has not needed.
Add code to support more LVs to be resized through a same code path
using a single lvresize_params struct.
(Now it's used for thin-pool metadata resize,
next user will be snapshot virtual resize).
Update code to adjust percent amount resize for use_policies.
Properly activate inactive thin-pool in case of any pool resize
as the command should not 'deffer' this operation to next activation.
Use common API design and pass just LV pointer to lv_manip.c functions.
Read cmd struct via lv->vg->cmd when needed.
Also do not try to return EINVALID_CMD_LINE error when we
have already openned VG - this error code can only be returned before
locking VG.
We have only 2 users of _lv_active() - one was already checking for ==1
while the other use (_lv_is_active()) could have take '-1' as a sign of having
an LV active. So return 0 and log_debug also the reason while detection
has failed (i.e. in case --driverload n - it's kind of expectable,
but might have confused user seeing just <backtrace>).
This fixes commit f50d4011cd which
introduced a problem when using older lvm2 code with newer libdm.
In this case, the old LVM didn't recognize new _LOG_BYPASS_REPORT flag
that libdm-report code used. This ended up with no output at all
from libdm where log_print_bypass_report was called because the
_LOG_BYPASS_REPORT was not masked properly in lvm2's print_log fn
which was called as callback function for logging.
With this patch, the lvm2 registers separate print_log_libdm logging
function for libdm instead. The print_log_libdm is exactly the same
as print_log (used throughout lvm2 code) but it checks whether we're
printing common line on output where "common" means not going to stderr,
not a warning and not an error and if we are, it adds the
_LOG_BYPASS_REPORT flag so the log_print goes directly to output, not
to any log report.
So this achieves the same goal as in f50d4011cd,
just doing it in a way that newer libdm is still compatible with older
lvm2 code (libdm-report is the only code using log_print).
Looking at the opposite mixture - older libdm with newer lvm2 code,
that won't be compilable because the new log report functionality
that is in lvm2 also requires new dm_report_group_* libdm functions
so we don't need to care here.
Move code from original print_log fn to a separate _vprint_log function
that accepts va_list and make print_log a wrapper over _vprint_log.
The print_log just initializes the va_list and uses it for _vprint_log
call now. This way, we can reuse _vprint_log if needed.
In commit 6ae22125, vgcfgrestore began disabling lvmetad
while running, and rescanned to enable it again at the end,
but missed the rescanning/enabling in the error case.
Reconnect to lvmetad if either the send fails (e.g. lvmetad
was restarted since lvmlockd last connected), or if no
lvmetad connection exists (e.g. lvmetad was started after
lvmlockd so no previous connection existed.)
Currently, a shutdown signal will cause lvmetad to quit
responding to new connections, but not actually exit until
all connections are gone. If a program is maintaining a
long running connection (e.g. lvmlockd, or even an lvm
command) when lvmetad gets a shutdown signal, then all
further commands will hang indefinately waiting for a
response that won't be sent.
With this patch, make lvmetad continue handling new
connections even after a shutdown signal. It will exit
once all connections are gone.
Previously, vgcfgrestore would attempt to vg_remove the
existing VG from lvmetad and then vg_update to add the
restored VG. But, if there was a failure in the command
or with vg_update, the lvmetad cache would be left incorrect.
Now, disable lvmetad before the restore begins, and then
rescan to populate lvmetad from disk after restore has
written the new VG to disk.
Reporting commands can be of different types (even if the command name
is the same):
- pvs command can be either of PVS, PVSEGS or LABEL report type,
- vgs command is of VGS report type,
- lvs command is of LVS or SEGS report type.
Use basic report type when looking for report prefix used for
--configreport option.
This means that:
- 'pvs --configreport pv' applies to PVS, PVSEGS or LABEL report type
- 'vgs --configreport vg' applies to VGS report type
- 'lvs --configreport lv' applies to LVS and SEGS report type
The DM_REPORT_OUTPUT_MULTIPLE_TIMES instructs reporting code to
keep rows even after dm_report_output call - the rows are not
destroyed in this case which makes it possible to call dm_report_output
multiple times.
This allows for moving parts of the code from dm_report_object to
dm_report_output which is important for subsequent patches that allow
for repeated dm_report_output, not destroying rows on each
dm_report_output call.
log_print is used during cmd line processing to log the result of the
operation (e.g. "Volume group vg successfully changed" and similar).
We don't want output from log_print to be interleaved with current
reports from group where log is reported as well. Also, the information
printed by log_print belongs to the log report too, so it should be
rerouted to log report if it's set.
Since the code in libdm-report which is responsible for doing the report
output uses log_print too, we need to use a different kind of log_print
which bypasses any log report currently used for logging (...simply,
we can't call log_print to output the log report itself which in turn
would again reroute to report - the report would never get on output
this way).
This patch adds structures and functions to reroute error and warning
logs to log report, if it's set.
There are 5 new functions:
- log_set_report
Set log report where logging will be rerouted.
- log_set_report_context
Set context globally so any report_cmdlog call will use it.
- log_set_report_object_type
Set object type globally so any report_cmdlog call will use it.
- log_set_report_object_name_and_id
Set object ID and name globally so any report_cmdlog call will use it.
- log_set_report_object_group_and_group_id
Set object group ID and name globally so any report_cmdlog call will use it.
These functions will be called during LVM command processing so any logs
which are rerouted to log report contain proper information about current
processing state.
The lvm fullreport works per VG and as such, the vg, lv, pv, seg and
pvseg subreport is done for each VG. However, if the PV is not part of
any VG yet, we still want to display pv and pvseg subreports for these
"orphan" PVs - so enable this for lvm fullreport's process_each_vg call.
If we have fullreport, make sure that the options/sort keys used for
each report doesn't change its type - we want to preserve the original
type so it's always 5 different subreports within fullreport (vg, lv, pv,
seg, pvseg). Since we have all report types within fullreport, users
should add fields under proper subreport type - this minimizes
duplication of info displayed on output.
Groupable args (the ones marked with ARG_GROUPABLE flag) start a new
group of args if:
- this is the first time we hit such a groupable arg,
- or if non-countable arg is repeated.
However, there may be cases where we want to give priorities when
forming groups and hence force new group creation if we hit an arg
with higher grouping priority.
For example, let's assume (for now) hypothetical sequence of args used:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
Without giving any priorites, we end up with:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\__________GROUP1___________/ \________GROUP2___________/ \_GROUP3_/
This is because we hit "-o" as the first groupable arg. The --configreport,
even though it's groupable too, it falls into the previous "-o" group.
While we may need to give priority to the --configreport arg that should
always start a new group in this scenario instead:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\_GROUP1_/ \_________GROUP2___________/ \_________GROUP3__________/
So here "-o" started a new group but since "--configreport" has higher
priority than "-o", it starts fresh new group now and hence the rest of
the command line's args are grouped by --configreport now.
lvm fullreport executes 5 subreports (vg, pv, lv, pvseg, seg) per each VG
(and so taking one VG lock each time) within one command which makes it
easier to produce full report about LVM entities.
Since all 5 subreports for a VG are done under a VG lock, the output is
more consistent mainly in cases where LVM entities may be changed in
parallel.
Add any report (pvs/vgs/lvs) currently processed to current report group
which is part of processing handle and which already contains log
report. This way both log report and pvs/vgs/lvs report will be
reported as a whole within a group, thus having same output format as
selected by --reportformat option.
If there's parent processing handle, we don't need to create completely
new report group and status report - we'll just reuse the one already
initialized for the parent.
Currently, the situation where this matter is when doing internal report
to do the selection for processing commands where we have parent processing
handle for the command itself and processing handle for the selection
part (that is selection for non-reporting tools).
Wire up report group creation with log report in struct
processing_handle and call report_format_init during processing handle
initialization (init_processing_handle fn) and destroy it while
destroing processing handle (destroy_processing_handle fn).
This way, all the LVM command processing using processing handle
has access to log report via which the current command log
can be reported as items are processed.
Separating common report and per-report arguments prepares the code for
handling several reports per one command (for example, the command log
report and LVM command report itself).
Each report can have sort keys, options (fields), list of fields to
compact and selection criteria set individually. Hooks for setting these
per report within one command will be a part of subsequent patches, this
patch only separates new struct single_report_args out of existing
struct report_args.
New report/output_format configuration sets the output format used
for all LVM commands globally. Currently, there are 2 formats
recognized:
- basic (the classical basic output with columns and rows, used by default)
- json (output is in json format)
Add new --reportformat option and new report_format_init function that
checks this option and creates new report group accordingly, also
preparing log report handle and adding it to the report group just
created.
This is a preparation for new CMDLOG report type which is going to be
used for reporting LVM command log.
The new report type introduces several new fields (log_seq_num, log_type,
log_context, log_object_type, log_object_group, log_object_id, object_name,
log_message, log_errno, log_ret_code) as well as new configuration settings
to set this report type (report/command_log_sort and report/command_log_cols
lvm.conf settings).
This patch also introduces internal report_cmdlog helper function
which is a wrapper over dm_report_object to report command log via
CMDLOG report type and which is going to be used throughout the code
to report the log items.
This patch introduces DM_REPORT_GROUP_JSON report group type. When using
this group type and when pushing a report to such a group, these flags
are automatically unset:
DM_REPORT_OUTPUT_ALIGNED
DM_REPORT_OUTPUT_HEADINGS
DM_REPORT_OUTPUT_COLUMNS_AS_ROWS
...and this flag is set:
DM_REPORT_OUTPUT_BUFFERED
The whole group is encapsulated in { } for the outermost JSON object
and then each report is reported on output as array of objects where
each object is the row from report:
{
"report_name1": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
],
"report_name2": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
]
...
}
This patch introduces DM_REPORT_GROUP_BASIC report group type. This
type has exactly the classical output format as we know from before
introduction of report groups. However, in addition to that, it allows
to put several reports into a group - this is the very basic grouping
scheme that doesn't change the output format itself:
Report: report1_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
Report: report2_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
There's no change in output for this report group type - with this type,
we only make sure there's always only one report in a group at a time,
not more.
This patch introduces DM report group (represented by dm_report_group
structure) that is used to group several reports to make a whole. As a
whole, all the reports in the group follow the same settings and/or
formatting used on output and it controls that the output is properly
ordered (e.g. the output from different reports is not interleaved
which would break readability and/or syntax of target output format
used for the whole group).
To support this feature, there are 4 new functions:
- dm_report_group_create
- dm_report_group_push
- dm_report_group_pop
- dm_report_group_destroy
From the naming used (dm_report_group_push/pop), it's clear the reports
are pushed onto a stack. The rule then is that only the report on top
of the stack can be reported (that means calling dm_report_output).
This way we make sure that the output is not interleaved and provides
determinism and control over the output.
Different formats may allow or disallow some of the existing report
flags controlling output itself (DM_REPORT_OUTPUT_*) to be set or not so
once the report is pushed to a group, the grouping code makes sure that
all the reports have compatible flags set and then these flags are
restored once each report is popped from the report group stack.
We also allow to push/pop non-report item in which case such an item
creates a structure (e.g. to put several reports together with any
opening and/or closing lines needed on output which pose as extra
formatting structure besides formatting the reports).
The dm_report_group_push function accepts an argument to pass any
format-specific data needed (e.g. handle, name, structures passed
along while working with reports...).
We can call dm_report_output directly anytime we need (with the only
restriction that we can call dm_report_output only for the report that
is currently on top of the group's stack). Or we don't need to call
dm_report_output explicitly in which case all the reports in a stack are
reported on output automatically once we call dm_report_group_destroy.
This fixes a problem in commit ae0a8740c. The problem
in that commit was that all existing PVs are initially
dropped from lvmetad. This works if the VG is updated
at the end, which replaces the dropped PVs, but if the
rescan finds that the VG seqno is unchanged, it leaves
the cached VG in place. So, we should only drop the
existing PVs in lvmetad when the VG is going to be updated.
commit 15da467b was meant to address the case where
use_lvmetad=1 in lvm.conf, and lvmetad is not available,
in which case, pvscan --cache -aay should activate LVs.
But the commit unintentionally also changed the case
where use_lvmetad=0 in lvm.conf, in which case
pvscan --cache -aay should not activate LVs, so fix
that here.
When pvscan --cache -aay fails to connect to lvmetad it will
simply exit and do nothing. Change this so that it will
skip the lvmetad cache step and do the activation step from
disk.
We were initially looking to see if an LV was hidden and if it was we were
creating an instance of a LvCommon object to represent it. Thus if we
had a hidden cache pool for example we were missing the methods and
properties for the cache pool. However, when we create the object path,
any hidden LVs, regardless of type/functionality will be placed in the
hidden path.
The object manager method get_object_by_lvm_id was used in many cases for
the sole reason of getting the object path for the object. Instead of
retrieving the object and then calling 'dbus_object_path' on the object, we
are adding a method which returns the object path.
When we are processing the LVs we need to build up dbus objects from least
dependent to most dependent, so that we have information available when
constructing.
Original code missed to catch all apperances of SIGINT.
Also enhance logging when running in shell without tty.
Accept this regex as valid input:
'^[ ^t]*([Yy]([Ee]([Ss]|)|)|[Nn]([Oo]|))[ ^t]*$'
Some commands scan labels to populate lvmcache multiple
times, i.e. lvmcache_init, scan labels to fill lvmcache,
lvmcache_destroy, then later repeat
Each time labels are scanned, duplicates are detected,
and preferred devices are chosen. Each time this is done
within a single command, we want to choose the same
preferred devices. So, check for existing preferences
when choosing preferred devices.
This also fixes a problem with the list of unused duplicate
devs when run in an lvm shell. The devs had been allocated
from cmd memory, resulting in invalid list entries between
commands.
A number of places are working on a specific dev when they
call lvmcache_info_from_pvid() to look up an info struct
based on a pvid. In those cases, pass the dev being used
to lvmcache_info_from_pvid(). When a dev is specified,
lvmcache_info_from_pvid() will verify that the cached
info it's using matches the dev being processed before
returning the info. Calling code will not mistakenly
get info for the wrong dev when duplicate devs exist.
This confusion was happening when scanning labels when
duplicate devs existed. label_read for the first dev
would add an info struct to lvmcache for that dev/pvid.
label_read for the second dev would see the pvid in
lvmcache from first dev, and mistakenly conclude that
the label_read from the second dev can be skipped
because it's already been done. By verifying that the
dev for the cached pvid matches the dev being read,
this mismatch is avoided and the label is actually read
from the second duplicate.
If a command gets stuck during an lvmetad update, lvmetad
will cancel that update after the timeout. The next command
to check the lvmetad will see that lvmetad needs to be
populated because lvmetad will return token of "none" after
a timed out update (same as when lvmetad is not populated
at all after starting.)
If a command gets an error during an lvmetad update, it
will now just quit and leave its updating token in place.
That update will be cancelled after the timeout.
Commit #5b3a4a9 caused the "name" variable to be cleared if
declaration and assignment is on two lines so put it back
so it's on one line for it to work again.
pvmove began processing tags unintentionally from commit,
6d7dc87cb pvmove: use toollib
pvmove works on a single PV, but tags can match multiple PVs.
If we allowed tags, but processed only the first matching PV,
then the resulting PV would be unpredictable.
Also, the current processing code does not allow us to simply
report an error and do nothing if more than one PV matches the tag,
because the command starts processing PVs as they are found,
so it's too late to do nothing if a second PV matches.
If configuration consists of several sources in config cascade
("config cascade" defined in man lvmconfig(8)), lvmconfig displayed
only difference from defaults of the topmost config in the cascade.
Fix lvmconfig to display complete difference, considering all
the configuration in the cascade.
For example, before this patch:
(use_lvmetad=0 set in lvm.conf which differs from defaults)
$ lvmconfig --type diff
global {
use_lvmetad=0
}
(compact_output=1 set on cmd line)
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
(headings=0 set in profile)
$ lvmconfig --type diff --commandprofile test
report {
headings=0
}
(difference in topmost configuration source is displayed)
$ lvmconfig --type diff --commandprofile test --config report/compact_output=1
report {
compact_output=1
}
With this patch applied (the config cascade is merged before looking for
difference from defaults in configuration):
$ lvmconfig --type diff
global {
use_lvmetad=0
}
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test
report {
headings=0
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test --config report/compact_output=1
report {
headings=0
compact_output=1
}
global {
use_lvmetad=0
}
Treat loop device created with 'losetup -P' as regular
partitioned device - so if it has partition table,
prevent its usage in commands like 'pvcreate'.
Before 'pvcreate /dev/loop0' could have erased and formated as PV,
after this patch, device is filtered out and cannot be used.
All the variables for sscanf in lvmlockctl.c and lvmlockd-sanlock.c are
zeroed before sscanf call so the failure is detected by seeing the zero
value instead of proper one in subsequent code - so use (void) for
sscanf calls to ignore return value here.
Before this fix, when reporting 'lvm devtypes', the report was
initialized with incorrect reserved values - the ones used for
pvs/vgs/lvs report were used instead of NULL value (because devtypes
doesn't have any reserved values).
For example, trying to (incorrectly) use lv_name for the -S|--select
with lvm devtypes which doesn't have this field at all:
Before this patch (internal error issued):
$ lvm devtypes -S 'lv_name=lvol0'
Internal error: _check_reserved_values_supported: field-specific reserved value of type 0x0 for field not supported
Internal error: dm_report_init_with_selection: trying to register unsupported reserved value type, skipping report selection
DevType MaxParts Description
aoe 16 ATA over Ethernet
ataraid 16 ATA Raid
bcache 1 bcache block device cache
...
With this patch applied (correct error displayed about
unrecognized selection field):
$ lvm devtypes -S 'lv_name=lvol0'
Device Types Fields
-------------------
devtype_name - Name of Device Type exactly as it appears in /proc/devices. [string]
devtype_max_partitions - Maximum number of partitions. (How many device minor numbers get reserved for each device.) [number]
devtype_description - Description of Device Type. [string]
Special Fields
--------------
selected - Set if item passes selection criteria. [number]
help - Show help. [unselectable number]
? - Show help. [unselectable number]
Unrecognised selection field: lv_name
Selection syntax error at 'lv_name=lvol0'.
Use 'help' for selection to get more help.
Convert fields into using a single status ioctl call per LV.
This is a bit tricky since when there are more complicated
stacks, at this moment its undefined which values should be shown.
It's clear we need to cache more then single ioctl per LV,
but also we need to define more explicitely relation between
reported values for snapshots.
This patch is not a final state, rather a transitional step.
It should not be giving more 'worst' values then previous
many-ioctl-calls-per-lv solution.
Add function to obtain percentage value for cache lv_seg_status.
This API is rather evolving 'middle' step as the ultimate goal
is segment API fuctionality.
But first we need to be clear at reporting level which values
are needed to be reported for which LVs and segments.
Add more code to properly store status for snapshot segment
maintaining lvm2 fiction of COW and snapshot internal volumes.
The key issue here is however not though-through reporting
logic - as there is no single answer for whole line state.
It not counting with layer and we may need few more ioctl to
cover all reporting needs depending upon what is actually
needed.
In reality we need to 'cache' more ioctl status queries for
individual LVs and their segments (so they checked at most once).
The other 'hard' topic for conversion is mirror segment handling.
Also we definitelly need to relocate some logic into segment's methods,
yet it might be complex as we have not clear border between targets.
TODO: define more clearly how are reporting fields defined in case
we 'stack' volumes like - cache of stacked thin LV snapshot origin.
lv_refresh_suspend_resume() has escaped with fail ret code
after failing suspend and could have left many volumes in suspend state.
So always unconditionally call resume also when suspend has failed.
To get better control when flushing is used add extra arg when
setting up dm task.
By default now check dm device status without flush.
(At this moment this should effect only thin and cache volumes).
Also switch dev_manager_thin_pool_status() to use more
readable 'flush' parameter instead of 'no_flush'.
Check first the LV is cow before even checking it's a merging COW.
Note: previosly merging_cow was also merging origin, so without
this explicit check it used to return '1' also when passed
LV has been merging origin.
When mirror/raid called copy_percent function to return,
when 100% was supposed to be returned, wrong float 100.0 value
could have been reported back instead of dm_percent_t DM_PERCENT_100.
There is broken API somewhere, since the function here rely on
actively being modifid VG content even when doing 'lvs' operation.
(extents_copies)
This reverts commit 8fd886f735.
This was a deliberate omission because logging token-by-token metadata
parsing greatly increases the amount of logging for hardly any benefit.
In general, only LVM config file settings need to be logged, and in
places where it's considered important to log particular elements of
metadata that should be done using specific log_* lines.
This area can be revisited.
In the same way that process_each_vg() can be passed
a single VG name to process, also allow process_each_lv()
to be passed a single VG name and LV name to process.
This refactors the code for autoactivation. Previously,
as each PV was found, it would be sent to lvmetad, and
the VG would be autoactivated using a non-standard VG
processing function (the "activation_handler") called via
a function pointer from within the lvmetad notification path.
Now, any scanning that the command needs to do (scanning
only the named device args, or scanning all devices when
there are no args), is done first, before any activation
is attempted. During the scans, the VG names are saved.
After scanning is complete, process_each_vg is used to do
autoactivation of the saved VG names. This makes pvscan
activation much more similar to activation done with
vgchange or lvchange.
The separate autoactivate phase also means that if lvmetad
is disabled (either before or during the scan), the command
can continue with the activation step by simply not using
lvmetad and reverting to disk scanning to do the
activation.
Add support for active cache LV.
Handle --cachemode args validation during command line processing.
Rework some lvm2 internal to use lvm2 defined CACHE_MODE enums
indepently on libdm defines and use enum around the code instead
of passing and comparing strings.
Don't use lvm_init() to create a full command context, which
does a lot of command setup (like connecting to daemons), which
is unnecessary for simply reading a value from lvm.conf.
Passing a NULL context arg to the lvm_config_ function is now
allowed, in which case lvm.conf is read without doing lvm
command setup.
A program may be using liblvm2app for simply checking a config
setting in lvm.conf. In this case, a full lvm context is not
needed, only cmd->cft (which are the config settings read from
lvm.conf).
lvm_config_find_bool() can now be passed a NULL lvm context
in which case it will only create cmd->cft, check the config
setting asked for, and destroy the cmd.
Only call lvm_init() when it's needed so that simply
loading the lvm python code in another program doesn't
make that program do lvm initialization.
The version call doesn't need a handle.
The garbage collection can just do lvm_quit to destroy
the command. The next call that needs lvm_init will
do it first.
When setting up a toolcontext, the lib init function
was detecting an error when there was none, and then
it was returning an incompletely initialized cmd struct
instead of NULL. The effect was that the lib would try
to use the uninitialized cmd struct and segfault.
This would happen if a non-fatal error occurred during
cmd setup, e.g. user permission failed on lvmetad socket,
causing cmd to fall back to scanning and not use lvmetad.
The only real error condition is when create_toolcontext
returns NULL. If cmd is returned, the lib can use it.
The _report fn is getting big - separate it in two:
- _report fn to get all the options and arguments
- _do_report fn for reporting itself
Also, place all the variables/arguments in one structure for easier
handling of the variables around.
If a command begins repopulating the lvmetad cache,
and fails part way through, it should set the disabled
state in lvmetad so other commands don't use bad data.
If a subsequent scan succeeds, the disabled state is
cleared.
If duplicate devices exist for a PV, and one device's
size matches the PV size, but the other doesn't, then
prefer the matching device.
If one device is used by an active LV, prefer that device.
When there are duplicate devices for a PV, one device
is preferred and chosen to exist in the VG. The other
devices are not used by lvm, but are displayed by pvs
with a new PV attr "d", indicating that they are
unchosen duplicate PVs.
The "duplicate" reporting field is set to "duplicate"
when the PV is an unchosen duplicate, and that field
is blank for the chosen PV.
Previously, duplicate PVs were processed as a side effect
of processing the "chosen" PV in lvmcache. The duplicate
PV would be hacked into lvmcache temporarily in place of
the chosen PV.
In the old way, we had to always process the "chosen" PV
device, even if a duplicate of it was named on the command
line. This meant we were processing a different device than
was asked for. This could be worked around by naming
multiple duplicate devs on the command line in which case
they were swapped in and out of lvmcache for processing.
Now, the duplicate devs are processed directly in their
own processing loop. This means we can remove the old
hacks related to processing dups as a side effect of
processing the chosen device. We can now simply process
the device that was named on the command line.
When the same PVID exists on two or more devices, one device
is preferred and used in the VG, and the others are duplicates
and are not used in the VG. The preferred device exists in
lvmcache as usual. The duplicates exist in a specical list
of unused duplicate devices.
The duplicate devs have the "d" attribute and the "duplicate"
reporting field displays "duplicate" for them.
'pvs' warns about duplicates, but the formal output only
includes the single preferred PV.
'pvs -a' has the same warnings, and the duplicate devs are
included in the output.
'pvs <path>' has the same warnings, and displays the named
device, whether it is preferred or a duplicate.
Wait to compare and choose alternate duplicate devices until
after all devices are scanned. During scanning, the first
duplicate dev is kept in lvmcache, and others are kept in a
new list (_found_duplicate_devs).
After all devices are scanned, compare all the duplicates
available for a given PVID and decide which is best.
If the dev used in lvmcache is changed, drop the old dev
from lvmcache entirely and rescan the replacement dev.
Previously the VG metadata from the old dev was kept in
lvmcache and only the dev was replaced.
A new config setting devices/allow_changes_with_duplicate_pvs
can be set to 0 which disallows modifying a VG or activating
LVs in it when the VG contains PVs with duplicate devices.
Set to 1 is the old behavior which allowed the VG to be
changed.
The logic for which of two devs is preferred has changed.
The primary goal is to choose a device that is currently
in use if the other isn't, e.g. by an active LV.
. prefer dev with fs mounted if the other doesn't, else
. prefer dev that is dm if the other isn't, else
. prefer dev in subsystem if the other isn't
If neither device is preferred by these rules, then don't
change devices in lvmcache, leaving the one that was found
first.
The previous logic for preferring a device was:
. prefer dev in subsystem if the other isn't, else
. prefer dev without holders if the other has holders, else
. prefer dev that is dm if the other isn't
When duplicate PVs are detected, set the disabled
flag so that commands will disable use of lvmetad.
This duplicate detection is done by lvmetad itself
when it's told about a single new PV with a PVID
that matches an existing PV on another device.
(This is different from the case where the command
is scanning all devices and detects the duplicate.)
Remove the "altdev" logic that attempted to keep
track of multiple devices for a single PV. It
is no longer used since lvmetad is disabled in
this case.
For raid1 use chunksize as bitmap-chunk specification.
Always enforce usage of bitmap - getting comparable outcome
as lvm2 raid support uses.
Add udev_wait after stopping md array - as in fact leg-device
are still in use by target even command has finished.
(mdadm --stop causes WATCH rule wakeup, and
ioctl(STOP_ARRAY) returns IMHO to early - it should finish
and fsync work on leg devices first).
Support parsing --chunksize option also when converting.
Now user can use cache pool created with i.e. 32K chunksize,
while in caching user can select 512K blocks.
Tool is supposed to validate cache metadata size is big enough
to support such chunk size. Otherwise error is shown.
When creating LV - in some case we change created segment type
(ATM for cache and snapshot) and we then manipulate with
lv segment according to 'lp' segtype.
Fix this by checking for proper type before accessing segment members.
This makes command like:
lvcreate --type cache-pool -L10 vg/cpool
lvcreate -H -L10 --cachesettings migtation_threshold=10000 vg/cpool
to pass since now tool correctly selects default cache policy.
Rather than doing repeated translations from name to
device when comparing args to existing PVs, do one
translation of the arg names and saving the device,
before checking existing PVs.
If there's an activation volume_filter, it might not be possible
to activate the rmeta LVs to wipe them. At least inherit any
LV tags from the parent LV while attempting this.
pvscan autoactivation has its own VG processing implementation,
so it can't properly handle things like foreign or shared VGs,
so make it ignore those VG types (or errors from them) as best
as possible.
Add a FIXME stating that pvscan autoactivation must really be
moved to the standard VG processing by calling process_each_vg
to do activation once the scanning / cache update is finished.
Checking for devices uses is_missing_pv() to check
if there is a device for the PV. is_missing_pv()
is based on the MISSING_PV flag, which does not
always correspond to !pv->dev. When using lvmetad,
a command like:
pvs --config 'devices/filter=["a|/dev/sdb|", "r|.*|"]'
will cause a number of PVs to have NULL pv->dev, but
not the MISSING_PV flag. So, NULL pv->dev needs to
also be checked.
Before executing modprobe for given module name, just check
if the module is not already present in /sys/module.
Useful when checking dm-cache-policy modules as we do not
having matching interface like for targets.
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Probably not worth mentioning "segments" here, just state that devices
for an LV can't be all found during the check - it's less mysterious for
user then:
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find all devices for LV fedora/root while checking used and assumed devices.
WARNING: Couldn't find all devices for LV fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
When checking assumed PVs against real devices used for LVs and if
there's no device assigned for an assumed PV (e.g. due to filters),
do log_warn instead of log_error and continue checking LV segments
and associated assumed PVs further, just like we do log_warn elsewhere
in this situation.
This way user will see the warning for each LV which couldn't be
checked completely against real PVs used. Before, we logged only
the very first occurence of missing device for an LV in a VG and we
returned from the function doing this check for all the LVs in VG
immediately which may be a bit misleading because it didn't tell
user about all the other LVs and whether they could be checked
or not.
For example, we have this setup:
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
/dev/vda2 fedora lvm2 a-- 19.49g 0
[0] fedora/~ # lvs -o+devices
LV VG Attr LSize Devices
root fedora -wi-ao---- 19.00g /dev/vda2(0)
swap fedora -wi-ao---- 500.00m /dev/vda2(4864)
Before this patch (only the very first LV in a VG is logged to have a
problem while checking used and assumed devices):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
With this patch applied (all LVs where we hit problem while checking
used and assumed devices are logged and it's warning, not error):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Check that @stats_list and @stats_print returned data in the
_stats_parse_list() and _stats_parse_region() functions before
attempting to operate on region and area values.
This avoids a coverity warning since fgets() could potentially
return no data from the memory buffer returned by the ioctl.
In both cases the ioctl would return an error, preventing these
functions from running, however it is cleaner to test for the
condition explicitly and fail in those cases.
If lvmetad is running, and a command opts to not use it
(--config global/use_lvmetad=0), and the command changes
metadata, then the metadata change is not visible to
lvmetad. Subsequent commands using lvmetad to change
metadata may cause corruption based on the invalid
lvmetad state.
Eventually we can set the disabled state in lvmetad
to prevent this problem, but for now print a warning
about the possibility.
When command is not using lvmetad because
use_lvmetad=0 in the config, but the lvmetad
pidfile exists, print a warning (previously
this checked for the socket existing instead
of the pidfile existing.)
vg/snapshotN should not appear anywhere.
No code should be showing this, but it was noticed in some logs last
week and we can deal with it in display_lvname().
When user requested on cmdline disabling of lvmetad/lvmpoll,
respect it and when lvmlockd requires these daemon,
Error configure with clear message about misconfiguration.
The lvmetad connection is created within the
init_connections() path during command startup,
rather than via the old lvmetad_active() check.
The old lvmetad_active() checks are replaced
with lvmetad_used() which is a simple check that
tests if the command is using/connected to lvmetad.
The old lvmetad_set_active(cmd, 0) calls, which
stopped the command from using lvmetad (to revert to
disk scanning), are replaced with lvmetad_make_unused(cmd).
The test was a weak attempt at verifying the special
combination of lvchange/vgchange -aay --sysinit, but
was only looking for lvmetad connection warnings.
Update the warning checks, and check the LV activation
state directly which is the main point.
Rename the test to reflect its purpose of checking
the -aay --sysinit combination.
Update the check about lvmetad running but not used.
Also add tests related to the new lvmetad disabled state.
lvm1 metadata is used here to test the disabled state
because lvm1 metadata is the first condition using the
disabled state.
After a device rescan that repopulates lvmetad,
if no reason for disabling lvmetad was seen
(lvm1 metadata or duplicate PVs), then clear
the disabled flag in lvmetad. This allows
commands to resume using the lvmetad cache
after the cause for disabling it has been removed.
Commands already check if the lvmetad token is valid,
and if not, they rescan devices to repopulate lvmetad
before running. Now, in addition to checking the
lvmetad token, they also check if the lvmetad disabled
flag is set. If so, they do not use the lvmetad cache
and revert to disk scanning.
A global flag in lvmetad indicates it has been disabled.
Other flags indicate the reason it was disabled.
These flags can be queried using get_global_info.
The lvmetactl debugging utility can set and clear the
disabled flag in lvmetad. Nothing else sets the
disabled flag yet.
Commands will check these flags after connecting to
lvmetad. If the disabled flag is set, the command
will not use the lvmetad cache, but revert to disk
scanning.
To test this feature:
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 0
disable_reason = "none"
token = "filter:3041577944"
$ vgs
(should report VGs from lvmetad)
$ lvmetactl set_global_disable 1
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 1
disable_reason = "DIRECT"
token = "filter:3041577944"
$ vgs
WARNING: Not using lvmetad because the disable flag was set directly.
(should report VGs without contacting lvmetad)
$ lvmetactl set_global_disable 0
$ vgs
(should report VGs from lvmetad)
process_each_pv was doing:
1. lvmcache_seed_infos_from_lvmetad()
sends pv_list request to lvmetad.
2. get_vgnameids()
sends vg_list request to lvmetad.
3. _get_all_devices()
first calls lvmcache_seed_infos_from_lvmetad(),
which is a no-op if it's already been called.
Because get_vgnameids() does not use the information
from lvmcache_seed_infos_from_lvmetad(), it does not
need to be called prior to get_all_devices where
it is actually needed.
Improve code for snapshot merge for readabilty
and also reduce number of tests needed to decide
if merging can or cannot be started.
(Further improving 9cccf5245a)
When dm_tree_find_node_by_uuid() fails to find passed uuid,
report in lof_debug the complete original uuid,
not the one stripped of LVM- prefix.
TODO: inspect manipulation with LVM- prefix here.
To recognize in runtime if we are merging or not
to make consistent decision between suspend and resume
add function to parse thin table line when add
merging thin device to the table.
A snapshot merge into its origin cannot be initiated while the devices
are in use. If there is outside interference (such as from udev),
the suspend (preload) and resume stages can reach conflicting decisions
about whether or not to proceed.
Try to make the logic more robust by checking the inactive or live
table during resume. (This is still not perfect.)
Commit 971ab733b7 ("thin: activation of
merging thin snapshot") also added an incorrect deactivation attempt
for non-thin LVs: find_snapshot(lv)->lv is not designed to be
activated and any attempt to deactivate it is incorrect.
Move checking the lvmetad state, and the possible rescan,
out of lvmetad_send() to the start of the command.
Previously, the token mismatch and rescan would occur
within lvmetad_send() for some other request. Now,
the token mismatch is detected earlier, so the
rescan can be done before the main command is in
progress. Rescanning deep within the processing of
another command will disturb the lvmcache state of
that other command.
A rescan already exists at the start of the command
for the case where foreign VGs are going to be read.
This same rescan is now also performed when there is
an lvmetad token mismatch (from a changed global_filter).
The commands pvscan/vgscan/lvscan/vgimport are excluded
from this preemptive checking/rescanning for lvmetad
because they want to do rescanning themselves explicitly.
If rescanning devices fails, then lvmetad has not been
correctly repopulated and should not be used, so make
the command revert to not using lvmetad.
When not obtaining device from udev, we are doing deep devdir scan,
and at the same time we try to insert everything what /sys/dev/block
knows about. However in case lvm2 is configured to use nonstardard
devdir this way it will see (and scan) devices from a real system.
lvm2 test suite is using its own test devdir with its
own device nodes. To avoid touching real /dev devices, validate
the device node exist in give dir and do not insert such device
into a cache.
With obtain list from udev this patch has no effect
(the normal user path).
We have _insert_dirs() for udev and non-udev compilation.
Compiling without udev missed to call dev_cache_index_devs().
Move the call after _insert_dirs() call so both compilation
gets it.
When scanning if device is being usable as PV,
we call STATUS - but this status should not cause
any flushing.
Skip also open_count information as it's not needed.
We don't have any report field of this type yet. Return this patch into
the play if we really need that. Currenly we always report status
(result of "status" dm ioctl) for an LV as a whole where we choose
segment which represents the LV, not calling status for each possible
segment it contains - we don't need this now so I'm removing it to
not make the code more complex uselessly.
Devices without "LVM-" uuid prefix have been generated by very old
version of lvm2 2.00 and 2.01.
Since version 2.02 all lvm2 devices are using prefix "LVM-".
However checking for present of ancient non prefixed devices does
take extra IOCTL per every call and for majority of todays user
it will not find anything new.
So use the assumption that users with kernel 3.X and newer are not
really using such old versions of lvm2 (year <2005) and with their
new kernel they are also using new version of lvm2 and skip
checking for them.
This change also makes trace logs more readable.
This is hotfix for RHBZ: https://bugzilla.redhat.com/1324537
However already the %FREE is not a good fit and we need something
better. Meanwhile make -l%PVS work at least as good as %FREE
for thin-pool.
TODO: this needs rework - it should be allocator to do all the size
decisions at one place.
Improved test script to verify lost mirror image does not
cause mirror corruption while mirror is in use.
TODO: add more cases (lost mlog...), lost image from 3leg mirror...
When leaving preload routine it should not altet state of flush required
when it's been already set to 1 and drop it to 0.
The API here is unclean, but current usage expects the same
variable pointer is for all preload calls and combines 'flush_required'
across all of them.
Commit 844b009584 tried to move
limit for usage of noflush to 'preload' however this was not
correctly processed.
Intead explicitly check for which types we do not want noflush
and also add debug message in this case.
Fix regression caused by commit ba41ee1dc9.
The idea was to use no_flush for changed device only for thin volumes
and thin pools but also to merge this with change made in commit
844b009584.
However the resulting condition has caused misbehavior for the mirror
suspend - as that has been before the ONLY allowed target type
that could have been suspended with noflush.
Result was badly working repair for --type mirror that has been
passing 'flush' to the repaired mirror target whenever preload
wrongly set flush_required.
The origin code has required the flush_required to be set whenever
deivce size is changed.
Now it first detects if device size got smaller
'dm_tree_node_size_changed(root) < 0' - this requires flush.
Otherwise it checks device is not thin volume nor thin pool and its
size has changed (got bigger) and requires flush.
This mean upsize of thin-pool or thin volume will not require flush.
/sys/dev/block is available since kernel version 2.2.26 (~ 2008):
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-dev
The VGID/LVID indexing code relies on this feature so skip indexing
if it's not available to avoid error messages about inability to open
/sys/dev/block directory.
We're not going to provide fallback code to read the /sys/block/
instead in this case as that's not that efficient - it needs extra
reads for getting major:minor and reading partitions would also
pose further reads and that's not worth it.
If compiling without udev_sync support, udev_get_library_context simply
returns NULL so we don't need to remember putting ifdef UDEV_SYNC_SUPPORT
in the code all the time we just need to check whether there's any udev
context initialized or not.
If obtain_device_list_from_udev=0, LVM can make use of persistent .cache
file. This cache file contains only devices which underwent filters in
previous LVM command run. But we need to iterate over all block devices
to create the VGID/LVID index completely for the device mismatch check
to be complete as well.
This patch iterates over block devices found in sysfs to generate the
VGID/LVID index in dev cache if obtain_device_list_from_udev=0
(if obtain_device_list_from_udev=1, we always read complete list of
block devices from udev and we ignore .cache file so we don't need
to look in sysfs for the complete list).
For the case when we print device name associated with struct device
that was not found in /dev, but in sysfs, for example when printing
devices where LV device mismatch is found.
Use meta% to expose highest mapped sector in thinLV.
so showing there 100.00% means thinLV maps latest sector.
Currently using a 'trick' with total_numerator to pass-in
device size when 'seg==NULL'
TODO: Improve device status API per target - current 'percentage'
is not really extensible.
Previous fix missed the fact the we do query for 'percent' with
seg value either set or unset (API overload...)
When 'seg' was unset, we still issue flush with status.
Fix it by cheking segtype by target_type.
As we check for segtype - we could also skip whole percentage
if the 'segtype' is unknown by code directly.
Reported-by: Ming-Hung Tsai <mingnus gmail com
It's correct to have a DM device that has no DM UUID assigned
so no need to issue error message in this case. Also, if the
device doesn't have DM UUID, it's also clear it's not an LVM LV
(...when looking for VGID/LVID while creating VGID/LVID indices
in dev cache).
For example:
$ dmsetup create test --table "0 1 linear /dev/sda 0"
And there's no PV in the system.
Before this patch (spurious error message issued):
$ pvs
_get_sysfs_value: /sys/dev/block/253:2/dm/uuid: no value
With this patch applied (no spurious error message):
$ pvs
If we're using persistent .cache file, we're reading this file instead
of traversing the /dev content. Fix missing indexing by VGID and LVID
here - hook this into persistent_filter_load where we populate device
cache from persistent .cache file instead of scanning /dev.
For example, inducing situation in which we warn about different device
actually used than what LVM thinks should be used based on metadata:
$ lsblk -s /dev/vg/lvol0
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol0 253:4 0 124M 0 lvm
`-loop1 7:1 0 128M 0 loop
$ lvmconfig --type diff
global {
use_lvmetad=0
}
devices {
obtain_device_list_from_udev=0
}
(obtain_device_list_from_udev=0 also means the persistent .cache file is used)
Before this patch - pvs is fine as it does the dev scan, but lvs relies
on persistent .cache file and it misses the VGID/LVID indices to check
and warn about incorrect devices used:
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
With this patch applied - both pvs and lvs is fine - the indices are
always created correctly (lvs just an example here, other LVM commands
that rely on persistent .cache file are fixed with this patch too):
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
It's possible that while a device is already referenced in sysfs, the node
is not yet in /dev directory.
This may happen in some rare cases right after LVs get created - we sync
with udev (or alternatively we create /dev content ourselves) while VG
lock is held. However, dev scan is done without VG lock so devices may
already be in sysfs, but /dev may not be updated yet if we call LVM command
right after LV creation (so the fact that fs_unlock is done within VG
lock is not usable here much). This is not a problem with devtmpfs as
there's at least kernel name for device in /dev as soon as the sysfs
item exists, but we still support environments without devtmpfs or
where different directory for dev nodes is used (e.g. our test suite).
This patch covers these situations by tracking such devices in
_cache.sysfs_only_names helper hash for the vgid/lvid check to work still.
This also resolves commit 6129d2e64d
which was then reverted by commit 109b7e2095
due to performance issues it may have brought (...and it didn't resolve
the problem fully anyway).
dmeventd daemon may call further code itself that looks at /dev, e.g.
via dmeventd_lvm2_command call. We need to have a consistent view of
the /dev content at that time. Therefore, sync /dev content before
calling monitoring hook which contacts dmeventd.
This problem was quite hidden before, but now it has manifested itself
because of recent additions to dev-cache code where we started looking
at device holders as seen in sysfs. What happened here was that the
device was already in sysfs, but not yet under /dev and this triggered
the new error message sometimes:
log_error("%s: failed to find associated device structure for holder %s.", devname, devpath);
This problem has manifested recently in our api/pytest.sh test from
testsuite where we create thin pool LVs and thin LVs and hence it also
causes dmeventd to be used as well and these error messages were
visible there.
UUID for LV is either "LVM-<vg_uuid><lv_uuid>" or "LVM-<vg_uuid><lv_uuid>-<suffix>".
The code before just checked the length of the UUID based on the first
template, not the variant with suffix - so LVs with this suffix were not
processed properly.
For example a thin pool LV (as an example of an LV that contains
sub LVs where UUIDs have suffixes):
[0] fedora/~ # lsblk -s /dev/vg/lvol1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol1 253:8 0 4M 0 lvm
`-vg-pool-tpool 253:6 0 116M 0 lvm
|-vg-pool_tmeta 253:2 0 4M 0 lvm
| `-sda 8:0 0 128M 0 disk
`-vg-pool_tdata 253:3 0 116M 0 lvm
`-sda 8:0 0 128M 0 disk
Before this patch (spurious warning message about device mismatch):
[0] fedora/~ # pvs
WARNING: Device mismatch detected for vg/lvol1 which is accessing /dev/mapper/vg-pool-tpool instead of (null).
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
With this patch applied (no spurious warning message about device mismatch):
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
To help out with debug, when an exception is thrown in the dbus service we
will dump all the information we have on the last 16 commands that were
executed along with the stack strace.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
With commit 2d5dc6512e the dbus server
no longer needs to utilize udev to know when to update its internal
state.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Check if the value we read from sysfs is not blank and replace the '\n'
at the end only when needed ('\n' should usually be there for sysfs values,
but better check this).
It's possible for an LVM LV to use a device during activation which
then differs from device which LVM assumes based on metadata later on.
For example, such device mismatch can occur if LVM doesn't have
complete view of devices during activation or if filters are
misbehaving or they're incorrectly set during activation.
This patch adds code that can detect this mismatch by creating
VG UUID and LV UUID index while scanning devices for device cache.
The VG UUID index maps VG UUID to a device list. Each device in the
list has a device layered above as a holder which is an LVM LV device
and for which we know the VG UUID (and similarly for LV UUID index).
We can acquire VG and LV UUID by reading /sys/block/<dm_dev_name>/dm/uuid.
So these indices represent the actual state of PV device use in
the system by LVs and then we compare that to what LVM assumes
based on metadata.
For example:
[0] fedora/~ # lsblk /dev/sdq /dev/sdr /dev/sds /dev/sdt
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdq 65:0 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev1 253:3 0 104M 0 mpath
sdr 65:16 0 104M 0 disk
`-mpath_dev1 253:3 0 104M 0 mpath
sds 65:32 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev2 253:4 0 104M 0 mpath
sdt 65:48 0 104M 0 disk
`-mpath_dev2 253:4 0 104M 0 mpath
In this case the vg-lvol0 is mapped onto sdq and sds becauset this is
what was available and seen during activation. Then later on, sdr and
sdt appeared and mpath devices were created out of sdq+sdr (mpath_dev1)
and sds+sdt (mpath_dev2). Now, LVM assumes (correctly) that mpath_dev1
and mpath_dev2 are the PVs that should be used, not the mpath
components (sdq/sdr, sds/sdt).
[0] fedora/~ # pvs
Found duplicate PV xSUix1GJ2SK82ACFuKzFLAQi8xMfFxnO: using /dev/mapper/mpath_dev1 not /dev/sdq
Using duplicate PV /dev/mapper/mpath_dev1 from subsystem DM, replacing /dev/sdq
Found duplicate PV MvHyMVabtSqr33AbkUrobq1LjP8oiTRm: using /dev/mapper/mpath_dev2 not /dev/sds
Using duplicate PV /dev/mapper/mpath_dev2 from subsystem DM, ignoring /dev/sds
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/sdq, /dev/sds instead of /dev/mapper/mpath_dev1, /dev/mapper/mpath_dev2.
PV VG Fmt Attr PSize PFree
/dev/mapper/mpath_dev1 vg lvm2 a-- 100.00m 0
/dev/mapper/mpath_dev2 vg lvm2 a-- 100.00m 0
If we're using non-standard /dev layout so we can't get the dm-X name
easily, we can't also look at the /sys/blocl/dm-X/dev to get the major:minor
pair. Use "stat" in this case even though it triggers automounts
(but there's no better way for now).
The /proc/self/mountinfo is not bound to device names like /proc/mounts
and it uses major:minor pairs instead.
This fixes a situation in which a volume is mounted and then renamed
later on - that makes /proc/mounts unreliable when detecting mounted
volumes.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1196910.
Commit b64703401d cause regression
when handling stacked resize of pool metadata volume that would
be a raid LV.
Fix it by properly setting up size also for layer extension.
Allow pvchange and pvresize to process exported VGs,
and have them check for the exported state in their
single function.
Previously, the exported VG state would trigger a
failure in vg_read()/ignore_vg() because the VGs are
being read with READ_FOR_UPDATE. Because these commands
read all VGs to search for the intended PVs, any
exported VG would trigger a failure, even if it was
not related to the intended PV.
Use <> around user entered option parameters to make it visually
different (just like we use Italic style in man page).
TODO:
In the future we should consistently provide such notation and
possibly generate it automagically from some internal data structures.
Preferably for man pages as well so we report actual set of supported
options.
Recent kernel (4.4) start to report values smaller then sector size
(but in reporting size for SSD which support data zeroing on discard).
For now log warning and assume it really means 1 sector.
Addressing RHBZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1313377
Continuing with lvconvert style updates.
Minimize usage of '\:' as zero-width break space, since html renderers
do not handle this groff sign well (some of them seems to even render
':' there).
But since we do not want see badly rendered man pages itself, prefer
the 'man visual' appearence over html page generation (man2html should
be actually fixed).
Start to use 'user input option' with Capitals more consistently.
Fixing vpath usage as it has been checking for existance of
generated file also in the $(scrdir) e.g.:
No need to remake target '10-dm.rules.in'; using VPATH name '...'
If the $(srcdir) had been also $(builddir) and contained already
generated rules file, it's been used instead generating new
one.
(See: http://make.mad-scientist.net/papers/how-not-to-use-vpath/)
Commit abd9618dd8 tried to improve
parsing of vg name from logical path - but still missed couple
corner cases.
This patch further improves the logic and reuses
validate_lvname_param() for parsing of lv_name.
Also explicitly checks for LVM_VG_NAME in the right case.
So now also properly parses cases like:
'lvconvert --repairt vg/'
and will provide correct error message.
There's a window between doing VG read and checking PV device size
against real device size. If the device is removed in this window,
the dev cache still holds struct device and pv->dev still references
that and that PV is not marked as missing. However, if we're trying
to get size for such device, the open fails because that device
doesn't exists anymore.
We called existing pv_dev_size in _check_pv_dev_sizes fn. But
pv_dev_size assigned a size of 0 if the dev_get_size it called failed
(because the device is gone).
So call the dev_get_size directly and check for the return code
in _check_pv_dev_sizes and go further only if we really know the
device size. This is to avoid confusing warning messages like:
Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
One or more devices used as PVs in VG helter_skelter have changed sizes.
While running on F24 a number of warnings were being emitted from using the
deprecated GObject instead of GLib. Tested on python 3.4 and 3.5.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Python 3.5 in F24 was throwing the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/main.py", line 73, in process_request
req.run_cmd()
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 73, in run_cmd
self.register_error(-1, st)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 123, in register_error
self._reg_ending(None, error_rc, error)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 115, in _reg_ending
self.cb_error(self._rc_error)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 669, in <lambda>
keywords[error_callback] = lambda exception: _method_reply_error(connection, message, exception)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 293, in _method_reply_error
exception))
File "/usr/lib64/python3.5/traceback.py", line 136, in format_exception_only
return list(TracebackException(etype, value, None).format_exception_only())
File "/usr/lib64/python3.5/traceback.py", line 442, in __init__
if (exc_value and exc_value.__cause__ is not None
AttributeError: 'str' object has no attribute '__cause__'
This was caused because we were calling the dbus error callback with a
string instead of an actual exception. On python 3.4 this was apparently
OK, but not with 3.5. Corrected to pass the exception to error callback.
Change tested on both python 3.4 and 3.5.
Reported-by: Vratislav Podzimek <vpodzime@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Specifying an output width of 0 now leads to a default minimum width
taken from the width of the column heading. (Most fields should use
this.)
Components of field names are generally separated by underscores (which
are optional at run-time). (Dropped 3 duplicate fields now covered by
this rule.)
Each field heading must be unique and generally doesn't have spaces
between words (which get capitalised) unless they are already short and
the fields are normally longer or clarity demands it.
With the recent conversion of pvcreate/pvremove to the
common toollib processing function, skipping in-use PVs
in _process_pvs_in_vg prevented them from being protected
as intended by the in-use flag.
The processing code for pvcreate/pvremove checks for the
in-use state itself and prevents using an in-use PV.
If a PV is skipped, it looks like an unused device and
is not protected from being used in pvcreate/pvremove.
This command option can be used to trigger a D-Bus
notification independent of the usual notifications
that are sent from other commands as an effect of
changes to PV/VG/LV state. If lvm is not built with
dbus notification support or if notify_dbus is disabled
in the config, this command will exit with an error.
When a command modifies a PV or VG, or changes the
activation state of an LV, it will send a dbus
notification when the command is finished. This
can be enabled/disabled with a config setting.
The code in _print_historical_lv function works with temporary
"descendants_buffer" that is allocated and freed within this
function.
When printing text out, we used "outf" macro which called
"out_text" fn and it checked return value and if failed,
the macro called "return_0" automatically. But since we
use the temporary buffer, if any of the out_text calls
fails, we need to deallocate this buffer properly - that's
the "goto_out", otherwise we'll be leaking memory.
So add new "outfgo" helper macro which does the same as "outf",
but it calls "goto_out" instead of "return_0" so we can jump
to a cleanup hook at the end.
- stdout/stderr for test output
- debug.log* for daemon output
- use install instead of cp for DBus config
- use system bus
- DBus reloads on new file. Better having it created with correct
permissions.
Notes:
- Squashed commits from mcsontos development branch
- Still disabled at this time
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Using lvm1 metadata with lvmetad is not generally allowed,
but nothing has prevented creating new lvm1 metadata with
lvmetad (missing error checking in pvcreate/vgcreate.)
Various tests are using lvm1 with lvmetad and happen to
work because of the missing error checks.
This commit fixes the tests so they won't fail when the
lvm1/lvmetad error checking is fixed. A new variable
LVM_TEST_LVM1 is defined and is used in the scripts to
decide if lvm1 metadata should be tested. LVM_TEST_LVM1
is not defined when lvmetad is being tested, and the
combination of LVM_TEST_LVM1 and LVM_TEST_LVMETAD can
be used to verify the desired lvmetad+lvm1 behavior.
Historical LV is valid as long as there is at least one live LV among
its ancestors. If we find any invalid (dangling) historical LVs, remove
them automatically.
The vg_strip_outdated_historical_lvs iterates over the list of historical LVs
we have and it shoots down the ones which are outdated.
Configuration hook to set the timeout will be in subsequent patch.
The metadata/record_lvs_history is global switch which enables or
disables recording historical LVs in metadata.
If both metadata/record_lvs_history=1 lvm.conf option and
--nohistory command switch is used at the same time, the
--nohistory prevails.
Report proper values for historical LVs in lv_layout and lv_role fields.
Any historical LV doesn't have any layout anymore and the role is "history".
For example:
$ lvs -H -o name,lv_attr,lv_layout,lv_role vg/-lvol1
LV Attr Layout Role
-lvol1 ----h----- none public,history
The lv_full_ancestors reporting field is just like the existing
lv_ancestors field but it also takes into account any history and
indirect relations recorded.
All names for historical LVs are prefixed with '-' character to make clear
difference between live and historical LVs. The '-' can't be set by users
for live LV names during lvcreate hence we never get into a conflict with
the names that user defines for live LVs.
The lv_historical reporting field is a simple binary field that reports
whether an LV is historical one ("historical" value or value of "1" displayed)
or not (blank string "" or value of "0" displayed).
When processing LVs in a VG and when the -H|--history switch is used,
make process_each_lv_in_vg to iterate over historical volumes too.
For each historical LV, we use dummy struct logical_volume instance with
the "this_glv" reference set to a wrapper over proper struct
historical_logical_volume representation. This makes it possible to process
historical LVs just like normal live LVs (though a dummy one without any
segments and all the other fields zeroed and blank) and it also allows
for using all historical LV related information via lv->this_glv->historical
reference.
One can use a simple call to lv_is_historical to make a difference between
live and historical LV in all the code that is called by process_each_* fns.
This patch adds "include_historical_lvs" field to struct cmd_context to
make it possible for the command to switch between original funcionality
where no historical LVs are processed and functionality where historical
LVs are taken into account (and reported or processed further). The switch
between these modes is done using the '-H|--history' switch on command
line.
The include_historical_lvs state is then passed to process_each_* fns
using the "include_historical_lvs" field within struct processing_handle.
Add support for making an interconnection between thin LV segment and
its indirect origin (which may be historical or live LV) - add a new
"indirect_origin" argument to attach_pool_lv function.
Also export historical LVs when exporting LVM2 metadata.
This is list of all historical LVs listed in
"historical_logical_volumes" metadata section with all
the properties exported for each historical LV.
For example, we have this thin snapshot sequence:
lvol1 --> lvol2 --> lvol3
\
--> lvol4
We end up with these metadata:
logical_volume {
...
(lvol1, lvol3 and lvol4 listed here as usual - no change here)
...
}
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "lvol1"
descendants = ["lvol3", "lvol4"]
}
}
By removing lvol1 further, we end up with:
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "-lvol1"
descendants = ["lvol3", "lvol4"]
}
lvol1 {
id = "me0mes-aYnK-nRfT-vNlV-UiR1-GP7r-ojbROr"
creation_time = 1456919608 # 2016-03-02 12:53:28 +0100
removal_time = 1456919767 # 2016-03-02 12:56:07 +0100
}
}
When an LV is being removed, we create an instance of
"struct historical_logical_volume" wrapped up in
"struct generic_logical_volume".
All instances of "struct historical_logical_volume" are then recorded in
"historical_lvs" list which is part of "struct volume_group".
The "historical LV" is then interconnected with "live LVs" to
connect a history chain for the live LV.
The add_glv_to_indirect_glvs is a helper function that registers a
volume represented by struct generic_logical_volume instance ("glv")
as an indirect user of another volume ("origin_glv") and vice versa -
it also registers the other volume ("origin_glv") as indirect_origin
of user volume ("glv").
The remove_glv_from_indirect_glvs does the opposite.
The get_or_create_glv is helper function that retrieves any existing
generic_logical_volume wrapper for the LV. If the wrapper does not exist
yet, it's created.
The get_org_create_glvl is the same as get_or_create_glv but it creates
the glv_list wrapper in addition so it can be added to a list.
Add new structures and new fields in existing structures to support
tracking history of LVs (the LVs which don't exist - the have been
removed already):
- new "struct historical_logical_volume"
This structure keeps information specific to historical LVs
(historical LV is very reduced form of struct logical_volume +
it contains a few specific fields to track historical LV
properties like removal time and connections among other LVs).
- new "struct generic_logical_volume"
Wrapper for "struct historical_logical_volume" and
"struct logical_volume" to make it possible to handle volumes
in uniform way, no matter if it's live or historical one.
- new "struct glv_list"
Wrapper for "struct generic_logical_volume" so it can be
added to a list.
- new "indirect_glvs" field in "struct logical_volume"
List that stores references to all indirect users of this LV - this
interconnects live LV with historical descendant LVs or even live
descendant LVs.
- new "indirect_origin" field in "struct lv_segment"
Reference to indirect origin of this segment - this interconnects
live LV (segment) with historical ancestor.
- new "this_glv" field in "struct logical_volume"
This references an existing generic_logical_volume wrapper for this
LV, if used. It can be NULL if not needed - which means we're not
handling historical LVs at all.
- new "historical_lvs" field in "struct volume group
List of all historical LVs read from VG metadata.
Pair kernel_cache_settings with kernel_cache_policy.
There is very small chance in error case that the value in table
might be differnet from the value stored in metadata
so make it 'checkable'.
Showing 'u' in the pv_attr reporting field is mostly unnecessary because
most PVs are allocatable, and being allocatable implies it is (u)sed,
and this is already obvious from other fields in the default 'pvs'
output like the VG name.
So move the new (u)sed pv_attr from character position 4 to 1, and only
show it in those rare cases when the PV is not (a)llocatable or the
relevant metadata is missing.
(Scripts should not be using pv_attr, but rather pv_allocatable,
pv_exported, pv_missing, pv_in_use etc.)
Helping with understanding we will not try to deref NULL pointer,
as if the sizes are initialized to NULL it also means 'mem' would
be NULL, but thats too hard to model so make it obvious.
Correct name is lvm2-lvmdbusd.service not lvmdbusd.service.
This makes the bus-activation (auto-activation) work.
Signed-off-by: Vratislav Podzimek <vpodzime@redhat.com>
Make the data_alignment variable 64 bits so it
can hold the invalid command line arg used in
pvreate-usage.sh pvcreate --dataalignment 1e.
On 32 bit arches, the smaller variable wouldn't
hold the invalid value so the error would not
trigger as expected by the test.
This simple tool calls the Manager.Refresh method on the dbus service
to check and see if the dbus service has the most up to date state.
This is to be used for testing to ensure that event driven updates are
working as planned.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When we use udev or have lvm call back into the dbus service when a
change occurs, even if that change originated from the dbus service
we end up refreshing the state of the system twice which is not
needed or wanted. This change handles this case by removing any
pending refreshes in the worker queue if the state of the system
was just updated.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Since we want to read env LVM_VG_NAME vg names,
we cannot just check LV names which do contain '/'.
So before the patch commands like:
> lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Path required for Logical Volume "vg".
Please provide a valid volume group name
> LVM_VG_NAME=vg lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Can't find LV vg in VG vg
Commit ca878a3426 introduced an issue
that zero sized extesion suddenly started to be accepted and
missed to return error.
Properly check invalid input values for sizes.
Add tests for the "dmstats report" command:
* report
* report --count
* report --histogram
So far the tests just check the command runs as expected when a
correctly configured stats region exists: validation of output
can be added later.
Add tests for the "dmstats create" command:
* simple whole-device region
* region using --start/--len options
* region using --segments option
* region with precise timestamps (--precise)
* region with histogram bounds (--bounds)
The install target already depends on .tests-stamp - since this
in turn depends on lib/version-expected there is no need to have
this as a dependency of install.
Add initial dmstats tests to 000-basic.sh. These tests ensure that
the dmsetup binary is built and linked correctly when called as
'dmstats' and that the version of the binary matches the expected
library version used for the build.
"pvcreate_each_params" was a temporary name used
to transition from the old "pvcreate_params".
Remove the old pvcreate_params struct and rename the
new pvcreate_each_params struct to pvcreate_params.
Rename various pvcreate_each_params terms to simply
pvcreate_params.
Use the new pvcreate_each_device() function from
toollib, previously added for pvcreate, in place
of the old pvcreate_vol().
This also requires shifting the location where the
lock is acquired for the new VG name. The lock for
the new VG is supposed to be acquired before pvcreate.
This means splitting the vg_lock_newname() out of
vg_create(), and calling vg_lock_newname() directly
before pvcreate, and then calling the remainder of
vg_create() after pvcreate.
The new function vg_lock_and_create() now does
vg_lock_newname() + vg_create(), like the previous
version of vg_create().
The lock on the new VG name is released before the
pvcreate and reacquired after the pvcreate because
pvcreate needs to reset lvmcache, which doesn't work
when locks are held. An exception could likely be
made for the new VG name lock, which would allow
vgcreate to hold the new VG name lock across the
pvcreate step.
This is common code for handling PV create/remove
that can be shared by pvcreate/vgcreate/vgextend/pvremove.
This does not change any commands to use the new code.
- Pull out the hidden equivalent of process_each_pv
into an actual top level process_each_pv.
- Pull the prompts to the top level, and do not
run any prompts while locks are held.
The orphan lock is reacquired after any prompts are
done, and the devices being created are checked for
any change made while the lock was not held.
Previously, pvcreate_vol() was the shared function for
creating a PV for pvcreate, vgcreate, vgextend.
Now, it will be toollib function pvcreate_each_device().
pvcreate_vol() was called effectively as a helper, from
within vgcreate and vgextend code paths.
pvcreate_each_device() will be called at the same level
as other process_each functions.
One of the main problems with pvcreate_vol() is that
it included a hidden equivalent of process_each_pv for
each device being created:
pvcreate_vol() -> _pvcreate_check() ->
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
pvcreate_each_device() reorganizes the code so that
each-VG-each-PV loop is done once, and uses the standard
process_each_pv function at the top level of the function.
This uses the vg->pv_write_list in place of the
vg->pvs_to_write list, and eliminates the use of
pvcreate_params. The label remove and zeroing
steps are shifted out of vg_write() to the higher
level like pvcreate will do.
The vg->pv_write_list contains pv_list structs for which
vg_write() should call pv_write().
The new list will replace vg->pvs_to_write that contains
vg_to_create structs which are used to perform higher-level
pvcreate-related operations. The higher level pvcreate
operations will be moved out of vg_write() to higher levels.
Since we already check in few other places 'info' is not NULL,
do the same for others - however when info would be NULL
it more or less looks like internal error.
Use #define instead, since we do not require actually buffer needs
to exists to eliminated new gcc6 warning:
clvm.h:53:19: warning: ‘CLVMD_SOCKNAME’ defined but not used
[-Wunused-const-variable]
Reshuffle messages during pvremove.
Always print WARNING: when PV is in use so using options
--force --force doesn't make this important user
notification go away.
Simplify variable 'used' usage (so older gcc doesn't warn
about the use of unitilizied variable).
Add some '.' into messages.
Currently it's been checked for 'zero' header for thin-pool,
but lets use it always for cache as well - since it's relatively 'cheap'
detection of read 'error' problems as thin/cache tools
currently do not work fast enough in this case.
export LVMDBUSD_SESSION=True to run on the session bus instead
of the system bus so that we can run the unit test without
installing the dbus conf file.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Reduced the size of LVs created and use actual PE numbers instead of hard
coding them to allow us to work with the loop back devices.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
It appears that the output of lvconvert --merge can vary some. The code
was blowing up as it was trying to parse a line of stdout to retrieve the
% complete, but the line did not have the needed format and an execption
was thrown. The uncaught exception caused the background thread to exit
without updating the job object, which caused the client to hang forever
waiting. Added a default exception handler to prevent unhandled execptions
causing hangs and removed the parameter skip_first_line as it's no longer
needed. The code checks to see if the line can be parsed before doing so.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
After the lockspace has been successfully removed,
invalidate the name field in the lockspace struct.
The struct remains on the list of lockspaces until
the struct can be freed later. Until the struct is
freed, its name will prevent another new lockspace
from being created with the same name.
When update fails in suspend() (sending of messages
fails because metadata space is full) call resume(),
so the locking sequence works properly for clustering.
Also failing deactivation should unlock memory.
Fix reporting of Fail thin-pool target status
as attr[8] letter 'F'.
Report 'needs_check' status from thin-pool target via
attr field [4] (letter 'c'/'C'), and also via CheckNeeded field.
TODO: think about better name here?
TODO: lots of prop_not_implemented_set
Fix parsing of 'Fail' status (using capital letter) for thin-pool.
Add also parsing of 'Error' state for thin-pool.
Add needs_check test for thin-pool.
Detect Fail state for thin.
Ask for confirmation when using pvcreate/pvremove on a PV which is
marked as belonging to a VG, just like we do in case of a PV which
belongs to known VG:
$ pvcreate -ff /dev/sda
Really INITIALIZE physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume not initialized
$ pvremove -ff /dev/sda
Really WIPE LABELS from physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume label not removed
The host that owns foreign VGs is responsible for fixing up PV_EXT_USED
flag - the same already applies to repairing any inconsistent VG.
This patch also moves the iteration over vg->pvs inside
_check_or_repair_pv_ext fn - it's cleaner this way.
pv->vg is not set yet during pvcreate processing. Use pv->fmt instead to
check for these fake PVs (all normal PVs have format defined, devices
which are not PVs don't have this set).
This fixes commit 0000db7f98.
If we know that a PV belongs to some VG and we're missing metadata
(because we have only those PV(s) from VG present in the system that
don't have metadata areas), we should skip such PV when processing
under system ID.
This is because we know that the PV belongs to some VG, but we
really can't decide whether it matches system ID unless the VG
metadata is present again.
Some of the PVs are not even orphan PVs - they're fake PVs - this can
happen if we're listing all devices with "pvs -a". Such PV must not
be marked as used.
The backup_restore_vg is used directly for restoring the VG from backup.
It's also used to do the VG conversions from one metadata format to
another which means vgconvert calls backup_restore_vg too.
When restoring VG from backup, we need to rewrite/write PV headers as
PVs may have been orphans before and now they're becoming part of some
VG - we need to write the PV_EXT_USED flag at least.
When using the backup_restore_vg for vgconvert, we need to write
completely new PV header in different format.
Avoid the special "pv_write" call and handling that was used before
this patch in vgconvert (vgconvert_single function to be more precise)
and reuse existing internal interface to register PV header for writing
(or rewriting) via vg->pvs_to_write list instead like we do it elsewhere
in the code.
This patch also resolves a problem in which PV headers with target
format were written in the vgconvert_single fn as orphans and VG
metadata were added later on - this was a tiny hack actually.
We can't do this now - we need to write the PV as belonging
to a VG because otherwise the PV_EXT_USED flag won't be written
properly (if the PV header is written as orphan, the PV_EXT_USED
is set to 0, of course, even though metadata are attached later).
So this patch removes this tiny inconsistency which was passing
just fine before because we didn't have any relation to the VG
in PV header before. Now we have the PV_EXT_USED flag which says
the "PV is used in some VG".
The same check as we already do for orphan PVs, just the other way
round now: if the PV is surely part of some VG and any PV the VG
contains does not have the PV_EXT_USED flag set, repair it.
For example - /dev/sda here is in VG vg and it's incorrectly not
marked as used by PV_EXT_USED flag:
pvs --binary -o pv_ext_vsn,pv_in_use
WARNING: Volume Group vg is not consistent.
WARNING: Repairing Physical Volume /dev/sda that is in Volume Group vg but not marked as used.
PV VG Fmt Attr PSize PFree ExtVsn PInUse
/dev/sda vg lvm2 a-- 124.00m 124.00m 2 1
PV header extension versions:
0 - the original PV without any extensions
1 - bootloader area support added
2 - PV_EXT_USED flag support added
So do the associated checks related to PV_EXT_USED flag only if
PV header extension found is of version 2 and higher.
If we know that the PV is orphan, meaning there's at least one MDA on
that PV which does not reference any VG and at the same time there's
PV_EXT_USED flag set, we're certainly in an inconsistent state and we
need to fix this.
For example, such situation can happen during vgremove/vgreduce if we
removed/reduced the VG, but we haven't written PV headers yet because
vgremove stopped abruptly for whatever reason just before writing new
PV headers with updated state, including PV extension flags (and so the
PV_EXT_USED flag).
However, in case the PV has no MDAs at all, we can't double-check
whether the PV_EXT_USED is correct or not - if that PV is marked
as used, it's either:
- really used (but other disks with MDAs are missing)
- or the error state as described above is hit
User needs to overwrite the PV header directly if it's really clear
the PV having no MDAs does not belong to any VG and at the same time
it's still marked as being in use (pvcreate -ff <dev_name> will fix this).
For example - /dev/sda here has 1 MDA, orphan and is incorrectly marked
with PV_EXT_USED flag:
$ pvs --binary -o+pv_in_use
WARNING: Found inconsistent standalone Physical Volumes.
WARNING: Repairing flag incorrectly marking Physical Volume /dev/sda as used.
PV VG Fmt Attr PSize PFree InUse
/dev/sda lvm2 --- 128.00m 128.00m 0
For example:
$ pvs -o pv_name,vg_name,pv_in_use
PV VG InUse
/dev/sda vg used
/dev/sdb
/dev/sdc used
(sda is part of vg - it's used
sdb is not part of vg - it's not used
sdc is part of vg, but MDAs missing - it's used)
Scenario:
$ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
We're adding the PV to a VG.
Before this patch:
$ vgcreate vg /dev/sda
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
With this path applied:
$ vgcreate vg /dev/sda
Volume group "vg" successfully created
...and verbose log containing: "Physical volume "/dev/sda" successfully written"
Make sure we won't use a PV that is already marked as used. Normally,
VG metadata would stop us from doing that, but we can run into a
situation where such metadata is missing because PVs with MDAs
are missing and the PVs left are the ones with 0 MDAs.
(/dev/sda in this example has 0 MDAs and it belongs to a VG,
but other PVs with MDA are missing)
$ pvs -o pv_name,pv_mda_count /dev/sda
PV #PMda
/dev/sda 0
$ pvcreate /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't initialize PV '/dev/sda' without -ff.
$ pvchange -u /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't change PV '/dev/sda' without -ff.
Physical volume /dev/sda not changed
0 physical volumes changed / 1 physical volume not changed
$ pvremove /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
(If you are certain you need pvremove, then confirm by using --force twice.)
$ vgcreate vg /dev/sda
Physical volume '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Unable to add physical volume '/dev/sda' to volume group 'vg'.
We'll use this struct in subsequent patches for PVs which should
be rewritten, not just created. So rename struct pv_to_create to
struct pv_to_write for clarity.
This is a hotfix for a bug introduced in
6d7dc87cb3.
The bug description: First we allocate memory for
processing handle (at an address 1) then we
allocate some memory on the same pool for later use
in pvmove_poll function inside the process_each_pv
function (at an address 2). After we jump out of
process_each_pv we called destroy_processing_handle.
As a result of destroying the handle memory pool could
deallocate all memory at address 1 or higher. The
pvmove_poll function tried to copy a memory allocated
at address 2 that could be returned to the system.
If it was so it led to segfault.
We need to rethink proper fix but in the same time
cmd->mem pool is recreated per each lvm command so
this should not cause problems even when we run
multiple commands in lvm shell.
A valgrind snapshot of the corruption:
Invalid read of size 1
at 0x4C29F92: strlen (mc_replace_strmem.c:403)
by 0x5495F2E: dm_pool_strdup (pool.c:51)
by 0x1592A7: _create_id (pvmove.c:774)
by 0x159409: pvmove_poll (pvmove.c:796)
by 0x1599E3: pvmove (pvmove.c:931)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address 0xf15df8a is 138 bytes inside a block of size 8,192 free'd
at 0x4C28430: free (vg_replace_malloc.c:446)
by 0x5494E73: dm_free_wrapper (dbg_malloc.c:357)
by 0x5495DE2: _free_chunk (pool-fast.c:318)
by 0x549561C: dm_pool_free (pool-fast.c:151)
by 0x164451: destroy_processing_handle (toollib.c:1837)
by 0x1598C1: pvmove (pvmove.c:903)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address this gcc warning:
metadata/lv.c:243: warning: initialized field overwritten
metadata/lv.c:243: warning: (near initialization for 'status.seg_status')
Present with e.g.: gcc version 4.3.2 (Debian 4.3.2-1.1)
Simplify calculation of extents rounding needed for
segment size.
Segment size has to divisible by 'extent count' needed to contain
whole stripe. LVM currently does not support stripes across segment.
In case the stripe size is bigger then extent size,
require bigger rounding.
Fixing regression caused by 197b5e6dc7.
So the 'TODO' part now finally know the answer - there is 'sparc64'
architecture which imposes limitation to read 64b words only through
64b aligned address.
Since we never could know how is the user going to use the returned
pointer and the userusually expects it's aligned on the highest CPU
required alignement, preserve it also for char*.
Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=809685
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Wrong thin-pool feature flag ordering in dm table: It will lead to
unnecessary table reload.
Fix it by placeing feature flags in order they are returned from the
kernel so current 'table line diff' code will not see a difference.
'verbose' was marked as a boolean option while it
takes integer args - so it has limited usage to 0 or 1,
but we supported 0-4 at least.
Fix it by switching to corrent int type.
(Hopefully noone was trying to use this variable as true/yes/false/no
way - as the would be unsupported/undocumented).
Reporter noticed lvm2 incorrectly translated
lvm2 threshold value to water mark in commit:
99237f0908
Fix it by properly translating size to number of
blocks in thin-pool and then calc for free blocks
matching configured lvm2 threshold value.
Reported-by: Ming-Hung Tsai <mingnus@gmail.com>
Normally, we generate and provide lvm.conf file where use_blkid_wiping
is set based on whether support for this is compiled in or not. This was
generated properly based on configure.
However, if lvm.conf is not used at all (someone deletes it) or the value
in lvm.conf is commented out (user edited it), we still need to use
proper default value that is based on DEFAULT_USE_BLKID_WIPING taken
from configure script - we used hardcoded value of "1" in this case
by mistake.
We already do check for suspended devs within udev rules where
the pvscan is to update lvmetad. So the check for suspended devs
in "pre-lvmetad" chain is not useful here - remove it - it may
be a source of hardly to detect races anyway (if udev rule detects
the device is not suspended and then the pvscan instance sees the
dev as suspended, we may end up not reacting to the event properly).
lvm1 and pool format do not support bootloader areas and we need to
remove any existing associated bootloader areas when we read lvm1 and
pool labels.
This has its importance if we're converting from one format to another
and we're reusing lvmcache in long-running commands (e.g. clvmd or lvm
shell) and we need to make lvmcache consistent and valid for current format.
Non-dm devices have ID_PART_TABLE_TYPE variable exported in
udev db from blkid scan for *both* whole devices and partitions.
We used ID_PART_ENTRY_DISK in addition to decide whether this
is the whole device or partition and then we filtered out only
whole devices where the partition table really is.
However, ID_PART_ENTRY_DISK was added in blkid 2.20 so we need
to use a different set of variables to decide on whole devices
and partitions on systems where older blkid is still used.
Now, we use ID_PART_TABLE_TYPE to detect that there's something
related to partitioning with this device and we use DEVTYPE variable
instead to decide between whole device (DEVTYPE="disk") and partition
(DEVTYPE="partition").
For dm devices it's simpler, we have ID_PART_TABLE_TYPE variable\
set in udev db for whole devices. It's not set for partitions,
hence we don't need more variable in addition to make the decision
on whole device vs. partition (dm devices do not have regular
partitions, hence DEVTYPE can't be used anyway, it's always set
to "disk" for whole disks and partitions).
Add "size" and "size_seqno" to struct device to cache device's size
and also to control its lifetime - the cached value is valid as long
as the global _dev_size_seqno is equal to the device's size_seqno,
otherwise we need to get the size again and cache the new value.
This patch also adds new dev_size_seqno_inc() fn for the appropriate
parts of the code to increment current global value of _dev_size_seqno
and hence to cause all currently cached values for device sizes to
be invalidated.
The device size is now cached because we're planning to reuse this
information for further checks and we want to avoid checking it more
than necessary to save resources.
Fix regression caused by c9f021de0b.
This commit actually transfered real-action (e.g. device removal)
into the next loop which has however missed to check for break.
So add check for break also there.
When creating a list in 'context of command' - use proper mempool.
vg->vgmem is mempool related to VG metadata - and can be eventually
locked read-only when VG struct is shared.
W: manual-page-warning /usr/share/man/man8/lvm.8.gz 491: warning: macro `_cdata',' not defined
rpmlint actually notices we had few hidden word in man page.
the line cannot start with apostrophe as it has then a different
meaning.
If not using explicit --enable-blkid-wiping/--disable-blkid-wiping
configure option, the configure script tries to enable/disable blkid
wiping feature automatically based on blkid library version found.
The script incorrectly set default value for lvm.conf's
allocation/use_blkid_wiping" setting to "1" (enabled) if proper
blkid library version was not found or the version found was less
than the minimum required. It should be set to "0" in this case.
The extent size must fits all blocks in 4294967295 sectors
(in 512b units) this is 1/2 KiB less then 2TiB.
So while previous statement 'suggested' 2TiB is still acceptable value,
make it clear it's not.
As now we support any multiples of 128KB as extent size -
values like 2047G will still 'flow-in' otherwise the largest power-of-2
supported value is 1TiB.
With 1TiB user needs 8388608 extents for 8EiB device.
(FYI such device is already unusable with todays glibc-2.22.90-27)
4GiB extent size is currently the smallest extent size which allows
a user to create 8EiB devices (with 2GiB it's less then 8EiB).
TODO: lvm2 may possibly print amount of 'lost/unused space' on a PV,
since using such ridiculously sized extent size may result in huge
space being left unaccessible.
Since commit 2fc126b00d, the library
code requires udev to be initialised for device scanning and
clvmd can fail to find VGs if devices/external_device_info_source
is set to "udev".
There are two basic groups of fields for LV segment device reporting:
- related to LV segment's devices: devices and seg_pe_ranges
- related to LV segment's metadata devices: metadata_devices and seg_metadata_le_ranges
The devices and metadata_devices report devices in this format:
"device_name(extent_start)"
The seg_pe_ranges and seg_metadata_le_ranges report devices in
this format:
"device_name:extent_start-extent_end"
This patch reverts partly what commit 7f74a99502
(v 2.02.140) introduced in this area - it added [] for
hidden devices to mark them for all four fields mentioned above.
We won't be marking hidden devices in devices and metadata_devices
fields.
The seg_metadata_le_ranges field will have hidden devices marked -
it's new enough that we don't need to care about compatibility much
yet.
The seg_pe_ranges is old enough that we shouldn't be changing this
one - so we're reverting to not marking hidden devices here.
Instead, there's going to be a new field "seg_le_ranges" which
is going to replace the seg_pe_ranges and it will mark hidden devices -
this is going to be introduced in a patch later.
So in the end we'll end up with:
(LV segment's devices)
devices field with "device_name(extent_start)" format, not marking hidden devices
seg_pe_ranges field with "device_name:extent_start-extent_end" format, not marking hidden devices (deprecated, new seg_le_ranges should be used instead for standardized format)
seg_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
(LV segment's metadata devices)
metadata_devices field with "device_name:extent_start-extent_end" format, not marking hidden devices
seg_metadata_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
Also, both seg_le_ranges and seg_metadata_le_ranges will honour the
report/list_item_separator setting which can be used to configure
the delimiter used for list items.
So, to sum it up, we will recommend using the new seg_le_ranges and
seg_metadata_le_ranges fields because they display devices with
standard extent range format, they can mark hidden devices and they
honour the report/list_item_separator setting.
We'll be keeping devices,seg_pe_ranges and metadata_devices fields
for compatibility.
The associated devices,metadata_devices,seg_pe_ranges and
seg_metadata_le_ranges are reported as genuine string lists now.
This allows for using the items separately in -S|--select
(so searching for subsets etc.) and also it allows for
configuring the separator using report/list_item_separator
which may be useful in scripts (however, we'll enable this
only for seg_le_metadata_ranges and not for devices,seg_pe_ranges
and seg_metadata_devices for compatibility reasons - see following
patch).
Add a comment in _process_pvs_in_vg() to document the
place where there have been problems with processing
PVs twice.
For a while we had a hacky workaround here where we'd
skip processing a PV if its device wasn't found in
all_devices (and !is_missing_pv since we want to
process PVs with missing devices.). That workaround
was removed in commit 5cd4d46f because it was no
longer needed.
The workaround had originally been needed to prevent
a device from being processed twice when the PV had
no MDAs -- it would be processed once in its real VG
and then the workaround would prevent it from being
processed a second time in the orphan VG.
Wrongly appearing as an orphan likely happened because
lvmcache would consider the no-MDA PV an orphan unless
the real VG holding that PV was also in lvmcache.
This issue is also mentioned in pvchange where holding
the global lock allows VGs to remain in lvmcache so
PVs with 0 mdas are not considered orphans.
The workaround in _process_pvs_in_vg() was originally
intended for reporting commands, not for pvchange.
But, it was accidentally helping pvchange also because
the method described by the pvchange global lock
comment had been subverted by commit 80f4b4b8.
Commit 80f4b4b8 was found to be unnecessary, and was
reverted in commit e710bac0. This restored the
intended global lock lvmcache effect to pvchange, and
it no longer relied on the workaround in toollib.
When reporting on LVs, take the end of the range from the size of the
underlying (hidden) LV rather than the logical size of the current
segment (that PVs use).
Previously, pvmove used the function find_pv_in_vg() which did the
equivalent of process_each_pv() by doing:
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
With the found 'pv', it would do vg_read() on pv_vg_name(pv),
and then do the actual pvmove processing.
This commit simplifies by using process_each_pv() and putting
the actual pvmove processing into the "single" function.
This eliminates both find_pv_by_name() and the vg_read().
The processing code that followed vg_read remains the same.
The return code for the pvmove command is not based on the
process_each_pv return code, but is based on the success/fail
conditions in the existing code.
Make the lvb validation rules for convert match
those for unlock (even though it would be very
unlikely or impossible for convert to deal with
zero lvb.)
When an orphan PV is changed/resized, the
lvmlockd global lock is converted from sh
to ex. If the command is changing two
orphan PVs, the conversion to ex should
be done only once.
Existing cache_settings field displays the settings which are
saved in metadata. Add new kernel_cache_settings fields to display
the settings which are currently used by kernel, including fields
for which default values are used.
This way users have complete view of the set of cache settings
supported (and which they can set) and their values which are used
at the moment by kernel.
For example:
$ lvs -o name,cache_policy,cache_settings,kernel_cache_settings vg
LV Cache Policy Cache Settings KCache Settings
cached1 mq migration_threshold=1024,write_promote_adjustment=2 migration_threshold=1024,random_threshold=4,sequential_threshold=512,discard_promote_adjustment=1,read_promote_adjustment=4,write_promote_adjustment=2
cached2 smq migration_threshold=1024 migration_threshold=1024
cached3 smq migration_threshold=2048
Fix lvm2app to return either 0 or 1 for lvm_vg_is_{clustered,exported},
including internal functions pvseg_is_allocated and vg_is_resizeable
which are not yet exposed in lvm2app but make them consistent with the
rest.
This reverts e28e22b9e1
The problem that that commit was fixing (pytest failure)
no longer appears with the current code, so the commit is
not needed.
That commit is a problem for pvchange, because it prevents
lvmcache from retaining VG metadata even while the global
lock is held. pvchange holds the global lock to ensure
that VG metadata is kept in lvmcache throughout processing.
If the cache is not kept, a PV with zero MDAs will appear
first in its actual VG and then appear again in the orphan VG.
It wrongly appears a second time in the orphan VG only if
the actual VG is dropped from lvmcache.
Thin pool discard mode set in metadata can be different from the one
actually used if any device underneath does not support that mode. Add
kernel_discard report field to make it possible to see this difference.
Internal _alloc_init() is only called from allocate_extents(),
which already does prevent usage of virtual segments.
So mark as internal error early and do not process it any further.
Add new test for lv_is_snapshot().
Also move few other bitchecks into same place as remaining bit tests.
TODO: drop lv_is_merging_origin() and keep using lv_is_merging().
The problem addressed by this workaround no longer
seems to exist, so remove it. PVs with no mdas
no longer appear in both their actual VG and in
the orphan VG.
Include brackets for the name if the dev is invisible.
This change applies to all callers of _format_pvsegs fn:
- lvseg_devices (the "lvs -o devices")
- lvseg_metadata_devices (the "lvs -o metadata_devices)
- lvseg_seg_pe_ranges (the "lvs -o seg_pe_ranges")
- lvseg_seg_metadata_le_ranges (the "lvs -o seg_metadata_le_ranges")
The common lv_pool_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_metadata_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_data_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_mirror_log_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_origin_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_convert_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
Use common _lvname_disp to report lv_parent. The _lvname_disp
takes care of properly marking LVs which are not visible - such
LVs are always enclosed in brackets when reported within any
other field.
For example, thin pool over RAID.
Before:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] cache_pool_tdata
[cache_pool_tdata_rimage_1] cache_pool_tdata
[cache_pool_tdata_rmeta_0] cache_pool_tdata
[cache_pool_tdata_rmeta_1] cache_pool_tdata
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] cache_pool_tmeta
[cache_pool_tmeta_rimage_1] cache_pool_tmeta
[cache_pool_tmeta_rmeta_0] cache_pool_tmeta
[cache_pool_tmeta_rmeta_1] cache_pool_tmeta
[lvol0_pmspare]
With this patch applied:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] [cache_pool_tdata]
[cache_pool_tdata_rimage_1] [cache_pool_tdata]
[cache_pool_tdata_rmeta_0] [cache_pool_tdata]
[cache_pool_tdata_rmeta_1] [cache_pool_tdata]
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] [cache_pool_tmeta]
[cache_pool_tmeta_rimage_1] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_0] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_1] [cache_pool_tmeta]
[lvol0_pmspare]
Do not mix dm_report_field_set_value and _field_set_value and
use single function call throughout for clarity. The same applies
for dm_report_field_string and _string_disp.
Fix regression caused by commit c2d4330f27
which removed the dm_pool_strdup for the cache policy name in
_cache_policy_disp report function.
This regression was hit with buffered reporting only (which is
used by default). The reason is that for buffered reporting, we're
iterating over LVs in VG (process_each_lv) while gathering
all the information that is needed for the report. In this case,
the LV's cache policy name has not been duped, but only the pointer
to the original VG buffer was stored. When the LV iteration finished,
the VG buffer was freed and any report to output called later
(dm_report_output call) accessed already freed VG data.
This didn't appear if unbuffered reporting was used (--unbuffered)
because in this case, the data were reported to output as
soon as they were processed, hence it was reported to output
before the VG data was freed.
The lvm2-activation{-early,-net}.service systemd unit statuses were missing
in dump gathered by lvmdump -s. These are quite important when debugging
scenarios with systemd environment and where lvmetad is not used.
Have commands send lvmlockd the update message
in vg_write instead of vg_commit, so that it's
not done while LVs are suspended. If the vg_write
is not committed, and the seqno sent to lvmlockd
is not used, then lvmlockd can detect this when
the next update uses the same seqno.
Use process_each_vg() to lock and read the old VG,
and then call the main vgrename code.
When real VG names are used (not a UUID in place of the
old name), the command still pre-locks the new name
(when strcmp wants it locked first), before calling
process_each_vg on the old name.
In the case where the old name is replaced with a UUID,
process_each_vg now translates that UUID into the real
VG name, which it locks and reads. In this case, we
cannot do pre-locking to maintain lock ordering because
the old name is unknown. So, in this case the strcmp
based lock ordering is suppressed and the old name is
always locked first. This opens a remote chance for
lock ordering conflict between racing vgrenames between
two names where one or both commands use the UUID.
Also always clear the internal lvmcache after rescanning, and
reinstate a test for --trustcache so that 'pvs --trustcache'
(for example) avoids rescanning.
Before commit c1f246fedf,
_get_all_devices() did a full device scan before
get_vgnameids() was called. The full scan in
_get_all_devices() is from calling dev_iter_create(f, 1).
The '1' arg forces a full scan.
By doing a full scan in _get_all_devices(), new devices
were added to dev-cache before get_vgnameids() began
scanning labels. So, labels would be read from new devices.
(e.g. by the first 'pvs' command after the new device appeared.)
After that commit, _get_all_devices() was called
after get_vgnameids() was finished scanning labels.
So, new devices would be missed while scanning labels.
When _get_all_devices() saw the new devices (after
labels were scanned), those devices were added to
the .cache file. This meant that the second 'pvs'
command would see the devices because they would be
in .cache.
Now, the full device scan is factored out of
_get_all_devices() and called by itself at the
start of the command so that new devices will
be known before get_vgnameids() scans labels.
Since we mark cache-pool as 'hidden/private' while it is in-use,
we may still allow user to change it's name.
It should not cause any harm and user may prefer better naming
for a cache-pool in use.
If an existing fifo has the wrong attributes it cannot be trusted
so we must unlink it and recreate it correctly.
(Replaces 2c8d6f5c90: if the other end of
the fifo already got opened while its mode was insecure, delaying the
chmod isn't going to make any difference!)
Reinstate and extend checks removed by e1b111b02a.
The code has always assumed that only root has access to the directory
containing the fifos and that they are under the complete control of
dmeventd code. If anything is found not to be as expected, then open()
should certainly not be attempted!
It's getting a bit more complex here.
Basic idea behind is - check_current_backup() should not
log error when a user is using a read-only filesystem,
so e.g. vgscan will not report any error when it tries
to take missing backup.
We still have cases when error could be reported though,
e.g. the backup this would be a symbolic link, but these
are rather misconfiguration and unexpected case.
We have to modes of 'archive()' usage -
1. compulsory - fail stops command and user may try '-An' option
to do a command.
2. non-compulsory - some fails in archiving are ignorable (i.e.
read-only filesystem where archive dir is located).
Those 2 cases needs to be properly handle - i.e. the non-compulsory
logging should not be tampering error logging message production.
So more work here is needed
Pass full buffer size to printf() function - no reason to make
buffer 1 char smaller.
Also rename locn buffer to message buffer directly since it's
not used for anything else.
TODO: we may use same buffer also for 'buf[]' since there is
no collision - so may safe 1K on stack usage.
In general, --select should be used to specify a VG by UUID,
but vgrename already allows a uuid to be substituted for
the name, so continue to allow it in that case.
If the VG arg from the command line does not match the
name of any known VGs, then check if the arg looks like
a UUID. If it's a valid UUID, then compare it to the
UUID of known VGs. If it matches the UUID of a known VG,
then process that VG.
Pass the single vgname as a new process_each_vg arg
instead of setting a cmd flag to tell process_each_vg
to take only the first vgname arg from argv.
Other commands with different argv formats will be
able to use it this way.
When two different VGs with the same name exist,
they are both stored in lvmcache using the vginfo->next
list. Previously, the code would print warnings (sometimes)
when adding VGs to this list. Now the duplicate VG names
are handled by higher level code, so this list no longer
needs to print warnings about duplicate VG names being found.
After recent changes to process_each, vg_read() is usually
given both the vgname and vgid for the intended VG.
However, in some cases vg_read() is given a vgid with
no vgname, or is given a vgname with no vgid.
When given a vgid with no vgname, vg_read() uses lvmcache
to look up the vgname using the vgid. If the vgname is
not found, vg_read() fails.
When given a vgname with no vgid, vg_read() should also
use lvmcache to look up the vgid using the vgname.
If the vgid is not found, vg_read() fails.
If the lvmcache lookup finds multiple vgids for the
vgname, then the lookup fails, causing vg_read() to fail
because the intended VG is uncertain.
Usually, both vgname and vgid for the intended VG are passed
to vg_read(), which means the lvmcache translations
between vgname and vgid are not done.
If two different VGs with the same name exist on the system,
a command that just specifies that ambiguous name will fail
with a new error:
$ vgs -o name,uuid
...
foo qyUS65-vn32-TuKs-a8yF-wfeQ-7DkF-Fds0uf
foo vfhKCP-mpc7-KLLL-Uh08-4xPG-zLNR-4cnxJX
$ lvs foo
Multiple VGs found with the same name: foo
Use the --select option with VG UUID (vg_uuid).
$ vgremove foo
Multiple VGs found with the same name: foo
Use the --select option with VG UUID (vg_uuid).
$ lvs -S vg_uuid=qyUS65-vn32-TuKs-a8yF-wfeQ-7DkF-Fds0uf
lv1 foo ...
This is implemented for process_each_vg/lv, and works
with or without lvmetad. It does not work for commands
that do not use process_each.
This change includes one exception to the behavior shown
above. If one of the VGs is foreign, and the other is not,
then the command assumes that the intended VG is the local
one and uses it.
This makes process_each_vg/lv always use the list of
vgnames on the system. When specific VGs are named on
the command line, the corresponding entries from
vgnameids_on_system are moved to vgnameids_to_process.
Previously, when specific VGs were named on the command
line, the vgnameids_on_system list was not created, and
vgnameids_to_process was created from the arg_vgnames
list (which is only names, without vgids).
Now, vgnameids_on_system is always created, and entries
are moved from that list to vgnameids_to_process -- either
some (when arg_vgnames specifies only some), or all (when
the command is processing all VGs, or needs to look at
all VGs for checking tags/selection).
This change adds one new lvmetad lookup (vg_list) to a
command that specifies VG names. It adds no new work
for other commands, e.g. non-lvmetad commands, or
commands that look at all VGs.
When using lvmetad, 'lvs foo' previously sent one
request to lvmetad: 'vg_lookup foo'.
Now, 'lvs foo' sends two requests to lvmetad:
'vg_list' and 'vg_lookup foo <uuid>'.
(The lookup can now always include the uuid in the request
because the initial vg_list contains name/vgid pairs.)
When not using lvmetad, this uses the system_id field in
the cached vginfo structs that are populated during a scan.
When using lvmetad, this requests the VG from lvmetad, and
checks the system_id field in the returned metadata.
When the command already knows both the vgid and vgname,
it should send both to lvmetad for a more exact request,
and it can save lvmetad the work of a name lookup.
Remove long outstand unused code lines, which were already
been obsoleted by other code.
Statuses and snapshot tree creation is already handled differently.
Also drop some 'extra' log_error() and use only stack;
since error has already been reported.
Since we do not use dev_manager in a way we would have destroyed VG
content while in-use - we could safely keep just pointer.
So dropping strdup.
Also it seems we actually no longer use vg_name for anything
so it may possibly go away completely unless it would be useful
for debugging...
Just for convenience to display all new configuration settings
introduced since given version (before, there was only --atversion
to display settings introduced in concrete version).
For example:
$ lvmconfig --type new --sinceversion 2.2.120
allocation {
# cache_mode="writethrough"
# cache_settings {
# }
}
global {
use_lvmlockd=0
# lvmlockd_lock_retries=3
# sanlock_lv_extend=256
use_lvmpolld=1
}
activation {
}
# report {
# compact_output_cols=""
# time_format="%Y-%m-%d %T %z"
# }
local {
# host_id=0
}
Unifying terminology.
Since all the metadata in-use are ALWAYS on disk - switch
to terminology committed and precommitted.
Patch has no functional change inside.
lv preload for detached LVs started to be used also
for various other types which just happens to pass through
weak if() condition.
TODO: find here better solution to rather explicitly check
for types we really need to preload.
We do not won't to 'expose' internals of VG struct.
ATM we use lists to keep all LVs - we may want to switch
to better struct for quicker 'search'.
Since we do not need 'lists' but always actual LV,
switch find_lv_in_vg_by_lvid() to return LV,
and replaces some use case of find_lv_in_vg()
with 'better' working find_lv() which already
returns LV.
When 'lvextend -L+XX vg/thinpool' do not leave inactive table
loaded for 'wrapping' LV on top of resized thin-pool
(ATM we use linear LV for this with same size as thin-pool).
Udev recently start to 'link-in' major amount of useless libs.
(Seem to be faulty 'systemd' link-in all issue)
Anyway - avoid locking those libs in RAM.
When preloading thin-pool device node for already
existing/running thin-pool do not resume such thin-pool.
This allows to properly schedule commit point for metadata,
when thin-pool data or metadata volume is resized.
Extra space between 'cache' target and metadata device caused
string comparation being not equal and thus always causing
table reload even when uneeded.
Avoid internal error message where thin pool repair code tries to
fix cache pool - was catched later in code stack, so rather
catch this early and make the repair function exlusive
to thin pools.
So far we have no code for repairing cache pools
(other then the automatic during activation/deactivation).
To handle multiple VGs with the same name.
Simply using the VG name is ambiguous, and
lvmetad requires the VG uuid be used to
specify which one is meant.
Coverity here is a bit 'blind' here and cannot resolve which
code paths are actually able to hit this code path.
(It's using 'statistic' to resolve all possible paths,
and it's not scanning 'individual' code paths.)
This just cleans warns and add 'cheap' tests.
Use 'mda' instead of NULL to quite Coverity warn.
However this code seems to be actually not even possible to hit.
With proper analysis it may possibly be dropped from code to
simplify logic.
Skip testing target_pvs for NULL, we already
dereference it in many other places.
If check would ever be needed - it needs to be
in front of _raid_extract_images().
In lookup, return a count of entries with the
same key rather than the value from a second
entry with the same key.
Using some slightly different names.
Simply use lookup_withval right away rather than doing a
standard lookup, checking for the wrong mapping, then
repeating with lookup_withval to get the right mapping.
Coverity is not able to understand assembly language in
system's header file, so provide model for such macro.
Note: to really see model in-use: #nodef FD_ZERO model_FD_ZERO
need to go to coverity/config/user_nodefs.h
Add missing display_lvname in _lvconvert_merge_thin_snapshot().
Also when we detect missing origin, report Internal error,
which would likely be the primary fault here
(and avoid dereft of NULL origin as noticed by Coverity).
When reading older lvm2 metadata for cache-pool - we now handle more
extended syntax - basically we want to enter most setting when
actually creating cached LV.
For this new validation code has been added. However older metadata
without new settings set is now found as invalid.
Fix it by adding default settings for cache policy mq
and cache mode writethrough.
If the data len is passed into the hash table
and saved there, then the hash table internals
do not need to assume that the data value is
a string at any point.
New hash table functions are added that allow for
multiple entries with the same key. Use of the
vgname_to_vgid hash table is converted to these
new functions since there are multiple entries
in vgname_to_vgid that have the same key (vgname).
When multiple VGs with the same name exist, commands
that reference only a VG name will fail saying the
VG could not be found (that error message could be
improved.) Any command that works with the select
option can access one of the VGs with -S vg_uuid=X.
vgrename is a special case that allows the first VG
name arg to be replaced by a uuid, which also works.
(The existing hash table implementation is not well
suited for handling this case, but it works ok with
the new extensions. Changing lvmetad to use its own
custom hash tables may be preferable at some point.)
Here Coverity cannot see the pointer cannot be NULL in this
code path - opened coverity case #00531860.
We could make a model to avoid seeing related reports,
but then we loose coverage for modeled function.
So decided to add minor hint for this case.
Recent change 2c8d6f5c90
actually droped restart when the reason of failing open is missing
device completely - check for ENOENT now as another reason
to start new dmeventd server (when there is no systemd to maintain it).
The udev_device_get_is_initialized is available since libudev version
165. Older versions are still used somewhere (e.g. RHEL6). So better
check for this fn and use it only if it's available.
Udev db records are marked as not initialized (incomplete) on timeout.
Issue an error message whenever LVM finds such records so users are
aware that something's going wrong with udev db.
This is important in case we use devices/external_device_info_source="udev"
where udev database records are used to do various filtering decisions.
For example:
udev log of timed out worker:
Nov 11 13:02:25 raw.virt systemd-udevd[607]: seq 1997 '/devices/virtual/block/dm-2' is taking a long time
Nov 11 13:04:25 raw.virt systemd-udevd[607]: seq 1997 '/devices/virtual/block/dm-2' killed
Nov 11 13:04:25 raw.virt systemd-udevd[607]: worker [11221] terminated by signal 9 (Killed)
Nov 11 13:04:25 raw.virt systemd-udevd[607]: worker [11221] failed while handling '/devices/virtual/block/dm-2'
...
LVM also issues error message visibly if incomplete udev db record is found,
devices/external_device_info_source="udev" is set:
$ pvs
Udev database has incomplete information about device /dev/dm-2.
Failed to get external handle for device /dev/dm-2 [udev].
...
Coverity noticed this condition is always false and the error
path could never be visited.
So check for all mismatches of supported messages
and actually mark log_error as internal error.
When the first arg is a UUID and vgrename translates
that UUID to a current VG name, the old and new VG
names are not being checked for equality. If they
are equal, it produces an internal error rather than
a proper error.
Use delay_dev to slow down mirror sync so we could more
easily check for race and proper reject of parallel mirror
leg addition/reduction.
Also expose fail in mirror allocation of parallel leg.
Rather then skipping whole test - just do not use it.
Failing tests that have required delay need to deal with reality
and shell either check for HAVE_DM_DELAY and skip portion
of test or using should when needed.
While through all codepaths we never 'read' lock_id unless LCKF_CONVERT,
coverity cannot decrypt this.
As since it's usually better to pass in 'well-defined' data structures
preset lock_id to 0.
Use fputs() when printing plain string,
easier then fprintf which needs to parse it.
Also check fd before close is >= 0 -
it is - but coverity fail to see it, so eliminate
this false-positive warning.
Doing 'stat' checking first and later opening is racy.
And since we do not really care about any 'status' info
here and we read 'sysfs' here - just drop whole 'stat()'
call and directly handle error from failing 'fopen()'.
Coverity here is not fully-in-picture - but please it
with validation of pointer which currently cannot be null,
since we always return at least empty string.
Check for arg_vgid_lookup and arg_name_lookup not being NULL.
Drop checking arg_vgid and arg_name for NULL since they
are already dereference earlier - thus mostly must be NOT NULL.
(If that would be possible larger rework of this function would be
required).
Put calls related to fifo opening into a single function.
Fix Time-Of-Check-Time-Of-Use and use fstat()
and fchmod() on already opened fd instead of
checking first path and then risking to open something
different.
Currently the code creates the log separately after allocating space for
the data and as no data allocation is needed this second time,
total_extents ends up holding zero so use new_extents directly instead.
When reading a foreign VG we cannot write it, since
it belongs to another host. When reading a shared VG
we cannot write it because we may not have an ex lock.
(Or we may be reading the shared VG while not using
lvmlockd in which case it's like reading a foreign VG.)
Add the same checks for wiping outdated PVs. We may
read a foreign or shared VG, or see the PVs, while
another host is part way through writing a new version
of the VG to the PVs. This might cause us to think
some of the PVs are outdated. We do not want to
write another host's PVs, especially when we may
wrongly conclude they are outdated.
When the command gets a list of alternate devices
from lvmetad, log each one directly. This is not
the same as the warnings when adding lvmcache,
which are related to which duplicate is preferred.
If two PVs have different VGs with the same name
(different uuids), one of the VGs is ignored by
lvmetad. A FIXME exists in lvmetad to find a
better response.
update_metadata and pv_found update the cached metadata;
these are both reworked to improve the code, organize it
by each possible state and transition, make it much more
clear what's changing, add more error checking and
handling, and add comments.
The state and content of the cache (hash tables) does not
change (apart from some things that didn't work before),
and the communication to/from commands does not change.
The implementation and organization of the code making
the state changes does change significantly.
One detail related to the content of the cache does change:
different hash tables do not reference the same memory any more;
the target values in each hash table are allocated and freed
individually.
The str_list_destroy function may be called to cleanup memory when
the list is not used anymore and the list itself was not allocated
from the memory pool.
When checking minimum mda size, make sure the mda_size after alignment
and calculation is more than 0 - if there's no place for an MDA at the
end of the disk, the _text_pv_add_metadata_area does not try to add it
there and it returns (because we already have the MDA at the start of
the disk at least).
Actually, we don't need extra condition as introduced in commit
00348c0a63. We should fix the last
condition:
(mdac->rlocn.size >= mdah->size)
...which should be:
(MDA_HEADER_SIZE + (rlocn ? rlocn->size : 0) + mdac->rlocn.size >= mdah->size))
Where the "mdac" is new metadata, the "rlocn" is old metadata.
So the main problem with the previous condition was that it
didn't count in MDA_HEADER_SIZE properly (and possible existing
metadata - the "rlocn"). This could have caused the error state
where metadata in ring buffer overlap to not be hit.
Replace the new condition introduced in 00348c0a63
with the improved one for the condition that existed there
already but it was just incomplete.
We're already checking whether old and new meta do not overlap in
ring buffer (as we need to keep both old and new meta during vg_write
up until vg_commit).
We also need to check whether the new metadata do not overlap
themselves in case we don't have old metadata yet (...because
we're in vgcreate). This could happen if we're creating a VG so
that the very first metadata written are long enough that it wraps
themselves in metadata ring buffer.
Although we limited the minimum metadata area size better with the
previous commit ccb8da404d which
makes the initial VG metadata overlap in ring buffer to be less
probable, the risk of hitting this overlap condition is still there
if we still manage to generate big enough metadata somehow.
For example, users can provide many and/or long VG tags during vgcreate
so that the VG metadata is long enough to start to wrap in the ring
buffer again...
Drop already tested 'threshold & create' which is in
lvextend-thin-full.sh
Count with now match faster 'dmeventd' wakeup on watermark
as it's now nearly instant after crossing threshold value.
If the underlaying device has actually bigger read-ahead settings,
let it pass.
But anyway switch to 512 strip-size to get really high R-A sector count.
This option could never have been printed in lvm2 metadata, so it could
be safely removed as it could have been set only as 0.
These configurable setting is supported via metadata profile.
Now with correctly functioning dmeventd enable usage of
low_water_mark for faster reaction on pool's threshold.
When user select e.g. 80% as a threshold value,
dmeventd doesn't need to wait 10 seconds till monitoring
timer expires, but nearly instantly resizes thin-pool
to fit bellow threshold.
If plugin's lvm command execution fails too often (>10 times),
there is no point to torture system more then necessary, just log
and drop monitoring in this case.
The recent addition to check for PVs that were
missed during the first iteration of processing
was unintentionally catching duplicate PVs because
duplicates were not removed from the all_devices
list when the primary dev was processed.
Also change a message from warn back to verbose.
If a VG is removed between the time that 'vgs'
or 'lvs' (with no args) creates the list of VGs
and the time that it reads the VG to process it,
then ignore the removed VG; don't report an error
that it could not be found, since it wasn't named
by the command.
The former patch(dab3ebce4c) is a little bit strict. For example, it is
OK to create PV on unpartitioned DASD devices with LDL formatted. So
after lvm version containing the patch, LVs created on those devices
could not be found.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Improve event string parser to avoid unneeded alloc+free.
Daemon talk function uses '-' to mark NULL/missing field.
So restore the NULL pointer back on parser.
This should have made old tools like 'dmevent_tool' work again.
As now 'uuid' or 'dso' could become NULL and then be
properly used in _want_registered_device() function.
Since lvm2 always fill these parameters, this change should
have no effect on lvm2.
PVs could be missing from the 'pvs' output if
their VG was removed at the same time that the
'pvs' command was run. To fix this:
1. If a VG is not found when processed, don't
silently skip the PVs in it, as is done when
the "skip" variable is set.
2. Repeat the VG search if some PVs are not
found on the first search through all VGs.
The second search uses a specific list of
PVs that were missed the first time.
testing:
/dev/sdb is a PV
/dev/sdd is a PV
/dev/sdg is not a PV
each test begins with:
vgcreate test /dev/sdb /dev/sdd
variations to test:
vgremove -f test & pvs
vgremove -f test & pvs -a
vgremove -f test & pvs /dev/sdb /dev/sdd
vgremove -f test & pvs /dev/sdg
vgremove -f test & pvs /dev/sdb /dev/sdg
The pvs command should always display /dev/sdb
and /dev/sdd, either as a part of VG test or not.
The pvs command should always print an error
indicating that /dev/sdg could not be found.
Recognize the target only 'extends' and do not enforce
'flush' in this case. Only the size reduction
still requires flush (so disables usage of no_flush flag).
If some other targets do require flush before suspend,
they have to explicitly ask for it.
While the activation code tries to evaluate which target
really needs flush with suspend and which may go without flush,
it has stayed effectively disabled by original commit:
33f732c5e9 since here
it only allows to pass non-pvmoving 'mirrors'.
So remove check for mirror LV type and only disable
no_flush for 'pvmove'..
TODO: Looking into history - it also seemed like raid target
would have always required flushing but it's been later
removed without clean explanation.
If some more targets really do need 'no_flush' it should
been handle at their 'level' - since we now stack multiple
targets over itself.
Add more functionality to size_changed function.
While 'existing' API only detected 0 for
unchanged, and !0 for changed,
new improved API will also detected if the
size has only went bigger - or there was
size reduction.
Function work for the whole dm-tree - so
no change is size is always 0.
only size extension 1.
and if some size reduction is there - returns -1.
This result can be used for better evaluation
whether we need to flush before suspend.
Use single code to evaluate if the percentage value has
crossed threshold.
Recalculate amount value to always fit bellow
threshold so there are not need any extra reiterations
to reach this state in case policy amount is too small.
Since plugin's percentage compare has been fixed,
it's now revealed wrong compare here.
The logic for threshold is - to allow to go as high
as given value e.g. 80% - so if pool is exactlu 80%
full it's still allowed to use it (dmeventd will not
resize it).
Commit 1a74171ca5 added
a check to ignore a VG that was FAILED_INCONSISTENT
if the command doesn't care if the VG is not found.
Remove that check because that case is never reached
by the current code.
The ONE_VGNAME_ARG was being passed and tested as
vg_read() flag but it's a cmd struct flag.
(It affects command arg processing in toollib,
not vg_read behavior. Flags related to command
processing are generally cmd struct flags, while
vg_read arg flags are generally related to vg_read
behavior.)
Running "vgremove -f VG & pvs" results in the pvs
command reporting that the VG is not found or is
inconsistent. If the VG is gone or being removed,
the pvs command should just skip it and not print
errors about it.
"Not found" is because the pvs command created the
list of VGs to process, including VG, then vgremove
removed the VG, then the pvs command came to to read
the VG to process it and did not find it.
An "inconsistent" error could be reported if vgremove
had only partially completed removing VG when pvs did
vg_read on the VG to process it, causing pvs to find
the VG in a partially-removed state.
This fix adds a flag that pvs uses to ignore a VG
that can't be read or is inconsistent.
When lvmetad is used and lvmcache update function (lvmcache_update_vgname_and_id)
was called to update existing lvmcache records, a condition was met
which made to retun from the update function immediately, effectively
making it NOOP.
It seems there's no reason for such condition and lvmcache should be
update appropriately even when lvmetad used as lvmcache may be reused,
most notably in lvm shell.
It's possible this is a remnant of the lvmetad development code which
didn't get removed for some reason and the bug didn't get spotted
because lvm shell is not used often (the condition dates back to 2012
or so).
Example, lvmetad and lvm shell used:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 124.00m
Before this patch:
==================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
With this patch applied:
========================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
The lvmcache info might be resued, most notably in lvm shell.
We need to be sure that even lvmcache_info marked as invalid
is removed from the lvmcache so it does not confuse any subsequent
code/commands executed later on.
Problematic example with the lvm shell:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Before this patch (/dev/sda still displayed in a way):
======================================================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
No physical volume label read from /dev/sda
(with lvmetad)
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
With this patch applied:
========================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
(with lvmetad)
lvm> pvs
Older pthread library was missing 'trick'
in pthread_cleanup_pop() which lead to
compilation error:
error: label at end of compound statement
Use explicit ';' to fix it.
Make lvm2_disable_dmeventd_monitoring() more explicit.
As memlock_inc_daemon() is also used by clvmd, which
does changes dmeventd and suspend ignore state at
some stages - make updates of these 2 variable
tied to the call of lvm2_disable_dmeventd_monitoring().
Once this call is made dmeventd monitoring
and suspended devices are ignored.
TODO: all lvm-global settings should really be moved
to command context.
Implementing exit when 'dmeventd' is idle.
Default idle timeout set to 1 hour - after this time period
dmeventd will cleanly exit.
On systems with 'systemd' - service is automatically started with
next contact on dmeventd communication socket/fifo.
On other systems - new dmeventd starts again when lvm2 command detects
its missing and monitoring is needed.
Add support to unmonitor device when monitor recognizes there is
nothing to monitor anymore.
TODO: possibly API change with return value could be also used.
Redesign threading code:
- plugin registration runs within its new created thread for
improved parallel usage.
- wait task is created just once and used during whole plugin lifetime.
- event thread is based over 'events' filter being set - when
filter is 0, such thread is 'unused'.
- event loop is simplified.
- timeout thread is never signaling 'processing' thread.
- pending of events filter cnange is properly reported and
running event thread is signalled when possible.
- helgrind is not reporting problems.
Need here to keep control device opened while there is 'any' dso
plugin loaded - otherwise there would a race closing controlfd
inside lvm2 plugin while some other monitoring thread would
tried to execute another WAITEVENT task.
Move all DSO related function in front, so they could be easily
referenced from rest of code.
Add proper error paths with logging and error reporting.
Drop mutex locking when releasing DSO - since DSO is always
allocated and released in main 'event' processing thread.
CONVERTING status flag is a tricky one. It's not set when converting
a non-mirror LV type to the mirror type, i.e.: linear -> two leg mirror.
Also the conversion itself is instant and doesn't require to be polled.
When mirror reaches sync state there's no final update on VG metadata
for lvmpolld to be made thereby report_progress in fact doesn't report
percentage of mirror being converted but percentage of mirror
being in sync. Perhaps we should reword the lvconvert output here.
On the other hand CONVERTING is set while we upconvert the mirror
from i.e. two leg mirror to four leg mirror. In such case the operation
is required to be polled so that lvmpolld can cleanup temporary
conversion log when the conversion is over.
Ignore CONVERTING lv_type for the moment and match LVs only by uuids
during 'mirror conversion'/'waiting for a sync to finish'.
The old code made two loops through the PVs: in the first
loop it found the max PV and VG name lengths, and in the
second loop it printed each PV using the name lengths as
field widths for aligning columns.
The new code uses process_each_pv() which makes one loop
through the PVs. In the *first* call to pvscan_single(),
the max name lengths are found by looping through the
lvmcache entries which have been populated by the generic
process_each code prior to calling any _single functions.
Subsequent calls to pvscan_single() reuse the max lengths
that were found by the first call.
The new report/compact_output_cols setting has exactly the same effect
as report/compact_output setting. The difference is that with the new
setting it's possible to define which cols should be compacted exactly
in contrast to all cols in case of report/compact_output.
In case both compact_output and compact_output_cols is enabled/set,
the compact_output prevails.
For example:
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols=""
$ lvs vg
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=1
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize
lvol0 vg -wi-a----- 4.00m
dm_report_compact_given_fields is the same as dm_report_compact_fields,
but it processes only given fields, not all the fields in the report
like dm_report_compact_field does.
If a host failed while holding a sanlock lease,
sanlock_acquire will by default block and wait
for the lease to expire before returning. We
want it to return with an error so we can retry
instead of blocking, which allows us to process
other lock operations.
(Enclose this in an ifdef until the new flag
appears in a sanlock release.)
Respect lvm2_log_fn prototype. The idea of 'reusing' print_log with
plain cast is causing very strange crashes with some older 'gcc' compilers.
So just do it cleanly...
This reverts commit 1b1c01a27b.
This caused messages to get dropped instead of logged into the log file.
(The log file and log function are independent at the moment.)
Rework thread creation code to better use resources.
New code will not leak 'timeout' registered thread on error path.
Also if the thread already exist - avoid creation of thread
object and it's later destruction.
If the race is noticed during adding new monitoring thread,
such thread is put on cleanup list and -EEXIST is reported.
As we now use 'unified' logging macro system - we no longer need
to protect from change of logging function pointer - it's set
once at the start of dmeventd and not change anymore
(as lvm2 library no longer interferers here).
Some signatures are spread around the disk in several copies, mainly for
backup. Make libblkid to detect these extra copies - there was missing
"blkid_probe_step_back" fn call after successful wipe of previous signature
copy.
An example with FAT table which has copies:
$ mkfs.vfat /dev/sda1
Before this patch:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
With this patch applied:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 0. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 510. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
Make sure log/prefix is set to "" when getting the list of VG names.
We need this for the format to be correct so it's properly searched
through later on.
$ vgcreate vgA /dev/sda
Volume group "vgA" successfully created
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ dd if=/dev/sda of=/dev/sdc bs=1M
(the new VG name is prefix of existing VG name)
$ vgimportclone -n vg /dev/sdb
(the new VG name is suffix of existing VG name)
$ vgimportclone -n gA /dev/sdc
Before this patch:
------------------
(we end up with "vg1" and "gA1" names with the "1" suffix which is not needed)
$ vgs -o vg_name
VG
gA1
vg1
vgA
With this patch applied:
------------------------
(we end up with "vg" and "gA" names as they're unique already and no extra suffix is added)
$ # vgs -o vg_name
VG
gA
vg
vgA
Of course, if the name supplied is not unique, the number is added correctly:
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ vgimportclone -n vgA /dev/sdb
$ vgs -o vg_name
VG
vgA
vgA1
If lvmlockd is running, lvmetad is configured (use_lvmetad=1),
but lvmetad is not running, then commands will seg fault
when trying to send a message to lvmetad.
The difference is lvmetad being "active", not just "used".
We can replace the expressions with awk/grep/cut/tr with --select now and
more suitable reporting options and modes. Also, we don't need to check
the temporary lvm.conf generated within vgimportclone script since we're
generating it ourselves now using lvmconfig, not using sed anymore like
it was before (so we can be pretty sure it's correct - we use lvmconfig
now even for generating the lvm.conf itself).
We already have pv_count to report number of PVs that a VG has based
on metadata.
This patch exposes the information about how many of these PVs are
missing which is also useful information for a VG. Wwe could count
the sum of pv_missing reporting fields for each PV in the VG before,
but the new field is practical when reporting VG as a whole and there's
no need to process each PV from VG alone.
If 'vgcreate --shared' finds both sanlock and dlm are running,
print a more accurate error message:
"Found multiple lock managers, select one with --lock-type."
When neither is running, we still print:
"Failed to detect a running lock manager to select lock type."
Using --lock-type sanlock|dlm implies --shared.
Using --shared selects lock type sanlock|dlm
(by choosing the one that's running.)
Using both --shared and --lock-type sanlock|dlm should
also be allowed (--shared is just redundant information.)
Also, leave out the note about "circular buffer" which is
an internal imeplementation detail anyway and not quite
informational for users:
Before this patch:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large for circular buffer
Failed to write VG vg1.
With this patch applied:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large: size of metadata to write is 691 bytes while PV metadata area size on /dev/sda is 512 bytes.
Failed to write VG vg1.
Before this patch:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear public
With this patch applied:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear private,lockd,sanlock
Avoid running tests, when prefix already exist in the system.
As prefix just uses PID number, we may hit a case for long
running tests, where devices from some previous runs were not
properly cleared away - detect this and fail early.
(Such machine should be inspected and fixed).
Add metadata_devices and seg_metadata_le_ranges report fields.
Currently only defined for raid, but should probably be extended
to all other segment types that don't report all their device
usage in the 'devices' field.
Correct some things, e.g. set mode and policy on
the cache lv, not the pool, lvm.conf field for
mode changed.
Add smq which was missing.
Make the sections on cache mode and cache policy
consistent in structure and style.
When lvmetad_pvscan_vg() reads VG metadata from each PV,
it compares it to the last one to verify it matches.
If the VG metadata does not match on the PVs, an error
is printed and it fails to read the VG. In this error
case, use log_debug to show the differences between
the two unmatching copies of the metadata.
One host changes a VG, making the cached VG on another
host invalid. The other host then rereads the VG from
disk to get the latest copy. If the first host removed
a PV from the VG, the second host attempts to reread the
VG from old PV when rescanning. Reading the VG from the
removed PV fails, causing vg_read to return "VG not found".
The fix is to simply not fail when a VG is not found while
rereading a PV and continue without it.
(This doesn't happen if the second host happens to first
run a command like 'vgs' that triggers a global revalidation
of metadata.)
vgchange --lock-type iterates through LVs to ensure
no LVs are active before changing the lock type of
the VG, but the loop was not checking that an LV
actually has a lock before trying it, so it would
fail if the VG had any LVs that don't use locks,
e.g it would fail on a tmeta LV from a pool.
We want most of our units to be started before any local/remote mount
points are mounted - we used {local,remote}-fs.target for this purpose
before, but it was not 100% correct as there's even {local,remote}-fs-pre.target
special systemd unit reserved for this exact purpose.
See also man 7 systemd.special and "local-fs-pre.target"/"remote-fs-pre.target"
description.
When user specifies '--force' with remove/remove_all/wipe_table
use '--noflush --nolockfs' resume flags, so the operation
will not block when device underneath is blocked.
Also make error messages more consistent:
Before this patch:
(/run/lock exists and is not a directory)
$ pvs
/run/lock/lvm: mkdir failed: Not a directory
File-based locking initialisation failed.
(/run/lock/lvm exists and is not a directory)
$ pvs
Directory "/run/lock/lvm" not found
File-based locking initialisation failed.
With this patch applied:
(/run/lock exists and is not a directory)
$ pvs
Existing path /run/lock is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed
(/run/lock/lvm exists and is not a directory)
$ pvs
Existing path /run/lock/lvm is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed.
When using udev, the /dev/mapper entries are symlinks - fix the code
to count with this.
This patch also fixes the dmsetup mknodes and vgmknodes to properly
repair /dev/mapper content if it sees dangling symlink in /dev/mapper.
$ lvs -o name,tags vg
LV LV Tags
lvol0
lvol1 mytag
Before this patch:
$ lvs -o name,tags vg -S 'tags=""'
Failed to parse string list value for selection field lv_tags.
Selection syntax error at 'tags=""'.
Use 'help' for selection to get more help.
(and the same for -S 'tags={}' and -S 'tags=[]')
With this patch applied:
$ lvs -o name,tags vg -S 'tags=""'
LV LV Tags
lvol0
(and the same for -S 'tags={}' and -S 'tags=[]')
If lvmlockd acquires an lv lock for a command, but the
command exits before the reply, then the command has
not activated the lv and lvmlockd should unlock it.
This only applies when the lv was not already locked.
(There will always be a chance that the lv lock is held
while the lv is not active, i.e. if the command fails in
the small window between getting the lv lock and before
doing the activation. In that case, rerunning the
activation command corrects the inconsistency.)
This commit helps by automatically clearing the
inconsistency (lv locked by not activated) in the most
common case when the lv lock operation is slow to
complete and the command is canceled by the user.
This commit also adds and cleans up references to the
client id in a bunch of log messages, which is useful
to follow processing on each independent lock request.
When using lvm shell, some structures which are cached in memory may be
reused. This happens for the struct label (a part of lvmcache_info
structure) when lvmetad is used in which case the PV scan is not
done that would normally overwrite these label structures in memory
and making them up-to-date.
This is all consequence of the fact that struct lvmcache_info and
struct label are not always assigned in the same part of the code.
For example, if lvmetad *is not* used, parts of the struct label are
reassigned in label_read fn while struct lvmcache_info is created
elsewhere. No part of the code reused struct label (and its "dev"
field) before calling label_read fn. That's why the real bug is
hidden when using lvm shell without lvmetad.
However, with lvmetad and lvm shell, the situation is a bit different.
The label_read fn is not called if lvmetad *is* used, hence the
struct label may have ended up not initialized properly.
There was missing assignment for the dev field in struct label
in _text_pv_write fn which caused this problem to appear in
lvm shell with lvmetad, for example:
Before this patch:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
unknown device lvm2 --- 128.00m 128.00m
With this patch applied:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Also, this problem had not appeared before changes introduced
by commits e1a63905d1 through
3a6f91d713 which, among other
things, added proper label field type reporting. Before, label
reporting was the same as using struct physical_volume which
has its own dev field assigned and so this problem was not exposed.
When a command does a sequence of
vg_write + vg_commit + vg_write + vg_commit,
initialization of non-PV devices happens during the
first vg_write, and does not need to be repeated by
the second vg_write.
When creating a lockd VG, this sequence occurs because
the VG is first created, then the lockd data is created,
then the lockd data is then written to the VG metadata.
Avoid validation of free space in pool, when no messages are passed.
Patch a3c7e326c3 add new check for
pool overload - but this check should not be made if there are
no messages and transaction_id is still within 'bounds' (bigger by 1).
Commit 00b36ef06a had a typo
and missed '{' for shell variable, thus command used slightly
different 'tmp' dir name for cache dir (with extra '}').
Such change was unnoticed until a recent fix in persistent
filter, lvm2 missed to update cache file when --config
was specified.
The result was, /tmp dir was accumulating snap.XXXXX} dirs when
running vgimportclose script.
Since we may want to swap names when LVs are complex types, we cannot
avoid doing full renames on both LV stacks.
Temporarily use 'pvmove_tmeta' as unused name to prevent validation troubles.
Commit 9403edbb93 move location of
configure.h and lvm-version.h.
Let's try even better place then /conf dir which should be left
for user configurable files.
Put these files right into include dir.
This applies the same rule/logic to dlm VGs that has always
existed for sanlock VGs. Allowing a dlm VG to be removed
while its lockspace was still running on other hosts largely
worked, but there were difficult problems if another VG with
the same name was recreated. Forcing the VG lockspace to
be stopped, gives both sanlock and dlm VGs the same behavior.
This shortcut was added for an odd case that I do not
believe is relevant any more. Having an alternate
path for lockspace thread cleanup is a complication
that could lead to problems.
The code was expecting the wrong return value from
compare_config, which returns 0 when equal.
This is a problem for a lockd VG using multiple PVs
when the VG needs to be rescanned.
Somehow raid tests landed in plain cache - separte them out
so they properly check for have_raid.
Check we do not support strip option with cache-pool creation.
ATM allocation can't handle stripping and cache pool allocation.
It's not yet even clear what should be actually result.
Until resolved, disable this option (it's been coredumping
inside allocation anyway).
Certain stacks of cached LVs may have unexpected consequences.
So add a warning function called when LV is cached to detect
such caces and WARN user about them - the best we could do ATM.
When we insert layer we also move status flag-bits for certain LV types,
so internal volume_group structure remains consistent.
(Perhaps it's misuse of 'insert_layer' function and we should have
another similar function for this.)
Basically we aim to maintain the same state as after reading fresh
metadata out of volume group.
Currently we when i.e. cache 'raid' LV - this should transfer 'raidLV' flag
to _corigin LV and cache is no longer a raid.
TODO: bits for stacked devices needs more exact rules.
The dlm will often lose the lvb content, so we need to
check quite a few possibilities for lvb values that
were not being checked before.
Refactoring was required to pass the entire lvb value
back to the core code instead of the single value.
The only functional change should be detecting new
lvb states where metadata is now invalidated where
it wasn't before.
When an action is created by lvmlockd for itself,
there is no client to send the result to. Add
the NO_CLIENT flag to the action to skip sending
the result to a client.
Add a new arg to lockd_start_vg() that indicates
it is being called for a new lockd VG, so that
lvmlockd knows the lockspace being started is new.
(Will be used by a following commit.)
Commit f6473baffc introduced a new
cmd->initialized variable to keep info about which parts of the
cmd_context have been initialized.
A part of this patch was also a change in refresh_filters fn
which checks for cmd->initialized.filters variable and it does
the filter refresh *only* if the filter has already been initialized
before otherwise it's a NOOP (before, the refresh_filters also
initialized filters as a side effect in case it had not been
initialized before which was not quite correct).
However, the commit f6473baffc
did not handle the case in which configuration changes
either via --config argument or when configuration file changed
and its timestamp was higher than the timestamp of the persistent
cache file - the /etc/lvm/cache/.cache.
This patch fixes this issue and it causes the init_filters fn
in lvm_run_command fn to be called with proper value of
"load_persistent_cache" switch even if the configuration changes,
hence causing the persistent cache file to be ignored in this
case.
Ensure make clean cleans any left-over file from their previous
location so they are not in conflict with new ones.
Also hide error message when .commands file is not present.
The regex filter (controlled by devices/filter lvm.conf setting) was
evaluated as the very last filter. However, this is not optimal when
it comes to restricting disk access - users define devices/filter
as well as devices/global_filter to avoid this.
The devices/global_filter is already positioned at the beginning of the
filter chain. We need to do the same for devices/filter.
Filter chains before this patch:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
type_filter -> usable->filter -> mpath_component_filter ->
partition_filter -> md_component_filter -> fw_raid_filter ->
regex_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> usable_filter -> regex_filter
From the chain list above we can see that particularly in case when
lvmetad is not used, the regex filter is the very last one that is
processed. If lvmetad is used, it doesn't matter much as there's
the global_regex_filter which is used instead when updating lvmetad
and when retrieving info from lvmetad, putting regex_filter in front
of usable_filter wouldn't change much since usabled_filter is not
reading disks directly.
This patch puts the regex filter to the front even in case lvmetad
is not used, hence reinstating the state as it was before commit
a7be3b12df (which moved the regex_filter
position in the chain). Still, the arguments for the commit
a7be3b12df still apply and they're
still satisfied since component filters (MD, mpath...) are evaluated
first just before updating lvmetad.
So with this patch, we end up with:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
regex_filter -> type_filter -> usable->filter ->
mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> regex_filter -> usable_filter
This way, specifying the regex_filter in non-lvmetad case causes
the devices to be filtered based on regex first before processing
any other filters which can access disks (like md_component_filter).
This patch also streamlines the code for better readability.
Hopefull 4.3 will be fixed and test will be updated to let
raid test running again.
Meanwhile using md-raid may effectively kill kernel,
so leave at least other tests running.
Split up _build_histogram_arg() into separate functions to allocate
and fill the histogram arg string and remove nested local variable
declarations from the parent function.
Coverity flags a user-after-free in _stats_histograms_destroy():
>>> Calling "dm_pool_free" frees pointer "mem->chunk" which has
>>> already been freed.
This should not be possible since the histograms are destroyed in
reverse order of allocation:
203 for (n = _nr_areas_region(region) - 1; n; n--)
204 if (region->counters[n].histogram)
205 dm_pool_free(mem, region->counters[n].histogram);
It appears that Coverity is unaware that pool->chunk is updated
during the call to dm_pool_free() and valgrind flags no errors in
this function when called with multiple allocated histograms.
Since there is no actual need to free the histograms individually
in this way simplify the code and just free the first allocated
object (which will also free all later allocated histograms in a
single call).
Put include/.symlinks_created as a prerequisite for dep calc.
Otherwise if these are not generated and user enters tests subdir and
runs 'make' he just gets endless loop of dep calculation.
Relocate generated configure.h and lvm-version.h outside
of compilable .c source tree.
The reason is behind - when compiling in builddir != srcdir
the generated file in lib/misc/configure.h was used for all compiled
source file except ones located in lib/misc dir - those would have used
configure.h file located in this dir - if there have existed one (i.e.
from some other build)
This problem was only visible, when srcdir == buildir was used before
trying to use srcdri != builddir (as configure.h appeared then in
srcdir).
The histogram changes adds a new error path to dm_stats_create().
Make sure that the dm_stats handle is properly destroyed if we fail
to create the histogram pool and check for failures setting the
program_id.
Since we are growing an object in the histogram pool the return
value of dm_pool_grow_object() must be checked and error paths need
to abandon the object before returning.
Undo the part of the recent EREMOVED change which
automatically stopped the lockspace for a remotely
removed VG. It didn't always work (would not work
when lvb content was rebuilt in the dlm). This will
be handled better when the lvb content is controlled
more strictly.
As part of fix that came with cf700151eb,
I forgot to add the check whether the result of stat was successful or
not. This bug caused uninitialized buffer to be used for entries
from .cache file which are no longer valid.
This bug may have caused these uninitialized values to be used further,
for example (see the unreal (2567,590944) representing major:minor
pair):
$ pvs
/dev/abc: stat failed: No such file or directory
Path /dev/abc no longer valid for device(2567,590944)
PV VG Fmt Attr PSize PFree
/dev/mapper/test lvm2 --- 104.00m 104.00m
/dev/vda2 rhel lvm2 a-- 9.51g 0
Older versions of gcc aren't able to track the assignments of
local variables as well as the latest versions leading to spurious
warnings like:
libdm-stats.c:2183: warning: "len" may be used uninitialized in this
function
libdm-stats.c:2177: warning: "minwidth" may be used uninitialized in
this function
Both of these variables are in fact assigned in all possible paths
through the function and later compilers do not produce these
warnings.
There's no reason to not initialize these variables though and
it makes the function slightly easier to follow.
Also fix one use of 'unsigned' for a nr_bins value.
Replace the histogram stats subcommand with a --histogram switch
to enable histogram related fields for both list and report output.
To avoid overloading the existing --histogram rename it to --bounds:
this is also a better description of the option.
Remove the optimization/shortcut for starting the dlm global
lockspace when it was already running.
Reenable automatically starting the dlm global lockspace
when a command attempts to use it and it's not yet started.
This had become disabled at some point.
Since we may easily get blocked when checking for percentage
of thin-pool - do not flush and just show current values.
This avoids holding VG locked when pool is overfilled.
Try to detect thin-pool which my block lvm2 command from furher
processing (i.e. lvextend).
Check if pool is read-only or out-of-space and in this case thins
will skipped from being scanned (so user may miss some PVs located
on thin volumes).
Fix regression introduced with commit:
2fc126b00d
This commit has moved pv_min_size() test in front
of device_is_usable(). However pv_min_size needs to open device,
so it may have actually get blocked.
So restore the original order and first validate
dm device to be usable for open.
It's worth to note that such check is not 'race-free',
but it usually eliminates 99.99% of problems ;).
Print [source:handler] in filters' debug messages only if external
device info source other than "none" is used.
$ lvmconfig --type full devices/external_device_info_source
external_device_info_source="none
Before this patch (from the -vvvv log):
filters/filter-usable.c:47 /dev/mapper/test: Skipping: Too small to hold a PV [none:(nil)]
filters/filter-md.c:33 /dev/sdb: Skipping md component device [none:(nil)]
filters/filter-partitioned.c:25 /dev/vda: Skipping: Partition table signature found [none:(nil)]
With this patch applied:
filters/filter-usable.c:44 /dev/mapper/test: Skipping: Too small to hold a PV
filters/filter-md.c:35 /dev/sdb: Skipping md component device
filters/filter-partitioned.c:27 /dev/vda: Skipping: Partition table signature found
librt doesn't have a pkgconfig file so use Libs.private: -lrt instead
to declare the dependency directly.
The same applies for -lm which is also used and which hasn't been
defined in the devmapper.pc file yet.
# lvmdbusd relies on command log report to inspect LVM command's execution status
report_command_log=1
# display only outermost LVM shell-related log that lvmdbusd inspects first after LVM command execution (it calls 'lastlog' for more detailed log afterwards if needed)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.