IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This reverts commit 99b6173f10.
These tests are disabled with lvmlockd because they use
snapshots without an origin which is not permitted in a
shared vg.
user creates a file listing real devices they want
lvm tests to use, and sets LVM_TEST_DEVICE_LIST.
lvm tests can use these with prepare_real_devs
and get_real_devs.
Other aux functions do not work with these devs.
To better test actually fsadm in test suite - avoid setting
LVM_BINARY locally - since test setup already modifies
PATH to find test's lvm binary as the 1st. in path.
Move extra md component detection into the label scan phase.
It had been in set_pv_devices which was deep within the vg_read
phase, which wasn't a good place (better to detect that earlier.)
Now that pv metadata info is available in the scan phase, the pv
details (size and device_hint) can be used for extra md checking.
Use the device_hint from the pv metadata to trigger a full md
component check if the device_hint begins with /dev/md.
Stop triggering full md component checks based on missing
udev info for a dev.
Changes to tests to reflect that the code is now detecting
md components in some test case that it wasn't before.
A cachevol can be forcibly detached when it's missing devices.
Also allow this if it's damaged/invalid and unrepairable.
This would be needed to recover data from the origin LV after
a cachevol is lost or damaged beyond repair.
In cases where lvconvert does not detect a fs block size on the
device, it falls back to choosing a writecache block size based
on the device's LBS and PBS (tries to match those.)
If the user specifies a writecache block size on the command
line (--cachesettings block_size=4096|512), lvconvert currently
fails and reports an error if the user-specified value does not
match the value lvconvert would have chosen based on LBS and PBS.
The purpose of allowing a user-specified value on the command line
is to override what lvconvert would otherwise do, so change this
to just print a warning that the user value does not match the
value that would be chosen based on the LBS/PBS, and then take
the user-specified value as the writecache block size.
In case legs of a raid0 LV are removed, the lvdisplay command still
reports 'available' though raid0 is not providing any resilience
compared to the other raid levels.
Also lvdisplay does not display '(partial)' in case of missing raid0
legs as oposed to the lvs command.
Enhance lvdisplay to report "NOT available" for any RaidLV type in case
too many legs are inaccessible hence causing data loss. I.e. any leg
for raid0, all for raid1, more than 1 for raid4/5, more than 2 for raid6
and in case of completely lost mirror groups for raid10.
Add test/shell/lvdisplay-raid.sh.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1872678
When a writecache sublv or an integrity metadata sublv
are partial (missing a dev), set the partial flag on
the upper level LV also, as is done for other sublvs.
Fix the two-step writecache detach in commit c32d7fed4f.
In the case of uncache, the cachevol is removed after
detaching the writecache. When the detach is finished
in the second step, the remove must wait until then.
Verify that corruption is corrected for raid levels other
than raid1. For other raid levels, attempt to corrupt the
given file pattern on each underlying device, since we don't
know which device contains the file being corrupted.
This ensures that corruption is actually be introduced
when testing the other raid levels.
Verify that corruption is being corrected by checking
the integritymismatches count is non-zero for the raid LV,
which includes the total from all images (since we don't
know which image will have the corruption.)
Test case where filesystem has been corrected via fsck.
In such case fsck returns '1' as success and should be
handled in a same way as '0' since fs is correct.
Restructure the pvscan code, and add new temporary files
that list pvids in a VG, used for processing PVs that
have no metadata.
The new temp files, in /run/lvm/pvs_lookup/<vgname>, allow a
proper pvscan --cache to be done on PVs that have no metadata.
pvscan --cache <dev> is only supposed to read <dev>, but when
<dev> has no metadata, this had not been possible. The
command had to fall back to scanning all devices to read all
VG metadata to get the list of all PVIDs needed to check for
a complete VG. Now, the temp file can be used in place of
reading metadata from all PVs on the system.
Since 'BLKZEROOUT' streams out more block at once, at can easily
zero-out larger set of blocks after 1st. failing one.
So the test is adapted to fully 'hide' swap header under error target.
When ERR_DEV and ZERO_DEV are used, they are automatically
taken down when the last user no longer needs them,
so hide them from 'forgotten' device check.
As there could be few invokes of stacktrace, avoid
repeatedly display logs from commands.
So after first display rename debug.log* -> debug_log
so the file still can remain for reading in test dir.
Cover the case where two copies of metadata have the
same seqno but different checksums. Also elaborate
on an existing fixme in the code for this case, since
we should be doing something better for this case.
This had been uncovering an issue with reopening
fds in readwrite mode.
This test seems to be hitting some corner case in handling
out-of-metadata space condintion in thin-pool.
Add few more aid notes and functionality.
Also add missing '|| true' with now direct-IO dd command.
Shorten running time of the test.
Fix some issues in invoked resizing script to it returns
correct return code and dmeventd can be a little bit quicker
in this test.
When loop can't handle sector-size option - failure caused double fail
for access of unbound variable
Also fix expression for 'rm' and remove loops after loop release.
If the test runs of loop device backend with 512 sectors,
xfs selects this smaller sector size and then data do not fit
(we would need -l9 with most of 'raids').
With 4K sectors data always fits.
Shorten and make the test easily readable by moving same code into
function and removed one duplicated test for 512,4096 combination.
Always use scsi_debug - since default ramdisk or loop device backend
is unpredictible.
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
Use new SKIP_WITH_LOW_SPACE and set higher requirement for free space.
But still this test can't run on system's tmpfs directories -
as they typically provide less then 2G of space and when the test
runs there it also provisioning for all READ pages!)
BRD (ramdisk) device should work.
Extend a _wait_recalc() loop for slower hw.
When creating large raid which do not need to be fully synchronized use
them on delay devices - so even less data needs read/write.
Remove unneeded lvchange as lvcreate is already leaving LV inactive.
Replace printf with awk as generator.
mm
Test can set individually a higher value for required free space on
storage.
Note: it is not fully reliable since when 'brd' (ramdisk) device is used
this free space value is rather meanigul, but it might help
in case where a real filesystem is doing back-end for test devices.
When the test exhausts all the available free space on storage device,
then during the fail we cannot write anything as well - yet
the teardown needs to finish it's work - otherwise we leave
basicaly overfilled filesystem for all remaining tests.
In cases where internal functions like zero_dev, delay_dev pass-in
invalid parameter so resulting table can't work, resume at least
previous table line before failing out - so the cleaning process
later on is not stuck waiting on a suspended device.
While the previous commit c9b40083fc
decresed version to 1.19 for using bigger datasets, it's not
been quite right - so from our bb machine it looks like
bigger metadata consumption started with 1.19 and kernel 4.18
(fc27)
Use bigger volume and slowdown writing to cache device.
This allows more simple to reach 'dirty' state.
Also document exactly 1 SIGINT has to fire aborting of flushing.
For proper checking of extension progress require version 1.15
It looks with older versoin extension happens during very slow
resume within lvm command - although speed is still somewhat slow
with latest version.
Speed-up a bit the first synchronization with just 50ms write delay,
but later set also delay on read to slowdown lvextend.
FIXME: there are still things to look at:
0 229376 raid raid1 2 AA 229376/229376 idle 0 0
0 229376 raid raid1 2 AA 0/229376 frozen 0 0 -
0 262144 raid raid1 2 AA 229376/262144 repair 0 0 -
0 262144 raid raid1 2 AA 229376/262144 repair 0 0 -
0 262144 raid raid1 2 AA 245888/262144 repair 0 0 -
Just like we have 'writeerror_dev' supporting creation of device
which 'readable' segment and segments where write will fail we
have now support for delay zero mappings.
This is useful if we want to 'fake' large writing areas where we do
not really care about the actual 'disk' content - since we test
operation logic and it doesn't matter we read and write zeroes.
With combination with 'delay' target we can create specific mappings
and avoid using large memory areas of ramdisk.
On test system with 'default' filter (aka accept all) test
after enabling device can suffer from automatic system
activation - so for created LVs setup skipping this automatic
activation. This should prevent getting LVs into table
with pvscan service.
The test was using a raid+integrity LV without
first waiting for the integrity sync, which could
cause the test to fail (depending on init speed)
where it depends on integrity to work in uninitialized
areas.
Also use cmp instead of diff.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum. When used on
raid images, dm-raid will correct the error by reading
the block from another image, and the device user sees
no error. The integrity metadata (checksums) are stored
on an internal LV allocated by lvm for each linear image.
The internal LV is allocated on the same PV as the image.
Create a raid LV with an integrity layer over each
raid image (for raid levels 1,4,5,6,10):
lvcreate --type raidN --raidintegrity y [options]
Add an integrity layer to images of an existing raid LV:
lvconvert --raidintegrity y LV
Remove the integrity layer from images of a raid LV:
lvconvert --raidintegrity n LV
Settings
Use --raidintegritymode journal|bitmap (journal is default)
to configure the method used by dm-integrity to ensure
crash consistency.
Initialization
When integrity is added to an LV, the kernel needs to
initialize the integrity metadata/checksums for all blocks
in the LV. The data corruption checking performed by
dm-integrity will only operate on areas of the LV that
are already initialized. The progress of integrity
initialization is reported by the "syncpercent" LV
reporting field (and under the Cpy%Sync lvs column.)
Example: create a raid1 LV with integrity:
$ lvcreate --type raid1 -m1 --raidintegrity y -n rr -L1G foo
Creating integrity metadata LV rr_rimage_0_imeta with size 12.00 MiB.
Logical volume "rr_rimage_0_imeta" created.
Creating integrity metadata LV rr_rimage_1_imeta with size 12.00 MiB.
Logical volume "rr_rimage_1_imeta" created.
Logical volume "rr" created.
$ lvs -a foo
LV VG Attr LSize Origin Cpy%Sync
rr foo rwi-a-r--- 1.00g 4.93
[rr_rimage_0] foo gwi-aor--- 1.00g [rr_rimage_0_iorig] 41.02
[rr_rimage_0_imeta] foo ewi-ao---- 12.00m
[rr_rimage_0_iorig] foo -wi-ao---- 1.00g
[rr_rimage_1] foo gwi-aor--- 1.00g [rr_rimage_1_iorig] 39.45
[rr_rimage_1_imeta] foo ewi-ao---- 12.00m
[rr_rimage_1_iorig] foo -wi-ao---- 1.00g
[rr_rmeta_0] foo ewi-aor--- 4.00m
[rr_rmeta_1] foo ewi-aor--- 4.00m
systemctl status corosync (version: 2.4.5) report error:
parse error in config: No interfaces defined
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
To write a new/repaired pv_header and label_header:
pvck --repairtype pv_header --file <file> <device>
This uses the metadata input file to find the PV UUID,
device size, and data offset.
To write new/repaired metadata text and mda_header:
pvck --repairtype metadata --file <file> <device>
This requires a good pv_header which points to one or two
metadata areas. Any metadata areas referenced by the
pv_header are updated with the specified metadata and
a new mda_header. "--settings mda_num=1|2" can be used
to select one mda to repair.
To combine all header and metadata repairs:
pvck --repair --file <file> <device>
It's best to use a raw metadata file as input, that was
extracted from another PV in the same VG (or from another
metadata area on the same PV.) pvck will also accept a
metadata backup file, but that will produce metadata that
is not identical to other metadata copies on other PVs
and other areas. So, when using a backup file, consider
using it to update metadata on all PVs/areas.
To get a raw metadata file to use for the repair, see
pvck --dump metadata|metadata_search.
List all instances of metadata from the metadata area:
pvck --dump metadata_search <device>
Save one instance of metadata at the given offset to
the specified file (this file can be used for repair):
pvck --dump metadata_search --file <file>
--settings "metadata_offset=<off>" <device>
When running cluster test with clvmd, the actual 'monitoring'
happens in cluster - so the 'already monitored' message
is also logged within clvmd code and the command cannot
see such effect.
clvmd was incapable to report this information back to command
so it cannot be displayed this way.
Add 'lvs -o+seg_monitor' validation which also works in clustered mode.
Avoid making more dbus calls to get information we already have. This
also avoids us getting an error where a dbus object representation is
being deleted by another process while we are trying to gather information
about it across the wire.
VDO pool LVs are represented by a new dbus interface VgVdo. Currently
the interface only has additional VDO properties, but when the
ability to support additional LV creation is added we can add a method
to the interface.
Improve the implementation of extracting all text metadata
copies from the metadata area. Use this for the existing
metadata_all dump option.
Add a new metadata_search dump option which does not use
lvm headers to find metadata, but looks in standard
locations. This is useful if headers are damaged and
can't be used to locate metadata.
Adding '-v' to metadata_all or metadata_search will add
the description and creation_time to the printed list of
metadata instances that are found.
In the hex dump output, grep for the vgname
followed by one space. This allows for test pids
with up to seven digits, which are used to contruct
the variable vgname used by the test. Otherwise
the long vgname wraps to the next line and fails to
match in grep.
When an LV is used as a writecache cachevol, give
it the LV name a _cvol suffix. Remove the suffix
when the cachevol is detached, restoring the
original LV name.
Use /dev/md33 instead of /dev/md0 to reduce chances of
conflicting with an existing name.
Only call 'mdadm --stop /dev/md33' for cleanup and don't
use 'mdadm --stop --scan' to avoid stopping other md devs.
Due to a dm-raid target flaw fixed in target version 1.15.0,
extents of raid sets don't get resynchronized when new MD bitmp
pages have to be allocated due to the extension.
Introduce lvextend-raid.sh to test this flaw.
Related: rhbz1671964
When an online PV completed a VG, the standard
activation functions were used to activate the VG.
These functions use a full scan of all devs.
When many pvscans are run during startup and need
to activate many VGs, scanning all devs from all
the pvscans can take a long time.
Optimize VG activation in pvscan to scan only the
devs in the VG being activated. This makes use of
the online file info that was used to determine
the VG was complete.
The downside of this approach is that pvscan activation
will not detect duplicate PVs and block activation,
where a normal activation command (which scans all
devices) would.
Some older BB with older cryptsetup tool do not 'retry' on remove
and when remove is issued right after 'fsck' - it might be
rejected with:
Device @PREFIX@-tcrypt2 is busy.
Try to use udevadm settle.
Since we use 'set -euE -o pipefail' for shell execution,
any failure of any command in the 'piped' shell can result
in failure of whole executed chain - resulting in typically
unsually test skip, that was left unnoticed.
Since checked command have usually short output, the simplest
fix seems to be to let grep parse whole output instead
of quiting after first match.
The cache repair utility does not yet work with a cachevol
(where metadata and data exist on the same LV.) So, warn
and prompt if writeback is specified with a cachevol.
Previously the VG metadata description field (which contains
the command line) was only included in backup/archive copies
of the metadata. Now also include it in the metadata written
to the metadata areas.
So when the target name happened to be a suffix of another one,
the grep was filtering incorrect line
(i.e. dm-cache && dm-writecache) - so do a line head matching.
The fact that vg repair is implemented as a part of vg read
has led to a messy and complicated implementation of vg_read,
and limited and uncontrolled repair capability. This splits
read and repair apart.
Summary
-------
- take all kinds of various repairs out of vg_read
- vg_read no longer writes anything
- vg_read now simply reads and returns vg metadata
- vg_read ignores bad or old copies of metadata
- vg_read proceeds with a single good copy of metadata
- improve error checks and handling when reading
- keep track of bad (corrupt) copies of metadata in lvmcache
- keep track of old (seqno) copies of metadata in lvmcache
- keep track of outdated PVs in lvmcache
- vg_write will do basic repairs
- new command vgck --updatemetdata will do all repairs
Details
-------
- In scan, do not delete dev from lvmcache if reading/processing fails;
the dev is still present, and removing it makes it look like the dev
is not there. Records are now kept about the problems with each PV
so they be fixed/repaired in the appropriate places.
- In scan, record a bad mda on failure, and delete the mda from
mda in use list so it will not be used by vg_read or vg_write,
only by repair.
- In scan, succeed if any good mda on a device is found, instead of
failing if any is bad. The bad/old copies of metadata should not
interfere with normal usage while good copies can be used.
- In scan, add a record of old mdas in lvmcache for later, do not repair
them while reading, and do not let them prevent us from finding and
using a good copy of metadata from elsewhere. One result is that
"inconsistent metadata" is no longer a read error, but instead a
record in lvmcache that can be addressed separate from the read.
- Treat a dev with no good mdas like a dev with no mdas, which is an
existing case we already handle.
- Don't use a fake vg "handle" for returning an error from vg_read,
or the vg_read_error function for getting that error number;
just return null if the vg cannot be read or used, and an error_flags
arg with flags set for the specific kind of error (which can be used
later for determining the kind of repair.)
- Saving an original copy of the vg metadata, for purposes of reverting
a write, is now done explicitly in vg_read instead of being hidden in
the vg_make_handle function.
- When a vg is not accessible due to "access restrictions" but is
otherwise fine, return the vg through the new error_vg arg so that
process_each_pv can skip the PVs in the VG while processing.
(This is a temporary accomodation for the way process_each_pv
tracks which devs have been looked at, and can be dropped later
when process_each_pv implementation dev tracking is changed.)
- vg_read does not try to fix or recover a vg, but now just reads the
metadata, checks access restrictions and returns it.
(Checking access restrictions might be better done outside of vg_read,
but this is a later improvement.)
- _vg_read now simply makes one attempt to read metadata from
each mda, and uses the most recent copy to return to the caller
in the form of a 'vg' struct.
(bad mdas were excluded during the scan and are not retried)
(old mdas were not excluded during scan and are retried here)
- vg_read uses _vg_read to get the latest copy of metadata from mdas,
and then makes various checks against it to produce warnings,
and to check if VG access is allowed (access restrictions include:
writable, foreign, shared, clustered, missing pvs).
- Things that were previously silently/automatically written by vg_read
that are now done by vg_write, based on the records made in lvmcache
during the scan and read:
. clearing the missing flag
. updating old copies of metadata
. clearing outdated pvs
. updating pv header flags
- Bad/corrupt metadata are now repaired; they were not before.
Test changes
------------
- A read command no longer writes the VG to repair it, so add a write
command to do a repair.
(inconsistent-metadata, unlost-pv)
- When a missing PV is removed from a VG, and then the device is
enabled again, vgck --updatemetadata is needed to clear the
outdated PV before it can be used again, where it wasn't before.
(lvconvert-repair-policy, lvconvert-repair-raid, lvconvert-repair,
mirror-vgreduce-removemissing, pv-ext-flags, unlost-pv)
Reading bad/old metadata
------------------------
- "bad metadata": the mda_header or metadata text has invalid fields
or can't be parsed by lvm. This is a form of corruption that would
not be caused by known failure scenarios. A checksum error is
typically included among the errors reported.
- "old metadata": a valid copy of the metadata that has a smaller seqno
than other copies of the metadata. This can happen if the device
failed, or io failed, or lvm failed while commiting new metadata
to all the metadata areas. Old metadata on a PV that has been
removed from the VG is the "outdated" case below.
When a VG has some PVs with bad/old metadata, lvm can simply ignore
the bad/old copies, and use a good copy. This is why there are
multiple copies of the metadata -- so it's available even when some
of the copies cannot be used. The bad/old copies do not have to be
repaired before the VG can be used (the repair can happen later.)
A PV with no good copies of the metadata simply falls back to being
treated like a PV with no mdas; a common and harmless configuration.
When bad/old metadata exists, lvm warns the user about it, and
suggests repairing it using a new metadata repair command.
Bad metadata in particular is something that users will want to
investigate and repair themselves, since it should not happen and
may indicate some other problem that needs to be fixed.
PVs with bad/old metadata are not the same as missing devices.
Missing devices will block various kinds of VG modification or
activation, but bad/old metadata will not.
Previously, lvm would attempt to repair bad/old metadata whenever
it was read. This was unnecessary since lvm does not require every
copy of the metadata to be used. It would also hide potential
problems that should be investigated by the user. It was also
dangerous in cases where the VG was on shared storage. The user
is now allowed to investigate potential problems and decide how
and when to repair them.
Repairing bad/old metadata
--------------------------
When label scan sees bad metadata in an mda, that mda is removed
from the lvmcache info->mdas list. This means that vg_read will
skip it, and not attempt to read/process it again. If it was
the only in-use mda on a PV, that PV is treated like a PV with
no mdas. It also means that vg_write will also skip the bad mda,
and not attempt to write new metadata to it. The only way to
repair bad metadata is with the metadata repair command.
When label scan sees old metadata in an mda, that mda is kept
in the lvmcache info->mdas list. This means that vg_read will
read/process it again, and likely see the same mismatch with
the other copies of the metadata. Like the label_scan, the
vg_read will simply ignore the old copy of the metadata and
use the latest copy. If the command is modifying the vg
(e.g. lvcreate), then vg_write, which writes new metadata to
every mda on info->mdas, will write the new metadata to the
mda that had the old version. If successful, this will resolve
the old metadata problem (without needing to run a metadata
repair command.)
Outdated PVs
------------
An outdated PV is a PV that has an old copy of VG metadata
that shows it is a member of the VG, but the latest copy of
the VG metadata does not include this PV. This happens if
the PV is disconnected, vgreduce --removemissing is run to
remove the PV from the VG, then the PV is reconnected.
In this case, the outdated PV needs have its outdated metadata
removed and the PV used flag needs to be cleared. This repair
will be done by the subsequent repair command. It is also done
if vgremove is run on the VG.
MISSING PVs
-----------
When a device is missing, most commands will refuse to modify
the VG. This is the simple case. More complicated is when
a command is allowed to modify the VG while it is missing a
device.
When a VG is written while a device is missing for one of it's PVs,
the VG metadata is written to disk with the MISSING flag on the PV
with the missing device. When the VG is next used, it is treated
as if the PV with the MISSING flag still has a missing device, even
if that device has reappeared.
If all LVs that were using a PV with the MISSING flag are removed
or repaired so that the MISSING PV is no longer used, then the
next time the VG metadata is written, the MISSING flag will be
dropped.
Alternative methods of clearing the MISSING flag are:
vgreduce --removemissing will remove PVs with missing devices,
or PVs with the MISSING flag where the device has reappeared.
vgextend --restoremissing will clear the MISSING flag on PVs
where the device has reappeared, allowing the VG to be used
normally. This must be done with caution since the reappeared
device may have old data that is inconsistent with data on other PVs.
Bad mda repair
--------------
The new command:
vgck --updatemetadata VG
first uses vg_write to repair old metadata, and other basic
issues mentioned above (old metadata, outdated PVs, pv_header
flags, MISSING_PV flags). It will also go further and repair
bad metadata:
. text metadata that has a bad checksum
. text metadata that is not parsable
. corrupt mda_header checksum and version fields
(To keep a clean diff, #if 0 is added around functions that
are replaced by new code. These commented functions are
removed by the following commit.)
If udev info is missing for a device, (which would indicate
if it's an MD component), then do an end-of-device read to
check if a PV is an MD component. (This is skipped when
using hints since we already know devs in hints are good.)
A new config setting md_component_checks can be used to
disable the additional end-of-device MD checks, or to
always enable end-of-device MD checks.
When both hints and udev info are disabled/unavailable,
the end of PVs will now be scanned by default. If md
devices with end-of-device superblocks are not being
used, the extra I/O overhead can be avoided by setting
md_component_checks="start".
The test was failing consistently on some VMs (F25), and inconsistently
on Rawhide.
With increased latency these failures are no longer reproducible.
Reproducer:
make check_lvmpolld T=pvmove-resume-multiseg.sh
teardown after the test was failing, probably because
of uncoordinated udev actions running on the test
system. Try to avoid this by doing some work before
teardown.
There have been two file locks used to protect lvm
"global state": "ORPHANS" and "GLOBAL".
Commands that used the ORPHAN flock in exclusive mode:
pvcreate, pvremove, vgcreate, vgextend, vgremove,
vgcfgrestore
Commands that used the ORPHAN flock in shared mode:
vgimportclone, pvs, pvscan, pvresize, pvmove,
pvdisplay, pvchange, fullreport
Commands that used the GLOBAL flock in exclusive mode:
pvchange, pvscan, vgimportclone, vgscan
Commands that used the GLOBAL flock in shared mode:
pvscan --cache, pvs
The ORPHAN lock covers the important cases of serializing
the use of orphan PVs. It also partially covers the
reporting of orphan PVs (although not correctly as
explained below.)
The GLOBAL lock doesn't seem to have a clear purpose
(it may have eroded over time.)
Neither lock correctly protects the VG namespace, or
orphan PV properties.
To simplify and correct these issues, the two separate
flocks are combined into the one GLOBAL flock, and this flock
is used from the locking sites that are in place for the
lvmlockd global lock.
The logic behind the lvmlockd (distributed) global lock is
that any command that changes "global state" needs to take
the global lock in ex mode. Global state in lvm is: the list
of VG names, the set of orphan PVs, and any properties of
orphan PVs. Reading this global state can use the global lock
in sh mode to ensure it doesn't change while being reported.
The locking of global state now looks like:
lockd_global()
previously named lockd_gl(), acquires the distributed
global lock through lvmlockd. This is unchanged.
It serializes distributed lvm commands that are changing
global state. This is a no-op when lvmlockd is not in use.
lockf_global()
acquires an flock on a local file. It serializes local lvm
commands that are changing global state.
lock_global()
first calls lockf_global() to acquire the local flock for
global state, and if this succeeds, it calls lockd_global()
to acquire the distributed lock for global state.
Replace instances of lockd_gl() with lock_global(), so that the
existing sites for lvmlockd global state locking are now also
used for local file locking of global state. Remove the previous
file locking calls lock_vol(GLOBAL) and lock_vol(ORPHAN).
The following commands which change global state are now
serialized with the exclusive global flock:
pvchange (of orphan), pvresize (of orphan), pvcreate, pvremove,
vgcreate, vgextend, vgremove, vgreduce, vgrename,
vgcfgrestore, vgimportclone, vgmerge, vgsplit
Commands that use a shared flock to read global state (and will
be serialized against the prior list) are those that use
process_each functions that are based on processing a list of
all VG names, or all PVs. The list of all VGs or all PVs is
global state and the shared lock prevents those lists from
changing while the command is processing them.
The ORPHAN lock previously attempted to produce an accurate
listing of orphan PVs, but it was only acquired at the end of
the command during the fake vg_read of the fake orphan vg.
This is not when orphan PVs were determined; they were
determined by elimination beforehand by processing all real
VGs, and subtracting the PVs in the real VGs from the list
of all PVs that had been identified during the initial scan.
This is fixed by holding the single global lock in shared mode
while processing all VGs to determine the list of orphan PVs.
Handle the case where pvscan --cache -aay (with no dev args)
gets to the final PV, completing the VG, but that final PV does not
have VG metadata. In this case, we need to use VG metadata from a
previously scanned PV in the same VG, which we saved for this
possibility. Using this saved metadata, we can find which VG
this PVID belongs to, and then check if that VG is now complete,
and if so add the VG name to the list of complete VGs to be
autoactivated.
Fix to previous commit
"pvscan: ignore online for shared and foreign PVs"
which was incorrectly considering a PV foreign if its
VG had no system ID when the host did have a system ID.
and "cachepool" to refer to a cache on a cache pool object.
The problem was that the --cachepool option was being used
to refer to both a cache pool object, and to a standard LV
used for caching. This could be somewhat confusing, and it
made it less clear when each kind would be used. By
separating them, it's clear when a cachepool or a cachevol
should be used.
Previously:
- lvm would use the cache pool approach when the user passed
a cache-pool LV to the --cachepool option.
- lvm would use the cache vol approach when the user passed
a standard LV in the --cachepool option.
Now:
- lvm will always use the cache pool approach when the user
uses the --cachepool option.
- lvm will always use the cache vol approach when the user
uses the --cachevol option.
When a VG has multiple PVs, and all those PVs come online
at the same time, concurrent pvscans for each PV will all
create the individual pvid files, and all will often see
the VG is now complete. This causes each of the pvscan
commands to think it should activate the VG, so there
are multiple activations of the same VG. The vg lock
serializes them, and only the first pvscan actually does
the activation, but there is still a lot of extra overhead
and time used by the other pvscans that attempt to
activate the already active VG. This can lead to a backlog
of pvscans and timeouts.
To fix this, this adds a new /run/lvm/vgs_online/ dir that
works like the existing /run/lvm/pvs_online/ dir. Each pvscan
that wants to activate a VG will first try to exlusively create
the file vgs_online/<vgname>. Only the first pvscan will
succeed, and that one will do the VG activation. The other
pvscans will find the vgname file exists and will not do the
activation step.
When a PV goes offline, the vgs_online file for the corresponding
VG is removed. This allows the VG to be autoactivated again
when the PV comes online again. This requires that the vgname be
stored in the pvid files.
An idea from Zdenek for better ensuring valid hints by invalidating
them when pvscan --cache <device> sees a new PV, which is a case
where we know that hints should be invalidated. This is triggered
from systemd/udev logic, and there may be some cases where it would
invalidate hints that the existing methods wouldn't detect.
Save the list of PVs in /run/lvm/hints. These hints
are used to reduce scanning in a number of commands
to only the PVs on the system, or only the PVs in a
requested VG (rather than all devices on the system.)
Since configure.h is a generated header and it's missing traditional
ifdefs preambule - it can be included & parsed multiple times.
Normally compiler is fine when defines have same value and there is
no warning - yet we don't need to parse this several times
and by adding -include directive we can ensure every file
in the package is rightly compile with configure.h as the
first header file.
If we are running the test where the device is /dev/* we will will
run the unit tests 'test_nesting' and 'test_pv_symlinks'. Otherwise
we will skip them.
This is a followup patch to commit edb72cb70c
to support related lvm2 test suite tests.
A 'global/support_mirrored_mirror_log' bool configuration variable gets
introduced allowing the creation of, or conversion to mirrored 'mirror'
logs if set. The capability to create these in turn allows the rest of
the tests to perform activation of such existing LVs and their conversions
to disk/core 'mirror' logs.
Display a disclaimer warning if enabled that this is not for regular use.
Add definition of the enabled config option to respective test scripts.
Related: rhbz1643562
Ensure configure.h is always 1st. included header.
Maybe we could eventually introduce gcc -include option, but for now
this better uses dependency tracking.
Also move _REENTRANT and _GNU_SOURCE into configure.h so it
doesn't need to be present in various source files.
This ensures consistent compilation of headers like stdio.h since
it may produce different declaration.
For some targets we do not want to generate dependencies.
Also add note about usage of such Makefile - it might be
possibly better to rename it to different filename to avoid
any confusion.
. When using default settings, this commit should change
nothing. The first PE continues to be placed at 1 MiB
resulting in a metadata area size of 1020 KiB (for
4K page sizes; slightly smaller for larger page sizes.)
. When default_data_alignment is disabled in lvm.conf,
align pe_start at 1 MiB, based on a default metadata area
size that adapts to the page size. Previously, disabling
this option would result in mda_size that was too small
for common use, and produced a 64 KiB aligned pe_start.
. Customized pe_start and mda_size values continue to be
set as before in lvm.conf and command line.
. Remove the configure option for setting default_data_alignment
at build time.
. Improve alignment related option descriptions.
. Add section about alignment to pvcreate man page.
Previously, DEFAULT_PVMETADATASIZE was 255 sectors.
However, the fact that the config setting named
"default_data_alignment" has a default value of 1 (MiB)
meant that DEFAULT_PVMETADATASIZE was having no effect.
The metadata area size is the space between the start of
the metadata area (page size offset from the start of the
device) and the first PE (1 MiB by default due to
default_data_alignment 1.) The result is a 1020 KiB metadata
area on machines with 4KiB page size (1024 KiB - 4 KiB),
and smaller on machines with larger page size.
If default_data_alignment was set to 0 (disabled), then
DEFAULT_PVMETADATASIZE 255 would take effect, and produce a
metadata area that was 188 KiB and pe_start of 192 KiB.
This was too small for common use.
This is fixed by making the default metadata area size a
computed value that matches the value produced by
default_data_alignment.
For embeded reshaping operation require higher driver version.
(otherwise we get:
Converting vg/LV1 from raid6 (same as raid6_zr) is directly possible to the following layouts:
raid6_nc
raid6_nr
raid6_la_6
raid6_ls_6
raid6_ra_6
raid6_rs_6
raid6_n_6
lvmpolld ATM is not desingned to preserve interval checking
in the same way the 'lvconvert' tool is doing - so the passed
'-i 40' is not respected and lvmpolld autonomously checks
state of conversion and updates lvm2 metadata and dm tables
when needed.
So skip portion of test that relayed on this and preserve this logic
only for command line invocation and forking of polling process
where the interval of checking is under full control.
Although we primarily want to check externally used libdevmapper
library, out internal all.h is still keeping all symbols
as the original library has - so for simpler compilation keep
using this private copy for defining needed symbols.
Add a little wait loop - since lvconvert started background process
and we need to wait till this bg task initiate its work -
adding ~1s loop should give reasonable enough time to start mirroring.
If a single, standard LV is specified as the cache, use
it directly instead of converting it into a cache-pool
object with two separate LVs (for data and metadata).
With a single LV as the cache, lvm will use blocks at the
beginning for metadata, and the rest for data. Separate
dm linear devices are set up to point at the metadata and
data areas of the LV. These dm devs are given to the
dm-cache target to use.
The single LV cache cannot be resized without recreating it.
If the --poolmetadata option is used to specify an LV for
metadata, then a cache pool will be created (with separate
LVs for data and metadata.)
Usage:
$ lvcreate -n main -L 128M vg /dev/loop0
$ lvcreate -n fast -L 64M vg /dev/loop1
$ lvs -a vg
LV VG Attr LSize Type Devices
main vg -wi-a----- 128.00m linear /dev/loop0(0)
fast vg -wi-a----- 64.00m linear /dev/loop1(0)
$ lvconvert --type cache --cachepool fast vg/main
$ lvs -a vg
LV VG Attr LSize Origin Pool Type Devices
[fast] vg Cwi---C--- 64.00m linear /dev/loop1(0)
main vg Cwi---C--- 128.00m [main_corig] [fast] cache main_corig(0)
[main_corig] vg owi---C--- 128.00m linear /dev/loop0(0)
$ lvchange -ay vg/main
$ dmsetup ls
vg-fast_cdata (253:4)
vg-fast_cmeta (253:5)
vg-main_corig (253:6)
vg-main (253:24)
vg-fast (253:3)
$ dmsetup table
vg-fast_cdata: 0 98304 linear 253:3 32768
vg-fast_cmeta: 0 32768 linear 253:3 0
vg-main_corig: 0 262144 linear 7:0 2048
vg-main: 0 262144 cache 253:5 253:4 253:6 128 2 metadata2 writethrough mq 0
vg-fast: 0 131072 linear 7:1 2048
$ lvchange -an vg/min
$ lvconvert --splitcache vg/main
$ lvs -a vg
LV VG Attr LSize Type Devices
fast vg -wi------- 64.00m linear /dev/loop1(0)
main vg -wi------- 128.00m linear /dev/loop0(0)
Since the test is currently directly working with live directory,
which can be getting updates from system's udev - add wait
for settling so removal of all known PVs happens after that.
But still this has major influce on behavior of running system,
so the test should never be executed on a user used box.
Commit 989626926c
introduced 2 new tests
lvconvert-raid-takeover-linear_to_raid4.sh and
lvconvert-raid-takeover-raid4_to_linear.sh
which involve raid reshaping.
Bump the checked dm-raid target version to 1.14.0
which has reshape kernel fixes to avoid test suite
runs to hang.
Bump target version to 1.14.0 which contains fixes
for reshape deadlock/corruption to allow tests to
run once the respective fixes show up in kernels.
Remove now superfluous multi-core checks.
Resolves: rhbz1501145
Related: rhbz1514539
Related: rhbz1586123
Related: rhbz1613039
Allow "lvconvert --type linear RaidLV" on a raid4 LV
providing convenient interim steps to convert to linear.
Add respective new test
lvconvert-raid-takeover-raid4_to_linear.sh
and
lvconvert-raid-takeover-linear_to_raid4.sh
for linear to raid4 once on it.
When converting from striped/raid0/raid0_meta
to raid6 with > 2 stripes, allow possible
direct conversion (to raid6_n_6).
In case of 2 stripes, first convert to raid5_n to restripe
to at least 3 data stripes (the raid6 minimum in lvm2) in
a second conversion before finally converting to raid6_n_6.
As before, raid6_n_6 then can be converted
to any other raid6 layout.
Enhance lvconvert-raid-takeover.sh to test the
2 stripes conversions to raid6.
Resolves: rhbz1624038
"lvconvert --type linear RaidLV" on striped and raid4/5/6/10
have to provide the convenient interim layouts. Fix involves
a cleanup to the convenience type function.
As a result of testing, add missing sync waits to
lvconvert-raid-reshape-linear_to_raid6-single-type.sh.
Resolves: rhbz1447809
Conversion to striped from raid0/raid0_meta is directly possible.
Fix a regression setting superfluous interim raid5_n conversion type
introduced by commit bd7cdd0b09.
Add new test script lvconvert-raid0-striped.sh.
Resolves: rhbz1608067
Update makefile to link with more libs since now whole liblvm-internal.a
is linked-in and this library has futher dependencies.
Avoid including deps for run-unit-test.
Drop linking separate status.c as it's already linked via internal libs.
Check allocation of thin-pool works on 2PVs, when one is so full,
that even metadata do not fit there (as they need at least 2M,
while 99% of 63MB fills >62MB)
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
With the move to top-level makefile - there are some issues
with subdir recursive makefile.
Make the building more tolerant for now until fully resolved.