IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
. it's of little use since the thin pool cannot be extended
while it's using writecache.
. lvconvert --splitcache on tdata isn't working, the writecache
table doesn't get updated with the cleaner setting.
. adding writecache to existing tdata not yet implemented
filters needing io weren't being run because bcache
wasn't set up. Read the first 4k of the device
before doing filtering or reading ondisk structs to
reduce reads.
It's possible for a machine with a non-4k page size
to create a PV with an mda_header at an offset other
than 4k. Fix pvck --dump to work with these other
mda offsets. pvck --repair will write a new first
mda at 4096 but lvm with other page sizes will work
with this.
The args for pvcreate/pvremove (and vgcreate/vgextend
when applicable) were not efficiently opened, scanned,
and filtered. This change reorganizes the opening
and filtering in the following steps:
- label scan and filter all devs
. open ro
. standard label scan at the start of command
- label scan and filter dev args
. open ro
. uses full md component check
. typically the first scan and filter of pvcreate devs
- close and reopen dev args
. open rw and excl
- repeat label scan and filter dev args
. using reopened rw excl fd
- wipe and write new headers
. using reopened rw excl fd
Older blockdev tool return failure error code with --help,
and since now the tool abort on command failure, lets
detect missing --getsize64 support directly by running
command and check if it returns something usable.
It's likely very hard to have the system with
such old blockdev tool and newer lvm2 compiled.
Since 'kilobytes' could be seen in 2 way - SI as '1000',
while all programmers sees it as '1024' - switch to
commonly acceptted KiB, MiB....
Resolves RHBZ 1496255.
Test case where filesystem has been corrected via fsck.
In such case fsck returns '1' as success and should be
handled in a same way as '0' since fs is correct.
Set more secure bash failure mode for pipilines.
Avoid using unset variables.
Enhnace error reporting for failing command.
Avoid using error via 'case..esac || error'.
In some cases the dev size may not have been read yet
in set_pv_devices(). In this case get the dev size
before comparing the dev size with the pv size.
Restructure the pvscan code, and add new temporary files
that list pvids in a VG, used for processing PVs that
have no metadata.
The new temp files, in /run/lvm/pvs_lookup/<vgname>, allow a
proper pvscan --cache to be done on PVs that have no metadata.
pvscan --cache <dev> is only supposed to read <dev>, but when
<dev> has no metadata, this had not been possible. The
command had to fall back to scanning all devices to read all
VG metadata to get the list of all PVIDs needed to check for
a complete VG. Now, the temp file can be used in place of
reading metadata from all PVs on the system.
To read the lvm headers and set dev->pvid if the
device is a PV. Difference from label_scan_ functions
is this does not read any vg metadata or add any info
to lvmcache.
Filtering in label_scan was controlled indirectly by
the fact that bcache was not yet set up when label_scan
first ran. The result is that filters that needed data
would not run and would return -EAGAIN, which would
result in the dev flag FILTER_AFTER_SCAN being set.
After the dev header was read for checking the label,
filters would be rechecked because of FILTER_AFTER_SCAN.
All filters would be checked this time because bcache
was now set up, and the filters needing data would
largely use data already scanned for reading the label.
This design worked but is hard to adjust for future
cases where bcache is already set up.
Replace this method (based on setting up bcache, or not)
with a new cmd flag filter_nodata_only. When this flag
is set filters that need data will not run. This allows
the same label_scan behavior when bcache has been set up.
There are no expected changes in behavior.
Touch of stack allocation validated given size with rlimit
and if the reserved_stack was above rlimit, its been completely
ignored - now we will always touch stack upto rlimit/2 size.
cmd context has 'threaded' value that used be set
by clvmd - and allowed proper memory locking management.
Reuse same bit for dmeventd.
Since dmeventd is using 300KiB stack per thread,
we will ignore any user settings for allocation/reserved_stack
until some better solution is find.
This avoids crashing of dmevend when user changes this value
and because in most cases lvm2 should work ok with 64K stack
size, this change should not cause any problems.
Check that type is always defined, if not make it explicit internal
error (although logged as debug - so catched only with proper lvm.conf
setting).
This ensures later type being NULL can't be dereferenced with coredump.
DM tree keeps track of created device while preloading a device tree.
When fail occures during such preload, it will now try to remove
all created and preloaded device. This makes it easier to maintain
stacking of device, since we do not need to check in-depth for
existance of all possible created devices during the failure.
Since 'BLKZEROOUT' streams out more block at once, at can easily
zero-out larger set of blocks after 1st. failing one.
So the test is adapted to fully 'hide' swap header under error target.
When ERR_DEV and ZERO_DEV are used, they are automatically
taken down when the last user no longer needs them,
so hide them from 'forgotten' device check.
As there could be few invokes of stacktrace, avoid
repeatedly display logs from commands.
So after first display rename debug.log* -> debug_log
so the file still can remain for reading in test dir.
Since BLKZEROOUT ioctl should be supposedly fastest
way how to clear block device start using this ioctl
for zeroing a device. Commonly we do zero typically
small portion of a device (8KiB) - however since we now
also started to zero metadata devices, in the case
of i.e. thin-pool metadata this can go upto ~16GiB
and here the performance starts to be noticable.
Since dev_set_bytes() now closes dev on error path itself,
remove this unneeded call now (introduced few commits back
in history thus removing comment from WHATS_NEW)
Since lvm2 normally block signals during protected
phase where it does not want to be interrupted.
Support interruptible processing when allowed
in section between sigint_allow() ... sigint_restore())
and let the 'io_getenvents()' finish with EINTR.
When bcache tries to write data to a faulty device,
it may get out of caching blocks and then just busy-loops
on a CPU - so this check protects this by checking
if there is already max_io (~64) errored blocks.
Call _wait_all() which does check whether there is still
some pending IO before sleep. Otherwise it may happen
our submitted IO operations have been already dispatched
and this call then endlessly waits for IO which are all done.
This can be reproduced when device returns quickly errors
on write requests.
When detaching a writecache, use the cleaner setting
by default to writeback data prior to suspending the
lv to detach the writecache. This avoids potentially
blocking for a long period with the device suspended.
Detaching a writecache first sets the cleaner option, waits
for a short period of time (less than a second), and checks
if the writecache has quickly become clean. If so, the
writecache is detached immediately. This optimizes the case
where little writeback is needed.
If the writecache does not quickly become clean, then the
detach command leaves the writecache attached with the
cleaner option set. This leaves the LV in the same state
as if the user had set the cleaner option directly with
lvchange --cachesettings cleaner=1 LV.
After leaving the LV with the cleaner option set, the
detach command will wait and watch the writeback progress,
and will finally detach the writecache when the writeback
is finished. The detach command does not need to wait
during the writeback phase, and can be canceled, in which
case the LV will remain with the writecache attached and
the cleaner option set. When the user runs the detach
command again it will complete the detach.
To detach a writecache directly, without using the cleaner
step (which has been the approach previously), add the
option --cachesettings cleaner=0 to the detach command.
Reorganize checking the device args for pvcreate/pvremove
to prepare for future changes. There should be no change
in behavior. Stop the inverted use of process_each_pv,
which pulled in a lot of unnecessary processing, and call
the check functions on each device directly.
Since we detect already transaction if before starting
to build dm tree - this extra check is a duplicate
that would only capture very tiny 'race' and we later
validate transaction_id with suspended snapshot origin.
Introduce structures lv_status_thin_pool and
lv_status_thin (pair to lv_status_cache, lv_status_vdo)
Convert lv_thin_percent() -> lv_thin_status()
and lv_thin_pool_percent() + lv_thin_pool_transaction_id() ->
lv_thin_pool_status().
This way a function user can see not only percentages, but also
other important status info about thin-pool.
TODO:
This patch tries to not change too many other things,
but pool_below_threshold() now uses new thin-pool info to return
failure if thin-pool cannot be actually modified.
This should be handle separately in a better way.
LVM2 is distributed under GPLv2 only. The readline library changed its
license long ago to GPLv3. Given that those licenses are incompatible
and you follow the FSF in their interpretation that dynamically linking
creates a derivative work, distributing LVM2 linked against a current
readline version might be legally problematic.
Add support for the BSD licensed editline library as an alternative for
readline.
Link: https://thrysoee.dk/editline
Cover the case where two copies of metadata have the
same seqno but different checksums. Also elaborate
on an existing fixme in the code for this case, since
we should be doing something better for this case.
This had been uncovering an issue with reopening
fds in readwrite mode.
There's a bug when lvpoll attempts to write new hints,
related to the fact that lvpoll does not follow the same
scanning process as standard commands.
Fix by disabling the use of hints in lvpoll. We may want
to renable hints in lvpoll in a way that they can be used,
if valid, but not updated if they don't exist or are invalid.
Improve error response and reporting, when creating thin snapshots.
If the thin pool kernel metadata already have device with ID lvm2
tries to create, give more meanigful error message and also properly
restore transaction id to the value known to thin-pool in this case.
Before it's been possible to divert by one from kernel TID value,
and lvm2 stacked delete message for such thin device.
Since ATM kernel does not support this operation,
disable 'lvrename' of an active vdopool.
As a workaround, user may simply deactivate, rename and activate.
This test seems to be hitting some corner case in handling
out-of-metadata space condintion in thin-pool.
Add few more aid notes and functionality.
Also add missing '|| true' with now direct-IO dd command.
Shorten running time of the test.
Fix some issues in invoked resizing script to it returns
correct return code and dmeventd can be a little bit quicker
in this test.
When user tries to extend vdo pool - he needs to go always
at least by 1 full VDO slab (defined as vdo_slab_size_mb).
To avoid all trouble around find 'workable' size - lvm2 automatically
increases the passed (or by --use-policies calculated) extension size
(and informs a user about sometimes possibly large increase as slab
size can go upto 32GiB)
With VDO users need to always 'think-big' anyway and expect such
operation to be in GiB domain range.
When thetable reload fails during suspend() - we were only calling
plain resume() - and this will reload only those devices,
which were left suspend, but will not try to restore
metadata state according to lvm2 reverted metadata.
So if we were reloading device tree - we have restored
only top-level LV and rest of reverted device manipulation
were left alone and possibly mismatched what is in committed
metadata.
FIXME: There are several cases were such revert will likely not work
properly anyway as some operation are currenly handled in single commit,
while they need multiple commits, but it's step towards better correctness.
At least we catch there errors now earlier.
When loop can't handle sector-size option - failure caused double fail
for access of unbound variable
Also fix expression for 'rm' and remove loops after loop release.
If the test runs of loop device backend with 512 sectors,
xfs selects this smaller sector size and then data do not fit
(we would need -l9 with most of 'raids').
With 4K sectors data always fits.
Shorten and make the test easily readable by moving same code into
function and removed one duplicated test for 512,4096 combination.
Always use scsi_debug - since default ramdisk or loop device backend
is unpredictible.
lvm opens devices readonly to scan them, but
needs to open then readwrite to update the metadata.
Previously, the ro fd was closed before the rw fd
was opened, leaving a small gap where the dev was
not held open, and during which the dev could
possibly change which storage it referred to.
With the bcache_change_fd() interface, lvm opens a
rw fd on a device to be written, tells bcache to
change to the new rw fd, and closes the ro fd.
. open dev ro
. read dev with the ro fd (label_scan)
. lock vg (ex for writing)
. open dev rw
. close ro fd
. rescan dev to check if the metadata changed
between the scan and the lock
. if the metadata did change, reread in full
. write the metadata
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
Use new SKIP_WITH_LOW_SPACE and set higher requirement for free space.
But still this test can't run on system's tmpfs directories -
as they typically provide less then 2G of space and when the test
runs there it also provisioning for all READ pages!)
BRD (ramdisk) device should work.
Extend a _wait_recalc() loop for slower hw.
When creating large raid which do not need to be fully synchronized use
them on delay devices - so even less data needs read/write.
Remove unneeded lvchange as lvcreate is already leaving LV inactive.
Replace printf with awk as generator.
mm
Test can set individually a higher value for required free space on
storage.
Note: it is not fully reliable since when 'brd' (ramdisk) device is used
this free space value is rather meanigul, but it might help
in case where a real filesystem is doing back-end for test devices.
When the test exhausts all the available free space on storage device,
then during the fail we cannot write anything as well - yet
the teardown needs to finish it's work - otherwise we leave
basicaly overfilled filesystem for all remaining tests.
In cases where internal functions like zero_dev, delay_dev pass-in
invalid parameter so resulting table can't work, resume at least
previous table line before failing out - so the cleaning process
later on is not stuck waiting on a suspended device.
While the previous commit c9b40083fc
decresed version to 1.19 for using bigger datasets, it's not
been quite right - so from our bb machine it looks like
bigger metadata consumption started with 1.19 and kernel 4.18
(fc27)
Use bigger volume and slowdown writing to cache device.
This allows more simple to reach 'dirty' state.
Also document exactly 1 SIGINT has to fire aborting of flushing.
During removal of a lot of locking code the signal blocking got lost
and signal processing got broken leading to unpredictable
behavior of i.e. activation code the can get interrupted in the
middle of DM table processing.
lvm2 code always expects signals are blocked while lock is held
unless it is explictelly placed into section of:
sigint_allow();....;sigint_restore();
For checking catched interrupt there is sigint_catched();
Metadata size was calculated correctly only for raids.
Fixes problem for crash during lvcreate when thin-pool was created
on a VG where remaining free space had the size to only fit a single
metadata LV and not also its _pmspare.
Lvcreate crashed with this assert message:
lvcreate: metadata/pv_map.c:198: consume_pv_area: Assertion `to_go <= pva->count' failed.
Aborted (core dumped)
TODO: there is probably to large overload of several alloc_handle
variables.
Reported-by: Wu Guanghao<wuguanghao3@huawei.com>
Reported-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cow may not be a snapshot type, the return value of origin_from_cow(cow) may be NULL
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
LV may not be a snapshot type, the return value of find_snapshot(lv) may be NULL.
Here, we will call stack if LV is not a snapshot type.
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
The return value of top_level_lv_name() may be NULL, so we should
check return value of top_level_lv_name before calling
strcmp(lv->name, top_level_lv_name(vg, lv_name)).
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
When using cscope to read code, it will generate below 3 files for speedup
cross-refer: cscope.files, cscope.in.out, cscope.po.out
The .gitignore only contains "/cscope.out". It a little bit messy when
executing 'git status', and other git commands.
This patch add all cscope generated files in .gitignore.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
When using --use-policy for automatic extension of thin-pool,
the extension of thin-pool's metadata itself can actually take
some extra space.
Since I'm not aware of exact compensation formula, add just
1% extra to calculated amount and hope it fits.
Wanted target is to always have usable thin-pool that fits
bellow pool_metadata_min_threshold().
Since we query on regular code these:
lv_raid_has_integrity()
lv_has_integrity_recalculate_metadata()
without prior checking for lv_is_raid() - these 'return 0' should
not use <stacktrace> as they are expected.
Correcting rounding rules for percentage evaluation.
Validate supported range of percentage.
(although ranges are already validated earlier on code path)
Enable compilation of vdo and writecache support as internaly
supported segment types by default.
For disabling use:
--with-vdo=none
--with-writcache=none
For proper checking of extension progress require version 1.15
It looks with older versoin extension happens during very slow
resume within lvm command - although speed is still somewhat slow
with latest version.
This is probably somewhat experimantal patch - but when i.e. raid device
is just extend, there should not be a technical need for flush,
unless the target would stricly need it. It should allow faster
processing of lvm command not being blocked by possibly longer flush.
Since we do not support rimage & rmeta for snapshots - we can
avoid quering for -cow devices and add them as origin_only -
since their snapshots (-cow) could have never existed.
This redumes several ioctl operation during table preloading.
Use '0' for error and '1' as success.
Also drop INTERNAL_ERROR from path - as this error
is ATM used for invalid devices.
(i.e. test lvconvert-raid1-split-trackchanges.sh)
Speed-up a bit the first synchronization with just 50ms write delay,
but later set also delay on read to slowdown lvextend.
FIXME: there are still things to look at:
0 229376 raid raid1 2 AA 229376/229376 idle 0 0
0 229376 raid raid1 2 AA 0/229376 frozen 0 0 -
0 262144 raid raid1 2 AA 229376/262144 repair 0 0 -
0 262144 raid raid1 2 AA 229376/262144 repair 0 0 -
0 262144 raid raid1 2 AA 245888/262144 repair 0 0 -
Just like we have 'writeerror_dev' supporting creation of device
which 'readable' segment and segments where write will fail we
have now support for delay zero mappings.
This is useful if we want to 'fake' large writing areas where we do
not really care about the actual 'disk' content - since we test
operation logic and it doesn't matter we read and write zeroes.
With combination with 'delay' target we can create specific mappings
and avoid using large memory areas of ramdisk.
On test system with 'default' filter (aka accept all) test
after enabling device can suffer from automatic system
activation - so for created LVs setup skipping this automatic
activation. This should prevent getting LVs into table
with pvscan service.
Since we declare dev_name in lib/device/device.h
and pvs in commands.h
rename local dev_name to device_name
and pvs to pvs_list to prevent shadowing warning.
m
Switch remaining zero sized struct to flexible arrays to be C99
complient.
These simple rules should apply:
- The incomplete array type must be the last element within the structure.
- There cannot be an array of structures that contain a flexible array member.
- Structures that contain a flexible array member cannot be used as a member of another structure.
- The structure must contain at least one named member in addition to the flexible array member.
Although some of the code pieces should be still improved.
While normally the 'mmap' file reading is better utilizing resources,
it has also its odd side with handling errors - so while we normally
use the mmap only for reading regular files from root filesystem
(i.e. lvm.conf) we can't prevent error to happen during the read
of these file - and such error unfortunately ends with SIGBUS error.
Maintaing signal handler would be compilated - so switch to slightly
less effiecient but more error resistant read() functinality.
reproducible steps:
1. vgcreate vg1 /dev/sda /dev/sdb
2. lvcreate --type raid0 -l 100%FREE -n raid0lv vg1
3. do remove the /dev/sdb action
4. lvdisplay show wrong 'LV Status'
After removing raid0 type LV underlying dev, lvdisplay still display
'available'. This is wrong status for raid0.
This patch add a new function raid_is_available(), which will handle
all raid case.
With this patch, lvdisplay will show
from:
LV Status available
to:
LV Status NOT available (partial)
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
It's better to set most of option as 'commented' with some
documented defaults instead of providing strict values.
This has the advantage we can eventually 'change' defualts
and get them working in future. Otherwise once the setting
is stored in lvm.conf in /etc, such setting has strictly
defined value and that can be only change with file update.
Allow the optional '--type raid1' to be included in the lvconvert
command when adding or removing raid images with integrity.
It does not change the meaning of the command (specifying a type
that matches the current type is redundant but generally allowed.)
merge.c:_check_lv_segment() was checking regionsize vs. mirrored LV size on
any 'mirror/raid1/raid10' segment type including type 'mirrored' mirror logs.
Avoid the check only for 'mirrored' mirror logs to allow conversion from log
type 'disk' with regionsize > mirror log SubLV size.
As we disabled support for 'mirrored' mirror logs with
commit e82303fd6a which still conditionally
allows to enable it via global/support_mirrored_mirror_logs=1,
patch is mandatory for all distributions.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1712983
Currently lvm2 is not wiping signatures when creating 'metadata' volumes
and raid _rmeta was the only exception - so make the behavior consistent
with other metadata devices and drop wiping ATM.
Drop also some extra debug since they are now more explanatory in
wipe_lv() function.
Also note - although lvm2 now does not wipe signatures - the error
from such wipping used to be actually 'ignored' before wipe_lv()
started to return error (with recent commit) and raid creation
continued with 'unzeroed' metadata device.
TODO: Several issues to resolve:
1. We may want to flip to wipping with all LVs (in that case we need to
support passing --yet & --force).
2. Also we may want to clear whole metadata device - however current
function is also used for wipping i.e. snapshot COW device which
is likely not a good candidate for full device zeroing.
We may also need to think about better logic when extent size is
enforcing very large LVs, when only a small portion of LV is ever
being used.
3. Using TRIM instead of zeroing metadata device might be worth to
implement.
mm
When converting volume to pool LV use also wiping of other signatures.
For writecache & pool conversion support --yet and --force
to bypass prompting for signature wiping.
For writecache drop unneded zero_sectors.
Note: currently we have lvconvert doing convertion and prompting
for confirmation of conversion - and then again wipe_lv() prompts
for removing i.e. filesystem signature - we should unify this
prompting into 1 message - althought the 'filesystem' discovery
needs active volume - while the 1st. conversion prompt can
work without active converted volume.
To avoid polution of metadata with some 'garbage' content or eventualy
some leak of stale data in case user want to upload metadata somewhere,
ensure upon allocation the metadata device is fully zeroed.
Behaviour may slow down allocation of thin-pool or cache-pool a bit
so the old behaviour can be restored with lvm.conf setting:
allocation/zero_metadata=0
TODO: add zeroing for extension of metadata volume.
Failure in wiping/zeroing stop the command.
If user wants to avoid command abortion he should use -Zn or -Wn
to avoid wiping.
Note: there is no easy way to distinguish which kind of failure has
happend - so it's safe to not proceed any futher.
When initiated larger write request, it may have happened, bcache
got out of free chunks - fix the loop, that is supposed to wait
until next free chunk becomes avain available.
To create a new cache or writecache LV with a single command:
lvcreate --type cache|writecache
-n Name -L Size --cachedevice PVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, a new cachevol LV is created internally, using PVfast
specified by the cachedevice option.
- Then, the cachevol is attached to the main LV, converting the
main LV to type cache|writecache.
Include --cachesize Size to specify the size of cache|writecache
to create from the specified --cachedevice PVs, otherwise the
entire cachedevice PV is used. The --cachedevice option can be
repeated to create the cache from multiple devices, or the
cachedevice option can contain a tag name specifying a set of PVs
to allocate the cache from.
To create a new cache or writecache LV with a single command
using an existing cachevol LV:
lvcreate --type cache|writecache
-n Name -L Size --cachevol LVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, the cachevol LVfast is attached to the main LV, converting
the main LV to type cache|writecache.
In cases where more advanced types (for the main LV or cachevol LV)
are needed, they should be created independently and then combined
with lvconvert.
Example
-------
user creates a new VG with one slow device and one fast device:
$ vgcreate vg /dev/slow1 /dev/fast1
user creates a new 8G main LV on /dev/slow1 that uses all of
/dev/fast1 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1
-n main -L 8G vg /dev/slow1
Example
-------
user creates a new VG with two slow devs and two fast devs:
$ vgcreate vg /dev/slow1 /dev/slow2 /dev/fast1 /dev/fast2
user creates a new 8G main LV on /dev/slow1 and /dev/slow2
that uses all of /dev/fast1 and /dev/fast2 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1 --cachedevice /dev/fast2
-n main -L 8G vg /dev/slow1 /dev/slow2
Example
-------
A user has several slow devices and several fast devices in their VG,
the slow devs have tag @slow, the fast devs have tag @fast.
user creates a new 8G main LV on the slow devs with a
2G writecache on the fast devs:
$ lvcreate --type writecache -n main -L 8G
--cachedevice @fast --cachesize 2G vg @slow
To add a cache or writecache to a main LV with a single command:
lvconvert --type cache|writecache --cachedevice /dev/ssd vg/main
A cachevol LV will be allocated from the specified cache device,
then attached to the main LV. Include --cachesize to specify the
size of cachevol to create, otherwise the entire cachedevice is
used. The cachedevice option can be repeated to create a cachevol
from multiple devices.
Example
-------
A user has an existing main LV that they want to speed up
using a new ssd.
user adds the new ssd to the VG:
$ vgextend vg /dev/ssd
user attaches the new ssd their main LV:
$ lvconvert --type writecache --cachedevice /dev/ssd vg/main
Example
-------
A user has two existing main LVs that they want to speed up
with a new ssd.
user adds the new 16G ssd to the VG:
$ vgextend vg /dev/ssd
user attaches some of the new ssd to the first main LV,
using half of the space:
$ lvconvert --type writecache --cachedevice /dev/ssd
--cachesize 8G vg/main1
user attaches some of the new ssd to the second main LV,
using the other half of the space:
$ lvconvert --type writecache --cachedevice /dev/ssd
--cachesize 8G vg/main2
Example
-------
A user has an existing main LV that they want to speed up using
two new ssds.
user adds the new two ssds the VG:
$ vgextend vg /dev/ssd1
$ vgextend vg /dev/ssd2
user attaches both ssds their main LV:
$ lvconvert --type writecache
--cachedevice /dev/ssd1 --cachedevice /dev/ssd2 vg/main
Use libblkid to detect sector/block size of the fs on the LV.
Use this to choose a compatible writecache block size.
Enable attaching writecache to an active LV.
When lvconvert is used to remove raid images, we can
skip calling lv_add_integrity_to_raid(), which finds
nothing to do, but the the blocksize validation would
be called unnecessarily and trigger spurious errors.
The test was using a raid+integrity LV without
first waiting for the integrity sync, which could
cause the test to fail (depending on init speed)
where it depends on integrity to work in uninitialized
areas.
Also use cmp instead of diff.
pvck --dump headers reads the metadata text area
to compute the text metadata checksum to compare
with the mda_header checksum.
The new header_only will skip reading the metadata
text and not validate the mda_header checksum.
It's possible for a dev-cache entry to remain after all
paths for it have been removed, and other parts of the
code expect that a dev always has a name. A better fix
may be to remove a device from dev-cache after all paths
to it have been removed.
When either logical block size or physical block size is 4K,
then lvmlockd creates sanlock leases based on 4K sectors,
but the lvm client side would create the internal lvmlock LV
based on the first logical block size it saw in the VG,
which could be 512. This could cause the lvmlock LV to be
too small to hold all the sanlock leases. Make the lvm client
side use the same sizing logic as lvmlockd.
The lock adopt feature was disabled since it had used
lvmetad as a source of info. This replaces the lvmetad
info with a local file and enables the adopt feature again
(enabled with lvmlockd --adopt 1).
The --lock-opt autowait was dropped back in 9ab6bdce01,
and attempting to specify it has quite an opposite effect:
no waiting is done, which makes the unit almost useless.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum. When used on
raid images, dm-raid will correct the error by reading
the block from another image, and the device user sees
no error. The integrity metadata (checksums) are stored
on an internal LV allocated by lvm for each linear image.
The internal LV is allocated on the same PV as the image.
Create a raid LV with an integrity layer over each
raid image (for raid levels 1,4,5,6,10):
lvcreate --type raidN --raidintegrity y [options]
Add an integrity layer to images of an existing raid LV:
lvconvert --raidintegrity y LV
Remove the integrity layer from images of a raid LV:
lvconvert --raidintegrity n LV
Settings
Use --raidintegritymode journal|bitmap (journal is default)
to configure the method used by dm-integrity to ensure
crash consistency.
Initialization
When integrity is added to an LV, the kernel needs to
initialize the integrity metadata/checksums for all blocks
in the LV. The data corruption checking performed by
dm-integrity will only operate on areas of the LV that
are already initialized. The progress of integrity
initialization is reported by the "syncpercent" LV
reporting field (and under the Cpy%Sync lvs column.)
Example: create a raid1 LV with integrity:
$ lvcreate --type raid1 -m1 --raidintegrity y -n rr -L1G foo
Creating integrity metadata LV rr_rimage_0_imeta with size 12.00 MiB.
Logical volume "rr_rimage_0_imeta" created.
Creating integrity metadata LV rr_rimage_1_imeta with size 12.00 MiB.
Logical volume "rr_rimage_1_imeta" created.
Logical volume "rr" created.
$ lvs -a foo
LV VG Attr LSize Origin Cpy%Sync
rr foo rwi-a-r--- 1.00g 4.93
[rr_rimage_0] foo gwi-aor--- 1.00g [rr_rimage_0_iorig] 41.02
[rr_rimage_0_imeta] foo ewi-ao---- 12.00m
[rr_rimage_0_iorig] foo -wi-ao---- 1.00g
[rr_rimage_1] foo gwi-aor--- 1.00g [rr_rimage_1_iorig] 39.45
[rr_rimage_1_imeta] foo ewi-ao---- 12.00m
[rr_rimage_1_iorig] foo -wi-ao---- 1.00g
[rr_rmeta_0] foo ewi-aor--- 4.00m
[rr_rmeta_1] foo ewi-aor--- 4.00m
Make it possible to tear down VDO volumes with blkdeactivate if VDO is
part of a device stack (and if VDO binary is installed). Also, support
optional -o|--vdooptions configfile=file.
lvm2 supports thin-pool to be later used by other tools doing
virtual volumes themself (i.e. docker) - in this case we
shall not validate transaction Id - is this is used by
other tools and lvm2 keeps value 0 - so the transationId
validation need to be skipped in this case.
When vdopool is activated standalone - we use a wrapping linear device
to hold actual vdo device active - for this we can set-up read-only
device to ensure there cannot be made write through this device to
actual pool device.
Creating a snapshot was using a persistent LV lock
on the origin, so if the origin LV was inactive at
the time of the snapshot the LV lock would remain.
(Running lvchange -an on the inactive LV would
clear the LV lock.) Use a transient LV lock so it
will be dropped if it was not locked previously.
Prevent attaching writecache to an active LV until
we can determine the block size of the fs on the LV,
and use that to enforce an appropriate writecache
block size. Changing the block size under a mounted
fs can cause panic/corruption.
dmeventd is 'scanning' statuses in loop (most usually in 10sec
intervals) - and meanwhile it sleeps within:
pthread_cond_timedwait()
However this function call tends to wakeup sometimes a short amount of
time sooner - and our code still believe the 'right time' has not yet
arrived and basically for a moment 'busy-looped' on calling this
function - so for systems with 'clock_gettime()' present we obtain
time and we go 10ms to the future second - this avoids unneeded
repeated invocation of our time scheduling loop.
TODO: monitoring during 1 hour 'time-change'...
When formating VDO volume, the calculated amound of bits
for 'vdoformat --slab-bits' parameter was shifted by 2 bits
(calculated size was making 2MiB vdo_slab_size_mb value appear like if
user would be specifying only 512KiB)
Fixed by properly converting internal size_mb value to KiB.
Fix the anoying kernel message reported:
device-mapper: cache: 253:2: metadata operation 'dm_cache_commit' failed: error = -5
which has been reported while cachevol has been removed.
Happened via confusing variable - so switch the variable to commonly user '_size'
which presents a value in sector units and avoid 'scaling' this as extent length
by vg extent size when placing 'error' target on removal path.
Patch shouldn't have impact on actual users data, since at this moment
of removal all date should have been already flushed to origin device.
m
The previous patch improved read of pipe when lvm2 was looking
for default logical size, but we clearly must read pipe also
for -V case, when the logical size is already defined.
Still the place can be better to block only particular reshape
operations which ATM cause kernel problems.
We check if the new number of images is higher - and prevent to take
conversion if the volume is in use (i.e. thin-pool's data LV).
clang: it's supposedly impossible path to hit, as we should always
have origin_lv defined when running this path, but adding protection
isn't a big issue to make this obvious to analyzer.
Since _reserve_area() may fail due to error allocation failure,
add support to report this already reported failure upward.
FIXME: it's log_error() without causing direct command failure.
When _daemon_read()/_client_read() fails during the read,
ensure memory allocated withing function is also release here
(so caller does not need to care). Also improve code readbility a bit
a for same functionality use more similar code.
Although we expect min_chunk_size to be 32bit value, for
large size of caches it might be useful to do calcs 64bit.
So to avoid doing shift as signed 32bit - use unsigned 64bit
from the start.
reporting fields (-o) directly from kernel:
writecache_total_blocks
writecache_free_blocks
writecache_writeback_blocks
writecache_error
The data_percent field shows used cache blocks / total cache blocks.
Currently the error messages are not clear. This very easy to
guide user to execute "--removemissing --force", it is dangerous
and will make the LVs to be destroied.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
systemctl status corosync (version: 2.4.5) report error:
parse error in config: No interfaces defined
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Since VDO is also pool, the old if() case missed to know about this,
and executed unnecesserily initialization of cache pool variables.
This was usually harmless when using 'smaller' sizes of VDO pools,
but for big VDO pool size, we were reporting senseless messages
about big cache chunk sizes.
Until we resolve reshape for 'stacked' devices, we need to disable it.
So users can no longer reshape i.e. thin-pool data volumes, causing
ATM bad thin-pool problems.
When devices are created - we were not giving meaning error messages
when the failure happened on 'reload' part of creation.
With this patch we are now able to report both name and major:minor.
Enhancment is most visible with 'crypto' devices,
which are using 'secure' memory erase bit.
To write a new/repaired pv_header and label_header:
pvck --repairtype pv_header --file <file> <device>
This uses the metadata input file to find the PV UUID,
device size, and data offset.
To write new/repaired metadata text and mda_header:
pvck --repairtype metadata --file <file> <device>
This requires a good pv_header which points to one or two
metadata areas. Any metadata areas referenced by the
pv_header are updated with the specified metadata and
a new mda_header. "--settings mda_num=1|2" can be used
to select one mda to repair.
To combine all header and metadata repairs:
pvck --repair --file <file> <device>
It's best to use a raw metadata file as input, that was
extracted from another PV in the same VG (or from another
metadata area on the same PV.) pvck will also accept a
metadata backup file, but that will produce metadata that
is not identical to other metadata copies on other PVs
and other areas. So, when using a backup file, consider
using it to update metadata on all PVs/areas.
To get a raw metadata file to use for the repair, see
pvck --dump metadata|metadata_search.
List all instances of metadata from the metadata area:
pvck --dump metadata_search <device>
Save one instance of metadata at the given offset to
the specified file (this file can be used for repair):
pvck --dump metadata_search --file <file>
--settings "metadata_offset=<off>" <device>
using --settings:
mda_offset=<offset> mda_size=<size> can be used
in place of the offset/size that normally come
from headers.
metadata_offset=<offset> prints/saves one instance
of metadata text at the given offset, in
metadata_all or metadata_search.
This reverts commit 7474440d3b.
lvs can use the scanning optimization again since it has
been changed in:
"scanning: optimize by checking text offset and checksum"
After the VG lock is taken for vg_read, reread the mda_header
and compare the metadata text offset and checksum to what was
seen during label scan. If it is unchanged, then the metadata
has not changed since the label scan, and the metadata does not
need to be reread under the lock for command processing.
For commands that do not make changes (e.g. reporting), the
mda_header is reread and checked on one mda to decide if the
full metadata rereading can be skipped. For other commands
(e.g. modifying the vg) the mda_header is reread and checked
from all PVs. (These could probably just check one mda also.)
The kernel MD runtime requires region size to be larger than stripe size
on striped raid layouts, thus the dm-raid target's constructor rejects
such request.
This causes e.g. an 'lvcreate --type raid10 -i3 -I4096 -R2048 -n lv vg' to fail.
Avoid failing late in the kernel by enforcing region size to be
larger or equal to stripe size.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1698225
When pvcreate/pvremove prompt the user, they first release
the global lock, then acquire it again after the prompt,
to avoid blocking other commands while waiting for a user
response. This release/reacquire changes the locking
order with respect to the hints flock (and potentially other
locks). So, to avoid deadlock, use a nonblocking request
when reacquiring the global lock.
The scanning optimization can produce warnings from
'lvs' when run concurrently with commands modifying LVs,
so disable the optimization until it can be improved.
Without the scanning optimization, lvs will always
read all PVs twice:
1. read metadata from all PVs, saving it in memory
2. for each VG
3. lock VG
4. reread metadata from all PVs in VG, replacing metadata
saved from step 1
5. run command on VG
6. unlock VG
The optimization would usually cause step 4 to be skipped,
and PVs would be read only once.
Running the command in step 5 using metadata that was not
read under the VG lock is usually fine, except for the
fact that lvs attempts to validate the metadata by comparing
it to current dm state. If other commands are modifying dm
state while lvs is running, lvs may see differences between
metadata from step 1 and dm state checked during step 5,
and print warnings.
(A better fix may be to detect the concurrent change and
fall back to rereading metadata in step 4 only when needed.)
Since we fixed linking of proper version of 'libdevmapper' with
linking lvm2 plugin correctly - we already have correct function
available linked with internal lvm library.
So drop unneeded include of parsing function.
Avoid mem leaking hint on every loop continue and
allocate hint only when it's going to be added into list.
Switch to use 'dm_strncpy()' and validate sizes.
For dev_in_device_list() != 0 allocated 'devl' was
actually leaking - so instead allocate 'devl' only
when !dev_in_device_list() and indent code around.
Since we check for NULL pointers earlier we need
to be consistent across function - since the NULL
would applies across whole function.
When dropping 'mda' check - we are actually
already dereferencing it before - so it can't
be NULL at that places (and it's validated
before entering _read_mda_header_and_metadata).
dev_unset_last_byte() must be called while the fd is still valid.
After a write error, dev_unset_last_byte() must be called before
closing the dev and resetting the fd.
In the write error path, dev_unset_last_byte() was being called
after label_scan_invalidate() which meant that it would not unset
the last_byte values.
After a write error, dev_unset_last_byte() is now called in
dev_write_bytes() before label_scan_invalidate(), instead of by
the caller of dev_write_bytes().
In the common case of a successful write, the sequence is still:
dev_set_last_byte(); dev_write_bytes(); dev_unset_last_byte();
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
When resizing 2 volumes like thin-pool and it's metadata and they
would be of a different type - command would be actually expecting
both LVs being of a same segtype - and would throw an error in
case they are different.
This patch fixes is by setting a new segtype from last segment of
2nd. extented device.
Also it fixes the possible 'percentage' extension setup that
might have been used for 'primary' volume - while the 'secondary'
LV always goes with direct size - as we do not support 'percentage'
setup for them
This affects maily usage of thin-pool where the extension of
thin-pool data size may also lead to extension of metadata size.
When running cluster test with clvmd, the actual 'monitoring'
happens in cluster - so the 'already monitored' message
is also logged within clvmd code and the command cannot
see such effect.
clvmd was incapable to report this information back to command
so it cannot be displayed this way.
Add 'lvs -o+seg_monitor' validation which also works in clustered mode.
Instead of checking all LVs in a VG - do just a direct copy of LVs
from the existing list ->segs_using_thin_lv.
TODO: maybe it could be better to expose seg_list to /tools...
Enhance lv_info with lv_info_with_name_check.
This 'variant' not only check existance if UUID in DM table
but also compares its DM name whether it's matching expected LV name.
Otherwise activation may 'skip' activation with rename in case the
DM UUID already exists, just device is different name.
This change make fairly easier manipulation with i.e. detached mirror
leg which ATM is using same UUID - just the LV name have been changed.
Used code was not able to run 'activation' (and do a rename) and just
skipped the call. So the code used to do a workaround and 'tried'
to deactivate such LV firts - this however work only in non-clvmd case,
as cluster was not having the lock for deactivated LV.
With this extended lv_info code will run 'activation' and will
synchronize the name to match expected LV name.
Patch extends _lv_info() with new paramter 'with_name_check',
which is later translated into 'name_check' argument for
_info_run() which in case of name mismatch evaluates the
check as if device does not exists.
Such call is only used in one place _lv_activate() which then
let activation run. All other invocation of _info() calls
are left intact.
TODO: fix mirror table manipulation (and raid)....
Avoid making more dbus calls to get information we already have. This
also avoids us getting an error where a dbus object representation is
being deleted by another process while we are trying to gather information
about it across the wire.
When a LV loses an interface it ends up getting removed and recreated.
This happens after the VGs have been processed and updated. Thus when
this happens we need to re-check the VGs.
VDO pool LVs are represented by a new dbus interface VgVdo. Currently
the interface only has additional VDO properties, but when the
ability to support additional LV creation is added we can add a method
to the interface.
The return value from bcache_invalidate_fd() was not being checked.
So I've introduced a little function, _invalidate_fd() that always
calls bcache_abort_fd() if the write fails.
The resume of 'released' 'COW' should preceed the resume of origin.
The fact we need to do the sequence differently for merge was
cause by bugs fixed in 2 previous commits - so we no longer need
to recognize 'merging' and we should always go with single
sequence.
The importance of this order is - to properly remove '-real' device
from origin LV. When COW is activated as 2nd. '-real' device is
kept in table as it cannot be removed during 1st. resume of origin,
and later activation of COW LV no longer builds tree associated
with origin LV.
When checking device id of a thin device that is just being
merged - the snapshot actually could have been already finished
which means '-real' suffix for the LV is already gone and just LV
is there - so check explicitely for this condition and use
correct UUID for this case.
When a cachevol LV is attached, have the LV keep it's lock
allocated. The lock on the cachevol won't be used while
it's attached. When the cachevol is split a new lock does
not need to be allocated. (Applies to cachevol usage by
both dm-cache and dm-writecache.)
When a user includes "--type foo" in a command, only
look at command definitions with matching type, as
opposed to using matching/mismatching --type as a
vote for/against a given command def. This means a
command with --type foo will prioritize a command def
with --type foo over other command defs that have
more matching options but an unmatching type. This
makes it more likely that a closely matching command
def will be recommended.
When LV gets cached and uses cache-pool - such cache-pool
will now get _cpool suffix automatically.
Thus 'Pool' column for cached LV will now show either _cvol
or _cpool LV.
Improve the implementation of extracting all text metadata
copies from the metadata area. Use this for the existing
metadata_all dump option.
Add a new metadata_search dump option which does not use
lvm headers to find metadata, but looks in standard
locations. This is useful if headers are damaged and
can't be used to locate metadata.
Adding '-v' to metadata_all or metadata_search will add
the description and creation_time to the printed list of
metadata instances that are found.
Before 'archive()' is called, lvm2 must not touch/modify metadata.
So move setting CACHE_VOL related flags past this point.
Also make sure reading of cache segtype always restores this
flag properly (even if compatible flag would be lost).
Since code is using -cdata and -cmeta UUID suffixes, it does not need
any new 'extra' ID to be generated and stored in metadata.
Since introduce of new 'segtype' cache+CACHE_USES_CACHEVOL we can
safely assume 'new' cache with cachevol will now be created
without extra metadata_id and data_id in metadata.
For backward compatibility, code still reads them in case older
version of metadata have them - so it still should be able
to activate such volumes.
Bonus is lowered size of lv structure used to store info about LV
(noticable with big volume groups).
In the hex dump output, grep for the vgname
followed by one space. This allows for test pids
with up to seven digits, which are used to contruct
the variable vgname used by the test. Otherwise
the long vgname wraps to the next line and fails to
match in grep.
When an LV is used as a writecache cachevol, give
it the LV name a _cvol suffix. Remove the suffix
when the cachevol is detached, restoring the
original LV name.
The first part of a cachevol LV is used for metadata,
and the rest of the space is used for data. The
division of space between metadata and data depends
on the total size of the cachevol.
The previous division gave more space than needed to
metadata, it was:
cachevol size 8M to 128M -> metadata size 16M *
cachevol size 128M to 1G -> metadata size 32M
cachevol size 1G and up -> metadata size 64M
(* if this resulted in over half the LV used as
metadata, then half the cachevol would be used
for metadata, and the other half for data.)
The division of space now gives less space to
metadata, it is:
cachevol size 8M to 16M -> metadata size 4M
cachevol size 16M to 4G -> metadata size 8M
cachevol size 4G to 16G -> metadata size 16M
cachevol size 16G to 32G -> metadata size 32M
cachevol size 32G and up -> metadata size 64M
When a VG contains some LVs with unknown segtypes, the user
should still be allowed to activate other LVs in the VG that
are understood.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi------- 4.00m
other foo vwi---u--- 48.00m
$ lvcreate -l1 foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Cannot change VG foo with unknown segments in it!
Cannot process volume group foo
$ lvchange -ay foo/lvol0
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
$ lvchange -ay foo/other
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Refusing activation of LV foo/other containing an unrecognised segment.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi-a----- 4.00m
other foo vwi---u--- 48.00m
A cachevol LV had the CACHE_VOL status flag in metadata,
and the cache LV using it had no new flag. This caused
problems if the new metadata was used by an old version
of lvm. An old version of lvm would have two problems
processing the new metadata:
. The old lvm would return an error when reading the VG
metadata when it saw the unknown CACHE_VOL status flag.
. The old lvm would return an error when reading the VG
metadata because it would not find an expected cache pool
attached to the cache LV (since the cache LV had a
cachevol attached instead.)
Change the use of flags:
. Change the CACHE_VOL flag to be a COMPATIBLE flag (instead
of a STATUS flag) so that old versions will not fail when
they see it.
. When a cache LV is using a cachevol, the cache LV gets
a new SEGTYPE flag CACHE_USES_CACHEVOL. This flag is
appended to the segtype name, so that old lvm versions
will fail to use the LV because of an unknown segtype,
as opposed to failing to read the VG.
Enhance activation of cached devices using cachevol.
Correctly instatiace cachevol -cdata & -cmeta devices with
'-' in name (as they are only layered devices).
Code is also a bit more compacted (although still not ideal,
as the usage of extra UUIDs stored in metadata is troublesome
and will be repaired later).
NOTE: this patch my brink potentially minor incompatiblity for 'runtime' upgrade
Instead of using 'noflush' option, switch cache_mode into WRITETHROUGH
which does not require flushing, when user confirmed he does not
want flushing for WRITEBACK (because of (partially) missing caching PV)
For wiping we activate and clear 'regular' devices,
since in case of whole process interuption (i.e. kill -9)
we leave metadata & DM table and workable state all the time.
Let vgck --updatemetadata repair cases where different mdas
hold indepedently valid but unmatching copies of the metadata,
i.e. different text metadata checksums or text metadata sizes.
vgck --updatemetadata would write the same correct
metadata to good mdas, and then to bad mdas, but the
sequence of vg_write/vg_commit calls betwen good and
bad mdas could cause a different description field to
be generated for good/bad mdas. (The description field
describing the command was recently included in the
ondisk copy of the metadata text.)
If a VG is forcibly changed from lock_type sanlock to
lock_type none, the internal lvmlock LV is left behind.
If that LV is not removed before vgremove is run on the
VG, then an internal check will be triggered by the
hidden lvmlock LV. So, check for and remove a left over
lvmlock LV during vgremove.
Add lots of vdo fields:
vdo_operating_mode - For vdo pools, its current operating mode.
vdo_compression_state - For vdo pools, whether compression is running.
vdo_index_state - For vdo pools, state of index for deduplication.
vdo_used_size - For vdo pools, currently used space.
vdo_saving_percent - For vdo pools, percentage of saved space.
vdo_compression - Set for compressed LV (vdopool).
vdo_deduplication - Set for deduplicated LV (vdopool).
vdo_use_metadata_hints - Use REQ_SYNC for writes (vdopool).
vdo_minimum_io_size - Minimum acceptable IO size (vdopool).
vdo_block_map_cache_size - Allocated caching size (vdopool).
vdo_block_map_era_length - Speed of cache writes (vdopool).
vdo_use_sparse_index - Sparse indexing (vdopool).
vdo_index_memory_size - Allocated indexing memory (vdopool).
vdo_slab_size - Increment size for growing (vdopool).
vdo_ack_threads - Acknowledging threads (vdopool).
vdo_bio_threads - IO submitting threads (vdopool).
vdo_bio_rotation - IO enqueue (vdopool).
vdo_cpu_threads - CPU threads for compression and hashing (vdopool).
vdo_hash_zone_threads - Threads for subdivide parts (vdopool).
vdo_logical_threads - Logical threads for subdivide parts (vdopool).
vdo_physical_threads - Physical threads for subdivide parts (vdopool).
vdo_max_discard - Maximum discard size volume can recieve (vdopool).
vdo_write_policy - Specified write policy (vdopool).
vdo_header_size - Header size at front of vdopool.
Previously only 'lvdisplay -m' was exposing them.
Since we now support activation of 'vdo' volume
without explicit activation of 'vdopool' it's now possible
to have active layer vdopool (-vpool) volume and
having vdopool itself inactive - yet still in this
case we can show available stats for this volume.
But we need to show correct activation status and other
standard info.
Adds support for the DM_GET_TARGET_VERSION to dmsetup.
It introduces a new comman "target-version" that will accept list
of targets and print their version.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Use /dev/md33 instead of /dev/md0 to reduce chances of
conflicting with an existing name.
Only call 'mdadm --stop /dev/md33' for cleanup and don't
use 'mdadm --stop --scan' to avoid stopping other md devs.
Due to a dm-raid target flaw fixed in target version 1.15.0,
extents of raid sets don't get resynchronized when new MD bitmp
pages have to be allocated due to the extension.
Introduce lvextend-raid.sh to test this flaw.
Related: rhbz1671964
When the PV device names in the VG metadata do not match the
current PV device names seen on the system, do not use the
optimized activation function (that avoids extra device scanning.)
When the device names do not match, it's a clue that there could
be duplicate PVs, in which case we want to scan all devicess to
find any duplicates and stop the activation if found.
This does not prevent autoactivating a VG from the incorrect
duplicate PV, because the incorrect duplicate may appear by itself
first. At that point its duplicate PV does not exist to be seen.
(A future enhancement could use the WWID to strengthen this
detection.)
Analogous to the case of a device with no regions, it is not an
error to attempt to list the stats groups on a device that has no
configured groups: just return success and continue.
Avoid checking 'lv_is_active()' since special LV types does this
validation anyway what calling _percent() function and call it
ONLY when none of special types is queried.
This restores support for VDO resize (as with support for
separate VDO pool activation, plain query for lv_is_active()
is not working in this case).
If the linear mapping is lost (for whatever reason, i.e.
test suite forcible 'dmsetup remove' linear LV,
lvm2 had hard times figuring out how to deactivate such DM table.
So add function which is in case inactive VDO pool LV checks if
the pool is actually still active (-vpool device present) and
it has open count == 0. In this case deactivation is allowed
to continue and cleanup DM table.
- use internal CACHE_VOL flag on cachevol LV
- add suffixes to dm uuids for internal LVs
- display appropriate letters in the LV attr field
- display writecache's cachevol in lvs output
This reverts commit ad560a286a.
The reverted patch also removed the warning which we realized we need
to keep as valuable process information (see related bugzilla below).
In a followup patch, we'll keep the message and avoid bailing out thus
always allowing lvconvert to try repairing if 'allocate' fault policy set.
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1751887
. For dm-cache in writethrough, always allow splitcache,
whether the cache is missing PVs or not.
. For dm-cache in writeback, if the cache is missing PVs,
allow splitcache with force and yes.
. For dm-writecache, if the cache is missing PVs,
allow splitcache with force and yes.
Enhance 'activation' experience for VDO pool to more closely match
what happens for thin-pools where we do use a 'fake' LV to keep pool
running even when no thinLVs are active. This gives user a choice
whether he want to keep thin-pool running (wihout possibly lenghty
activation/deactivation process)
As we do plan to support multple VDO LVs to be mapped into a single VDO,
we want to give user same experience and 'use-patter' as with thin-pools.
This patch gives option to activate VDO pool only without activating
VDO LV.
Also due to 'fake' layering LV we can protect usage of VDO pool from
command like 'mkfs' which do require exlusive access to the volume,
which is no longer possible.
Note: VDO pool contains 1024 initial sectors as 'empty' header - such
header is also exposed in layered LV (as read-only LV).
For blkid we are indentified as LV with UUID suffix - thus private DM
device of lvm2 - so we do not need to store any extra info in this
header space (aka zero is good enough).
When lvm2 is activating layered pool LV (to basically keep pool opened,
the other function used to be 'locking' be in sync with DM table)
use this LV in read-only mode - this prevents 'write' access into
data volume content of thin-pool.
Note: since EMPTY/unused thin-pool is created as 'public LV' for generic
use by any user who i.e. wish to maintain thin-pool and thins himself.
At this moment, thin-pool appears as writable LV. As soon as the 1st.
thinLV is created, layer volume will appear is 'read-only' LV from this moment.
When pvscan is used to activate a VG via an
asynchronous service (i.e. lvm2-pvscan), there
is no requirement that the command wait for
udev to create device nodes before returning.
It's possible that waiting for udev is slow
enough to cause the service running the command
to time out. So, allow the --noudevsync option
to be given to pvscan to skip waiting for udev.
(This commit is not changing the lvm2-pvscan
service itself to use --noudevsync.)
Still unknown is whether there are any complex
LV activation cases in which lvm itself requires
access to a device node, in which case the udev
wait could be needed by lvm itself.
(When running an activation command directly
from the command line, it's generally expected
that the activated LVs are ready to use when
the command is finished, so lvm waits for
udev to finish creating the dev nodes.)
When an online PV completed a VG, the standard
activation functions were used to activate the VG.
These functions use a full scan of all devs.
When many pvscans are run during startup and need
to activate many VGs, scanning all devs from all
the pvscans can take a long time.
Optimize VG activation in pvscan to scan only the
devs in the VG being activated. This makes use of
the online file info that was used to determine
the VG was complete.
The downside of this approach is that pvscan activation
will not detect duplicate PVs and block activation,
where a normal activation command (which scans all
devices) would.
Fixes a regression from commit ba7ff96faf
"improve reading and repairing vg metadata"
where the error path for a vg name with invalid
charaters was missing an error flag, which led
to the caller not recognizing an error occured.
Previously, an error flag was hidden in the old
_vg_make_handle function.
Only the first entry of the filter array was being
included in the copy of the filter, rather than the
entire thing. The result is that hints would not be
refreshed if the filter was changed but the first
entry was unchanged.
Fixes a segfault in the recent commit e01fddc57:
"improve duplicate pv handling for md components"
While choosing between duplicates, the info struct is
not always valid; it may have been dropped already.
Remove the code that was still using the info struct for
size comparisons. The size comparisons were a bogus check
anyway because it was just preferring the dev that had
already been chosen, it wasn't actually comparing the
dev size to the PV size. It would be good to use a
dev/PV size comparison in the duplicate handling code, but
the PV size is not available until after vg_read, not
from the scan.
With Python 3.8 converting these directly to string using str()
no longer works, we need to convert these to integer first.
On Python 3.8:
>>> str(dbus.Int64(1))
'dbus.Int64(1)'
On Python 3.7 (and older):
>>> str(dbus.UInt64(1))
'1'
This is probably related to removing __str__ function from method
from int (dbus.UInt is subclass of int) which happened in 3.8, see
https://docs.python.org/3.8/whatsnew/3.8.html
Signed-off-by: Vojtech Trefny <vtrefny@redhat.com>
Since we need to preserve allocated strings across 2 separate
activation calls of '_tree_action()' we need to use other mem
pool them dm->mem - but since cmd->mem is released between
individual lvm2 locking calls, we rather introduce a new separate
mem pool just for pending deletes with easy to see life-span.
(not using 'libmem' as it would basicaly keep allocations over
the whole lifetime of clvmd)
This patch is fixing previous commmit where the memory was
improperly used after pool release.
Update configure and make code compilable if prlimit() is not present.
Since the code is suspicious do not cope yet with it's replacement
with set/getrlimit().
New udev in rawhide seems to be 'dropping' udev rule operations for devices
that are no longer existing - while this is 'probably' a bug - it's
revealing moments in lvm2 that likely should not run in a single
transaction and we should wait for a cookie before submitting more work.
TODO: it seem more 'error' paths should always include synchronization
before starting deactivating 'just activated' devices.
We should probably figure out some 'automatic' solution for this instead
of placing sync_local_dev_name() all over the place...
Support internal removal of 'cache origin' volume - which we
do not normally expose to a user - however internal processing
loops may hit this condition (depending on order of list LVs).
So when this operation is internally requested - we automatically
try to remove it's 'holding' LV (cache LV) - which will also
remove the origin.
Drop the 'cluster-only' optimization so we do resume ALL device
before we try to wait on cookie before 'removal' operation.
It's more correct order of operation - alhtough possibly slightly
less efficient - but until we have correct list of operations
'in-progress' we can't do anything better.
With previous patch 30a98e4d67 we
started to put devices one pending_delete list instead
of directly scheduling their removal.
However we have operations like 'snapshot merge' where we are
resuming device tree in 2 subsequent activation calls - so
1st such call will still have suspened devices and no chance
to push 'remove' ioctl.
Since we curently cannot easily solve this by doing just single
activation call (which would be preferred solution) - we introduce
a preservation of pending_delete via command structure and
then restore it on next activation call.
This way we keep to remove devices later - although it might be
not the best moment - this may need futher tunning.
Also we don't keep the list of operation in 1 trasaction
(unless we do verify udev symlinks) - this could probably
also make it more correct in terms of which 'remove' can
be combined we already running 'resume'.
Udev debugging is a bit tricky, so to more easily pair cookie ID,
which is the lowest 16 bit - print cookie as hexa number.
This simplify pairing of processed cookies while the 'higher bit flags'
are changed for the same cookie.
Resuming of 'error' table entry followed with it's dirrect removal
is now troublesame with latest udev as it may skip processing of
udev rules for already 'dropped' device nodes.
As we cannot 'synchronize' with udev while we know we have devices
in suspended state - rework 'cleanup' so it collects nodes
for removal into pending_delete list and process the list with
synchronization once we are without any suspended nodes.
Between 'resume' and 'remove' we need to wait for udev to synchronize,
otherwise udev may 'skip' resume event processing if the udev node
is already gone.
When pvmove is finished, we do a tricky operation since we try to
resume multiple different device that were all joined into 1 big tree.
Currently we use the infromation from existing live DM table,
where we can get list of all holders of pvmove device.
We look for these nodes (by uuid) in new metadata, and we do now a full
regular device add into dm tree structure. All devices should be
already PRELOAD with correct table before entering suspend state,
however for correctly working readahead we need to put correct info
also into RESUME tree. Since table are preloaded, the same table
is skip and resume, but correct read ahead is now set.
Eliminate md components at the start so they don't
interfere with actual duplicates, and don't need
to be removed later. This also allows for choosing
no copy of a PVID if they all happen to be md
components.
Usually md components are eliminated in label scan and/or
duplicate resolution, but they could sometimes get into
the vg_read stage, where set_pv_devices compares the
device to the PV.
If set_pv_devices runs an md component check and finds
one, vg_read should eliminate the components.
In set_pv_devices, run an md component check always
if the PV is smaller than the device (this is not
very common.) If the PV is larger than the device,
(more common), do the component check when the config
setting is "auto" (the default).
The OPTIONS+="event_timeout" is Unsupported since systemd/udev version 216,
that is ~5 years ago.
Since systemd/udev version 243, there's a new message printed if unsupported
OPTIONS value is used:
Invalid value for OPTIONS key, ignoring: 'event_timeout=180'
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1740666
Some older BB with older cryptsetup tool do not 'retry' on remove
and when remove is issued right after 'fsck' - it might be
rejected with:
Device @PREFIX@-tcrypt2 is busy.
Try to use udevadm settle.
Since we use 'set -euE -o pipefail' for shell execution,
any failure of any command in the 'piped' shell can result
in failure of whole executed chain - resulting in typically
unsually test skip, that was left unnoticed.
Since checked command have usually short output, the simplest
fix seems to be to let grep parse whole output instead
of quiting after first match.
Kernels <2.6.27 don't have /sys/dev dir - add code for looking
out device name via longre seach in /sys/block
This makes commands like 'dmsetup dep -o blkdevname' working.
Fix versioning for updated symbols dm_stats_create_region
and dm_stats_create_region.
Only the latest symbol should have global entry.
Since I'm not sure what is currenlty the best option for
old symbols - we added support for easy commenting of them
(so we do not lose information when the symbol appeared
for the first time.)
Note: some old already deleted symbols should have been
restored as comments.
When there are more devices than the current soft
open file limit (default 1024), raise the soft limit
to the hard/max limit (default 4096).
Do this prior to scanning in case enough of the
devices are PVs that need to be kept open.
Avoid having PVs with different logical block sizes in the same VG.
This prevents LVs from having mixed block sizes, which can produce
file system errors.
The new config setting devices/allow_mixed_block_sizes (default 0)
can be changed to 1 to return to the unrestricted mode.
Do this at two levels, although one would be enough to
fix the problem seen recently:
- Ignore any reported sector size other than 512 of 4096.
If either sector size (physical or logical) is reported
as 512, then use 512. If neither are reported as 512,
and one or the other is reported as 4096, then use 4096.
If neither is reported as either 512 or 4096, then use 512.
- When rounding up a limited write in bcache to be a multiple
of the sector size, check that the resulting write size is
not larger than the bcache block itself. (This shouldn't
happen if the sector size is 512 or 4096.)
Previously, consecutive copies of metadata would have garbage
data in the space between them. After metadata wrapping,
the garbage would be portions of old metadata. This made
analysis of the metadata area more difficult.
This would happen because the start of new copy of metadata
is advanced from the end of the last copy to start at the
next 512 byte boundary.
Zero the space between consecutive copies of metadata by
extending each metadata write to end at the next 512 byte
boundary. The size of the metadata itself is not extended,
only the write. The buffer being written contains the
metadata text followed by the necessary number of zeros.
An active md device with an end superblock causes lvm to
enable full md component detection. This was being done
within the filter loop instead of before, so the full
filtering of some devs could be missed.
Also incorporate the recently added config setting that
controls the md component detection.
Fix commit 7836e7aa1c
"pvscan: ignore device with incorrect size"
which caused pvscan to not consider a PV online (for purposes
of event based activation) if the PV and device sizes differed.
This helped to avoid mistaking MD components for PVs, and is
replaced by triggering an md component check when PV and device
sizes differ (which happens in set_pv_device).
This check was mistakenly removed when shifting code in commit
"separate code for setting devices from metadata parsing".
Put it back with some new conditions.
This allows the creation of a striped mirror leg(s) during upconvert
by adding lvconvert command line options --stripes/--stripesize
for 'mirror' to tools/command-lines.in.
In case multiple mirror legs are being added, all will have the
same requested striped layout.
Resolves: rhbz1720705
We've been assigning this in 69-dm-lvm-metad.rules:
ENV{ID_MODEL}="LVM PV $env{ID_FS_UUID_ENC} on /dev/$name"
This was for the description to appear for each systemd device
unit representing this device, for example:
$systemctl -a | grep "LVM PV"
dev-block-252:2.device loaded active plugged LVM PV JhxC7B-YTgk-3jIU-5GVo-c4gV-W8t3-UUz06p on /dev/vda2 2
dev-disk-by\x2did-lvm\x2dpv\x2duuid\x2dJhxC7B\x2dYTgk\x2d3jIU\x2d5GVo\x2dc4gV\x2dW8t3\x2dUUz06p.device loaded active plugged LVM PV JhxC7B-YTgk-3jIU-5GVo-c4gV-W8t3-UUz06p on /dev/vda2 2
...
However, there could be an actual ID_MODEL that people are interested in
more than the fact that this is an LVM PV and so we shouldn't overwrite
the value.
Also, we already have a symlink /dev/disk/by-id/lvm-pv-uuid-<PV_UUID>
created which is then reflected as device unit (all device's symlinks
have systemd device unit representation) so we can still reach this
information in systemd unit listings even without setting the ID_MODEL.
Reported here: https://github.com/lvmteam/lvm2/issues/21
The cache repair utility does not yet work with a cachevol
(where metadata and data exist on the same LV.) So, warn
and prompt if writeback is specified with a cachevol.
The exported VG checking/enforcement was scattered and
inconsistent. This centralizes it and makes it consistent,
following the existing approach for foreign and shared
VGs/PVs, which are very similar to exported VGs/PVs.
The access policy that now applies to foreign/shared/exported
VGs/PVs, is that if a foreign/shared/exported VG/PV is named
on the command line (i.e. explicitly requested by the user),
and the command is not permitted to operate on it because it
is foreign/shared/exported, then an access error is reported
and the command exits with an error. But, if the command is
processing all VGs/PVs, and happens to come across a
foreign/shared/exported VG/PV (that is not explicitly named on
the command line), then the command silently skips it and does
not produce an error.
A command using tags or --select handles inaccessible VGs/PVs
the same way as a command processing all VGs/PVs, and will
not report/return errors if these inaccessible VGs/PVs exist.
The new policy fixes the exit codes on a somewhat random set of
commands that previously exited with an error if they were
looking at all VGs/PVs and an exported VG existed on the system.
There should be no change to which commands are allowed/disallowed
on exported VGs/PVs.
Certain LV commands (lvs/lvdisplay/lvscan) would previously not
display LVs from an exported VG (for unknown reasons). This has
not changed. The lvm fullreport command would previously report
info about an exported VG but not about the LVs in it. This
has changed to include all info from the exported VG.
When vg_read rescans devices with the intention of
writing the VG, the label rescan can open the devs
RW so they do not need to be closed and reopened
RW in dev_write_bytes.
Previously the VG metadata description field (which contains
the command line) was only included in backup/archive copies
of the metadata. Now also include it in the metadata written
to the metadata areas.
When monitoring, skip exported VGs without causing a command
failure.
The lvm2-monitor service runs 'vgchange --monitor y', so
any exported VG on the system would cause the service to
fail.
The man page generation for pvchange/lvchange/vgchange was
incorrect (leaving out some option listings) as a result of
commit e225bf5 "fix command definition for pvchange -a"
So when the target name happened to be a suffix of another one,
the grep was filtering incorrect line
(i.e. dm-cache && dm-writecache) - so do a line head matching.
The -a was being included in the set of "one or more"
options instead of an actual required option. Even
though the cmd def was not implementing the restrictions
correctly, the command internally was.
Adjust the cmd def code which did not support a command
with some real required options and a set of "one or more"
options.
The way that this command now uses the global lock
followed by a label scan, it can simply check if the
new VG name exists, and if not lock it and create it.
These two flags may be not reset at the end of
the command when the unlock is implicit, which
is a problem if the cmd struct is reused.
Clear the flags in the general fin_locking.
* origin/master: (22 commits)
tests: add metadata-bad-mdaheader.sh
tests: add metadata-bad-text.sh
tests: add outdated-pv.sh
tests: add metadata-old.sh
tests: add missing-pv missing-pv-unused
metadata.c: removed unused code
improve reading and repairing vg metadata
add a warning message when updating old metadata
vgcfgbackup add error messages
vgck --updatemetadata is a new command
move pv header repairs to vg_write
process_each_pv handle outdated pvs
move wipe_outdated_pvs to vg_write
create separate lvmcache update functions for read and write
fix vg_commit return value
change args for text label read function
add mda arg to add_mda
keep track of which mdas have old metadata in lvmcache
ability to keep track of outdated pvs in lvmcache
ability to keep track of bad mdas in lvmcache
...
The fact that vg repair is implemented as a part of vg read
has led to a messy and complicated implementation of vg_read,
and limited and uncontrolled repair capability. This splits
read and repair apart.
Summary
-------
- take all kinds of various repairs out of vg_read
- vg_read no longer writes anything
- vg_read now simply reads and returns vg metadata
- vg_read ignores bad or old copies of metadata
- vg_read proceeds with a single good copy of metadata
- improve error checks and handling when reading
- keep track of bad (corrupt) copies of metadata in lvmcache
- keep track of old (seqno) copies of metadata in lvmcache
- keep track of outdated PVs in lvmcache
- vg_write will do basic repairs
- new command vgck --updatemetdata will do all repairs
Details
-------
- In scan, do not delete dev from lvmcache if reading/processing fails;
the dev is still present, and removing it makes it look like the dev
is not there. Records are now kept about the problems with each PV
so they be fixed/repaired in the appropriate places.
- In scan, record a bad mda on failure, and delete the mda from
mda in use list so it will not be used by vg_read or vg_write,
only by repair.
- In scan, succeed if any good mda on a device is found, instead of
failing if any is bad. The bad/old copies of metadata should not
interfere with normal usage while good copies can be used.
- In scan, add a record of old mdas in lvmcache for later, do not repair
them while reading, and do not let them prevent us from finding and
using a good copy of metadata from elsewhere. One result is that
"inconsistent metadata" is no longer a read error, but instead a
record in lvmcache that can be addressed separate from the read.
- Treat a dev with no good mdas like a dev with no mdas, which is an
existing case we already handle.
- Don't use a fake vg "handle" for returning an error from vg_read,
or the vg_read_error function for getting that error number;
just return null if the vg cannot be read or used, and an error_flags
arg with flags set for the specific kind of error (which can be used
later for determining the kind of repair.)
- Saving an original copy of the vg metadata, for purposes of reverting
a write, is now done explicitly in vg_read instead of being hidden in
the vg_make_handle function.
- When a vg is not accessible due to "access restrictions" but is
otherwise fine, return the vg through the new error_vg arg so that
process_each_pv can skip the PVs in the VG while processing.
(This is a temporary accomodation for the way process_each_pv
tracks which devs have been looked at, and can be dropped later
when process_each_pv implementation dev tracking is changed.)
- vg_read does not try to fix or recover a vg, but now just reads the
metadata, checks access restrictions and returns it.
(Checking access restrictions might be better done outside of vg_read,
but this is a later improvement.)
- _vg_read now simply makes one attempt to read metadata from
each mda, and uses the most recent copy to return to the caller
in the form of a 'vg' struct.
(bad mdas were excluded during the scan and are not retried)
(old mdas were not excluded during scan and are retried here)
- vg_read uses _vg_read to get the latest copy of metadata from mdas,
and then makes various checks against it to produce warnings,
and to check if VG access is allowed (access restrictions include:
writable, foreign, shared, clustered, missing pvs).
- Things that were previously silently/automatically written by vg_read
that are now done by vg_write, based on the records made in lvmcache
during the scan and read:
. clearing the missing flag
. updating old copies of metadata
. clearing outdated pvs
. updating pv header flags
- Bad/corrupt metadata are now repaired; they were not before.
Test changes
------------
- A read command no longer writes the VG to repair it, so add a write
command to do a repair.
(inconsistent-metadata, unlost-pv)
- When a missing PV is removed from a VG, and then the device is
enabled again, vgck --updatemetadata is needed to clear the
outdated PV before it can be used again, where it wasn't before.
(lvconvert-repair-policy, lvconvert-repair-raid, lvconvert-repair,
mirror-vgreduce-removemissing, pv-ext-flags, unlost-pv)
Reading bad/old metadata
------------------------
- "bad metadata": the mda_header or metadata text has invalid fields
or can't be parsed by lvm. This is a form of corruption that would
not be caused by known failure scenarios. A checksum error is
typically included among the errors reported.
- "old metadata": a valid copy of the metadata that has a smaller seqno
than other copies of the metadata. This can happen if the device
failed, or io failed, or lvm failed while commiting new metadata
to all the metadata areas. Old metadata on a PV that has been
removed from the VG is the "outdated" case below.
When a VG has some PVs with bad/old metadata, lvm can simply ignore
the bad/old copies, and use a good copy. This is why there are
multiple copies of the metadata -- so it's available even when some
of the copies cannot be used. The bad/old copies do not have to be
repaired before the VG can be used (the repair can happen later.)
A PV with no good copies of the metadata simply falls back to being
treated like a PV with no mdas; a common and harmless configuration.
When bad/old metadata exists, lvm warns the user about it, and
suggests repairing it using a new metadata repair command.
Bad metadata in particular is something that users will want to
investigate and repair themselves, since it should not happen and
may indicate some other problem that needs to be fixed.
PVs with bad/old metadata are not the same as missing devices.
Missing devices will block various kinds of VG modification or
activation, but bad/old metadata will not.
Previously, lvm would attempt to repair bad/old metadata whenever
it was read. This was unnecessary since lvm does not require every
copy of the metadata to be used. It would also hide potential
problems that should be investigated by the user. It was also
dangerous in cases where the VG was on shared storage. The user
is now allowed to investigate potential problems and decide how
and when to repair them.
Repairing bad/old metadata
--------------------------
When label scan sees bad metadata in an mda, that mda is removed
from the lvmcache info->mdas list. This means that vg_read will
skip it, and not attempt to read/process it again. If it was
the only in-use mda on a PV, that PV is treated like a PV with
no mdas. It also means that vg_write will also skip the bad mda,
and not attempt to write new metadata to it. The only way to
repair bad metadata is with the metadata repair command.
When label scan sees old metadata in an mda, that mda is kept
in the lvmcache info->mdas list. This means that vg_read will
read/process it again, and likely see the same mismatch with
the other copies of the metadata. Like the label_scan, the
vg_read will simply ignore the old copy of the metadata and
use the latest copy. If the command is modifying the vg
(e.g. lvcreate), then vg_write, which writes new metadata to
every mda on info->mdas, will write the new metadata to the
mda that had the old version. If successful, this will resolve
the old metadata problem (without needing to run a metadata
repair command.)
Outdated PVs
------------
An outdated PV is a PV that has an old copy of VG metadata
that shows it is a member of the VG, but the latest copy of
the VG metadata does not include this PV. This happens if
the PV is disconnected, vgreduce --removemissing is run to
remove the PV from the VG, then the PV is reconnected.
In this case, the outdated PV needs have its outdated metadata
removed and the PV used flag needs to be cleared. This repair
will be done by the subsequent repair command. It is also done
if vgremove is run on the VG.
MISSING PVs
-----------
When a device is missing, most commands will refuse to modify
the VG. This is the simple case. More complicated is when
a command is allowed to modify the VG while it is missing a
device.
When a VG is written while a device is missing for one of it's PVs,
the VG metadata is written to disk with the MISSING flag on the PV
with the missing device. When the VG is next used, it is treated
as if the PV with the MISSING flag still has a missing device, even
if that device has reappeared.
If all LVs that were using a PV with the MISSING flag are removed
or repaired so that the MISSING PV is no longer used, then the
next time the VG metadata is written, the MISSING flag will be
dropped.
Alternative methods of clearing the MISSING flag are:
vgreduce --removemissing will remove PVs with missing devices,
or PVs with the MISSING flag where the device has reappeared.
vgextend --restoremissing will clear the MISSING flag on PVs
where the device has reappeared, allowing the VG to be used
normally. This must be done with caution since the reappeared
device may have old data that is inconsistent with data on other PVs.
Bad mda repair
--------------
The new command:
vgck --updatemetadata VG
first uses vg_write to repair old metadata, and other basic
issues mentioned above (old metadata, outdated PVs, pv_header
flags, MISSING_PV flags). It will also go further and repair
bad metadata:
. text metadata that has a bad checksum
. text metadata that is not parsable
. corrupt mda_header checksum and version fields
(To keep a clean diff, #if 0 is added around functions that
are replaced by new code. These commented functions are
removed by the following commit.)
uses vg_write to correct more common or less severe issues,
and also adds the ability to repair some metadata corruption
that couldn't be handled previously.
and implement it based on a device, not based
on a pv struct (which is not available when the
device is not a part of the vg.)
currently only the vgremove command wipes outdated
pvs until more advanced recovery is added in a
subsequent commit
The vg read and vg write cases need to update lvmcache
differently, so create separate functions for them.
The read case now handles checking for outdated mdas
and moves them aside into a new list to be repaired in
a subsequent commit.
The existing comment was desribing the correct behavior,
but the code didn't match. The commit is successful if
one mda was committed. Making it depend on the result of
the internal lvmcache update was wrong.
Have the caller pass the label_sector to the read
function so the read function can set the sector
field in the label struct, instead of having the
read function return a pointer to the label for
the caller to set the sector field.
Also have the read function return a flag indicating
to the caller that the scanned device was identified
as a duplicate pv.
Outdated PVs hold metadata for VG from which they
have been removed. Add the ability to keep track
of these in lvmcache.
This will be used for more advanced repair in a
subsequent commit.
mda's that cannot be processed by lvm because of
some corruption can be kept on a separate list.
These will be used for more advanced repair in a
subsequent commit.
When reading metadata headers and text, use a new set
of flags to identify specific errors that are seen.
These will be used for more advanced repair in a
subsequent commit.
If udev info is missing for a device, (which would indicate
if it's an MD component), then do an end-of-device read to
check if a PV is an MD component. (This is skipped when
using hints since we already know devs in hints are good.)
A new config setting md_component_checks can be used to
disable the additional end-of-device MD checks, or to
always enable end-of-device MD checks.
When both hints and udev info are disabled/unavailable,
the end of PVs will now be scanned by default. If md
devices with end-of-device superblocks are not being
used, the extra I/O overhead can be avoided by setting
md_component_checks="start".
Save the previous duplicate PVs in a global list instead
of a list on the cmd struct. dmeventd reuses the cmd struct
for multiple commands, and the list entries between commands
were being freed (apparently), causing a segfault in dmeventd
when it tried to use items in cmd->unused_duplicate_devs
that had been saved there by the previous command.
Use the recently added dump routines to produce the
old/traditional pvck output, and remove the code that
had been used for that.
The validation/checking done by the new routines means
that new lines prefixed with CHECK are printed for
incorrect values.
Recent kernel version from kernel commit:
de7180ff908b2bc0342e832dbdaa9a5f1ecaa33a
started to report in cache status line new flag:
no_discard_passdown
Whenever lvm spots unknown status it reports:
Unknown feature in status:
So add reconginzing this feature flag and also report this with
'lvs -o+kernel_discards'
When no_discard_passdown is found in status 'nopassdown' gets reported
for this field (roughly matching what we report for thin-pools).
Add 'pvck --dump headers' to print all the
lvm ondisk structs. Also checks the values
and prints any problems.
The previous dump metadata is also converted to
use these same routines, which do not depend on lvm
fully scanning/reading/processing the headers and
metadata on disk. This makes it useful to get data in
cases where there is corruption that would otherwise
prevent the normal functions from working.
The test was failing consistently on some VMs (F25), and inconsistently
on Rawhide.
With increased latency these failures are no longer reproducible.
Reproducer:
make check_lvmpolld T=pvmove-resume-multiseg.sh
The new command 'pvck --dump metadata PV' will extract
the current version of VG metadata from a PV for testing
and debugging. --dump metadata_area extracts the entire
text metadata area.
teardown after the test was failing, probably because
of uncoordinated udev actions running on the test
system. Try to avoid this by doing some work before
teardown.
lvm2 till version 2.02.169 (commit 78d004efa8)
was printing invalid creation_time argument into metadata on 32bit arch.
However with commit ba9820b142 we started
to properly validate all input numbers and thus we refused to accept
invalid metadata with 'garbage' string - but this results in the
situation where metadata produced on older lvm2 on 32 bit architecture
will become unreadable after upgrade.
To fix this case - extend libdm parser in a way, that whenever we
find error integer value, we also check if the parsed value is not for
creation_time node and in this case we let the metadata pass through
with made-up date 2018-05-24 (release date of 2.02.169).
commit aa75b31db5
"pvscan: handle case of scanning PV without metadata last"
failed to recognize that an arg may be null in the case of
'pvscan --cache' (without -aay) which does not keep track
of complete VGs because it does not need to activate them.
The scanning rework missed removing this instance of label scan.
It's no longer needed because of the way that label scan is always
run once from the start of the command. This unnecessary scan
would be triggered by running 'pvs @tag'.
If an md component is not excluded by other means and
vg_read is used to read metadata from it, then this new
check compares the device size with the PV size, and runs
a full md check on the device if the sizes don't match.
and don't call it from inside pvcreate_each_device.
This avoids having to repeat it for users of
pvcreate_each_device (pvcreate/pvremove/vgcreate/vgextend.)
There have been two file locks used to protect lvm
"global state": "ORPHANS" and "GLOBAL".
Commands that used the ORPHAN flock in exclusive mode:
pvcreate, pvremove, vgcreate, vgextend, vgremove,
vgcfgrestore
Commands that used the ORPHAN flock in shared mode:
vgimportclone, pvs, pvscan, pvresize, pvmove,
pvdisplay, pvchange, fullreport
Commands that used the GLOBAL flock in exclusive mode:
pvchange, pvscan, vgimportclone, vgscan
Commands that used the GLOBAL flock in shared mode:
pvscan --cache, pvs
The ORPHAN lock covers the important cases of serializing
the use of orphan PVs. It also partially covers the
reporting of orphan PVs (although not correctly as
explained below.)
The GLOBAL lock doesn't seem to have a clear purpose
(it may have eroded over time.)
Neither lock correctly protects the VG namespace, or
orphan PV properties.
To simplify and correct these issues, the two separate
flocks are combined into the one GLOBAL flock, and this flock
is used from the locking sites that are in place for the
lvmlockd global lock.
The logic behind the lvmlockd (distributed) global lock is
that any command that changes "global state" needs to take
the global lock in ex mode. Global state in lvm is: the list
of VG names, the set of orphan PVs, and any properties of
orphan PVs. Reading this global state can use the global lock
in sh mode to ensure it doesn't change while being reported.
The locking of global state now looks like:
lockd_global()
previously named lockd_gl(), acquires the distributed
global lock through lvmlockd. This is unchanged.
It serializes distributed lvm commands that are changing
global state. This is a no-op when lvmlockd is not in use.
lockf_global()
acquires an flock on a local file. It serializes local lvm
commands that are changing global state.
lock_global()
first calls lockf_global() to acquire the local flock for
global state, and if this succeeds, it calls lockd_global()
to acquire the distributed lock for global state.
Replace instances of lockd_gl() with lock_global(), so that the
existing sites for lvmlockd global state locking are now also
used for local file locking of global state. Remove the previous
file locking calls lock_vol(GLOBAL) and lock_vol(ORPHAN).
The following commands which change global state are now
serialized with the exclusive global flock:
pvchange (of orphan), pvresize (of orphan), pvcreate, pvremove,
vgcreate, vgextend, vgremove, vgreduce, vgrename,
vgcfgrestore, vgimportclone, vgmerge, vgsplit
Commands that use a shared flock to read global state (and will
be serialized against the prior list) are those that use
process_each functions that are based on processing a list of
all VG names, or all PVs. The list of all VGs or all PVs is
global state and the shared lock prevents those lists from
changing while the command is processing them.
The ORPHAN lock previously attempted to produce an accurate
listing of orphan PVs, but it was only acquired at the end of
the command during the fake vg_read of the fake orphan vg.
This is not when orphan PVs were determined; they were
determined by elimination beforehand by processing all real
VGs, and subtracting the PVs in the real VGs from the list
of all PVs that had been identified during the initial scan.
This is fixed by holding the single global lock in shared mode
while processing all VGs to determine the list of orphan PVs.
wipe_lv knows it's going to write the device, so it
can open rw from the start. It was opening readonly,
and then dev_write needed to reopen it readwrite.
To avoid tiny race on checking arrival of signal and entering select
(that can latter remain stuck as signal was already delivered) switch
to use pselect().
If it would needed, we can eventually add extra code for older systems
without pselect(), but there are probably no such ancient systems in
use.
Handle the case where pvscan --cache -aay (with no dev args)
gets to the final PV, completing the VG, but that final PV does not
have VG metadata. In this case, we need to use VG metadata from a
previously scanned PV in the same VG, which we saved for this
possibility. Using this saved metadata, we can find which VG
this PVID belongs to, and then check if that VG is now complete,
and if so add the VG name to the list of complete VGs to be
autoactivated.
When hints are invalid and ignored, the list of hints
could be non-empty (from additions before an invalid
hint was found). This confused the calling code which
was checking for an empty list to see if hints were used.
Ensure the list is empty when hints are not used.
We already used Conflicts=shutdown target to stop LVM2 services on shutdown.
But we still missed the ordering - the shutdown.target should be reached
only after all the services are really stopped.
Reported here: https://github.com/lvmteam/lvm2/issues/17
If a device looks like a PV, but its size does not
match the PV size in the metadata, then skip it for
purposes of autoactivation. It's probably not wrong
device for the PV.
In the past, the first 'pvscan --cache -aay dev' command
to run on the system would initialize the pvs_online dir
by scanning all devs and creating online files for all pvs
it found, and then autoactivating the VG (if complete) for
the named dev. The idea was that the system may not have
been able to run pvscan commands for early devices, so the
first pvscan to run would need to "make up" for any devices
that had appeared previously, which the system was unable to
scan. The problem or idea of making up for missed scans is
historical and should no longer be needed, so remove this
special init case.
When pvscan is run for the initialization case (the first
pvscan run on the system), it scans all devs and creates
online files for all PVs it finds. Previously it would
then autoactivate every complete VG, but change this to
only autoactive the (complete) VG corresponding to the
named device arg(s).
- remove reference to locking_type which is no longer used
- remove references to adopting locks which has been disabled
- move some sanlock-specific info out of a general section
- remove info about doing automatic lockstart by the system
since this was never used (the resource agent does it)
- replace info about lvextend and manual refresh under gfs2
with a description about the automatic remote refresh
This reverts 518a8e8cfb
"lvmlockd: activate mirror LVs in shared mode with cmirrord"
because while activating a mirror LV with cmirrord worked,
changes to the active cmirror did not work.
When data are growing, adapt also size of metadata.
As we get way too many reports from users doing huge growths of
data portion while keep metadata small and avoiding using monitoring.
So to enhance the user-experience in case user requests grown of
thin-pool (without passing PV list for growth) - lvm2 will automaticaly
grown also the metadata part of thin-pool (if possible).
Add function for estimation of thin-pool metadata size for given size of
data. Function is using already existing internal API so it can
be reused for resize of thin-pool data.
Update the previous commit to leave the vgname as
an arg instead of moving it into the select option,
(the compound select option rule is confusing the
dlm arg processing.)
Using --select 'lvname=LV && vgname=VG' avoids the problem
of the lvchange exit code not distinguishing an actual error
result vs the VG or LV not existing. (This is in case there
is an odd dlm/gfs2 setup where some nodes are running the dlm
but do not have access to the VG.)
When lvextend extends an LV that is active with a shared
lock, use this as a signal that other hosts may also have
the LV active, with gfs2 mounted, and should have the LV
refreshed to reflect the new size. Use the libdlmcontrol
run api, which uses dlm_controld/corosync to run an
lvchange --refresh command on other cluster nodes.
When an LV is active with a shared lock, a command can be
run to change the LV with --lockopt skiplv (to override the
exclusive lock the command ordinarily requires which is not
compatible with the outstanding shared lock.)
In this case, other commands may have the LV active and may
need to refresh the LV, so print warning stating this.
Udev is running udev-rule action upon 'resume'.
However lvm2 in special case is doing replacement of
'soon-to-be-removed' device with 'error' target for resuming
and then follows actual removal - the sequence is usually quick,
so when udev start action - it can result in 'strange' error
message in kernel log like:
Process '/usr/sbin/dmsetup info -j 253 -m 17 -c --nameprefixes --noheadings --rows -o name,uuid,suspended' failed with exit code 1.
To avoid this - we need to ensure there is synchronization wait for udev
between 'resume' and 'remove' part of this process.
However existing code put strict requirement to avoid synchronizing with
udev inside critical section - but this originally came from requirement
to not do anything special while there could be devices in
suspend-state. Now we are able to see differnce between critical section
with or without suspended devices. For udev synchronization only
suspended devices are prohibited to be there - so slightly relax
condition and allow calling and using 'fs_sync()' even inside critical
section - but there must not be any suspended device.
Allow using caching with VDO.
User can either cache a single vdopool or
a vdo LV - difference when the caching is put-in depends on a use-case
and it's upto user to decide which kind of speed is expected.
Internal detection of SCSI device being in-use by DM mpath has been
performed several times for each component device - this could be
eventually racy - so instead when we do remember 1st. checked result
for device being mpath and use it consistenly over the filter runtime.
Move DM usage into dev_manager.c source file.
Also convert STATUS to INFO ioctl - as that's enough
to obtain UUID - this also avoid issuing unwanted flush on checked DM
device for being mpath.
Fix to previous commit
"pvscan: ignore online for shared and foreign PVs"
which was incorrectly considering a PV foreign if its
VG had no system ID when the host did have a system ID.
Activation would not be allowed anyway, but we can
check for these cases early and avoid wasted time in
pvscan managing online files an attempting activation.
This is the default bcache size that is created at the
start of the command. It needs to be large enough to
hold a single copy of metadata for a given VG, or the
VG cannot be read or written (since the entire VG would
not fit into available memory.)
Increasing the default reduces the chances of anyone
needing to increase the default to use their VG.
The size can be set in lvm.conf global/io_memory_size;
the lower limit is 4 MiB and the upper limit is 128 MiB.
When a single copy of metadata gets within 1MB of the
current io_memory_size value, begin printing a warning
that the io_memory_size should be increased.
which defines the amount of memory that lvm will allocate
for bcache. Increasing this setting is required if it is
smaller than a single copy of VG metadata.
and "cachepool" to refer to a cache on a cache pool object.
The problem was that the --cachepool option was being used
to refer to both a cache pool object, and to a standard LV
used for caching. This could be somewhat confusing, and it
made it less clear when each kind would be used. By
separating them, it's clear when a cachepool or a cachevol
should be used.
Previously:
- lvm would use the cache pool approach when the user passed
a cache-pool LV to the --cachepool option.
- lvm would use the cache vol approach when the user passed
a standard LV in the --cachepool option.
Now:
- lvm will always use the cache pool approach when the user
uses the --cachepool option.
- lvm will always use the cache vol approach when the user
uses the --cachevol option.
For users who do not want all of the fields included
in debug lines, let them specify in lvm.conf which
fields to include. timestamp, command[pid], and
file:line fields can all be disabled.
Without this, the output from different commands in a single
log file could not be separated.
Change the default "indent" setting to 0 so that the default
debug output does not include variable spaces in the middle
of debug lines.
Use the correct loop variable within the loop, instead of reusing the
initial value. Table lines after the first don't get terminated in
the right place.
Signed-off-by: Kurt Garloff <kurt@garloff.de>
When a VG has multiple PVs, and all those PVs come online
at the same time, concurrent pvscans for each PV will all
create the individual pvid files, and all will often see
the VG is now complete. This causes each of the pvscan
commands to think it should activate the VG, so there
are multiple activations of the same VG. The vg lock
serializes them, and only the first pvscan actually does
the activation, but there is still a lot of extra overhead
and time used by the other pvscans that attempt to
activate the already active VG. This can lead to a backlog
of pvscans and timeouts.
To fix this, this adds a new /run/lvm/vgs_online/ dir that
works like the existing /run/lvm/pvs_online/ dir. Each pvscan
that wants to activate a VG will first try to exlusively create
the file vgs_online/<vgname>. Only the first pvscan will
succeed, and that one will do the VG activation. The other
pvscans will find the vgname file exists and will not do the
activation step.
When a PV goes offline, the vgs_online file for the corresponding
VG is removed. This allows the VG to be autoactivated again
when the PV comes online again. This requires that the vgname be
stored in the pvid files.
Use a file lock to ensure that only one pvscan will do
initialization of pvs_online, otherwise multiple concurrent
pvscans may all see an empty pvs_online directory and
do initialization.
The pvscan that is doing initialization should also only
attempt to activate complete VGs.
When aay was included in the pvscan --cache command,
the activation part was complaining about the unusual
state of the hint file since it had been recreated
just prior.
udev_dev_is_md_component and udev_dev_is_mpath_component
are not used for obtaining the device list, but they still
use libudev for device info. When there are problems with
udev, these functions can get stuck. So, use the existing
obtain_device_list_from_udev config setting to also control
whether these "is component" functions are used, which gives
us a way to avoid using libudev entirely when it's causing
problems.
Just like we support for thin-pool syntax:
lvcreate --thinpool new_tpoolname -L Size vg
add same support logic with for vdo-poo:
lvcreate --vdopool new_vpoolname -L Size vg
Also move description of syntax bellow thin-pool, so it's
correctly ordered in generated man page.
Whenever thin-pool chunk size is unspecified and left for lvm calculation
try to select the size as nearest highest power-of-2 instead of
just being a multiple of 64KiB.
Fixing recent commit 022ebb0cfe
Resize already has size that needs to be counted with,
otherwise upsizing operation could turn into size reduction one.
Since the parse_vdo_pool_status() become vdo_manip API part,
and there will be no 'dm' matching status parser,
the API can be simplified and closely match thin API here.
Now with newer VDO kvdo target we can start to use standard mechanism
to enable resize of VDO volumes.
VDO pool can be grown.
Virtual volume grows on top of VDO pool when is not big enough.
Reduced VDOLV is calling discard for reduced areas - this can
take long time!
TODO: implement some pollable mechanism for out-of-lock TRIM.
When using 'lvcreate -l100%VG' and there is big disproportion between
real available space and requested setting - automatically fallback
to 100%FREE.
Difference can be seen when VG is big and already most space was
allocated, so the requestion 100%VG can end (and by spec for % modifier
it's correct) as LV with size of 1%VG. Usually this is not a big
problem - buit in some cases - like cache-pool allocation, this
can result a big difference for chunksize selection.
With this patch it's more closely match common-sense logic without
the need of reitteration of too big changes in lvm2 core ATM.
TODO: in the future there should be allocator solving all allocations
in a single call.
When using caches with BIG pool size (>TB) there is required
to use relatively huge chunk size. Once the chunksize has
got over 1MiB - kernel cache target stopped writing such chunks
back on this if the migration_threshold remained on default 1MiB
(2048 sectors) size.
This patch ensure, DM layer will not let pass table line which
has not big enough migration threshold that can let pass
at least 8 chunks (independently of lvm2 metadata).
lvm uses 'minimum_io_size' name to exactly match VDO naming here,
however in all common cases _size is using 'sector/512b' unit.
But in this case the value is in bytes and can have only 2 values:
either 512 or 4096.
It's probably not worth to rename it internaly, so we can just
drop comment - instead of using 1 or 8.
Thought let's think about it....
Lvm can at times have duplicate names. When this happens the daemon will
internally use vg_name:vg_uuid as the name for lookups, but display just
the vg_name externally. If an API user uses the Manager.LookUpByLvmId and
queries the vg name they will only get returned one result as the API
can only accommodate returning 1. The one returned is the first instance
found when sorting the volume groups by UUID.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1583510
When we have two logical volumes which switch their names at the
same time we are left with incorrect lookups. Anytime we find
an entry by doing a lookup by UUID or by name we will ensure
that the lookups are indeed correct.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1642176
An idea from Zdenek for better ensuring valid hints by invalidating
them when pvscan --cache <device> sees a new PV, which is a case
where we know that hints should be invalidated. This is triggered
from systemd/udev logic, and there may be some cases where it would
invalidate hints that the existing methods wouldn't detect.
If there are two independent scripts doing:
vgchange --lockstart vg
lvchange -ay vg/lv
The first vgchange to do the lockstart will wait for
the lockstart to complete before returning.
The second vgchange to do the lockstart will see that
the start is already in progress (from the first) and
will do nothing. This means the second does not wait
for any lockstart to complete, and moves on to the
lvchange which may find the lockspace still starting
and fail.
To fix this, make the vgchange lockstart command
wait for any lockstart's in progress to complete.
Save the list of PVs in /run/lvm/hints. These hints
are used to reduce scanning in a number of commands
to only the PVs on the system, or only the PVs in a
requested VG (rather than all devices on the system.)
The systemd generators are executed very early during the switch
from initramfs to system partition and the syslog is not yet fully
operational - it may cause blocking, if some debug logging is enabled
at the same time in /etc/lvm/lvm.conf log{} section.
To avoid timeouting and killing this generator - rather enhance lvm
code to suppress any syslog communication when LVM_SUPPRESS_SYSLOG
envvar is set.
Use of this envvar is needed since the parsing of i.e. cmdline options
that could eventually override lvm.conf setting happens in this case
way too late and number of lines could have been already streamed to
syslog.
Fix a scenario where global/event_activation setting is not found. In
this case we need to take default value just like lvm tools do when
executed. So use "lvmconfig --type full".
Also, if we fail to execute lvmconfig for whatever reason, fallback to
generating the activation units as failsafe action.
Reported by: Bastian Blank <waldi debian org>
When initializing an LV to hold the writecache, use wipe_lv()
which looks for specific signatures on the LV.
Wiping signatures is not necessary, but printing a warning
that names a specific signature (in addition to the existing
generic warning/confirmation) may help if a user accidentally
specifies the wrong LV which contains something important.
Detach function return 0 for error and 1 for success.
Add missing log errors from failing deactivation.
Add missing log error from failing synchronization.
In few cases error paths from initialization were returned as
'success == 1'.
Also assing num_mb with single compare checking valid sector_size.
For dumb compiler make num_mb always defined.
Since configure.h is a generated header and it's missing traditional
ifdefs preambule - it can be included & parsed multiple times.
Normally compiler is fine when defines have same value and there is
no warning - yet we don't need to parse this several times
and by adding -include directive we can ensure every file
in the package is rightly compile with configure.h as the
first header file.
If we are running the test where the device is /dev/* we will will
run the unit tests 'test_nesting' and 'test_pv_symlinks'. Otherwise
we will skip them.
When a VG is exported, the 'fullreport' returns an exit code of 5, but
otherwise returns the data we are wanting.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
In some cases we get stuck where we are unable to retrieve the current
state of lvm as we are encountering an error. When the error is
persistent we will log and exit the daemon instead of consuming vast
amounts of resources.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Drop very old original format of VDO target and focus on V2 version.
So some variables were renamed or replaced.
There is no compatibility preserved (with assumption so far this is
experimental feature and there is no real user).
Note - version currently VDO calls this version 6.2.
A bit of chicken & egg problem - dmeventd needs to use old libdm library.
VDO is only part of new device_mapper internal library.
So include directly source file for parsing status - this fixes usability
problem of VDO plugin introduced with previous Makefile reshaping
patchset.
NOTE: source file needs to be keep then compilable in both environments.
Also add missing copyright header.
This is a followup patch to commit edb72cb70c
to support related lvm2 test suite tests.
A 'global/support_mirrored_mirror_log' bool configuration variable gets
introduced allowing the creation of, or conversion to mirrored 'mirror'
logs if set. The capability to create these in turn allows the rest of
the tests to perform activation of such existing LVs and their conversions
to disk/core 'mirror' logs.
Display a disclaimer warning if enabled that this is not for regular use.
Add definition of the enabled config option to respective test scripts.
Related: rhbz1643562
With older gcc - we need to resolve symbols linked with devmapper-event
that is now using -ldevmapper.
Also add forgotten systemd library needed for dbus notification.
Avoid doing hard set of LIBS var,
so if the LIBS is set before 'include make.tmpl' it's not lost.
This gives better control over order of linked libraries.
Since there is very little change there will be any new devel going
to happing with cmirror - avoid eating extra disk space and link
with already installed libdm which implements all use basic
function of dm list
Since dmeventd is 'libdm' based project, it needs to link
libdm library instead of its internal version
An external users may provide plugins loadeable by dmeventd.
So external user of libdevmapper-event library has no other option
then to link with released libdevmapper library.
The complexity comes with lvm2 plugins.
The lvm2 plugin itself uses internal version of device_mapper,
but libdevmapper-event usage is libdm based - so there needs to be avoided
any breakage on compatibility of internal i.e. dm_task_run structures.
TODO: most likely dmeventd itself should be moved into libdm/dm-tools dir,
and only lvm2 plugins should be created as part of lvm project,
but those still need to link with libdevmapper.
If we don't know the meaning we will return the key with default text
instead of raising an exception and taking the daemon out in the
process.
Resolves: rhbz1657950
When we get bug reports we may not get the entire log, so lets
dump the fight recorder from newest to oldest as the one we
are interested in was likely to be the last command run.
Scenario: Given an existed LV `lvol0`, I want to create another LV
on the PVs used by `lvol0`.
I use `build_parallel_areas_from_lv()` to obtain the `pv_list` of each segments.
However, the returned `pv_list` is not properly initialized, which causes
segfault in subsequent operations.
Ensure configure.h is always 1st. included header.
Maybe we could eventually introduce gcc -include option, but for now
this better uses dependency tracking.
Also move _REENTRANT and _GNU_SOURCE into configure.h so it
doesn't need to be present in various source files.
This ensures consistent compilation of headers like stdio.h since
it may produce different declaration.
There's a small window during creation of a new RaidLV when
rmeta SubLVs are made visible to wipe them in order to prevent
erroneous discovery of stale RAID metadata. In case a crash
prevents the SubLVs from being committed hidden after such
wiping, the RaidLV can still be activated with the SubLVs visible.
During deactivation though, a deadlock occurs because the visible
SubLVs are deactivated before the RaidLV.
The patch adds _check_raid_sublvs to the raid validation in merge.c,
an activation check to activate.c (paranoid, because the merge.c check
will prevent activation in case of visible SubLVs) and shares the
existing wiping function _clear_lvs in raid_manip.c moved to lv_manip.c
and renamed to activate_and_wipe_lvlist to remove code duplication.
Whilst on it, introduce activate_and_wipe_lv to share with
(lvconvert|lvchange).c.
Resolves: rhbz1633167
In RHEL7 we marked mirrored mirror logs as deprecated and
added a related message. This patch prohibits creating new
'mirror' LVs with that log type or converting existing LVs
to have one.
Existing LVs with mirrored mirror log can be activated
and converted to disk/core logs.
Avoid double deprecation message when running lvconvert.
Resolves: rhbz1643562
Patch 668c9d0762 introduced regression,
since the code here would actually always return failing result.
Replace it with more simple call to strndup().
For some targets we do not want to generate dependencies.
Also add note about usage of such Makefile - it might be
possibly better to rename it to different filename to avoid
any confusion.
commit de28637
scan: use full md filter when md 1.0 devices are present
missed the fact that md superblock version 0.90 also puts
metadata at the end of the device, so the full md filter
needs to be used when either 0.90 or 1.0 is present.
. When using default settings, this commit should change
nothing. The first PE continues to be placed at 1 MiB
resulting in a metadata area size of 1020 KiB (for
4K page sizes; slightly smaller for larger page sizes.)
. When default_data_alignment is disabled in lvm.conf,
align pe_start at 1 MiB, based on a default metadata area
size that adapts to the page size. Previously, disabling
this option would result in mda_size that was too small
for common use, and produced a 64 KiB aligned pe_start.
. Customized pe_start and mda_size values continue to be
set as before in lvm.conf and command line.
. Remove the configure option for setting default_data_alignment
at build time.
. Improve alignment related option descriptions.
. Add section about alignment to pvcreate man page.
Previously, DEFAULT_PVMETADATASIZE was 255 sectors.
However, the fact that the config setting named
"default_data_alignment" has a default value of 1 (MiB)
meant that DEFAULT_PVMETADATASIZE was having no effect.
The metadata area size is the space between the start of
the metadata area (page size offset from the start of the
device) and the first PE (1 MiB by default due to
default_data_alignment 1.) The result is a 1020 KiB metadata
area on machines with 4KiB page size (1024 KiB - 4 KiB),
and smaller on machines with larger page size.
If default_data_alignment was set to 0 (disabled), then
DEFAULT_PVMETADATASIZE 255 would take effect, and produce a
metadata area that was 188 KiB and pe_start of 192 KiB.
This was too small for common use.
This is fixed by making the default metadata area size a
computed value that matches the value produced by
default_data_alignment.
The pvscan systemd service for autoactivation was
mistakenly dropped along with the lvmetad related
services.
The activation generator program now looks at the new
lvm.conf setting "event_activation" (default 1) to
switch between event activation and direct activation.
Previously, the old use_lvmetad setting was used to
switch between event and direct activation.
instead of a separate --writecacheblocksize option.
writecache block_size is not technically a setting,
but it can borrow the option as a special case.
fix lseek error check
fix read/write error checks
handle zero return from read and write
don't return an error for short io
fix partial read/write loop
io_setup() for aio may fail if a system has reached the
aio request limit. In this case, fall back to using
sync io. Also, lvm use of aio can be disabled entirely
with config setting global/use_aio=0.
The system limit for aio requests can be seen from
/proc/sys/fs/aio-max-nr
The current usage of aio requests can be seen from
/proc/sys/fs/aio-nr
The system limit for aio requests can be increased by
setting fs.aio-max-nr using sysctl.
Also add last-byte limit to the sync io code.
For embeded reshaping operation require higher driver version.
(otherwise we get:
Converting vg/LV1 from raid6 (same as raid6_zr) is directly possible to the following layouts:
raid6_nc
raid6_nr
raid6_la_6
raid6_ls_6
raid6_ra_6
raid6_rs_6
raid6_n_6
lvmpolld ATM is not desingned to preserve interval checking
in the same way the 'lvconvert' tool is doing - so the passed
'-i 40' is not respected and lvmpolld autonomously checks
state of conversion and updates lvm2 metadata and dm tables
when needed.
So skip portion of test that relayed on this and preserve this logic
only for command line invocation and forking of polling process
where the interval of checking is under full control.
Although we primarily want to check externally used libdevmapper
library, out internal all.h is still keeping all symbols
as the original library has - so for simpler compilation keep
using this private copy for defining needed symbols.
Just for case ensure compiler is not able to optimize
memset() away for resources that are released.
This idea of using memory barrier is taken from openssl.
Other options would be to check for 'explicit_bzero' function.
When preparing ioctl buffer and flatting all parameters,
add table parameters only to ioctl that do process them.
Note: list of ioctl should be kept in sync with kernel code.
DM_DEVICE_CREATE with table is doing several ioctl operations,
however only some of then takes parameters.
Since _create_and_load_v4() reused already existing dm task from
DM_DEVICE_RELOAD it has also kept passing its table parameters
to DM_DEVICE_RESUME ioctl - but this ioctl is supposed to not take
any argument and thus there is no wiping of passed data - and
since kernel returns buffer and shortens dmi->data_size accordingly,
anything past returned data size remained uncleared in zfree()
function.
This has problem if the user used dm_task_secure_data (i.e. cryptsetup),
as in this case binary expact secured data are erased from main memory
after use, but they may have been left in place.
This patch is also closing the possible hole for error path,
which also reuse same dm task structure for DM_DEVICE_REMOVE.
Add a little wait loop - since lvconvert started background process
and we need to wait till this bg task initiate its work -
adding ~1s loop should give reasonable enough time to start mirroring.
This could be seen as continuation of
6cee8f1b06.
Some test maching with old udev system shows problem,
where udev 'jumps on' leg device after mirror target
releases its legs - since udev does not (in this old case) skips
such device from scanning - it opens device - and this prevent
leg device to be deactivated - effectively such device stays
'leaked' in DM table invisibly to lvm2 command.
So to 'combat' this issue - if the device has '_mimage' in its name,
the retry of deactivation is automatically assumed.
NOTE: wider impact is unexpected - as it's touching only old mirror
target which is nowadays replaced with 'raid'.
In case there will be some problem identified - probably both patches
should be reverted.
With older systems and udevs we don't have control over scanning of lvm2
internal devices - so far we set retry-removal only for top-level LVs,
but in occasional cases udev can be 'fast enough' to open device for
scanning and prevent removal of such device from DM table.
So to combat this case - try to pass 'retry' flag also for removal of
internal device so see how many races can go away with this simple
patch.
Note: patch is applied only to internal version of libdm so the external
API remains working in the old way for now.
Commit 813347cf84 added extra validation,
however in this particular we do want to trim suffix out so rather ignore
resulting error code here intentionaly.
If a single, standard LV is specified as the cache, use
it directly instead of converting it into a cache-pool
object with two separate LVs (for data and metadata).
With a single LV as the cache, lvm will use blocks at the
beginning for metadata, and the rest for data. Separate
dm linear devices are set up to point at the metadata and
data areas of the LV. These dm devs are given to the
dm-cache target to use.
The single LV cache cannot be resized without recreating it.
If the --poolmetadata option is used to specify an LV for
metadata, then a cache pool will be created (with separate
LVs for data and metadata.)
Usage:
$ lvcreate -n main -L 128M vg /dev/loop0
$ lvcreate -n fast -L 64M vg /dev/loop1
$ lvs -a vg
LV VG Attr LSize Type Devices
main vg -wi-a----- 128.00m linear /dev/loop0(0)
fast vg -wi-a----- 64.00m linear /dev/loop1(0)
$ lvconvert --type cache --cachepool fast vg/main
$ lvs -a vg
LV VG Attr LSize Origin Pool Type Devices
[fast] vg Cwi---C--- 64.00m linear /dev/loop1(0)
main vg Cwi---C--- 128.00m [main_corig] [fast] cache main_corig(0)
[main_corig] vg owi---C--- 128.00m linear /dev/loop0(0)
$ lvchange -ay vg/main
$ dmsetup ls
vg-fast_cdata (253:4)
vg-fast_cmeta (253:5)
vg-main_corig (253:6)
vg-main (253:24)
vg-fast (253:3)
$ dmsetup table
vg-fast_cdata: 0 98304 linear 253:3 32768
vg-fast_cmeta: 0 32768 linear 253:3 0
vg-main_corig: 0 262144 linear 7:0 2048
vg-main: 0 262144 cache 253:5 253:4 253:6 128 2 metadata2 writethrough mq 0
vg-fast: 0 131072 linear 7:1 2048
$ lvchange -an vg/min
$ lvconvert --splitcache vg/main
$ lvs -a vg
LV VG Attr LSize Type Devices
fast vg -wi------- 64.00m linear /dev/loop1(0)
main vg -wi------- 128.00m linear /dev/loop0(0)
Since the test is currently directly working with live directory,
which can be getting updates from system's udev - add wait
for settling so removal of all known PVs happens after that.
But still this has major influce on behavior of running system,
so the test should never be executed on a user used box.
Check that type is always defined, if not make it explicit internal
error (although logged as debug - so catched only with proper lvm.conf
setting).
This ensures later type being NULL can't be dereferenced with coredump.
When sanlock_release returns an error because of an i/o
timeout releasing the lease on disk, lvmlockd should just
consider the lock released. sanlock will continue trying
to release the lease on disk after the original request
times out.
The lvmlock LV size was not adjusted correctly for 512 vs 4K
sector sizes which influence the lease size used by sanlock.
When lvmlock was automatically extended, the zeroing through
bcache wasn't working.
The choice about sector size and lease align size is
now made by the sanlock user, in this case lvmlockd.
This will allow lvmlockd to use other lease sizes in
the future. This also prevents breakage if hosts
report different sector sizes, or the sector size
reported by a device changes.
Since the stats handle is neither bound nor listed before the
attempt to call dm_stats_get_nr_regions(), it will always return
zero: this prevents reporting of any dmstats regions on any
device.
Remove the dm_stats_get_nr_regions() check and instead rely on
the correct return status from dm_stats_populate() which only
returns 0 in the case that there are regions to inspect (and
which logs a specific error for all other cases).
Reported-by: Bryan Gurney <bgurney@redhat.com>
It doesn't make sense to test or warn about the region count until
the stats handle has been listed: at this point it may or may not
contain valid information (but is guaranteed to be correct after
the list).
lvm uses a bcache block size of 128K. A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated. How much depends on
alignments. When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.
This patch is a quick hack to prevent lvm from writing
past the end of the metadata area.
This reverts commit 16ae968d24.
We need to come up with a better fix, because we fall short
wiping all known signatures when not using the wipe_lv API.
lvm metadata writes, commits and activations are performed
for (newly) allocated RAID metadata SubLVs to wipe any preexisiting
data thus avoid false raid superblock positives on RaidLV activation.
This process can be interrupted by command or system crashs
thus leaving stale SubLVs in the lvm metadata as a problem.
Because we hold an exclusive lock in this metadata SubLV wiping
process, we can address this problem by avoiding aforementioned
commits/writes/activations altogether wiping the respective first
sector of the first physical extent allocated to any metadata SubLV
directly via the existing dev_set() API.
Succeeds all LVM RAID tests.
Related: rhbz1633167
When persistent_filter_create() fails, the existing passed filter
should be preserved, so it could be properly deleted on
error path - so new pfilter is assigned instead.
Thin plugin started to use configuble setting to allow to configure
usage of external scripts - however to read this value it needed to
execute internal command as dmeventd itself has no access to lvm.conf
and the API for dmeventd plugin has been kept stable.
The call of command itself was not normally 'a big issue' until users
started to use higher number of monitored LVs and execution of command
got stuck because other monitored resource already started to execute
some other lvm2 command and become blocked waiting on VG lock.
This scenario revealed necesity to somehow avoid calling lvm2 command
during resource registration - but this requires bigger changes - so
meanwhile this patch tries to minimize the possibility to hit this race
by obtaining any configurable setting just once - such patch is small
and covers majority of problem - yet better solution needs to be
introduced likely with bigger rework of dmeventd.
TODO: avoid blocking registration of resource with execution of lvm2
commands since those can get stuck waiting on mutexes.
%ghost should not be used for directories created by systemd-tmpfiles.
This may prevent package from working right after installation without
invoking systemd-tmpfiles.
See: https://pagure.io/packaging-committee/issue/439
Previously the size was limited by checking if the
old and new copies of the metadata overlapped.
This generally limited the size to about half of
the total space, but it could be larger given the
size differences between old and new. Now add a
direct check to limit the size to half the space.
Remove another instance of an invalid check for metadata
overflow during read. The previous instance was removed
in commit 5fb15b193.
This was checking for metadata that that overflowed the
circular disk metadata buffer during read, but such metadata
cannot be written, so it shouldn't be possible to find see.
Also, the check was incorrect and could trigger when there
was no overflow.
This fixes a problem in commit e6bb780d24, in which the
back compat handling for the old locking_type=4 was
incorrectly translated to mean the same thing as --readonly,
which prevented activation because activation uses an
exclusive vg lock. Previously, locking_type=4 allowed
activation.
If we see locking_type 4 in an old config, translate it to
the new combination of --readonly and --sysinit, which we
now define to mean the --readonly behavior with an exception
to allow activation.
The vg_write/vg_commit code was imprecise, uncommented, and
hard to understand. Rewrite it with clearer, cleaner code,
extensive comments, descriptions of how it works, and add
more info in debugging output.
The minor changes in behavior are to things that were
either incorrect or probably unintended:
- vg_write/vg_commit no longer check that the current vgname at
the start of the text metadata matches the vgname being written.
This has already been done at least twice by the time they are
called, and repeating it again against the same cached data has
no use.
- A fragment of old removed code had been left behind that checked
if the old unused alignment policy would wrap. It was still
being checked to decide if the metadata area was full, which
could possibly cause an incorrect full metadata failure.
- vg_remove now clears both the raw_locns in the mda_header that
point to committed metadata (raw_locn slot 0) and precommitted
metadata (raw_locn slot 1). Previously it fully cleared the
committed slot, and would only clear the offset field in the
precommitted slot if it saw a problem with the metadata in the
vg being removed.
- read_metadata_location_summary was wrongly comparing the number
of wrapped bytes with an offset to report an error about the
metadata being too large. This wrong check is removed, it
could have resulted in erroneous errors.
Commit 989626926c
introduced 2 new tests
lvconvert-raid-takeover-linear_to_raid4.sh and
lvconvert-raid-takeover-raid4_to_linear.sh
which involve raid reshaping.
Bump the checked dm-raid target version to 1.14.0
which has reshape kernel fixes to avoid test suite
runs to hang.
Bump target version to 1.14.0 which contains fixes
for reshape deadlock/corruption to allow tests to
run once the respective fixes show up in kernels.
Remove now superfluous multi-core checks.
Resolves: rhbz1501145
Related: rhbz1514539
Related: rhbz1586123
Related: rhbz1613039
Allow "lvconvert --type linear RaidLV" on a raid4 LV
providing convenient interim steps to convert to linear.
Add respective new test
lvconvert-raid-takeover-raid4_to_linear.sh
and
lvconvert-raid-takeover-linear_to_raid4.sh
for linear to raid4 once on it.
When converting from striped/raid0/raid0_meta
to raid6 with > 2 stripes, allow possible
direct conversion (to raid6_n_6).
In case of 2 stripes, first convert to raid5_n to restripe
to at least 3 data stripes (the raid6 minimum in lvm2) in
a second conversion before finally converting to raid6_n_6.
As before, raid6_n_6 then can be converted
to any other raid6 layout.
Enhance lvconvert-raid-takeover.sh to test the
2 stripes conversions to raid6.
Resolves: rhbz1624038
devices/scan_lvs (default 1) determines whether lvm
will scan LVs for layered PVs. The lvm behavior has
always been to scan LVs, but it's rare for LVs to have
layered PVs, and much more common for there to be many
LVs that substantially slow down scanning with no benefit.
This is implemented in the usable filter, and has the
same effect as listing all LVs in the global_filter.
Add "lvm2-activation-generator: " prefix for all kmsg messages written by
lvm2-activation-generator so we can identify the message in global system log.
We need to have Ceph RBD devices mapped first before use in a stack
where LVM is on top so make sure rbdmap.service is called before
generated lvm2-activation-net.service.
On shutdown, we need to stop blk-availability first before we stop the
rbdmap.service.
Resolves: rhbz1623479
This is the number of concurrent async io requests that
the scan layer will submit to the bcache layer. There
will be an open fd for each of these, so it is best to
keep this well below the default limit for max open files
(1024), otherwise lvm may get EMFILE from open(2) when
there are around 1024 devices to scan on the system.
"lvconvert --type linear RaidLV" on striped and raid4/5/6/10
have to provide the convenient interim layouts. Fix involves
a cleanup to the convenience type function.
As a result of testing, add missing sync waits to
lvconvert-raid-reshape-linear_to_raid6-single-type.sh.
Resolves: rhbz1447809
Conversion to striped from raid0/raid0_meta is directly possible.
Fix a regression setting superfluous interim raid5_n conversion type
introduced by commit bd7cdd0b09.
Add new test script lvconvert-raid0-striped.sh.
Resolves: rhbz1608067
With improved mirror activation code --splitmirror issue poppedup
since there was missing proper preload code and deactivation
for splitted mirror leg.
If a mirror LV is listed in read_only_volume_list, it would
still be activated rw. The activation would initially be
readonly, but the monitoring function would immediately
change it to rw. This was a regression from commit
fade45b1d1 mirror: improve table update
The monitoring function needs to copy the read_only setting
into the new set of mirror activation options it uses.
When vgcreate does an automatic pvcreate, it opens the
dev with O_EXCL to ensure no other subsystem is using
the device. This exclusive fd remained in bcache and
prevented activation parts of lvm from using the dev.
This appeared with vgcreate of a sanlock VG because of
the unique combination where the dev is not yet a PV,
so pvcreate is needed, and the vgcreate also creates
and activates an internal LV for sanlock.
Fix this by closing the exclusive fd after it's used
by pvcreate so that it won't interfere with other
bits of lvm that may try to use the device.
Conversions of LVs under snapshot to thinpool or cachepool
correctly fail but leave them inactive and provide cryptic
error messages like 'Internal error: #LVs (10) != #visible
LVs (2) + #snapshots (1) + #internal LVs (5) in VG VG'.
Reject and provide better error message.
Resolves: rhbz1514146
The 'lvconvert LV' command def has caused multiple problems
for command matching because it matches the required options
of any lvconvert command. Any lvconvert with incorrect options
ends up matching 'lvconvert LV', which then produces an error
about incorrect options being used for 'lvconvert LV'. This
prevents suggestions from nearest-command partial command matches.
Add a special case for 'lvconvert LV' so that it won't be used
as a partial match for a command that has options specified.
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
When lvmetad is not used, use temporary files to record
which PVs have appeared. Use these temp files to determine
when a VG is complete, to trigger autoactivation.
This change allows us to remove lvmetad while keeping the
same autoactivation behavior that lvmetad provides.
The temp files are created in /run/lvm/pvs_online/ and are
named for the PVID of the PV. The files contain the
major:minor of the device the PV was read from.
e.g. if VG foo has dev1 and dev2, then:
. pvscan --cache -aay dev1
reads vg metadata from dev1
creates /run/lvm/pvs_online/<pvid-of-dev1>
checks if all vg->pvs are online: no
. pvscan --cache -aay dev2
reads vg metadata from dev2
creates /run/lvm/pvs_online/<pvid-of-dev2>
checks if all vg->pvs are online: yes
autoactivates vg
A 'pvscan --cache dev' (without -aay) still records that
dev is online.
A 'pvscan --cache --major X --minor Y' after a device is
gone will remove the temp file for it.
A 'pvscan --cache [-aay]' (no devs) resets the state of
temp files by removing them all, then scanning all devs
and creating temp files for PVs that are found.
If no online files exist, the first pvscan --cache scans
all devs and creates temp files for any PVs found.
The scope of the temp files is only pvscan, and they are only
used for pvscan-based autoactivation. No other commands are
concerned with or aware of these temp files. When lvm creates
or removes PVs, no attempt is made to update the temp files.
Support vgchange usage with VDO segtype.
Also changing extent size need small update for vdo virtual extent.
TODO: API needs enhancements so it's not about adding ifs() everywhere.
When user create vdo-pool - use different automatic name.
So unlike with traditional LVs using lvol0, lvol1
use vpool0, vpool1...
TODO: apply similar for thin-pool & cache-pool...
Checks whether VDO support is enabled.
Detects presence of 'vdoformat' tool which is required for to format VDO pool.
ATM build of VDO is NOT automatically enabled (None is default).
To enable build of LVM with VDO support use:
configure --with-vdo=internal
TODO: Maybe future version may switch to link some small VDO library for formating
(would require linking and package dependency).
To support autoloading of VDO dm target driver loading of 'kvdo'
kernel module is needed - ATM it's not using 'dm-vdo' name.
So to support this strange name - add temporarily solution to
autoload kvdo kernel module in this case.
Introduce VDO plugin for monitoring VDO devices.
This plugin can be used also by other users, as plugin checks
for UUID prefix 'LVM-' and run lvm actions only on those
devices.
Non LVM- device are only monitored and log warnings
when usage threshold reaches 80%.
Update makefile to link with more libs since now whole liblvm-internal.a
is linked-in and this library has futher dependencies.
Avoid including deps for run-unit-test.
Drop linking separate status.c as it's already linked via internal libs.
Check allocation of thin-pool works on 2PVs, when one is so full,
that even metadata do not fit there (as they need at least 2M,
while 99% of 63MB fills >62MB)
When lvm2 command is executed in test mode, discard ioctl is skipped.
This may cause even data-loose in case, issuing discard for released
areas was enabled and user 'tested' lvreduce.
When allocating thin-pool with more then 1 device - try to
allocate 'metadataLV' with reuse of log-type allocation for mirror LV.
It should be naturally place on other device then 'dataLV'.
However due to somewhat hard to follow allocation logic code,
it's been rejected allocation in cases where there was not
enough space for data or metadata on single PV, thus to successed,
usage of segments was mandatory.
While user may use:
allocation/thin_pool_metadata_require_separate_pvs=1
to enforce separe meta and data LV - on default settings, this is not
enable thus segment allocation is meant to work.
NOTE:
As already said - the original intention of this whole 'if()' is unclear,
so try to split this test into multiple more simple tests that are more readable.
TODO: more validation.
When node loading fails, there is not much the caller can do,
since there is 'unknown' set of devices preloaded.
Only suspend during preload knows future precommitted 'metadata',
so it's non-trivial to drop 'preloaded' entries with any later call.
However dm tree tracks newly loaded entries - so in this case it
may simplify the recovery path by dropping preloaded entries so
they are not leaked in the DM table.
Allow creation of any virtual segment type with just --virtualsize
specified without any real extent size give.
TODO: likely --type error,zero might be later enhanced to use -V
(along with -L) - but since those targets do not allocate real
space, supporting -V makes sense with them.
Amound of linked libraries grows.
Most of them we don't need to lock in, since we are not using
them in locked section, so skip locking them in memory.
It's important to lock memory beforo running SUSPEND ioctl - but whole
lvm preload runs in memory unlocked environment - as in this phase
memory allocation is allowed and is meant to happen.
Once all targets are preload and ready (confirmed from all targets)
we start suspending tree - and here the memory allocation (or i.e.
opening files) is no longer allowed - as it may cause kernel deadlock.
Commit 3f35146 added a check on the value returned by the
_display_info_cols() function:
1024 if (!_switches[COLS_ARG])
1025 _display_info_long(dmt, &info);
1026 else
1027 r = _display_info_cols(dmt, &info);
1028
1029 return r;
This exposes a bug in the dmstats code in _display_info_cols:
the fact that a device has no regions is explicitly not an error
(and is documented as such in the code), but since the return
code is not changed before leaving the function it is now treated
as an error leading to:
# dmstats list
Command failed.
When no regions exist.
Set the return code to the correct value before returning.
For better code reuse split _node_send_messages into commont
messaging part and separate _thin_pool_node_send_messages.
Patch makes it possible to better reuse common code for messaging
other targets.
udev creates a train wreck of events if we open devices
with RDWR. Until we can fix/disable/scrap udev, work around
this by opening RDONLY and then closing/reopening RDWR when
a write is needed. This invalidates the bcache blocks for
the device before writing so it can trigger unnecessary
rereading.
It's no longer needed. Clustered VGs are now handled in
the same way as foreign VGs, and as shared VGs that
can't be accessed:
- A command processing all VGs sees a clustered VG,
prints a message ("Skipping clustered VG foo."),
skips it, and does not fail.
- A command where the clustered VG is explicitly
named on the command line, prints a message and fails.
"Cannot access clustered VG foo, see lvmlockd(8)."
The option is listed in the set of ignored options for
the commands that previously accepted it. (Removing it
entirely would cause commands/scripts to fail if they
set it.)
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
Since we are using "DefaultDependencies=no" we do not get automatic STOP
job on socket connection - so automatically refuse connection on
shutdown by adding this Conflict definition to socket Unit.
Shuffle code for better readability as set of conditions was
hard to follow.
Make it obvious the refresh & activate path is handling
monitoring and polling on its own.
So the only --monitor and --poll option needs explicit care.
Option --monitor without option --poll will now as a result
of this patch NOT start polling.
So command: vgchange --monitor n is no longer a polling starter.
Restoring polling for activated volumes lost with my recent commit:
75fed05d3e and move start of polling
directly into _activate_lvs_in_vg() - as there we know exactly
if there was some volume even activated.
Also make it sharing same code for pvscan -aay.
With the move to top-level makefile - there are some issues
with subdir recursive makefile.
Make the building more tolerant for now until fully resolved.
The previous method for forcibly changing a clustered VG
to a local VG involved using -cn and locking_type 0.
Since those options are deprecated, replace it with
the same command used for other forced lock type changes:
vgchange --locktype none --lockopt force.
Shared VGs will generally be started and activated by
the resource agent. Without the agent, this script doesn't
have a good way to know which LVs to activate.
vgreduce, vgremove and vgcfgrestore were acquiring
the orphan lock in the midst of command processing
instead of at the start of the command. (The orphan
lock moved to being acquired at the start of the
command back when pvcreate/vgcreate/vgextend were
reworked based on pvcreate_each_device.)
vgsplit also needed a small update to avoid reacquiring
a VG lock that it already held (for the new VG name).
A few places were calling a function to check if a
VG lock was held. The only place it was actually
needed is for pvcreate which wants to do its own
locking (and scanning) around process_each_pv.
The locking/scanning exceptions for pvcreate in
process_each_pv/vg_read can be enabled by just passing
a couple of flags instead of checking if the VG is
already locked. This also means that these special
cases won't be enabled unknowingly in other places
where they shouldn't be used.
When pvmoving LV - the target for LV is a mirror so the validation
that checked the type is matching was incorrect.
While we need a more generic enhancment of LVS output for pvmoved LVs,
for now at least stop showing internal errors and 'X' symbols in attrs.
The last commit related to this was incomplete:
"Implement lock-override options without locking type"
This is further reworking and reduction of the locking.[ch]
layer which handled all clustering, but is now only used
for file locking. The "locking types" that this layer
implemented were removed previously, leaving only the
standard file locking. (Some cluster-related artifacts
remain to be cleared out later.)
Command options to override or modify locking behavior
are reimplemented here without using the locking types.
Also, deprecated locking_type values are recognized,
and implemented as if one of the equivalent override
options was set.
Options that override file locking are:
. --nolocking disables all file locking.
. --readonly grants read lock requests without actually
taking a file lock, and refuses write lock requests.
. --ignorelockingfailure tries to set up file locks and
uses them normally if possible. When not possible, it
behaves like --readonly, but allows activation.
. --sysinit is the same as ignorelockingfailure.
. global/metadata_read_only acquires actual read file
locks, and refuses write lock requests.
(Some of these options could probably be deprecated
because they were added as workarounds to various
locking_type behaviors that are now deprecated.)
The locking_type setting now has one valid value: 1 which
refers to standard file locking. Configs that contain
deprecated values are recognized and still work in
largely the same way:
. 0 disabled all locking, now implemented like --nolocking
is set. Allow the nolocking option in all commands.
. 1 is the normal file locking setting and is unchanged.
. 2 was for external locking which was not used, and
reverts to normal file locking.
. 3 was for cluster/clvm. This reverts to normal file
locking, and prints messages about lvmlockd.
. 4 was equivalent to readonly, now implemented like
--readonly is set.
. 5 disabled all locking, now implemented like
--nolocking is set.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
Add tests for linear <-> striped|raid* conversions.
Add region_size config to reshape tests to avoid test
failures in case of it being defined unexpectedly in lvm.conf.
Related: rhbz1439925
Related: rhbz1447809
"lvconvert --type {linear|striped|raid*} ..." on a striped/linear
LV provides convenience interim type to convert to the requested
final layout similar to the given raid* <-> raid* conveninece types.
Whilst on it, add missing raid5_n convenince type from raid5* to raid10.
Resolves: rhbz1439925
Resolves: rhbz1447809
Resolves: rhbz1573255
Fixes breakage from the recent libdm split. Though these didn't
ever appear to be linked (could they have piggy backed from libdevmapper.so
being linked to them?).
In this command, lvcreate creates a new LV and then combines
it with an existing cache pool, producing a cache LV. This
command was previously not allowed in in a shared VG.
When the lvmlockd lock is shared, upgrade it to ex
when repair (writing) is needed during vg_read.
Pass the lockd state through additional read-related
functions so the instances of repair scattered through
vg_read can be handled.
(Temporary solution until the ad hoc repairs can be
pulled out of vg_read into a top level, centralized
repair function.)
It's not an error if a command requests the global lock
when it has already acquired it. It shouldn't happen,
but there could be cases we've not found.
We have been warning about duplicate devices (and disabling lvmetad)
immediately when the dup was detected (during label_scan). Move the
warnings (and the disabling) to happen later, after label_scan is
finished.
This lets us avoid an unwanted warning message about duplicates
in the special case were md components are eliminated during the
duplicate device resolution.
The report uses process_each_vg() which populates lvmcache
based on a VG list from lvmetad. If there are no VGs,
but only orphan PVs, the orphans are not shown. Add an
explicit call to populate lvmcache with PV info from lvmetad.
This minor patch fixes grammar in a few messages which get
printed to users. It also fixes the same grammar mistake in
several comments.
Signed-off-by: Rick Elrod <relrod@redhat.com>
--
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
No point to start building lvm without this header file.
Although there could be 'some point' in supporting standalone build
of 'just' libdm where the libaio might be avoided.
TODO: think about configure option for building libdm only.
Unfortunatelly on kernels <4.16 lvm2 can't user brd ramdisks
for backend device as number of test is failing with this kernel
message:
device-mapper: ioctl: can't change device type after initial table load.
caused by DAX request-based handling, and lvm2 tries to replace device
with backend 'error' bio-based device and such table reload is being
rejected.
So ATM keep ramdisk only on most recent kernel to experiment a bit,
for older machines just stay safe and keep old slower loop backend.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
While newer system can detect need for 4K mkfs, on older test machines
running test suite over 4k is reporting problems.
Some more generic solution is needed thought.
When test happens to run in tmpfs, it cannot use O_DIRECT (unsupported
with tmpfs).
CHECKME: unsure if detection of tmpfs is 'valid' but kind of works and
is very simple.
As Makefiles already do use target with name 'device-mapper'
rename this new device-mapper dir to non-conflicting name.
We also seem to already use '_' in other dir names.
Also rename device_mapper/Makefile to source for generating Makefile.in
so we can use it for build in other source dirs properly.
It's very hard to use some 'non-recurive' Makefiles with
rest of system running 'recursively'.
So ATM drop inclusion of subdir makefile and add support
for 2 new top-level targets:
unit-test (builds test/unit dir)
run-unit-test (build & run test/unit/unit-test run)
Just like 52656c89fd
when now cache is compiled in 'unditionally'.
This patch is actually enforce by changes in
commit: 2bc896f2a3
where CACHE value is not set anymore.
If the test does not need root, it can use 'SKIP_ROOT_DM_CHECK'.
For such test no actions needed root to initilize DM devices and
nodes will be take and test can check i.e. functional unit tests.
Usage of dm_delay looks to be slowing not just 'delayed' portion
of device, but due to the fact it's also slows down ANY flush
operation on such device it's overal speed impact is huge.
In some case we can however user other methods to slowdown disk writes,
in case of old dm 'mirror' target we can throttle I/O of mirror
synchronisation giving the next commands enough time to test couple
race conditions.
Usage:
throttle_dm_mirror [percentage]
Thtrottle down sync speed (lowest is '1' which is also default when
unspecified)
restore_dm_mirror
Restores the value of throttling before call of 'throttle_dm_mirror'
Usually it should '100'
Currently usage of loop device over backend file in ramdisk (tmpfs)
is actually causing unnecassary memory consution, since just
reading such loop device is causing RAM provisioning.
This patch add another possible way how to use ramdisk directly
through 'brd' device when possible (and allowed).
This however has it's limitation as well - brd does not support
TRIM, so the only way how to erase is to remove brd module ??
Alse there is 4K sector size limitation imposed by ramdisk.
Anyway - for some mirror test that were using large amount of
disk space (tens of MB) this brings noticable speed boost.
(But could be worth to solve the slowness of loop in kernel?)
To prevent using 'brd' for testing set LVM_TEST_PREFER_BRD=0
like this:
make check_local LVM_TEST_PREFER_BRD=0
This test can't use brd (ramdisk) as backend since for some
weird reason lsblk is not listing these device.
TODO: test could be probably rewritten to avoid using lsblk somehow??
When the backend device supports only 4K blocks (like ramdisk)
we cannot use for testing any smaller blocksize.
So recalc test for 4K extent size.
We may possibly introduce one list extra test that
can be executed on devices with 512b sectors to
check lvm2 support those min extent sizes...
ATM it's a bit ugly to enforce flushing of 'stdio' here, but works as quick
hot-fix.
log_print*() is using buffered I/O.
But for pooling with typical 1s interval this may take a while before
buffer about continues progress gets flushed.
So ATM fflush().
TODO: either add log_print*_with_flush() or maybe directly use just
line buffering with log_print() and only log_debug() keep using buffered
I/O mode.
md devices using an older superblock version have
superblocks at the end of the md device. For commands
that skip reading the end of devices during filtering,
the md component devs will be scanned, and will appear
as duplicate PVs to the original md device. Remove
these md components from the list of unused duplicate
devices, so they are treated as if they had been
ignored during filtering. This avoids the restrictions
that are placed on using PVs with duplicates.
All these functions are now used as utilities,
e.g. for ioctl (not for io), and need to
open/close the device each time they are called.
(Many of the opens can probably be eliminated by
just using the bcache fd for the ioctl.)
with the --labelsector option. We probably don't
need all this code to support any value for this
option; it's unclear how, when, why it would be
used.
An implementation of an adaptive radix tree. Has the following nice
properties:
- At least as fast as the hash table
- Uses less memory
- You don't need to give an expected size when you create
- It scales nicely (ie. no large reallocations like the hash table).
- You can iterate the keys in lexicographical order.
Only insert and lookup are implemented so far. Plus there's a lot
more performance to come.
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
This reverts commit 0931067dc5.
The dep files should be in the build dir, which is not necc. the src dir.
Easy to fix, but reverting for now until I have time to revisit.
The clvmd saved_vg data is independent from the normal lvm
lvmcache vginfo data, so separate saved_vg from vginfo.
Normal lvm doesn't need to use save_vg at all, and in clvmd,
lvmcache changes on vginfo can be made without worrying
about unwanted effects on saved_vg.
To avoid the chance of freeing a saved vg while another
code path is using it, defer freeing saved vgs until
all the lvmcache content is dropped for the vg.
In case "lvconvert -mN RaidLV" was used on a degraded
raid1 LV, success was returned instead of an error.
Provide message to inform about the need to repair first
before changing number of mirrors and exit with error.
Add new lvconvert-m-raid1-degraded.sh test.
Resolves: rhbz1573960
I don't like having this in a common header because it means you end
up including too much and causing unneccessary dependencies. eg,
lib/misc/lib.h includes libdevmapper.h, internationalisation, and
logging stuff.
There are likely more bits of code that can be removed,
e.g. lvm1/pool-specific bits of code that were identified
using FMT flags.
The vgconvert command can likely be reduced further.
The lvm1-specific config settings should probably have
some other fields set for proper deprecation.
The mixed up vg repair code in vg_read was trying
to repair a vg when vg_read was called by clvmd.
The clvmd daemon isn't supposed to be repairing
or writing a vg.
(This is a temporary workaround; vg repair will soon
be pulled out of vg_read so it can be called in a
controlled way and consolidated instead of spread
around.)
Shift refresh of mirror table right into monitor_dev_for_events().
Use !vg_write_lock_held() to recognize use of lvchange/vgchange.
(this shall change if this would no longer work, but requires
futher some API changes).
With this patch dm mirror table is only refreshed when necassary.
Also update WARNING message about mirror usage without monitoring
and display LV name.
When 'dmsetup' reports result with --nameprefixes it currently
incorrectly 'escapes' problematic characters.
Letting pass such string though shell 'eval' function is hard task.
So instead cut away substring.
Once dmsetup will start to properly escape backslash and apostrophe
this function may need further tuning.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
External build dir now works too.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
The filters save information about devices that should
be ignored, so if we need to repeat a scan (unusual,
but happens in clvmd), we need to update the filters.
When pvmove was run in background mode and forks
instead of using lvmpolld, the child pvmove process
was not clearing the bcache from the parent, so all
the aio ops in the child were failing.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
The copy of the VG which clvmd stashes in lvmcache should
not only be used between suspend and resume, but between
sequential LV operations in clvmd, so that clvmd does not
need to reread the VG for each one. Prepare for that by
renaming the stashed VG as "saved_vg".
Instead of using delayer device user 'zero' device and let mirror
do some real work which takes some time.
In case the test machine is too fast - mirror might need to be made bigger
to meet needed criteria.
Also move all test needed this 'zero' PV trick to the end of test
so $dev2 and $dev4 are covered with 'zero' and can take any amount of
write without consuming any real space.
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Don't allow writes in test mode. test mode should be
more sophisticated than just faking writes, and this
should be a last defense for cases where test mode is
not being checked correctly.
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
In some odd cases (e.g. tests) there are very few devices
which results in creating too few blocks in bcache, so
create bcache with a minimum number of blocks.
When a PV is stacked on an LV, the LV will be kept in
bcache, and the open fd on the LV may interfere with
processing the LV. So, drop/close a bcache fd for
an LV before processing the LV.
Commands using lvmetad will not begin with a proper
label_scan which initializes bcache, but may later
decide they need to scan a set of devs, in which case
they'll need bcache set up at that point.
The improved detection of bad metadata when scanning
(where errors were ignored before) means we now have to
override some errors when forcibly erasing damaged metadata.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
This is a temporary hacky workaround to the problem of
reads going through bcache and writes not using bcache.
The write path wants to read parts of data that it is
incrementally writing to disk, but the reads (using
bcache) don't work because the writes are not in the
bcache. For now, add a dev to bcache before each attempt
to read it in case it's being used on the write path.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
In the same way as the other process_each functions.
In the common case all the info that's needed can be
used from lvmcache after a label scan. But this means
that unchosen devs for duplicate PVs need to be handled
explicitly.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
The old code was doing unnecessary label scans when
checking to see if the new VG name exists. A single
label_scan is sufficient if it is done after the
new VG lock is held.
When lvmlockd indicates that the lvmetad cache is out of
date because of changes by another node, lvmetad_pvscan_vg()
rescans the devices in the VG to update lvmetad. Use the
new label_scan in this function to use the common code and
take advantage of the new aio and reduced reads.
This fixes the use of lvmcache_label_rescan_vg() in the previous
commit for the special case of independent metadata areas.
label scan is about discovering VG name to device associations
using information from disks, but devices in VGs with
independent metadata areas have no information on disk, so
the label scan does nothing for these VGs/devices.
With independent metadata areas, only the VG metadata found
in files is used. This metadata is found and read in
vg_read in the processing phase.
lvmcache_label_rescan_vg() drops lvmcache info for the VG devices
before repeating the label scan on them. In the case of
independent metadata areas, there is no metadata on devices, so the
label scan of the devices will find nothing, so will not recreate
the necessary vginfo/info data in lvmcache for the VG. Fix this
by setting a flag in the lvmcache vginfo struct indicating that
the VG uses independent metadata areas, and label rescanning should
be skipped.
In the case of independent metadata areas, it is the metadata
processing in the vg_read phase that sets up the lvmcache
vginfo/info information, and label scan has no role.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When user configured lvm2 to NOT user monitoring, activated mirror
actually hang upon error and it's quite unusable moment.
So instead Warn those 'brave' non-monitoring users about possible
problem and activation mirror without blocking error handling.
This also makes it a bit simpler for test suite to handle trouble
cases when test is running without dmeventd.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Whenever we make visible LV out of previously invisible one,
reload it's table - the is mandator for proper udev rule
processing as well as ensure content of dm table is correct.
TODO: this new generic rule probably make extra raid rules unnecessary.
Fixing regresion on argument acceptance where any lv can be passed
with paramaterless lvconvert which is meant to figure out needed
operation - i.e. wait for mirror synchronization.
User has no other 'effective' method to wait for mirror getting in-sync.
Since we support snapshot of mirrors, we do need to properly check
for stacked lock holder - fixes problem of pvmove in cluster
with mirrors under snapshot.
WHATS_NEW for this patch goes with 'Restore pvmove support...'
The current logic that avoids setting SYSTEMD_ALIAS and SYSTEMD_WANTS
on "change" events is flawed in the default "systemd background job"
configuration. For systemd, it's important that device properties don't
change spuriously.
If an "add" event starts lvm2-pvscan@.service for a device, and a
"change" event follows, removing SYSTEMD_ALIAS and SYSTEMD_WANTS from the
udev db, information about unit dependencies between the device and the
pvscan service can be lost in systemd, in particular if the daemon
configuration is reloaded.
Steps to reproduce problem:
- create a device with an LVM PV
- remove device
- add device (generates "add" and "change" uevents for the device)
(at this point SYSTEMD_ALIAS and SYSTEMD_WANTS are clear in udev db)
- systemctl daemon-reload
(systemd reloads udev db)
- vgchange -a n
- remove device
=> the lvm2-pvscan@.service for the device is still active although the
device is gone.
- add device again
=> the PV is not detected, because systemd sees the lvm2-pvscan@.service
as active and thus doesn't restart it.
The original purpose of this logic was to avoid volumes being scanned
over and over again. With systemd background jobs, that isn't necessary,
because systemd will not restart the job as long as it's active.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Make the distinction between the cases with and without systemd
background jobs explicit in 69-dm-lvm-metad.rules rather than
substituting the rule from the Makefile. At this stage,
this improves only readibility, at the cost of one GOTO statement.
This patch introduces no functional change to the udev rules.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Test that no (Sub)LV remnants persist if the volume group is
not listed in configuration variable activation/volume_list,
hence not activatable thus causing initialization of rmeta
SubLVs to fail.
Related: rhbz1161347
btrfs is using fake major:minor device numbers.
try to be smarter and detect used node via DM device name.
This shortens delays, where i.e. lvm2 is asked to deactivate
volume with mounted btrfs as such operation is not retryed
and user is informed about device being in use.
Correct testing with format 1 and mq policy.
Add testing of 'smq'
Fix testing with clvmd - where logged message is part of clvmd log
and we can only check command status.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
Use 4K chunks since some older kernels are not capable
to create striped volumes with smaller size.
TODO: lvm2 should detect this ahead and avoid kernel
reporting "Invalid chunk".
Since this function is called with 'fd == -1', but Coverity can't see
this path can't be visited with this argument, add explicit check for
valid descriptor.
If the tools for checking thin_pool or cache metadata are missing,
issue rather just a WARNING, but let the operation of activation
continue.
This has the advantage, the if user is missing those tools,
but he already started to use thinpool or cacheing, he can
access these volumes with a WARNING.
Also if the user is using too old tools i.e. for CacheV2 format
dmpd tool 0.7 is required - provide informative WARNING and
skip failure from older tool version which can't understand
new format V2.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
While prioritized_section() based on raised priority works
nicely for standard lvm comman - separate counter is actually needed
when it's used in daemons like clvmd/dmeventd where priority
stays raised all the time.
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Introduce prioritized_section() as a closer match to previous logic
of critical_section() that has been held over longer sequence of
ioctl commands - essentially it's matching operation on a single
cookie.
While 'critical_section()' now corresponds to locked memory - we hold
this memory only between suspend/resume thus notion of 'cookie' was
lost.
This patch restores some logic unintentionaly lost with dropping
memory locking for just activation/deactivation calls.
When function dm_stats_populate() returns 0 it's an error and needs
log_error() message - function can't have 'success' returning 0 or
error without reasons.
Prevent call of dm_stats_populate(), when there has been no
stats region detected for a DM device.
Such skip is evaluated as 'correct' visit of stats call and
not causing 'dmstats' command failure.
Resulting loop table line was streamed to 'stderr' stream - assuming this
was not a feature when user used '-v' for more verbose output
and properly show it via 'log_verbose()' on 'stdout'.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
With problematic kernels raid devices can be occasionaly left with
'frozen' status - try to 'unfreeze' them with idle message on teardown.
Also replace couple greps with 'built-in' dmsetup --select feature.
Note: dmsetup --select currently reports 'No devices found' on stdout
and return success - looks like a bug to fix.
Just like lvm2 has internal devices like _tdata which is using UUID with
suffix, there is similar private type of device for crypto device where
they are using CRYPT-TEMP uuid prefix.
Also ignore stratis.
Some kernel version suffer from bad state transition where a device
steps into 'frozen' mode. Any application that tries to read such
raid gets unfortunatelly bloked.
As some sort of protection try to skip such raid device from being
scanned to minimize chances to block lvm2 command on such scan.
When such device is found, warning gets printed.
RaidLVs on read_only_volume_list have their SubLVs
activated readonly thus disabling metadata updates
or image resynchronization/recovery. Bug also causes
automatic repairs to fail.
Fix by always activating the RAID SubLVs readwrite.
Resolves: rhbz1208269
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
When snapshot is created in read-only mode with 'lvcreate -s -pr...',
lvm2 still needs to be able to write to layered -cow volume
to store metadata and exceptions blocks.
TODO: in some case we might be able to do full tree with read-only
volume but this probably needs futher validation:
1. checking snapshot header already exist
2. origin & snapshot are both in read-only mode.
Occasionaly users may need to peek into 'component devices.
Normally lvm2 does not let users activation component.
This patch adds special mode where user can activate
component LV in a 'read-only' mode i.e.:
lvchange -ay vg/pool_tdata
All devices can be deactivated with:
lvchange -an vg | vgchange -an....
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
When dmeventd thin plugin forks a configurable script, switch to use
execvp to pass whole environment present to dmeventd - so all configured
paths present at dmeventd startup are visible to script.
This was likely not a problem for common user enviroment,
however in test suite case variable like LVM_SYSTEM_DIR were
not actually used from test itself but rather from
a system present lvm.conf and this may have cause strange
behavior of a testing script.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
This is somewhat tricky - for test suite we keep using
'set -e -o pipefail' - the effect here is - we get error report
from any 'failing' command in whole pipeline - thus when something
like this: 'lvs | head -1' is used - and 'head' finishes before
lead 'lvs' is done - it recieves SIGPIPE and exits with error,
and somewhat misleading gets occasionally reported depending
of speed of commands.
For this case we have to avoid using standard pipes and rather
switch to using streamed results with temporary output file.
This is all nicely handled with bash feature '< <()'.
For more info:
https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail
While 'file-locking' code always dropped cached VG before
lock was taken - other locking types actually missed this.
So while the cache dropping has been implement for i.e. clvmd,
actually running command in cluster keept using cache even
when the lock has been i.e. dropped and taken again.
This rather 'hard-to-hit' error was noticable in some
tests running in cluster where content of PV has been
changed (metadata-balance.sh)
Fix the code by moving cache dropping directly lock_vol() function.
TODO: it's kind of strange we should ever need drop_cached_metadata()
used in several places - this all should happen automatically
this some futher thinking here is likely needed.
Improve pvmove to accept 'locally' active LVs together with
exclusive active LVs.
In the 1st. phase it now recognizes whether exclusive pvmove is needed.
For this case only 'exclusively' or 'locally-only without remote
activative state' LVs are acceptable and all others are skipped.
During build-up of pvmove 'activation' steps are taken, so if
there is any problem we can now 'skip' LVs from pvmove operation
rather then giving-up whole pvmove operation.
Also when pvmove is restarted, recognize need of exclusive pvmove,
and use it whenever there is LV, that require exclusive activation.
So this is a bit more complex and possibly worth futher checking.
ATM clvmd drops cmd->mem mempool AFTER refresh of cmd.
So anything allocating from cmd->mem during toolcontext init
will likely die at some point in time.
As a quick fix - just use regular malloc/free for 'dso' alloction.
It's worth to note - cmd->libmem seems to be often misused
causing hidden memleaking for clvmd.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.
Avoid using same return code for reporting 2 different things
and stricly report error code by return value and add new
parameter for reporting monitoring status.
This makes easier to recognize which error we got from dm_event
and continue only with ENOENT.
Check if the generated vg name still fits the buffer.
So too long strings are rejected.
Drop -1 from size passed to snprintf - as the \0 is already included.
On occasional gcc releases it's better to specify also -ldevmapper
to linking logic for python object.
It's in fact more correct since the liblvm.c code is using
libdevmapper functions - that were linked in only via
liblvm2app library.
This partially reverts commit da37cbd24f.
As the _cmdline structure use mempool for allocated ellement
that is being release on cmd_context close.
Before the better fix is made - restore previous logic and
reinitialize cmd structures again for new cmd_context.
Problem can be hit with e.g. this test run:
make check_local T=foreign LVM_VALGRIND_DMEVENTD=1
Invalid read of size 1
at 0x4C31C83: strcmp (vg_replace_strmem.c:846)
by 0x6BA0939: _find_command (lvmcmdline.c:1555)
by 0x6BA4304: lvm_run_command (lvmcmdline.c:2810)
by 0x6BD5E02: lvm2_run (lvmcmdlib.c:91)
by 0x685607E: dmeventd_lvm2_run (dmeventd_lvm.c:118)
by 0x6652684: _use_policy (dmeventd_thin.c:117)
by 0x6652E56: process_event (dmeventd_thin.c:298)
by 0x10CC5A: _do_process_event (dmeventd.c:945)
by 0x10CF83: _monitor_thread (dmeventd.c:1033)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Address 0x6266270 is 4,352 bytes inside a block of size 8,192 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x5289142: dm_free_wrapper (dbg_malloc.c:393)
by 0x528998A: _free_chunk (pool-fast.c:318)
by 0x52892A6: dm_pool_destroy (pool-fast.c:78)
by 0x6A8E52C: destroy_toolcontext (toolcontext.c:2254)
by 0x6BA5BD6: lvm_fin (lvmcmdline.c:3327)
by 0x6BD5EA7: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x5288F46: dm_malloc_aux (dbg_malloc.c:287)
by 0x52890AC: dm_malloc_wrapper (dbg_malloc.c:371)
by 0x52898E6: _new_chunk (pool-fast.c:286)
by 0x52893BA: dm_pool_alloc_aligned (pool-fast.c:106)
by 0x5289310: dm_pool_alloc (pool-fast.c:90)
by 0x6A8A21A: _load_config_file (toolcontext.c:808)
by 0x6A8A3D9: _init_lvm_conf (toolcontext.c:842)
by 0x6A8D3BD: create_toolcontext (toolcontext.c:1941)
by 0x6BA5B24: init_lvm (lvmcmdline.c:3308)
by 0x6BD5B7C: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5EB8: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
With pthreaded daemons like 'dmeventd' using liblvm via plugin,
lvm2 actually should not 'play' with streams at all - as there
could be parallel outputs running.
As a current quick workaround just disable change for pthreaded
program (gettid() != getpid()).
TODO: it's possible the change of buffering actually doesn't serve us
any measurable benefit and could be dropped as whole later...
Meanwhile this patch is fixing this occasional valgrind race report:
Invalid read of size 4
at 0x571892C: vfprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x57216B3: fprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x5042886: dm_event_log (libdevmapper-event.c:925)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Address 0x6264d30 is 192 bytes inside a block of size 552 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x573907D: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5C00: reopen_standard_stream (log.c:189)
by 0x6A8E62C: destroy_toolcontext (toolcontext.c:2271)
by 0x6BA5C22: lvm_fin (lvmcmdline.c:3339)
by 0x6BD5EF3: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x573932B: fdopen@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5DC2: reopen_standard_stream (log.c:200)
by 0x6A8D11D: create_toolcontext (toolcontext.c:1898)
by 0x6BA5B6B: init_lvm (lvmcmdline.c:3319)
by 0x6BD5BC8: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5F04: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
....
Process terminating with default action of signal 6 (SIGABRT): dumping core
at 0x570016B: raise (in /usr/lib64/libc-2.26.9000.so)
by 0x5701520: abort (in /usr/lib64/libc-2.26.9000.so)
by 0x57437D8: __libc_message (in /usr/lib64/libc-2.26.9000.so)
by 0x5743831: __libc_fatal (in /usr/lib64/libc-2.26.9000.so)
by 0x5744056: _IO_vtable_check (in /usr/lib64/libc-2.26.9000.so)
by 0x574751C: __overflow (in /usr/lib64/libc-2.26.9000.so)
by 0x574191A: fputc (in /usr/lib64/libc-2.26.9000.so)
by 0x50428E3: dm_event_log (libdevmapper-event.c:934)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Some tools are typically installed into /usr/sbin (or /sbin) dir.
And some systems do not add this path to user's $PATH var.
Ensure sbin paths are looked through...
In fact pvmove does support 'clustered-core' target for clustered
pvmove of LVs activated on multiple nodes.
This patch restores support for activation of pvmove on all nodes
for LVs that are also activate on all nodes.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
Add protectional internall error whenever we spot activation
of 'exclusive' only segments in 'non-exclusive' mode.
TODO: possibly the activation locking could be enhanced to handle
this fully behind the scene - as for now this works purely for
lvchange/vgchange activation.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Dedicated functions are now used to process each piece of data obtained,
so the refactoring in this file gives us one for the vgsummary and one
for the metadata header. This new type of function takes two parameters
(for now), the obtained data plus a single struct (that must not
reference any data on the stack) that wraps up the entire context needed
to process it.
Sleep a bit before checking /sys/block dir so the kernel has a moment to
actually put scsi debug device in it...
Some quite old kernels are in troubles with this plain searching grep
without sleep (namely 2.6.32)
modprobe scsi_debug
<sleep .1>
grep -H scsi_debug /sys/block/*/device/model
modprobe -r scsi_debug
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.