IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Use the recently added dump routines to produce the
old/traditional pvck output, and remove the code that
had been used for that.
The validation/checking done by the new routines means
that new lines prefixed with CHECK are printed for
incorrect values.
Recent kernel version from kernel commit:
de7180ff908b2bc0342e832dbdaa9a5f1ecaa33a
started to report in cache status line new flag:
no_discard_passdown
Whenever lvm spots unknown status it reports:
Unknown feature in status:
So add reconginzing this feature flag and also report this with
'lvs -o+kernel_discards'
When no_discard_passdown is found in status 'nopassdown' gets reported
for this field (roughly matching what we report for thin-pools).
Add 'pvck --dump headers' to print all the
lvm ondisk structs. Also checks the values
and prints any problems.
The previous dump metadata is also converted to
use these same routines, which do not depend on lvm
fully scanning/reading/processing the headers and
metadata on disk. This makes it useful to get data in
cases where there is corruption that would otherwise
prevent the normal functions from working.
The test was failing consistently on some VMs (F25), and inconsistently
on Rawhide.
With increased latency these failures are no longer reproducible.
Reproducer:
make check_lvmpolld T=pvmove-resume-multiseg.sh
The new command 'pvck --dump metadata PV' will extract
the current version of VG metadata from a PV for testing
and debugging. --dump metadata_area extracts the entire
text metadata area.
teardown after the test was failing, probably because
of uncoordinated udev actions running on the test
system. Try to avoid this by doing some work before
teardown.
lvm2 till version 2.02.169 (commit 78d004efa8)
was printing invalid creation_time argument into metadata on 32bit arch.
However with commit ba9820b142 we started
to properly validate all input numbers and thus we refused to accept
invalid metadata with 'garbage' string - but this results in the
situation where metadata produced on older lvm2 on 32 bit architecture
will become unreadable after upgrade.
To fix this case - extend libdm parser in a way, that whenever we
find error integer value, we also check if the parsed value is not for
creation_time node and in this case we let the metadata pass through
with made-up date 2018-05-24 (release date of 2.02.169).
commit aa75b31db5
"pvscan: handle case of scanning PV without metadata last"
failed to recognize that an arg may be null in the case of
'pvscan --cache' (without -aay) which does not keep track
of complete VGs because it does not need to activate them.
The scanning rework missed removing this instance of label scan.
It's no longer needed because of the way that label scan is always
run once from the start of the command. This unnecessary scan
would be triggered by running 'pvs @tag'.
If an md component is not excluded by other means and
vg_read is used to read metadata from it, then this new
check compares the device size with the PV size, and runs
a full md check on the device if the sizes don't match.
and don't call it from inside pvcreate_each_device.
This avoids having to repeat it for users of
pvcreate_each_device (pvcreate/pvremove/vgcreate/vgextend.)
There have been two file locks used to protect lvm
"global state": "ORPHANS" and "GLOBAL".
Commands that used the ORPHAN flock in exclusive mode:
pvcreate, pvremove, vgcreate, vgextend, vgremove,
vgcfgrestore
Commands that used the ORPHAN flock in shared mode:
vgimportclone, pvs, pvscan, pvresize, pvmove,
pvdisplay, pvchange, fullreport
Commands that used the GLOBAL flock in exclusive mode:
pvchange, pvscan, vgimportclone, vgscan
Commands that used the GLOBAL flock in shared mode:
pvscan --cache, pvs
The ORPHAN lock covers the important cases of serializing
the use of orphan PVs. It also partially covers the
reporting of orphan PVs (although not correctly as
explained below.)
The GLOBAL lock doesn't seem to have a clear purpose
(it may have eroded over time.)
Neither lock correctly protects the VG namespace, or
orphan PV properties.
To simplify and correct these issues, the two separate
flocks are combined into the one GLOBAL flock, and this flock
is used from the locking sites that are in place for the
lvmlockd global lock.
The logic behind the lvmlockd (distributed) global lock is
that any command that changes "global state" needs to take
the global lock in ex mode. Global state in lvm is: the list
of VG names, the set of orphan PVs, and any properties of
orphan PVs. Reading this global state can use the global lock
in sh mode to ensure it doesn't change while being reported.
The locking of global state now looks like:
lockd_global()
previously named lockd_gl(), acquires the distributed
global lock through lvmlockd. This is unchanged.
It serializes distributed lvm commands that are changing
global state. This is a no-op when lvmlockd is not in use.
lockf_global()
acquires an flock on a local file. It serializes local lvm
commands that are changing global state.
lock_global()
first calls lockf_global() to acquire the local flock for
global state, and if this succeeds, it calls lockd_global()
to acquire the distributed lock for global state.
Replace instances of lockd_gl() with lock_global(), so that the
existing sites for lvmlockd global state locking are now also
used for local file locking of global state. Remove the previous
file locking calls lock_vol(GLOBAL) and lock_vol(ORPHAN).
The following commands which change global state are now
serialized with the exclusive global flock:
pvchange (of orphan), pvresize (of orphan), pvcreate, pvremove,
vgcreate, vgextend, vgremove, vgreduce, vgrename,
vgcfgrestore, vgimportclone, vgmerge, vgsplit
Commands that use a shared flock to read global state (and will
be serialized against the prior list) are those that use
process_each functions that are based on processing a list of
all VG names, or all PVs. The list of all VGs or all PVs is
global state and the shared lock prevents those lists from
changing while the command is processing them.
The ORPHAN lock previously attempted to produce an accurate
listing of orphan PVs, but it was only acquired at the end of
the command during the fake vg_read of the fake orphan vg.
This is not when orphan PVs were determined; they were
determined by elimination beforehand by processing all real
VGs, and subtracting the PVs in the real VGs from the list
of all PVs that had been identified during the initial scan.
This is fixed by holding the single global lock in shared mode
while processing all VGs to determine the list of orphan PVs.
wipe_lv knows it's going to write the device, so it
can open rw from the start. It was opening readonly,
and then dev_write needed to reopen it readwrite.
To avoid tiny race on checking arrival of signal and entering select
(that can latter remain stuck as signal was already delivered) switch
to use pselect().
If it would needed, we can eventually add extra code for older systems
without pselect(), but there are probably no such ancient systems in
use.
Handle the case where pvscan --cache -aay (with no dev args)
gets to the final PV, completing the VG, but that final PV does not
have VG metadata. In this case, we need to use VG metadata from a
previously scanned PV in the same VG, which we saved for this
possibility. Using this saved metadata, we can find which VG
this PVID belongs to, and then check if that VG is now complete,
and if so add the VG name to the list of complete VGs to be
autoactivated.
When hints are invalid and ignored, the list of hints
could be non-empty (from additions before an invalid
hint was found). This confused the calling code which
was checking for an empty list to see if hints were used.
Ensure the list is empty when hints are not used.
We already used Conflicts=shutdown target to stop LVM2 services on shutdown.
But we still missed the ordering - the shutdown.target should be reached
only after all the services are really stopped.
Reported here: https://github.com/lvmteam/lvm2/issues/17
If a device looks like a PV, but its size does not
match the PV size in the metadata, then skip it for
purposes of autoactivation. It's probably not wrong
device for the PV.
In the past, the first 'pvscan --cache -aay dev' command
to run on the system would initialize the pvs_online dir
by scanning all devs and creating online files for all pvs
it found, and then autoactivating the VG (if complete) for
the named dev. The idea was that the system may not have
been able to run pvscan commands for early devices, so the
first pvscan to run would need to "make up" for any devices
that had appeared previously, which the system was unable to
scan. The problem or idea of making up for missed scans is
historical and should no longer be needed, so remove this
special init case.
When pvscan is run for the initialization case (the first
pvscan run on the system), it scans all devs and creates
online files for all PVs it finds. Previously it would
then autoactivate every complete VG, but change this to
only autoactive the (complete) VG corresponding to the
named device arg(s).
- remove reference to locking_type which is no longer used
- remove references to adopting locks which has been disabled
- move some sanlock-specific info out of a general section
- remove info about doing automatic lockstart by the system
since this was never used (the resource agent does it)
- replace info about lvextend and manual refresh under gfs2
with a description about the automatic remote refresh
This reverts 518a8e8cfb
"lvmlockd: activate mirror LVs in shared mode with cmirrord"
because while activating a mirror LV with cmirrord worked,
changes to the active cmirror did not work.
When data are growing, adapt also size of metadata.
As we get way too many reports from users doing huge growths of
data portion while keep metadata small and avoiding using monitoring.
So to enhance the user-experience in case user requests grown of
thin-pool (without passing PV list for growth) - lvm2 will automaticaly
grown also the metadata part of thin-pool (if possible).
Add function for estimation of thin-pool metadata size for given size of
data. Function is using already existing internal API so it can
be reused for resize of thin-pool data.
Update the previous commit to leave the vgname as
an arg instead of moving it into the select option,
(the compound select option rule is confusing the
dlm arg processing.)
Using --select 'lvname=LV && vgname=VG' avoids the problem
of the lvchange exit code not distinguishing an actual error
result vs the VG or LV not existing. (This is in case there
is an odd dlm/gfs2 setup where some nodes are running the dlm
but do not have access to the VG.)
When lvextend extends an LV that is active with a shared
lock, use this as a signal that other hosts may also have
the LV active, with gfs2 mounted, and should have the LV
refreshed to reflect the new size. Use the libdlmcontrol
run api, which uses dlm_controld/corosync to run an
lvchange --refresh command on other cluster nodes.
When an LV is active with a shared lock, a command can be
run to change the LV with --lockopt skiplv (to override the
exclusive lock the command ordinarily requires which is not
compatible with the outstanding shared lock.)
In this case, other commands may have the LV active and may
need to refresh the LV, so print warning stating this.
Udev is running udev-rule action upon 'resume'.
However lvm2 in special case is doing replacement of
'soon-to-be-removed' device with 'error' target for resuming
and then follows actual removal - the sequence is usually quick,
so when udev start action - it can result in 'strange' error
message in kernel log like:
Process '/usr/sbin/dmsetup info -j 253 -m 17 -c --nameprefixes --noheadings --rows -o name,uuid,suspended' failed with exit code 1.
To avoid this - we need to ensure there is synchronization wait for udev
between 'resume' and 'remove' part of this process.
However existing code put strict requirement to avoid synchronizing with
udev inside critical section - but this originally came from requirement
to not do anything special while there could be devices in
suspend-state. Now we are able to see differnce between critical section
with or without suspended devices. For udev synchronization only
suspended devices are prohibited to be there - so slightly relax
condition and allow calling and using 'fs_sync()' even inside critical
section - but there must not be any suspended device.
Allow using caching with VDO.
User can either cache a single vdopool or
a vdo LV - difference when the caching is put-in depends on a use-case
and it's upto user to decide which kind of speed is expected.
Internal detection of SCSI device being in-use by DM mpath has been
performed several times for each component device - this could be
eventually racy - so instead when we do remember 1st. checked result
for device being mpath and use it consistenly over the filter runtime.
Move DM usage into dev_manager.c source file.
Also convert STATUS to INFO ioctl - as that's enough
to obtain UUID - this also avoid issuing unwanted flush on checked DM
device for being mpath.
Fix to previous commit
"pvscan: ignore online for shared and foreign PVs"
which was incorrectly considering a PV foreign if its
VG had no system ID when the host did have a system ID.
Activation would not be allowed anyway, but we can
check for these cases early and avoid wasted time in
pvscan managing online files an attempting activation.
This is the default bcache size that is created at the
start of the command. It needs to be large enough to
hold a single copy of metadata for a given VG, or the
VG cannot be read or written (since the entire VG would
not fit into available memory.)
Increasing the default reduces the chances of anyone
needing to increase the default to use their VG.
The size can be set in lvm.conf global/io_memory_size;
the lower limit is 4 MiB and the upper limit is 128 MiB.
When a single copy of metadata gets within 1MB of the
current io_memory_size value, begin printing a warning
that the io_memory_size should be increased.
which defines the amount of memory that lvm will allocate
for bcache. Increasing this setting is required if it is
smaller than a single copy of VG metadata.
and "cachepool" to refer to a cache on a cache pool object.
The problem was that the --cachepool option was being used
to refer to both a cache pool object, and to a standard LV
used for caching. This could be somewhat confusing, and it
made it less clear when each kind would be used. By
separating them, it's clear when a cachepool or a cachevol
should be used.
Previously:
- lvm would use the cache pool approach when the user passed
a cache-pool LV to the --cachepool option.
- lvm would use the cache vol approach when the user passed
a standard LV in the --cachepool option.
Now:
- lvm will always use the cache pool approach when the user
uses the --cachepool option.
- lvm will always use the cache vol approach when the user
uses the --cachevol option.
For users who do not want all of the fields included
in debug lines, let them specify in lvm.conf which
fields to include. timestamp, command[pid], and
file:line fields can all be disabled.
Without this, the output from different commands in a single
log file could not be separated.
Change the default "indent" setting to 0 so that the default
debug output does not include variable spaces in the middle
of debug lines.
Use the correct loop variable within the loop, instead of reusing the
initial value. Table lines after the first don't get terminated in
the right place.
Signed-off-by: Kurt Garloff <kurt@garloff.de>
When a VG has multiple PVs, and all those PVs come online
at the same time, concurrent pvscans for each PV will all
create the individual pvid files, and all will often see
the VG is now complete. This causes each of the pvscan
commands to think it should activate the VG, so there
are multiple activations of the same VG. The vg lock
serializes them, and only the first pvscan actually does
the activation, but there is still a lot of extra overhead
and time used by the other pvscans that attempt to
activate the already active VG. This can lead to a backlog
of pvscans and timeouts.
To fix this, this adds a new /run/lvm/vgs_online/ dir that
works like the existing /run/lvm/pvs_online/ dir. Each pvscan
that wants to activate a VG will first try to exlusively create
the file vgs_online/<vgname>. Only the first pvscan will
succeed, and that one will do the VG activation. The other
pvscans will find the vgname file exists and will not do the
activation step.
When a PV goes offline, the vgs_online file for the corresponding
VG is removed. This allows the VG to be autoactivated again
when the PV comes online again. This requires that the vgname be
stored in the pvid files.
Use a file lock to ensure that only one pvscan will do
initialization of pvs_online, otherwise multiple concurrent
pvscans may all see an empty pvs_online directory and
do initialization.
The pvscan that is doing initialization should also only
attempt to activate complete VGs.
When aay was included in the pvscan --cache command,
the activation part was complaining about the unusual
state of the hint file since it had been recreated
just prior.
udev_dev_is_md_component and udev_dev_is_mpath_component
are not used for obtaining the device list, but they still
use libudev for device info. When there are problems with
udev, these functions can get stuck. So, use the existing
obtain_device_list_from_udev config setting to also control
whether these "is component" functions are used, which gives
us a way to avoid using libudev entirely when it's causing
problems.
Just like we support for thin-pool syntax:
lvcreate --thinpool new_tpoolname -L Size vg
add same support logic with for vdo-poo:
lvcreate --vdopool new_vpoolname -L Size vg
Also move description of syntax bellow thin-pool, so it's
correctly ordered in generated man page.
Whenever thin-pool chunk size is unspecified and left for lvm calculation
try to select the size as nearest highest power-of-2 instead of
just being a multiple of 64KiB.
Fixing recent commit 022ebb0cfe
Resize already has size that needs to be counted with,
otherwise upsizing operation could turn into size reduction one.
Since the parse_vdo_pool_status() become vdo_manip API part,
and there will be no 'dm' matching status parser,
the API can be simplified and closely match thin API here.
Now with newer VDO kvdo target we can start to use standard mechanism
to enable resize of VDO volumes.
VDO pool can be grown.
Virtual volume grows on top of VDO pool when is not big enough.
Reduced VDOLV is calling discard for reduced areas - this can
take long time!
TODO: implement some pollable mechanism for out-of-lock TRIM.
When using 'lvcreate -l100%VG' and there is big disproportion between
real available space and requested setting - automatically fallback
to 100%FREE.
Difference can be seen when VG is big and already most space was
allocated, so the requestion 100%VG can end (and by spec for % modifier
it's correct) as LV with size of 1%VG. Usually this is not a big
problem - buit in some cases - like cache-pool allocation, this
can result a big difference for chunksize selection.
With this patch it's more closely match common-sense logic without
the need of reitteration of too big changes in lvm2 core ATM.
TODO: in the future there should be allocator solving all allocations
in a single call.
When using caches with BIG pool size (>TB) there is required
to use relatively huge chunk size. Once the chunksize has
got over 1MiB - kernel cache target stopped writing such chunks
back on this if the migration_threshold remained on default 1MiB
(2048 sectors) size.
This patch ensure, DM layer will not let pass table line which
has not big enough migration threshold that can let pass
at least 8 chunks (independently of lvm2 metadata).
lvm uses 'minimum_io_size' name to exactly match VDO naming here,
however in all common cases _size is using 'sector/512b' unit.
But in this case the value is in bytes and can have only 2 values:
either 512 or 4096.
It's probably not worth to rename it internaly, so we can just
drop comment - instead of using 1 or 8.
Thought let's think about it....
Lvm can at times have duplicate names. When this happens the daemon will
internally use vg_name:vg_uuid as the name for lookups, but display just
the vg_name externally. If an API user uses the Manager.LookUpByLvmId and
queries the vg name they will only get returned one result as the API
can only accommodate returning 1. The one returned is the first instance
found when sorting the volume groups by UUID.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1583510
When we have two logical volumes which switch their names at the
same time we are left with incorrect lookups. Anytime we find
an entry by doing a lookup by UUID or by name we will ensure
that the lookups are indeed correct.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1642176
An idea from Zdenek for better ensuring valid hints by invalidating
them when pvscan --cache <device> sees a new PV, which is a case
where we know that hints should be invalidated. This is triggered
from systemd/udev logic, and there may be some cases where it would
invalidate hints that the existing methods wouldn't detect.
If there are two independent scripts doing:
vgchange --lockstart vg
lvchange -ay vg/lv
The first vgchange to do the lockstart will wait for
the lockstart to complete before returning.
The second vgchange to do the lockstart will see that
the start is already in progress (from the first) and
will do nothing. This means the second does not wait
for any lockstart to complete, and moves on to the
lvchange which may find the lockspace still starting
and fail.
To fix this, make the vgchange lockstart command
wait for any lockstart's in progress to complete.
Save the list of PVs in /run/lvm/hints. These hints
are used to reduce scanning in a number of commands
to only the PVs on the system, or only the PVs in a
requested VG (rather than all devices on the system.)
The systemd generators are executed very early during the switch
from initramfs to system partition and the syslog is not yet fully
operational - it may cause blocking, if some debug logging is enabled
at the same time in /etc/lvm/lvm.conf log{} section.
To avoid timeouting and killing this generator - rather enhance lvm
code to suppress any syslog communication when LVM_SUPPRESS_SYSLOG
envvar is set.
Use of this envvar is needed since the parsing of i.e. cmdline options
that could eventually override lvm.conf setting happens in this case
way too late and number of lines could have been already streamed to
syslog.
Fix a scenario where global/event_activation setting is not found. In
this case we need to take default value just like lvm tools do when
executed. So use "lvmconfig --type full".
Also, if we fail to execute lvmconfig for whatever reason, fallback to
generating the activation units as failsafe action.
Reported by: Bastian Blank <waldi debian org>
When initializing an LV to hold the writecache, use wipe_lv()
which looks for specific signatures on the LV.
Wiping signatures is not necessary, but printing a warning
that names a specific signature (in addition to the existing
generic warning/confirmation) may help if a user accidentally
specifies the wrong LV which contains something important.
Detach function return 0 for error and 1 for success.
Add missing log errors from failing deactivation.
Add missing log error from failing synchronization.
In few cases error paths from initialization were returned as
'success == 1'.
Also assing num_mb with single compare checking valid sector_size.
For dumb compiler make num_mb always defined.
Since configure.h is a generated header and it's missing traditional
ifdefs preambule - it can be included & parsed multiple times.
Normally compiler is fine when defines have same value and there is
no warning - yet we don't need to parse this several times
and by adding -include directive we can ensure every file
in the package is rightly compile with configure.h as the
first header file.
If we are running the test where the device is /dev/* we will will
run the unit tests 'test_nesting' and 'test_pv_symlinks'. Otherwise
we will skip them.
When a VG is exported, the 'fullreport' returns an exit code of 5, but
otherwise returns the data we are wanting.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
In some cases we get stuck where we are unable to retrieve the current
state of lvm as we are encountering an error. When the error is
persistent we will log and exit the daemon instead of consuming vast
amounts of resources.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Drop very old original format of VDO target and focus on V2 version.
So some variables were renamed or replaced.
There is no compatibility preserved (with assumption so far this is
experimental feature and there is no real user).
Note - version currently VDO calls this version 6.2.
A bit of chicken & egg problem - dmeventd needs to use old libdm library.
VDO is only part of new device_mapper internal library.
So include directly source file for parsing status - this fixes usability
problem of VDO plugin introduced with previous Makefile reshaping
patchset.
NOTE: source file needs to be keep then compilable in both environments.
Also add missing copyright header.
This is a followup patch to commit edb72cb70c
to support related lvm2 test suite tests.
A 'global/support_mirrored_mirror_log' bool configuration variable gets
introduced allowing the creation of, or conversion to mirrored 'mirror'
logs if set. The capability to create these in turn allows the rest of
the tests to perform activation of such existing LVs and their conversions
to disk/core 'mirror' logs.
Display a disclaimer warning if enabled that this is not for regular use.
Add definition of the enabled config option to respective test scripts.
Related: rhbz1643562
With older gcc - we need to resolve symbols linked with devmapper-event
that is now using -ldevmapper.
Also add forgotten systemd library needed for dbus notification.
Avoid doing hard set of LIBS var,
so if the LIBS is set before 'include make.tmpl' it's not lost.
This gives better control over order of linked libraries.
Since there is very little change there will be any new devel going
to happing with cmirror - avoid eating extra disk space and link
with already installed libdm which implements all use basic
function of dm list
Since dmeventd is 'libdm' based project, it needs to link
libdm library instead of its internal version
An external users may provide plugins loadeable by dmeventd.
So external user of libdevmapper-event library has no other option
then to link with released libdevmapper library.
The complexity comes with lvm2 plugins.
The lvm2 plugin itself uses internal version of device_mapper,
but libdevmapper-event usage is libdm based - so there needs to be avoided
any breakage on compatibility of internal i.e. dm_task_run structures.
TODO: most likely dmeventd itself should be moved into libdm/dm-tools dir,
and only lvm2 plugins should be created as part of lvm project,
but those still need to link with libdevmapper.
If we don't know the meaning we will return the key with default text
instead of raising an exception and taking the daemon out in the
process.
Resolves: rhbz1657950
When we get bug reports we may not get the entire log, so lets
dump the fight recorder from newest to oldest as the one we
are interested in was likely to be the last command run.
Scenario: Given an existed LV `lvol0`, I want to create another LV
on the PVs used by `lvol0`.
I use `build_parallel_areas_from_lv()` to obtain the `pv_list` of each segments.
However, the returned `pv_list` is not properly initialized, which causes
segfault in subsequent operations.
Ensure configure.h is always 1st. included header.
Maybe we could eventually introduce gcc -include option, but for now
this better uses dependency tracking.
Also move _REENTRANT and _GNU_SOURCE into configure.h so it
doesn't need to be present in various source files.
This ensures consistent compilation of headers like stdio.h since
it may produce different declaration.
There's a small window during creation of a new RaidLV when
rmeta SubLVs are made visible to wipe them in order to prevent
erroneous discovery of stale RAID metadata. In case a crash
prevents the SubLVs from being committed hidden after such
wiping, the RaidLV can still be activated with the SubLVs visible.
During deactivation though, a deadlock occurs because the visible
SubLVs are deactivated before the RaidLV.
The patch adds _check_raid_sublvs to the raid validation in merge.c,
an activation check to activate.c (paranoid, because the merge.c check
will prevent activation in case of visible SubLVs) and shares the
existing wiping function _clear_lvs in raid_manip.c moved to lv_manip.c
and renamed to activate_and_wipe_lvlist to remove code duplication.
Whilst on it, introduce activate_and_wipe_lv to share with
(lvconvert|lvchange).c.
Resolves: rhbz1633167
In RHEL7 we marked mirrored mirror logs as deprecated and
added a related message. This patch prohibits creating new
'mirror' LVs with that log type or converting existing LVs
to have one.
Existing LVs with mirrored mirror log can be activated
and converted to disk/core logs.
Avoid double deprecation message when running lvconvert.
Resolves: rhbz1643562
Patch 668c9d0762 introduced regression,
since the code here would actually always return failing result.
Replace it with more simple call to strndup().
For some targets we do not want to generate dependencies.
Also add note about usage of such Makefile - it might be
possibly better to rename it to different filename to avoid
any confusion.
commit de28637
scan: use full md filter when md 1.0 devices are present
missed the fact that md superblock version 0.90 also puts
metadata at the end of the device, so the full md filter
needs to be used when either 0.90 or 1.0 is present.
. When using default settings, this commit should change
nothing. The first PE continues to be placed at 1 MiB
resulting in a metadata area size of 1020 KiB (for
4K page sizes; slightly smaller for larger page sizes.)
. When default_data_alignment is disabled in lvm.conf,
align pe_start at 1 MiB, based on a default metadata area
size that adapts to the page size. Previously, disabling
this option would result in mda_size that was too small
for common use, and produced a 64 KiB aligned pe_start.
. Customized pe_start and mda_size values continue to be
set as before in lvm.conf and command line.
. Remove the configure option for setting default_data_alignment
at build time.
. Improve alignment related option descriptions.
. Add section about alignment to pvcreate man page.
Previously, DEFAULT_PVMETADATASIZE was 255 sectors.
However, the fact that the config setting named
"default_data_alignment" has a default value of 1 (MiB)
meant that DEFAULT_PVMETADATASIZE was having no effect.
The metadata area size is the space between the start of
the metadata area (page size offset from the start of the
device) and the first PE (1 MiB by default due to
default_data_alignment 1.) The result is a 1020 KiB metadata
area on machines with 4KiB page size (1024 KiB - 4 KiB),
and smaller on machines with larger page size.
If default_data_alignment was set to 0 (disabled), then
DEFAULT_PVMETADATASIZE 255 would take effect, and produce a
metadata area that was 188 KiB and pe_start of 192 KiB.
This was too small for common use.
This is fixed by making the default metadata area size a
computed value that matches the value produced by
default_data_alignment.
The pvscan systemd service for autoactivation was
mistakenly dropped along with the lvmetad related
services.
The activation generator program now looks at the new
lvm.conf setting "event_activation" (default 1) to
switch between event activation and direct activation.
Previously, the old use_lvmetad setting was used to
switch between event and direct activation.
instead of a separate --writecacheblocksize option.
writecache block_size is not technically a setting,
but it can borrow the option as a special case.
fix lseek error check
fix read/write error checks
handle zero return from read and write
don't return an error for short io
fix partial read/write loop
io_setup() for aio may fail if a system has reached the
aio request limit. In this case, fall back to using
sync io. Also, lvm use of aio can be disabled entirely
with config setting global/use_aio=0.
The system limit for aio requests can be seen from
/proc/sys/fs/aio-max-nr
The current usage of aio requests can be seen from
/proc/sys/fs/aio-nr
The system limit for aio requests can be increased by
setting fs.aio-max-nr using sysctl.
Also add last-byte limit to the sync io code.
For embeded reshaping operation require higher driver version.
(otherwise we get:
Converting vg/LV1 from raid6 (same as raid6_zr) is directly possible to the following layouts:
raid6_nc
raid6_nr
raid6_la_6
raid6_ls_6
raid6_ra_6
raid6_rs_6
raid6_n_6
lvmpolld ATM is not desingned to preserve interval checking
in the same way the 'lvconvert' tool is doing - so the passed
'-i 40' is not respected and lvmpolld autonomously checks
state of conversion and updates lvm2 metadata and dm tables
when needed.
So skip portion of test that relayed on this and preserve this logic
only for command line invocation and forking of polling process
where the interval of checking is under full control.
Although we primarily want to check externally used libdevmapper
library, out internal all.h is still keeping all symbols
as the original library has - so for simpler compilation keep
using this private copy for defining needed symbols.
Just for case ensure compiler is not able to optimize
memset() away for resources that are released.
This idea of using memory barrier is taken from openssl.
Other options would be to check for 'explicit_bzero' function.
When preparing ioctl buffer and flatting all parameters,
add table parameters only to ioctl that do process them.
Note: list of ioctl should be kept in sync with kernel code.
DM_DEVICE_CREATE with table is doing several ioctl operations,
however only some of then takes parameters.
Since _create_and_load_v4() reused already existing dm task from
DM_DEVICE_RELOAD it has also kept passing its table parameters
to DM_DEVICE_RESUME ioctl - but this ioctl is supposed to not take
any argument and thus there is no wiping of passed data - and
since kernel returns buffer and shortens dmi->data_size accordingly,
anything past returned data size remained uncleared in zfree()
function.
This has problem if the user used dm_task_secure_data (i.e. cryptsetup),
as in this case binary expact secured data are erased from main memory
after use, but they may have been left in place.
This patch is also closing the possible hole for error path,
which also reuse same dm task structure for DM_DEVICE_REMOVE.
Add a little wait loop - since lvconvert started background process
and we need to wait till this bg task initiate its work -
adding ~1s loop should give reasonable enough time to start mirroring.
This could be seen as continuation of
6cee8f1b06.
Some test maching with old udev system shows problem,
where udev 'jumps on' leg device after mirror target
releases its legs - since udev does not (in this old case) skips
such device from scanning - it opens device - and this prevent
leg device to be deactivated - effectively such device stays
'leaked' in DM table invisibly to lvm2 command.
So to 'combat' this issue - if the device has '_mimage' in its name,
the retry of deactivation is automatically assumed.
NOTE: wider impact is unexpected - as it's touching only old mirror
target which is nowadays replaced with 'raid'.
In case there will be some problem identified - probably both patches
should be reverted.
With older systems and udevs we don't have control over scanning of lvm2
internal devices - so far we set retry-removal only for top-level LVs,
but in occasional cases udev can be 'fast enough' to open device for
scanning and prevent removal of such device from DM table.
So to combat this case - try to pass 'retry' flag also for removal of
internal device so see how many races can go away with this simple
patch.
Note: patch is applied only to internal version of libdm so the external
API remains working in the old way for now.
Commit 813347cf84 added extra validation,
however in this particular we do want to trim suffix out so rather ignore
resulting error code here intentionaly.
If a single, standard LV is specified as the cache, use
it directly instead of converting it into a cache-pool
object with two separate LVs (for data and metadata).
With a single LV as the cache, lvm will use blocks at the
beginning for metadata, and the rest for data. Separate
dm linear devices are set up to point at the metadata and
data areas of the LV. These dm devs are given to the
dm-cache target to use.
The single LV cache cannot be resized without recreating it.
If the --poolmetadata option is used to specify an LV for
metadata, then a cache pool will be created (with separate
LVs for data and metadata.)
Usage:
$ lvcreate -n main -L 128M vg /dev/loop0
$ lvcreate -n fast -L 64M vg /dev/loop1
$ lvs -a vg
LV VG Attr LSize Type Devices
main vg -wi-a----- 128.00m linear /dev/loop0(0)
fast vg -wi-a----- 64.00m linear /dev/loop1(0)
$ lvconvert --type cache --cachepool fast vg/main
$ lvs -a vg
LV VG Attr LSize Origin Pool Type Devices
[fast] vg Cwi---C--- 64.00m linear /dev/loop1(0)
main vg Cwi---C--- 128.00m [main_corig] [fast] cache main_corig(0)
[main_corig] vg owi---C--- 128.00m linear /dev/loop0(0)
$ lvchange -ay vg/main
$ dmsetup ls
vg-fast_cdata (253:4)
vg-fast_cmeta (253:5)
vg-main_corig (253:6)
vg-main (253:24)
vg-fast (253:3)
$ dmsetup table
vg-fast_cdata: 0 98304 linear 253:3 32768
vg-fast_cmeta: 0 32768 linear 253:3 0
vg-main_corig: 0 262144 linear 7:0 2048
vg-main: 0 262144 cache 253:5 253:4 253:6 128 2 metadata2 writethrough mq 0
vg-fast: 0 131072 linear 7:1 2048
$ lvchange -an vg/min
$ lvconvert --splitcache vg/main
$ lvs -a vg
LV VG Attr LSize Type Devices
fast vg -wi------- 64.00m linear /dev/loop1(0)
main vg -wi------- 128.00m linear /dev/loop0(0)
Since the test is currently directly working with live directory,
which can be getting updates from system's udev - add wait
for settling so removal of all known PVs happens after that.
But still this has major influce on behavior of running system,
so the test should never be executed on a user used box.
Check that type is always defined, if not make it explicit internal
error (although logged as debug - so catched only with proper lvm.conf
setting).
This ensures later type being NULL can't be dereferenced with coredump.
When sanlock_release returns an error because of an i/o
timeout releasing the lease on disk, lvmlockd should just
consider the lock released. sanlock will continue trying
to release the lease on disk after the original request
times out.
The lvmlock LV size was not adjusted correctly for 512 vs 4K
sector sizes which influence the lease size used by sanlock.
When lvmlock was automatically extended, the zeroing through
bcache wasn't working.
The choice about sector size and lease align size is
now made by the sanlock user, in this case lvmlockd.
This will allow lvmlockd to use other lease sizes in
the future. This also prevents breakage if hosts
report different sector sizes, or the sector size
reported by a device changes.
Since the stats handle is neither bound nor listed before the
attempt to call dm_stats_get_nr_regions(), it will always return
zero: this prevents reporting of any dmstats regions on any
device.
Remove the dm_stats_get_nr_regions() check and instead rely on
the correct return status from dm_stats_populate() which only
returns 0 in the case that there are regions to inspect (and
which logs a specific error for all other cases).
Reported-by: Bryan Gurney <bgurney@redhat.com>
It doesn't make sense to test or warn about the region count until
the stats handle has been listed: at this point it may or may not
contain valid information (but is guaranteed to be correct after
the list).
lvm uses a bcache block size of 128K. A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated. How much depends on
alignments. When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.
This patch is a quick hack to prevent lvm from writing
past the end of the metadata area.
This reverts commit 16ae968d24.
We need to come up with a better fix, because we fall short
wiping all known signatures when not using the wipe_lv API.
lvm metadata writes, commits and activations are performed
for (newly) allocated RAID metadata SubLVs to wipe any preexisiting
data thus avoid false raid superblock positives on RaidLV activation.
This process can be interrupted by command or system crashs
thus leaving stale SubLVs in the lvm metadata as a problem.
Because we hold an exclusive lock in this metadata SubLV wiping
process, we can address this problem by avoiding aforementioned
commits/writes/activations altogether wiping the respective first
sector of the first physical extent allocated to any metadata SubLV
directly via the existing dev_set() API.
Succeeds all LVM RAID tests.
Related: rhbz1633167
When persistent_filter_create() fails, the existing passed filter
should be preserved, so it could be properly deleted on
error path - so new pfilter is assigned instead.
Thin plugin started to use configuble setting to allow to configure
usage of external scripts - however to read this value it needed to
execute internal command as dmeventd itself has no access to lvm.conf
and the API for dmeventd plugin has been kept stable.
The call of command itself was not normally 'a big issue' until users
started to use higher number of monitored LVs and execution of command
got stuck because other monitored resource already started to execute
some other lvm2 command and become blocked waiting on VG lock.
This scenario revealed necesity to somehow avoid calling lvm2 command
during resource registration - but this requires bigger changes - so
meanwhile this patch tries to minimize the possibility to hit this race
by obtaining any configurable setting just once - such patch is small
and covers majority of problem - yet better solution needs to be
introduced likely with bigger rework of dmeventd.
TODO: avoid blocking registration of resource with execution of lvm2
commands since those can get stuck waiting on mutexes.
%ghost should not be used for directories created by systemd-tmpfiles.
This may prevent package from working right after installation without
invoking systemd-tmpfiles.
See: https://pagure.io/packaging-committee/issue/439
Previously the size was limited by checking if the
old and new copies of the metadata overlapped.
This generally limited the size to about half of
the total space, but it could be larger given the
size differences between old and new. Now add a
direct check to limit the size to half the space.
Remove another instance of an invalid check for metadata
overflow during read. The previous instance was removed
in commit 5fb15b193.
This was checking for metadata that that overflowed the
circular disk metadata buffer during read, but such metadata
cannot be written, so it shouldn't be possible to find see.
Also, the check was incorrect and could trigger when there
was no overflow.
This fixes a problem in commit e6bb780d24, in which the
back compat handling for the old locking_type=4 was
incorrectly translated to mean the same thing as --readonly,
which prevented activation because activation uses an
exclusive vg lock. Previously, locking_type=4 allowed
activation.
If we see locking_type 4 in an old config, translate it to
the new combination of --readonly and --sysinit, which we
now define to mean the --readonly behavior with an exception
to allow activation.
The vg_write/vg_commit code was imprecise, uncommented, and
hard to understand. Rewrite it with clearer, cleaner code,
extensive comments, descriptions of how it works, and add
more info in debugging output.
The minor changes in behavior are to things that were
either incorrect or probably unintended:
- vg_write/vg_commit no longer check that the current vgname at
the start of the text metadata matches the vgname being written.
This has already been done at least twice by the time they are
called, and repeating it again against the same cached data has
no use.
- A fragment of old removed code had been left behind that checked
if the old unused alignment policy would wrap. It was still
being checked to decide if the metadata area was full, which
could possibly cause an incorrect full metadata failure.
- vg_remove now clears both the raw_locns in the mda_header that
point to committed metadata (raw_locn slot 0) and precommitted
metadata (raw_locn slot 1). Previously it fully cleared the
committed slot, and would only clear the offset field in the
precommitted slot if it saw a problem with the metadata in the
vg being removed.
- read_metadata_location_summary was wrongly comparing the number
of wrapped bytes with an offset to report an error about the
metadata being too large. This wrong check is removed, it
could have resulted in erroneous errors.
Commit 989626926c
introduced 2 new tests
lvconvert-raid-takeover-linear_to_raid4.sh and
lvconvert-raid-takeover-raid4_to_linear.sh
which involve raid reshaping.
Bump the checked dm-raid target version to 1.14.0
which has reshape kernel fixes to avoid test suite
runs to hang.
Bump target version to 1.14.0 which contains fixes
for reshape deadlock/corruption to allow tests to
run once the respective fixes show up in kernels.
Remove now superfluous multi-core checks.
Resolves: rhbz1501145
Related: rhbz1514539
Related: rhbz1586123
Related: rhbz1613039
Allow "lvconvert --type linear RaidLV" on a raid4 LV
providing convenient interim steps to convert to linear.
Add respective new test
lvconvert-raid-takeover-raid4_to_linear.sh
and
lvconvert-raid-takeover-linear_to_raid4.sh
for linear to raid4 once on it.
When converting from striped/raid0/raid0_meta
to raid6 with > 2 stripes, allow possible
direct conversion (to raid6_n_6).
In case of 2 stripes, first convert to raid5_n to restripe
to at least 3 data stripes (the raid6 minimum in lvm2) in
a second conversion before finally converting to raid6_n_6.
As before, raid6_n_6 then can be converted
to any other raid6 layout.
Enhance lvconvert-raid-takeover.sh to test the
2 stripes conversions to raid6.
Resolves: rhbz1624038
devices/scan_lvs (default 1) determines whether lvm
will scan LVs for layered PVs. The lvm behavior has
always been to scan LVs, but it's rare for LVs to have
layered PVs, and much more common for there to be many
LVs that substantially slow down scanning with no benefit.
This is implemented in the usable filter, and has the
same effect as listing all LVs in the global_filter.
Add "lvm2-activation-generator: " prefix for all kmsg messages written by
lvm2-activation-generator so we can identify the message in global system log.
We need to have Ceph RBD devices mapped first before use in a stack
where LVM is on top so make sure rbdmap.service is called before
generated lvm2-activation-net.service.
On shutdown, we need to stop blk-availability first before we stop the
rbdmap.service.
Resolves: rhbz1623479
This is the number of concurrent async io requests that
the scan layer will submit to the bcache layer. There
will be an open fd for each of these, so it is best to
keep this well below the default limit for max open files
(1024), otherwise lvm may get EMFILE from open(2) when
there are around 1024 devices to scan on the system.
"lvconvert --type linear RaidLV" on striped and raid4/5/6/10
have to provide the convenient interim layouts. Fix involves
a cleanup to the convenience type function.
As a result of testing, add missing sync waits to
lvconvert-raid-reshape-linear_to_raid6-single-type.sh.
Resolves: rhbz1447809
Conversion to striped from raid0/raid0_meta is directly possible.
Fix a regression setting superfluous interim raid5_n conversion type
introduced by commit bd7cdd0b09.
Add new test script lvconvert-raid0-striped.sh.
Resolves: rhbz1608067
With improved mirror activation code --splitmirror issue poppedup
since there was missing proper preload code and deactivation
for splitted mirror leg.
If a mirror LV is listed in read_only_volume_list, it would
still be activated rw. The activation would initially be
readonly, but the monitoring function would immediately
change it to rw. This was a regression from commit
fade45b1d1 mirror: improve table update
The monitoring function needs to copy the read_only setting
into the new set of mirror activation options it uses.
When vgcreate does an automatic pvcreate, it opens the
dev with O_EXCL to ensure no other subsystem is using
the device. This exclusive fd remained in bcache and
prevented activation parts of lvm from using the dev.
This appeared with vgcreate of a sanlock VG because of
the unique combination where the dev is not yet a PV,
so pvcreate is needed, and the vgcreate also creates
and activates an internal LV for sanlock.
Fix this by closing the exclusive fd after it's used
by pvcreate so that it won't interfere with other
bits of lvm that may try to use the device.
Conversions of LVs under snapshot to thinpool or cachepool
correctly fail but leave them inactive and provide cryptic
error messages like 'Internal error: #LVs (10) != #visible
LVs (2) + #snapshots (1) + #internal LVs (5) in VG VG'.
Reject and provide better error message.
Resolves: rhbz1514146
The 'lvconvert LV' command def has caused multiple problems
for command matching because it matches the required options
of any lvconvert command. Any lvconvert with incorrect options
ends up matching 'lvconvert LV', which then produces an error
about incorrect options being used for 'lvconvert LV'. This
prevents suggestions from nearest-command partial command matches.
Add a special case for 'lvconvert LV' so that it won't be used
as a partial match for a command that has options specified.
Native disk scanning is now both reduced and
async/parallel, which makes it comparable in
performance (and often faster) when compared
to lvm using lvmetad.
Autoactivation now uses local temp files to record
online PVs, and no longer requires lvmetad.
There should be no apparent command-level change
in behavior.
When lvmetad is not used, use temporary files to record
which PVs have appeared. Use these temp files to determine
when a VG is complete, to trigger autoactivation.
This change allows us to remove lvmetad while keeping the
same autoactivation behavior that lvmetad provides.
The temp files are created in /run/lvm/pvs_online/ and are
named for the PVID of the PV. The files contain the
major:minor of the device the PV was read from.
e.g. if VG foo has dev1 and dev2, then:
. pvscan --cache -aay dev1
reads vg metadata from dev1
creates /run/lvm/pvs_online/<pvid-of-dev1>
checks if all vg->pvs are online: no
. pvscan --cache -aay dev2
reads vg metadata from dev2
creates /run/lvm/pvs_online/<pvid-of-dev2>
checks if all vg->pvs are online: yes
autoactivates vg
A 'pvscan --cache dev' (without -aay) still records that
dev is online.
A 'pvscan --cache --major X --minor Y' after a device is
gone will remove the temp file for it.
A 'pvscan --cache [-aay]' (no devs) resets the state of
temp files by removing them all, then scanning all devs
and creating temp files for PVs that are found.
If no online files exist, the first pvscan --cache scans
all devs and creates temp files for any PVs found.
The scope of the temp files is only pvscan, and they are only
used for pvscan-based autoactivation. No other commands are
concerned with or aware of these temp files. When lvm creates
or removes PVs, no attempt is made to update the temp files.
Support vgchange usage with VDO segtype.
Also changing extent size need small update for vdo virtual extent.
TODO: API needs enhancements so it's not about adding ifs() everywhere.
When user create vdo-pool - use different automatic name.
So unlike with traditional LVs using lvol0, lvol1
use vpool0, vpool1...
TODO: apply similar for thin-pool & cache-pool...
Checks whether VDO support is enabled.
Detects presence of 'vdoformat' tool which is required for to format VDO pool.
ATM build of VDO is NOT automatically enabled (None is default).
To enable build of LVM with VDO support use:
configure --with-vdo=internal
TODO: Maybe future version may switch to link some small VDO library for formating
(would require linking and package dependency).
To support autoloading of VDO dm target driver loading of 'kvdo'
kernel module is needed - ATM it's not using 'dm-vdo' name.
So to support this strange name - add temporarily solution to
autoload kvdo kernel module in this case.
Introduce VDO plugin for monitoring VDO devices.
This plugin can be used also by other users, as plugin checks
for UUID prefix 'LVM-' and run lvm actions only on those
devices.
Non LVM- device are only monitored and log warnings
when usage threshold reaches 80%.
Update makefile to link with more libs since now whole liblvm-internal.a
is linked-in and this library has futher dependencies.
Avoid including deps for run-unit-test.
Drop linking separate status.c as it's already linked via internal libs.
Check allocation of thin-pool works on 2PVs, when one is so full,
that even metadata do not fit there (as they need at least 2M,
while 99% of 63MB fills >62MB)
When lvm2 command is executed in test mode, discard ioctl is skipped.
This may cause even data-loose in case, issuing discard for released
areas was enabled and user 'tested' lvreduce.
When allocating thin-pool with more then 1 device - try to
allocate 'metadataLV' with reuse of log-type allocation for mirror LV.
It should be naturally place on other device then 'dataLV'.
However due to somewhat hard to follow allocation logic code,
it's been rejected allocation in cases where there was not
enough space for data or metadata on single PV, thus to successed,
usage of segments was mandatory.
While user may use:
allocation/thin_pool_metadata_require_separate_pvs=1
to enforce separe meta and data LV - on default settings, this is not
enable thus segment allocation is meant to work.
NOTE:
As already said - the original intention of this whole 'if()' is unclear,
so try to split this test into multiple more simple tests that are more readable.
TODO: more validation.
When node loading fails, there is not much the caller can do,
since there is 'unknown' set of devices preloaded.
Only suspend during preload knows future precommitted 'metadata',
so it's non-trivial to drop 'preloaded' entries with any later call.
However dm tree tracks newly loaded entries - so in this case it
may simplify the recovery path by dropping preloaded entries so
they are not leaked in the DM table.
Allow creation of any virtual segment type with just --virtualsize
specified without any real extent size give.
TODO: likely --type error,zero might be later enhanced to use -V
(along with -L) - but since those targets do not allocate real
space, supporting -V makes sense with them.
Amound of linked libraries grows.
Most of them we don't need to lock in, since we are not using
them in locked section, so skip locking them in memory.
It's important to lock memory beforo running SUSPEND ioctl - but whole
lvm preload runs in memory unlocked environment - as in this phase
memory allocation is allowed and is meant to happen.
Once all targets are preload and ready (confirmed from all targets)
we start suspending tree - and here the memory allocation (or i.e.
opening files) is no longer allowed - as it may cause kernel deadlock.
Commit 3f35146 added a check on the value returned by the
_display_info_cols() function:
1024 if (!_switches[COLS_ARG])
1025 _display_info_long(dmt, &info);
1026 else
1027 r = _display_info_cols(dmt, &info);
1028
1029 return r;
This exposes a bug in the dmstats code in _display_info_cols:
the fact that a device has no regions is explicitly not an error
(and is documented as such in the code), but since the return
code is not changed before leaving the function it is now treated
as an error leading to:
# dmstats list
Command failed.
When no regions exist.
Set the return code to the correct value before returning.
For better code reuse split _node_send_messages into commont
messaging part and separate _thin_pool_node_send_messages.
Patch makes it possible to better reuse common code for messaging
other targets.
udev creates a train wreck of events if we open devices
with RDWR. Until we can fix/disable/scrap udev, work around
this by opening RDONLY and then closing/reopening RDWR when
a write is needed. This invalidates the bcache blocks for
the device before writing so it can trigger unnecessary
rereading.
It's no longer needed. Clustered VGs are now handled in
the same way as foreign VGs, and as shared VGs that
can't be accessed:
- A command processing all VGs sees a clustered VG,
prints a message ("Skipping clustered VG foo."),
skips it, and does not fail.
- A command where the clustered VG is explicitly
named on the command line, prints a message and fails.
"Cannot access clustered VG foo, see lvmlockd(8)."
The option is listed in the set of ignored options for
the commands that previously accepted it. (Removing it
entirely would cause commands/scripts to fail if they
set it.)
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
Since we are using "DefaultDependencies=no" we do not get automatic STOP
job on socket connection - so automatically refuse connection on
shutdown by adding this Conflict definition to socket Unit.
Shuffle code for better readability as set of conditions was
hard to follow.
Make it obvious the refresh & activate path is handling
monitoring and polling on its own.
So the only --monitor and --poll option needs explicit care.
Option --monitor without option --poll will now as a result
of this patch NOT start polling.
So command: vgchange --monitor n is no longer a polling starter.
Restoring polling for activated volumes lost with my recent commit:
75fed05d3e and move start of polling
directly into _activate_lvs_in_vg() - as there we know exactly
if there was some volume even activated.
Also make it sharing same code for pvscan -aay.
With the move to top-level makefile - there are some issues
with subdir recursive makefile.
Make the building more tolerant for now until fully resolved.
The previous method for forcibly changing a clustered VG
to a local VG involved using -cn and locking_type 0.
Since those options are deprecated, replace it with
the same command used for other forced lock type changes:
vgchange --locktype none --lockopt force.
Shared VGs will generally be started and activated by
the resource agent. Without the agent, this script doesn't
have a good way to know which LVs to activate.
vgreduce, vgremove and vgcfgrestore were acquiring
the orphan lock in the midst of command processing
instead of at the start of the command. (The orphan
lock moved to being acquired at the start of the
command back when pvcreate/vgcreate/vgextend were
reworked based on pvcreate_each_device.)
vgsplit also needed a small update to avoid reacquiring
a VG lock that it already held (for the new VG name).
A few places were calling a function to check if a
VG lock was held. The only place it was actually
needed is for pvcreate which wants to do its own
locking (and scanning) around process_each_pv.
The locking/scanning exceptions for pvcreate in
process_each_pv/vg_read can be enabled by just passing
a couple of flags instead of checking if the VG is
already locked. This also means that these special
cases won't be enabled unknowingly in other places
where they shouldn't be used.
When pvmoving LV - the target for LV is a mirror so the validation
that checked the type is matching was incorrect.
While we need a more generic enhancment of LVS output for pvmoved LVs,
for now at least stop showing internal errors and 'X' symbols in attrs.
The last commit related to this was incomplete:
"Implement lock-override options without locking type"
This is further reworking and reduction of the locking.[ch]
layer which handled all clustering, but is now only used
for file locking. The "locking types" that this layer
implemented were removed previously, leaving only the
standard file locking. (Some cluster-related artifacts
remain to be cleared out later.)
Command options to override or modify locking behavior
are reimplemented here without using the locking types.
Also, deprecated locking_type values are recognized,
and implemented as if one of the equivalent override
options was set.
Options that override file locking are:
. --nolocking disables all file locking.
. --readonly grants read lock requests without actually
taking a file lock, and refuses write lock requests.
. --ignorelockingfailure tries to set up file locks and
uses them normally if possible. When not possible, it
behaves like --readonly, but allows activation.
. --sysinit is the same as ignorelockingfailure.
. global/metadata_read_only acquires actual read file
locks, and refuses write lock requests.
(Some of these options could probably be deprecated
because they were added as workarounds to various
locking_type behaviors that are now deprecated.)
The locking_type setting now has one valid value: 1 which
refers to standard file locking. Configs that contain
deprecated values are recognized and still work in
largely the same way:
. 0 disabled all locking, now implemented like --nolocking
is set. Allow the nolocking option in all commands.
. 1 is the normal file locking setting and is unchanged.
. 2 was for external locking which was not used, and
reverts to normal file locking.
. 3 was for cluster/clvm. This reverts to normal file
locking, and prints messages about lvmlockd.
. 4 was equivalent to readonly, now implemented like
--readonly is set.
. 5 disabled all locking, now implemented like
--nolocking is set.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
The options: --nolocking, --readonly, --sysinit
override, or make exceptions to, the normal file locking
behavior. Implement these by just checking for the
options in the file locking path instead of using
special locking types.
Basic LV functions:
activate_lv(), deactivate_lv(),
suspend_lv(), resume_lv()
were routed through the locking infrastruture on the way to:
lv_activate_with_filter(), lv_deactivate(),
lv_suspend_if_active(), lv_resume_if_active()
This commit removes the locking infrastructure from the
middle and calls the later functions directly from the former.
There were a couple of ancillary steps that the locking
infrastructure added along the way which are still included:
- critical section inc/dec during suspend/resume
- checking for active component LVs during activate
The "activation" file lock (serializing activation) has not
been kept because activation commands have been changed to
take the VG file lock exclusively which makes the activation
lock unused and unnecessary.
Make activation commands:
vgchange -ay, lvchange -ay, pvscan -aay
take an exclusive file lock on the VG to serialize
multiple concurrent activation commands which could
otherwise interfere with each other.
Four commands lock two VGs at a time:
- vgsplit and vgmerge already have their own logic to
acquire the locks in the correct order.
- vgimportclone and vgrename disable this ordering check.
Different flavors of activate_lv() and lv_is_active()
which are meaningful in a clustered VG can be eliminated
and replaced with whatever that flavor already falls back
to in a local VG.
e.g. lv_is_active_exclusive_locally() is distinct from
lv_is_active() in a clustered VG, but in a local VG they
are equivalent. So, all instances of the variant are
replaced with the basic local equivalent.
For local VGs, the same behavior remains as before.
For shared VGs, lvmlockd was written with the explicit
requirement of local behavior from these functions
(lvmlockd requires locking_type 1), so the behavior
in shared VGs also remains the same.
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
Add tests for linear <-> striped|raid* conversions.
Add region_size config to reshape tests to avoid test
failures in case of it being defined unexpectedly in lvm.conf.
Related: rhbz1439925
Related: rhbz1447809
"lvconvert --type {linear|striped|raid*} ..." on a striped/linear
LV provides convenience interim type to convert to the requested
final layout similar to the given raid* <-> raid* conveninece types.
Whilst on it, add missing raid5_n convenince type from raid5* to raid10.
Resolves: rhbz1439925
Resolves: rhbz1447809
Resolves: rhbz1573255
Fixes breakage from the recent libdm split. Though these didn't
ever appear to be linked (could they have piggy backed from libdevmapper.so
being linked to them?).
In this command, lvcreate creates a new LV and then combines
it with an existing cache pool, producing a cache LV. This
command was previously not allowed in in a shared VG.
When the lvmlockd lock is shared, upgrade it to ex
when repair (writing) is needed during vg_read.
Pass the lockd state through additional read-related
functions so the instances of repair scattered through
vg_read can be handled.
(Temporary solution until the ad hoc repairs can be
pulled out of vg_read into a top level, centralized
repair function.)
It's not an error if a command requests the global lock
when it has already acquired it. It shouldn't happen,
but there could be cases we've not found.
We have been warning about duplicate devices (and disabling lvmetad)
immediately when the dup was detected (during label_scan). Move the
warnings (and the disabling) to happen later, after label_scan is
finished.
This lets us avoid an unwanted warning message about duplicates
in the special case were md components are eliminated during the
duplicate device resolution.
The report uses process_each_vg() which populates lvmcache
based on a VG list from lvmetad. If there are no VGs,
but only orphan PVs, the orphans are not shown. Add an
explicit call to populate lvmcache with PV info from lvmetad.
This minor patch fixes grammar in a few messages which get
printed to users. It also fixes the same grammar mistake in
several comments.
Signed-off-by: Rick Elrod <relrod@redhat.com>
--
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
No point to start building lvm without this header file.
Although there could be 'some point' in supporting standalone build
of 'just' libdm where the libaio might be avoided.
TODO: think about configure option for building libdm only.
Unfortunatelly on kernels <4.16 lvm2 can't user brd ramdisks
for backend device as number of test is failing with this kernel
message:
device-mapper: ioctl: can't change device type after initial table load.
caused by DAX request-based handling, and lvm2 tries to replace device
with backend 'error' bio-based device and such table reload is being
rejected.
So ATM keep ramdisk only on most recent kernel to experiment a bit,
for older machines just stay safe and keep old slower loop backend.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
While newer system can detect need for 4K mkfs, on older test machines
running test suite over 4k is reporting problems.
Some more generic solution is needed thought.
When test happens to run in tmpfs, it cannot use O_DIRECT (unsupported
with tmpfs).
CHECKME: unsure if detection of tmpfs is 'valid' but kind of works and
is very simple.
As Makefiles already do use target with name 'device-mapper'
rename this new device-mapper dir to non-conflicting name.
We also seem to already use '_' in other dir names.
Also rename device_mapper/Makefile to source for generating Makefile.in
so we can use it for build in other source dirs properly.
It's very hard to use some 'non-recurive' Makefiles with
rest of system running 'recursively'.
So ATM drop inclusion of subdir makefile and add support
for 2 new top-level targets:
unit-test (builds test/unit dir)
run-unit-test (build & run test/unit/unit-test run)
Just like 52656c89fd
when now cache is compiled in 'unditionally'.
This patch is actually enforce by changes in
commit: 2bc896f2a3
where CACHE value is not set anymore.
If the test does not need root, it can use 'SKIP_ROOT_DM_CHECK'.
For such test no actions needed root to initilize DM devices and
nodes will be take and test can check i.e. functional unit tests.
Usage of dm_delay looks to be slowing not just 'delayed' portion
of device, but due to the fact it's also slows down ANY flush
operation on such device it's overal speed impact is huge.
In some case we can however user other methods to slowdown disk writes,
in case of old dm 'mirror' target we can throttle I/O of mirror
synchronisation giving the next commands enough time to test couple
race conditions.
Usage:
throttle_dm_mirror [percentage]
Thtrottle down sync speed (lowest is '1' which is also default when
unspecified)
restore_dm_mirror
Restores the value of throttling before call of 'throttle_dm_mirror'
Usually it should '100'
Currently usage of loop device over backend file in ramdisk (tmpfs)
is actually causing unnecassary memory consution, since just
reading such loop device is causing RAM provisioning.
This patch add another possible way how to use ramdisk directly
through 'brd' device when possible (and allowed).
This however has it's limitation as well - brd does not support
TRIM, so the only way how to erase is to remove brd module ??
Alse there is 4K sector size limitation imposed by ramdisk.
Anyway - for some mirror test that were using large amount of
disk space (tens of MB) this brings noticable speed boost.
(But could be worth to solve the slowness of loop in kernel?)
To prevent using 'brd' for testing set LVM_TEST_PREFER_BRD=0
like this:
make check_local LVM_TEST_PREFER_BRD=0
This test can't use brd (ramdisk) as backend since for some
weird reason lsblk is not listing these device.
TODO: test could be probably rewritten to avoid using lsblk somehow??
When the backend device supports only 4K blocks (like ramdisk)
we cannot use for testing any smaller blocksize.
So recalc test for 4K extent size.
We may possibly introduce one list extra test that
can be executed on devices with 512b sectors to
check lvm2 support those min extent sizes...
ATM it's a bit ugly to enforce flushing of 'stdio' here, but works as quick
hot-fix.
log_print*() is using buffered I/O.
But for pooling with typical 1s interval this may take a while before
buffer about continues progress gets flushed.
So ATM fflush().
TODO: either add log_print*_with_flush() or maybe directly use just
line buffering with log_print() and only log_debug() keep using buffered
I/O mode.
md devices using an older superblock version have
superblocks at the end of the md device. For commands
that skip reading the end of devices during filtering,
the md component devs will be scanned, and will appear
as duplicate PVs to the original md device. Remove
these md components from the list of unused duplicate
devices, so they are treated as if they had been
ignored during filtering. This avoids the restrictions
that are placed on using PVs with duplicates.
All these functions are now used as utilities,
e.g. for ioctl (not for io), and need to
open/close the device each time they are called.
(Many of the opens can probably be eliminated by
just using the bcache fd for the ioctl.)
with the --labelsector option. We probably don't
need all this code to support any value for this
option; it's unclear how, when, why it would be
used.
An implementation of an adaptive radix tree. Has the following nice
properties:
- At least as fast as the hash table
- Uses less memory
- You don't need to give an expected size when you create
- It scales nicely (ie. no large reallocations like the hash table).
- You can iterate the keys in lexicographical order.
Only insert and lookup are implemented so far. Plus there's a lot
more performance to come.
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
This reverts commit 0931067dc5.
The dep files should be in the build dir, which is not necc. the src dir.
Easy to fix, but reverting for now until I have time to revisit.
The clvmd saved_vg data is independent from the normal lvm
lvmcache vginfo data, so separate saved_vg from vginfo.
Normal lvm doesn't need to use save_vg at all, and in clvmd,
lvmcache changes on vginfo can be made without worrying
about unwanted effects on saved_vg.
To avoid the chance of freeing a saved vg while another
code path is using it, defer freeing saved vgs until
all the lvmcache content is dropped for the vg.
In case "lvconvert -mN RaidLV" was used on a degraded
raid1 LV, success was returned instead of an error.
Provide message to inform about the need to repair first
before changing number of mirrors and exit with error.
Add new lvconvert-m-raid1-degraded.sh test.
Resolves: rhbz1573960
I don't like having this in a common header because it means you end
up including too much and causing unneccessary dependencies. eg,
lib/misc/lib.h includes libdevmapper.h, internationalisation, and
logging stuff.
There are likely more bits of code that can be removed,
e.g. lvm1/pool-specific bits of code that were identified
using FMT flags.
The vgconvert command can likely be reduced further.
The lvm1-specific config settings should probably have
some other fields set for proper deprecation.
The mixed up vg repair code in vg_read was trying
to repair a vg when vg_read was called by clvmd.
The clvmd daemon isn't supposed to be repairing
or writing a vg.
(This is a temporary workaround; vg repair will soon
be pulled out of vg_read so it can be called in a
controlled way and consolidated instead of spread
around.)
Shift refresh of mirror table right into monitor_dev_for_events().
Use !vg_write_lock_held() to recognize use of lvchange/vgchange.
(this shall change if this would no longer work, but requires
futher some API changes).
With this patch dm mirror table is only refreshed when necassary.
Also update WARNING message about mirror usage without monitoring
and display LV name.
When 'dmsetup' reports result with --nameprefixes it currently
incorrectly 'escapes' problematic characters.
Letting pass such string though shell 'eval' function is hard task.
So instead cut away substring.
Once dmsetup will start to properly escape backslash and apostrophe
this function may need further tuning.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
External build dir now works too.
Git handles symlinks, tar handles symlinks. So I've just put the
links themselves into git.
This simplifies dependencies a little, and stop some build loops I was
hitting.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
The filters save information about devices that should
be ignored, so if we need to repeat a scan (unusual,
but happens in clvmd), we need to update the filters.
When pvmove was run in background mode and forks
instead of using lvmpolld, the child pvmove process
was not clearing the bcache from the parent, so all
the aio ops in the child were failing.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
The copy of the VG which clvmd stashes in lvmcache should
not only be used between suspend and resume, but between
sequential LV operations in clvmd, so that clvmd does not
need to reread the VG for each one. Prepare for that by
renaming the stashed VG as "saved_vg".
Instead of using delayer device user 'zero' device and let mirror
do some real work which takes some time.
In case the test machine is too fast - mirror might need to be made bigger
to meet needed criteria.
Also move all test needed this 'zero' PV trick to the end of test
so $dev2 and $dev4 are covered with 'zero' and can take any amount of
write without consuming any real space.
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Don't allow writes in test mode. test mode should be
more sophisticated than just faking writes, and this
should be a last defense for cases where test mode is
not being checked correctly.
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
In some odd cases (e.g. tests) there are very few devices
which results in creating too few blocks in bcache, so
create bcache with a minimum number of blocks.
When a PV is stacked on an LV, the LV will be kept in
bcache, and the open fd on the LV may interfere with
processing the LV. So, drop/close a bcache fd for
an LV before processing the LV.
Commands using lvmetad will not begin with a proper
label_scan which initializes bcache, but may later
decide they need to scan a set of devs, in which case
they'll need bcache set up at that point.
The improved detection of bad metadata when scanning
(where errors were ignored before) means we now have to
override some errors when forcibly erasing damaged metadata.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
This is a temporary hacky workaround to the problem of
reads going through bcache and writes not using bcache.
The write path wants to read parts of data that it is
incrementally writing to disk, but the reads (using
bcache) don't work because the writes are not in the
bcache. For now, add a dev to bcache before each attempt
to read it in case it's being used on the write path.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
In the same way as the other process_each functions.
In the common case all the info that's needed can be
used from lvmcache after a label scan. But this means
that unchosen devs for duplicate PVs need to be handled
explicitly.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
The old code was doing unnecessary label scans when
checking to see if the new VG name exists. A single
label_scan is sufficient if it is done after the
new VG lock is held.
When lvmlockd indicates that the lvmetad cache is out of
date because of changes by another node, lvmetad_pvscan_vg()
rescans the devices in the VG to update lvmetad. Use the
new label_scan in this function to use the common code and
take advantage of the new aio and reduced reads.
This fixes the use of lvmcache_label_rescan_vg() in the previous
commit for the special case of independent metadata areas.
label scan is about discovering VG name to device associations
using information from disks, but devices in VGs with
independent metadata areas have no information on disk, so
the label scan does nothing for these VGs/devices.
With independent metadata areas, only the VG metadata found
in files is used. This metadata is found and read in
vg_read in the processing phase.
lvmcache_label_rescan_vg() drops lvmcache info for the VG devices
before repeating the label scan on them. In the case of
independent metadata areas, there is no metadata on devices, so the
label scan of the devices will find nothing, so will not recreate
the necessary vginfo/info data in lvmcache for the VG. Fix this
by setting a flag in the lvmcache vginfo struct indicating that
the VG uses independent metadata areas, and label rescanning should
be skipped.
In the case of independent metadata areas, it is the metadata
processing in the vg_read phase that sets up the lvmcache
vginfo/info information, and label scan has no role.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When user configured lvm2 to NOT user monitoring, activated mirror
actually hang upon error and it's quite unusable moment.
So instead Warn those 'brave' non-monitoring users about possible
problem and activation mirror without blocking error handling.
This also makes it a bit simpler for test suite to handle trouble
cases when test is running without dmeventd.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Whenever we make visible LV out of previously invisible one,
reload it's table - the is mandator for proper udev rule
processing as well as ensure content of dm table is correct.
TODO: this new generic rule probably make extra raid rules unnecessary.
Fixing regresion on argument acceptance where any lv can be passed
with paramaterless lvconvert which is meant to figure out needed
operation - i.e. wait for mirror synchronization.
User has no other 'effective' method to wait for mirror getting in-sync.
Since we support snapshot of mirrors, we do need to properly check
for stacked lock holder - fixes problem of pvmove in cluster
with mirrors under snapshot.
WHATS_NEW for this patch goes with 'Restore pvmove support...'
The current logic that avoids setting SYSTEMD_ALIAS and SYSTEMD_WANTS
on "change" events is flawed in the default "systemd background job"
configuration. For systemd, it's important that device properties don't
change spuriously.
If an "add" event starts lvm2-pvscan@.service for a device, and a
"change" event follows, removing SYSTEMD_ALIAS and SYSTEMD_WANTS from the
udev db, information about unit dependencies between the device and the
pvscan service can be lost in systemd, in particular if the daemon
configuration is reloaded.
Steps to reproduce problem:
- create a device with an LVM PV
- remove device
- add device (generates "add" and "change" uevents for the device)
(at this point SYSTEMD_ALIAS and SYSTEMD_WANTS are clear in udev db)
- systemctl daemon-reload
(systemd reloads udev db)
- vgchange -a n
- remove device
=> the lvm2-pvscan@.service for the device is still active although the
device is gone.
- add device again
=> the PV is not detected, because systemd sees the lvm2-pvscan@.service
as active and thus doesn't restart it.
The original purpose of this logic was to avoid volumes being scanned
over and over again. With systemd background jobs, that isn't necessary,
because systemd will not restart the job as long as it's active.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Make the distinction between the cases with and without systemd
background jobs explicit in 69-dm-lvm-metad.rules rather than
substituting the rule from the Makefile. At this stage,
this improves only readibility, at the cost of one GOTO statement.
This patch introduces no functional change to the udev rules.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Test that no (Sub)LV remnants persist if the volume group is
not listed in configuration variable activation/volume_list,
hence not activatable thus causing initialization of rmeta
SubLVs to fail.
Related: rhbz1161347
btrfs is using fake major:minor device numbers.
try to be smarter and detect used node via DM device name.
This shortens delays, where i.e. lvm2 is asked to deactivate
volume with mounted btrfs as such operation is not retryed
and user is informed about device being in use.
Correct testing with format 1 and mq policy.
Add testing of 'smq'
Fix testing with clvmd - where logged message is part of clvmd log
and we can only check command status.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
Use 4K chunks since some older kernels are not capable
to create striped volumes with smaller size.
TODO: lvm2 should detect this ahead and avoid kernel
reporting "Invalid chunk".
Since this function is called with 'fd == -1', but Coverity can't see
this path can't be visited with this argument, add explicit check for
valid descriptor.
If the tools for checking thin_pool or cache metadata are missing,
issue rather just a WARNING, but let the operation of activation
continue.
This has the advantage, the if user is missing those tools,
but he already started to use thinpool or cacheing, he can
access these volumes with a WARNING.
Also if the user is using too old tools i.e. for CacheV2 format
dmpd tool 0.7 is required - provide informative WARNING and
skip failure from older tool version which can't understand
new format V2.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
While prioritized_section() based on raised priority works
nicely for standard lvm comman - separate counter is actually needed
when it's used in daemons like clvmd/dmeventd where priority
stays raised all the time.
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Introduce prioritized_section() as a closer match to previous logic
of critical_section() that has been held over longer sequence of
ioctl commands - essentially it's matching operation on a single
cookie.
While 'critical_section()' now corresponds to locked memory - we hold
this memory only between suspend/resume thus notion of 'cookie' was
lost.
This patch restores some logic unintentionaly lost with dropping
memory locking for just activation/deactivation calls.
When function dm_stats_populate() returns 0 it's an error and needs
log_error() message - function can't have 'success' returning 0 or
error without reasons.
Prevent call of dm_stats_populate(), when there has been no
stats region detected for a DM device.
Such skip is evaluated as 'correct' visit of stats call and
not causing 'dmstats' command failure.
Resulting loop table line was streamed to 'stderr' stream - assuming this
was not a feature when user used '-v' for more verbose output
and properly show it via 'log_verbose()' on 'stdout'.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
With problematic kernels raid devices can be occasionaly left with
'frozen' status - try to 'unfreeze' them with idle message on teardown.
Also replace couple greps with 'built-in' dmsetup --select feature.
Note: dmsetup --select currently reports 'No devices found' on stdout
and return success - looks like a bug to fix.
Just like lvm2 has internal devices like _tdata which is using UUID with
suffix, there is similar private type of device for crypto device where
they are using CRYPT-TEMP uuid prefix.
Also ignore stratis.
Some kernel version suffer from bad state transition where a device
steps into 'frozen' mode. Any application that tries to read such
raid gets unfortunatelly bloked.
As some sort of protection try to skip such raid device from being
scanned to minimize chances to block lvm2 command on such scan.
When such device is found, warning gets printed.
RaidLVs on read_only_volume_list have their SubLVs
activated readonly thus disabling metadata updates
or image resynchronization/recovery. Bug also causes
automatic repairs to fail.
Fix by always activating the RAID SubLVs readwrite.
Resolves: rhbz1208269
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
When snapshot is created in read-only mode with 'lvcreate -s -pr...',
lvm2 still needs to be able to write to layered -cow volume
to store metadata and exceptions blocks.
TODO: in some case we might be able to do full tree with read-only
volume but this probably needs futher validation:
1. checking snapshot header already exist
2. origin & snapshot are both in read-only mode.
Occasionaly users may need to peek into 'component devices.
Normally lvm2 does not let users activation component.
This patch adds special mode where user can activate
component LV in a 'read-only' mode i.e.:
lvchange -ay vg/pool_tdata
All devices can be deactivated with:
lvchange -an vg | vgchange -an....
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
When dmeventd thin plugin forks a configurable script, switch to use
execvp to pass whole environment present to dmeventd - so all configured
paths present at dmeventd startup are visible to script.
This was likely not a problem for common user enviroment,
however in test suite case variable like LVM_SYSTEM_DIR were
not actually used from test itself but rather from
a system present lvm.conf and this may have cause strange
behavior of a testing script.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
This is somewhat tricky - for test suite we keep using
'set -e -o pipefail' - the effect here is - we get error report
from any 'failing' command in whole pipeline - thus when something
like this: 'lvs | head -1' is used - and 'head' finishes before
lead 'lvs' is done - it recieves SIGPIPE and exits with error,
and somewhat misleading gets occasionally reported depending
of speed of commands.
For this case we have to avoid using standard pipes and rather
switch to using streamed results with temporary output file.
This is all nicely handled with bash feature '< <()'.
For more info:
https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail
While 'file-locking' code always dropped cached VG before
lock was taken - other locking types actually missed this.
So while the cache dropping has been implement for i.e. clvmd,
actually running command in cluster keept using cache even
when the lock has been i.e. dropped and taken again.
This rather 'hard-to-hit' error was noticable in some
tests running in cluster where content of PV has been
changed (metadata-balance.sh)
Fix the code by moving cache dropping directly lock_vol() function.
TODO: it's kind of strange we should ever need drop_cached_metadata()
used in several places - this all should happen automatically
this some futher thinking here is likely needed.
Improve pvmove to accept 'locally' active LVs together with
exclusive active LVs.
In the 1st. phase it now recognizes whether exclusive pvmove is needed.
For this case only 'exclusively' or 'locally-only without remote
activative state' LVs are acceptable and all others are skipped.
During build-up of pvmove 'activation' steps are taken, so if
there is any problem we can now 'skip' LVs from pvmove operation
rather then giving-up whole pvmove operation.
Also when pvmove is restarted, recognize need of exclusive pvmove,
and use it whenever there is LV, that require exclusive activation.
So this is a bit more complex and possibly worth futher checking.
ATM clvmd drops cmd->mem mempool AFTER refresh of cmd.
So anything allocating from cmd->mem during toolcontext init
will likely die at some point in time.
As a quick fix - just use regular malloc/free for 'dso' alloction.
It's worth to note - cmd->libmem seems to be often misused
causing hidden memleaking for clvmd.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.
Avoid using same return code for reporting 2 different things
and stricly report error code by return value and add new
parameter for reporting monitoring status.
This makes easier to recognize which error we got from dm_event
and continue only with ENOENT.
Check if the generated vg name still fits the buffer.
So too long strings are rejected.
Drop -1 from size passed to snprintf - as the \0 is already included.
On occasional gcc releases it's better to specify also -ldevmapper
to linking logic for python object.
It's in fact more correct since the liblvm.c code is using
libdevmapper functions - that were linked in only via
liblvm2app library.
This partially reverts commit da37cbd24f.
As the _cmdline structure use mempool for allocated ellement
that is being release on cmd_context close.
Before the better fix is made - restore previous logic and
reinitialize cmd structures again for new cmd_context.
Problem can be hit with e.g. this test run:
make check_local T=foreign LVM_VALGRIND_DMEVENTD=1
Invalid read of size 1
at 0x4C31C83: strcmp (vg_replace_strmem.c:846)
by 0x6BA0939: _find_command (lvmcmdline.c:1555)
by 0x6BA4304: lvm_run_command (lvmcmdline.c:2810)
by 0x6BD5E02: lvm2_run (lvmcmdlib.c:91)
by 0x685607E: dmeventd_lvm2_run (dmeventd_lvm.c:118)
by 0x6652684: _use_policy (dmeventd_thin.c:117)
by 0x6652E56: process_event (dmeventd_thin.c:298)
by 0x10CC5A: _do_process_event (dmeventd.c:945)
by 0x10CF83: _monitor_thread (dmeventd.c:1033)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Address 0x6266270 is 4,352 bytes inside a block of size 8,192 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x5289142: dm_free_wrapper (dbg_malloc.c:393)
by 0x528998A: _free_chunk (pool-fast.c:318)
by 0x52892A6: dm_pool_destroy (pool-fast.c:78)
by 0x6A8E52C: destroy_toolcontext (toolcontext.c:2254)
by 0x6BA5BD6: lvm_fin (lvmcmdline.c:3327)
by 0x6BD5EA7: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x5288F46: dm_malloc_aux (dbg_malloc.c:287)
by 0x52890AC: dm_malloc_wrapper (dbg_malloc.c:371)
by 0x52898E6: _new_chunk (pool-fast.c:286)
by 0x52893BA: dm_pool_alloc_aligned (pool-fast.c:106)
by 0x5289310: dm_pool_alloc (pool-fast.c:90)
by 0x6A8A21A: _load_config_file (toolcontext.c:808)
by 0x6A8A3D9: _init_lvm_conf (toolcontext.c:842)
by 0x6A8D3BD: create_toolcontext (toolcontext.c:1941)
by 0x6BA5B24: init_lvm (lvmcmdline.c:3308)
by 0x6BD5B7C: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5EB8: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
With pthreaded daemons like 'dmeventd' using liblvm via plugin,
lvm2 actually should not 'play' with streams at all - as there
could be parallel outputs running.
As a current quick workaround just disable change for pthreaded
program (gettid() != getpid()).
TODO: it's possible the change of buffering actually doesn't serve us
any measurable benefit and could be dropped as whole later...
Meanwhile this patch is fixing this occasional valgrind race report:
Invalid read of size 4
at 0x571892C: vfprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x57216B3: fprintf (in /usr/lib64/libc-2.26.9000.so)
by 0x5042886: dm_event_log (libdevmapper-event.c:925)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Address 0x6264d30 is 192 bytes inside a block of size 552 free'd
at 0x4C2ED68: free (vg_replace_malloc.c:530)
by 0x573907D: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5C00: reopen_standard_stream (log.c:189)
by 0x6A8E62C: destroy_toolcontext (toolcontext.c:2271)
by 0x6BA5C22: lvm_fin (lvmcmdline.c:3339)
by 0x6BD5EF3: lvm2_exit (lvmcmdlib.c:123)
by 0x6856013: dmeventd_lvm2_exit (dmeventd_lvm.c:103)
by 0x66535B8: unregister_device (dmeventd_thin.c:432)
by 0x10CBBC: _do_unregister_device (dmeventd.c:926)
by 0x10CD74: _monitor_unregister (dmeventd.c:979)
by 0x10D094: _monitor_thread (dmeventd.c:1066)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
Block was alloc'd at
at 0x4C2DBBB: malloc (vg_replace_malloc.c:299)
by 0x573932B: fdopen@@GLIBC_2.2.5 (in /usr/lib64/libc-2.26.9000.so)
by 0x6AC5DC2: reopen_standard_stream (log.c:200)
by 0x6A8D11D: create_toolcontext (toolcontext.c:1898)
by 0x6BA5B6B: init_lvm (lvmcmdline.c:3319)
by 0x6BD5BC8: cmdlib_lvm2_init (lvmcmdlib.c:34)
by 0x6BD5F04: lvm2_init (lvm2cmd.c:20)
by 0x6855EA7: dmeventd_lvm2_init (dmeventd_lvm.c:67)
by 0x665305F: register_device (dmeventd_thin.c:352)
by 0x10CB7A: _do_register_device (dmeventd.c:916)
by 0x10CEE4: _monitor_thread (dmeventd.c:1006)
by 0x54B35E0: start_thread (in /usr/lib64/libpthread-2.26.9000.so)
by 0x57C30EE: clone (in /usr/lib64/libc-2.26.9000.so)
....
Process terminating with default action of signal 6 (SIGABRT): dumping core
at 0x570016B: raise (in /usr/lib64/libc-2.26.9000.so)
by 0x5701520: abort (in /usr/lib64/libc-2.26.9000.so)
by 0x57437D8: __libc_message (in /usr/lib64/libc-2.26.9000.so)
by 0x5743831: __libc_fatal (in /usr/lib64/libc-2.26.9000.so)
by 0x5744056: _IO_vtable_check (in /usr/lib64/libc-2.26.9000.so)
by 0x574751C: __overflow (in /usr/lib64/libc-2.26.9000.so)
by 0x574191A: fputc (in /usr/lib64/libc-2.26.9000.so)
by 0x50428E3: dm_event_log (libdevmapper-event.c:934)
by 0x10B015: _dmeventd_log (dmeventd.c:125)
by 0x10D289: _unregister_for_event (dmeventd.c:1146)
by 0x10E52E: _handle_request (dmeventd.c:1583)
by 0x10E6D7: _do_process_request (dmeventd.c:1631)
by 0x10E7C6: _process_request (dmeventd.c:1660)
by 0x1101A4: main (dmeventd.c:2285)
Some tools are typically installed into /usr/sbin (or /sbin) dir.
And some systems do not add this path to user's $PATH var.
Ensure sbin paths are looked through...
In fact pvmove does support 'clustered-core' target for clustered
pvmove of LVs activated on multiple nodes.
This patch restores support for activation of pvmove on all nodes
for LVs that are also activate on all nodes.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
Add protectional internall error whenever we spot activation
of 'exclusive' only segments in 'non-exclusive' mode.
TODO: possibly the activation locking could be enhanced to handle
this fully behind the scene - as for now this works purely for
lvchange/vgchange activation.
Use properly exclusive activation when reactivating origin after
snapshot merge (since origin must have been previously also exlusively
activated).
Same applies when converting volumes to thin-pool or cache.
Previously used 'only' local activation incorrectly allowed local
activation of some targets (i.e. raid) - thus 'leaking' chance to
activate same device on another node - which can be a problem
for device types like raid.
No longer use the external 'result' pointer internally to set up the
cached label. The callback _set_label_read_result() is now given the
internal label pointer directly
Callers that don't need the result are no longer required to pass a
label pointer into label_read().
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Dedicated functions are now used to process each piece of data obtained,
so the refactoring in this file gives us one for the vgsummary and one
for the metadata header. This new type of function takes two parameters
(for now), the obtained data plus a single struct (that must not
reference any data on the stack) that wraps up the entire context needed
to process it.
Sleep a bit before checking /sys/block dir so the kernel has a moment to
actually put scsi debug device in it...
Some quite old kernels are in troubles with this plain searching grep
without sleep (namely 2.6.32)
modprobe scsi_debug
<sleep .1>
grep -H scsi_debug /sys/block/*/device/model
modprobe -r scsi_debug
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
Add three new raid tests with io load and table
reloads during reshape for target 1.13.2.
Add a raid0 to raid10 conversion test.
Also add more signals to trap in lvconvert-raid-reshape-load.sh.
If there is sufficient space in the metadata area, align the next
metadata to a disk offset that is a multiple of 4096 bytes and
don't write it circularly. If it doesn't all fit at the end
of the metadata area, go back to the start and write it all there
contiguously.
If there is insufficient space to use the new stricter rules, revert to
the original behaviour, aligning on 512-byte boundaries wrapping around
the circular buffer as required.
Even after writing some metadata encountered problems, some commands
continue (rightly or wrongly) and attempt to make further changes.
Once an mda is marked MDA_FAILED, don't try to use it again.
This also applies when reverting, where one loop already skips
failed mdas but the other doesn't.
This fixes some device open_count warnings on relevant failure paths.
lvmdbusd was started, but the process was not recognized by pgrep.
- configure does not make the script executable - set the flag
explicitly when running make check,
- process name changed to lvmdbusd. The previous python3 value
originated from the use of /usr/bin/env.
Use new ALIGN_ABSOLUTE macro when calculating the start location
of new metadata and adjust the end of buffer detection so that
there is no longer an imposed gap between old and new metadata.
Currently both start and offset should always be divisible by alignment,
so this should have no effect, but a later patch will increase alignment
so these variables can no longer be optimised out.
lvmdbusd executable script must use python3 interpreter detected by
configure script, as site-packages directory used for library is only
used by that interpreter.
Although it doesn't look like it can be a measurable problem
and costs some time to flip priorities outside of activation window.
So just like with memory locking preserve priority until call
memlock_unlock() appears.
(addition to commit c086dfadc3).
Expand out the metadata wrapping calculations to prepare
to support a larger alignment.
The current alignment is 512 bytes so
(mdac_area_start + rlocn->offset) % alignment is zero.
When doing resume, directly pass location where new updated info
needs to be stored.
_resume_node() ensures the info is ONLY updated when the function
is successful and never changes it on error path.
Update the logic towards more explicit logic.
Preload tree normally does not want to resume, only
in certain cases of extension or new loaded nodes can be
resumed. So introduce new internal variable delay_resume_if_extended
controlable by target.
Patch itself is not changing current existing behaviour,
and rather documents existing problem in more readable way.
lvm2 needs to introduce explicit mechanism how to support more
fain-grained (and safe) logic to i.e. resize thin-pool which
can be sitting on cached raid volume.
Variable props.send_messages has 3 states and was not used properly
here. Activation in this moment does not need to verify thin-pool status
as that has been already checked on preload.
So only if there are some real messages (value 2) call function
for sending them.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
In case of failed legs, raid replaces those with
e.g. "vg-lv_rimage_0-missing_0_0" mapped to an error target.
Those errouneously remain on deactivation.
Fix by removing them on deactivation/removal of the RaidLV.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
Code already has dereferenced UUID before this point,
and its already given we require name & uuid when ading new node
(although uuid could be empty string).
Just like everywhere else - use single if() for major:minor setup
(it basically can't fail as of today anyway)
Always leave funtion with correctly set pointers even on error path.
Replicator never really existed in upstream kernel and its support
got deprecated.
Also its support never got finished so no code is supposed to be
using it anyway.
Libdm symbols are remaining, just the implementation will always
return failure - so any user of:
dm_tree_node_add_replicator_dev_target()
dm_tree_node_add_replicator_target().
will now always recieve error message.
Separate handling of error code from _info_by_dev.
This error can only happeng when we are running out of memory.
In such case there is urgent need to stop any futher proceeding
of command and run to error ASAP.
Avoiding "$(get first_extent_sector "$d")" in the loop
allows the test to succeed in the cluster. Further cluster
analysis needed to get to the core reason.
The lvm2 test suite aims at small test resource footprints
(few PVs, small PV sizes) to run on tmpfs backed loop device.
OTOH, lvconvert-reshape-raid.sh aims to test the maxima of
supported total stripes of 64. This patch adds a prerequisite
conditional to skip tests using more than 14 stripes.
It requires the target version 1.13.1 to avoid deadlocks.
If the recovery of the repleced leg(s) of a RaidLV created without
initial resynchronization (i.e. "lvcreate --nosync ...") got
interrupted, it can't be extended because of the < 100% sync rate.
In case caller passes in changed stripe size when reshaping raid4/5
to 1 stripe aiming to convert to raid1 and optionally to linear,
ignore it to prevent data corruption.
Use new 3rd. state of trace_pvmove_deps == 2.
In this state we know, we have already seen the node and can skip futher
testing. Remainging value 1 signals we want to track, and value 0
is for ignoring tracking, but node is still checking in this case.
Reduces large amount of duplicate ioctl queries.
Check also all snapshosts when resume is requested,
the origin volume is already resume, but possibly
some subLV or snapshot LV could be suspended if
we are still in critical_section.
When entering any critical section, lvm2 used to lock process memory
and raised task priority to avoid problem with page swapping and minimize
time of having non-resumed devices in table.
With this patch, memory locking which which is expensive is only used when
entering 'suspending' section as only in this section there is risk
lvm could be suspending a device which later can be needed for paging.
Raised priority is still kept for all section entrances as this is
low-cost operation and may accelerate table resumes - although the real
impact can be still considered later.
When pvmove is finished and metadata are updated, the code missed
to merge possible mergable segments - so add explicit merging
call after pvmoved volumes are unlocked.
This avoids weird results where i.e. lvs could have been reporting
non-matching segments as lvs upon metadata read is doing silent segment
merging while dm table left after pvmove was still preserving
non-merged segments.
When large size number (>2^31) is given on command line it could be
misdetected and in certain cases lead to wrongly casted number.
So make sure all cases always do set _MAX number in case the value would
not fit within the supported range instead of getting some random value
within the range.
In most cases this was not a problem to detect, but i.e. stripesize
parameter might have been fooled by certain large numbers.
Rewrite validation of stripes and stripe_size args into more readable
sequential code.
Extend reading of stripes & stripes_size args so it better knows
defaults for types like striped raid.
TODO: this should really be a value obtained for segtype structure and
all the weird conditions and modification of stripes and stripe_size
around lvm2 code should be dropped.
ATM we want to support delayed resume purely in pvmove case.
So have libdm logic internal to recognize difference beween
pvmove and other targets that do use delayed resume.
This fixes problem introduced with commit aa68b898ff
for mirror-on-mirror or snapshot-on-mirror problem.
TODO: likely added new API call and let libdm user select
delayed nodes explicitely.
Use code which detectes handlers in a way, which is more
backward-compatible friendly.
Replace read of 'sysfs' uuid entry with dm ioctl call.
Use /sys/block/dm-X/holders path instead of
new path /sys/dev/block/major:minor/holders.
TODO:
There are few more occurencies of this logic around the code
so some abstract interface should be considered.
In some cases the message could be slightly misleading so use
here rather conditional.
TODO:
In future we may possibly further tune the message in case we are
certain the level of redundancy protection has not been reduced.
When pvmove is finished and does 'suspend/resume' on PVMOVE LV,
on resume path committed metadata are already showing 'standalone'
pvmove LV prepared just for removal.
However code should be able to 'resume' preloaded LV there were
participating in pvmove operation.
Previously this was all done in the 'tools' part of lvm2 code.
So the lvconvert upon pvmove finish had to explicitely call 'resume' on every such LV.
Now 'smarted' activation code is able to deduce and combine all information from
the active dm table and committed metadata so single call resolves
it all in one go.
Internally holders are detected by reading sysfs directory to capture
all needed UUID which are then looked in lvm2 metadata and all such
LVs are automatically collected into dmtree.
Replace complex code with standard lv_update_and_reload_origin().
Extra suspend should not be necessary.
(If they would be - dependency tree would have bug for fixing).
Propagate delayed resume at least for preload case in a simple way.
Currently PVMOVE depends on internal logic where 'mirror' with
corelog is 'possible' PVMOVE. In such case resume of 'created'
node is 'delayed'.
This is mostly an ugly internal hack - but for the moment being when we
add propagation for preload - it does work reasonable.
TODO: provide standard API and avoid this internal 'guessing'.
Only thin-pool with origin_only suspend is allowed to be not suspending anything.
In such case pairing resume will 'decrement' critical section counter.
Just like suspend handles preload for pvmove finish,
in similar way handle suspend of starting pvmove.
In this case the precommited metadata are checked for list of PVMOVEed
LVs and those are suspended in with committed metadata.
When sanlock or dlm lock managers return an error number
that we don't recognize, replace it with a generic -ELMERR
which is defined in the set of special lvmlockd error
numbers. Otherwise, an unknown lock manager error number
could be misinterpreted for something else if it happened
to overlap another set of error numbers (which they have
not thus far.)
These less common errors returned from sanlock should
also cause sanlock to retry the lock acquire:
- i/o timeout occurs during sanlock_acquire().
other i/o on the same disk as the leases can cause
sanlock i/o timeouts.
- low level disk paxos contention between hosts naturally
causes one host to not acquire the lease. There are a
couple special error numbers associated with these cases
that should just be recognized as a normal failure to
acquire the lease.
There is no need to differentiation between clustered VG and normal VG.
As the activation depends on locking type.
Use unconditionally locally exclusive activation for pvmove.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
When activation of LVs fails prior pvmove start, try to deactivate
already activated LVs.
TODO: possibly remember which LVs where already activate and only those
take down - devices which are already in-use will stay active.
Only lv_committed() now uses vg->vg_committed and it appears redundant
if its contents match the enclosing VG so don't waste cycles creating it
when that's known to be true when no write lock is held so the struct
won't get modified.
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.
The persistent filter should not be imported by any command that doesn't
use it so take addtional note of REQUIRES_FULL_LABEL_SCAN (for vgrename)
and introduce IGNORE_PERSISTENT_FILTER for vgscan and pvscan.
If the suspend/resume sequence would leave some device in suspend
for possible later resume, backup cannot be takes (fs holding backups
could be still frozen in critical section())
Move check for presence of raid4 into the right place
so there is no way how to hit activation of any LV
with raid4 on kernel which does not support it.
Commit 763db8aab0 rejects 2-legged
conversions to striped/raid0 but different messages are displayed
for raid0 or striped. This commit provides the same rejection messages.
raid4/5 LVs may only be converted to striped or raid0/raid0_meta
in case they have at least 3 legs. 2-legged raid4/5 are a result
of either converting a raid1 to raid4/5 (takeover) or converting
a raid4/5 with more than 2 legs to raid1 with 2 legs (reshape).
The raid4/5 personalities map those as raid1,
thus reject conversion to striped/raid0.
Resolves: rhbz1511047
In HA cluster, we have "clvm" resource agent to manage clvmd daemon.
The agent invokes clvmd like: "clvmd -T90 -d0", which always prints
a scaring error message:
"""
local socket: connect failed: No such file or directory
"""
When specifed with "-d" option, clvmd tries to check if an instance
of the clvmd daemon is already running through a testing connection.
The connect() will fail with this ENOENT error in such case, so supress
the error message in such case.
TODO: add missing error reaction code - since ofter log_error, program
is not supposed to continue running (log_error() is for reporting
stopping problems).
Signed-off-by: Eric Ren <zren@suse.com>
Check and prevent starting another snapshot merge before
exiting merging is finished.
TODO: we can possibly implement smarter logic to drop existing
merging and start a new one.
The lvchange-raid[456].sh test checks that mismatches can be detected
properly. It does this by writing garbage to the back half of one of
the legs directly. When performing a "check" or "repair" of mismatches,
MD does a good job going directly to disk and bypassing any buffers that
may prevent it from seeing mismatches. However, in the case of RAID4/5/6
we have the stripe cache to contend with and this is not bypassed. Thus,
mismatches which have /just/ happened to an area that now populates the
stripe cache may be overlooked. This isn't a serious issue, however,
because the stripe cache is short-lived and reasonably small. So, while
there may be a small window of time between the disk changing underneath
the RAID array and when you run a "check"/"repair" - causing a mismatch
to be missed - that would be no worse than if a user had simply run a
"check" a few seconds before the disk changed. IOW, it simply isn't worth
making a fuss over dropping the stripe cache before beginning a "check" or
"repair" (which we actually did attempt to do a while back).
So, to get the test running smoothly, we simply deactivate and reactivate
the LV to force the stripe cache to be dropped and then proceed. We could
just as easily wait a few seconds for the stripe cache to empty also.
When a "recover" is just starting for a RAID LV, it is possible to get
"idle" for the sync action if the status is issued quickly enough. This
is fine, the MD thread just hasn't gotten things going yet. However,
the /need/ for a "recover" should be marked in md->recovery and it would
be simple enough to fix the kernel so this doesn't happen. May eventually
want a separate bug for this, but for now it fits with RHBZ 1507719.
We always preferred and recommended socket activation for our services
so remove the Install section in related .service units which are unused
in this case and keep only the Install section in associated .socket
units.
Signed-off-by: Bastian Blank <waldi@debian.org>
Since vg_validate() now rejects LVs without segments and
insert_layer_for_segments_on_pv() gets just created
'layer_lv' without segment, it needs to be hidden
from vg->lvs during processing of _align_segment_boundary_to_pe_range()
as this function calls lv_validate() and now requires
vg to be consistent. LV is then put back into vg->lvs.
There are two known bugs in the lvconvert-raid-status-validation.sh
test. The first one I consider to be more of an annoyance (1507719).
The second one I consider to be more serious (1507729).
RHBZ 1507719 simply documents the fact that the three RAID status
fields may not always be coherent due to the way they are set and
unset when the MD thread is shutting down and starting up. For
example, the sync ratio may be 100% but the sync action may not
yet have switched to "idle" and the health characters may not yet
all be 'A's (i.e. the devices set to InSync).
RHBZ 1507729 is more serious. The sync ratio can be 100% for a
short period of time after upconverting linear -> RAID1. It is
reset to 0 once the MD sync thread gets to work on it. It does
this because, technically, the array /is/ in-sync if the new
devices are excluded - i.e. the data is 100% available and
consistent. I'm not sure what to do about this problem, but we'd
much rather not have this state that looks exactly like the
end of the process when the sync ratio is 100% because the
"recover" process finished, but the sync action and health
characters haven't been updated yet. Put simply, the problem
is that we can't tell if a sync is starting or finished based
on the status output.
Since 4fa5add6b1 ("pvcreate: Wipe cached
bootloaderarea when wiping label.") label_remove is responsible
for the lvmcache_del. (toollib and liblvm need fixing to share
the code.)
Preload reiserfs module for the case, fs is present/compiled for a
kernel but it's not present in memory.
Size reducition needs --yes confirmation to preceed for reiserfs.
Correct reported message when thin snapshot has been already merged.
So lvm2 is no longer reporting "Mergins of snapshot X will occur..."
(even with swapped names).
When an ignored metadata area gets flagged for use again, make sure the
code doesn't try to parse its old metadata. Firstly by trying to detect
this situation and skipping the read (while still remembering the
position reached in the circular buffer), and secondly by clearing the
invalid live metadata location on disk as a precaution when subsequently
writing out the precommitted metadata.
Problems showed up when a metadata area in one VG got moved to
another VG in ignored state (still holding metadata for the original
VG) and then later got brought into use in the new VG - only the header
should be read in this case, not any of the metadata content.
vgsplit shares the vg_rename code so that must only set the PV_MOVED_VG
flag introduced in commit 486ed10848
("vgmerge: Fix intermediate metadata corruption") on PVs that moved.
Since both lvcreate and lvconvert needs to check for same
type of allowed origin for snapshot - move the code into
a single function.
This way we also fix several inconsitencies where snapshot
has been allowed by mistake either through lvcreate or
lvconvert path.
Generation of man pages is generating lot of barely readable output.
For normal build quietize this a bit.
For original verbose build start to use 'make V=1'
(just like i.e. linux kernel does)
TODO: apply at more places...
Converting from one raid level to another, no changes
of stripes or stripesize can be requested because those
are subject to reshaping. I.e. the process requires to
takeover first and secondly request raid algorithm,
stripe or stripesize changes.
Ignore any related changes display warninngs
and proceed with the takeover.
Without this patch, a takeover requesting
stripesize change causes data corruption!
Just like with other vars support this:
make check_local T=xyz LVM_LOG_FILE_MAX_LINES=10000000
Allows easily to override existing line limit.
Also increase limiting size of logs per command since some of
our commands are becoming very verbose....
Add explaining message, when command was aborted due to the reach
of configure line number count (LVM_LOG_FILE_MAX_LINES)
for logging (used mainly with testing).
Last patch missed to mention, we've improved/fixed generated paths
in units and init.d shell scripts when lvm2 was plainly configured
with just i.e. --prefix.
Note: some distros might have fully specified --sbindir and
--usrsbindir - thus those very not seeing problems in generated paths.
Replace lowercase @sbindir@ with @SBINDIR@ which contains
fully decoded path.
Same with @usrsbindir@ which is also used with clvmd and cmirrord.
Also handle SYSCONFDIR for EnvironmentFile.
Patch fixes generated unit files with strings like:
ExecStart=${exec_prefix}/sbin/lvm
Introduce few more AC_SUBST vars for usage in *.in generation.
In some case we want to replace i.e. $sbindir with full path
instead of current ${exec_prefix}/sbin.
This patch provides:
USRSBINDIR
SBINDIR
DEFAULT_SYS_LOCK_DIR
SYSCONFDIR
At the same time properly use sbindir & usrsbindir with
lvm, fsadm, clvmd from one primary definition.
Do not allow to take snapshot of mirror/raid leg or log or metadata LV.
This was actually never supported, but user was able to create it,
and this put device stack in hardly fixable state (needs manual work).
This prevents such creation to pass.
Also improve validation when recreating snapshot volume type
from origin and COW volume.
Correction to function for extracting vgname out of lvconvert
parameters.
Avoid repeating some checks.
Add code to handle generic options which may provide vgname in its argument
and compare them all so they match to a single vgname (otherwise it's a
error).
Extract default (envvar) vgname only when no position nor optional vgname is
found.
Fixing regression instroduce with patchset started with commit:
1e2420bca8 (2.02.169)
split resize_crypt function in two.
a) Detect proper dm-crypt device type and count new --size
value for cryptsetup resize command.
b) Perform the resize
the bug in LUKS grow/shrink decision in fsadm was
masked due to fact that default LVM2 extent size
was larger than LUKS1 default data offset for dm-crypt
mapping. The new test address this bug.
lvcreate supports a 'conversion' when caching LV.
This normally worked fine, however in case passed LV was
thin-pool's data LV with suffix _tdata we have failed to early.
As the easiest fix looks dropping validation of name when
caching type is select - such name check will happen later
once the VG is opened again and properly detect if the LV
with protected name already exists and can be converted,
or will be rejected as ambigiuous operation requiring user
to specify --type cache | --type cache-pool.
Replaced the confusing device error message "not found (or ignored by
filtering)" by either "not found" or "excluded by a filter".
(Later we should be able to say which filter.)
Left the the liblvm code paths alone.
Fixes the following case with 3PVs and 3 legs "mirror" LV:
# lvcreate -l100%FREE --type mirror -m2 vg3
Insufficient free space for log allocation for logical volume .
Unable to allocate extents for mirror log.
Related: rhbz1269533
Activation lock has a primary purpose to serialize locking of individual
LV in case there is no other protecting mechanism for parallel
execution.
However in the case an activated LV is composed from several other LVs,
noone should be able to manipulate with those LVs as well.
This patch add a very 'naive' global VG activation locking in this case.
In the future we may introduce smarter function detecting minimal closed
graph components if this will appear as bottleneck
Patch checks if the VG Write lock is held - in this case we do not
need any more locking - command has exclusive access to VG.
In case we have clustered VG and we are activating an LV which does not
need other LVs - we also do not need any more locks.
In all other cases take respective lock - for single LV - use lvid,
for complex LVs use vgname.
Creating striped RaidLVs with lv size not divisible by region size
caused the region size to be adjusted:
# lvcreate --type raid5 -n region_check.32.00m_3 -i 3 -L 1g --nosync -R 32.00m raid_sanity
Using default stripesize 64.00 KiB.
Rounding size 1.00 GiB (256 extents) up to stripe boundary size <1.01 GiB(258 extents).
WARNING: New raid5 won't be synchronised. Don't read what you didn't write!
Using reduced mirror region size of 8.00 MiB
Logical volume region_check.32.00m_3 created.
Fix by not imposing "mirror" constraints on "raid".
Resolves: rhbz1404007
vgmerge suffers from a similar problem to the one fixed in commit
8146548d25 ("vgsplit: Fix intermediate
metadata corruption.")
When merging, splitting or renaming VGs, use a new PV status flag
PV_MOVED_VG to mark the PVs that hold metadata with the old VG name and
use this to provide PV-level granularity instead of incorrectly assuming
all PVs in the VG are the same.
Add these for dmeventd systemd unit (dm-event.service):
Before: shutdown.target
Conflicts: shutdown.target
This will cause the dmeventd to be properly stopped at shutdown (after
all the dmeventd clients unregistered their devices from monitoring)
with dm-event.service's stop action (there's no direct action defined
for the "stop" so systemd sends SIGTERM instead).
Before, we let dmeventd to get killed only as part of the very last
SIGTERM/SIGKILL for all the remaining processes late in the shutdown
sequence so we may have missed some logs if dmeventd encountered an
error during its shutdown (logging facilities are already off at this
late time in shutdown sequence).
Ref: https://www.redhat.com/archives/lvm-devel/2017-October/msg00000.html
When dmeventd receives SIGTERM/INT/HUP/QUIT it validates if exit is possible.
If there was any device still monitored, such exit request used to
be ignored/refused. This 'usually' worked reasonably well, however if there
is very short time period between last device is unmonitored and signal
reception - there was possibility such EXIT was ignored, as dmeventd has
not yet got into idle state even commands like 'vgchange -an' has already
finished.
This patch changes logic towards scheduling EXIT to the nearest
point when there is no monitored device.
EXIT is never forgotten.
NOTE: if there is only a single monitored device and someone sends
SIGTERM and later someone uses i.e. 'lvchange --refresh' after
unmonitoring dmeventd will exit and new instance needs to be
started.
If you send a SIGUSR1 (10) to the daemon it will dump all the
threads current stacks to stdout. This will be useful when the
daemon is apparently hung and not processing requests.
eg.
$ sudo kill -10 <daemon pid>
Make sure that any and all code that executes in the main thread is
wrapped with a try/except block to ensure that at the very least
we log when things are going wrong.
Changing the VG of a PV uses the same on-disk mechanism as vgrename.
This relies on recognising both the old and new VG names. Prior to this
patch the vgsplit code incorrectly provided the new VG name twice
instead of the old and new ones. This lead the low-level mechanism not
to recognise the device as already belonging to a VG and so paying no
attention to the location of its existing metadata, sometimes partly
overwriting it and then later trying to read the corrupt metadata and
issuing a checksum error.
In some cases we are seeing where there are no VGs, but the data returned from
lvm shows that the PVs have the following for the VG:
"vg_name":"[unknown]", "vg_uuid":""
The code was only checking for the exitence of the VG name and we called into
the function get_object_path_by_uuid_lvm_id which requires both the VG name and
the LV name to exist (asserts this) which results in the following stack trace:
Traceback (most recent call last):
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 563, in runner
obj._run()
File "/home/tasleson/lvm2/daemons/lvmdbusd/utils.py", line 584, in _run
self.rc = self.f(*self.args)
File "/home/tasleson/lvm2/daemons/lvmdbusd/fetch.py", line 26, in
_main_thread_load
cache_refresh=False)[1]
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/home/tasleson/lvm2/daemons/lvmdbusd/loader.py", line 37, in common
objects = retrieve(search_keys, cache_refresh=False)
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 40, in
pvs_state_retrieve
p["pv_attr"], p["pv_tags"], p["vg_name"], p["vg_uuid"]))
File "/home/tasleson/lvm2/daemons/lvmdbusd/pv.py", line 84, in __init__
vg_uuid, vg_name, vg_obj_path_generate)
File "/home/tasleson/lvm2/daemons/lvmdbusd/objectmanager.py", line 318,
in get_object_path_by_uuid_lvm_id
assert uuid
AssertionError
When executing in the main thread, if we encounter an exception we
will bypass the notify_all call on the condition and the calling thread
never wakes up.
@staticmethod
def runner(obj):
# noinspection PyProtectedMember
Exception thrown here
----> obj._run()
So the following code doesn't run, which causes calling thread to hang
with obj.cond:
obj.function_complete = True
obj.cond.notify_all()
Additionally for some unknown reason the stderr is lost.
Best guess is it's something to do with scheduling a python function
into the GLib.idle_add. That made finding issue quite difficult.
There's nothing special about /boot other than it's used during boot.
But when blkdeactivate is called either on all devices or including a
device where the /boot is on top, we should also include this mount
point when doing unmount before deactivation of supported devices.
The new blkdeactivate -r|mdraidoption wait causes blkdeactivate to wait
for any resync/recovery/reshape that is currently in progress before
deactivating the device.
If this option is used, blkdeactivate calls mdadm -W|--wait before
mdadm -S|--stop.
Revert dc50f2f4a0.
We're canonicalizing/escaping the names here and we're reusing the
variable name so the code doesn't need to use extra variables and
further assignments that may confuse us. Let's keep the code simple.
The
local name=(...$name)
is not the same as
local name
name=(...$name)
(I know various code-checking tools fuss about this and recommend
the 2nd way, but let's ignore those tools' nitpicking here please.)
There was a typo in blkdeactivate --dmoption/--lvmoption/mpathoption,
it had missing "s" at the end and it was not recognized properly, only
short names for the options (-d/-l/-m).
When certain cmd def RULE's fail, the error messages can
sometimes be confusing. This expands the error messages
to help clarify why the rule failed, especially in cases
where options are used incorrectly.
In a shared VG, only allow pvmove with a named LV,
so that only PE's used by the LV will be moved.
The LV is then activated exclusively, ensuring that
the PE's being moved are not used from another host.
Previously, pvmove was mistakenly allowed on a full PV.
This won't work when LVs using that PV are active on
other hosts.
Avoid starting test, when test dir has less then 50M of free space.
Better to crash early before letting die machine on weird crash
in OOM cases...
Also show free disk space when test starts
Enable handling of --poolmetadataspare so if user can prevent
creation of _pmspare volume during --repair operation (just
like during actual lvcreate or lvconvert) for pool volumes.
The following commands now pass the device list through a
--select|-S filter before processing:
suspend resume clear wipe_table remove deps status table
In a shared VG, lvconvert must be used to create thin pools
and cache pools, not the lvcreate variants of those commands.
Deny these cases early in lvcreate using the new command defs.
Denying these cases deeper in the code was missing some
cleanup of the partially completed command.
Revert the lvmlockd.c changes from:
commit 0bf836aa14
"tidy: prefer not using else after return"
The commit introduced at least one regression, which broke
lvcreate of a thin pool in a shared VG.
ATM we have several instances of daemonizing code.
Each has its 'special' logic so not completely easy
to unify them all into a single routine.
Start to unify them and use one strategy for rediricting
all input/outpus to /dev/null - use 'dup2' function for this
and open /dev/null before fork to make sure it's available.
When file-locking mode failed on locking, such description was leaked
(typically not an issue since command usually exists afterwards).
So shirt close() at the end of function and use it in all error paths.
Also make sure, when interrrupt is detected, it's really not holding
lock and returns 0.
lvmcache_foreach_mda() can fail for numerous reasons
and failing error code cannot be ignored (out-of-memory...)
TODO: might need more error handling tunning.
gcc warns here about storring 69 bytes in 64 byte array (losing
potentially 4 bytes from 'ls->name').
lvmlockd-core.c:2657:36: warning: ‘%s’ directive output may be truncated writing up to 64 bytes into a region of size 60 [-Wformat-truncation=]
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~
lvmlockd-core.c:2657:2: note: ‘snprintf’ output between 5 and 69 bytes into a destination of size 64
snprintf(tmp_name, MAX_NAME, "REM:%s", ls->name);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Replaced with slightly better code - but it still misses error path what
to do if the name would be truncated... - so added FIXME.
Also using all bytes for snprintf() buffer size
(as the size is with \0 included)
* Attempt to rejoin lockspaces and adopt locks from a previous
@@ -5948,7 +6007,6 @@ static int main_loop(daemon_state *ds_arg)
close_worker_thread();
close_client_thread();
closelog();
daemon_close(lvmetad_handle);
return1;/* libdaemon uses 1 for success */
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.