IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Shift refresh of mirror table right into monitor_dev_for_events().
Use !vg_write_lock_held() to recognize use of lvchange/vgchange.
(this shall change if this would no longer work, but requires
futher some API changes).
With this patch dm mirror table is only refreshed when necassary.
Also update WARNING message about mirror usage without monitoring
and display LV name.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
In some pvmove tests, clvmd uses the new (precommitted)
saved_vg, but then requests the old saved_vg, and
expects that the new saved_vg be returned instead of
the old. So, when returning the new saved_vg, forget
the old one so we don't return it again.
The filters save information about devices that should
be ignored, so if we need to repeat a scan (unusual,
but happens in clvmd), we need to update the filters.
When clvmd does a full label scan just prior to
calling _vg_read(), pass a new flag into _vg_read
to indicate that the normal rescan of VG devs is
not needed.
After reading a VG, stash it in lvmcache as "saved_vg".
Before reading the VG again, try to use the saved_vg.
The saved_vg is dropped on VG lock operations.
The copy of the VG which clvmd stashes in lvmcache should
not only be used between suspend and resume, but between
sequential LV operations in clvmd, so that clvmd does not
need to reread the VG for each one. Prepare for that by
renaming the stashed VG as "saved_vg".
For reporting commands (pvs,vgs,lvs,pvdisplay,vgdisplay,lvdisplay)
we do not need to repeat the label scan of devices in vg_read if
they all had matching metadata in the initial label scan. The
data read by label scan can just be reused for the vg_read.
This cuts the amount of device i/o in half, from two reads of
each device to one. We have to be careful to avoid repairing
the VG if we've skipped rescanning. (The VG repair code is very
poor, and will be redone soon.)
Don't allow writes in test mode. test mode should be
more sophisticated than just faking writes, and this
should be a last defense for cases where test mode is
not being checked correctly.
Recent changes allow some major simplification of the way
lvmcache works and is used. lvmcache_label_scan is now
called in a controlled fashion at the start of commands,
and not via various unpredictable side effects. Remove
various calls to it from other places. lvmcache_label_scan
should not be called from anywhere during a command, because
it produces an incorrect representation of PVs with no MDAs,
and misclassifies them as orphans. This has been a long
standing problem. The invalid flag and rescanning based on
that is no longer used and removed. The 'force' variation is
no longer needed and removed.
We can't let clvmd keep all scanned devs open,
which prevents them from being removed. So
drop the bcache data (and close fds) affter
doing a label scan.
Also set up bcache before the clvm-specific
vg_read (which needs to rescan the vg's devs
using bcache) and destroy the bcache after.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
In some odd cases (e.g. tests) there are very few devices
which results in creating too few blocks in bcache, so
create bcache with a minimum number of blocks.
Commands using lvmetad will not begin with a proper
label_scan which initializes bcache, but may later
decide they need to scan a set of devs, in which case
they'll need bcache set up at that point.
The improved detection of bad metadata when scanning
(where errors were ignored before) means we now have to
override some errors when forcibly erasing damaged metadata.
Drop an extra label scan in the recovery part
of vg_read. This is a temporary improvement
until the pending replacement for the broken
recovery code burried in vg_read.
This is a temporary hacky workaround to the problem of
reads going through bcache and writes not using bcache.
The write path wants to read parts of data that it is
incrementally writing to disk, but the reads (using
bcache) don't work because the writes are not in the
bcache. For now, add a dev to bcache before each attempt
to read it in case it's being used on the write path.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
The copy of VG metadata stored in lvmcache was not being used
in general. It pretended to be a generic VG metadata cache,
but was not being used except for clvmd activation. There
it was used to avoid reading from disk while devices were
suspended, i.e. in resume.
This removes the code that attempted to make this look
like a generic metadata cache, and replaces with with
something narrowly targetted to what it's actually used for.
This is a way of passing the VG from suspend to resume in
clvmd. Since in the case of clvmd one caller can't simply
pass the same VG to both suspend and resume, suspend needs
to stash the VG somewhere that resume can grab it from.
(resume doesn't want to read it from disk since devices
are suspended.) The lvmcache vginfo struct is used as a
convenient place to stash the VG to pass it from suspend
to resume, even though it isn't related to the lvmcache
or vginfo. These suspended_vg* vginfo fields should
not be used or touched anywhere else, they are only to
be used for passing the VG data from suspend to resume
in clvmd. The VG data being passed between suspend and
resume is never modified, and will only exist in the
brief period between suspend and resume in clvmd.
suspend has both old (current) and new (precommitted)
copies of the VG metadata. It stashes both of these in
the vginfo prior to suspending devices. When vg_commit
is successful, it sets a flag in vginfo as before,
signaling the transition from old to new metadata.
resume grabs the VG stashed by suspend. If the vg_commit
happened, it grabs the new VG, and if the vg_commit didn't
happen it grabs the old VG. The VG is then used to resume
LVs.
This isolates clvmd-specific code and usage from the
normal lvm vg_read code, making the code simpler and
the behavior easier to verify.
Sequence of operations:
- lv_suspend() has both vg_old and vg_new
and stashes a copy of each onto the vginfo:
lvmcache_save_suspended_vg(vg_old);
lvmcache_save_suspended_vg(vg_new);
- vg_commit() happens, which causes all clvmd
instances to call lvmcache_commit_metadata(vg).
A flag is set in the vginfo indicating the
transition from the old to new VG:
vginfo->suspended_vg_committed = 1;
- lv_resume() needs either vg_old or vg_new
to use in resuming LVs. It doesn't want to
read the VG from disk since devices are
suspended, so it gets the VG stashed by
lv_suspend:
vg = lvmcache_get_suspended_vg(vgid);
If the vg_commit did not happen, suspended_vg_committed
will not be set, and in this case, lvmcache_get_suspended_vg()
will return the old VG instead of the new VG, and it will
resume LVs based on the old metadata.
When process_each_pv() calls vg_read() on the orphan VG, the
internal implementation was doing an unnecessary
lvmcache_label_scan() and two unnecessary label_read() calls
on each orphan. Some of those unnecessary label scans/reads
would sometimes be skipped due to caching, but the code was
always doing at least one unnecessary read on each orphan.
The common format_text case was also unecessarily calling into
the format-specific pv_read() function which actually did nothing.
By analyzing each case in which vg_read() was being called on
the orphan VG, we can say that all of the label scans/reads
in vg_read_orphans are unnecessary:
1. reporting commands: the information saved in lvmcache by
the original label scan can be reported. There is no advantage
to repeating the label scan on the orphans a second time before
reporting it.
2. pvcreate/vgcreate/vgextend: these all share a common
implementation in pvcreate_each_device(). That function
already rescans labels after acquiring the orphan VG lock,
which ensures that the command is using valid lvmcache
information.
When lvmlockd indicates that the lvmetad cache is out of
date because of changes by another node, lvmetad_pvscan_vg()
rescans the devices in the VG to update lvmetad. Use the
new label_scan in this function to use the common code and
take advantage of the new aio and reduced reads.
This fixes the use of lvmcache_label_rescan_vg() in the previous
commit for the special case of independent metadata areas.
label scan is about discovering VG name to device associations
using information from disks, but devices in VGs with
independent metadata areas have no information on disk, so
the label scan does nothing for these VGs/devices.
With independent metadata areas, only the VG metadata found
in files is used. This metadata is found and read in
vg_read in the processing phase.
lvmcache_label_rescan_vg() drops lvmcache info for the VG devices
before repeating the label scan on them. In the case of
independent metadata areas, there is no metadata on devices, so the
label scan of the devices will find nothing, so will not recreate
the necessary vginfo/info data in lvmcache for the VG. Fix this
by setting a flag in the lvmcache vginfo struct indicating that
the VG uses independent metadata areas, and label rescanning should
be skipped.
In the case of independent metadata areas, it is the metadata
processing in the vg_read phase that sets up the lvmcache
vginfo/info information, and label scan has no role.
Move the location of scans to make it clearer and avoid
unnecessary repeated scanning. There should be one scan
at the start of a command which is then used through the
rest of command processing.
Previously, the initial label scan was called as a side effect
from various utility functions. This would lead to it being called
unnecessarily. It is an expensive operation, and should only be
called when necessary. Also, this is a primary step in the
function of the command, and as such it should be called prominently
at the top level of command processing, not as a hidden side effect
of a utility function. lvm knows exactly where and when the
label scan needs to be done. Because of this, move the label scan
calls from the internal functions to the top level of processing.
Other specific instances of lvmcache_label_scan() are still called
unnecessarily or unclearly by specific commands that do not use
the common process_each functions. These will be improved in
future commits.
During the processing phase, rescanning labels for devices in a VG
needs to be done after the VG lock is acquired in case things have
changed since the initial label scan. This was being done by way
of rescanning devices that had the INVALID flag set in lvmcache.
This usually approximated the right set of devices, but it was not
exact, and obfuscated the real requirement. Correct this by using
a new function that rescans the devices in the VG:
lvmcache_label_rescan_vg().
Apart from being inexact, the rescanning was extremely well hidden.
_vg_read() would call ->create_instance(), _text_create_text_instance(),
_create_vg_text_instance() which would call lvmcache_label_scan()
which would call _scan_invalid() which repeats the label scan on
devices flagged INVALID. lvmcache_label_rescan_vg() is now called
prominently by _vg_read() directly.
To do label scanning, lvm code calls lvmcache_label_scan().
Change lvmcache_label_scan() to use the new label_scan()
based on bcache.
Also add lvmcache_label_rescan_vg() which calls the new
label_scan_devs() which does label scanning on only the
specified devices. This is for a subsequent commit and
is not yet used.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
When user configured lvm2 to NOT user monitoring, activated mirror
actually hang upon error and it's quite unusable moment.
So instead Warn those 'brave' non-monitoring users about possible
problem and activation mirror without blocking error handling.
This also makes it a bit simpler for test suite to handle trouble
cases when test is running without dmeventd.
When adjusting region size for clustered VG it always needs to fit
2 full bitset into 1MB due to old limits of CPG.
This is relatively big amount of bits, but we have still limitation
for region size to fit into 32bits (0x8000000).
So for too big mirrors this operation needs to fail - so whenever
function returns now 0, it means we can't find matching region_size.
Since return 0 is now 'error' we need to also pass proper region_size
when creating pvmove mirror.
Since extent_size is no longer power_of_2 this max region size
evalution was rather producing random bitsize as a combination
of lowest bit from number of extents and extent size itself.
Correct calculation to use whole LV size and pick biggest
possible power of 2 value smaller then UINT32_MAX.
Drop mirrored mirror log limitation that applies only in very limited
use-case and actually mirrored mirror log is deprecated anyway.
So 'disk' mirror log is selecting the correct minimal size, and
bigger size is only enforced with real mirrored mirror log.
Also for mirrored mirror log we let use 'smalled' region size if needed
so if user uses 1G region size, we still keep small mirror log
with much smaller region size in this case when needed.
Also mirror log extent calculation is now properly detecting error
with too big mirrors where previosly trimmed uint32_t was applies
unintentionally.
Whenever we make visible LV out of previously invisible one,
reload it's table - the is mandator for proper udev rule
processing as well as ensure content of dm table is correct.
TODO: this new generic rule probably make extra raid rules unnecessary.
Only policy 'smq' is meant to be used with format version 2.
Code used to let pass 'mq' policy also with format 2. But 'mq'
is obsoloted wth smq and kernel currently matches it. But this
is incompatible with older original mq logic - so disallow creation
of this rather useless combination.
If the tools for checking thin_pool or cache metadata are missing,
issue rather just a WARNING, but let the operation of activation
continue.
This has the advantage, the if user is missing those tools,
but he already started to use thinpool or cacheing, he can
access these volumes with a WARNING.
Also if the user is using too old tools i.e. for CacheV2 format
dmpd tool 0.7 is required - provide informative WARNING and
skip failure from older tool version which can't understand
new format V2.
In case a newly created RaidLV is blacklisted using config
\"activation { volume list = [ ... ] }\" (i.e. its SubLVs stay inactive),
the metadata SubLVs can't get wiped thus failing the creation.
As a result, the RaidLV together with its SubLVs
is left behind in an inconsistent state.
Fix by removing the RaidLV and provide a hint about volume_list reasoning.
Resolves: rhbz1161347
While prioritized_section() based on raised priority works
nicely for standard lvm comman - separate counter is actually needed
when it's used in daemons like clvmd/dmeventd where priority
stays raised all the time.
Detect we are in prioritezed section instead of critical one,
since these operation were supposed to NOT be happining during
whole set of operation.
This patch fixes verification of udev operations.
Introduce prioritized_section() as a closer match to previous logic
of critical_section() that has been held over longer sequence of
ioctl commands - essentially it's matching operation on a single
cookie.
While 'critical_section()' now corresponds to locked memory - we hold
this memory only between suspend/resume thus notion of 'cookie' was
lost.
This patch restores some logic unintentionaly lost with dropping
memory locking for just activation/deactivation calls.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
Just like lvm2 has internal devices like _tdata which is using UUID with
suffix, there is similar private type of device for crypto device where
they are using CRYPT-TEMP uuid prefix.
Also ignore stratis.
Some kernel version suffer from bad state transition where a device
steps into 'frozen' mode. Any application that tries to read such
raid gets unfortunatelly bloked.
As some sort of protection try to skip such raid device from being
scanned to minimize chances to block lvm2 command on such scan.
When such device is found, warning gets printed.
RaidLVs on read_only_volume_list have their SubLVs
activated readonly thus disabling metadata updates
or image resynchronization/recovery. Bug also causes
automatic repairs to fail.
Fix by always activating the RAID SubLVs readwrite.
Resolves: rhbz1208269
Just like with lvcreate, this lvconvert case also need to properly
check which LV actually holds lock for cached origin - as it might
be i.e. thin-pool tdata subLV.
When snapshot is created in read-only mode with 'lvcreate -s -pr...',
lvm2 still needs to be able to write to layered -cow volume
to store metadata and exceptions blocks.
TODO: in some case we might be able to do full tree with read-only
volume but this probably needs futher validation:
1. checking snapshot header already exist
2. origin & snapshot are both in read-only mode.
Occasionaly users may need to peek into 'component devices.
Normally lvm2 does not let users activation component.
This patch adds special mode where user can activate
component LV in a 'read-only' mode i.e.:
lvchange -ay vg/pool_tdata
All devices can be deactivated with:
lvchange -an vg | vgchange -an....
If componet devices could be activated alone, ensure they are not breaking
common commands.
TODO: mostly likely this is not a definite list of all needed checks
and more will come later.
This is the 'last' place where a LV is present in metadata.
Any removed device should not be left active in dm table.
So this check is an extra validation protection to capture any
forgotten deactivation (adding 1 extra ioctl into lvremove path)
Introduce:
lv_is_component() check is LV is actually a component device.
lv_component_is_active() checking if any component device is active.
lv_holder_is_active() is any component holding device is active.
Instead of checking with existing size of external origin LV,
use correctly the new 'wanted' size of this LV whether it fits
the limitiation requirements for older thin-pool target.
Otherwise code started to the the resize, updates metadata and
just fails during 'resize' in case the LV was active. For
inactive LV operation could have actually passed.
Checking here for cache_pool is not necessary and in effect
the check is not even right - since there are internal
states that do allow to active such LV.
Fix missing 'externalLV' traversing for thins with external origins.
Replace extra for_each_sub_lv_except_pools() with better
internal logic allowing selectively to cut of processed subLV tree.
Extend error code for function 'fn()' when it returns -1 it will
stop futher tree scan for given LV.
Also a bit simplify code to have only one place that
is calling 'fn()' and use level counter to know
depth of traversing.
Update renaming travering to skip trees for pools
and external origins.
While 'file-locking' code always dropped cached VG before
lock was taken - other locking types actually missed this.
So while the cache dropping has been implement for i.e. clvmd,
actually running command in cluster keept using cache even
when the lock has been i.e. dropped and taken again.
This rather 'hard-to-hit' error was noticable in some
tests running in cluster where content of PV has been
changed (metadata-balance.sh)
Fix the code by moving cache dropping directly lock_vol() function.
TODO: it's kind of strange we should ever need drop_cached_metadata()
used in several places - this all should happen automatically
this some futher thinking here is likely needed.
So this is a bit more complex and possibly worth futher checking.
ATM clvmd drops cmd->mem mempool AFTER refresh of cmd.
So anything allocating from cmd->mem during toolcontext init
will likely die at some point in time.
As a quick fix - just use regular malloc/free for 'dso' alloction.
It's worth to note - cmd->libmem seems to be often misused
causing hidden memleaking for clvmd.
Build dso plugin name during segtype initialisation and just
use the string during command life-time.
Also slightlt update message verbosity and make it very_verbose
when operation is going to be made and 'verbose' when it's done.