IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Just like suspend handles preload for pvmove finish,
in similar way handle suspend of starting pvmove.
In this case the precommited metadata are checked for list of PVMOVEed
LVs and those are suspended in with committed metadata.
Whenever pvmove tree is going to be generated for suspend
and such LV has a user - use this 'using LV' to generate
correct dm tree holding all components.
LV is asked for resume, and its already resume and tool
is inside 'critical_section()' check if there is any suspended sub LV.
In that case 'resume' operation will not be skipped.
When old snapshot is merged, lvm2 still can report some data about
merged 'snapshot' - i.e. it occupied space in VG.
This patch fixes regression from commit:
6fd20be629
and resolved RHBZ: 1460161
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces infrastructure prerequisites to be used
by raid_manip.c extensions in followup patches.
This base is needed for allocation of out-of-place
reshape space required by the MD raid personalities to
avoid writing over data in-place when reading off the
current RAID layout or number of legs and writing out
the new layout or to a different number of legs
(i.e. restripe)
Changes:
- add members reshape_len to 'struct lv_segment' to store
out-of-place reshape length per component rimage
- add member data_copies to struct lv_segment
to support more than 2 raid10 data copies
- make alloc_lv_segment() aware of both reshape_len and data_copies
- adjust all alloc_lv_segment() callers to the new API
- add functions to retrieve the current data offset (needed for
out-of-place reshaping space allocation) and the devices count
from the kernel
- make libdm deptree code aware of reshape_len
- add LV flags for disk add/remove reshaping
- support import/export of the new 'struct lv_segment' members
- enhance lv_extend/_lv_reduce to cope with reshape_len
- add seg_is_*/segtype_is_* macros related to reshaping
- add target version check for reshaping
- grow rebuilds/writemostly bitmaps to 246 bit to support kernel maximal
- enhance libdm deptree code to support data_offset (out-of-place reshaping)
and delta_disk (legs add/remove reshaping) target arguments
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add to commits 87117c2b25 and 0b8bf73a63 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Resolves: rhbz1025322
Related: rhbz1265191
Related: rhbz1399844
Related: rhbz1404425
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We do the same for i.e. old snaphost origin.
It's actually not needed to call extra lv_has_target_type() to detect
snapshot merge is in progress - decode this right during status
capturing and save even few extra ioctl calls.
Drop LV from passed API arg - it's always segment being checked.
Also use_layer is now in full control of lv_info_with_seg_status().
It decides which device needs to be checked to get 'the most info'.
TODO: future version should be able to expose status from
Start moving selection of status taken for a LV into a single place.
The logic for showing info & status has been spread over multiple
places and were doing too complex decision going agains each other.
Unify selection of status of origin & cow scanned device.
TODO: in future we want to grab status for LV and layered LV and have
both statuses present for display - i.e. when 'old snapshot'
of thinLV is takes and there is ongoing merge - at some moment
we are not capable to show all needed info.
Translate log_info() into log_very_verbose() which is macro
supposed to be used by our code.
log_info() is internal macro with eventually some 'symbolic' meaning
in syslogging daemons.
Check for dm-raid target version with non-standard raid4 mapping expecting the dedicated
parity device in the last rather than the first slot and prohibit to create, activate or
convert to such LVs from striped/raid0* or vice-versa in order to avoid data corruption.
Add related tests to lvconvert-raid-takeover.sh
Resolves: rhbz1388962
Avoid monitoring of activated cache-pool - where the only purpose ATM
is to clear metadata volume which is actually activate in place
of cache-pool name (using public LV name).
Since VG lock is held across whole clear operation, dmeventd cannot
be used anyway - however in case of appliction crash we may
leave unmonitored device.
In future we may provide better mechanism as the current name
replacemnet is creating 'uncommon' table setups in case the metadata
LV is more complex type like raid (needs some futher thinking about
error path results).
Another point to think about is the fact we should not clear device
while holding lock (i.e. dmeventd mirror repair cannot work in cases
like this).
We have only 2 users of _lv_active() - one was already checking for ==1
while the other use (_lv_is_active()) could have take '-1' as a sign of having
an LV active. So return 0 and log_debug also the reason while detection
has failed (i.e. in case --driverload n - it's kind of expectable,
but might have confused user seeing just <backtrace>).
Add more code to properly store status for snapshot segment
maintaining lvm2 fiction of COW and snapshot internal volumes.
The key issue here is however not though-through reporting
logic - as there is no single answer for whole line state.
It not counting with layer and we may need few more ioctl to
cover all reporting needs depending upon what is actually
needed.
In reality we need to 'cache' more ioctl status queries for
individual LVs and their segments (so they checked at most once).
The other 'hard' topic for conversion is mirror segment handling.
Also we definitelly need to relocate some logic into segment's methods,
yet it might be complex as we have not clear border between targets.
TODO: define more clearly how are reporting fields defined in case
we 'stack' volumes like - cache of stacked thin LV snapshot origin.
To get better control when flushing is used add extra arg when
setting up dm task.
By default now check dm device status without flush.
(At this moment this should effect only thin and cache volumes).
Also switch dev_manager_thin_pool_status() to use more
readable 'flush' parameter instead of 'no_flush'.
Before executing modprobe for given module name, just check
if the module is not already present in /sys/module.
Useful when checking dm-cache-policy modules as we do not
having matching interface like for targets.
A snapshot merge into its origin cannot be initiated while the devices
are in use. If there is outside interference (such as from udev),
the suspend (preload) and resume stages can reach conflicting decisions
about whether or not to proceed.
Try to make the logic more robust by checking the inactive or live
table during resume. (This is still not perfect.)
Commit 844b009584 tried to move
limit for usage of noflush to 'preload' however this was not
correctly processed.
Intead explicitly check for which types we do not want noflush
and also add debug message in this case.
dmeventd daemon may call further code itself that looks at /dev, e.g.
via dmeventd_lvm2_command call. We need to have a consistent view of
the /dev content at that time. Therefore, sync /dev content before
calling monitoring hook which contacts dmeventd.
This problem was quite hidden before, but now it has manifested itself
because of recent additions to dev-cache code where we started looking
at device holders as seen in sysfs. What happened here was that the
device was already in sysfs, but not yet under /dev and this triggered
the new error message sometimes:
log_error("%s: failed to find associated device structure for holder %s.", devname, devpath);
This problem has manifested recently in our api/pytest.sh test from
testsuite where we create thin pool LVs and thin LVs and hence it also
causes dmeventd to be used as well and these error messages were
visible there.
Remove long outstand unused code lines, which were already
been obsoleted by other code.
Statuses and snapshot tree creation is already handled differently.
Also drop some 'extra' log_error() and use only stack;
since error has already been reported.
lv preload for detached LVs started to be used also
for various other types which just happens to pass through
weak if() condition.
TODO: find here better solution to rather explicitly check
for types we really need to preload.
We do not won't to 'expose' internals of VG struct.
ATM we use lists to keep all LVs - we may want to switch
to better struct for quicker 'search'.
Since we do not need 'lists' but always actual LV,
switch find_lv_in_vg_by_lvid() to return LV,
and replaces some use case of find_lv_in_vg()
with 'better' working find_lv() which already
returns LV.
Existing messaging intarface for thin-pool has a few 'weak' points:
* Message were posted with each 'resume' operation, thus not allowing
activation of thin-pool with the existing state.
* Acceleration skipped suspend step has not worked in cluster,
since clvmd resumes only nodes which are suspended (have proper lock
state).
* Resume may fail and code is not really designed to 'fail' in this
phase (generic rule here is resume DOES NOT fail unless something serious
is wrong and lvm2 tool usually doesn't handle recovery path in this case.)
* Full thin-pool suspend happened, when taken a thin-volume snapshot.
With this patch the new method relocates message passing into suspend
state.
This has a few drawbacks with current API, but overal it performs
better and gives are more posibilities to deal with errors.
Patch introduces a new logic for 'origin-only' suspend of thin-pool and
this also relates to thin-volume when taking snapshot.
When suspend_origin_only operation is invoked on a pool with
queued messages then only those messages are posted to thin-pool and
actual suspend of thin pool and data and metadata volume is skipped.
This makes taking a snapshot of thin-volume lighter operation and
avoids blocking of other unrelated active thin volumes.
Also fail now happens in 'suspend' state where the 'Fail' is more expected
and it is better handled through error paths.
Activation of thin-pool is now not sending any message and leaves upto a tool
to decided later how to finish unfinished double-commit transaction.
Problem which needs some API improvements relates to the lvm2 tree
construction. For the suspend tree we do not add target table line
into the tree, but only a device is inserted into a tree.
Current mechanism to attach messages for thin-pool requires the libdm
to know about thin-pool target, so lvm2 currently takes assumption, node
is really a thin-pool and fills in the table line for this node (which
should be ensured by the PRELOAD phase, but it's a misuse of internal API)
we would possibly need to be able to attach message to 'any' node.
Other thing to notice - current messaging interface in thin-pool
target requires to suspend thin volume origin first and then send
a create message, but this could not have any 'nice' solution on lvm2
side and IMHO we should introduce something like 'create_after_resume'
message.
Patch also changes the moment, where lvm2 transaction id is increased.
Now it happens only after successful finish of kernel transaction id
change. This change was needed to handle properly activation of pool,
which is in the middle of unfinished transaction, and also this corrects
usage of thin-pool by external apps like Docker.
Older lvm2 tools where always providing linear mapping for thin pool.
Recent lvm2 version however support external usage of thin pool and
empty/unused pools are loaded without such external linear mapping.
So this patch covers 'upgrade' problem, where older tool has activated
thin-pool with 'linear' layer mapping, and newer tools didn't expected
such mapping to exist and were not able to deactivate such table.
So before checking for new layout in dm-table, check if there is not
an old one already there.
Check splitted leg is active before preload.
(Since splitmirrors currently only does work active raid volumes
it's not a change for current code flow).
Minor optimization included - when already positively checked
for raid image don't check again for raid metadata.
for_each_sub_lv() normally does not put pool_lv into deps.
So for now go around it in 'lv_preload()' and add explicit
call with pool.
TODO: think about a better way, we want pool_lv deps only in certain
moments, so maybe for_each_sub_lv() needs new arg for this.
When raid is being splitted, extracted leg & metadata
is still floating in the table - and thus we need to
detect this case and properly preload their matching
table so consequent activation of extracted LVs properly
renames (and FREES) existing raid images, so ongoing
image name shifting will work.
API for seg reporting is breaking internal lvm coding - it cannot
use vgmem mem pool for allocation of reported value.
So use separate pool instead of 'vgmem' for non vg related allocations
Add consts for many function params - but still many other are left
for now as non-const - needs deeper level of change even on libdm side.
When creating/activating clustered mirrors, we should have cmirrord
available and running. If it's not, we ended up with rather cryptic
errors like:
$ lvcreate -l1 -m1 --type mirror vg
Error locking on node 1: device-mapper: reload ioctl on failed: Invalid argument
Failed to activate new LV.
$ vgchange -ay vg
Error locking on node node 1: device-mapper: reload ioctl on failed: Invalid argument
This patch adds check for cmirror availability and it errors out
properly, also giving a more precise error messge so users are able
to identify the source of the problem easily:
$ lvcreate -l1 -m1 --type mirror vg
Shared cluster mirrors are not available.
$ vgchange -ay vg
Error locking on node 1: Shared cluster mirrors are not available.
Exclusively activated cluster mirror LVs are OK even without cmirrord:
$ vgchange -aey vg
1 logical volume(s) in volume group "vg" now active
Just call return 0 directly on error path, without using
"goto" - the code is short, no need to use it this way
(the dead code appeared as part of further changes in this
function).
- Add separate lv_status fn (if we're interested only in seg status,
but not lv info at the same time as it is with existing
lv_info_with_seg_status fn). So we 3 fns:
- lv_info (existing one, runs only info ioctl, fills in struct lvinfo only)
- lv_status (new one, runs status ioctl, fills in struct lv_seg_status only)
- lv_info_with_seg_status (existing one, runs status ioctl, fills
in struct lvinfo as well as lv_seg_status)
- Add more comments in the code explaining the difference between lv_info,
lv_status and lv_info_with_seg_status and their return values.
- Move decision whether lv_info_with_seg_status needs to call only
status ioctl (in case the segment for which we require status is from
the LV for which we require info) or separate status and info ioctl
(in case the segment for which we require status is from different
LV that the one for which we require info) into
lv_info_with_seg_status fn so caller doesn't need to bother about
this at all.
- Cleanup internal interface for this seg status so it's more readable.
When deactivating origin, we may have possibly left table in broken state,
where origin is not active, but snapshot volume is still present.
Let's ensure deactivation of origin detects also all associated
snapshots are inactive - otherwise do not skip deactivation.
(so i.e. 'vgchange -an' would detect errors)
Activate of new/unused/empty thin pool volume skips
the 'overlay' part and directly provides 'visible' thin-pool LV to the user.
Such thin pool still gets 'private' -tpool UUID suffix for easier
udev detection of protected lvm2 devices, and also gets udev flags to
avoid any scan.
Such pool device is 'public' LV with regular /dev/vgname/poolname link,
but it's still 'udev' hidden device for any other use.
To display proper active state we need to do few explicit tests
for this condition.
Before it's used for any lvm2 thin volume, deactivation is
now needed to avoid any 'race' with external usage.
Replace lv_cache_block_info() and lv_cache_policy_info()
with lv_cache_status() which directly returns
dm_status_cache structure together with some calculated
values.
After use mem pool stored inside lv_status_cache structure
needs to be destroyed.
Currently, there are 5 things that device_is_usable function checks
(for DM devices only, of course):
- is device empty?
- is device blocked? (mirror)
- is device suspended?
- is device composed of an error target?
- is device name/uuid reserved?
If answer to any of these questions is "yes", then the device is not usable.
This patch just adds possibility to choose what to check for exactly - the
device_is_usable function now accepts struct dev_usable_check_params make
this selection possible. This is going to be used by subsequent patches.
Use of lv_info() internally in lv_check_not_in_use(),
so it always could use with_open_count properly.
Skip sysfs() testing in open_count == 0 case.
Accept just 'lv' pointer like other functions.
The function has 'built-in' lv_is_active_locally check,
which however is not what we need to check in many place.
For now at least remotely active snapshot merge is
detected and for this case merge on next activation is scheduled.
Use lv_is_* macros throughout the code base, introducing
lv_is_pvmove, lv_is_locked, lv_is_converting and lv_is_merging.
lv_is_mirror_type no longer includes pvmove.
Currently, we have two modes of activation, an unnamed nominal mode
(which I will refer to as "complete") and "partial" mode. The
"complete" mode requires that a volume group be 'complete' - that
is, no missing PVs. If there are any missing PVs, no affected LVs
are allowed to activate - even RAID LVs which might be able to
tolerate a failure. The "partial" mode allows anything to be
activated (or at least attempted). If a non-redundant LV is
missing a portion of its addressable space due to a device failure,
it will be replaced with an error target. RAID LVs will either
activate or fail to activate depending on how badly their
redundancy is compromised.
This patch adds a third option, "degraded" mode. This mode can
be selected via the '--activationmode {complete|degraded|partial}'
option to lvchange/vgchange. It can also be set in lvm.conf.
The "degraded" activation mode allows RAID LVs with a sufficient
level of redundancy to activate (e.g. a RAID5 LV with one device
failure, a RAID6 with two device failures, or RAID1 with n-1
failures). RAID LVs with too many device failures are not allowed
to activate - nor are any non-redundant LVs that may have been
affected. This patch also makes the "degraded" mode the default
activation mode.
The degraded activation mode does not yet work in a cluster. A
new cluster lock flag (LCK_DEGRADED_MODE) will need to be created
to make that work. Currently, there is limited space for this
extra flag and I am looking for possible solutions. One possible
solution is to usurp LCK_CONVERT, as it is not used. When the
locking_type is 3, the degraded mode flag simply gets dropped and
the old ("complete") behavior is exhibited.
Accidently it's been commited - but it has also shown,
that on heavy loaded systems (like our test machine could be)
slightly bigger timeouts which waits longer for udev rules
processing does help and avoids occasional refuse of deactivation
because device is still being open.
(i.e. lvcreate...; lvchange -an...)
Unsure how we could now synchronize for this. On very slow(/loaded)
system 5 second timeout is simply not enough.
TODO: introduce at least lvm.conf configurable setting to
allow longer 'retry' loops.
Reindent lv_check_not_in_use to simplify internal loop code.
Also return always '0/1' (drop -1) - since we only
check for failure (0) - and we don't really know
why lv_info() has failed.
Drop unused passed cmd pointer from function.
TODO:
We have two similar functions (though not identical)
lv_manip.c: for_each_sub_lv()
metadata.c: _lv_each_dependency()
They seem to not always match - we should probably convert
to use only a single function.
This function is typically called for cmd context refresh or destroy.
On the non-clustered case we already unlocked all messages,
however when i.e. 'clvmd' gets break signal it may have
still couple messages queued.
For now just report an error.
Building on the new DM function that parses DM cache status, we
introduce the following LVM level functions to aquire information
about cache devices:
- lv_cache_block_info: retrieves information on the cache's block/chunk usage
- lv_cache_policy_info: retrieves information on the cache's policy
If the volume_list filters out volume from activation,
it is still success result for this function.
Change the error message back to verbose level.
Detect if the volume is active localy before zeroing,
so we report error a bit later for cases, where volume
could not be activated because it doesn't pass through volume
list (but user still could create volume when he disables
zeroing)
Add LV_TEMPORARY flag for LVs with limited existence during command
execution. Such LVs are temporary in way that they need to be activated,
some action done and then removed immediately. Such LVs are just like
any normal LV - the only difference is that they are removed during
LVM command execution. This is also the case for LVs representing
future pool metadata spare LVs which we need to initialize by using
the usual LV before they are declared as pool metadata spare.
We can optimize some other parts like udev to do a better job if
it knows that the LV is temporary and any processing on it is just
useless.
This flag is orthogonal to LV_NOSCAN flag introduced recently
as LV_NOSCAN flag is primarily used to mark an LV for the scanning
to be avoided before the zeroing of the device happens. The LV_TEMPORARY
flag makes a difference between a full-fledged LV visible in the system
and the LV just used as a temporary overlay for some action that needs to
be done on underlying PVs.
For example: lvcreate --thinpool POOL --zero n -L 1G vg
- first, the usual LV is created to do a clean up for pool metadata
spare. The LV is activated, zeroed, deactivated.
- between "activated" and "zeroed" stage, the LV_NOSCAN flag is used
to avoid any scanning in udev
- betwen "zeroed" and "deactivated" stage, we need to avoid the WATCH
udev rule, but since the LV is just a usual LV, we can't make a
difference. The LV_TEMPORARY internal LV flag helps here. If we
create the LV with this flag, the DM_UDEV_DISABLE_DISK_RULES
and DM_UDEV_DISABLE_OTHER_RULES flag are set (just like as it is
with "invisible" and non-top-level LVs) - udev is directed to
skip WATCH rule use.
- if the LV_TEMPORARY flag was not used, there would normally be
a WATCH event generated once the LV is closed after "zeroed"
stage. This will make problems with immediated deactivation that
follows.
This patch reinstates the lv_info call to check for open count of
the LV we're removing/deactivating - this was changed with commit 125712b
some time ago and we relied on the ioctl retry logic deeper in the libdm
while calling the exact 'remove' ioctl.
However, there are still some situations in which it's still required to
check for open count before we do any 'remove' actions - this mainly
applies to LVs which consist of several sub LVs, like it is for
virtual snapshot devices.
The commit 1146691 fixed the issue with ordering of actions during
virtual snapshot removal while the snapshot is still open. But
the check for the open status of the snapshot is still prone to
marking the snapshot as in use with an immediate exit even though
this could be a temporary asynchronous open only, most notably
because of udev and its WATCH udev rule with accompanying scans
for the event which is asynchronous. The situation where this crops
up most often is when we're closing the LV that was open for read-write
and then calling lvremove immediately.
This patch reinstates the original lv_info call for the open status
of the LV in the lv_check_not_in_use fn that gets called before
we do any LV removal/deactivation. In addition to original logic,
this patch adds its own retry loop with a delay (25x0.2 seconds)
besides the existing ioctl retry loop.
Component LVs of a thinpool can be RAID LVs. Users who attempt a
scrubbing operation directly on a thinpool will be prompted to
specify the sub-LV they wish the operation to be performed on. If
neither of the sub-LVs are RAID, then a message telling them that
the operation can only be performed on a RAID LV will be given.
Since the virtual snapshot has no reason to stay alive once we
detach related snapshot - deactivate whole thing in front of
snapshot removal - otherwice the code would get tricky for
support in cluster.
The correct full solution would require to have transactions
for libdm operations.
Also enable to the check for snapshot being opened prior
the origin deactivation, otherwise we could easily end
with the origin being deactivate, but snapshot still kept
active, desynchronizing locking state in cluster.
A common scenario is during new LV creation when we need to wipe the
newly created LV and avoid any udev scanning before this stage otherwise
it could cause the device (the LV) to be claimed by some other subsystem
for which there were stale metadata within LV data.
This patch adds possibility to mark the LV we're just about to wipe with
a flag that gets passed to udev via DM_COOKIE as a subsystem specific
flag - DM_SUBSYSTEM_UDEV_FLAG0 (in this case the subsystem is "LVM")
so LVM udev rules will take care of handling that.
Some code has been added recently which makes it impossible to compile
when "configure --disable-devmapper" is used. This patch just shuffles
the code around so it's under proper #ifdef DEVMAPPER_SUPPORT.
When NULL info struct is passed in - function is usable
as a quick query for lv_is_active_locally() - with a bonus
we may query for layered device.
So it could be seen as a more efficient lv_is_active_locally().
Properly skip unmonitoring of thin pool volume in deactivation code
path. Code makes sure if there is just any thin pool user
it stays monitored with all its resources.
The status printed for dm-raid targets on older kernels does not include
the syncaction field. This is handled by dev_manager_raid_status() just
fine by populating the raid status structure with NULL for that field.
However, lv_raid_sync_action() does not properly handle that field being
NULL. So, check for it and return 0 if it is NULL.
Revert commit 37ffe6a. If static variables are to be used then we
will put them elsewhere and limit the optimization to reporting
code, rather that have it be used in the general case.
Previously, we have relied on UUIDs alone, and on lvmcache to make getting a
"new copy" of VG metadata fast. If the code which triggers the activation has
the correct VG metadata at hand (the version which is currently on disk), it can
now hand it to the activation code directly.
There are places where 'lv_is_active' was being used where it was
more correct to use 'lv_is_active_locally'. For example, when checking
for the existance of a kernel instance before asking for its status.
Most of the time these would work correctly. (RAID is only allowed on
non-clustered VGs at the moment, which means that 'lv_is_active' and
'lv_is_active_locally' would give the same result.) However, it is
more correct to use the proper variant and it helps with future
scenarios where targets might be allowed exclusively (or clustered) in
a cluster VG.