IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Disable code which has postprocessed whole tree and reset udev flags.
We need to find out which case was troublesome - since this loop
was just hidding bug in other code parts (most probably preload tree)
In general for non-toplevel LVs we shouldn't allow any _tree_action.
For now error on request for cache_pool activation which
doesn't even exist in dm-table.
Drop unused passed cmd pointer from function.
TODO:
We have two similar functions (though not identical)
lv_manip.c: for_each_sub_lv()
metadata.c: _lv_each_dependency()
They seem to not always match - we should probably convert
to use only a single function.
This function is typically called for cmd context refresh or destroy.
On the non-clustered case we already unlocked all messages,
however when i.e. 'clvmd' gets break signal it may have
still couple messages queued.
For now just report an error.
Reorder detection for internal device - since this test
is much simpler then target analysis, check it sooner.
Replace test for '68' with sizeof & ID_LEN
Add FIXME about device alias problem with is_reserved_lvname,
since this test fails on devices like /dev/dm-X
so we need to convert tests to UUID.
Even though we make pool volume as a public visible LV,
we still do not want tools to look at this volume.
While we do not create /dev/vg/lv link, device is still
accessible via /dev/mapper/vg-lv and there is no easy
way to recognize it's private without lvm2 metadata.
Enhance UUID with -pool suffix and directly skip
any LV with a suffix in device_is_usable() call.
TODO: enhance other targets with this logic.
blkid may probably use same simple logic.
The empty pool is also the pool which has yet queued list of messages
and transaction_id == 1.
Problem is exposed when pool is created inactive.
lvcreate -L10 -T vg/pool -an
lvcreate -V10 -T vg/pool
Move flags for segments to segtype header where it seems more closely
related as the features are related to segtype and not activation.
Use unsigned #define - since it's more common in lvm2 source code
for bit flags.
Code uses target driver version for better estimation of
max size of COW device for snapshot.
The bug can be tested with this script:
VG=vg1
lvremove -f $VG/origin
set -e
lvcreate -L 2143289344b -n origin $VG
lvcreate -n snap -c 8k -L 2304M -s $VG/origin
dd if=/dev/zero of=/dev/$VG/snap bs=1M count=2044 oflag=direct
The bug happens when these two conditions are met
* origin size is divisible by (chunk_size/16) - so that the last
metadata area is filled completely
* the miscalculated snapshot metadata size is divisible by extent size -
so that there is no padding to extent boundary which would otherwise
save us
Signed-off-by:Mikulas Patocka <mpatocka@redhat.com>
Test raid10 availability as a target feature (instead of doing
it in all the places where raid10 should be checked).
TODO: activation needs runtime validation - so metadata with raid10
are skipped from activation in user-friendly way in lvm2.
This patch allows users to create cache LVs with 'lvcreate'. An origin
or a cache pool LV must be created first. Then, while supplying the
origin or cache pool to the lvcreate command, the cache can be created.
Ex1:
Here the cache pool is created first, followed by the origin which will
be cached.
~> lvcreate --type cache_pool -L 500M -n cachepool vg /dev/small_n_fast
~> lvcreate --type cache -L 1G -n lv vg/cachepool /dev/large_n_slow
Ex2:
Here the origin is created first, followed by the cache pool - allowing
a cache LV to be created covering the origin.
~> lvcreate -L 1G -n lv vg /dev/large_n_slow
~> lvcreate --type cache -L 500M -n cachepool vg/lv /dev/small_n_fast
The code determines which type of LV was supplied (cache pool or origin)
by checking its type. It ensures the right argument was given by ensuring
that the origin is larger than the cache pool.
If the user wants to remove just the cache for an LV. They specify
the LV's associated cache pool when removing:
~> lvremove vg/cachepool
If the user wishes to remove the origin, but leave the cachepool to be
used for another LV, they specify the cache LV.
~> lvremove vg/lv
In order to remove it all, specify both LVs.
This patch also includes tests to create and remove cache pools and
cache LVs.
Building on the new DM function that parses DM cache status, we
introduce the following LVM level functions to aquire information
about cache devices:
- lv_cache_block_info: retrieves information on the cache's block/chunk usage
- lv_cache_policy_info: retrieves information on the cache's policy
When thin volume is using external origin, current thin target
is not able to supply 'extended' size with empty pages.
lvm2 detects version and disables extension of LV past the external
origin size in this case.
Thin LV could be however still reduced and extended freely bellow
this size.
This reverts commit 24639be558.
Ok - seems we could be here a bit too active - and we
may remove devices which are unsuable for reasons we are not
aware of - thus taking down whole device could be way to big hammer.
So we still need some solution to recover from failing preload
and activation - but it needs more tunning.
When activation fails - we may leak large tree of partially loaded
devices in the dm table (i.e. failure in snapshot activation)
The best we can do here is try to deactivate whole device and
remove as much inactive table entries as we can.
Collapse 2 ifs and replace log_error() with log_warn(), since\
the reported message is not causing tools error.
(and cannot be probably triggered anyway).
Drop find_merging_snapshot() function. Use find_snapshot()
called after check for lv_is_merging_origin() which
is the commonly used code path - so we avoid duplicated
tests and potential risk of derefering NULL point
in unhandled error path.
If the volume_list filters out volume from activation,
it is still success result for this function.
Change the error message back to verbose level.
Detect if the volume is active localy before zeroing,
so we report error a bit later for cases, where volume
could not be activated because it doesn't pass through volume
list (but user still could create volume when he disables
zeroing)
There is a problem with the way mirrors have been designed to handle
failures that is resulting in stuck LVM processes and hung I/O. When
mirrors encounter a write failure, they block I/O and notify userspace
to reconfigure the mirror to remove failed devices. This process is
open to a couple races:
1) Any LVM process other than the one that is meant to deal with the
mirror failure can attempt to read the mirror, fail, and block other
LVM commands (including the repair command) from proceeding due to
holding a lock on the volume group.
2) If there are multiple mirrors that suffer a failure in the same
volume group, a repair can block while attempting to read the LVM
label from one mirror while trying to repair the other.
Mitigation of these races has been attempted by disallowing label reading
of mirrors that are either suspended or are indicated as blocking by
the kernel. While this has closed the window of opportunity for hitting
the above problems considerably, it hasn't closed it completely. This is
because it is still possible to start an LVM command, read the status of
the mirror as healthy, and then perform the read for the label at the
moment after a the failure is discovered by the kernel.
I can see two solutions to this problem:
1) Allow users to configure whether mirrors can be candidates for LVM
labels (i.e. whether PVs can be created on mirror LVs). If the user
chooses to allow label scanning of mirror LVs, it will be at the expense
of a possible hang in I/O or LVM processes.
2) Instrument a way to allow asynchronous label reading - allowing
blocked label reads to be ignored while continuing to process the LVM
command. This would action would allow LVM commands to continue even
though they would have otherwise blocked trying to read a mirror. They
can then release their lock and allow a repair command to commence. In
the event of #2 above, the repair command already in progress can continue
and repair the failed mirror.
This patch brings solution #1. If solution #2 is developed later on, the
configuration option created in #1 can be negated - allowing mirrors to
be scanned for labels by default once again.
Add LV_TEMPORARY flag for LVs with limited existence during command
execution. Such LVs are temporary in way that they need to be activated,
some action done and then removed immediately. Such LVs are just like
any normal LV - the only difference is that they are removed during
LVM command execution. This is also the case for LVs representing
future pool metadata spare LVs which we need to initialize by using
the usual LV before they are declared as pool metadata spare.
We can optimize some other parts like udev to do a better job if
it knows that the LV is temporary and any processing on it is just
useless.
This flag is orthogonal to LV_NOSCAN flag introduced recently
as LV_NOSCAN flag is primarily used to mark an LV for the scanning
to be avoided before the zeroing of the device happens. The LV_TEMPORARY
flag makes a difference between a full-fledged LV visible in the system
and the LV just used as a temporary overlay for some action that needs to
be done on underlying PVs.
For example: lvcreate --thinpool POOL --zero n -L 1G vg
- first, the usual LV is created to do a clean up for pool metadata
spare. The LV is activated, zeroed, deactivated.
- between "activated" and "zeroed" stage, the LV_NOSCAN flag is used
to avoid any scanning in udev
- betwen "zeroed" and "deactivated" stage, we need to avoid the WATCH
udev rule, but since the LV is just a usual LV, we can't make a
difference. The LV_TEMPORARY internal LV flag helps here. If we
create the LV with this flag, the DM_UDEV_DISABLE_DISK_RULES
and DM_UDEV_DISABLE_OTHER_RULES flag are set (just like as it is
with "invisible" and non-top-level LVs) - udev is directed to
skip WATCH rule use.
- if the LV_TEMPORARY flag was not used, there would normally be
a WATCH event generated once the LV is closed after "zeroed"
stage. This will make problems with immediated deactivation that
follows.
This patch reinstates the lv_info call to check for open count of
the LV we're removing/deactivating - this was changed with commit 125712b
some time ago and we relied on the ioctl retry logic deeper in the libdm
while calling the exact 'remove' ioctl.
However, there are still some situations in which it's still required to
check for open count before we do any 'remove' actions - this mainly
applies to LVs which consist of several sub LVs, like it is for
virtual snapshot devices.
The commit 1146691 fixed the issue with ordering of actions during
virtual snapshot removal while the snapshot is still open. But
the check for the open status of the snapshot is still prone to
marking the snapshot as in use with an immediate exit even though
this could be a temporary asynchronous open only, most notably
because of udev and its WATCH udev rule with accompanying scans
for the event which is asynchronous. The situation where this crops
up most often is when we're closing the LV that was open for read-write
and then calling lvremove immediately.
This patch reinstates the original lv_info call for the open status
of the LV in the lv_check_not_in_use fn that gets called before
we do any LV removal/deactivation. In addition to original logic,
this patch adds its own retry loop with a delay (25x0.2 seconds)
besides the existing ioctl retry loop.
Component LVs of a thinpool can be RAID LVs. Users who attempt a
scrubbing operation directly on a thinpool will be prompted to
specify the sub-LV they wish the operation to be performed on. If
neither of the sub-LVs are RAID, then a message telling them that
the operation can only be performed on a RAID LV will be given.
Since the virtual snapshot has no reason to stay alive once we
detach related snapshot - deactivate whole thing in front of
snapshot removal - otherwice the code would get tricky for
support in cluster.
The correct full solution would require to have transactions
for libdm operations.
Also enable to the check for snapshot being opened prior
the origin deactivation, otherwise we could easily end
with the origin being deactivate, but snapshot still kept
active, desynchronizing locking state in cluster.
A common scenario is during new LV creation when we need to wipe the
newly created LV and avoid any udev scanning before this stage otherwise
it could cause the device (the LV) to be claimed by some other subsystem
for which there were stale metadata within LV data.
This patch adds possibility to mark the LV we're just about to wipe with
a flag that gets passed to udev via DM_COOKIE as a subsystem specific
flag - DM_SUBSYSTEM_UDEV_FLAG0 (in this case the subsystem is "LVM")
so LVM udev rules will take care of handling that.
Some code has been added recently which makes it impossible to compile
when "configure --disable-devmapper" is used. This patch just shuffles
the code around so it's under proper #ifdef DEVMAPPER_SUPPORT.
When NULL info struct is passed in - function is usable
as a quick query for lv_is_active_locally() - with a bonus
we may query for layered device.
So it could be seen as a more efficient lv_is_active_locally().
Properly skip unmonitoring of thin pool volume in deactivation code
path. Code makes sure if there is just any thin pool user
it stays monitored with all its resources.
Commit b248ba0a39 attempted to
prevent mirror devices which had a failed device in their
mirrored log from being usable/readable by LVM. This was to
protect against circular dependancies where one LVM command
could be blocked trying to read one of these affected mirrors
while the LVM command to fix/unblock that mirror was stuck
behind the currently running command.
The above commit went wrong when it used 'device_is_usable()' to
recurse on the mirrored log device to check if it was suspended
or blocked. The 'device_is_usable' function also contains a check
for reserved names - like *_mlog, etc. This last check always
triggered when checking a mirror's log simply because of the name,
not because it was suspended or blocked - a false positive.
The solution is to create a new function like 'device_is_usable',
but without the check for reserved names. Using this new function
(device_is_suspended_or_blocked), we can check the status of a
mirror's log device properly.
The status printed for dm-raid targets on older kernels does not include
the syncaction field. This is handled by dev_manager_raid_status() just
fine by populating the raid status structure with NULL for that field.
However, lv_raid_sync_action() does not properly handle that field being
NULL. So, check for it and return 0 if it is NULL.
When tree for thin LVs was using external_lv, there has been
far less optimal solution, that has tried to add certain
existing dependencie only when new node was added.
However this has lead to way to complex tree construction since
many repeated checks have been made during such tree build.
This patch move this detection to the proper _partial_tree generation
code and uses for it new 'activation' flag, which is set when
tree for ACTIVATION or PRELOAD is generated.
It increases performance when thins with external origins are used.
(in release update)
Created dlid for test is not needed afterward, so lower a memory
usage of this call is repeatedly used for building some large tree.
TODO: create function to use given buffer on stack as much cheaper.
Code needs to check if the layer origin device is suspended,
It's valid to create thinvolume snapshot of thinvolume which is also
used as an old-style snapshot. In this case we need to check -real
is suspended.
When adding origin_only - add only layer thin volume.
(in case it's also old-snapshot add only -real device)
Revert commit 37ffe6a. If static variables are to be used then we
will put them elsewhere and limit the optimization to reporting
code, rather that have it be used in the general case.
Previously, we have relied on UUIDs alone, and on lvmcache to make getting a
"new copy" of VG metadata fast. If the code which triggers the activation has
the correct VG metadata at hand (the version which is currently on disk), it can
now hand it to the activation code directly.
There are places where 'lv_is_active' was being used where it was
more correct to use 'lv_is_active_locally'. For example, when checking
for the existance of a kernel instance before asking for its status.
Most of the time these would work correctly. (RAID is only allowed on
non-clustered VGs at the moment, which means that 'lv_is_active' and
'lv_is_active_locally' would give the same result.) However, it is
more correct to use the proper variant and it helps with future
scenarios where targets might be allowed exclusively (or clustered) in
a cluster VG.
Setting the cmd->default_settings.udev_fallback also requires DM
driver version check. However, this caused useless mapper/control
access with ioctl if not needed actually. For example if we're not
using activation code, we don't need to know the udev_fallback as
there's no node and symlink processing.
For example, this premature mapper/control access caused problems
when using lvm2app even when no activation happens - there are
situations in which we don't need to use mapper/control, but still
need some of the lvm2app functionality. This is also the case for
lvm2-activation systemd generator which just needs to look at the
lvm2 configuration, but it shouldn't touch mapper/control.
Commit 9fd7ac7d03 introduced a way a
method of avoiding reading from mirrors with a device failure. If
a device was found to be dead, the mapping table was checked for
'handle_errors' or 'block_on_error'. These strings were checked for
in the table string via 'strstr', which could also match on strings
like, 'no_handle_errors' or 'no_block_on_error'. No such strings
exist, but we don't want to have problems in the future if they do.
So, we check for ' <string>{'\0'|' '}'.
New options to 'lvchange' allow users to scrub their RAID LVs.
Synopsis:
lvchange --syncaction {check|repair} vg/raid_lv
RAID scrubbing is the process of reading all the data and parity blocks in
an array and checking to see whether they are coherent. 'lvchange' can
now initaite the two scrubbing operations: "check" and "repair". "check"
will go over the array and recored the number of discrepancies but not
repair them. "repair" will correct the discrepancies as it finds them.
'lvchange --syncaction repair vg/raid_lv' is not to be confused with
'lvconvert --repair vg/raid_lv'. The former initiates a background
synchronization operation on the array, while the latter is designed to
repair/replace failed devices in a mirror or RAID logical volume.
Additional reporting has been added for 'lvs' to support the new
operations. Two new printable fields (which are not printed by
default) have been added: "syncaction" and "mismatches". These
can be accessed using the '-o' option to 'lvs', like:
lvs -o +syncaction,mismatches vg/lv
"syncaction" will print the current synchronization operation that the
RAID volume is performing. It can be one of the following:
- idle: All sync operations complete (doing nothing)
- resync: Initializing an array or recovering after a machine failure
- recover: Replacing a device in the array
- check: Looking for array inconsistencies
- repair: Looking for and repairing inconsistencies
The "mismatches" field with print the number of descrepancies found during
a check or repair operation.
The 'Cpy%Sync' field already available to 'lvs' will print the progress
of any of the above syncactions, including check and repair.
Finally, the lv_attr field has changed to accomadate the scrubbing operations
as well. The role of the 'p'artial character in the lv_attr report field
as expanded. "Partial" is really an indicator for the health of a
logical volume and it makes sense to extend this include other health
indicators as well, specifically:
'm'ismatches: Indicates that there are discrepancies in a RAID
LV. This character is shown after a scrubbing
operation has detected that portions of the RAID
are not coherent.
'r'efresh : Indicates that a device in a RAID array has suffered
a failure and the kernel regards it as failed -
even though LVM can read the device label and
considers the device to be ok. The LV should be
'r'efreshed to notify the kernel that the device is
now available, or the device should be 'r'eplaced
if it is suspected of failing.
I've updated the dm_status_raid structure and dm_get_status_raid()
function to make it handle the new kernel status fields that will
be coming in dm-raid v1.5.0. It is backwards compatible with the
old status line - initializing the new fields to '0'. The new
structure is also more amenable to future changes. It includes a
'reserved' field that is currently initialized to zero but could
be used to hold flags describing new features. It also now uses
pointers for the character strings instead of attempting to allocate
their space along with the structure (causing the size of the
structure to be variable). This allows future fields to be appended.
The new fields that are available are:
- sync_action : shows what the sync thread in the kernel is doing
(idle, frozen, resync, recover, check, repair, or
reshape)
- mismatch_count: shows the number of discrepancies which were
found or repaired by a "check" or "repair"
process, respectively.
For example, the old call and reference:
find_config_tree_str(cmd, "devices/dir", DEFAULT_DEV_DIR)
...now becomes:
find_config_tree_str(cmd, devices_dir_CFG)
So we're referring to the named configuration ID instead
of passing the configuration path and the default value
is taken from central config definition in config_settings.h
automatically.
Add basic support for converting LV into an external origin volume.
Syntax:
lvconvert --thinpool vg/pool --originname renamed_origin -T origin
It will convert volume 'origin' into a thin volume, which will
use 'renamed_origin' as an external read-only origin.
All read/write into origin will go via 'pool'.
renamed_origin volume is read-only volume, that could be activated
only in read-only mode, and cannot be modified.
Reorder activation code to look similar for preload tree and
activation tree.
Its also give much better suppport for device stacking,
since now we also support activation of snapshot which might
be then used for other devices.
We can avoid many dev_manager (ioctl) calls by caching the results of
previous calls to lv_raid_dev_health. Just considering the case where
'lvs -a' is called to get the attributes of a RAID LV and its sub-lvs,
this function would be called many times. (It would be called at least
7 times for a 3-way RAID1 - once for the health of each sub-LV and once
for the health of the top-level LV.) This is a good idea because the
sub-LVs are processed in groups along with their parent RAID LV and in
each case, it is the parent LV whose status will be queried. Therefore,
there only needs to be one trip through dev_manager for each time the
group is processed.
Similar to the way thin* accesses its kernel status, we add a method
for RAID to grab the various values in its status output without the
higher levels (LVM) having to understand how to parse the output.
Added functions include:
- lib/activate/dev_manager.c:dev_manager_raid_status()
Pulls the status line from the kernel
- libdm/libdm-deptree.c:dm_get_status_raid()
Parses status line and puts components into dm_status_raid struct
- lib/activate/activate.c:lv_raid_dev_health()
Accesses dm_status_raid to deliver raid dev_health string
The new structure and functions can provide a more unified way to access
status information. ('lv_raid_percent' could switch to using these
functions, for example.)
Function _ignore_blocked_mirror_devices was not release
allocated strings images_health and log_health.
In error paths it was also not releasing dm_task structure.
Swaped return code of _ignore_blocked_mirror_devices and
use 1 as success.
In _parse_mirror_status use log_error if memory allocation
fails and few more errors so they are no going unnoticed
as debug messages.
On error path always clear return values and free strings.
For dev_create_file use cache mem pool to avoid memleak.
In case we don't want to activate, autoactivate or have the
VG/LV read-only. Primarily targeted for the auto_activation_volume_list,
but it makes no harm for other settings (the part of the code
that reads these three settings is shared, but there's no
reason to separate it only for this change).
Check if target supports discards for chunk sizes,
that are not power of 2 (just multiple of 64K),
and enable it in case it's supported by thin kernel target.
Commit 9fd7ac7d03 did not handle mirrors
that contained mirrored logs. This is because the status line of the
mirror does not give an indication of the health of the mirrored log,
as you can see here:
[root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
Thus, the possibility for LVM commands to hang still persists when mirror
have mirrored logs. I discovered this while performing some testing that
does polling with 'pvs' while doing I/O and killing devices. The 'pvs'
managed to get between the mirrored log device failure and the attempt
by dmeventd to repair it. The result was a very nasty block in LVM
commands that is very difficult to remove - even for someone who knows
what is going on. Thus, it is absolutely essential that the log of a
mirror be recursively checked for mirror devices which may be failed
as well.
Despite what the code comment says in the aforementioned commit...
+ * _mirrored_transient_status(). FIXME: It is unable to handle mirrors
+ * with mirrored logs because it does not have a way to get the status of
+ * the mirror that forms the log, which could be blocked.
... it is possible to get the status of the log because the log device
major/minor is given to us by the status output of the top-level mirror.
We can use that to query the log device for any DM status and see if it
is a mirror that needs to be bypassed. This patch does just that and is
now able to avoid reading from mirrors that have failed devices in a
mirrored log.
Addresses: rhbz855398 (Allow VGs to be built on cluster mirrors),
and other issues.
The LVM code attempts to avoid reading labels from devices that are
suspended to try to avoid situations that may cause the commands to
block indefinitely. When scanning devices, 'ignore_suspended_devices'
can be set so the code (lib/activate/dev_manager.c:device_is_usable())
checks any DM devices it finds and avoids them if they are suspended.
The mirror target has an additional mechanism that can cause I/O to
be blocked. If a device in a mirror fails, all I/O will be blocked
by the kernel until a new table (a linear target or a mirror with
replacement devices) is loaded. The mirror indicates that this condition
has happened by marking a 'D' for the faulty device in its status
output. This condition must also be checked by 'device_is_usable()' to
avoid the possibility of blocking LVM commands indefinitely due to an
attempt to read the blocked mirror for labels.
Until now, mirrors were avoided if the 'ignore_suspended_devices'
condition was set. This check seemed to suggest, "if we are concerned
about suspended devices, then let's ignore mirrors altogether just
in case". This is insufficient and doesn't solve any problems. All
devices that are suspended are already avoided if
'ignore_suspended_devices' is set; and if a mirror is blocking because
of an error condition, it will block the LVM command regardless of the
setting of that variable.
Rather than avoiding mirrors whenever 'ignore_suspended_devices' is
set, this patch causes mirrors to be avoided whenever they are blocking
due to an error. (As mentioned above, the case where a DM device is
suspended is already covered.) This solves a number of issues that weren't
handled before. For example, pvcreate (or any command that does a
pv_read or vg_read, which eventually call device_is_usable()) will be
protected from blocked mirrors regardless of how
'ignore_suspended_devices' is set. Additionally, a mirror that is
neither suspended nor blocking is /allowed/ to be read regardless
of how 'ignore_suspended_devices' is set. (The latter point being the
source of the fix for rhbz855398.)
Adding couple INTERNAL_ERROR reports for unwanted parameters:
Ensure the 'top' metadata node cannot be NULL for lvmetad.
Make obvious vginfo2 cannot be NULL.
Report internal error if handler and vg is undefined.
Check for handle in poll_vg().
Ensure seg is not NULL in dev_manager_transient().
Report missing read_ahead for _lv_read_ahead_single().
Check for report handler in dm_report_object().
Check missing VG in _vgreduce_single().
Change 'lv_passes_volumes_filter' fn back to static as it's not
actually needed in the other code (a remnant from devel version).
Fix lvm.conf comment referencing '--autoactivate' which was finally
decided to be '--activate ay'.
Define an 'activation_handler' that gets called automatically on
PV appearance/disappearance while processing the lvmetad_pv_found
and lvmetad_pv_gone functions that are supposed to update the
lvmetad state based on PV availability state. For now, the actual
support is for PV appearance only, leaving room for PV disappearance
support as well (which is a more complex problem to solve as this
needs to count with possible device stack).
Add a new activation change mode - CHANGE_AAY exposed as
'--activate ay/-aay' argument ('activate automatically').
Factor out the vgchange activation functionality for use in other
tools (like pvscan...).
The 'mirror' segtype and 'raid1' segtype both set the 'MIRRORED' flag.
However, due to differences in the way these device-mapper targets behave
'mirror' must be suspended with the 'noflush' option and 'raid1' does not
have to be.
This patch ensures that when the 'MIRRORED' flag is checked to see if
'noflush' is needed that it does not also set it for 'raid1' by mistake.
Code adds better support for monitoring of thin pool devices.
update_pool_lv uses DMEVENTD_MONITOR_IGNORE to not manipulate with monitoring.
vgchange & lvchange are checking real thin pool device for existance
as we are using _tpool real device and visible LV pool device might not
be even active (_tpool is activated implicitely for any thin volume).
monitor_dev_for_events is another _lv_postorder like code it might be worth
to think about reusing it here - for now update the code to properly
monitory thin volume deps.
For unmonitoring add extra code to check the usage of thin pool - in case it's in use
unmonitoring of thin volume is skipped.
Update a way we handle option passing - so we now support path and options
with space inside.
Fix dm name usage for thin pools with '-' in name.
Use new lvm.conf option thin_check_options to pass in options as string array.
Use thin_dump --repair suggestion in log error message
and use just warning on deactivation path without repair info
(since node has been deactivated).
Also check whether there is not 16 args for thin_check configured.
Use libdm callback to execute thin_check before activation
thin pool and after deactivation as well.
Supporting thin_check_executable which may pass in extra options for
the tool.
Commit 02f6f4902f introduced a bug that caused
RAID devices to fail to activate if the device for a single sub-LV failed.
The special case of LVM mirror was handled, but not LVM RAID.
EXAMPLE:
[root@bp-01 ~]# devices vg
LV Copy% Devices
lv 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] /dev/sde1(1)
[lv_rimage_1] /dev/sdh1(1)
[lv_rmeta_0] /dev/sde1(0)
[lv_rmeta_1] /dev/sdh1(0)
[root@bp-01 ~]# vgchange -an vg
0 logical volume(s) in volume group "vg" now active
[root@bp-01 ~]# off.sh sdh
Turning off sdh
[root@bp-01 ~]# vgchange -ay vg --partial
Partial mode. Incomplete logical volumes will be processed.
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
Cannot activate vg/lv_rimage_1: all segments missing.
0 logical volume(s) in volume group "vg" now active
AFTER this patch:
[root@bp-01 ~]# vgchange -ay vg --partial
Partial mode. Incomplete logical volumes will be processed.
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
1 logical volume(s) in volume group "vg" now active
[root@bp-01 ~]# devices vg
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
LV Copy% Devices
lv 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] /dev/sde1(1)
[lv_rimage_1] unknown device(1)
[lv_rmeta_0] /dev/sde1(0)
[lv_rmeta_1] unknown device(0)
[root@bp-01 ~]# dmsetup table vg-lv; dmsetup status vg-lv
0 1024000 raid raid1 3 0 region_size 1024 2 253:2 253:3 - -
0 1024000 raid raid1 2 AD 1024000/1024000
No WHATSNEW update necessary because this is an intrarelease fix.
brassow
Count number of error and existing areas and if there is no existing area
for the LV avoid its activation.
Always disable partial activatio for thin volumes.
For mirrors currently put in hack to let it pass with a special name
since current mirror code needs to activate such LV during some operations.
For reading % of mapped size of thin volume use as origin for
old style snapshot '-real' device needs to be queried.
Fix log_error report given for lvs -a in this case.
Extend the usage of origin_only flag to allow resume of thin pool LV
(when it's active) to pass only the messages.
origin_only flag will skip detection of already resumed tree for thin_pool,
so we do not need to suspend the tree and we just send messages.
Pass in the origin_only flag also for thin volumes - but curently the flag
is not used to its best.
FIXME: achieve the state where only thin volume snapshot origin is
suspended without its childrens - let's explore whether this may
happen automatically inside libdm (might be generic for other targets).
So the code would not need to annotate the node for this.
Extend lv_activate_opts with bool flag to know for which purpose
dtree is created - and add message only for activation tree
(since that's the only place that may send them).
Extend validation check for thin snapshot creation and test whether
active snapshot origin is suspended before its snapshot is created
(useful in recover scenarios) - in this case also detect, whether
transaction has been already completed and avoid such suspend check
failure in that case.
Similar to the "mirror" segment type's log device, _add_dev_to_dtree should
be called and not _add_lv_to_dtree when adding metadata sub-LVs to the deptree.
Since _add_lv_to_dtree was being called, 'origin_only' could be set if a
snapshot sits on top of the RAID device. This would cause the actual device
that needed to be added to be skipped in favor of the non-existant device,
"<foo>-real".
This patch to the suspend code - like the similar change for resume -
queries the lock mode of a cluster volume and records whether it is active
exclusively. This is necessary for suspend due to the possibility of
preloading targets. Failure to check to exclusivity causes the cluster target
of an exclusively activated mirror to be used when converting - rather than
the single machine target.
This value returns percentage of 'mapped' size compared with total LV size.
(Without passed seg pointer it return highest mapped size - but it's
not used yet.)
LVM metadata knows only of striped segments - not linear ones.
The activation code detects segments with a single stripe and switches
them to use the linear target.
If the new lvm.conf setting is set to 0 (e.g. in a test script), this
'optimisation' is turned off.
Remove FIXMES - there should not be any pool free call since
the memory pool is from device manager, and pool is detroyed
after the operation, so doing extra free here would not help here.
However lv_has_target_type() is using cmd mempool so here the extra
call for dm_pool_free makes sence.
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
Since we support snapshots of thin volumes, we could have more layers,
so we have to check whether tpool layer is going to be inserted.
As the _add_segment_to_dtree() is the only place that adds tpool
segment, we may just check pointer (no strcmp for layer).
Switch to use seg_is_ function instead of lv_is_.
Add filter which tries to check if scanned device is part
of active multipath.
Firstly, only SCSI major number devices are handled in filter.
Then it checks if device has exactly one holder (in sysfs) and
if it is device-mapper device and DM-UUID is prefixed by "MPATH-".
If so, this device is filtered out.
The whole filter can be switched off by setting
mpath_component_detection in lvm.conf.
https://bugzilla.redhat.com/show_bug.cgi?id=597010
Signed-off-by: Milan Broz <mbroz@redhat.com>
Let's put the overlay device over real thin pool device.
So we can get the proper locking on cluster.
Overwise the pool LV would be activate once implicitely
and in other case explicitely, confusing locking mechanism.
This patch make the activation of pool LV independent on
activation of thin LV since they will both implicitely use
real -thin pool device.
To ensure we properly handle LV cluster locking - explicitely do
not allow to change the availability of the thin pool that is in use
for some thin LV.
As soon as the thin volume is created the only way to activate pool
is via implicit dependency.
Ignore thinpool open count for lv/vgchange operations.
When verify_udev_operations was disable, code for stacking fs operation for
lvm links was completely disable - but this code was also used for collecting
information, that a new node is being created.
Add a new flag which is set when a creation of lv symlinks is requested which
should restore old behaviour of lv_info function, that has called fs_sync()
before quere for open count on device.
Using log_warn to report missing symlinks as warning, since the command
itself returns as successful, we should not produce log_error().
log_warn is better fit here.
Cosmetic, since r is already 0 for the error path, no need to assign it there,
and r is assigned to 1 after switch command.
Also makes the code more readable.
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The current code does not always assign proper udev flags to sub-LVs (e.g.
mirror images and log LVs). This shows up especially during a splitmirror
operation in which an image is split off from a mirror to form a new LV.
A mirror with a disk log is actually composed of 4 different LVs: the 2
mirror images, the log, and the top-level LV that "glues" them all together.
When a 2-way mirror is split into two linear LVs, two of those LVs must be
removed. The segments of the image which is not split off to form the new
LV are transferred to the top-level LV. This is done so that the original
LV can maintain its major/minor, UUID, and name. The sub-lv from which the
segments were transferred gets an error segment as a transitory process
before it is eventually removed. (Note that if the error target was not put
in place, a resume_lv would result in two LVs pointing to the same segment!
If the machine crashes before the eventual removal of the sub-LV, the result
would be a residual LV with the same mapping as the original (now linear) LV.)
So, the two LVs that need to be removed are now the log device and the sub-LV
with the error segment. If udev_flags are not properly set, a resume will
cause the error LV to come up and be scanned by udev. This causes I/O errors.
Additionally, when udev scans sub-LVs (or former sub-LVs), it can cause races
when we are trying to remove those LVs. This is especially bad during failure
conditions.
When the mirror is suspended, the top-level along with its sub-LVs are
suspended. The changes (now 2 linear devices and the yet-to-be-removed log
and error LV) are committed. When the resume takes place on the original
LV, there are no longer links to the other sub-lvs through the LVM metadata.
The links are implicitly handled by querying the kernel for a list of
dependencies. This is done in the '_add_dev' function (which is recursively
called for each dependency found) - called through the following chain:
_add_dev
dm_tree_add_dev_with_udev_flags
<*** DM / LVM divide ***>
_add_dev_to_dtree
_add_lv_to_dtree
_create_partial_dtree
_tree_action
dev_manager_activate
_lv_activate_lv
_lv_resume
lv_resume_if_active
When udev flags are calculated by '_get_udev_flags', it is done by referencing
the 'logical_volume' structure. Those flags are then passed down into
'dm_tree_add_dev_with_udev_flags', which in turn passes them to '_add_dev'.
Unfortunately, when '_add_dev' is finding the dependencies, it has no way to
calculate their proper udev_flags. This is because it is below the DM/LVM
divide - it doesn't have access to the logical_volume structure. In fact,
'_add_dev' simply reuses the udev_flags given for the initial device! This
virtually guarentees the udev_flags are wrong for all the dependencies unless
they are reset by some other mechanism. The current code provides no such
mechanism. Even if '_add_new_lv_to_dtree' were called on the sub-devices -
which it isn't - entries already in the tree are simply passed over, failing
to reset any udev_flags. The solution must retain its implicit nature of
discovering dependencies and be able to go back over the dependencies found
to properly set the udev_flags.
My solution simply calls a new function before leaving '_add_new_lv_to_dtree'
that iterates over the dtree nodes to properly reset the udev_flags of any
children. It is important that this function occur after the '_add_dev' has
done its job of querying the kernel for a list of dependencies. It is this
list of children that we use to look up their respective LVs and properly
calculate the udev_flags.
This solution has worked for single machine, cluster, and cluster w/ exclusive
activation.
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
~> lvconvert --splitmirrors 1 --trackchanges vg/lv
The '--trackchanges' option allows a user the ability to use an image of
a RAID1 array for the purposes of temporary read-only access. The image
can be merged back into the array at a later time and only the blocks that
have changed in the array since the split will be resync'ed. This
operation can be thought of as a partial split. The image is never completely
extracted from the array, in that the array reserves the position the device
occupied and tracks the differences between the array and the split image via
a bitmap. The image itself is rendered read-only and the name (<LV>_rimage_*)
cannot be changed. The user can complete the split (permanently splitting the
image from the array) by re-issuing the 'lvconvert' command without the
'--trackchanges' argument and specifying the '--name' argument.
~> lvconvert --splitmirrors 1 --name my_split vg/lv
Merging the tracked image back into the array is done with the '--merge'
option (included in a follow-on patch).
~> lvconvert --merge vg/lv_rimage_<n>
The internal mechanics of this are relatively simple. The 'raid' device-
mapper target allows for the specification of an empty slot in an array
via '- -'. This is what will be used if a partial activation of an array
is ever required. (It would also be possible to use 'error' targets in
place of the '- -'.) If a RAID image is found to be both read-only and
visible, then it is considered separate from the array and '- -' is used
to hold it's position in the array. So, all that needs to be done to
temporarily split an image from the array /and/ cause the kernel target's
bitmap to track (aka "mark") changes made is to make the specified image
visible and read-only. To merge the device back into the array, the image
needs to be returned to the read/write state of the top-level LV and made
invisible.
Move the free_vg() to vg.c and replace free_vg with release_vg
and make the _free_vg internal.
Patch is needed for sharing VG in vginfo cache so the release_vg function name
is a better fit here.
Implementation described in doc/lvm2-raid.txt.
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
teardown sequence. (Previously the snapshot was deactivated while its
origin was active and before its removal was committed to disk, so
restarting after a crash at the point would leave corruption.)