IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The original commit comments can be located via this git commit ID:
7d8e615c0b
There were three possible solutions to the original problem proposed in the
initial check-in. The one chosen was as follows:
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it doesn't
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
Turns out, the cluster also views the extra suspend/resume operations as
pointless too and ignores them. So, this solution doesn't work in a cluster.
Further, I've noticed that in addition to the remote cluster nodes still getting
I/O errors from scanning the error target, they also have a different LVM and
DM views of the same LV. IOW, while the LVM level (gotten from the LVM metadata)
sees the correct name for the newly split LV, device-mapper still maintains the
old names.
Because the original fix failed to completely fix the problem (or work-around it)
and because a better solution must be found to address the additional cluster
issue of device renaming, I am reverting the above mentioned commit.
The current code does not always assign proper udev flags to sub-LVs (e.g.
mirror images and log LVs). This shows up especially during a splitmirror
operation in which an image is split off from a mirror to form a new LV.
A mirror with a disk log is actually composed of 4 different LVs: the 2
mirror images, the log, and the top-level LV that "glues" them all together.
When a 2-way mirror is split into two linear LVs, two of those LVs must be
removed. The segments of the image which is not split off to form the new
LV are transferred to the top-level LV. This is done so that the original
LV can maintain its major/minor, UUID, and name. The sub-lv from which the
segments were transferred gets an error segment as a transitory process
before it is eventually removed. (Note that if the error target was not put
in place, a resume_lv would result in two LVs pointing to the same segment!
If the machine crashes before the eventual removal of the sub-LV, the result
would be a residual LV with the same mapping as the original (now linear) LV.)
So, the two LVs that need to be removed are now the log device and the sub-LV
with the error segment. If udev_flags are not properly set, a resume will
cause the error LV to come up and be scanned by udev. This causes I/O errors.
Additionally, when udev scans sub-LVs (or former sub-LVs), it can cause races
when we are trying to remove those LVs. This is especially bad during failure
conditions.
When the mirror is suspended, the top-level along with its sub-LVs are
suspended. The changes (now 2 linear devices and the yet-to-be-removed log
and error LV) are committed. When the resume takes place on the original
LV, there are no longer links to the other sub-lvs through the LVM metadata.
The links are implicitly handled by querying the kernel for a list of
dependencies. This is done in the '_add_dev' function (which is recursively
called for each dependency found) - called through the following chain:
_add_dev
dm_tree_add_dev_with_udev_flags
<*** DM / LVM divide ***>
_add_dev_to_dtree
_add_lv_to_dtree
_create_partial_dtree
_tree_action
dev_manager_activate
_lv_activate_lv
_lv_resume
lv_resume_if_active
When udev flags are calculated by '_get_udev_flags', it is done by referencing
the 'logical_volume' structure. Those flags are then passed down into
'dm_tree_add_dev_with_udev_flags', which in turn passes them to '_add_dev'.
Unfortunately, when '_add_dev' is finding the dependencies, it has no way to
calculate their proper udev_flags. This is because it is below the DM/LVM
divide - it doesn't have access to the logical_volume structure. In fact,
'_add_dev' simply reuses the udev_flags given for the initial device! This
virtually guarentees the udev_flags are wrong for all the dependencies unless
they are reset by some other mechanism. The current code provides no such
mechanism. Even if '_add_new_lv_to_dtree' were called on the sub-devices -
which it isn't - entries already in the tree are simply passed over, failing
to reset any udev_flags. The solution must retain its implicit nature of
discovering dependencies and be able to go back over the dependencies found
to properly set the udev_flags.
My solution simply calls a new function before leaving '_add_new_lv_to_dtree'
that iterates over the dtree nodes to properly reset the udev_flags of any
children. It is important that this function occur after the '_add_dev' has
done its job of querying the kernel for a list of dependencies. It is this
list of children that we use to look up their respective LVs and properly
calculate the udev_flags.
This solution has worked for single machine, cluster, and cluster w/ exclusive
activation.
The problem as reported by "ben <benscott@nwlink.com>" on lvm-devel:
vgsplit fails with mirrored mirror log
#lvs --all -o lv_name,lv_attr,devices
LV Attr Devices
MyMirror mwi--
[MyMirror_mimage_0] Iwi--- /dev/sdq(0)
[MyMirror_mimage_1] Iwi--- /dev/sdo(0)
[MyMirror_mimage_2] Iwi--- /dev/sdi(0)
[MyMirror_mlog] mwi---
[MyMirror_mlog_mimage_0] Iwi--- /dev/sds(0)
[MyMirror_mlog_mimage_1] Iwi--- /dev/sde(0)
#vgsplit -v "TestA" "TestB" "/dev/sdq" "/dev/sdo" "/dev/sdi" "/dev/sds"
"/dev/sde"
Checking for volume group "TestA"
Checking for new volume group "TestB"
Archiving volume group "TestA" metadata (seqno 213).
Can't split mirror MyMirror between two Volume Groups
AFTER FIX:
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
Volume group "new" not found
Skipping volume group new
LV VG Devices
lv vg lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] vg /dev/sdb1(0)
[lv_mimage_1] vg /dev/sdc1(0)
[lv_mlog] vg lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] vg /dev/sdh1(0)
[lv_mlog_mimage_1] vg /dev/sdi1(0)
[root@bp-01 ~]# vgsplit vg new /dev/sd[bchi]1
New volume group "new" successfully split from "vg"
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
LV VG Devices
lv new lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] new /dev/sdb1(0)
[lv_mimage_1] new /dev/sdc1(0)
[lv_mlog] new lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] new /dev/sdh1(0)
[lv_mlog_mimage_1] new /dev/sdi1(0)
Since execve passed only NULL as environ, we had lost all environment vars on
restart - thus actually running 'different' clvmd then the one at start.
Preserving environ allows to restart clvmd with the same settings
(i.e. LD_LIBRARY_PATH)
Add test for second restart.
Add named cluster_ops to easily learn the name of the active cluster manager,
so we are able to restart singlenode manager in testing.
Add simple test for clvmd -S (restart) and -R (refresh)
(though it needs some extensions).
Read 2 environmental vars to learn about overide position for
CLVMD and LVM binaries.
We support LVM_BINARY in other script - and this way we could easily
test restart in our test-suite.
Bugfix:
Add (most probably unfinished) support for -E arg with list of exclusive
locks. (During clvmd restart all exclusive locks would have been lost and
in fact, if there would have been an exclusive lock, usage text would be
printed and clvmd exits.)
Instead of parsing list options multiple times every time some lock UUID is
checked - put them straight into the hash table - make the code easier to
understand as well.
Remove was_ex_lock() function (replaced with dm_hash_lookup()).
Swap return value for get_initial_state() (1 means success).
Update man pages and usage info for -E option.
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
If you specify the segment type (e.g. --type mirror) and the mirrors argument
as zero, it would result in a mirrored LV with only one image. While the device
may be valid in theory, it should not be allowed in practice. It also makes it
difficult on the conversion tools, since they react badly to single-image
mirrors.
Patch fixes Clang warnings about possible access via lv_name NULL pointer.
Replaces allocation of memory (strdup) with just pointer assignment
(since execve is being called anyway).
Checks for !*lv_name only when lv_name is defined.
(and as I'm not quite sure what state this really is - putting a FIXME
around - as this rather looks suspicios ??).
Add debug print of passed clvmd args.
(since data in statbuf are invalid).
Check whether sysconf managed to find _SC_PAGESIZE.
Report at least debug warning about failing unlink
(logging scheme here seems to be a different then in lvm).
Duplicate terminal FDs and use similar code as is made in clvmd
and cleanup warns about missing open/close tests.
FIXME: Looks like we already have 3 instancies of the same code in lvm repo.
When fsadm is test - it needs to execute lvm and fsadm from non-standard path
setting. So adding a support in fsadm script when user set LVM_BINARY, then
the lvm command invoced from fsadm will have the same PATH setting as before
entering fsadm command.
Needed for testing.
In case someone would use filename paths with spaces when changing
this script surround commands with '"'.
With default settings there is no change in behavior.
LVM has huge set of options now - it's approaching 60 short-arg less options
and we get interesting case of misdetection for 'merge' option which has been
put into the middle of options with 'short_arg' - thus certainly past 65. (ASCII 'A').
To avoid confusion of short_arg with long_opt number - add '128' to all such
non-short-arg options.
Revert John patch, which fixed only 1 place where ~LVM_WRITE was in use and
convert ommited LVM_READ/WRITE flags to 64bit constants as well.
(Since both 'status' flags for LV and VG are 64bit.)
lv_mirror_count was not able to handle mirrors of stripes properly. When a
failed device is removed, the MIRRORED status flag is removed from the LV
conditionally based on the results of lv_mirror_count. However, lv_mirror_count
trusted the MIRRORED flag - thinking any such LV must be mirrored. It would
happily assign first_seg(lv)->area_count as the number of mirrors, but when
a mirrored striped LV was reduced to a simple striped LV area_count would be
the number of /stripes/ not the number of /mirrors/. A result higher than 1
would be returned from lv_mirror_count, the MIRRORED flag would not be cleared,
and the LV would fail to be up-converted properly in lvconvert_mirrors_aux
because of it.
The operation of deactivating the residual error target LV after removing a
mirror layer can cause a "device in-use" conflict with udev. Giving udev a
poke before calling deactivate_lv eliminates the conflict. The stick used
to poke udev is 'sync_local_dev_names'.
Kernel requires a mirror to be at least 1 region large. So,
if our mirror log is itself a mirror, it must be at least
1 region large. This restriction may not be necessary for
non-mirrored logs, but we apply the rule anyway.
(The other option is to make the region size of the log
mirror smaller than the mirror it is acting as a log for,
but that really complicates things. It's much easier to
keep the region_size the same for both.)
LVM_WRITE is a 32-bit flag. Now that RAID[_IMAGE|_META] are 64-bit,
and'ing a RAID LV's status against LVM_WRITE can reset the higher order
flags.
A similar thing will affect thinp flags if not careful.
_alloc_init calculates the number of necessary log extents via
'mirror_log_extents'. 'mirror_log_extents' takes 3 arguments: region_size,
pe_size, and size of the mirror LV. Unfortunately, _alloc_init is guessing at
the mirror size by using 'ah->new_extents / ah->area_multiple' - the number of
extents that the mirror images have. However, this is /always/ wrong when
allocating the log separately. Further, the log is always allocated separately
unless we are up-converting the mirror at the same time. It was by luck alone
that a default value of '1' reflects what we want in most cases.
In order to get a decent value computed, we need to pass in the 'lv' argument
to allocate_extents. This would normally imply a desire for cling/contiguous
allocation to the given LV, but since we are not allocating any parallel
extents and only log extents, it works fine.
When an image is split from a 2-way mirror, the original mirror is converted to
a linear device. To do this, the top "layer" must be removed. The segments
are transferred from the sub-lv to the top-level LV and the link is severed.
The former sub-lv - having its segments transferred - now contains a temporary
error target.
When the original LV is resumed, the old sub-lv that now contains an error
segment is activated and scanned. This is what causes the I/O error messages.
There are three ways to fix this problem:
1) Do not set the sub-lv which contains the error target as "visible" before
suspending the original LV. This way, when the original is resumed, the sub-lv
device node is not created and it is not scanned - avoiding the error messages.
The problem with this approach is that if the machine crashes after the
resume, it leaves the *hidden* LV in place and the user has a more difficult
time noticing that it needs to be cleaned up. Thus, this type of processing is
frowned upon.
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it does not
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
3) Flag the sub-lv (error target) with a "do not scan" flag. This seems like
the cleanest approach, but I have been unable to find the method for doing
this. LVs get tagged in such a way by _get_udev_flags, but in this case the
resume of the original LV also resumes the error target LV without running it
through _get_udev_flags (likely because they are no longer linked). Could
there be something wrong in resume_lv?
Option #2 was chosen to fix this bug, but it seems like more of a workaround
for now.