IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
When an LVM mirror is up-converted (an additional image added), it creates
a temporary mirror stack. The lower-level mirror in the stack that is
created was not being activated exclusively - violating the exclusive nature
of the original mirror. We now check for exclusive activation of a mirror
before converting it, and if found, we ensure that the temporary mirror
is also exclusively activated.
Mirrors used to be created by first creating a linear device and then adding
the other images plus the log. Now mirrors are created by creating all the
images in one go and then adding the log separately. The new way ran into
the condition that cluster mirrors cannot change the log type (in the case
of creation, from core -> disk) while the mirror is not active. (It isn't
active because it is in the process of being created.) The reason this
condition is in place is because a remote node may have the mirror active, and
we don't want to alter the log underneath it.
What we really needed was a way of checking if the mirror was active remotely
but not locally, and in that case do not allow a change of the log. I've added
this check, and cluster mirrors can now be created again.
We've used udev fallback code till now to check whether udev
created/removed the entries in /dev correctly and if not,
a repair was done (giving a warning messagea about that).
This patch adds a possibility to enable this additional check
and subsequent fallback only when required (debugging purposes
mostly) and trust udev completely.
So let's disable the fallback code by default and add a new
configuration option "activation/udev_fallback".
(The original code for creating the nodes will still be used
in case the device directory that is set in lvm.conf differs
from the one that udev uses and also when activation/udev_rules
is set to 0 - otherwise we would end up with no nodes/symlinks
at all)
It's useful to keep the partial flag cached - so just move the call
for vg_mark_partil_lvs() into import_vg_from_config_tree() so it gets
evaluated before it goes through the lvmcache.
This patch should not present any functional change.
Note: It is rather temporal solution - proper place is probably inside the
'read' call back - but needs some more discussion.
For now using this minor hack.
As the ACTIVATE_EXCL could be set only in clvmd code - there is no
use for this test in lv_add_mirrors() function only called from
tools context.
FIXME: Add cluster test case for this.
To avoid modification of 'read-only' volume group structure
add a new structure to pass local data around the code for LV
activation.
As origin_only is one such flag - replace this parameter with new
struct lv_activate_opts.
More parameters might eventually become part of lv_activate_opts.
transient error), stemming from the following sequence of events:
1) devices fail IO, triggering repair
2) dmeventd starts fixing up the mirror
3) during the downconversion, a new metadata version is written
--> the devices come back online here
4) the mirror device suspend/resume is called to update DM tables
5) during the suspend/resume cycle, *pre*-commit metadata is read;
however, since the failed devices are now back online, we get back
inconsistent set of precommit metadata and the whole operation fails
The patch relaxes the check that fails in step 5 above, namely by ignoring
inconsistencies coming from PVs that are marked MISSING.
and use this for the LVM critical section logic. Also report an error if
code tries to load a table while any device is known to be in the
suspended state.
(If the variety of problems these changes are showing up can't be fixed
before the next release, the error messages can be reduced to debug
level.)
are affected by the move. (Currently it's possible for I/O to become
trapped between suspended devices amongst other problems.
The current fix was selected so as to minimise the testing surface. I
hope eventually to replace it with a cleaner one that extends the
deptree code.
Some lvconvert scenarios still suffer from related problems.
Currently some operation with striped mirrors lead
to corrupted metadata, this patch just add detection of such
situation.
Example:
# lvcreate -i2 -l10 -n lvs vg_test
# lvconvert -m1 vg_test/lvs
# lvreduce -f -l1 vg_test/lvs
Reducing logical volume lvs to 4.00 MiB
Segment extent reduction 9not divisible by #stripes 2
Logical volume lvs successfully resized
# lvremove vg_test/lvs
Segment extent reduction 1not divisible by #stripes 2
LV segment lvs:0-4294967295 is incorrectly listed as being used by LV lvs_mimage_0
Internal error: LV segments corrupted in lvs_mimage_0.
There's a possibility someone will use the '/' in the hostname. Since we
generate a temporary file name (path) including the hostname, any '/' would
be ambiguous.
We can always set such hostname using 'sethostname' from unistd.h. But the
'hostname' command already includes the check and removes the '/' char.
However, some old versions still allow that.
See: https://bugzilla.redhat.com/show_bug.cgi?id=711445.
Since this is only a temporary name and the possibility of this error is
quite negligible, we don't need any complex escape sequence here, just a
simple char replace.
Before, we used vg_write_lock_held call to determnine the way a device is
opened. Unfortunately, this opened many devices in RW mode when it was not
really necessary. With the OPTIONS+="watch" rule used in the udev rules,
this could fire numerous events while closing such devices (and it caused
useless scans from within udev rules in return).
A common bug we hit with this was with the lvremove command which was unable
to remove the LV since it was being opened from within the udev rules. This
patch should minimize such situations (at least with respect to LVM handling
of devices).
Though there's still a possibility someone will open a device 'outside' in
parallel and fire the event based on the watch rule when closing a device
once opened for RW.
allocates these buffers in such way it adds memory page for each such buffer
and size of unlock memory check will mismatch by 1 or 2 pages.
This happens when we print or read lines without '\n' so these buffers are
used. To avoid this extra allocation, use setvbuf to set these bufffers ahead.
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Also, add a new 'obtain_device_list_from_udev' setting to lvm.conf with which
we can turn this feature on or off if needed.
If set, the cache of block device nodes with all associated symlinks
will be constructed out of the existing udev database content.
This avoids using and opening any inapplicable non-block devices or
subdirectories found in the device directory. This setting is applied
to udev-managed device directory only, other directories will be scanned
fully. LVM2 needs to be compiled with udev support for this setting to
take effect. N.B. Any device node or symlink not managed by udev in
udev directory will be ignored with this setting on.
Avoid using of already released memory when duplicated MDA is found.
As get_pv_from_vg_by_id() may call lvmcache_label_scan() use the local copy
of the vgname and vgid on the stack as vginfo may dissapear and code was
then accessing garbage in memory.
i.e. pvs /dev/loop0
(when /dev/loop0 and /dev/loop1 has same MDA content)
Invalid read of size 1
at 0x523C986: dm_hash_lookup (hash.c:325)
by 0x440C8C: vginfo_from_vgname (lvmcache.c:399)
by 0x4605C0: _create_vg_text_instance (format-text.c:1882)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
by 0x47FBD8: vg_read_internal (metadata.c:3231)
by 0x477594: get_pv_from_vg_by_id (metadata.c:344)
by 0x45F07A: _get_pv_if_in_vg (format-text.c:1400)
by 0x45F0B9: _populate_pv_fields (format-text.c:1414)
by 0x45F40F: _text_pv_read (format-text.c:1493)
by 0x480431: _pv_read (metadata.c:3500)
by 0x4802B2: pv_read (metadata.c:3462)
Address 0x652ab80 is 0 bytes inside a block of size 4 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x442277: _free_vginfo (lvmcache.c:963)
by 0x44235E: _drop_vginfo (lvmcache.c:992)
by 0x442B23: _lvmcache_update_vgname (lvmcache.c:1165)
by 0x443449: lvmcache_update_vgname_and_id (lvmcache.c:1358)
by 0x443C07: lvmcache_add (lvmcache.c:1492)
by 0x46588C: _text_read (text_label.c:271)
by 0x466A65: label_read (label.c:289)
by 0x4413FC: lvmcache_label_scan (lvmcache.c:635)
by 0x4605AD: _create_vg_text_instance (format-text.c:1881)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
Add testing script
pv_manip.c to properly account for case when pe_start=0 and the first
physical extent is to be released (currently skip the first extent to
avoid discarding the PV label).
My previous patch fixed incorrect error check for dm_snprintf.
However in this particular case - dm_snprintf has been used differently -
just like strncpy + setting last char with '\0' - so the code had to return
error - because the buffer was to short for whole string.
Patch replaces it with real strncpy.
Also test for alloca() failure is removed - as the program behaviour
is rather undefined in this case - it never returns NULL.
lvseg properties for lvm2app, 'devices' and 'seg_pe_ranges'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
dm_pool_free incorrectly. This check-in fixes that incorrect usage.
I've also added a WHATS_NEW line to reflect the changes I made to allow
lv_extend to operate on 0 length intrinsically layered LVs (i.e mirrors
and RAID). I forgot that in the last commit.
allows us to allocate all images of a mirror (or RAID array) at one
time during create.
The current mirror implementation still requires a separate allocation
for the log, however.
Actually, we can call vg_set_fid(vg, NULL) instead of calling
destroy_instance for all PV structs and a VG struct - it's the same
code we already have in the vg_set_fid.
Also, allow exactly the same fid to be set again for the same PV/VG
Before, this could end up with the fid destroyed because we destroyed
existing fid first and then we used the new one and we didn't care
whether existing one == new one by chance.
Instead of searching linear list of all LVs, PVs - use created hash tables
also for quick mapping between LV.
(Note - for small number of PVs or LVs the overhead of the hash is bigger).
TODO: Use hash tables in volume_group structure directly.
Instead of regenerating config tree and parsing same data again,
check whether export_vg_to_buffer does not produce same string as
the one already cached - in this case keep it, otherwise throw cached
content away.
For the code simplicity calling _free_cached_vgmetadata() with
vgmetadata == NULL as the function handles this itself.
Note: sometimes export_vg_to_buffer() generates almost the same data
with just different time stamp, but for the patch simplicity,
data are reparsed in this case.
This patch currently helps for vgrefresh.
Isn't usually perfomance critical - but log_error is used i.e.for debuging,
this code noticable slows down the processing.
Added 512KB limit to avoid memory exhastions in case of some endless loop.
TODO: use _lvm_errmsg buffer only when lvm2api needs it.
Avoid locking sum testing with valgrind compilation.
Make memory unaccessible in the valgrind for dm_pool_abadon_object.
Valgrind hinting should not be needed in _free_chunk for dm_free.
Could be reached via few of our lvm2 test cases:
==11501== Invalid read of size 8
==11501== at 0x49B2E0: _area_length (import-extents.c:204)
==11501== by 0x49B40C: _read_linear (import-extents.c:222)
==11501== by 0x49B952: _build_segments (import-extents.c:323)
==11501== by 0x49B9A0: _build_all_segments (import-extents.c:334)
==11501== by 0x49BB4C: import_extents (import-extents.c:364)
==11501== by 0x497655: _format1_vg_read (format1.c:217)
==11501== by 0x47E43E: _vg_read (metadata.c:2901)
cut from t-vgcvgbackup-usage.sh
--
pvcreate -M1 $(cat DEVICES)
vgcreate -M1 -c n $vg $(cat DEVICES)
lvcreate -l1 -n $lv1 $vg $dev1
--
Idea of the fix is rather defensive - to allocate one extra element
to 'map' array which is then used in _area_length() - where the
loop checks, whether next map entry is continuous.
By placing there always one extra zero entry -
we fix the read of unallocated memory, and we make sure the data would
not make a continous block.
FIXME: there could be a problem if some special broken lvm1 data would be imported.
As the format1 is currently not really used - leave it for future fix
and use this small hotfix for now.
Invalid primary_vginfo was supposed to move all its lvmcache_infos to
orphan_vginfo - however it has called _drop_vginfo() inside the loop
that released primary_vginfo itself - thus made the loop using released
memory.
Use _vginfo_detach_info() instead and call _drop_vginfo after
th loop is finished.
Valgrind trace it should fix:
Invalid read of size 8
at 0x41E960: _lvmcache_update_vgname (lvmcache.c:1229)
by 0x41EF86: lvmcache_update_vgname_and_id (lvmcache.c:1360)
by 0x441393: _text_read (text_label.c:329)
by 0x442221: label_read (label.c:289)
by 0x41CF92: lvmcache_label_scan (lvmcache.c:635)
by 0x45B303: _vg_read_by_vgid (metadata.c:3342)
by 0x45B4A6: lv_from_lvid (metadata.c:3381)
by 0x41B555: lv_activation_filter (activate.c:1346)
by 0x415868: do_activate_lv (lvm-functions.c:343)
by 0x415E8C: do_lock_lv (lvm-functions.c:532)
by 0x40FD5F: do_command (clvmd-command.c:120)
by 0x413D7B: process_local_command (clvmd.c:1686)
Address 0x63eba10 is 16 bytes inside a block of size 160 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x41DE70: _free_vginfo (lvmcache.c:980)
by 0x41DEDA: _drop_vginfo (lvmcache.c:998)
by 0x41E854: _lvmcache_update_vgname (lvmcache.c:1238)
by 0x41EF86: lvmcache_update_vgname_and_id (lvmcache.c:1360)
by 0x441393: _text_read (text_label.c:329)
by 0x442221: label_read (label.c:289)
by 0x41CF92: lvmcache_label_scan (lvmcache.c:635)
by 0x45B303: _vg_read_by_vgid (metadata.c:3342)
by 0x45B4A6: lv_from_lvid (metadata.c:3381)
by 0x41B555: lv_activation_filter (activate.c:1346)
by 0x415868: do_activate_lv (lvm-functions.c:343)
problematic line:
dm_list_iterate_items_safe(info2, info3, &primary_vginfo->infos)
Fix 2 more functions sending cluster messages to avoid passing uninitilised bytes
and compensate 1 extra byte attached to the message from the clvm_header.args[1]
member variable.
If _move_lv_segments is passed a 'lv_from' that does not yet
have any segments, it will screw things up because the code
that does the segment copy assumes there is at least one
segment. See copy code here:
lv_to->segments = lv_from->segments;
lv_to->segments.n->p = &lv_to->segments;
lv_to->segments.p->n = &lv_to->segments;
If 'segments' is an empty list, the first statement copies over
the values, but the next two reset those values to point to the
other LV's list structure. 'lv_to' now appears to have one
segment, but it is really an ill-set pointer.
other cases, the code could wind up removing wrong number of mirrors. In yet
other cases, we could remove the right number of mirrors, but fail to respect
the removal preferences (i.e. keep an image that was requested to be removed
while removing an image that was requested to be kept). Under some
circumstances, remove_mirror_images could also get stuck in an infinite loop.
This patch should fix all of the above undesirable behaviours.
Signed-off-by: Petr Rockai <prockai@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
LVM doesn't behave correctly if running as non-root user,
there is warning when it detects it.
Despite this, it produces many error messages, saying nothing.
See https://bugzilla.redhat.com/show_bug.cgi?id=620571
This patch fixes two things:
1) Removes eror message from device_is_usable() which has no
information value anyway (real warning is printed inside it).
2) it fixes device-mapper initialization, if we support
core dm module autoload and device node is present, it should
fail early and not try recreate existing and correct node.
(non-root == permission denied here)
N.B. In future code should support user roles, some more
drastic checks in code are probably contraproductive now.
Attach \0 for proper char* display - otherwise somewhat random message could
be displayed in debug more and read of unpredictable read of uninitilized
memory values could happen.
As code uses strncpy(system_id, NAME_LEN) and doesn't set '\0'
Fix it by always allocating NAME_LEN + 1 buffer size and with zalloc
we always get '\0' as the last byte.
This bug may trigger some unexpected behavior of the string operation
code - depends on the pool allocator.
FIXME: refactor this code to alloc_vg.
We have 3 components and traling '\0' so allocate proper room for all of them.
Problem was nicely hidden by allocation from pool and allocation aligment
offset - so to trigger real problem with this one is actually hard.
Return value of readlink limits valid string size.
Characters after returned size present some garbage to printf.
Fix it by placing '\0' on the return size value.
Missing free_vg on error_path in lvmcache_get_vg fn. Call destroy_instance
only if the fid is not part of the vg in backup_read_vg fn (otherwise it's
part of the VG we're returning and we definitely don't want to destroy it!).
This is necessary for proper format instance ref_count support. We iterate
over vg->pvs and vg->removed_pvs list and the ref_count is decremented and
then it is destroyed if not referenced anymore.
Since format instances will use own memory pool, it's necessary to properly
deallocate it. For now, only fid is deallocated. The PV structure itself
still uses cmd mempool mostly, but anytime we'd like to add a mempool
in the struct physical_volume, we can just rename this fn to free_pv and
add the code (like we have free_vg fn for VGs).
This is essential for proper format instance ref_count support. We must
use these functions to set the fid everywhere from now on, even the NULL
value!
We'd like to use the fid mempool for text_context that is stored
in the instance (we used cmd mempool before, so the order of
initialisation was not a matter, but now it is since we need to
create the fid mempool first which happens in create_instance fn).
The text_context initialisation is not needed anywhere outside the
create_instance fn so move it there.
Format instances can be created anytime on demand and it contains
metadata area information mostly (at least for now, but in the future,
we may store more things here to update/edit in a PV/VG). In case we
have lots of metadata areas, memory consumption will rise. Using cmd
context mempool is not quite optimal here because it is destroyed too
late. So let's use a separate mempool for format instances.
Reference counting is used because fids could be shared, e.g. each PV
has either a PV-based fid or VG-based fid. If it's VG-based, each PV has
a shared fid with the VG - a reference to VG's fid.
Makes the code more readable and has a smaller number of memory
accesses thus it's small optimisation as well.
For _get_token() optimize number parsing. Check for '.' char only
if it's not a digit. Move pointer incrementation into one place.
For _eat_space() check only p->te for '\0' in skipping of comment line.
Avoid check for '\0' when we know it is space. Also master while loop
doesn't need checking p->tb for '\0'. We just need to check p->tb
isn't already at the end of buffer. This could give 'extra' loop cycle
if we are already there - but safes memory access in every other case.
Add _lv_postorder_vg() - for calling _lv_postorder() for every LV from VG.
We use this in 2 places - vg_mark_partial_lvs() and vg_validate()
so make it as a one function.
Benefit here is - to use only one cleanup code and avoid
potentially duplicate scans of same LVs.
Accelerate validation loop by using lvname, lvid, pvid hash tables.
Also merge pvl loop into one cycle now - no need to scan the list twice.
List scan is stopped when dm_hash_insert fails.
The error message with loop_counter1 is no longer provided - however
the message has been misleading anyway.
Create new function alloc_vg() to allocate VG structure.
It takes pool_name (for easier debugging).
and also take vg_name to futher simplify code.
Move remainder of _build_vg_from_pds to _pool_vg_read
and use vg memory pool for import functions.
(it's been using smem -> fid mempool -> cmd mempool)
(FIXME: remove mempool parameter for import functions and use vg).
Move remainder of the _build_vg to _format1_vg_read
lvseg_segtype_dup used memory pool vg memory pool for strind duplication.
However this one gets released before reporting happens so the command like:
pvs -o segtype
prints data from already released memory pool. Thanks to the fact there
is not much allocation happing after the VG is released, the memory
stays unmodified and correct result is printed.
Fix adds support for mempool passed parameter (like other similar
query commands) and uses dm_report memory pool for string duplication.
- returned char not needed to be explicitly const
- warn if pipe() fails in clvmd (more fixes here needed for error paths...)
- assign (and ignore) read() output in drain buffer
We allow writing non-orphan PVs only for resize now. The "orphan PV" assert
in pv_write fn uses the "allow_non_orphan" parameter to control this assert.
However, we should find a more elaborate solution so we can remove this
restriction altogether (pv_write together with vg_write is not atomic, we
need to find a safe mechanism so there's an easy revert possible in case of
an error).
There is a lot to test.
Two new config settings added that are intended to make the code behave
closely to the way it did before - worth a try if you find problems.
Add a small fix that preserves pe_start for lvm1 PVs when being converted.
(this fix needs to be replaced with something more clever, but let's have this working now)
If the PV is already part of the VG (so the pv->fid == vg->fid), it makes no
sense to attach the mdas information from PV to a VG. Instead, we read new
PV metadata information from cache and attach it to the VG fid.
This function also sets a reference to a new VG format instance for all PVs
that are part of the VG so the PV-VG interconnection is consistent after the
change.
Add supporting functions to work with the format instance and metadata area
structures stored within the format instance. Add support for simple indexing
of metadata areas using PV id and mda order (for on-disk PV only for now, we
can extend the indexing even for other mdas if needed - we only need to define
a proper key for the index).
As the kernel seems to be doing weird things during
mlock -> munlock - allow 1 page locking difference without
warning - and log just debug message for a 1 page difference.
Allocation happens outside critical section probably during
log_warn printing.
Should make tests passing for now.
Fixing some const warnings - with API change in:
int vg_extend(struct volume_group *vg, int pv_count, const char *const *pv_names,
Change is needed - as lvm2api expects const behaviour here.
So vg_extend() is doing local strdup for unescaping.
skip_dev_dir return const char* from const char* vg_name.
Rest of the patch is cleanup of related warnings.
Also using dm_report_filed_string() API change to simplify
casting in _string_disp and _lvname_disp.
New strategy for memory locking to decrease the number of call to
to un/lock memory when processing critical lvm functions.
Introducing functions for critical section.
Inside the critical section - memory is always locked.
When leaving the critical section, the memory stays locked
until memlock_unlock() is called - this happens with
sync_local_dev_names() and sync_dev_names() function call.
memlock_reset() is needed to reset locking numbers after fork
(polldaemon).
The patch itself is mostly rename:
memlock_inc -> critical_section_inc
memlock_dec -> critical_section_dec
memlock -> critical_section
Daemons (clmvd, dmevent) are using memlock_daemon_inc&dec
(mlockall()) thus they will never release or relock memory they've
already locked memory.
Macros sync_local_dev_names() and sync_dev_names() are functions.
It's better for debugging - and also we do not need to add memlock.h
to locking.h header (for memlock_unlock() prototyp).
Add configurable option to define minimal size of
of block device usable as a PV.
pv_min_size() is added to lvm-globals and it's being
initialized through _process_config.
Macro PV_MIN_SIZE is unused and removed.
New define DEFAULT_PV_MIN_SIZE_KB is added to lvm-global
and unlike PV_MIN_SIZE it uses KB units.
Should help users with various slow devices attached to the system,
which cannot be easily filtered out (like FDD on /dev/sdX):
https://bugzilla.redhat.com/show_bug.cgi?id=644578
results in clvmd deadlock
When a logical volume is activated exclusively in a cluster, the
local (non-cluster-aware) target is used. However, when creating
a snapshot on the exclusive LV, the resulting suspend/resume fails
to load the appropriate device-mapper table - instead loading the
cluster-aware target.
This patch adds an 'exclusive' parameter to the pertinent resume
functions to allow for the right target type to be loaded.
Fix regresion from 2.02.75 speedup - so currently crc32 is a little bit
more complicated on big-endian CPU as the uint32_t needs to be shifted
on here.
Make configurable default behaviour how to deal with device node creates.
With udev system natural options should be 'resume'.
For older systems where user expect there is node in /dev/mapper immediately
after dmsetup create --notable - use 'create'
FIXME:
Code needs fixing passing this flag through udev cookie.
activated.
In order to achieve this, we need to be able to query whether
the origin is active exclusively (a condition of being able to
add an exclusive snapshot).
Once we are able to query the exclusive activation of an LV, we
can safely create/activate the snapshot.
A change to 'hold_lock' was also made so that a request to aquire
a WRITE lock did not replace an EX lock, which is already a form
of write lock.
Remove temporaly added fs_unlock() calls to fix clmvd usablity.
Now when the message passing is properly working - they are no longer needed.
Simplify no_locking check for VG unlock - as message is always send
for all targets - clustered & non-clustered.
Thanks to CLVMD_CMD_SYNC_NAMES propagation fix the message passing started
to work. So starts to send a message before the VG is unlocked.
Removing also implicit sync in VG unlock from clmvd as now the message
is delievered and processed in do_command().
Also add support for this new message into external locking
and mask this event from further processing.
With the ability to stack many operations in one udev transaction -
in same cases we are adding and removing same device at the same time
(i.e. deactivate followed by activate).
This leads to a problem of checking stacked operations:
i.e. remove /dev/node1 followed by create /dev/node1
If the node creation is handled with udev - there is a problem as
stacked operation gives warning about existing node1 and will try to
remove it - while next operation needs to recreate it.
Current code removes all previous stacked operation if the fs op is
FS_DEL - patch adds similar behavior for FS_ADD - it will try to
remove any 'delete' operation if udev is in use.
For FS_RENAME operation it seems to be more complex. But as we
are always stacking FS_READ_AHEAD after FS_ADD operation -
should be safe to remove all previous operation on the node
when udev is running.
Code does same checking for stacking libdm and liblvm operations.
As a very simple optimization counters were added for each stacked ops
type to avoid unneeded list scans if some operation does not exists in
the list.
Enable skipping of fs_unlock() (udev sync) if only DEL operations are staked.
as we do not use lv_info for already deleted nodes.
This is better way how to fix clustered synchronization with udev.
As the code for message passing needs fixed - put currently
fs_unlock() after every active/deactive command in clvmd to
ensure nodes are properly created in time.
There was no effect from having this wrong yet, because the
tree of callers only ever cared about the answer to the first
condition (!response), which determines whether a lock is
held or not. Correct responses, however, are needed soon.
Instead of implicitly syncing udev operation in clustered and
file locking code - call synchronization directly in lock_vol() when
the operation unlocks VG
The problem is missing implicit fs_unlock() in the no_locking code.
This is used with --sysinit on read-only filesystem locking dir.
In this case vgchange -ay could exit before all udev nodes are properly
synchronised and may cause problems with accessing such node right after
vgchange --sysinint command is finished.
Add test case for vgchange --sysinit.
strncpy (which check each byte for \0) is not need as we always copy
the length size - so using memcpy is a bit cheaper.
Add missing log_error message for failed allocation.
Improve condition within lock_vol so we are not calling extra unlock
if the volume just has been deactivated.
Patch uses lck_type and replaces negative 'and' condition to more
readable 'or' condition.
Few missing strace traces added.
As sync_local_dev_names() cannot be called within activation context,
add new parametr which allows to select if the sync call is needed
before executing new command.
Stop calling fs_unlock() from lv_de/activate().
Start using internal lvm fs cookie for dm_tree.
Stop directly calling dm_udev_wait() and
dm_tree_set/get_cookie() from activate code -
it's now called through fs_unlock() function.
Add lvm_do_fs_unlock()
Call fs_unlock() when unlocking vg where implicit unlock solves the
problem also for cluster - thus no extra command for clustering
environment is required - only lvm_do_fs_unlock() function is added
to call lvm's fs_unlock() while holding lvm_lock mutex in clvmd.
Add fs_unlock() also to set_lv() so the command waits until devices
are ready for regular open (i.e. wiping its begining).
Move fs_unlock() prototype to activation.h to keep fs.h private
in lib/activate dir and not expose other functions from this header.
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
If some allocation for peristent filter fails its memory reference
was lost, fix it by calling filter's destructor.
Fix log_error messages for failing allocation.
Checking for vg being != NULL in this place is not needed.
Pointer vg is already dereferced in this function above this code line.
Also this internal function _read_pv is always called with valid 'vg' pointer.
Add checks for clonning allocation a fail-out when something is
not allocated correctly.
Also move var declaration to the begining of the function
and fix log_error messages.
As const segment_type or const format_type are never released
use their non-const version and remove const downcast from dm_free calls.
This change fixes many gcc warnings we were getting from them.
Change API interface to accept even completely const array patterns.
This should present no change for libdm users and allows to pass
pattern arrays without cast to const char **.
To have better control were the config tree could be modified use more
const pointers and very carefully downcast them back to non-const
(for config tree merge).
Detect existence of new SELinux selabel interface during configure.
Use new dm_prepare_selinux_context instead of dm_set_selinux_context.
We should set the SELinux context before the actual file system object creation.
The new dm_prepare_selinux_context function sets this using the selabel_lookup
fn in conjuction with the setfscreatecon fn. If selinux/label.h interface
(that should be a part of the selinux library) is not found during configure,
we fallback to the original matchpathcon function instead.
Set cmd->independent_metadata_areas if metadata/dirs or disk_areas in use.
- Identify and record this state.
Don't skip full scan when independent mdas are present even if memlock is set.
- Clusters and OOM aren't supported, so no problem doing the proper scans.
Avoid revalidating the label cache immediately after scanning.
- A simple optimisation.
Support scanning for a single VG in independent mdas.
- Not used by the fix but I left it in anyway as later patches might use it.
This reset of vgmem pointer causes access of already released memory.
(_vg_make_handle allocates vg from vgmem pool itself - which is a bit tricky)
Interestingly this memory fault was missed by our test suite.
Add log_error message for lv_info failure and exit from futher
processing.
Replace 'leg' occurence in debug message with 'image' which
is used in other messages.
Add test for NULL from dm_poll_create.
Reorder dm_pool_destroy() before file close and add label out:.
Avoid leaking file descriptor if the allocation fails.
Nicely hidden memory leak in outf macro error path.
This macro is using out_text() and does automagical return_0.
That would leak tag_buffer allocated memory.
As there was same code for tags output - create _out_tags() function.
Set vg to NULL after releasing it as the following memlock() test may
lead to goto for the second call of vg_release() with the already
released vg pointer.
Similar to 'get' property internal functions.
Add specific 'set' function for vg_mda_copies.
Signed-off-by: Dave Wysochanski <wysochanski@pobox.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Patch updates exec_cmd() and adds 3rd parameter with pointer for
status value, so caller might examine returned status code.
If the passed pointer is NULL, behavior is unmodified.
Patch allows to confinue with lvresize if the failure from fsadm check is
caused by mounted filesystem as many of filesystem resize tools do support
online filesystem resize. (originally user had to use flag '-n' to bypass
this filesystem check)
A merged snapshot's DM device is made to use the "error" target as part
of lvm's transaction to merge a snapshot. This snapshot merge use-case
aside, any device using the error target shouldn't be scanned.
Based on review comments, rename a few fields in lvm_property_type.
In particular, change 'is_writeable' to 'is_settable', which is
more intuitive to the intent of the bitfield (a 'set' function
exists for this field/property). Also, remove the char array
for 'id' - unnecessary as we can just use the string passed in
to do the strcmp. Finally rename the union members from n_val
to 'integer' and 's_val' to 'string'.
to lvm.conf in the activation section: 'snapshot_autoextend_threshold' and
'snapshot_autoextend_percent', that define how to handle automatic snapshot
extension. The former defines when the snapshot should be extended: when its
space usage exceeds this many percent. The latter defines how much extra space
should be allocated for the snapshot, in percent of its current size.
Problem:
When both legs of a mirrored log fail, neither the log nor the parent
mirror can proceed. The repair code must be careful to replace the
log with an error target before operating on the parent - otherwise,
the parent can get stuck trying to suspend because it can't push through
any writes. The steps to replace the log device with an error target
were incomplete and resulted in the replacement not happening at all!
The code originally had all the necessary logic to complete the
replacement task, but was pulled out in a effort to clean-up that
section of code, while fixing another bug:
<offending commit msg>
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
</offending commit msg>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/LVM2/lib/metadata/mirror.c?cvsroot=lvm2&f=h#rev1.130
I've reverted the "clean-up" changes associated with that fix, but not what
that commit was actually fixing.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
page.
Add ->target_name to segtype_handler to allow a more specific target
name to be returned based on the state of the segment.
Result of trying to merge a snapshot using a kernel that doesn't have
the snapshot-merge target:
Before:
# lvconvert --merge vg/snap
Can't expand LV lv: snapshot target support missing from kernel?
Failed to suspend origin lv
After:
# lvconvert --merge vg/snap
Can't process LV lv: snapshot-merge target support missing from kernel?
Failed to suspend origin lv
Unable to merge LV "snap" into it's origin.
Use _even_rand() function instead of floor() in _bitset_with_random_bits().
floor() function is missing in dietlibc (on architectures other than x86).
Moreover using floor() to clip rand results does not assure even result
distribution. _even_rand() uses integer arithmetic only and is designed to
return evenly distributed results.
> Looks OK to me. It took a while to decipher what is the exact meaning of
> the loop in _even_rand (to a non-pseudorandomness-expert) but I am
> fairly comfortable with it now. If I understand this correctly, it
> rejects numbers that come from an "incomplete" slice of the RAND_MAX
> space (considering the number space [0, RAND_MAX] is divided into some
> "max"-sized slices and at most a single smaller slice, between [n*max,
> RAND_MAX] for suitable n -- numbers from this last slice are discarded
> because they could distort the distribution in favour of smaller
> numbers).
Signed-off-by: Przemyslaw Iskra <sparky <at> pld-linux.org>
Reviewed-by: Petr Rockai <prockai <at> redhat.com>
In other LVM memory structures such as volume_group, the field
used to store flags is called "status", and on-disk fields are called
'flags', so rename the one inside metadata_area to be consistent.
Not only is it more consistent with existing code but is cleaner
to say "the status of this mda is ignored".
Background for this patch - prajnoha pinged me on IRC this morning
about a fix he was working on related to metadataignore when
metadata/dirs was set. I was reviewing my patches from this year
and realized the 'flags' field was probably not the best choice
when I originally did the metadataignore patches.
Current lvm1 allocation code seems to not properly
map segments on missing PVs.
For now disable this functionality.
(It never worked and previous commit just introduced segfault here.)
So the partial mode in lvm1 can only process missing PVs
with no LV segments only.
Also do not use random PV UUID for missing part but use fixed
string derived from VG UUID (to not confuse clvmd tests).
We need to use a similar function for pv and lv properties, so just make
a generic _get_property() function that contains most of the required
functionality. Also, add a check to ensure the field name matches the
object passed in by re-using report_type_t enum. For pv properties,
the report_type might be either PVS or LABEL.
In addition, add 'const' to 'get' functions object parameter, but not
'set' functions. Add _not_implemented_set() and _not_implemented_get()
functions.
Add 'get' functions based on generic macros for VG, PV, and LV.
Add 'get' functions for vg string fields, vg_name, vg_fmt, vg_sysid,
vg_uuid, vg_attr, and vg_tags, and all numeric fields.
Add supporting functions for vg_name, vg_fmt, vg_system_id.
Append "_dup" to end of supporting functions to make clear the strings
are dup'd and to avoid namespace conflict with vg_name.
Add supporting functions for pv_uuid, vg_uuid, and lv_uuid.
Call new function id_format_and_copy. Use 'const' where appropriate.
Add "_dup" suffix to indicate memory is being allocated.
Call {pv|vg|lv}_uuid_dup from lvm2app uuid functions.
This patch addresses code review request to simplify creation of 'attr'
strings. The simplification is done in this separate patch to more
easily review and ensure the simplification is done without error.
Move the creating of the 'attr' strings into a common function so
they can be called from the 'disp' functions as well as the new
'get' property functions.
Add "_dup" suffix to indicate memory is allocated.
Refactor pvstatus_disp to take pv argument and call pv_attr_dup().
This patch is similar to the other patches for pv and vg
functionality, and separates lv functionality into separate
files, concentrating on reporting fields and simple functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out pv functions and data, particularly
related to the reporting fields calculations.
More code could be moved here but for now I'm stopping at reporting
functions 'get' / 'set' functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out vg functions and data, particularly
related to the reporting fields calculations.
Read complete content of /proc/self/maps into one buffer without
realocation in the middle of reading and before doing any m/unlock
operation with these lines - as some of them gets change.
With previous implementation we've read some mappings twice ([stack])
If some lvm1 device is missing, lvm fails on all operations
# vgcfgbackup -f bck -P vg_test
Partial mode. Incomplete volume groups will be activated read-only.
3 PV(s) found for VG vg_test: expected 4
PV segment VG free_count mismatch: 152599 != 228909
PV segment VG extent_count mismatch: 152600 != 228910
Internal error: PV segments corrupted in vg_test.
Volume group "vg_test" not found
Allow loading of lvm1 partial VG by allocating "new" missing PV,
which covers lost space. Also this fake mising PV inform code
that it is partial VG.
https://bugzilla.redhat.com/show_bug.cgi?id=501390
Revert to old glibc behaviour for vsnprintf used in emit_to_buffer fn.
Otherwise, the check that follows would be wrong for new glibc versions.
This caused the rh bug #633033 to be undetected and pass throught the check,
corrupting the metadata!
In certain configurations, we're not under a VG rw lock while trying to write
a new archive file with VG metadata. A common example is using "vgs" while
having the content of backup and archive directories empty. The code scans the
content of these directories and tries to determine the final index that should
be used in archive name. Since we're not under a lock, we can get into a race
while choosing the index which could end up showing errors about not being able
to rename to final archive name. Let's add random number suffix to these archive
file names so we can avoid the race.
For example, when using '--config "backup { ... }"' line, the values from
lvm.conf (or default values) should be overridden. This patch adds
reinitialisation of archive and backup handling on toolcontext refresh
which makes these settings to be applied.
to block when a mirror under a snapshot suffers a failure.
The problem has to do with label scanning. When a mirror suffers
a failure, the kernel blocks I/O to prevent corruption. When
LVM attempts to repair the mirror, it scans the devices on the
system for LVM labels. While mirrors are skipped during this
scanning process, snapshot-origins are not. When the origin is
scanned, it kicks up I/O to the mirror (which is blocked)
underneath - causing the label scan (an thus the repair operation)
to hang.
This patch simply bypasses snapshot-origin devices when doing
labels scans (while ignore_suspended_devices() is set). This
fixes the issue.
introduced in commit b16b4d92a7
"Improve various log messages."
fixes a lot of
../include/metadata.h:148: warning: type qualifiers ignored on function return type
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
Add 'get' functions based on the simple macro function definition for a
numeric property.
Add 'get' functions for the following: _vg_extent_count_get,
_vg_free_count_get, _max_lv_get, _max_pv_get, _pv_count_get,
_lv_count_get, _snap_count_get, _vg_seqno_get, _vg_size_get,
_vg_free_get, vg_mda_*.
For size functions, multiply by SECTOR_SIZE to return the value in bytes.
Extend the existing reporting infrastructure definitions and structures
to include a 'get' and 'set' function for each field. We will provide
a 'get' and 'set' function for each of these fields, which will be utilized
by exported lvm2app functions.
Define a default _not_implemented 'get' and 'set' function that just sets
an errno and returns 0. Future patches will actually implement the
specific 'get' and 'set' functions for each property. For read-only
properties, only the 'get' function will be implemented.
Define vg_get_property() function to query a property. We will call
this from a lvm2app function.
The 'id' entries in columns.h are the report field names. Since these are
unique, we'd like to use them in generation of 'get' / 'set' functions.
As a step towards using them for this purpose, remove the explicit double
quotes and use the macro '#' character to add the double quotes back when
placing them into the '_fields' array 'id' member.
Add a 'flags' field to columns.h, and set it to 0 by default.
Define FIELD_MODIFIABLE flag to indicate whether a 'set' function exists
to change the field's value.
In all top vg read functions only LCK_VG_READ/WRITE can be used.
All other vg lock definitions are low-level backend machinery.
Moreover, LCK_WRITE cannot be tested through bitmask.
This patch fixes these mistakes.
For _recover_vg() we do not need lock_flags, it can be only
two of above and we always upgrading to LCK_VG_WRITE lock there.
(N.B. that code is racy)
There is no functional change in code (despite wrong masking
it produces correct bits:-)
One shiny day we should use libblkid here. But now using LUKS is
very common together with LVM and pvcreate destroys LUKS completely.
So for user's convenience, try to detect LUKS signature and allow abort.
pvcreate detects MD and swap signature.
The logic hidden there is not only documented but it is also
user unfriendly. Who invented this logic should run pvcreate
on its own critical MD device to see why;-)
This patch
- creates one function instead of duplication code
- asks if user want to overwrite signature
- allows aborting (!)
(Please note that writing LVM signatute without wiping old
is wrong, it confuses blkid, MD will not work anyway and
swap and LUKS is broken too.)
Ignore snapshots when performing mirror recovery beneath an origin.
Pass LCK_ORIGIN_ONLY flag around cluster.
Add suspend_lv_origin and resume_lv_origin using LCK_ORIGIN_ONLY.
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
Introduce --norestorefile to allow user to override the new requirement.
This can also be overridden with "devices/require_restorefile_with_uuid"
in lvm.conf -- however the default is 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We can already detect MD devices internally. But when using MD partitions,
these have "block extended major" (blkext) assigned (259). Blkext major
is also used in general, so we need to check whether the original device
is an MD device actually.
An incorrect fix on July 13, 2010 for an annoyance has caused a regression.
The offending check-in was part of the 2.02.71 release of LVM. That
check-in caused any PVs specified on the command line to be ignored when
performing a mirror split.
This patch reverses the aforementioned check-in (solving the regressions)
and posits a new solution to the list reversal problem. The original
problem was that we would always take the lowest mimage LVs from a mirror
when performing a split, but what we really want is to take the highest
mimage LVs. This patch accomplishes that by working through the list in
reverse order - choosing the higher numbered mimages first. (This also
reduces the amount of processing necessary.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Takahiro Yasui <takahiro.yasui@hds.com>
all but one mirror leg.
<patch header>
To handle a double failure of a mirrored log, Jon's two patches are
commited, however, lvconvert command can't still handle an error
when mirror leg and mirrored log got failure at the same time.
[Patch]: Handle both devices of a mirrored log failing (bug 607347)
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00009.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00027.html
[Patch]: Handle both devices of a mirrored log failing (bug 607347) -
additional fix
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00093.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00101.html
In the second patch, the target type of mirrored log is replaced with
error target when remove_log is set to 1, but this procedure should be
also used in other cases such as the number of mirror leg is 1. This
patch relocates the procedure to the main path.
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
- With Jon's second patch, we think that we don't have to call
remove_mirror_log() in _lv_update_mirrored_log() because will be
handled remove_mirror_images() in _lvconvert_mirrors_repaire().
</patch header>
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
The cluster log daemon (cmirrord) is not multi-threaded and
can handle only one request at a time. When a log is stacked
on top of a mirror (which itself contains a 'core' log), it
creates a situation that cannot be solved without threading.
When the top level mirror issues a "resume", the log daemon
attempts to read from the log device to retrieve the log
state. However, the log is a mirror which, before issuing
the read, attempts to determine the 'sync' status of the
region of the mirror which is to be read. This sync status
request cannot be completed by the daemon because it is
blocked on a read I/O to the very mirror requesting the
sync status.
CMIRRORD_PIDFILE is not defined. This makes the build fail.
Therefore, we need to conditionalize the check for cmirrord
based on if CMIRRORD_PIDFILE is defined.
mirrors, we must also check that the log daemon (cmirrord) is running.
The log module can be auto-loaded, but the daemon cannot be
"auto-started". Failing to check for the daemon produces cryptic
messages that customers have a hard time deciphering. (The system
messages do report that the log daemon is not running, but people
don't seem to find this message easily.)
Here are examples of what is printed when the module is available,
but the log daemon has not been started.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg
Shared cluster mirrors are not available.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg -v
Setting logging type to disk
Finding volume group "vg"
Archiving volume group "vg" metadata (seqno 3).
Creating logical volume lv
Executing: /sbin/modprobe dm-log-userspace
Cluster mirror log daemon is not running
Shared cluster mirrors are not available.
Creating volume group backup "/etc/lvm/backup/vg" (seqno 4).
The main problem with these bugs was that the newly split
off LV was not being suspended properly. This meant that
the memlock count was not being balanced, the DM devices
were not being renamed, and some DM devices which should
have been removed were not.
I've also renamed some of the variables and added comments
to make things clearer as to what is going on. (I can break
this patch in two if it means easier review.)
Switch dmeventd to use dm_create_lockfile and drop duplicate code.
Allow clvmd pidfile to be configurable.
Switch cmirrord and clvmd to use dm_create_lockfile.
This should bring less confusion when there are some settings left and
people just forgot about it and then they run into problems. These messages
should give them a hint of what's really going on.
even though there was no log. A simple run through the in-tree test
suite would have caught this. :(
- if (lv_is_mirrored(detached_log_lv) &&
+ if (detached_log_lv && lv_is_mirrored(detached_log_lv) &&
Also, made some cosmetic changes suggested by kabi after my last check-in
(e.g. s/return 0/return_0/ and adding an error message).
A previous check-in added logic to handle the case where both images
of a mirrored log failed. It solved the problem by simply removing
the log entirely - leaving the parent mirror with a 'core' log. This
worked for most cases. However, if there was a small delay between
the failures of the two mirrored log devices, the mirror would hang,
LVM would hang, and no additional LVM commands could be issued.
When the first leg of the log fails, it signals the need for repair.
Before 'lvconvert --repair' is run by dmeventd, the second leg fails.
'lvconvert' would see both devices as failed and try to remove the
log entirely. When it came time to suspend the parent mirror to
update the configuration, the suspend would hang because it couldn't
get any I/O through the mirrored log, which was plugged waiting for
corrective action. The solution is to replace the log with an error
target to clear any pending writes before removing it. This allows
the parent mirror to suspend and make the proper changes.
Pass metadataignore through PV creation / setup paths.
As a result of this cleanup, we can remove the unnecessary setting
of mda_ignore bits inside pvcreate_single(), after call to pv_create.
For now, just set metadataignore to '0' in some places. This is
equivalent to the prior functionality, although the 0 is given
by the caller not hardcoded in _mda_setup() call.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
- If a PV contained empty mdas, the auto-recovery code was not kicking in.
- The 'inconsistent' state was getting lost when metadata was cached so
recovery didn't kick in. But leave the behaviour alone when using
precommitted metadata because of a warning in a confusing FIXME.
In my testing, pvs and vgs didn't repair inconsistent metadata like they
used to do. (How many other tools fail similarly now?)
And there should be no need to cache inconsistent metadata because it is
supposed to get repaired under the protection of a write lock immediately it is
discovered.
This code is in need of a redesign based on first principles.
I still see bugs in this code and this commit is risky.
Allow metadataignore flag to be passed in to pvcreate.
Ideally, more refactoring of the mda allocation / initialization
is warranted, but for now, we just add another parameter to 'add_mda'
to take an existing mda ignored flag. We need to do this or pv_write
loses the state of the mda 'ignored' flag before copying and writing
to disk.
Print device name when setting or clearing metadata ignore bit.
Example:
label/label.c:160 /dev/loop2: lvm2 label detected
cache/lvmcache.c:1136 lvmcache: /dev/loop2: now in VG #orphans_lvm2 (#orphans_lvm2)
metadata/metadata.c:4142 Setting mda ignored flag for metadata_locn /dev/loop2.
format_text/text_label.c:318 Skipping mda with ignored flag on device /dev/loop2 at offset 4096
Logging isn't ideal, especially for mda_set_ignore. Ideally we'd
like to display the device name and offset in this case but this
requires a bit more work and a per-format 'mda_description' function
pointer definition (we don't have access to mda_context in
metadata.c).
In preparation to call this from both pvcreate as well as pvchange,
move the guts of metadataignore into a library function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
There's an intermittent failure with vgcfgbackup that seems to have been
introduced with the metadataignore / vgmetadatacopies patchset.
Intermittent failures are often the result of uninitialized data,
so this patch calls zalloc in a few places it might matter.
Allowing an 'all' and 'unmanaged' value is more intuitive, and
provides a simple way for users to get back to original LVM behavior
of metadata written to all PVs in the volume group.
If the user requests "--vgmetadatacopies unmanaged", this instructs
LVM not to manage the ignore bits to achieve a specific number of
metadata copies in the volume group. The user is free to use
"pvchange --metadataignore" to control the mdas on a per-PV basis.
If the user requests "--vgmetadatacopies all", this instructs LVM
to do 2 things: 1) clear all ignore bits, and 2) set the "unmanaged"
policy going forward.
Internally, we use the special MAX_UINT32 value to indicate 'all'.
This 'just' works since it's the largest value possible for the
field and so all 'ignore' bits on all mdas in the VG will get
cleared inside _vg_metadata_balance(). However, after we've
called the _vg_metadata_balance function, we check for the special
'all' value, and if set, we write the "unmanaged" value into the
metadata. As such, the 'all' value is never written to disk.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The check in vg_split_mdas will trigger an error if the 'from' vg
list is empty. However, this might be ok in some instances now
that we have ignored mdas. Relax this check so an error is triggered
only in the case where there's truly no more mdas in the 'from'
vg.
One example of where this makes a difference is with vgreduce.
If we try to vgreduce a PV with un-ignored mdas, this should trigger
the balancing function to un-ignore mdas on another PV in the VG.
However, we don't get to vg_write() before we fail because this
list size check fails, and we see an error message indicating:
"Cannot remove final metadata area ..."
Another example is with vgsplit into a new VG, where the PVs
being moved contain all ignored mdas. We must move the mdas on
fid->metadata_areas_ignored from 'vg_from' to 'vg_to'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The vgextend path calls add_pv_to_vg(). Inside add_pv_to_vg(),
we must ensure we pass the correct mdas list into pv_setup(), as
copies of mdas are placed on the vg->fid list. If we don't place
the mdas on the correct vg->fid list, the various counts may be
incorrect and the metadata balance algorithm will not work when
called from vg_write() path.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Compare the value of the newly added vg_mda_copies field
(--vgmetadatacopies parameter) with the current count of
in-use mdas and ignoring or unignoring mdas as necessary to
get to the target count. Also, as a safety check before
returning, ensure we have at least one mda enabled.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the ability to read/write the vg->mda_copies values
from/to the vg metadata.
If we read the VG metadata and this field does not exist, we set
mda_copies to the default value of 0. Later in the code, we use
this special '0' value to indicate a disable of metadata balancing.
This should preserve existing LVM behavior and ensure metadata balancing
can be turned off should the need arise.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the get and partially implemented set function.
The 'set' function should probably ignore or un-ignore metadata areas
based on new values.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a field to struct volume_group to later implement metadata
balancing:
- mda_copies: target # of non-ignored mdas in the VG; default 0 (do
not control pv 'ignore mdas' bit.
This patch just adds the parameter to the structures with the default
values but does not modify any commands. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Arrange mdas so mdas that are to be ignored come first. This is an
optimization that ensures consistency on disk for the longest period of time.
This was noted by agk in review of the v4 patchset of pvchange-based mda
balance.
Note the following example for an explanation of the background:
Assume the initial state on disk is as follows:
PV0 (v1, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
If we did not sort the list, we would have a commit sequence something like
this:
PV0 (v2, non-ignored)
PV1 (v2, ignored)
PV2 (v2, ignored)
PV3 (v2, non-ignored)
After the commit of PV0's mdas, we'd have an on-disk state like this:
PV0 (v2, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is an inconsistent state of the disk. If the machine fails, the next
time it was brought back up, the auto-correct mechanism in vg_read would
update the metadata on PV1-PV3. However, if possible we try to avoid
inconsistent on-disk states. Clearly, because we did not sort, we have
a greater chance of on-disk inconsistency - from the time the commit of
PV0 is complete until the time PV3 is complete.
We could improve the amount of time the on-disk state is consistent by simply
sorting the commit order as follows:
PV1 (v2, ignored)
PV2 (v2, ignored)
PV0 (v2, non-ignored)
PV3 (v2, non-ignored)
Thus, after the first PV is committed (in this case PV1), on-disk we would
have:
PV0 (v1, non-ignored)
PV1 (v2, ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is clearly a consistent state. PV1 will be read but the mda will be
ignored. All other PVs contain v1 metadata, and no auto-correct will be
required. In fact, if we commit all PVs with ignored mdas first, we'll
only have an inconsistent state when we start writing non-ignored PVs,
and thus the chances we'll get an inconsistent state on disk is much
less with the sorted method.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When we are constructing the vg, we may need to adjust the list of
metadata_areas if there are ignored mdas. At label read time, we
do not read the metadata of ignored mdas, and as a result, they do
not get placed on vg->fid->metadata_areas inside _text_create_text_instance
since lvmcache does not have these areas attached to vginfo->infos.
However, when we're checking the pvids inside _vg_read, after having
read another metadata area from another PV, we do have the opportunity
to update the metadata_area and metadata_areas_ignored lists based
on the read metadata_area. We need accurate mda lists for the reporting
functions that count the ignored mdas, as well as general correctness
of mda balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
With the addition of ignored mdas, we replace all checks for an empty
mda list with a new function to look for either an empty mda list or
ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a helper function to consolidate checking for an empty mdas list
or ignored mdas. Ignored mdas should behave almost identically to
an empty mda list - the metadata areas should not be read or written
to. This function will make it easier to implement metadata balancing
and easier to track pvs with an empty mda list or ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We implement ignore of an mda at label_read time by checking for
the ignore bit, and then skipping the reading of the vgname and
other information in the metadata. This will have an effect similar
to a PV found with no mdas. Thus, it will look like an orphan in the
cache until we scan the rest of the system and find a PV with
metadata, and the mda will not be on the vg->fid->metadata_areas
list so no read/writes will be done to the metadata area.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Define a new pvs field, pv_mda_used_count, and a new vgs field,
vg_mda_used_count to match the existing pv_mda_count and vg_mda_count.
These new fields count the number of mdas that have the 'ignored' bit
clear (they are in use on the PV / VG). Also define various supporting
functions to implement the counting as well as setting the ignored
flag and determining if an mda is ignored. These high level functions
call into the lower level location independent mda ignore functions
defined by earlier patches.
Note that counting ignored mdas in a vg requires traversing both lists
and checking for the ignored bit on the mda. The count of 'ignored'
mdas then is defined by having the bit set, not by which list the mda
is on. The list does determine whether LVM actually does read/write to
the mda, though we must count the bits in order to return accurate numbers
for the various counts. Also, pv_mda_set_ignored must search both vg
lists for ignored mda. If the state changes and needs to be committed
to disk, the ignored mda will be on the non-ignored list.
Note also in pv_mda_set_ignored(), we must properly manage the mda lists.
If we change the ignored state of an mda, we must change any mdas on
vg->fid->metadata_areas that correspond to this pv. Also, we may
need to allocate a copy of the mda, as is done when fid->metadata_areas
is populated from _vg_read(), if we are un-ignoring an ignored mda.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a second mda list, metadata_areas_ignored to fid, and a couple
functions, fid_add_mda() and fid_add_mdas() to help manage the lists.
These functions are needed to properly count the ignored mdas and
manage the lists attached to the 'fid' and ultimately the 'vg'.
Ensure metadata_areas_ignored is initialized in other formats, even
if the list is never used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Because of the way mdas are handled internally, where a PV in a VG
has mdas on both info->mdas and vg->fid->metadata_areas list, we
need a location independent copy constructor for struct
metadata_area. Break up the existing format-text specific copy
constructor into a format independent piece and a format dependent
piece.
This function is necessary to properly implement pv_set_mda_ignored().
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
A metadata_area is defined independent of the location. One downside
is that there is no obvious mapping from a pv to an mda. For a PV in
a VG, we need a way to start with a PV and end up with an MDA, if we
are to manage mdas starting with a device/pv. This function provides
us a way to go down the list of PVs on a VG, and identify which ones
match a particular PV.
I'm not entirely happy with this approach, but it does fit into the
existing structures in a reasonable way.
An alternative solution might be to refactor the VG - PV interface such
that mdas are a list tied to a PV. However, this seemed a bit tricky since
a PV does not come into existence until after the list of mdas is
constructed (see _vg_read() - we create a 'fid' and attach mdas to it,
then we go through them and attach pvs).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
We'd like to pass in mda_header to vgname_from_mda(). In order to
do this, we need to call raw_read_mda_header() from text_label.c,
_text_read(), which gets called from the label_read() path, and
peers into the metadata and update vginfo cache. We should check
the disable bit here, and if set, not peer into the vg metadata,
thus reducing the I/O to disk.
In the process, move vgname_from_mda() to layout.h, since the fn
only gets called from format_text code, and we need the mda_header
definition from the private layout.h.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This refactoring moves the device open/close up one level to the caller of
_vg_read_raw_area(). Should be no functional change and facilitate future
changes related to metadata balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
First we add a 'flags' field to the location independent
metadata_area structure, and a MDA_IGNORE flag. The
mda_is_ignored and mda_set_ignored functions are added to
manage the flag. Adding the flag and functions gives a
library interface to ignore metadata areas independent of
the underlying location (disk, file, etc). The location
specific read/write functions must then handle the specifics
of what this flag means to the location.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Adding a flag to the 'rlocn' structure in the mda header of the
text format allows us to flip a bit to ignore an area on disk that
stores the metadata via the text format specific mda_header.
This patch defines the flag and access functions to manage the flag.
Other patches will manage the ignore on a format-independent basis,
by using a flag in the metadata_area structure.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Future patches will make use of a specific flag in the on-disk 'raw_locn'
structure to enable/disable metadata areas, and facilitate metadata
balancing.
Note that 'filler' is always set to '0' (see add_mda() - memset),
so use of this area as a non-zero flags field is a safe way to
provide future code features.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The same region size is used for both mirror volume and mirrored
log volume, but when the physical extent size is bigger than region size,
the size of mirror leg for mirrored log is smaller than the region size
and lvcreate command fails.
This patch adjusts a region size of mirrored log to a smaller value of
region size or physical extent size.
[This patch ensures that the region_size of the mirrored log does not
exceed the size of the mirrored log itself, which would violate the
kernel constraint: (region_size <= ti->len).]
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Preload libc.mo file for localized lvm before taking memory lock - this way
we prevent disk access for some error paths in libdm, that prints localized
errno messages while they are still in memory locked state.
converting from 2-way to 3-way mirror (collapse_mirrored_lv)
was calling '_remove_mirror_images' with the 'remove_log'
parameter set. When the code was put in to fix 599898 to
honor log parameters during conversion, this argument was
suddenly being honored. Thus, when someone would convert from
a 2-way to 3-way mirror, the log would get removed.
'collapse_mirrored_lv' should not be calling '_remove_mirror_images'
with 'remove_log' set.
to 3-way mirror. When conversion operations are performed on
these types of mirrors, log options can be confused/ignored.
In the case of a converting 3-way mirror, we have a top-level
2-way corelog mirror whose legs are 1) a 2-way disk-log mirror
and 2) a linear device. If we wish to convert this 3-way mirror
to a 2-way mirror, the linear device is removed and the extra
top layer is eliminated. If we also wished to convert the disk
log to a core log in the same step, ambiguity creeps in. It is
somewhat obvious what the user wants - a 2-way mirror with a
corelog. However, looking at the top level mirror before
compression, it seems that the mirror already has a core log.
This is why the operation seemed to fail.
This patch simply re-evaluates what mirrored_seg points to after
a compression and then considers the log argument.
This is a fix for bug 599898.
Code is mixing up internal DLM and LVM definitions of lock
modes and flags.
OpenAIS and singlenode locking do not depend on DLM but
code currently cannot be compiled without libdlm.h!
LCK_* flags is LVM abstraction, used through all the code.
Only low-level backend (clvmd-cman etc) should use DLM definitions,
also this code should do all needed conversions.
Because there are two DLM flags used in generic code
(NOQUEUE, CONVERT) we define it similar way like lock modes.
(So all needed binary-compatible flags are on one place in locking.h)
(Further code cleaning still needed, though:-)