IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This is essential for proper format instance ref_count support. We must
use these functions to set the fid everywhere from now on, even the NULL
value!
We'd like to use the fid mempool for text_context that is stored
in the instance (we used cmd mempool before, so the order of
initialisation was not a matter, but now it is since we need to
create the fid mempool first which happens in create_instance fn).
The text_context initialisation is not needed anywhere outside the
create_instance fn so move it there.
Format instances can be created anytime on demand and it contains
metadata area information mostly (at least for now, but in the future,
we may store more things here to update/edit in a PV/VG). In case we
have lots of metadata areas, memory consumption will rise. Using cmd
context mempool is not quite optimal here because it is destroyed too
late. So let's use a separate mempool for format instances.
Reference counting is used because fids could be shared, e.g. each PV
has either a PV-based fid or VG-based fid. If it's VG-based, each PV has
a shared fid with the VG - a reference to VG's fid.
Makes the code more readable and has a smaller number of memory
accesses thus it's small optimisation as well.
For _get_token() optimize number parsing. Check for '.' char only
if it's not a digit. Move pointer incrementation into one place.
For _eat_space() check only p->te for '\0' in skipping of comment line.
Avoid check for '\0' when we know it is space. Also master while loop
doesn't need checking p->tb for '\0'. We just need to check p->tb
isn't already at the end of buffer. This could give 'extra' loop cycle
if we are already there - but safes memory access in every other case.
Add _lv_postorder_vg() - for calling _lv_postorder() for every LV from VG.
We use this in 2 places - vg_mark_partial_lvs() and vg_validate()
so make it as a one function.
Benefit here is - to use only one cleanup code and avoid
potentially duplicate scans of same LVs.
Accelerate validation loop by using lvname, lvid, pvid hash tables.
Also merge pvl loop into one cycle now - no need to scan the list twice.
List scan is stopped when dm_hash_insert fails.
The error message with loop_counter1 is no longer provided - however
the message has been misleading anyway.
Create new function alloc_vg() to allocate VG structure.
It takes pool_name (for easier debugging).
and also take vg_name to futher simplify code.
Move remainder of _build_vg_from_pds to _pool_vg_read
and use vg memory pool for import functions.
(it's been using smem -> fid mempool -> cmd mempool)
(FIXME: remove mempool parameter for import functions and use vg).
Move remainder of the _build_vg to _format1_vg_read
Fixing few issues:
struct clvm_header contains 'char args[1]' - so adding '+ 1' here
for message length calculation is 1 byte off.
Message with last byte uninitialized is then passed to write function.
Update also related arglen.
Initialise xid and clintid to 0.
Memory allocation is checked for NULL
lvseg_segtype_dup used memory pool vg memory pool for strind duplication.
However this one gets released before reporting happens so the command like:
pvs -o segtype
prints data from already released memory pool. Thanks to the fact there
is not much allocation happing after the VG is released, the memory
stays unmodified and correct result is printed.
Fix adds support for mempool passed parameter (like other similar
query commands) and uses dm_report memory pool for string duplication.
While STRIPE_SIZE_LIMIT * 2 is basically UINT_MAX, 32bit integer
value can already overflow durin arg size parsing.
(This really happens in test where --stripesize 4294967291 is used,
in s390x uint overflow and this test is ineffective.)
We allow writing non-orphan PVs only for resize now. The "orphan PV" assert
in pv_write fn uses the "allow_non_orphan" parameter to control this assert.
However, we should find a more elaborate solution so we can remove this
restriction altogether (pv_write together with vg_write is not atomic, we
need to find a safe mechanism so there's an easy revert possible in case of
an error).
There is a lot to test.
Two new config settings added that are intended to make the code behave
closely to the way it did before - worth a try if you find problems.
Fixing some const warnings - with API change in:
int vg_extend(struct volume_group *vg, int pv_count, const char *const *pv_names,
Change is needed - as lvm2api expects const behaviour here.
So vg_extend() is doing local strdup for unescaping.
skip_dev_dir return const char* from const char* vg_name.
Rest of the patch is cleanup of related warnings.
Also using dm_report_filed_string() API change to simplify
casting in _string_disp and _lvname_disp.
New strategy for memory locking to decrease the number of call to
to un/lock memory when processing critical lvm functions.
Introducing functions for critical section.
Inside the critical section - memory is always locked.
When leaving the critical section, the memory stays locked
until memlock_unlock() is called - this happens with
sync_local_dev_names() and sync_dev_names() function call.
memlock_reset() is needed to reset locking numbers after fork
(polldaemon).
The patch itself is mostly rename:
memlock_inc -> critical_section_inc
memlock_dec -> critical_section_dec
memlock -> critical_section
Daemons (clmvd, dmevent) are using memlock_daemon_inc&dec
(mlockall()) thus they will never release or relock memory they've
already locked memory.
Macros sync_local_dev_names() and sync_dev_names() are functions.
It's better for debugging - and also we do not need to add memlock.h
to locking.h header (for memlock_unlock() prototyp).
Add configurable option to define minimal size of
of block device usable as a PV.
pv_min_size() is added to lvm-globals and it's being
initialized through _process_config.
Macro PV_MIN_SIZE is unused and removed.
New define DEFAULT_PV_MIN_SIZE_KB is added to lvm-global
and unlike PV_MIN_SIZE it uses KB units.
Should help users with various slow devices attached to the system,
which cannot be easily filtered out (like FDD on /dev/sdX):
https://bugzilla.redhat.com/show_bug.cgi?id=644578
results in clvmd deadlock
When a logical volume is activated exclusively in a cluster, the
local (non-cluster-aware) target is used. However, when creating
a snapshot on the exclusive LV, the resulting suspend/resume fails
to load the appropriate device-mapper table - instead loading the
cluster-aware target.
This patch adds an 'exclusive' parameter to the pertinent resume
functions to allow for the right target type to be loaded.
Fix regresion from 2.02.75 speedup - so currently crc32 is a little bit
more complicated on big-endian CPU as the uint32_t needs to be shifted
on here.
activated.
In order to achieve this, we need to be able to query whether
the origin is active exclusively (a condition of being able to
add an exclusive snapshot).
Once we are able to query the exclusive activation of an LV, we
can safely create/activate the snapshot.
A change to 'hold_lock' was also made so that a request to aquire
a WRITE lock did not replace an EX lock, which is already a form
of write lock.
Add new function dm_task_set_add_node() to select between 2 types
of node creation in device directory.
DM_ADD_NODE_ON_RESUME is now default and ensures node is created on
resume. Old original behavior is accessible with DM_ADD_NODE_ON_CREATE.
In this case node would be created during dmsetup create --notable.
For the user 2 new options for dmsetup create are added:
[{--addnodeonresume | --addnodeoncreate }]
Properly working node creation on resume is needed for proper operation
stacking and ability to correctly check in which state the device should
after whole udev transation.
Thanks to CLVMD_CMD_SYNC_NAMES propagation fix the message passing started
to work. So starts to send a message before the VG is unlocked.
Removing also implicit sync in VG unlock from clmvd as now the message
is delievered and processed in do_command().
Also add support for this new message into external locking
and mask this event from further processing.
With the ability to stack many operations in one udev transaction -
in same cases we are adding and removing same device at the same time
(i.e. deactivate followed by activate).
This leads to a problem of checking stacked operations:
i.e. remove /dev/node1 followed by create /dev/node1
If the node creation is handled with udev - there is a problem as
stacked operation gives warning about existing node1 and will try to
remove it - while next operation needs to recreate it.
Current code removes all previous stacked operation if the fs op is
FS_DEL - patch adds similar behavior for FS_ADD - it will try to
remove any 'delete' operation if udev is in use.
For FS_RENAME operation it seems to be more complex. But as we
are always stacking FS_READ_AHEAD after FS_ADD operation -
should be safe to remove all previous operation on the node
when udev is running.
Code does same checking for stacking libdm and liblvm operations.
As a very simple optimization counters were added for each stacked ops
type to avoid unneeded list scans if some operation does not exists in
the list.
Enable skipping of fs_unlock() (udev sync) if only DEL operations are staked.
as we do not use lv_info for already deleted nodes.
Instead of implicitly syncing udev operation in clustered and
file locking code - call synchronization directly in lock_vol() when
the operation unlocks VG
The problem is missing implicit fs_unlock() in the no_locking code.
This is used with --sysinit on read-only filesystem locking dir.
In this case vgchange -ay could exit before all udev nodes are properly
synchronised and may cause problems with accessing such node right after
vgchange --sysinint command is finished.
Add test case for vgchange --sysinit.
return lockspace reference (even if lockspace already exists)
and thus increases DLM lockspace count. It means that after
clvmd restart the lockspace is still in use.
(The only way to clean environment to enable clean cluster
shutdown is call "dlm_tool leave clvmd" several times.)
Because only one clvmd can run in time, we can use simpler logic,
try to open lockspace with dlm_open_lockspace() and only if it fails
try to create new one. This way the lockspace reference doesn not
increase.
Very easily reproducible with "clvmd -S" command.
Patch also fixes return code when clvmd_restart fails and fixes
double free if debug option was specified during restart.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=612862
Improve condition within lock_vol so we are not calling extra unlock
if the volume just has been deactivated.
Patch uses lck_type and replaces negative 'and' condition to more
readable 'or' condition.
Few missing strace traces added.
As sync_local_dev_names() cannot be called within activation context,
add new parametr which allows to select if the sync call is needed
before executing new command.
Stop calling fs_unlock() from lv_de/activate().
Start using internal lvm fs cookie for dm_tree.
Stop directly calling dm_udev_wait() and
dm_tree_set/get_cookie() from activate code -
it's now called through fs_unlock() function.
Add lvm_do_fs_unlock()
Call fs_unlock() when unlocking vg where implicit unlock solves the
problem also for cluster - thus no extra command for clustering
environment is required - only lvm_do_fs_unlock() function is added
to call lvm's fs_unlock() while holding lvm_lock mutex in clvmd.
Add fs_unlock() also to set_lv() so the command waits until devices
are ready for regular open (i.e. wiping its begining).
Move fs_unlock() prototype to activation.h to keep fs.h private
in lib/activate dir and not expose other functions from this header.
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
If some allocation for peristent filter fails its memory reference
was lost, fix it by calling filter's destructor.
Fix log_error messages for failing allocation.
a mirror image (or removing a log while adding a mirror). Advise the
user to use two separate commands instead.
This issue become especially problematic when PVs are specified, as they
tend to mean different things when adding vs removing. In a command that
mixes adding and removing, it is impossible to decern exactly what the
user wants.
This change prevents bug 603912.
Add checks for clonning allocation a fail-out when something is
not allocated correctly.
Also move var declaration to the begining of the function
and fix log_error messages.
To have better control were the config tree could be modified use more
const pointers and very carefully downcast them back to non-const
(for config tree merge).
As ternary operator has lower priority then add operation, this check
was not doing what seemed to be expected.
So enclose the test in braces and check for NULL in *buf.
We need to be sure that /var/run and /var/lock is always there.
(E.g. these two directories could be using tmpfs which then loose
all the content after reboot.)
Detect existence of new SELinux selabel interface during configure.
Use new dm_prepare_selinux_context instead of dm_set_selinux_context.
We should set the SELinux context before the actual file system object creation.
The new dm_prepare_selinux_context function sets this using the selabel_lookup
fn in conjuction with the setfscreatecon fn. If selinux/label.h interface
(that should be a part of the selinux library) is not found during configure,
we fallback to the original matchpathcon function instead.
Set cmd->independent_metadata_areas if metadata/dirs or disk_areas in use.
- Identify and record this state.
Don't skip full scan when independent mdas are present even if memlock is set.
- Clusters and OOM aren't supported, so no problem doing the proper scans.
Avoid revalidating the label cache immediately after scanning.
- A simple optimisation.
Support scanning for a single VG in independent mdas.
- Not used by the fix but I left it in anyway as later patches might use it.
This reset of vgmem pointer causes access of already released memory.
(_vg_make_handle allocates vg from vgmem pool itself - which is a bit tricky)
Interestingly this memory fault was missed by our test suite.
Add log_error message for lv_info failure and exit from futher
processing.
Replace 'leg' occurence in debug message with 'image' which
is used in other messages.
Add test for NULL from dm_poll_create.
Reorder dm_pool_destroy() before file close and add label out:.
Avoid leaking file descriptor if the allocation fails.
Nicely hidden memory leak in outf macro error path.
This macro is using out_text() and does automagical return_0.
That would leak tag_buffer allocated memory.
As there was same code for tags output - create _out_tags() function.
Set vg to NULL after releasing it as the following memlock() test may
lead to goto for the second call of vg_release() with the already
released vg pointer.
Fix for the last commit as $MOUNTED is not only used as bool flag,
but also store mounted location for remount - so parsing output
from mount differently then from /proc/mounts.
Prefix calls of 'tunefs' tools with LANG=C to be sure we always do get
some nonlocalized strings.
Avoid using forced 'resize2fs' for cleanly unmounted filesystems and
run regular fsck -f for this case as required by resize2fs.
'fsadm check' uses date difference for extX filesystems between
the last mount and last check of 'fsck -f' execution and if the mount
was later run 'fsck' with -f so resize2fs is happy and user does not
need to pass '-f' flag.
Updated patch from Florian Haas from Linux-HA project.
User needs to 'configure --enable-ocf' to get file installed
by 'make install' target by default.
User can also use 'make install_ocf' to get only ocf files installed.
With disabled (default) ocf support - no ocf files are installed.
FIXME: ocf installation path needs to be kept in sync with pacemaker.
find better way and possible also better location.
Patch updates exec_cmd() and adds 3rd parameter with pointer for
status value, so caller might examine returned status code.
If the passed pointer is NULL, behavior is unmodified.
Patch allows to confinue with lvresize if the failure from fsadm check is
caused by mounted filesystem as many of filesystem resize tools do support
online filesystem resize. (originally user had to use flag '-n' to bypass
this filesystem check)
Return status code 3 for fsadm check of mounted filesystem - used later with
lvresize update patch to better support online filesystem resize.
Also makes a more consistent user interruption and returns status code 2
in this case.
Simultaneous -a and --refresh is not valid.
poll+monitor are valid together with or without -ay* (but not with -an*)
No longer print polling results summary if no LVs in the VG were polled.
We cast (char*) to (uint32_t*) that changes alignment requierements.
For our case the code has been correct as alloca() returns properly
aligned buffer, however this patch make it cleaner and more readable
and avoids warning generation.
A merged snapshot's DM device is made to use the "error" target as part
of lvm's transaction to merge a snapshot. This snapshot merge use-case
aside, any device using the error target shouldn't be scanned.
Problem:
When both legs of a mirrored log fail, neither the log nor the parent
mirror can proceed. The repair code must be careful to replace the
log with an error target before operating on the parent - otherwise,
the parent can get stuck trying to suspend because it can't push through
any writes. The steps to replace the log device with an error target
were incomplete and resulted in the replacement not happening at all!
The code originally had all the necessary logic to complete the
replacement task, but was pulled out in a effort to clean-up that
section of code, while fixing another bug:
<offending commit msg>
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
</offending commit msg>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/LVM2/lib/metadata/mirror.c?cvsroot=lvm2&f=h#rev1.130
I've reverted the "clean-up" changes associated with that fix, but not what
that commit was actually fixing.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
page.
Add ->target_name to segtype_handler to allow a more specific target
name to be returned based on the state of the segment.
Result of trying to merge a snapshot using a kernel that doesn't have
the snapshot-merge target:
Before:
# lvconvert --merge vg/snap
Can't expand LV lv: snapshot target support missing from kernel?
Failed to suspend origin lv
After:
# lvconvert --merge vg/snap
Can't process LV lv: snapshot-merge target support missing from kernel?
Failed to suspend origin lv
Unable to merge LV "snap" into it's origin.
Try to distinguish between the case of using interactive shell and non
interactive running - different combinations of '-y' and '-p' option
needs to be used for fsck.
Update the way how fsadm detects mounted filesystem.
With udev /dev/dm-XXX paths are now returned - but mount or /proc/mounts
prints names in form of /dev/mapper/vg-lv - so the match was not found.
Fixex RHBZ #638050.
Current solution uses same trick as mount and detects vg-lv name through
/sys where available - this should be reasonable safe.
Instead of calling mount without parameter to get actual mount table,
switch to use /proc/mounts directly.
Fix missing 'dry' execution of lvresize - fixing problem where resize
command were 'dry-run' executed - but lvresize has been executed for real.
Also adapt code slightly to support better recursive execution of fsadm
through lvresize call.
Under certain conditions it was possible to break (^C) fsadm before actually
resizing filesystem, but lvresize which executed fsadm will think resize
was succesful and shrinks partitions with unresized filesystem on it.
Fix by returning error (1) for this case - this stops lvresize from futher
proceding in resize operation.
In other LVM memory structures such as volume_group, the field
used to store flags is called "status", and on-disk fields are called
'flags', so rename the one inside metadata_area to be consistent.
Not only is it more consistent with existing code but is cleaner
to say "the status of this mda is ignored".
Background for this patch - prajnoha pinged me on IRC this morning
about a fix he was working on related to metadataignore when
metadata/dirs was set. I was reviewing my patches from this year
and realized the 'flags' field was probably not the best choice
when I originally did the metadataignore patches.
Current lvm1 allocation code seems to not properly
map segments on missing PVs.
For now disable this functionality.
(It never worked and previous commit just introduced segfault here.)
So the partial mode in lvm1 can only process missing PVs
with no LV segments only.
Also do not use random PV UUID for missing part but use fixed
string derived from VG UUID (to not confuse clvmd tests).
Read complete content of /proc/self/maps into one buffer without
realocation in the middle of reading and before doing any m/unlock
operation with these lines - as some of them gets change.
With previous implementation we've read some mappings twice ([stack])
If some lvm1 device is missing, lvm fails on all operations
# vgcfgbackup -f bck -P vg_test
Partial mode. Incomplete volume groups will be activated read-only.
3 PV(s) found for VG vg_test: expected 4
PV segment VG free_count mismatch: 152599 != 228909
PV segment VG extent_count mismatch: 152600 != 228910
Internal error: PV segments corrupted in vg_test.
Volume group "vg_test" not found
Allow loading of lvm1 partial VG by allocating "new" missing PV,
which covers lost space. Also this fake mising PV inform code
that it is partial VG.
https://bugzilla.redhat.com/show_bug.cgi?id=501390
Revert to old glibc behaviour for vsnprintf used in emit_to_buffer fn.
Otherwise, the check that follows would be wrong for new glibc versions.
This caused the rh bug #633033 to be undetected and pass throught the check,
corrupting the metadata!
In certain configurations, we're not under a VG rw lock while trying to write
a new archive file with VG metadata. A common example is using "vgs" while
having the content of backup and archive directories empty. The code scans the
content of these directories and tries to determine the final index that should
be used in archive name. Since we're not under a lock, we can get into a race
while choosing the index which could end up showing errors about not being able
to rename to final archive name. Let's add random number suffix to these archive
file names so we can avoid the race.
For example, when using '--config "backup { ... }"' line, the values from
lvm.conf (or default values) should be overridden. This patch adds
reinitialisation of archive and backup handling on toolcontext refresh
which makes these settings to be applied.
can be opprobriously slow if created with '--nosync'.
One of the ways cluster mirrors coordinate I/O and recovery
amoung the different machines is by the use of the log
function 'is_remote_recovering()' which lets nodes know if
a region they wish to perform a write on is currently being
recovered on another node. If the region is being recovered,
the I/O is delayed.
The 'is_remote_recovering' routine has been optimized to
avoid the deluge of requests that would be issued to the
userspace log server by maintaining a marker of how far
the recovery has gotten. It can then immediately return
'not recovering' if the region being inquired about is
less than this mark. Additionally, if the region of
concern is greater than the mark, the function will
limit the number of transmissions to userspace by assuming
the region /is/ being recovered when skipping the
transmission. This limits the amount of processing
and updates the mark in 1/4 sec time steps.
This patch fixes a problem where 'the mark' is not being
updated because of faulty logic in the userspace log
daemon. When '--nosync' is used to create a cluster
mirror, the userspace log daemon never has a chance
to update the mark in the normal way. The fix is to set
the mark to "complete" if the mirror was created with
the --nosync flag.
may never complete.
If you convert from a linear to a mirror and then convert that
mirror back to linear /while/ the previous (up)convert is
taking place, the mirror polling process will never complete.
This is because the function that polls the mirror for
completion doesn't check if it is still polling a mirror and
the copy_percent that it gets back from the linear device is
certainly never 100%.
The fix is simply to check if the daemon is still looking at
a mirror device - if not, return PROGRESS_CHECK_FAILED.
The user sees the following output from the first (up)convert
if someone else sneaks in and does a down-convert shortly
after their convert:
[root@bp-01 ~]# lvconvert -m1 vg/lv
vg/lv: Converted: 43.4%
ABORTING: Mirror percentage check failed.
to block when a mirror under a snapshot suffers a failure.
The problem has to do with label scanning. When a mirror suffers
a failure, the kernel blocks I/O to prevent corruption. When
LVM attempts to repair the mirror, it scans the devices on the
system for LVM labels. While mirrors are skipped during this
scanning process, snapshot-origins are not. When the origin is
scanned, it kicks up I/O to the mirror (which is blocked)
underneath - causing the label scan (an thus the repair operation)
to hang.
This patch simply bypasses snapshot-origin devices when doing
labels scans (while ignore_suspended_devices() is set). This
fixes the issue.
introduced in commit b16b4d92a7
"Improve various log messages."
fixes a lot of
../include/metadata.h:148: warning: type qualifiers ignored on function return type
set appropriate Required-Start and Required-Stop at configure time.
Reorder the checks for user selected cluster managers to match auto
detected ones, to be consistent in the output.
Add special case for qdiskd that´s started after cman/lock_gulmd for
RHEL-4/RHEL-5.
If pvmove crashed and metadata contains pvmove LV
but without miorrored segments, pvmove --abort
will not repair the situation (and finish wth success!).
Fix it by allowing metadata update if aborting
(thus removing pvmove LV) even if no moved LVs detected.
(Tested on real metadata provided by an lvm user:-)
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
In all top vg read functions only LCK_VG_READ/WRITE can be used.
All other vg lock definitions are low-level backend machinery.
Moreover, LCK_WRITE cannot be tested through bitmask.
This patch fixes these mistakes.
For _recover_vg() we do not need lock_flags, it can be only
two of above and we always upgrading to LCK_VG_WRITE lock there.
(N.B. that code is racy)
There is no functional change in code (despite wrong masking
it produces correct bits:-)
One shiny day we should use libblkid here. But now using LUKS is
very common together with LVM and pvcreate destroys LUKS completely.
So for user's convenience, try to detect LUKS signature and allow abort.
This is not only undocumented but is is also in violation with --help
documentation.
Using --yes without --force is useful in pvcreate when it detects
old signature.
pvcreate detects MD and swap signature.
The logic hidden there is not only documented but it is also
user unfriendly. Who invented this logic should run pvcreate
on its own critical MD device to see why;-)
This patch
- creates one function instead of duplication code
- asks if user want to overwrite signature
- allows aborting (!)
(Please note that writing LVM signatute without wiping old
is wrong, it confuses blkid, MD will not work anyway and
swap and LUKS is broken too.)
The lvm repair issues I believe are the superficial symptoms of this
bug - there are worse issues that are not as clearly seen. From my
inline comments:
* If the mirror was successfully recovered, we want to always
* force every machine to write to all devices - otherwise,
* corruption will occur. Here's how:
* Node1 suffers a failure and marks a region out-of-sync
* Node2 attempts a write, gets by is_remote_recovering,
* and queries the sync status of the region - finding
* it out-of-sync.
* Node2 thinks the write should be a nosync write, but it
* hasn't suffered the drive failure that Node1 has yet.
* It then issues a generic_make_request directly to
* the primary image only - which is exactly the device
* that has suffered the failure.
* Node2 suffers a lost write - which completely bypasses the
* mirror layer because it had gone through generic_m_r.
* The file system will likely explode at this point due to
* I/O errors. If it wasn't the primary that failed, it is
* easily possible in this case to issue writes to just one
* of the remaining images - also leaving the mirror inconsistent.
*
* We let in_sync() return 1 in a cluster regardless of what is
* in the bitmap once recovery has successfully completed on a
* mirror. This ensures the mirroring code will continue to
* attempt to write to all mirror images. The worst that can
* happen for reads is that additional read attempts may be
* taken.
Ignore snapshots when performing mirror recovery beneath an origin.
Pass LCK_ORIGIN_ONLY flag around cluster.
Add suspend_lv_origin and resume_lv_origin using LCK_ORIGIN_ONLY.
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
clvmd daemon itself does the right thing when invoked as non-root, by
returning 4.
The patch removes the use daemon function from
/etc/rc.d/init.d/functions that´s unnecessary and has th bad habit to
mask the return codes from the real daemon.
Add a simple and generic check to see if clvmd is executed by root or not.
Our stop/reload/restart paths in the init script are complex and not all
the tools involved in the process are guaranteed to return 4 if executed
by non-root against a process that´s running as root (for example kill
-TERM will return -1 and parsing the output to catch the error is
suboptimal at best).
https://bugzilla.redhat.com/show_bug.cgi?id=553381
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
Introduce --norestorefile to allow user to override the new requirement.
This can also be overridden with "devices/require_restorefile_with_uuid"
in lvm.conf -- however the default is 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We can already detect MD devices internally. But when using MD partitions,
these have "block extended major" (blkext) assigned (259). Blkext major
is also used in general, so we need to check whether the original device
is an MD device actually.