IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The str_list_destroy function may be called to cleanup memory when
the list is not used anymore and the list itself was not allocated
from the memory pool.
When checking minimum mda size, make sure the mda_size after alignment
and calculation is more than 0 - if there's no place for an MDA at the
end of the disk, the _text_pv_add_metadata_area does not try to add it
there and it returns (because we already have the MDA at the start of
the disk at least).
Actually, we don't need extra condition as introduced in commit
00348c0a63. We should fix the last
condition:
(mdac->rlocn.size >= mdah->size)
...which should be:
(MDA_HEADER_SIZE + (rlocn ? rlocn->size : 0) + mdac->rlocn.size >= mdah->size))
Where the "mdac" is new metadata, the "rlocn" is old metadata.
So the main problem with the previous condition was that it
didn't count in MDA_HEADER_SIZE properly (and possible existing
metadata - the "rlocn"). This could have caused the error state
where metadata in ring buffer overlap to not be hit.
Replace the new condition introduced in 00348c0a63
with the improved one for the condition that existed there
already but it was just incomplete.
We're already checking whether old and new meta do not overlap in
ring buffer (as we need to keep both old and new meta during vg_write
up until vg_commit).
We also need to check whether the new metadata do not overlap
themselves in case we don't have old metadata yet (...because
we're in vgcreate). This could happen if we're creating a VG so
that the very first metadata written are long enough that it wraps
themselves in metadata ring buffer.
Although we limited the minimum metadata area size better with the
previous commit ccb8da404d which
makes the initial VG metadata overlap in ring buffer to be less
probable, the risk of hitting this overlap condition is still there
if we still manage to generate big enough metadata somehow.
For example, users can provide many and/or long VG tags during vgcreate
so that the VG metadata is long enough to start to wrap in the ring
buffer again...
Drop already tested 'threshold & create' which is in
lvextend-thin-full.sh
Count with now match faster 'dmeventd' wakeup on watermark
as it's now nearly instant after crossing threshold value.
If the underlaying device has actually bigger read-ahead settings,
let it pass.
But anyway switch to 512 strip-size to get really high R-A sector count.
This option could never have been printed in lvm2 metadata, so it could
be safely removed as it could have been set only as 0.
These configurable setting is supported via metadata profile.
Now with correctly functioning dmeventd enable usage of
low_water_mark for faster reaction on pool's threshold.
When user select e.g. 80% as a threshold value,
dmeventd doesn't need to wait 10 seconds till monitoring
timer expires, but nearly instantly resizes thin-pool
to fit bellow threshold.
If plugin's lvm command execution fails too often (>10 times),
there is no point to torture system more then necessary, just log
and drop monitoring in this case.
The recent addition to check for PVs that were
missed during the first iteration of processing
was unintentionally catching duplicate PVs because
duplicates were not removed from the all_devices
list when the primary dev was processed.
Also change a message from warn back to verbose.
If a VG is removed between the time that 'vgs'
or 'lvs' (with no args) creates the list of VGs
and the time that it reads the VG to process it,
then ignore the removed VG; don't report an error
that it could not be found, since it wasn't named
by the command.
The former patch(dab3ebce4c) is a little bit strict. For example, it is
OK to create PV on unpartitioned DASD devices with LDL formatted. So
after lvm version containing the patch, LVs created on those devices
could not be found.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Improve event string parser to avoid unneeded alloc+free.
Daemon talk function uses '-' to mark NULL/missing field.
So restore the NULL pointer back on parser.
This should have made old tools like 'dmevent_tool' work again.
As now 'uuid' or 'dso' could become NULL and then be
properly used in _want_registered_device() function.
Since lvm2 always fill these parameters, this change should
have no effect on lvm2.
PVs could be missing from the 'pvs' output if
their VG was removed at the same time that the
'pvs' command was run. To fix this:
1. If a VG is not found when processed, don't
silently skip the PVs in it, as is done when
the "skip" variable is set.
2. Repeat the VG search if some PVs are not
found on the first search through all VGs.
The second search uses a specific list of
PVs that were missed the first time.
testing:
/dev/sdb is a PV
/dev/sdd is a PV
/dev/sdg is not a PV
each test begins with:
vgcreate test /dev/sdb /dev/sdd
variations to test:
vgremove -f test & pvs
vgremove -f test & pvs -a
vgremove -f test & pvs /dev/sdb /dev/sdd
vgremove -f test & pvs /dev/sdg
vgremove -f test & pvs /dev/sdb /dev/sdg
The pvs command should always display /dev/sdb
and /dev/sdd, either as a part of VG test or not.
The pvs command should always print an error
indicating that /dev/sdg could not be found.
Recognize the target only 'extends' and do not enforce
'flush' in this case. Only the size reduction
still requires flush (so disables usage of no_flush flag).
If some other targets do require flush before suspend,
they have to explicitly ask for it.
While the activation code tries to evaluate which target
really needs flush with suspend and which may go without flush,
it has stayed effectively disabled by original commit:
33f732c5e9 since here
it only allows to pass non-pvmoving 'mirrors'.
So remove check for mirror LV type and only disable
no_flush for 'pvmove'..
TODO: Looking into history - it also seemed like raid target
would have always required flushing but it's been later
removed without clean explanation.
If some more targets really do need 'no_flush' it should
been handle at their 'level' - since we now stack multiple
targets over itself.
Add more functionality to size_changed function.
While 'existing' API only detected 0 for
unchanged, and !0 for changed,
new improved API will also detected if the
size has only went bigger - or there was
size reduction.
Function work for the whole dm-tree - so
no change is size is always 0.
only size extension 1.
and if some size reduction is there - returns -1.
This result can be used for better evaluation
whether we need to flush before suspend.
Use single code to evaluate if the percentage value has
crossed threshold.
Recalculate amount value to always fit bellow
threshold so there are not need any extra reiterations
to reach this state in case policy amount is too small.
Since plugin's percentage compare has been fixed,
it's now revealed wrong compare here.
The logic for threshold is - to allow to go as high
as given value e.g. 80% - so if pool is exactlu 80%
full it's still allowed to use it (dmeventd will not
resize it).
Commit 1a74171ca5 added
a check to ignore a VG that was FAILED_INCONSISTENT
if the command doesn't care if the VG is not found.
Remove that check because that case is never reached
by the current code.
The ONE_VGNAME_ARG was being passed and tested as
vg_read() flag but it's a cmd struct flag.
(It affects command arg processing in toollib,
not vg_read behavior. Flags related to command
processing are generally cmd struct flags, while
vg_read arg flags are generally related to vg_read
behavior.)
Running "vgremove -f VG & pvs" results in the pvs
command reporting that the VG is not found or is
inconsistent. If the VG is gone or being removed,
the pvs command should just skip it and not print
errors about it.
"Not found" is because the pvs command created the
list of VGs to process, including VG, then vgremove
removed the VG, then the pvs command came to to read
the VG to process it and did not find it.
An "inconsistent" error could be reported if vgremove
had only partially completed removing VG when pvs did
vg_read on the VG to process it, causing pvs to find
the VG in a partially-removed state.
This fix adds a flag that pvs uses to ignore a VG
that can't be read or is inconsistent.
When lvmetad is used and lvmcache update function (lvmcache_update_vgname_and_id)
was called to update existing lvmcache records, a condition was met
which made to retun from the update function immediately, effectively
making it NOOP.
It seems there's no reason for such condition and lvmcache should be
update appropriately even when lvmetad used as lvmcache may be reused,
most notably in lvm shell.
It's possible this is a remnant of the lvmetad development code which
didn't get removed for some reason and the bug didn't get spotted
because lvm shell is not used often (the condition dates back to 2012
or so).
Example, lvmetad and lvm shell used:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 124.00m
Before this patch:
==================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
With this patch applied:
========================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
The lvmcache info might be resued, most notably in lvm shell.
We need to be sure that even lvmcache_info marked as invalid
is removed from the lvmcache so it does not confuse any subsequent
code/commands executed later on.
Problematic example with the lvm shell:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Before this patch (/dev/sda still displayed in a way):
======================================================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
No physical volume label read from /dev/sda
(with lvmetad)
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
With this patch applied:
========================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
(with lvmetad)
lvm> pvs
Older pthread library was missing 'trick'
in pthread_cleanup_pop() which lead to
compilation error:
error: label at end of compound statement
Use explicit ';' to fix it.
Make lvm2_disable_dmeventd_monitoring() more explicit.
As memlock_inc_daemon() is also used by clvmd, which
does changes dmeventd and suspend ignore state at
some stages - make updates of these 2 variable
tied to the call of lvm2_disable_dmeventd_monitoring().
Once this call is made dmeventd monitoring
and suspended devices are ignored.
TODO: all lvm-global settings should really be moved
to command context.
Implementing exit when 'dmeventd' is idle.
Default idle timeout set to 1 hour - after this time period
dmeventd will cleanly exit.
On systems with 'systemd' - service is automatically started with
next contact on dmeventd communication socket/fifo.
On other systems - new dmeventd starts again when lvm2 command detects
its missing and monitoring is needed.
Add support to unmonitor device when monitor recognizes there is
nothing to monitor anymore.
TODO: possibly API change with return value could be also used.
Redesign threading code:
- plugin registration runs within its new created thread for
improved parallel usage.
- wait task is created just once and used during whole plugin lifetime.
- event thread is based over 'events' filter being set - when
filter is 0, such thread is 'unused'.
- event loop is simplified.
- timeout thread is never signaling 'processing' thread.
- pending of events filter cnange is properly reported and
running event thread is signalled when possible.
- helgrind is not reporting problems.
Need here to keep control device opened while there is 'any' dso
plugin loaded - otherwise there would a race closing controlfd
inside lvm2 plugin while some other monitoring thread would
tried to execute another WAITEVENT task.
Move all DSO related function in front, so they could be easily
referenced from rest of code.
Add proper error paths with logging and error reporting.
Drop mutex locking when releasing DSO - since DSO is always
allocated and released in main 'event' processing thread.
CONVERTING status flag is a tricky one. It's not set when converting
a non-mirror LV type to the mirror type, i.e.: linear -> two leg mirror.
Also the conversion itself is instant and doesn't require to be polled.
When mirror reaches sync state there's no final update on VG metadata
for lvmpolld to be made thereby report_progress in fact doesn't report
percentage of mirror being converted but percentage of mirror
being in sync. Perhaps we should reword the lvconvert output here.
On the other hand CONVERTING is set while we upconvert the mirror
from i.e. two leg mirror to four leg mirror. In such case the operation
is required to be polled so that lvmpolld can cleanup temporary
conversion log when the conversion is over.
Ignore CONVERTING lv_type for the moment and match LVs only by uuids
during 'mirror conversion'/'waiting for a sync to finish'.
The old code made two loops through the PVs: in the first
loop it found the max PV and VG name lengths, and in the
second loop it printed each PV using the name lengths as
field widths for aligning columns.
The new code uses process_each_pv() which makes one loop
through the PVs. In the *first* call to pvscan_single(),
the max name lengths are found by looping through the
lvmcache entries which have been populated by the generic
process_each code prior to calling any _single functions.
Subsequent calls to pvscan_single() reuse the max lengths
that were found by the first call.
The new report/compact_output_cols setting has exactly the same effect
as report/compact_output setting. The difference is that with the new
setting it's possible to define which cols should be compacted exactly
in contrast to all cols in case of report/compact_output.
In case both compact_output and compact_output_cols is enabled/set,
the compact_output prevails.
For example:
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols=""
$ lvs vg
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=1
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize
lvol0 vg -wi-a----- 4.00m
dm_report_compact_given_fields is the same as dm_report_compact_fields,
but it processes only given fields, not all the fields in the report
like dm_report_compact_field does.
If a host failed while holding a sanlock lease,
sanlock_acquire will by default block and wait
for the lease to expire before returning. We
want it to return with an error so we can retry
instead of blocking, which allows us to process
other lock operations.
(Enclose this in an ifdef until the new flag
appears in a sanlock release.)
Respect lvm2_log_fn prototype. The idea of 'reusing' print_log with
plain cast is causing very strange crashes with some older 'gcc' compilers.
So just do it cleanly...
This reverts commit 1b1c01a27b.
This caused messages to get dropped instead of logged into the log file.
(The log file and log function are independent at the moment.)
Rework thread creation code to better use resources.
New code will not leak 'timeout' registered thread on error path.
Also if the thread already exist - avoid creation of thread
object and it's later destruction.
If the race is noticed during adding new monitoring thread,
such thread is put on cleanup list and -EEXIST is reported.
As we now use 'unified' logging macro system - we no longer need
to protect from change of logging function pointer - it's set
once at the start of dmeventd and not change anymore
(as lvm2 library no longer interferers here).
Some signatures are spread around the disk in several copies, mainly for
backup. Make libblkid to detect these extra copies - there was missing
"blkid_probe_step_back" fn call after successful wipe of previous signature
copy.
An example with FAT table which has copies:
$ mkfs.vfat /dev/sda1
Before this patch:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
With this patch applied:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 0. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 510. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
Make sure log/prefix is set to "" when getting the list of VG names.
We need this for the format to be correct so it's properly searched
through later on.
$ vgcreate vgA /dev/sda
Volume group "vgA" successfully created
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ dd if=/dev/sda of=/dev/sdc bs=1M
(the new VG name is prefix of existing VG name)
$ vgimportclone -n vg /dev/sdb
(the new VG name is suffix of existing VG name)
$ vgimportclone -n gA /dev/sdc
Before this patch:
------------------
(we end up with "vg1" and "gA1" names with the "1" suffix which is not needed)
$ vgs -o vg_name
VG
gA1
vg1
vgA
With this patch applied:
------------------------
(we end up with "vg" and "gA" names as they're unique already and no extra suffix is added)
$ # vgs -o vg_name
VG
gA
vg
vgA
Of course, if the name supplied is not unique, the number is added correctly:
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ vgimportclone -n vgA /dev/sdb
$ vgs -o vg_name
VG
vgA
vgA1
If lvmlockd is running, lvmetad is configured (use_lvmetad=1),
but lvmetad is not running, then commands will seg fault
when trying to send a message to lvmetad.
The difference is lvmetad being "active", not just "used".
We can replace the expressions with awk/grep/cut/tr with --select now and
more suitable reporting options and modes. Also, we don't need to check
the temporary lvm.conf generated within vgimportclone script since we're
generating it ourselves now using lvmconfig, not using sed anymore like
it was before (so we can be pretty sure it's correct - we use lvmconfig
now even for generating the lvm.conf itself).
We already have pv_count to report number of PVs that a VG has based
on metadata.
This patch exposes the information about how many of these PVs are
missing which is also useful information for a VG. Wwe could count
the sum of pv_missing reporting fields for each PV in the VG before,
but the new field is practical when reporting VG as a whole and there's
no need to process each PV from VG alone.
If 'vgcreate --shared' finds both sanlock and dlm are running,
print a more accurate error message:
"Found multiple lock managers, select one with --lock-type."
When neither is running, we still print:
"Failed to detect a running lock manager to select lock type."
Using --lock-type sanlock|dlm implies --shared.
Using --shared selects lock type sanlock|dlm
(by choosing the one that's running.)
Using both --shared and --lock-type sanlock|dlm should
also be allowed (--shared is just redundant information.)
Also, leave out the note about "circular buffer" which is
an internal imeplementation detail anyway and not quite
informational for users:
Before this patch:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large for circular buffer
Failed to write VG vg1.
With this patch applied:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large: size of metadata to write is 691 bytes while PV metadata area size on /dev/sda is 512 bytes.
Failed to write VG vg1.
Before this patch:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear public
With this patch applied:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear private,lockd,sanlock
Avoid running tests, when prefix already exist in the system.
As prefix just uses PID number, we may hit a case for long
running tests, where devices from some previous runs were not
properly cleared away - detect this and fail early.
(Such machine should be inspected and fixed).
Add metadata_devices and seg_metadata_le_ranges report fields.
Currently only defined for raid, but should probably be extended
to all other segment types that don't report all their device
usage in the 'devices' field.
Correct some things, e.g. set mode and policy on
the cache lv, not the pool, lvm.conf field for
mode changed.
Add smq which was missing.
Make the sections on cache mode and cache policy
consistent in structure and style.
When lvmetad_pvscan_vg() reads VG metadata from each PV,
it compares it to the last one to verify it matches.
If the VG metadata does not match on the PVs, an error
is printed and it fails to read the VG. In this error
case, use log_debug to show the differences between
the two unmatching copies of the metadata.
One host changes a VG, making the cached VG on another
host invalid. The other host then rereads the VG from
disk to get the latest copy. If the first host removed
a PV from the VG, the second host attempts to reread the
VG from old PV when rescanning. Reading the VG from the
removed PV fails, causing vg_read to return "VG not found".
The fix is to simply not fail when a VG is not found while
rereading a PV and continue without it.
(This doesn't happen if the second host happens to first
run a command like 'vgs' that triggers a global revalidation
of metadata.)
vgchange --lock-type iterates through LVs to ensure
no LVs are active before changing the lock type of
the VG, but the loop was not checking that an LV
actually has a lock before trying it, so it would
fail if the VG had any LVs that don't use locks,
e.g it would fail on a tmeta LV from a pool.
We want most of our units to be started before any local/remote mount
points are mounted - we used {local,remote}-fs.target for this purpose
before, but it was not 100% correct as there's even {local,remote}-fs-pre.target
special systemd unit reserved for this exact purpose.
See also man 7 systemd.special and "local-fs-pre.target"/"remote-fs-pre.target"
description.
When user specifies '--force' with remove/remove_all/wipe_table
use '--noflush --nolockfs' resume flags, so the operation
will not block when device underneath is blocked.
Also make error messages more consistent:
Before this patch:
(/run/lock exists and is not a directory)
$ pvs
/run/lock/lvm: mkdir failed: Not a directory
File-based locking initialisation failed.
(/run/lock/lvm exists and is not a directory)
$ pvs
Directory "/run/lock/lvm" not found
File-based locking initialisation failed.
With this patch applied:
(/run/lock exists and is not a directory)
$ pvs
Existing path /run/lock is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed
(/run/lock/lvm exists and is not a directory)
$ pvs
Existing path /run/lock/lvm is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed.
When using udev, the /dev/mapper entries are symlinks - fix the code
to count with this.
This patch also fixes the dmsetup mknodes and vgmknodes to properly
repair /dev/mapper content if it sees dangling symlink in /dev/mapper.
$ lvs -o name,tags vg
LV LV Tags
lvol0
lvol1 mytag
Before this patch:
$ lvs -o name,tags vg -S 'tags=""'
Failed to parse string list value for selection field lv_tags.
Selection syntax error at 'tags=""'.
Use 'help' for selection to get more help.
(and the same for -S 'tags={}' and -S 'tags=[]')
With this patch applied:
$ lvs -o name,tags vg -S 'tags=""'
LV LV Tags
lvol0
(and the same for -S 'tags={}' and -S 'tags=[]')
If lvmlockd acquires an lv lock for a command, but the
command exits before the reply, then the command has
not activated the lv and lvmlockd should unlock it.
This only applies when the lv was not already locked.
(There will always be a chance that the lv lock is held
while the lv is not active, i.e. if the command fails in
the small window between getting the lv lock and before
doing the activation. In that case, rerunning the
activation command corrects the inconsistency.)
This commit helps by automatically clearing the
inconsistency (lv locked by not activated) in the most
common case when the lv lock operation is slow to
complete and the command is canceled by the user.
This commit also adds and cleans up references to the
client id in a bunch of log messages, which is useful
to follow processing on each independent lock request.
When using lvm shell, some structures which are cached in memory may be
reused. This happens for the struct label (a part of lvmcache_info
structure) when lvmetad is used in which case the PV scan is not
done that would normally overwrite these label structures in memory
and making them up-to-date.
This is all consequence of the fact that struct lvmcache_info and
struct label are not always assigned in the same part of the code.
For example, if lvmetad *is not* used, parts of the struct label are
reassigned in label_read fn while struct lvmcache_info is created
elsewhere. No part of the code reused struct label (and its "dev"
field) before calling label_read fn. That's why the real bug is
hidden when using lvm shell without lvmetad.
However, with lvmetad and lvm shell, the situation is a bit different.
The label_read fn is not called if lvmetad *is* used, hence the
struct label may have ended up not initialized properly.
There was missing assignment for the dev field in struct label
in _text_pv_write fn which caused this problem to appear in
lvm shell with lvmetad, for example:
Before this patch:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
unknown device lvm2 --- 128.00m 128.00m
With this patch applied:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Also, this problem had not appeared before changes introduced
by commits e1a63905d1 through
3a6f91d713 which, among other
things, added proper label field type reporting. Before, label
reporting was the same as using struct physical_volume which
has its own dev field assigned and so this problem was not exposed.
When a command does a sequence of
vg_write + vg_commit + vg_write + vg_commit,
initialization of non-PV devices happens during the
first vg_write, and does not need to be repeated by
the second vg_write.
When creating a lockd VG, this sequence occurs because
the VG is first created, then the lockd data is created,
then the lockd data is then written to the VG metadata.
Avoid validation of free space in pool, when no messages are passed.
Patch a3c7e326c3 add new check for
pool overload - but this check should not be made if there are
no messages and transaction_id is still within 'bounds' (bigger by 1).
Commit 00b36ef06a had a typo
and missed '{' for shell variable, thus command used slightly
different 'tmp' dir name for cache dir (with extra '}').
Such change was unnoticed until a recent fix in persistent
filter, lvm2 missed to update cache file when --config
was specified.
The result was, /tmp dir was accumulating snap.XXXXX} dirs when
running vgimportclose script.
Since we may want to swap names when LVs are complex types, we cannot
avoid doing full renames on both LV stacks.
Temporarily use 'pvmove_tmeta' as unused name to prevent validation troubles.
Commit 9403edbb93 move location of
configure.h and lvm-version.h.
Let's try even better place then /conf dir which should be left
for user configurable files.
Put these files right into include dir.
This applies the same rule/logic to dlm VGs that has always
existed for sanlock VGs. Allowing a dlm VG to be removed
while its lockspace was still running on other hosts largely
worked, but there were difficult problems if another VG with
the same name was recreated. Forcing the VG lockspace to
be stopped, gives both sanlock and dlm VGs the same behavior.
This shortcut was added for an odd case that I do not
believe is relevant any more. Having an alternate
path for lockspace thread cleanup is a complication
that could lead to problems.
The code was expecting the wrong return value from
compare_config, which returns 0 when equal.
This is a problem for a lockd VG using multiple PVs
when the VG needs to be rescanned.
Somehow raid tests landed in plain cache - separte them out
so they properly check for have_raid.
Check we do not support strip option with cache-pool creation.
ATM allocation can't handle stripping and cache pool allocation.
It's not yet even clear what should be actually result.
Until resolved, disable this option (it's been coredumping
inside allocation anyway).
Certain stacks of cached LVs may have unexpected consequences.
So add a warning function called when LV is cached to detect
such caces and WARN user about them - the best we could do ATM.
When we insert layer we also move status flag-bits for certain LV types,
so internal volume_group structure remains consistent.
(Perhaps it's misuse of 'insert_layer' function and we should have
another similar function for this.)
Basically we aim to maintain the same state as after reading fresh
metadata out of volume group.
Currently we when i.e. cache 'raid' LV - this should transfer 'raidLV' flag
to _corigin LV and cache is no longer a raid.
TODO: bits for stacked devices needs more exact rules.
The dlm will often lose the lvb content, so we need to
check quite a few possibilities for lvb values that
were not being checked before.
Refactoring was required to pass the entire lvb value
back to the core code instead of the single value.
The only functional change should be detecting new
lvb states where metadata is now invalidated where
it wasn't before.
When an action is created by lvmlockd for itself,
there is no client to send the result to. Add
the NO_CLIENT flag to the action to skip sending
the result to a client.
Add a new arg to lockd_start_vg() that indicates
it is being called for a new lockd VG, so that
lvmlockd knows the lockspace being started is new.
(Will be used by a following commit.)
Commit f6473baffc introduced a new
cmd->initialized variable to keep info about which parts of the
cmd_context have been initialized.
A part of this patch was also a change in refresh_filters fn
which checks for cmd->initialized.filters variable and it does
the filter refresh *only* if the filter has already been initialized
before otherwise it's a NOOP (before, the refresh_filters also
initialized filters as a side effect in case it had not been
initialized before which was not quite correct).
However, the commit f6473baffc
did not handle the case in which configuration changes
either via --config argument or when configuration file changed
and its timestamp was higher than the timestamp of the persistent
cache file - the /etc/lvm/cache/.cache.
This patch fixes this issue and it causes the init_filters fn
in lvm_run_command fn to be called with proper value of
"load_persistent_cache" switch even if the configuration changes,
hence causing the persistent cache file to be ignored in this
case.
Ensure make clean cleans any left-over file from their previous
location so they are not in conflict with new ones.
Also hide error message when .commands file is not present.
The regex filter (controlled by devices/filter lvm.conf setting) was
evaluated as the very last filter. However, this is not optimal when
it comes to restricting disk access - users define devices/filter
as well as devices/global_filter to avoid this.
The devices/global_filter is already positioned at the beginning of the
filter chain. We need to do the same for devices/filter.
Filter chains before this patch:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
type_filter -> usable->filter -> mpath_component_filter ->
partition_filter -> md_component_filter -> fw_raid_filter ->
regex_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> usable_filter -> regex_filter
From the chain list above we can see that particularly in case when
lvmetad is not used, the regex filter is the very last one that is
processed. If lvmetad is used, it doesn't matter much as there's
the global_regex_filter which is used instead when updating lvmetad
and when retrieving info from lvmetad, putting regex_filter in front
of usable_filter wouldn't change much since usabled_filter is not
reading disks directly.
This patch puts the regex filter to the front even in case lvmetad
is not used, hence reinstating the state as it was before commit
a7be3b12df (which moved the regex_filter
position in the chain). Still, the arguments for the commit
a7be3b12df still apply and they're
still satisfied since component filters (MD, mpath...) are evaluated
first just before updating lvmetad.
So with this patch, we end up with:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
regex_filter -> type_filter -> usable->filter ->
mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> regex_filter -> usable_filter
This way, specifying the regex_filter in non-lvmetad case causes
the devices to be filtered based on regex first before processing
any other filters which can access disks (like md_component_filter).
This patch also streamlines the code for better readability.
Hopefull 4.3 will be fixed and test will be updated to let
raid test running again.
Meanwhile using md-raid may effectively kill kernel,
so leave at least other tests running.
Split up _build_histogram_arg() into separate functions to allocate
and fill the histogram arg string and remove nested local variable
declarations from the parent function.
Coverity flags a user-after-free in _stats_histograms_destroy():
>>> Calling "dm_pool_free" frees pointer "mem->chunk" which has
>>> already been freed.
This should not be possible since the histograms are destroyed in
reverse order of allocation:
203 for (n = _nr_areas_region(region) - 1; n; n--)
204 if (region->counters[n].histogram)
205 dm_pool_free(mem, region->counters[n].histogram);
It appears that Coverity is unaware that pool->chunk is updated
during the call to dm_pool_free() and valgrind flags no errors in
this function when called with multiple allocated histograms.
Since there is no actual need to free the histograms individually
in this way simplify the code and just free the first allocated
object (which will also free all later allocated histograms in a
single call).
Put include/.symlinks_created as a prerequisite for dep calc.
Otherwise if these are not generated and user enters tests subdir and
runs 'make' he just gets endless loop of dep calculation.
Relocate generated configure.h and lvm-version.h outside
of compilable .c source tree.
The reason is behind - when compiling in builddir != srcdir
the generated file in lib/misc/configure.h was used for all compiled
source file except ones located in lib/misc dir - those would have used
configure.h file located in this dir - if there have existed one (i.e.
from some other build)
This problem was only visible, when srcdir == buildir was used before
trying to use srcdri != builddir (as configure.h appeared then in
srcdir).
The histogram changes adds a new error path to dm_stats_create().
Make sure that the dm_stats handle is properly destroyed if we fail
to create the histogram pool and check for failures setting the
program_id.
Since we are growing an object in the histogram pool the return
value of dm_pool_grow_object() must be checked and error paths need
to abandon the object before returning.
Undo the part of the recent EREMOVED change which
automatically stopped the lockspace for a remotely
removed VG. It didn't always work (would not work
when lvb content was rebuilt in the dlm). This will
be handled better when the lvb content is controlled
more strictly.
As part of fix that came with cf700151eb,
I forgot to add the check whether the result of stat was successful or
not. This bug caused uninitialized buffer to be used for entries
from .cache file which are no longer valid.
This bug may have caused these uninitialized values to be used further,
for example (see the unreal (2567,590944) representing major:minor
pair):
$ pvs
/dev/abc: stat failed: No such file or directory
Path /dev/abc no longer valid for device(2567,590944)
PV VG Fmt Attr PSize PFree
/dev/mapper/test lvm2 --- 104.00m 104.00m
/dev/vda2 rhel lvm2 a-- 9.51g 0
Older versions of gcc aren't able to track the assignments of
local variables as well as the latest versions leading to spurious
warnings like:
libdm-stats.c:2183: warning: "len" may be used uninitialized in this
function
libdm-stats.c:2177: warning: "minwidth" may be used uninitialized in
this function
Both of these variables are in fact assigned in all possible paths
through the function and later compilers do not produce these
warnings.
There's no reason to not initialize these variables though and
it makes the function slightly easier to follow.
Also fix one use of 'unsigned' for a nr_bins value.
Replace the histogram stats subcommand with a --histogram switch
to enable histogram related fields for both list and report output.
To avoid overloading the existing --histogram rename it to --bounds:
this is also a better description of the option.
Remove the optimization/shortcut for starting the dlm global
lockspace when it was already running.
Reenable automatically starting the dlm global lockspace
when a command attempts to use it and it's not yet started.
This had become disabled at some point.
Since we may easily get blocked when checking for percentage
of thin-pool - do not flush and just show current values.
This avoids holding VG locked when pool is overfilled.
Try to detect thin-pool which my block lvm2 command from furher
processing (i.e. lvextend).
Check if pool is read-only or out-of-space and in this case thins
will skipped from being scanned (so user may miss some PVs located
on thin volumes).
Fix regression introduced with commit:
2fc126b00d
This commit has moved pv_min_size() test in front
of device_is_usable(). However pv_min_size needs to open device,
so it may have actually get blocked.
So restore the original order and first validate
dm device to be usable for open.
It's worth to note that such check is not 'race-free',
but it usually eliminates 99.99% of problems ;).
Print [source:handler] in filters' debug messages only if external
device info source other than "none" is used.
$ lvmconfig --type full devices/external_device_info_source
external_device_info_source="none
Before this patch (from the -vvvv log):
filters/filter-usable.c:47 /dev/mapper/test: Skipping: Too small to hold a PV [none:(nil)]
filters/filter-md.c:33 /dev/sdb: Skipping md component device [none:(nil)]
filters/filter-partitioned.c:25 /dev/vda: Skipping: Partition table signature found [none:(nil)]
With this patch applied:
filters/filter-usable.c:44 /dev/mapper/test: Skipping: Too small to hold a PV
filters/filter-md.c:35 /dev/sdb: Skipping md component device
filters/filter-partitioned.c:27 /dev/vda: Skipping: Partition table signature found
librt doesn't have a pkgconfig file so use Libs.private: -lrt instead
to declare the dependency directly.
The same applies for -lm which is also used and which hasn't been
defined in the devmapper.pc file yet.
Make sure that correct 'dmstats create' messages are shown for all
examples and fix LV examples to use correct dmsetup output name
format (vg/lv -> vg-lv).
Improve the names and labels of stats reports columns, ensure that
the minimum field widths allow unambiguos labels to be shown and
update the man page descriptions of these fields.
Add support to dmstats to create and report histograms.
Add a --histogram switch to 'create' that accepts a string
description of bin boundaries and DR_STATS and DR_STATS_META fields
to report bin configuration and absolute and relative histogram
values:
hist_bins
hist_bounds
hist_ranges
hist_count
hist_count_bounds
hist_count_ranges
hist_percent
hist_percent_bounds
hist_percent_ranges
A new 'histogram' subcommand displays a report that emphasizes
histogram data as either counters or percentage values.
Add support for creating, parsing, and reporting dm-stats latency
histograms on kernels that support precise_timestamps.
Histograms are specified as a series of time values that give the
boundaries of the bins into which I/O counts accumulate (with
implicit lower and upper bounds on the first and last bins).
A new type, struct dm_histogram, is introduced to represent
histogram values and bin boundaries.
The boundary values may be given as either a string of values (with
optional unit suffixes) or as a zero terminated array of uint64_t
values expressing boundary times in nanoseconds.
A new bounds argument is added to dm_stats_create_region() which
accepts a pointer to a struct dm_histogram initialised with bounds
values.
Histogram data associated with a region is parsed during a call to
dm_stats_populate() and used to build a table of histogram values
that are pointed to from the containing area's counter set. The
histogram for a specified area may then be obtained and interogated
for values and properties.
This relies on kernel support to provide the boundary values in
a @stats_list response: this will be present in 4.3 and 4.2-stable. A
check for a minimum driver version of 4.33.0 is implemented to ensure
that this is present (4.32.0 has the necessary precise_timestamps and
histogram features but is unable to report these via @stats_list).
Access methods are provided to retrieve histogram values and bounds
as well as simple string representations of the counts and bin
boundaries. Methods are also available to return the total count
for a histogram and the relative value (as a dm_percent_t) of a
specified bin.
For repeating reports field widths should be re-calculated for
each report interval. Not doing so will cause a single row with
wide field data to cause all subsequent rows to share the width:
Name RgID ArID R/s W/s Histogram Bounds
vg_hex-lv_home 0 0 4522.00 834.00 0s: 991, 2ms: 152, 4ms: 161, 6ms: 4052 0s, 2ms, 4ms, 6ms
vg_hex-lv_swap 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_root 0 0 1754.00 683.00 0s: 369, 2ms: 65, 4ms: 90, 6ms: 1913 0s, 2ms, 4ms, 6ms
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e 0 0 4522.00 868.00 0s: 985, 2ms: 152, 4ms: 161, 6ms: 4092 0s, 2ms, 4ms, 6ms
vg_hex-lv_images 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
Name RgID ArID R/s W/s Histogram Bounds
vg_hex-lv_home 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_swap 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_root 0 0 0.00 2.00 0s: 1, 2ms: 0, 4ms: 0, 6ms: 1 0s, 2ms, 4ms, 6ms
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_images 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
^^^^^^^^^^^^^^^^^
This is especially significant for the current histogram fields:
depending on the time since the last clear operation the first
report iteration may contain very large values leading to a very
large minimum field width. Without resetting field widths this
large minimum field width value is used for all subsequent rows.
Previously all stderr messages issued by spawned lvpoll command were reported
as INFO only. This made all such messages invisible in syslog or lvmpolld log
while running default configuration.
All lvpoll stderr messages are loged with WARN priority now and lvpoll
command exiting with retcode != 0 is logged with ERROR priority in
syslog and lvmpolld log
Include both the VG uuid and name in the lvmetad
set_vg_info message. This works around an obscure
problem where the VG uuid in lvmlockd is wrong
when one host removes a dlm VG, then creates a new
VG with the same name. If the dlm lockspace for
the initial VG was never stopped on another host,
that other host will be using the old uuid in its
lvmetad set_vg_info message. (That can be
corrected with a larger change, but this is an
effective workaround.)
set_vg_info previously accepted only vg uuid,
now accept both vg uuid and vg name. If the
uuid is provided, it's used just as before,
but if the uuid is not provided, or if it's
not found, then fall back to using the vg
name if that is provided.
lvmlockd would fail to recognize that the global lockspace
failed to start if the dlm wasn't running, so future attempts
to start the dlm global lockspace would do nothing, thinking
it was already running.
This was only used to return two flags indicating specific
reasons for a lock failure so that a more specific error
message could be printed by the command (lockspace had been
stopped, or lockspace had an error starting.)
Remove the list, given its limited usefulness, the fact it
would easily become inaccurate, and the fact it was causing
misleading error messages. The error conditions it was meant
to help could be reported differently.
Previously, a command would only rescan a lockd VG
when lvmetad returned the "vg_invalid" flag indicating
that the cached copy was invalid (which is done by
lvmlockd.) This is still the only usual reason for
rescanning a lockd VG, but two new special cases are
added where we also do the rescan:
. When the --shared option is used to display lockd VGs
from hosts not using lvmlockd. This is the same case
as using --foreign to display foreign VGs, but --shared
was missing the corresponding bits to rescan the VGs.
. When a lockd VG is allowed to be read for displaying
after failing to acquire the lock from lvmlockd. In
this case, the usual mechanism for validating the
cache is missed, so assume the cache would have been
invalidated. (This had been a previous todo item
that was lost during other cleanup.)
These were long-standing todos that were lost track of.
This makes lvmlockd removal steps for dlm VGs closely match
sanlock VGs. Because dlm lockspaces are not required to be
stopped on all hosts before vgremove, there is an extra bit
for dlm lockspaces, where a flag is set in the VG lock lvb
indicating that the VG was removed. If other hosts happen
to use the VG lock they will see this flag and stop their
lockspace.
Remove the existing lock type using the same functions
used to remove the lockd components during vgremove.
This results in a "clean" VG and lvmlockd state after
the vgchange, i.e. no bits left over from previous
lock type.
Originally when vgdisplay encountered an exported VG it issued a
WARNING. Commit d6b1de30 replaced this with an error message
but still exited with success (incorrect). A backtrace was recently
added in commit b193809987.
As vgdisplay already states that the VG is exported in its output,
just drop these messages completely.
dm_stats_create_region is now assigned to DM_1_02_106 by default:
the DM_1_02_104 .exported_symbols file entry was moved into
libdm-stats.c as:
DM_EXPORT_SYMBOL(dm_stats_create_region, 1_02_104)
so delete it from .exported_symbols.DM_1_02_104.
Since commit 797c18d543 some internal symbols
have been exported in shared libraries by mistake because 'local: *' got
lost. Fix the shell script not to compare the whole filename with
'Base'
All cache args could be specified when caching LV
(means converting LV to cached).
When --cachemode arg is given during cache-pool conversion,
store it in the metadata.
https://bugzilla.redhat.com/show_bug.cgi?id=1255184
Since cache-pool actualy keeps info about caching,
display this info for cache-pool LV as well
(matches info for cache LV when cache-pool is asociated with it).
Commit 82a27a8 introduced a change to the symbol versioning macros
that allows a new version of a function to be introduced while
keeping the old behaviour via a versioned symbol export. The new
symbol is listed in the current .exported_symbols.DM_* file and a
default (@@VERSION) binding is created during linking.
This broke the build on RHEL5, RHEL6 and Debian Lenny. This is
because the make version in these distros returns results from the
$(wildcard *) command in a different order to the RHEL7 and F22
versions: this affects the ordering of the generated .export.sym
version script:
RHEL7/F22
for i in ./.exported_symbols.Base ./.exported_symbols.DM_1_02_99
./.exported_symbols.DM_1_02_98 ./.exported_symbols.DM_1_02_97
./.exported_symbols.DM_1_02_106 ./.exported_symbols.DM_1_02_105
./.exported_symbols.DM_1_02_103 ./.exported_symbols.DM_1_02_101
./.exported_symbols.DM_1_02_104 ./.exported_symbols.DM_1_02_100
290: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
*388: 000000000003cfc7 314 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@@DM_1_02_106
391: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
*552: 000000000003cfc7 314 FUNC GLOBAL DEFAULT 12 dm_stats_create_region
944: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
992: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
RHEL6:
for i in ./.exported_symbols.Base ./.exported_symbols.DM_1_02_100
./.exported_symbols.DM_1_02_101 ./.exported_symbols.DM_1_02_103
./.exported_symbols.DM_1_02_104 ./.exported_symbols.DM_1_02_105
./.exported_symbols.DM_1_02_106 ./.exported_symbols.DM_1_02_97
./.exported_symbols.DM_1_02_98 ./.exported_symbols.DM_1_02_99; do\
290: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
390: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
*479: 000000000003cfa7 314 FUNC LOCAL DEFAULT 12 dm_stats_create_region
944: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
992: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
The F22 build has the correct behaviour (although the sort order is
inconsistent) but on RHEL6 the 1_02_106 symbol file appears after
version 1_02_104 which introduced the original symbol. This causes
the later version of the symbol to lose its version binding and be
reduced to local scope.
If using un-versioned exports of the current version of a symbol
(i.e. exported with the plain symbol name and no macro) and using
the linker script to set the symbol version, the current version
node must appear first in the version script: the un-versioned
symbol will be bound to the first version node found that contains
it.
On RHEL6 and the other older distros the original version of the
dm_stats_create_region() call sorted before the current version
(DM_1_02_104 vs. DM_1_02_106) leading to a subsequent link error for
the later symbol version:
dmsetup.o: In function `_do_stats_create_regions':
/root/src/git/lvm2/tools/dmsetup.c:4658: undefined reference to
`dm_stats_create_region'
Ensure that the ordering of entries in the version script is
consistent to avoid an old implementation shadowing a newer one by
sorting the list of file names before the loop:
$$(echo $(EXPORTED_SYMBOLS) | tr ' ' '\n' | sort -rnt_ -k5 )
This only sorts by patch level but this is sufficient to maintain
the correct order for current version files.
Tested on RHEL5, 6, 7 and F22.
Add support for the kernel precise_timestamps feature. This allows
regions to be created using counters with nanosecond precision.
A new dm_stats method, dm_stats_set_precise_timestamps() causes all
future regions created with this handle to attempt to enable precise
counters.
Fix the version export macros to make it possible to export two
different DM_* versions of a symbol: currently it is only possible for a
DM_* symbol to override a symbol in Base. Attempting to export two
symbols at different DM_* version levels (e.g. DM_1_02_104 and
DM_1_02_106) leads to a linker error due to a duplicate symbol
definition.
This is because the DM_EXPORTED_SYMBOL macro makes each exported symbol
the default (@@VERSION):
__asm__(".symver " #func "_v" #ver ", " #func "@@DM_" #ver )
Fix the macro to use a single '@' for a symbols exported in multiple
versions and rename the macros to DM_EXPORT_*:
DM_EXPORT_SYMBOL(func,ver)
DM_EXPORT_SYMBOL_BASE(func,ver)
For functions that have multiple implementations these macros control
symbol export and versioning.
Function definitions that exist in only one version never need to use
these macros.
Backwards compatible implementations must include a version tag of
the form "_v1_02_104" as a suffix to the function name and use the
macro DM_EXPORT_SYMBOL to export the function and bind it to the
specified version string.
Since versioning is only available when compiling with GCC the entire
compatibility version should be enclosed in '#if defined(__GNUC__)',
for example:
int dm_foo(int bar)
{
return bar;
}
#if defined(__GNUC__)
// Backward compatible dm_foo() version 1.02.104
int dm_foo_v1_02_104(void);
int dm_foo_v1_02_104(void)
{
return 0;
}
DM_EXPORT_SYMBOL(dm_foo,1_02_104)
#endif
A prototype for the compatibility version is required as these
functions must not be declared static.
The DM_EXPORT_SYMBOL_BASE macro is only used to export the base
versions of library symbols prior to the introduction of symbol
versioning: it must never be used for new symbols.
Single messages sent over unix sockets are limited in
size to /proc/sys/net/core/wmem_max, so send the 1MB
debug buffer in smaller chunks to avoid EMSGSIZE.
Also look for EAGAIN and retry sending for a limited
time when the reader is slower than the writer.
Also shift the location of that code so it's the same
as other requests.
With clusters larger than 3 nodes, the 32-byte debug buffer in
cpg_join_callback() is too small to contain all the node IDs, because
32-bit identifiers are generally rendered in 10 decimal digits. No fixed
size is good in all cases, but this is conditionally logged debug info,
so we can simply truncate it. Double the size, nevertheless.
Add a function to test whether the kernel precise_timestamps
feature is available in the current device-mapper driver version.
Presence of precise_timestamps also implies the availability of
latency histograms.
This mainly makes the description text use 80 columns.
There are a few minor adjustments to wording to help
the text layout, and a couple minor improvements to
descriptions.
The unlock call will fail in expected and normal cases,
and should not cause the command to fail. (An actual
unlock in the lock manager should never fail.)
We already use -lm functions in a couple of places (these are
satisfied by gcc built-ins for most builds): add a configure.in
check and explicitly link to -lm.
This reverts commit 70db1d523d.
Since we use 'strncpy' even for case where it exactly matches
the buffer size and \0 is not expected to be added there.
Before printing a commented automatic config value,
print a line describing what it is. Otherwise, the
commented value can look like it's a part of an
example preceding it.
The timerfd guarantees that it will return 8 bytes when a read(2)
is issued (a uint64_t giving the number of timer events during the
call). Check that it does so and log a non-fatal error if the byte
count is not 8.
Several interfaced in libdm-stats return a uint64_t when it is
only used to signal success/failure: change all these uses to
return a simple int instead.
Move code which runtime detects settings for cache_policy
out of config dir to cache seg handling code.
Also mark cache_mode as command profilable setting.
Revert back to already existing behavior which has been slightly
modified by a900d150e4.
At the end however it seem to be equal to change TID right with first
metadata write.
Existing code missed handling for 'unused' thin-pool which would
require to also check empty message list for TID==0.
So with the fix we now again preserve 'active' thin-pool volume
when first thin volume is created - this property was lost and caused
problems in cluster, where the lock was hold, but volume was no longer
active on the node.
Another missing part was the proper support for already increased,
but unfinished TID change.
So going back here with existing logic -
TID is increased with first MDA update.
Code allows start with either same TID or (TID-1).
If there are messages, TID must be lower by 1 for sending,
otherwise messages were already posted.
As cache_policy is evaluated in runtime, we no longer should use
CFG_COMMENTED, but have to switch to CFG_UNDEFINED.
So as long as the value is undefined, it's runtime evaluated.
Once it's set - it's always respected (no runtime fallback).
Also fix version of introduced settings to 2.2.128.
Commit f10ad95 introduced a regression causing the size of regions
passed in on the command line to be truncated to zero. Initialise
the 'this_len' variable to the supplied length to correct this.
Commit f10ad95 introduced a regression in the calculation of the
number of areas in a region created with the --areasize switch:
vg_hex-lv_home: Created new region with 0 area(s) as region ID 1
vg_hex-lv_swap: Created new region with 0 area(s) as region ID 1
Fis this by using the correct region size when calculating the
value.
When dmstats is run with -v or higher enable a per-area reporting
mode for statistics regions. This will output one row per area
(rather than one row per region) and adds additional fields of use
when viewing areas:
area_id - index within the region assigned by libdm-stats
area_start - the start location of the area in the containing
device.
Add a method to retrieve the offset of an area within the
containing region (rather than the offset within the containing
device returned by dm_stats_get_area_start()).
Although users of the library can calculate this themselves it is
better to provide this through a method call to avoid users making
assumptions about the structure of regions and areas.
The dm_stats_get_area_start (and its '_current_' variant) methods
are expected to return the start sector of the area in the
containing device.
Make sure the call adds region->start to the returned value.
Add a '--raw' switch to stats reports that causes us to report the
basic counter values rather than derived metrics for each visible
statistics region.
Add prefixes to all dmsetup report types to allow the 'group_all'
option to be effective:
DR_NAME name_
DR_INFO info_
DR_DEPS deps_
DR_TREE tree_
DR_NAME splitname_
When run with full verbosity dmsetup or dmstats reports will
output a figure that tracks a moving average over a window of the
last two intervals:
Interval #3 time delta: 999991087ns
Interval #3 mean duration: 999907064ns, current err: -8913ns
End interval #3 duration: 999991087ns
Adjusted sample interval duration: 999991087ns
Due to the narrow window this is a very crude estimate and is only
of use to someone debugging or modifying the stats clock: remove
the value and the global variables used to track it.
Anyone with a particular use for this information can construct a
better mean by calculating the value of a greater number of
intervals.
Unlike 'info -c' and 'stats report' the 'dmstats list' subcommand
does its own report processing. This complicates the handling of
the DR_STATS and DR_STATS_META fields and leads to inconsistent
behaviour between the different commands. In particular it causes
'stats list' to segfault when using 'all' field options:
Segmentation fault (core dumped)
Delete _stats_list() entirely and adapt _stats_report so that it
can correctly format a DR_STATS_META-only report request.
This requires passing the subcommand into _report_init() where it
is used in addition to the command name to select the default set
of report fields for the 'list' and 'report' stats subcommands.
With this change both 'list' and 'report' dmstats report will use
the correct report object type and ensure that it is initialised
appropriately for the field selection in use.
Although statistics and meta fields (region and area properties) share
the same object type the state of the handle they expect differs: meta
only expects a dm_stats_list() operation to have been performed whereas
statistics require a fully populated handle.
Distinguish between these requirements by separating the fields into
two distinct report types:
DR_STATS = 32,
DR_STATS_META = 64
The new category is described as "Mapped Device Statistics Region
Information" in the help text.
Make the use of the this_start and this_len variables easier to
follow and clarify the use of zero start and len arguments to
request a whole-device region.
Add a pair of fields to expose the current per-interval duation
estimate. The 'interval' field provides a real value in units of
seconds and the 'interval_ns' field provides the same quantity
expressed as a whole number of nanoseconds.
Do not include bits/time.h as it is an internal libc header file.
A comment at the top of the glibc specific bits/time.h says:
"Never include this file directly; use <time.h> instead."
This fixes the following build error with musl libc:
libdm-timestamp.c:37:23: fatal error: bits/time.h: No such file or directory
---
Compile tested with Alpine Linx (musl libc) and ubuntu 15.04
libdm/libdm-timestamp.c | 1 -
1 file changed, 1 deletion(-)
Introduce enums and global variables to record cleanly which command we
are processing and eliminate the historically inconsistent use of the
shifted argv[0] and fix assorted bugs discovered along the way.
Add dm_report_is_empty() to indicate there is no data awaiting output
and use this to suppress dmsetup report headings when no data is output
so we don't get a stray line saying 'Help' at the end of reporting help.
Define a report type (as the interface requires) so -o all selects
the right fields in splitname. (A fix for stats list will follow.)
Exit immediately if no device is supplied to dmsetup wipe_table instead
of hitting errors later and failing.
Adjust the command name printed in usage/help output to match command
invoked (most of the time).
The '--force' switch is only used by dmstats to allow either
creation or deletion of one or more regions on all devices.
These operations do not carry any risk: just a possible mess of
region IDs to be cleaned up.
Remove the use of '--force' for stats commands and change current
uses to a new '--alldevices' switch.
The region creation message just outputs the new region_id, e.g.:
Created region: 0
This is fine when the device is unambigous (as above) but produces
unhelpful output when creating multiple regions, or regions on
multiple devices:
Created region: 0
Created region: 0
Created region: 1
Created region: 2
Created region: 0
To address this refactor _stats_create_segments() (previously only
used when creating one-region-per-target for --segments) into a
more general _do_stats_create_regions() that can create regions
for each segment, or a single region spanning either the entire
device or a specied start/len range.
This allows us to output all region creation messages from a
single point where both the device name and all information needed
to derive the number of areas is available.
This allows us to log all these facts in the resulting messages:
vg_hex-lv_home: Created new region with 13 area(s) as region ID 0
vg_hex-lv_home: Created new region with 4 area(s) as region ID 1
vg_hex-lv_home: Created new region with 1 area(s) as region ID 2
vg_hex-lv_swap: Created new region with 1 area(s) as region ID 0
vg_hex-lv_root: Created new region with 10 area(s) as region ID 0
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e: Created new region with 17 area(s) as region ID 0
vg_hex-lv_images: Created new region with 20 area(s) as region ID 0
vg_hex-lv_images: Created new region with 4 area(s) as region ID 1
Don't use cryptic abbreviations and make sure that all values can
be understood by someone not familiar with the clock internals.
Include the current interval number (inverse of the _count) in all
interval update messages and attempt to align interval timestamp
logs for interval counts < 99,999.
If _stats_report fails (e.g. due to an invalid device on the
command line) destroy the _report to prevent stats columns headings
from being displayed.
This also requires a change in main to test the return from
_perform_command_for_all_repeatable_args inside the interval loop
and exit immediately in case of error.
The _update_interval_times() function is called once per reported
object: when shutting down at the end of a run only the first call
should free timestamps. Clear the timestamp pointers after free
and use this to signal to other callers that the clock is already
shut down.
If the Linux timerfd interface to POSIX timers is available at compile
time use it for all report interval timekeeping. This gives more
accurate interval timing when the per-interval processing time is less
than the configured interval and simplifies the timestamp bookkeeping
required to keep accurate time.
For systems without timerfd support fall back to the simple usleep based
timer.
Change logic and naming of some internal API functions.
cache_set_mode() and cache_set_policy() both take segment.
cache mode is now correctly 'masked-in'.
If the passed segment is 'cache' segment - it will automatically
try to find 'defaults' according to profiles if the are NOT
specified on command line or they are NOT already set for cache-pool.
These defaults are never set for cache-pool.
Add code to detect available cache features.
Support policy_mq & policy_smq features which might be disabled.
Introduce global_cache_disabled_features_CFG.
Add new profilable configurables:
allocation/cache_policy
allocation/cache_settings
and mark allocation/cache_pool_chunk_size as profilable as well.
Obsolete allocation/cache_pool_cachemode and
introduce new allocation/cache_mode instead.
Rename DEFAULT_CACHE_POOL_POLICY to DEFAULT_CACHE_POLICY.
Request a transient LV lock from lvmlockd when
converting an LV. If the LV is inactive when
lvconvert is run, the LV lock will be acquired
and then released when the command is done.
If the LV is active, a persistent lock exists
already and the transient lock request does nothing.
This fixes the issue that had been mentioned in the
comment previously.
lvrename should not be done if the LV is active on another host.
This check was mistakenly removed when the code was changed to
use LV uuids in locks rather than LV names.
Since libdm-stats only uses fmemopen'd FILE objects the only way
that a close can fail is corruption of the memory containing the
FILE: check for this case and emit a backtrace if it occurs.
libdm/libdm-stats.c: 338 in _stats_parse_list()
libdm/libdm-stats.c: 341 in _stats_parse_list()
libdm/libdm-stats.c: 481 in _stats_parse_region()
libdm/libdm-stats.c: 487 in _stats_parse_region()
libdm/libdm-stats.c: 487 in _stats_parse_region()
- Calling "fclose" without checking return value
Remove an unneccessary conditional operator and simplify the logic
in _nr_areas:
libdm/libdm-stats.c: 501 in _nr_areas() - Control flow issues (DEADCODE)
The error path of _stats_list frees the task and stats objects:
don't try to branch to it before they have been allocated.
tools/dmsetup.c: 4589 in _stats_help() - Null pointer dereferences (FORWARD_NULL)
Make sure the newly created handle is freed if we are unable to
also create the pool for it.
tools/dmsetup.c: 4255 in _stats_list() - Variable "dms" going out of scope leaks the storage it points to.
Make sure comm is closed in the error path of _program_id_from_proc().
libdm/libdm-stats.c: 98 in dm_stats_create() - Variable "comm" going out of scope leaks the storage it points to.
There's no point testing _report here in _stats_report: it's always
initialised before the function is called and if the check did fail
we'd end up freeing an uninitialized dm_task in the error path.
tools/dmsetup.c: 4389 in _stats_report() - Declaring variable "dmt" without initializer.
The check for other sanlock lockspaces was not checking
that the lockspace type was sanlock, so if dlm lockspaces
were visible, they were wrongly included.
When vgremove is used to remove multiple VGs in one command,
e.g. vgremove foo bar, the first VG (foo) that is removed
may have held the sanlock global lock. In this case,
do not continue removing further VGs (bar) without the
global lock.
Add the libdm-stats module to libdm: this implements a simple interface
for creating, managing and interrogating I/O statistics regions and
areas on device-mapper devices.
The library interface is documented in libdevmapper.h and provides a
'dm_stats' handle that is used to perform statistics operations and
obtain data.
Public methods are provided to create and destroy handles and to list,
create, and destroy statistics regions as well as to obtain and parse
counter data and calculate rate-based metrics.
This commit also adds a 'dmsetup stats' (aka 'dmstats') command with
'clear', 'create', 'delete', 'list', 'print', and 'report' sub-commands.
See the library documentation and the dmstats.8 manual page for detailed
API and command descriptions.
Don't do interval management and external timekeeping for stats in
dm_report: let applications handle this on their own.
Since this has not been included in a release remove it from the
library entirely and handle report timing directly inside dmsetup.
Add a function to print column headings regardless of whether they
have already been output. This will be used by dmstats to issue
periodic reminders of the column headings.
This patch removes a check for RH_HEADINGS_PRINTED from
_report_headings that prevents headings being displayed if the flag
is already set; this check is redundant since the only existing
caller (_output_as_columns()) already tests the flag before
calling the function.
Not releasing objects back to the pool is fine for short-lived
pools since the memory will be freed when dm_pool_destroy() is
called.
Any pool that may be long-lived needs to be more careful to free
objects back to the pool to avoid leaking memory that will not be
reclaimed until the pool is destroyed at process exit time.
The report pool currently leaks each headings line and some row
data.
Although dm_report_output() tries to free the first allocated row
this may end up freeing a later row due to sorting of the row list
while reporting. Store a pointer to the first allocated row from
_do_report_obect() instead and free this at the end of
_output_as_columns(), _output_as_rows(), and dm_report_clear().
Also make sure to call dm_pool_free() for the headings line built
in _report_headings().
When dmstats is introduced it will maintain dm_report objects for
the whole lifetime of the process: without these changes a stats
report could leak around 600k in 10m (exact rate depends on field
selection and data values):
top - 12:11:32 up 4 days, 3:16, 15 users, load average: 0.01, 0.12, 0.14
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6473 root 20 0 130196 3124 2792 S 0.0 0.0 0:00.00 dmstats
top - 12:22:04 up 4 days, 3:26, 15 users, load average: 0.06, 0.11, 0.13
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6498 root 20 0 130836 3712 2752 S 0.0 0.0 0:00.60 dmstats
With this patch no increase in RSS is seen:
top - 13:54:58 up 4 days, 4:59, 15 users, load average: 0.12, 0.14, 0.14
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 130196 2996 2688 S 0.0 0.0 0:00.00 dmstats
top - 14:04:31 up 4 days, 5:09, 15 users, load average: 1.02, 0.67, 0.36
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 130196 2996 2688 S 0.3 0.0 0:00.32 dmstats
This also affects report output for repeating reports in the
DM_REPORT_OUTPUT_COLUMNS_AS_ROWS case; row state is not fully cleared for
the next iteration leading to progressive growth of the heading width:
vg_hex-lv_home:vg_hex-lv_swap:vg_hex-lv_root:luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:vg_hex-lv_images
253:253:253:253:253
2:0:1:4:3
L--w:L--w:L--w:L--w:L--w
1:2:1:1:1
3:1:1:1:2
0:0:0:0:0
LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiv08BCGvF4WsJSqWUDUt7qtf2hEmjtVvo:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiKf7XIiwdAYOJfaGhQe9fu26cTEICGgFS:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiEZj7ZXbmrWDuGhd7vvi88VF0NdTMG8iA:CRYPT-LUKS1-797339213f684c929eb7d0aca4c6ba3e-luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOi2rKredlBPnw2X7v1BiCuEpFo6gaE7BRw
:::::vg_hex-lv_home:vg_hex-lv_swap:vg_hex-lv_root:luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:vg_hex-lv_images
:::::253:253:253:253:253
:::::2:0:1:4:3
:::::L--w:L--w:L--w:L--w:L--w
:::::1:2:1:1:1
:::::3:1:1:1:2
:::::0:0:0:0:0
:::::LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiv08BCGvF4WsJSqWUDUt7qtf2hEmjtVvo:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiKf7XIiwdAYOJfaGhQe9fu26cTEICGgFS:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiEZj7ZXbmrWDuGhd7vvi88VF0NdTMG8iA:CRYPT-LUKS1-797339213f684c929eb7d0aca4c6ba3e-luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOi2rKredlBPnw2X7v1BiCuEpFo6gaE7BRw
This adds the infrastructure, code paths, error reporting,
etc. to handle storage errors, or storage loss, under the
sanlock leases in a VG that is being used. The loss of
storage means sanlock cannot renew its leases, which means
that the host needs to stop using the shared VG before its
leases expire.
This still requires manually shutting down a VG that has
lost lease storage, e.g. unmounting file systems,
deactivating LVs in the VG. The next step is to
automatically use a command like blkdeactivate to do that.
tools/polldaemon.c:465: uninit_use_in_call: Using uninitialized value "id.vg_name" when calling "print_log".
tools/polldaemon.c:465: uninit_use_in_call: Using uninitialized value "id.lv_name" when calling "print_log".
/lib/log/log.c:88: warning[invalidScanfArgType_int]: %llu in format string (no. 2) requires 'unsigned long long *' but the argument type is 'long long *'.
daemons/lvmlockd/lvmlockd-core.c:791: error[uninitstring]: Dangerous usage of 'version' (strncpy doesn't always null-terminate it).
The dlm global lockspace is automatically added when the
first dlm VG lockspace is added. Reverse this by removing
the dlm global lockspace after the last dlm VG lockspace
is removed. (Remove old non-working code that did this
based on an old command that could explicitly add/remove
the dlm global lockspace.)
Whenver reporting field name is registered with libdevmapper and if
the field name contains any number of underscores ('_'), libdm
can now automatically recognize any of its variant without any
underscores used.
For example:
..for underscores in prefixes:
pvs -o pv_name
pvs -o name
pvs -o pvname (newly recognized besides pvname)
..for underscores in the name:
lvs -o cache_mode
lvs -o cachemode
..or even multiple underscores:
pvs -o pv___na___me
It's all variant of the same field name.
No commands set has_subcommands yet.
Move multiple device loop to separate function because we'll
soon want to call it repeatedly.
(Based on patch from bmr.)
Use refresh_filters instead of destroy_filters and init_filters
in refresh_toolcontext fn which deals with cmd->initialized.filters
correctly on refresh.
Just shuffle the items and put them into logical groups so it's
visible at first sight what each group contains - it makes it a bit
easier to make heads and tails of the whole cmd_context monster.
When a command is flagged with NO_METADATA_PROCESSING flag, it means
such command does not process any metadata and hence it doens't require
lvmetad, lvmpolld and it can get away with no locking too. These are
mostly simple commands (like lvmconfig/dumpconfig, version, types,
segtypes and other builtin commands that do not process metadata
in any way).
At first, when lvm command is executed, create toolcontext without
initializing connections (lvmetad,lvmpolld) and without initializing
filters (which depend on connections init). Instead, delay this
initialization until we know we need this. That is, until the
lvm_run_command fn is called in which we know what the actual
command to run is and hence we can avoid any connection, filter
or locking initiliazation for commands that would not make use
of it anyway.
For all the other create_toolcontext calls, we keep the original
behaviour - the filters and connections are initialized together
with the toolcontext.
Make it possible to decide whether we want to initialize connections and
filters together with toolcontext creation.
Add "filters" and "connections" fields to struct
cmd_context_initialized_parts and set these in cmd_context.initialized
instance accordingly.
(For now, all create_toolcontext calls do initialize connections and
filters, we'll change that in subsequent patch appropriately.)
Move original lvmetad and lvmpolld initialization code from
_process_config fn to their own functions _init_lvmetad and
_init_lvmpolld (both covered with single _init_connections fn).
Add struct cmd_context_initialized_parts to wrap up information
about which cmd context pieces are initialized and add variable
of this struct type into struct cmd_context.
Also, move existing "config_initialized" variable that was directly
part of cmd_context into the new cmd_context.initialized wrapper.
We'll be adding more items into the struct cmd_context_initialized_parts
with subsequent patches...
This tries harder to avoid creating duplicate global locks in
sanlock VGs by refusing to create a new sanlock VG with a
global lock if other sanlock VGs exist that may have a gl.
vgsummary information contains provisional VG information
that is obtained without holding the VG lock. This info
can be used to lock the VG, and then read it with vg_read().
After the VG is read properly, the vgsummary info should
be verified.
Add the VG lock_type to the vgsummary. It needs to be
known before the VG can be locked and read.
This is a regression introduced by commit
6c0e44d5a2 which changed
the way dev_cache_get fn works - before this patch, when a
device was not found, it fired a full rescan to correct the
cache. However, the change coming with that commit missed
this full_rescan call, causing the lvmcache to still contain
info about PVs which should be filtered now.
Such situation may have happened by coincidence of using
old persistent cache (/etc/lvm/cache/.cache) which does not
reflect the actual state anymore, a device name/symlink which
now points to a device which should be filtered and a fact we
keep info about usable DM devices in .cache no matter what
the filter setting is.
This bug could be hidden though by changes introduced in
commit f1a000a477 as it
calls full_rescan earlier before this problem is hit.
But we need to fix this anyway for the dev_cache_get
to be correct if we happen to use the same code path
again somewhere sometime.
For example, simple reproducer was (before commit
1a000a477558e157532d5f2cd2f9c9139d4f87c):
- /dev/sda contains a PV header with UUID y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M
- lvm.conf: filter = [ "r|.*|" ]
- rm -f .cache (to start with clean state)
- dmsetup create test --table "0 8388608 linear /dev/sda 0" (8388608 is
just the size of the /dev/sda device I use in the reproducer)
- pvs (this will create .cache file which contains
"/dev/disk/by-id/lvm-pv-uuid-y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M"
as well as "/dev/mapper/test" and the target node "/dev/dm-1" - all the
usable DM mappings (and their symlinks) get into the .cache file even
though the filter "is set to "ignore all" - we do this - so far it's OK)
- dmsetup remove test (so we end up with /dev/disk/by-id/lvm-pv-uuid-...
pointing to the /dev/sda now since it's the underlying device
containing the actual PV header)
- now calling "pvs" with such .cache file and we get:
$ pvs
PV VG Fmt Attr PSize PFree
/dev/disk/by-id/lvm-pv-uuid-y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M vg lvm2 a-- 4.00g 0
Even though we have set filter = [ "r|.*|" ] in the lvm.conf file!
Moved out from lib/display and a little documentation added.
It's tuned to LVM's requirements historically and its behaviour
might not always be what you would expect.
There is no longer an "enable" option for the global lock,
so remove the bit of code that was checking for it. It
was an optional variation anyway, and not one that was likely
to be used.
Also update the corresponding comment describing global lock
creation.
Stop removing hyphens when = is seen. With an option
like --profile=thin-performance, the hyphen removal
will stop at = and will not remove - after thin.
Stop removing hyphens altogether when a stand alone arg
of -- appears.
Simply running concurrent copies of 'pvscan | true' is enough to make
clvmd freeze: pvscan exits on the EPIPE without first releasing the
global lock.
clvmd notices the client disappear but because the cleanup code that
releases the locks is triggered from within some processing after the
next select() returns, and that processing can 'break' after doing just
one action, it sometimes never releases the locks to other clients.
Move the cleanup code before the select.
Check all fds after select().
Improve some debug messages and warn in the unlikely event that
select() capacity could soon be exceeded.
When there are duplicate global locks, check if the gl
is still enabled each time a gl or vg lock is acquired
in the lockspace. Once one of the duplicates is disabled,
then other hosts will recognize that the issue is resolved
without needing to restart the lockspaces.
Move the DEBUG_MEM decision inside libdevmapper.so instead of exposing
it in libdevmapper.h which causes failures if the binary and library
were compiled with opposite debugging settings.
pvscan autoactivation does not work for lockd VGs because
lock start is needed on a lockd VG before locking can be
done for it. Add a check to skip the attempt at autoactivate
rather than calling it, knowing it will fail.
Add a comment explaining why pvscan --cache works fine for
lockd VGs without locks, and why autoactivate is not done.
. the poll check will eventually call finish which will
write the VG, so an ex VG lock is needed from lvmlockd.
. fix missing unlock on poll error path
. remove the lockd locking while monitoring the progress
of the command, as suggested by the earlier FIXME comment,
as it's not needed.
The CFG_SECTION_NO_CHECK flag can be used to mark a section
and its whole subtree as containing settings where checks
won't be made (lvmconfig --validate).
These are setting where we don't know the names and and type
in advance and they're recognized in runtime. As we don't know
the type and name in advance, we can't do any checks here
of course.
Use this flag with great care as it disables config checks
for the whole config subtree found under such section.
This flag is going to be used by subsequent patches from
Zdenek to support some cache settings...
Recent change to move the polling outside of core lvconvert
code was wrongly using 'lv' and 'vg' structs which can't be
used outside of the core code, which caused seg fault.
Properly isolate all use of lv structs within the core of
the lvconvert code, saving any information necessary,
(esp lvid). After the core of lvconvert is done, use
the saved information to do polling.
FIXME: the need for is_merging_origin and is_merging_origin_thin
in this patch is ugly, and a cleaner way should be found to deal
with that than what is done here.
Also it effectively removed all hacks in _lvconvert_merge_single
performing ugly: VG reread, unlock, polling, lock sequence.
Moreover all polling operations are postponed after all conversions
are finished.
lvm2 (while locking via lvmlockd) should now be able to run with
or without lvmpolld while performing poll operations originating
in lvconvert command.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Comply with the rules we have for log_error and log_warn...
$ pvcreate /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
1 existing signature left on the device.
Aborting pvcreate on /dev/sda1.
$ pvcreate /dev/sda1 --force
WARNING: Failed to get offset of the xfs_external_log signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
libblkid may return the list of signatures found, but it may not
provide offset and size for each signature detected. This may
happen in case signatures are mixed up or there are more, possibly
overlapping, signatures found.
Make lvm commands pass if such situation happens and we're using
--force (or any stronger force method).
For example:
$ pvcreate /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
1 existing signature left on the device.
Aborting pvcreate on /dev/sda1.
$ pvcreate --force /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
The vgchange/lvchange activation commands read the VG, and
don't write it, so they acquire a shared VG lock from lvmlockd.
When other commands fail to acquire a shared VG lock from
lvmlockd, a warning is printed and they continue without it.
(Without it, the VG metadata they display from lvmetad may
not be up to date.)
vgchange/lvchange -a shouldn't continue without the shared
lock for a couple reasons:
. Usually they will just continue on and fail to acquire the
LV locks for activation, so continuing is pointless.
. More importantly, without the sh VG lock, the VG metadata
used by the command may be stale, and the LV locks shown
in the VG metadata may no longer be current. In the
case of sanlock, this would result in odd, unpredictable
errors when lvmlockd doesn't find the expected lock on
disk. In the case of dlm, the invalid LV lock could be
granted for the non-existing LV.
The solution is to not continue after the shared lock fails,
in the same way that a command fails if an exclusive lock fails.
When lvmlockd is compiled without support for one of the
lock managers (sanlock or dlm), and a command tries to use
one of them, explain that in the error message.
Don't abort test when clvmd processes two VGs concurrently.
CLVMD: ioctl/libdm-iface.c:1940 Internal error: Performing unsafe table load while 3 device(s) are known to be suspended: (253:19)
When lvm is built without lvmlockd support, vgcreate using a
shared lock type would succeed and create a local VG (the
--shared option was effectively ignored). Make it fail.
Fix the same issue when using vgchange to change a VG to a
shared lock type.
Make the error messages consistent.
A segfault was reported when extending an LV with a smaller number of
stripes than originally used. Under unusual circumstances, the cling
detection code could successfully find a match against the excess
stripe positions and think it had finished prematurely leading to an
allocation being pursued with a length of zero.
Rename ix_offset to num_positional_areas and move it to struct
alloc_state so that _is_condition() can obtain access to it.
In _is_condition(), areas_size can no longer be assumed to match the
number of positional slots being filled so check this newly-exposed
num_positional_areas directly instead. If the slot is outside the
range we are trying to fill, just ignore the match for now.
(Also note that the code still only performs cling detection against
the first segment of the LV.)
Replace misleading "not found" in the log message when
devices/preferred_names is set to empty array:
Really not found:
device/dev-cache.c:689 devices/preferred_names not found in config: using built-in preferences
Found, but empty:
config/config.c:1431 Setting devices/preferred_names to preferred_names = [ ]
device/dev-cache.c:689 devices/preferred_names is empty: using built-in preferences
Commit 7e728fe1a1 added a log call
directly in find_config_tree_array when defaults are used.
This patch also adds the log for the value which is found in
existing configuration and for which defaults are not used.
For example:
Defaults used:
config/config.c:1428 devices/scan not found in config: defaulting to scan = [ "/dev" ]
Value defined in configuration used:
config/config.c:1431 Setting devices/scan to scan = [ "/dev", "/mydev", "/abc" ]
This makes the logging consistent with the other find_config_tree_* functions.
Policy name has to be always defined.
Capture it as an internal error before write.
When reading metadata without defined policy name, use default defined policy.
TODO: Unsure, but it might have to be actually always 'mq' in this case.
Keep policy name separate from policy settings and avoid
to mangling and demangling this string from same config tree.
Ensure policy_name is always defined.
Use find_config_tree_array for all config arrays. Also, add
INTERNAL_ERROR in case there should have been at least default
value defined for a setting but it was not returned for some
reason (either config_settings.h misconfiguration or other config
tree error printed by functions called by find_config_tree_array).
Both lock_start filters were being skipped when any lock-opt
values were used. The "auto" lock-opt should cause the
auto_lock_start_list to be used. The lock_start_list should
always be used.
The behavior of lock_start_list/auto_lock_start_list are tested
and verified to behave like volume_list/auto_activation_volume_list.
Since the default was changed to wait for lock-start to finish,
the "wait" and "autowait" lock-opt values are not needed, but a
new "autonowait" is added to the existing "nowait" avoid the
default waiting.
There are two different failure conditions detected in
access_vg_lock_type() that should have different error
messages. This adds another failure flag so the two
cases can be distinguished to avoid printing a misleading
error message.
Require global/{thin,cache}_{check,repair}_options to be always defined.
If not defined directly by user in the configuration and if there's no
concrete default option to use, make "" (empty string) the default one -
it's then clearly visible in the "lvmconfig --type default" (and
generated lvm.conf) and also it makes its handling in the code more
straightforward so we don't need to handle undefined values.
This means, if there are no default values for these settings defined,
we end up with this generated now:
{thin,cache}_{check,repair}_options = [ "" ]
So the value is never undefined and if it is, it's an error.
(The cache_repair_options is actually not used in the code at the moment,
but once the code using this setting is in, it will follow the same logic
as used for thin_repair_options.)
The "exported" state of the VG can be useful with lockd VGs
because the exported state keeps a VG from being used in general.
It's a way to keep a VG protected and out of the way.
Also fix the command flags: ALL_VGS_IS_DEFAULT is not true for
vgimport/vgexport, since they both return errors immediately if
no VG args are specified. LOCKD_VG_SH is not true for vgexport
beause it must use an ex lock to write the VG.
When --nolocking is used (by vgs, lvs, pvs):
. don't use lvmlockd at all (set use_lvmlockd to 0)
. allow lockd VGs to be read
When --readonly is used (by vgs, lvs, pvs, vgdisplay, lvdisplay,
pvdisplay, lvmdiskscan, lvscan, pvscan, vgcfgbackup):
. skip actual lvmlockd locking calls
. allow lockd VGs to be read
. check that only shared gl/vg locks are being requested
(even though the actually locking is being skipped)
. check that no LV locks are requested, because no LVs
should be activated or used in readonly mode
. disable using lvmetad so VGs are read from disk
It is important to note the limited commands that accept
the --nolocking and --readonly options, i.e. no commands
that change/write a VG or change/activate LVs accept these
options, only commands that read VGs.
A new lockd lock needs to be created for the new LV
created by split mirror and split snapshot. Disallow
these options in lockd VGs until that is implemented.
There are at least a couple instances where
the lock_args check does not work correctly,
(listed in the comment), so disable the
NULL check for lock_args until those are
resolved.
log_warn was added recently because no known code used
the given condition, but running pvcreate on an existing
PV uses this case, and should not produce a warning.
Put the change from commit #10d27998b3d2f6100e9e29e83d1d99948c55875f
back so we have working tree again for now. This code needs a bit of
a cleanup to return proper return value to check...
lib/format1/import-export.c:167: var_deref_op: Dereferencing null pointer "vg->lvm1_system_id"
lib/cache/lvmetad.c:1023: var_deref_op: Dereferencing null pointer "this"
daemons/lvmlockd/lvmlockd-core.c:2659: check_after_deref: Null-checking "act" suggests that it may be null, but it has already been dereferenced on all paths leading to the check
/daemons/lvmetad/lvmetad-core.c:1024: check_after_deref: Null-checking "pvmeta" suggests that it may be null, but it has already been dereferenced on all paths leading to the check
If running lvmconf ... --startstopservice --mirrorservice in systemd
environment, handle lvm2-cmirrord accordingly. A typo in the script
caused the lvm2-cmirrord to not start/stop immediately, it was
only enabled/disabled (so the --startstopservice was ignored in this
case).
This prevents 'lvremove vgname' from attempting to remove the
hidden sanlock LV. Only vgremove should remove the hidden
sanlock LV holding the sanlock locks.
tools/polldaemon.c:457: array_null: Comparing an array to null is not useful: "lv->lvid.s"
The lv->lvid.s is never NULL. The check was supposed to be *lv->lvid.s
to check if the string is not empty.
... Using uninitialized value "lockd_state" when calling "lockd_vg"
(even though lockd_vg assigns 0 to the lockd_state, but it looks at
previous state of lockd_state just before that so we need to have
that properly initialized!)
libdm/libdm-report.c:2934: uninit_use_in_call: Using uninitialized value "tm". Field "tm.tm_gmtoff" is uninitialized when calling "_get_final_time".
daemons/lvmlockd/lvmlockctl.c:273: uninit_use_in_call: Using uninitialized element of array "r_name" when calling "format_info_r_action". (just added FIXME as this looks unfinished?)
lib/log/log.c:115: leaked_storage: Variable "st" going out of scope leaks the storage it points to
daemons/lvmpolld/lvmpolld-core.c:573: leaked_storage: Variable "cmdargv" going out of scope leaks the storage it points to
daemons/lvmlockd/lvmlockd-core.c:5341: leaked_handle: Handle variable "fd" going out of scope leaks the handle
daemons/lvmlockd/lvmlockctl.c:575: overwrite_var: Overwriting "able_vg_name" in "able_vg_name = strdup(optarg)" leaks the storage that "able_vg_name" points to
daemons/lvmlockd/lvmlockctl.c:571: overwrite_var: Overwriting "able_vg_name" in "able_vg_name = strdup(optarg)" leaks the storage that "able_vg_name" points to
daemons/lvmlockd/lvmlockctl.c:385: leaked_handle: Handle variable "s" going out of scope leaks the handle
Before, we used general find_config_tree_node function to retrieve
array values. This had a downside where if the node was not found,
we had to insert default values directly in-situ after the
find_config_tree_node call. This way, we had two copies of default
values - one in config_settings.h and the other one directly in the
code where we found out that find_config_tree_node returned NULL and
hence we needed to fall back to defaults.
With separate find_config_tree_array used for array config values,
we keep all the defaults centrally in config_settings.h because
the new find_config_tree_array automatically returns these defaults
if it can't find any value set in the configuration.
This patch just makes the behaviour exactly the same for arrays as
for any other non-array type where we call find_config_tree_<type>
already, hence making the internal interface for handling array
values consistent with the rest of the config types.
if lvm2 is built with debug memory options dm_free() is not
mapped directly to std library's free(). This may cause memory corruption
as a line buffer may get reallocated in getline with realloc.
This is a temporary hotfix. Other debug memory failure needs to
be investigated and explained.
including the allow_override_lock_modes setting.
It was not possible to override default lock modes any longer,
since the command line options had already been removed.
A mechanism will probably be required later that puts part of
this back.
Existing messaging intarface for thin-pool has a few 'weak' points:
* Message were posted with each 'resume' operation, thus not allowing
activation of thin-pool with the existing state.
* Acceleration skipped suspend step has not worked in cluster,
since clvmd resumes only nodes which are suspended (have proper lock
state).
* Resume may fail and code is not really designed to 'fail' in this
phase (generic rule here is resume DOES NOT fail unless something serious
is wrong and lvm2 tool usually doesn't handle recovery path in this case.)
* Full thin-pool suspend happened, when taken a thin-volume snapshot.
With this patch the new method relocates message passing into suspend
state.
This has a few drawbacks with current API, but overal it performs
better and gives are more posibilities to deal with errors.
Patch introduces a new logic for 'origin-only' suspend of thin-pool and
this also relates to thin-volume when taking snapshot.
When suspend_origin_only operation is invoked on a pool with
queued messages then only those messages are posted to thin-pool and
actual suspend of thin pool and data and metadata volume is skipped.
This makes taking a snapshot of thin-volume lighter operation and
avoids blocking of other unrelated active thin volumes.
Also fail now happens in 'suspend' state where the 'Fail' is more expected
and it is better handled through error paths.
Activation of thin-pool is now not sending any message and leaves upto a tool
to decided later how to finish unfinished double-commit transaction.
Problem which needs some API improvements relates to the lvm2 tree
construction. For the suspend tree we do not add target table line
into the tree, but only a device is inserted into a tree.
Current mechanism to attach messages for thin-pool requires the libdm
to know about thin-pool target, so lvm2 currently takes assumption, node
is really a thin-pool and fills in the table line for this node (which
should be ensured by the PRELOAD phase, but it's a misuse of internal API)
we would possibly need to be able to attach message to 'any' node.
Other thing to notice - current messaging interface in thin-pool
target requires to suspend thin volume origin first and then send
a create message, but this could not have any 'nice' solution on lvm2
side and IMHO we should introduce something like 'create_after_resume'
message.
Patch also changes the moment, where lvm2 transaction id is increased.
Now it happens only after successful finish of kernel transaction id
change. This change was needed to handle properly activation of pool,
which is in the middle of unfinished transaction, and also this corrects
usage of thin-pool by external apps like Docker.
Add support for sending message in suspend tree for thin-pools.
When this operation is requested whole subtree suspend is then skipped.
This is experimantal support for new lvm2 code for sending message
in suspend phase where 'thin-pool origin-only suspend' will send
messages instead of really suspending thin-pool tree.
When suspening thin volume origin-only - only thin volume is suspended,
then messages are posted and thin-pool suspend is skipped.
Recognize date and time specification within selection criteria
that is formulated in a more free-form way besides to the original
basic YYYY-MM-DD HH:MM format that libdevmapper supports.
Currently, this free-form format is recognized for lv_time field.
Users are able to use expressions from this set:
- weekday names ("Sunday" - "Saturday" or abbreviated as "Sun" - "Sat")
- labels for points in time ("noon", "midnight")
- labels for a day relative to current day ("today", "yesterday")
- points back in time with relative offset from today (N is a number)
( "N" "seconds"/"minutes"/"hours"/"days"/"weeks"/"years" "ago")
( "N" "secs"/"mins"/"hrs" ... "ago")
( "N" "s"/"m"/"h" ... "ago")
- time specification either in hh:mm:ss format or with AM/PM suffixes
- month names ("January" - "December" or abbreviated as "Jan" - "Dec")
For example:
$ date
Fri Jul 3 10:11:13 CEST 2015
$ lvmconfig --type full report/time_format
time_format="%a %Y-%m-%d %T %z %Z [%s]"
$ lvs
LV VG Time
lvol0 vg Fri 2014-08-22 21:25:41 +0200 CEST [1408735541]
lvol2 vg Sun 2015-04-26 14:52:20 +0200 CEST [1430052740]
root fedora Wed 2015-05-27 08:09:21 +0200 CEST [1432706961]
swap fedora Wed 2015-05-27 08:09:21 +0200 CEST [1432706961]
lvol1 vg Tue 2015-06-30 03:25:43 +0200 CEST [1435627543]
lvol3 vg Tue 2015-06-30 14:52:23 +0200 CEST [1435668743]
lvol6 vg Wed 2015-07-01 13:35:56 +0200 CEST [1435750556]
lvol4 vg Thu 2015-07-02 12:12:02 +0200 CEST [1435831922]
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
$ lvs -S 'time=yesterday'
LV VG Time
lvol4 vg Thu 2015-07-02 12:12:02 +0200 CEST [1435831922]
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
$ lvs -S 'time since "June 30"'
LV VG Time
lvol1 vg Tue 2015-06-30 03:25:43 +0200 CEST [1435627543]
lvol3 vg Tue 2015-06-30 14:52:23 +0200 CEST [1435668743]
lvol6 vg Wed 2015-07-01 13:35:56 +0200 CEST [1435750556]
lvol4 vg Thu 2015-07-02 12:12:02 +0200 CEST [1435831922]
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
$ lvs -S 'time since "noon June 30"'
LV VG Time
lvol3 vg Tue 2015-06-30 14:52:23 +0200 CEST [1435668743]
lvol6 vg Wed 2015-07-01 13:35:56 +0200 CEST [1435750556]
lvol4 vg Thu 2015-07-02 12:12:02 +0200 CEST [1435831922]
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
$ lvs -S 'time since "2 July 9AM"'
LV VG Time
lvol4 vg Thu 2015-07-02 12:12:02 +0200 CEST [1435831922]
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
$ lvs -S 'time since "2 July 1PM"'
LV VG Time
lvol5 vg Thu 2015-07-02 14:30:32 +0200 CEST [1435840232]
...and so on.
Wire the dm_report_reserved_handler instance call in reporting/selection
infrastructure to handle reserved value actions (currently only
DM_REPORT_RESERVED_PARSE_FUZZY_NAME and DM_REPORT_RESERVED_GET_DYNAMIC_VALUE
actions).
With fuzzy names we mean the names for which it's hard or even impossible
to enumerate all possible variations of the name - the name needs to
be evaluated. An example of fuzzy name is a name which has a base
(substring) which matches and it can contain arbitrary variations
around this base. We can cover human language better with fuzzy
names as people may use several different names (or sentences) to
denote the same thing.
With dynamic values we mean the values which are not constants
and they need to be evaluated in runtime. An example of dynamic
value is a value which depends on current system state (e.g. time,
current configuration or any other state which may change and it
needs runtime evaluation).
There's a handler that can be registered with reporting/selection
using dm_report_reserved_handler instance. This is a central point
in which the computation/evaluation happens when processing reserved
values. Currently, there are two actions declared:
DM_REPORT_RESERVED_PARSE_FUZZY_NAME
(translates fuzzy name into canonical name)
DM_REPORT_RESERVED_GET_DYNAMIC_VALUE
(gets value for canonical name)
The handler is then registered as value in struct
dm_report_reserved_value (see explaining comments besided
the struct dm_report_reserved_value in libdevmapper.h).
Also, this patch provides support for simple caching of values
used during report/selection via dm_report_value_cache_{set,get}.
This is supposed to be used mainly in the dm_report_reserved_handler
instances to save values among calls so all the handler calls work
with the same base value used in computation/evaluation and/or
possibly to save resources if the evaluation is more time-consuming.
The cache is attached to the dm_report handle and so the cache is
dropped one dm_report is dropped.
Generic numbers and time values share some operators so make sure
we have the flags correctly adjusted based on expected type if
we're using reserved values.
dm_snprintf() returns upon success the number of characters printed
(excluding the null byte used to end output to strings).
So add extra byte to preserve \0.
This fixes regression when displaying more then a single lv name.
_node_name() prepares into dm_tree internal buffer device
name and it (major:minor) for easy usage for debug messages.
To avoid any allocation a small buffer in struct dm_tree is preallocated
to store this message.
This patch adds support for time values used in reporting fields.
The raw values are always stored as number of seconds since epoch.
The support that comes with this patch is the basic one which allows
only for recognition of strictly formatted date and time in selection
criteria (the format follows a subset of formats defined by ISO 8601):
date time timezone
date:
YYYY-MM-DD (or shortly YYYYMMDD)
YYYY-MM (shortly YYYYMM), auto DD=1
YYYY, auto MM=01 and DD=01
time:
hh:mm:ss (or shortly hhmmss)
hh:mm (or shortly hhmm), auto ss=0
hh (or shortly hh), auto mm=0, auto ss=0
timezone (always with + or - sign):
+hh:mm or -hh:mm (or shortly +hhmm or -hhmm)
+hh or -hh
Or directly the time (number of seconds) since Epoch (1970-01-01 00:00:00 UTC)
when the number value is prefixed by "@":
@number_of_seconds_since_epoch
This patch also adds aliases for comparison operators
used together with time values which are more intuitive
to use:
since (as alias for >=)
after (as alias for >)
until (as alias for <=)
before (as alias for <)
For example:
$ lvmconfig --type full report/time_format
time_format="%Y-%m-%d %T %z %Z [%s]"
$ lvs -o name,time vg
LV Time
lvol0 2015-06-28 21:25:41 +0200 CEST [1435519541]
lvol1 2015-06-30 03:25:43 +0200 CEST [1435627543]
lvol2 2015-04-26 14:52:20 +0200 CEST [1430052740]
lvol3 2015-06-30 14:52:23 +0200 CEST [1435668743]
$ lvs vg -o name,time -S 'time since "2015-04-26 15:00" && time until "2015-06-30"'
LV Time
lvol0 2015-06-28 21:25:41 +0200 CEST [1435519541]
lvol1 2015-06-30 03:25:43 +0200 CEST [1435627543]
lvol3 2015-06-30 14:52:23 +0200 CEST [1435668743]
$ lvs vg -o name,time -S 'time since "2015-04-26 15:00" && time until "2015-06-30 6:00"'
LV Time
lvol0 2015-06-28 21:25:41 +0200 CEST [1435519541]
lvol1 2015-06-30 03:25:43 +0200 CEST [1435627543]
$ lvs vg -o name,time -S 'time since @1435519541'
LV Time
lvol0 2015-06-28 21:25:41 +0200 CEST [1435519541]
lvol1 2015-06-30 03:25:43 +0200 CEST [1435627543]
lvol3 2015-06-30 14:52:23 +0200 CEST [1435668743]
This is basic time recognition support that is directly a part of
libdevmapper. Recognition of more free-form expressions will be a
part of subsequent patches.
Just like we have DEFAULT_USE_LVMETAD (or DEFUALT_USE_LVMPOLLD), use
fallback_to_lvm1=1 lvm.conf setting if we configured lvm2 with
--enable-lvm1-fallback and use fallback_to_lvm1=0 otherwise.
Also, generate proper lvm.conf.in with unconfigured value.
This patch allows for registration and recognition of reserved
values which are ranges, so they're composed of two values actually
to denote the lower and upper bound for the range (stored as an array
with exactly two items to define the boundaries).
Also, this patch allows for flagging reserved values as named-only
which means that such values are not strictly reserved. The strictly
reserved values are reserved values as used before this patch.
Distinction between strictly-reserved and named-only values
is clearly visible with comparisons. Normally, strictly reserved
value is not accounted for if we do "greater than" or "lower than"
comparisons, for example:
1 2 3 ....
|
abc
- we have "abc" as reserved value for field with value "2"
- the value reported for the field is "abc" (or "2", it doesn't matter here)
- the selection we're processing is -S 'field < abc'
- the result of the selection gives nothing as "abc" is strictly
reserved value (bound to "2") and there's no order defined for
it and it would only match if we directly compared the value
(so -S 'field = abc' would match)
With named-only values, the "abc" is named-only value for "2",
so selection -S 'field < abc" is the same as using -S 'field < 2'.
The "abc" is just an alias for some value so the value or its
assigned name can be used equally in selection criteria.
Make it possible to define format for time that is displayed.
The way the format is defined is equal to the way that is used
for strftime function, although not all formatting options as
used in strftime are available for LVM2 - the set is restricted
(e.g. we do not allow newline to be printed). The lvm.conf
comments contain the whole list that LVM2 accepts for time format
together with brief description (copied from strftime man page).
For example:
(defaults used - the format is the same as used before this patch)
$ lvs -o+time vg/lvol0 vg/lvol1
LV VG Attr LSize Time
lvol0 vg -wi-a----- 4.00m 2015-06-25 16:18:34 +0200
lvol1 vg -wi-a----- 4.00m 2015-06-29 09:17:11 +0200
(using 'time_format = "@%s"' in lvm.conf - number of seconds
since the Epoch)
$ lvs -o+time vg/lvol0 vg/lvol1
LV VG Attr LSize Time
lvol0 vg -wi-a----- 4.00m @1435241914
lvol1 vg -wi-a----- 4.00m @1435562231
Commit e587b0677b broke the build on
systems where /bin/sh is Dash, for example.
Origin patch by Ferenc Wágner <wferi@niif.hu> changed later to
avoid using shell call, so makefile add 'server' target when
one of metad or polld daemon is requested.
Do not display settings with undefined default values, but do display
these settings in case the value is defined directly in any part of
the existing config cascade.
For example, the lvmconfig --type current always displays these settings
(as it's somewhere in "current" configuration cascade that makes it defined).
The lvmconfig --type full displays these settings only if it's defined
somewhere in the cascade, but not if default value is used instead
The lvmconfig --type default never displays these settings...
More concrete example - let's have activation/volume_list directly
set in lvm.conf and activation/read_only_volume_list not set.
Both of these settings have *undefined default* values.
$lvmconfig --type full activation/volume_list activation/read_only_volume_list
volume_list="/dev/vg/lv"
(...only volume_list is defined, hence it's printed)
However, the comments will display more info (see also previous commit):
$lvmconfig --type full activation/volume_list activation/read_only_volume_list --withsummary
# Configuration option activation/volume_list.
# Only LVs selected by this list are activated.
# This configuration option does not have a default value defined.
# Value defined in existing configuration has been used for this setting.
volume_list="/dev/vg/lv"
# Configuration option activation/read_only_volume_list.
# LVs in this list are activated in read-only mode.
# This configuration option does not have a default value defined.
Display comment abour value from existing config being used. For example:
$ lvmconfig --type full --withsummary report/compact_output report/buffered
# Configuration option report/compact_output.
# Do not print empty report fields.
# Value defined in existing configuration has been used for this setting.
compact_output=1
# Configuration option report/buffered.
# Buffer report output.
buffered=1
The lvmconfig --type full is actually a combination of --type current
and --type missing together with --mergedconfig options used.
The overall outcome is a configuration tree with settings as LVM sees
it when it looks for the values - that means, if the setting is defined
in some config source (lvm.conf, --config, lvmlocal.conf or any profile
that is used), the setting is used. Otherwise, if the setting is not
defined in any part of the config cascade, the defaults are used.
The --type full displays exactly this final tree with all the values
defined, either coming from configuration tree or from defaults.
Synchronize with udev logic before reusing device as snapshot.
This patch tries to fix the problem with udev, where we manage
to 'active' LV for clearing, then we deactivate such device and
active again as member of 'origin&snapshot' tree all in 1 step.
There needs to be a sync point where udev has time to remove all links,
otherwise we race with scans and we may end-up with mysterious 'free'
links in the system pointing to wrong dm names.
This patch tries to fix failing topology cluster tests..
We shouldn't be adding spaces by default in output as that
may be be used already in scripts and especially for the eval
in shell scripts where spaces are not allowed between key
and value!
Add --withspaces option to lvmconfig for pretty output with
more space in for readability.
It's not an error to define empty values for
{thin,cache}_{check,repair}_options - such empty value means no
options are passed when these external commands are called.
If blkid wiping is possible, than set use_blkid_wiping=1 and
use_blkid_wiping=0 otherwise for its default value. If blkid wiping
is disabled during configure and use_blkid_wiping=1 is set by chance,
it's simply ignored - this patch is just a cleanup that makes it more
obvious for the user (we use similar logic for use_lvmetad and
use_lvmpolld settings).
Default value for lvmetad and lvmpolld has hooks in configure script,
the "lvmconfig --type default --unconfigured" should display:
use_lvmetad = @DEFAULT_USE_LVMETAD@
use_lvmpolld = @DEFAULT_USE_LVMPOLLD@
Note that these settings are not of string type. Recent change (the
DM_CONFIG_VALUE_FMT_STRING_NO_QUOTES formatting flag) makes it
possible to recognize that the setting is not of string type and if
there's unconfigured value defined for it, the enclosing " " is
automatically removed on output.
Do not use "#S" (blank string) as default value as that ends up with
'key = [ "" ]' to be generated which is not what we want in most cases.
Also, fix default values for global/{thin,cache}_{check,repair}_options
and avoid assigning blank values. For example, the thin_check_options
had this set as default value previously:
"#S" DEFAULT_THIN_CHECK_OPTION1 "#S" DEFAULT_THIN_CHECK_OPTION2
If any (or both) of DEFAULT_THIN_CHECK_OPTION* variables was set
to "", we ended up with clumsy default value generated like:
thin_check_options = [ "-q", "" ]
With this patch, we end up with correct:
thin_check_options = [ "-q" ]
or, if all options are undefined:
thin_check_options = [ ]
Which is the correct way to express this.
There are two basic groups of formatting flags (32 bits):
- common ones applicable for all config value types (lower 16 bits)
- type-related formatting flags (higher 16 bits)
With this patch, we initially support four new flags that
modify the the way the config value is displayed:
Common flags:
=============
DM_CONFIG_VALUE_FMT_COMMON_ARRAY - causes array config values
to be enclosed in "[ ]" even if there's only one item
(previously, there was no way to recognize an array with one
item and scalar value, hence array values with one member
were always displayed without "[ ]" which libdm accepted
when reading, but it may have been misleading for users)
DM_CONFIG_VALUE_FMT_COMMON_EXTRA_SPACE - causes extra spaces to
be inserted in "key = value" (or key = [ value, value, ... ] in
case of arrays), compared to "key=value" seen on output before.
This makes the output more readable for users.
Type-related flags:
===================
DM_CONFIG_VALUE_FMT_INT_OCTAL - prints integers in octal form with
"0" as a prefix (libdm's config reading code can handle this via
strtol just fine so it's properly recognized as number in octal
form already if there's "0" used as prefix)
DM_CONFIG_VALUE_FMT_STRING_NO_QUOTES - makes it possible to print
strings without enclosing " "
This patch also adds dm_config_value_set_format_flags and
dm_config_value_get_format_flags functions to set and get
these formatting flags.
This is the client side handling of the global_invalid state
added to lvmetad in commit c595b50cec8a6b95c6ac4988912d1412f3cc0237.
The function added here:
. checks if the global state in lvmetad is invalid
. if so, scans disks to update the state in lvmetad
. clears the global_invalid flag in lvmetad
. updates the local udev db to reflect any changes
and update the lvmetad copy after it is reread from disk.
This is the client side handling of the vg_invalid state
added to lvmetad in commit c595b50cec8a6b95c6ac4988912d1412f3cc0237.
Add the ability to invalidate global or individual VG metadata.
The invalid state is returned to lvm commands along with the metadata.
This allows lvm commands to detect stale metadata from the cache and
reread the latest metadata from disk (in a subsequent patch.)
These changes do not change the protocol or compatibility between
lvm commands and lvmetad.
Global information
------------------
Global information refers to metadata that is not isolated
to a single VG , e.g. the list of vg names, or the list of pvs.
When an external system, e.g. a locking system, detects that global
information has been changed from another host (e.g. a new vg has been
created) it sends lvmetad the message: set_global_info: global_invalid=1.
lvmetad sets the global invalid flag to indicate that its cached data is
stale.
When lvm commands request information from lvmetad, lvmetad returns the
cached information, along with an additional top-level config node called
"global_invalid". This new info tells the lvm command that the cached
information is stale.
When an lvm command sees global_invalid from lvmated, it knows it should
rescan devices and update lvmetad with the latest information. When this
is complete, it sends lvmetad the message: set_global_info:
global_invalid=0, and lvmetad clears the global invalid flag. Further lvm
commands will use the lvmetad cache until it is invalidated again.
The most common commands that cause global invalidation are vgcreate and
vgextend. These are uncommon compared to commands that report global
information, e.g. vgs. So, the percentage of lvmetad replies containing
global_invalid should be very small.
VG information
--------------
VG information refers to metadata that is isolated to a single VG,
e.g. an LV or the size of an LV.
When an external system determines that VG information has been changed
from another host (e.g. an lvcreate or lvresize), it sends lvmetad the
message: set_vg_info: uuid=X version=N. X is the VG uuid, and N is the
latest VG seqno that was written. lvmetad checks the seqno of its cached
VG, and if the version from the message is newer, it sets an invalid flag
for the cached VG. The invalid flag, along with the newer seqno are saved
in a new vg_info struct.
When lvm commands request VG metadata from lvmetad, lvmetad includes the
invalid flag along with the VG metadata. The lvm command checks for this
flag, and rereads the VG from disk if set. The VG read from disk is sent
to lvmetad. lvmetad sees that the seqno in the new version matches the
seqno from the last set_vg_info message, and clears the vg invalid flag.
Further lvm commands will use the VG metadata from lvmetad until it is
next invalidated.
Since our test environment runs also in non-real-udev world,
it's using /etc/.cache file with scanned files.
So in this case it is mandatory the user runs 'vgscan'
after a device reappears in the system.
This 'first' lvm2 command then fixes metadata (just like vgs did).
Use of display_lvname() in plain log_debug() may accumulate memory in
command context mempool. Use instead small ringbuffer which allows to
store cuple (10 ATM) names so upto 10 full names can be used at one.
We are not keeping full VG/LV names as it may eventually consume larger
amount of RAM resouces if vgname is longer and lots of LVs are in use.
Note: if there would be ever needed for displaing more names at once,
the limit should be raised (e.g. log_debug() would need to print more
then 10 LVs on a single line).
With thin-pool kernel target module 1.13 it's now support usage of
external origin with sizes which are not 'alligned' with chunk size
of thin-pool.
Enable lvm2 support for this and also fix reporting of data_percent
usage for case sizes are not alligned.
Note that this is just a quick fix and it needs more robust fix to
encompass any combination, not just the (old) snapshot one!
This started with this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1219222
If we have devices/ignore_suspended_devices=1 set based on which we
filter out suspended devices as unusable (or if we ignore suspended
devices by force, e.g. during lvconvert called from dmeventd) and
when we have snapshot and snapshot origin devices in the play, we
need to look at their components unerneath (*-real and *-cow) to
check if they're not suspended. If they are, the snapshot/snapshot
origin is not usable as well and hence it needs to be filtered out
by filter-usable.c code which does suspended device filtering.
Not going into much details here, more details are in the bugzilla
mentioned above. However, this is a quick fix since snapshot
and this exact situation is not the only one. So this is
something that needs to be revisited and fixed properly
with full dm tree and checking the whole stack to state
whether the device at the very top is usable or not.
This patch fixes segfault which was caused by incorrectly marking some
settings CFG_DEFAULT_COMMENTED instead of CFG_DEFAULT_UNDEFINED - the
ones which have NULL default value, hence they're really undefined.
A regression caused by a98ceceb1d.
For example:
$ lvmconfig log/file
file="/a"
Before this patch:
$ lvmconfig --type diff
Segmentation fault (core dumped)
With this patch applied:
$ lvmconfig --type diff
log {
file="/a"
}
The same applies for these settings:
log/activate_file
global/library_dir
global/system_id_file
<disk_area>/disk_area_id
There were also other settings with NULL default value and marked as
CFG_DEFAULT_COMMENTED instead of CFG_DEFAULT_UNDEFINED, but they were
cfg_array config settings where the NULL value was not causing segfault
(NULL == empty array).
Just as 'e' means activation with an exclusive lock,
add an 's' to mean activation with a shared lock.
This allows the existing but implicit behavior of '-ay'
of clvm LVs to be specified explicitly. For local VGs,
asy simply means ay, just like aey means ay.
For local VGs, ay == aey == asy
For clvm VGs, ay == asy, aey == aey, asy == asy
The hyphens are removed from long option names before
being read. This means that:
- Option name specifications in args.h must not include hyphens.
(The hyphen in 'use-policies' is removed.)
- A user can include hyphens anywhere in the option name.
All the following are equivalent:
--vgmetadatacopies,
--vg-metadata-copies,
--v-g-m-e-t-a-d-a-t-a-c-o-p-i-e-s-
lvmetad_init() should not be called with open connection to the daemon.
Doing so is considered to be an internall error within lvm2 code.
Such coincidence can't occur within current code. Let's assure us it won't
ever happen.
Some of descritpions were misleading at least. Some were completely
off the reality.
lvmetad_init doesn't re-establish or initialise a connection
lvmetad_active and lvmetad_connect_or_warn can do so.
There are reports of unexplained ioctl failures when using dmeventd.
An explanation might be that the wrong value of errno is being used.
Change libdevmapper to store an errno set by from dm ioctl() directly
and provide it to the caller through a new dm_task_get_errno() function.
[Replaced f9510548667754d9209b232348ccd2d806c0f1d8]
Commit b00711e312 improperly
convert _area_missing() replacment and moved check for
AREA_PV seg_type() into same if() section.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
If lvmetad is not used, we generate lvm2-activation{-early,-net}.service
systemd services to activate any VGs found on the system. So far we used
--sysinit during this activation as polling was still forked off of the
lvm activation command.
This has changed with lvmpolld - we have proper lvmpolld systemd
service now (activated via its socket unit). As such, we don't need
to use --sysinit anymore during activation in systemd environment
as polling was the only barrier to remove the need for --sysinit.
There's a race when asking lvmpolld about progress_status and
actually reading the progress info from kernel:
Even with lvmpolld being used we read status info from
LVM2 command issued by a user (client side from lvmpolld perspective).
The whole cycle may look like following:
1) set up an operation that requires polling (i.e. pvmove /dev/sda)
2) notify lvmpolld about such operation (lvmpolld_poll_init())
3) in case 1) was not called with --background it would continue with:
4) Ask lvmpolld about progress status. it may respond with one of:
a) in_progress
b) not_found
c) finished
d) any low level error
5) provided the answer was 4a) try to read progress info from polling LV
(i.e. vg00/pvmove1). Repeat steps 4) and 5) until the answer is != 4a).
And now we got into racy configuration: lvmpolld answered with in_progress
but it may be the that in_between 4) and 5) the operation has already
finished and polling LV is already gone or there's nothing to ask for.
Up to now, 5) would report warning and it could print such warning many
times if --interval was set to 0.
We don't want to scary users by warnings in such situation so let's just
print these messages in verbose mode. Error messages due to error while
reading kernel status info (on existing, active and locked LV) remained
the same.
Avoid using make's $(shell invocation since the eval order is
then somewhat different and use $$( subshell.
This also fixes a problem when more then one symbol is found,
since target shell has been given separate word list
so the 'R' assignment would need "" around it.
currently in wait_for_single_lv() fn trying to poll missing pvmove LV
is considered success. It may have been already finished by another
instance of polldaemon. either by another forked off polldaemon
or by lvmpolld.
Let's try to handle the mirror conversion and snapshot merge the same
way.
These wrappers have been replaced by direct calls
to vg_read() and find_lv() in previous commits.
This commit should have no functional impact since
all bits were already unreachable.
let's call dev_close_all() only before we're about to 'sleep'
for at least one second during the polling.
(it's questionable whether to call dev_close_all() at all in
polldaemon code. Natural extension would be to drop it completely)
Previous patch incorrectly skipped replace of @LOCALEDIR@.
The standard option is --localedir so use --with-localedir
as backward compatible option and set localedir if it's not
yet been set (if the could ever happen).
Use double-eval to translate $datarootdir to $prefix to real dir.
More exact clean of library exported symbols files.
Also use $(firstword) test to check for empty string
so 'make clean' has now cleaner condensed look.
Clean also created include links.
Possibly easier to follow - to have just a single dependency line
and use if() within rule.
Also replace $(words) with $(firstword) which is more commonly used.
Set LVM_TEST_THIN_REPAIR_CMD to /bin/false for test which
doesn't need it.
This way - even if on the system there is no such tool present,
test will not result with warning about missing tool.
Also remove from Makefile settings of TEST vars which are set in
through /lib/paths - this also allows to override them in test.
as of now lvmpolld works as client utility for
querying running instance of lvmpolld server
on metadata, state, etc.
Currently the only request implemented is the '--dump'.
It prints out full lvmpolld state (mimics lvmdump -p command).
we don't want to fail properly set pvmove after metadata
update. failure to copy id components could end with dangling
mirror moving PV segments but no monitoring from lvmpolld or
classical polldaemon.
lvpoll now process passed LV name properly. It respects
LVM_VG_NAME env. variable and is able to process LV name
passed in various formats:
- VG/LV
- LV name only (with LVM_VG_NAME set)
- /dev/mapper/VG-LV
- /dev/VG/LV
Use CFG_DEFAULT_COMMENTED and CFG_DEFAULT_UNDEFINED to
replicate the existing comments in example.conf.
Fix host_list to be cfg_array.
UNDEFINED is only used if the value depends on other
system/kernel values outside of lvm. The most common
case is when dm-thin or dm-cache have built-in default
settings in the kernel, and lvm will use those built-in
default values unless the corresponding lvm config setting
is set.
COMMENTED is used to comment out the default setting in
lvm.conf. The effect is that if the LVM version is
upgraded, and the new version of LVM has new built-in
default values, the new defaults are used by LVM unless
the previous default value was set (uncommented) in lvm.conf.
Introduce new implmentation of dm_task_get_info() function
with support for reading internal_suspend.
.
This time it is done in a 'versioned' way.
We keep the old fashion dm_task_get_info(Base) to implement
the old behavior of 1.02.95 libdm code.
libdm version 1.02.96 introduced 'macro' wrapper
dm_task_get_info_with_deferred_remove with new implementation
of dm_task_get_info() - we cannot do anything else then to
provide compatible version of this symbol.
Now in version 1.02.97 we add new versioned implementation of
dm_task_get_info(DM_1_02_97) symbol.
This has the effect that i.e. rpm build will finaly resolve proper
dependency on a new symbol - so it will be no longer possible,
to build a new binary and use old library
(rpm -q --provides will show libdevmapper.so.1.02(DM_1_02_97)(64bit))
Also the history is now tracked. If a new function is added (or
reimplemented), it needs to be placed in proper file,
so it could be exported with right versioning symbol.
File .exported_symbols.Base should and any existing older DM
should be treated as read-only after a release.
Also - only libdm has been currently enhanced with versioned .Base
file, as soon as other libs (liblvm, libdevmapper-event) needs changes
they should also get their exported symbol files - meanwhile
make.tmpl handles both cases.
Since now we enable those by default when compiled with those daemons,
explicitely disable them in tests when needed.
Alphabetically sort configurables.
Basic support for upstream 'build' of rpm packages.
Make spec file generated.
2 new simple targets:
make dist - create LVM2.MAJOR.MINOR.PATCHLEVEL.tgz from git files.
make rpm - some generic rpmbuilder using spec files.
Create packages in build/ subdir.
DO NOT USE created rpms in any distribution!
Configure provides proper settings for
use_lvmetad and use_lvmpolld conf setttings.
When the build of polld & lvmetad, these settings
are enabled by default unless explicitelly disabled
with --disable-use-lvmetad/--disable-use-lvmpolld.
This is an alternative/equivalent to commit
ca67cf84df
The problem (wrong label->dev after a new preferred
duplicate device is chosen) was isolated to the lvmetad
case (non-lvmetad worked fine), and this fixes the problem
by setting the new label->dev in the lvmetad-specific
code rather than in the general lvmcache code.
In process_each_{vg,lv,pv} when no vgname args are given,
the first step is to get a list of all vgid/vgname on the
system. This is exactly what lvmetad returns from a
vg_list request. The current code is doing a vg_lookup
on each VG after the vg_list and populating lvmcache with
the info for each VG. These preliminary vg_lookup's are
unnecessary, because they will be done again when the
processing functions call vg_read. This patch eliminates
the initial round of vg_lookup's, which can roughly cut in
half the number of lvmetad requests and save a lot of extra work.
Use 64bit arithmentic for PV size calculation (Coverity).
Also remove sector shift for compared PV size, since all
values are already held in sectors.
This fixes validatio of PV size when restoring PV
from vg metadata backup file.
Improve the python unit test case to cover all of the properties of a LV and
the properties of a LV segment.
In addition we also add a 'tag' to the lv so that we can retrieve it
using the 'lv_tags' property to ensure that this works as expected.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Synopsis: STR_LIST needs to be treated as STR for properties.
For any lvm property that was internally 'typed' as a string list we were failing
to return a string in the property API. This was due to the fact that for the
properties to work the value needs to either be evaulated as a string or a
number. This change corrects the macro used to build the memory array of
structures so that the string bitfield is set as needed to ensure that the value
is a string.
https://bugzilla.redhat.com/show_bug.cgi?id=1139920
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When retrieving a property value that is a string, if the character pointer in C
was NULL, we would segfault. This change checks for non-null before creating a
python string representation. In the case where the character pointer is NULL
we will return a python 'None' for the value.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
With the lvm2app C API adding the ability to determine when a property is
signed we can then use this information to construct the correct representation
of the number for python which will maintain value and sign. Previously, we
only represented the numbers in python as positive integers.
Python long type exceeds the range for unsigned and signed integers, we just
need to use the appropriate parsing code to build correctly.
Python part of the fix for:
https://bugzilla.redhat.com/show_bug.cgi?id=838257
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Currently lvm2app properties have the following structure:
typedef struct lvm_property_value {
uint32_t is_settable:1;
uint32_t is_string:1;
uint32_t is_integer:1;
uint32_t is_valid:1;
uint32_t padding:28;
union {
const char *string;
uint64_t integer;
} value;
} lvm_property_value_t;
which assumes that numerical values were in the range of 0 to 2**64-1. However,
some of the properties were 'signed', like LV major/minor numbers and some
reserved values for properties that represent percentages. Thus when the
values were retrieved they were in two's complement notation. So for a -1
major number the API user would get a value of 18446744073709551615. The
API user could cast the returned value to an int64_t to handle this, but that
requires the API developer to look at the source code and determine when it
should be done.
This change modifies the return property structure to:
typedef struct lvm_property_value {
uint32_t is_settable:1;
uint32_t is_string:1;
uint32_t is_integer:1;
uint32_t is_valid:1;
uint32_t is_signed:1;
uint32_t padding:27;
union {
const char *string;
uint64_t integer;
int64_t signed_integer;
} value;
} lvm_property_value_t;
With this addition the API user can interrogate that the value is numerical,
(is_integer = 1) and subsequently check if it's signed (is_signed = 1) too.
If signed, then the API developer should use the union's signed_integer to
avoid casting.
This change maintains backwards compatibility as the structure size remains
unchanged and integer value remains unchanged. Only the additional bit
taken from the pad is utilized.
Bugzilla reference:
https://bugzilla.redhat.com/show_bug.cgi?id=838257
Signed-off-by: Tony Asleson <tasleson@redhat.com>
querying future lvmpolld with zero wait time is highly undesirable
and can cause serious performance drop of the future daemon. The new
wrapper function may avoid immediate return from syscal by
introducing minimal wait time on demand.
Routines responsible for polling of in-progress pvmove, snapshot merge
or mirror conversion each used custom lookup functions to find vg and
lv involved in polling.
Especially pvmove used pvname to lookup pvmove in-progress. The future
lvmpolld will poll each operation by vg/lv name (internally by lvid).
Also there're plans to make pvmove able to move non-overlaping ranges
of extents instead of single PVs as of now. This would also require
to identify the opertion in different manner.
The poll_operation_id structure together with daemon_parms structure they
identify unambiguously the polling task.
Waiting even after _check_lv_status returned success and
'finished' flag was set to true doesn't make much sense.
Note that while we skip the wait() we also skip the
init_full_scan_done(0) inside the routine. This should
have no impact as long as the code after _wait_for_single_lv
doesn't presume anything about the state of the cache.
as a part of bigger effort to unify polling intefaces
poll_get_copy_lv should be able to look up LVs based
on theirs lv->status field.
Effective after pvmove starts using poll_get_copy_lv
fn as well.
When kernel target reports sync status as 0% it might as well mean
it's 100% in sync, just the target is in some race inconsistent
state - so reread once again and take a more optimistic value ;)
Patch tries to work around:
https://bugzilla.redhat.com/show_bug.cgi?id=1210637
Reinstate config settings matching the last release until every
case where the generator produces different output has been reviewed
and fresh decisions made about which defaults to expose as protection
against changes in newer releases. We should be trying to reduce, not
increase, this number.
Introduce LVM_TEST_LVMETAD_DEBUG_OPTS to allow to override
default debug opts for lvmetad.
However could be still overloaded on command line:
make check_lvmetad LVM_TEST_LVMETAD_DEBUG_OPTS="-l all"...
Better name for aux function.
First use normal -TERM, and only after a while use -KILL
(leaving some time to correctly finish)
Print INFO about killed processes.
This patch adds supporting code for handling deprecated settings.
Deprecated settings are not displayed by default in lvmconfig output
(except for --type current and --type diff). There's a new
"--showdeprecated" lvmconfig option to display them if needed.
Also, when using lvmconfig --withcomments, the comments with info
about deprecation are displayed for deprecated settings and with
lvmconfig --withversions, the version in which the setting was
deprecated is displayed in addition to the version of introduction.
If using --atversion with a version that is lower than the one
in which the setting was deprecated, the setting is then considered
as not deprecated (simply because at that version it was not
deprecated).
For example:
$ lvmconfig --type default activation
activation {
...
raid_region_size=512
...
}
$ lvmconfig --type default activation --showdeprecated
activation {
...
mirror_region_size=512
raid_region_size=512
...
}
$ lvmconfig --type default activation --showdeprecated --withversions
activation {
...
# Available since version 1.0.0.
# Deprecated since version 2.2.99.
mirror_region_size=512
# Available since version 2.2.99.
raid_region_size=512
...
}
$ lvmconfig --type default activation --showdeprecated --withcomments
activation {
...
# Configuration option activation/mirror_region_size.
# This has been replaced by the activation/raid_region_size
# setting.
# Size (in KB) of each copy operation when mirroring.
# This configuration option is deprecated.
mirror_region_size=512
# Configuration option activation/raid_region_size.
# Size in KiB of each raid or mirror synchronization region.
# For raid or mirror segment types, this is the amount of
# data that is copied at once when initializing, or moved
# at once by pvmove.
raid_region_size=512
...
}
$ lvmconfig --type default activation --withcomments --atversion 2.2.98
activation {
...
# Configuration option activation/mirror_region_size.
# Size (in KB) of each copy operation when mirroring.
mirror_region_size=512
...
}
A preparatory code for marking configuration nodes as deprecated:
- struct cfg_def_item gains 2 new fields ("deprecated_since_version" and "deprecation_comment"
- cfg* macros to handle new fields
- related config_settings.h edits to add new fields for each item (null for all at the moment)
Patch with implementation will follow...
Before this patch:
$ lvmconfig --type list --withversions --withsummary global/use_lvmetad
global/use_lvmetad - Use lvmetad to cache metadata and reduce disk scanning. [2.2.93]
$ lvmconfig --type list --withversions global/use_lvmetad
global/use_lvmetad
With this patch applied:
$ lvmconfig --type list --withversions --withsummary global/use_lvmetad
global/use_lvmetad - Use lvmetad to cache metadata and reduce disk scanning. [2.2.93]
$ lvmconfig --type list --withversions global/use_lvmetad
global/use_lvmetad - [2.2.93]
We're commenting out settings with undefined default values.
The comment character '#' was printed at the very beginning of
the line, it should be placed just at the beginning of the setting,
after the space/tab prefix is printed.
Before this patch:
$ lvmconfig --type default activation
activation {
...
# volume_list=[]
...
}
With this patch applied:
$ lvmconfig --type default activation
activation {
...
# volume_list=[]
...
}
New lvmconf function is using bash associative arrays - however
older systems like RHEL5 doesn't provide this feature. In this case
stay with older variant.
Restore support for use case like this:
aux lvmconf 'tags/@foo {}'
works with systemd activated daemons only as of now
each daemon implementation may decide to signalize its
internal idle state (i.e. all background tasks unrelated to
client threads are finished)
These settings are in the "unsupported" group:
devices/loopfiles
log/activate_file
metadata/disk_areas (section)
metadata/disk_areas/<disk_area> (section)
metadata/disk_areas/<disk_area>/size
metadata/disk_areas/<disk_area>/id
These settings are in the "advanced" group:
devices/dir
devices/scan
devices/types
global/proc
activation/missing_stripe_filler
activation/mlock_filter
metadata/pvmetadatacopies
metadata/pvmetadataignore
metadata/stripesize
metadata/dirs
Also, this patch causes the --ignoreunsupported and --ignoreadvanced
switches to be honoured for all config types (lvmconfig --type).
By default, the --type current and --type diff display unsupported
settings, the other types ignore them - this patch also introduces
--showunsupported switch for all these other types to display even
unsupported settings in their output if needed.
lvmconfig --type list displays plain list of configuration settings.
Some of the existing decorations can be used (--withsummary and
--withversions) as well as existing options/switches (--ignoreadvanced,
--ignoreunsupported, --ignorelocal, --atversion).
For example (displaying only "config" section so the list is not long):
$lvmconfig --type list config
config/checks
config/abort_on_errors
config/profile_dir
$ lvmconfig --type list --withsummary config
config/checks - If enabled, any LVM configuration mismatch is reported.
config/abort_on_errors - Abort the LVM process if a configuration mismatch is found.
config/profile_dir - Directory where LVM looks for configuration profiles.
$ lvmconfig -l config
config/checks - If enabled, any LVM configuration mismatch is reported.
config/abort_on_errors - Abort the LVM process if a configuration mismatch is found.
config/profile_dir - Directory where LVM looks for configuration profiles.
$ lvmconfig --type list --withsummary --withversions config
config/checks - If enabled, any LVM configuration mismatch is reported. [2.2.99]
config/abort_on_errors - Abort the LVM process if a configuration mismatch is found. [2.2.99]
config/profile_dir - Directory where LVM looks for configuration profiles. [2.2.99]
Example with --atversion (displaying global section):
$ lvmconfig --type list global
global/umask
global/test
global/units
global/si_unit_consistency
global/suffix
global/activation
global/fallback_to_lvm1
global/format
global/format_libraries
global/segment_libraries
global/proc
global/etc
global/locking_type
global/wait_for_locks
global/fallback_to_clustered_locking
global/fallback_to_local_locking
global/locking_dir
global/prioritise_write_locks
global/library_dir
global/locking_library
global/abort_on_internal_errors
global/detect_internal_vg_cache_corruption
global/metadata_read_only
global/mirror_segtype_default
global/raid10_segtype_default
global/sparse_segtype_default
global/lvdisplay_shows_full_device_path
global/use_lvmetad
global/thin_check_executable
global/thin_dump_executable
global/thin_repair_executable
global/thin_check_options
global/thin_repair_options
global/thin_disabled_features
global/cache_check_executable
global/cache_dump_executable
global/cache_repair_executable
global/cache_check_options
global/cache_repair_options
global/system_id_source
global/system_id_file
$ lvmconfig --type list global --atversion 2.2.50
global/umask
global/test
global/units
global/suffix
global/activation
global/fallback_to_lvm1
global/format
global/format_libraries
global/segment_libraries
global/proc
global/locking_type
global/wait_for_locks
global/fallback_to_clustered_locking
global/fallback_to_local_locking
global/locking_dir
global/library_dir
global/locking_library
some tests left dangling bg processes originating in
lvm2 commands being able to spawn any bg polling process
(lvchange, vgchange, pvmove, lvconvert...)
Initial fn 'add_to_kill_list' should collect processes with
specific parameters (proc's command line and parent processes ID).
After testing finishes the fn kill_listed_processes should remove these
listed by 'add_to_kill_list'.
Unfortunately it proved to be prone to an error especially in scenarios
where cmd line of initiating command contained characters required to
be espaced before passing to shell script to make it work correctly.
(Or if cmd spawned more than one bg process with same cmd line. i.e.:
vgchange or lvchange).
The new implementation is much simpler. It uses env. variable (LVM_TEST_TAG)
for marking a process desired to be killed later or during test env. teardown.
(i.e.: LVM_TEST_TAG=kill_me_$PREFIX to kill only processes related to
current test environment)
'lvm dumpconfig' now does a lot more than just dumping configuration
information and is no longer only a support tool. Users now need
to run it to find out about configuration information that has been
removed from the lvm.conf man page so we need to promote this to full
command line status as 'lvmconfig'. Also accept 'lvm config' and mention
it in the usage information of lvmconf (which should also get merged in
eventually).
Example:
/dev/loop0 and /dev/loop1 are duplicates,
created by copying one backing file to the
other.
'identity /dev/loopX' creates an identity
mapping for loopX named idmloopX, which
adds a duplicate for the named device.
The duplicate selection code for lvmetad is
incomplete, and lvmetad is disabled for this
example.
[~]# losetup -f loopfile0
[~]# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 foo lvm2 a-- 308.00m 296.00m
[~]# losetup -f loopfile1
[~]# pvs
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/loop1 not /dev/loop0
Using duplicate PV /dev/loop1 which is more recent, replacing /dev/loop0
PV VG Fmt Attr PSize PFree
/dev/loop1 foo lvm2 a-- 308.00m 308.00m
[~]# ./identity /dev/loop0
[~]# pvs
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/loop1 not /dev/loop0
Using duplicate PV /dev/loop1 without holders, replacing /dev/loop0
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/mapper/idmloop0 not /dev/loop1
Using duplicate PV /dev/mapper/idmloop0 from subsystem DM, replacing /dev/loop1
PV VG Fmt Attr PSize PFree
/dev/mapper/idmloop0 foo lvm2 a-- 308.00m 296.00m
[~]# ./identity /dev/loop1
[~]# pvs
WARNING: duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV is being used from both devices /dev/loop0 and /dev/loop1
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/loop1 not /dev/loop0
Using duplicate PV /dev/loop1 which is more recent, replacing /dev/loop0
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/mapper/idmloop0 not /dev/loop1
Using duplicate PV /dev/mapper/idmloop0 from subsystem DM, replacing /dev/loop1
Found duplicate PV LnSOEqzEYED3RvIOa5PZP2s7uyuBLmAV: using /dev/mapper/idmloop1 not /dev/mapper/idmloop0
Using duplicate PV /dev/mapper/idmloop1 which is more recent, replacing /dev/mapper/idmloop0
PV VG Fmt Attr PSize PFree
/dev/mapper/idmloop1 foo lvm2 a-- 308.00m 308.00m
Describe
thin_check_options, thin_repair_options,
cache_check_option, scache_repair_options
as a "list of options", rather than a "string of options"
because a single string, e.g. "-q --clear-needs-check-flag"
does not work, and needs to be entered as a list,
e.g. ["-q", "--clear-needs-check-flag"]
The settings which have their default value evaluated in runtime should
have their 'unconfigured' counterparts also evaluated in runtime since
those values can be constructed by using other settings.
For example, before this patch:
$ lvm dumpconfig --type default --unconfigured devices/cache_dir devices/cache
cache_dir="@DEFAULT_SYS_DIR@/@DEFAULT_CACHE_SUBDIR@"
cache="/etc/lvm/cache/.cache
With this patch applied:
$ lvm dumpconfig --type default --unconfigured devices/cache_dir devices/cache
cache_dir="@DEFAULT_SYS_DIR@/@DEFAULT_CACHE_SUBDIR@"
cache="@DEFAULT_SYS_DIR@/@DEFAULT_CACHE_SUBDIR@/.cache"
The @something@ used for unconfigured default value is not bound to
CFG_TYPE_STRING settings defined in config_settings.h, it can be
used for any other config type too.
When test is executed on real device - lets try a more complete
cleanup - discard whole device first and try to wipe any
headers it might be left from previous test.
Show full chain of ancestors and descendants for snapshots
(both thick and thin - in case of thick, the "ancestor" field
is actually equal to "origin" field as snapshots can't be
chained for thick snapshots).
These fields display current state as it is, they do not
display any history! If the snapshot chain is broken in
the middle, we don't report the historical origin (this
is going to be a part of another patch and a different
set of fields or just a switch for existing fields to
show ancestors and descendants with history included).
For example:
(origin --> snapshot)
lvol1 --> lvol2 --> lvol3 --> lvol4
\
--> lvol5 --> lvol6 --> lvol7 --> lvol8
$ lvs -o name,pool_lv,origin,ancestors,descendants vg
LV Pool Origin Ancestors Descendants
lvol1 pool lvol2,lvol3,lvol4,lvol5,lvol6,lvol7,lvol8
lvol2 pool lvol1 lvol1 lvol3,lvol4,lvol5,lvol6,lvol7,lvol8
lvol3 pool lvol2 lvol2,lvol1 lvol4
lvol4 pool lvol3 lvol3,lvol2,lvol1
lvol5 pool lvol2 lvol2,lvol1 lvol6,lvol7,lvol8
lvol6 pool lvol5 lvol5,lvol2,lvol1 lvol7,lvol8
lvol7 pool lvol6 lvol6,lvol5,lvol2,lvol1 lvol8
lvol8 pool lvol7 lvol7,lvol6,lvol5,lvol2,lvol1
Scenario:
$ vgs -o+vg_mda_copies
VG #PV #LV #SN Attr VSize VFree #VMdaCps
fedora 1 2 0 wz--n- 9.51g 0 unmanaged
vg 16 9 0 wz--n- 1.94g 1.83g 2
$ lvs -o+read_ahead vg/lvol6 vg/lvol7
LV VG Attr LSize Pool Origin Data% Rahead
lvol6 vg Vwi-a-tz-- 1.00g pool lvol5 0.00 auto
lvol7 vg Vwi---tz-k 1.00g pool lvol6 256.00k
Before this patch:
$vgs -o vg_name,vg_mda_copies -S 'vg_mda_copies < unmanaged'
VG #VMdaCps
vg 2
Problem:
Reserved values can be only used with exact match = or !=, not <,<=,>,>=.
In the example above, the "unamanaged" is internally represented as
18446744073709551615, but this should be ignored while not comparing
field directly with "unmanaged" reserved name with = or !=. Users
should not be aware of this internal mapping of the reserved value
name to its internal value and hence it doesn't make sense for such
reserved value to take place in results of <,<=,> and >=.
There's no order defined for reserved values!!! It's a special
*reserved* value that is taken out of the usual value range
of that type.
This is very similar to what we have already fixed with
2f7f6932dc, but it's the other way round
now - we're using reserved value name in selection criteria now
(in the patch 2f7f693, we had concrete value and we compared it
with the reserved value). So this patch completes patch 2f7f693.
This patch also fixes this problem:
$ lvs -o+read_ahead vg/lvol6 vg/lvol7 -S 'read_ahead > 32k'
LV VG Attr LSize Pool Origin Data% Rahead
lvol6 vg Vwi-a-tz-- 1.00g pool lvol5 0.00 auto
lvol7 vg Vwi---tz-k 1.00g pool lvol6 256.00k
Problem:
In the example above, the internal reserved value "auto" is in the
range of selection "> 32k" - it shouldn't match as well. Here the
"auto" is internally represented as MAX_DBL and of course, numerically,
MAX_DBL > 256k. But for users, the reserved value should be uncomparable
to any number so the mapping of the reserved value name to its interna
value is transparent to users. Again, there's no order defined for
reserved values and hence it should never match if using <,<=,>,>=
operators.
This is actually exactly the same problem as already described in
2f7f6932dc, but that patch failed for
size field types because of incorrect internal representation used.
With this patch applied, both problematic scenarios mentioned
above are fixed now:
$ vgs -o vg_name,vg_mda_copies -S 'vg_mda_copies < unmanaged'
(blank)
$ lvs -o+read_ahead vg/lvol6 vg/lvol7 -S 'read_ahead > 32k'
LV VG Attr LSize Pool Origin Rahead
lvol7 vg Vwi---tz-k 1.00g pool lvol6 256.00k
When test suite is used from installed rpm package
we need to handle things better.
This patch is rather first approach - expecting few more
tweaks needed.
At this moment LVM_LOG_FILE_EPOCH with
LVM_EXPECTED_EXIT_STATUS properly deletes debug logs
only for real commands - support for lvm2 API does not yet
exists
By default these are empty strings, so the config settings
should be flagged as undefined, so they will be commented
out of the generated config. Otherwise, the lines:
thin_repair_options=""
cache_repair_options=""
in the dump output cause a warning when processed since
lvm doesn't want an empty string.
Also regenerate lvm.conf.in.
The specific config settings have been removed
from the lvm.conf(5) man page, and replaced with
a description of how to use lvm dumpconfig to
view the settings.
The sample lvm.conf and lvmlocal.conf files are now generated:
example.conf.base - initial ungenerated part of the file
example.conf.gen - generated portion from dumpconfig
example.conf.in - combination of .base and .gen files
example.conf - result of configure processing .in file
lvmlocal.conf.base - initial ungenerated part of the file
lvmlocal.conf.gen - generated portion from dumpconfig
lvmlocal.conf.in - combination of .base and .gen files
lvmlocal.conf - result of configure processing .in file
Do not edit the .in files, but edit config_settings.h
or the .base files, and then use 'make generate' to create
the new .in files.
- configure
with options
- make
creates tools/lvm
- make generate
uses tools/lvm to create example.conf.in and lvmlocal.conf.in
by combining .base files with dumpconfig output.
- configure
with same options as above
creates example.conf and lvmlocal.conf from .in files
With use_lvmetad=0, duplicate PVs /dev/loop0 and /dev/loop1,
where in this example, /dev/loop1 is the cached device
referenced by pv->dev, the command 'pvs /dev/loop0' reports:
Failed to find physical volume "/dev/loop0".
This is because the duplicate PV detection by pvid is
not working because _get_all_devices() is not setting
any dev->pvid for any entries. This is because the
pvid information has not yet been saved in lvmcache.
This is fixed by calling _get_vgnameids_on_system()
before _get_all_devices(), which has the effect of
caching the necessary pvid information.
With this fix, running pvs /dev/loop0, or pvs /dev/loop1,
produces no error and one line of output for the PV (the
device printed is the one cached in pv->dev, in this
example /dev/loop1.)
Running 'pvs /dev/loop0 /dev/loop1' produces no error
and two lines of output, with each device displayed
on one of the lines.
Running 'pvs -a' shows two PVs, one with loop0 and one
with loop1, and both shown as a member of the same VG.
Running 'pvs' shows only one of the duplicate PVs,
and that shows the device cached in pv->dev (loop1).
The above output is what the duplicate handling code
was previously designed to output in commits:
b64da4d8b5 toollib: search for duplicate PVs only when needed
3a7c47af0e toollib: pvs -a should display VG name for each duplicate PV
57d74a45a0 toollib: override the PV device with duplicates
c1f246fedf toollib: handle duplicate pvs in process_in_pv
As a further step after this, we may choose to change
some of those.
For all of these commands, a warning is printed about
the existence of the duplicate PVs:
Found duplicate PV ...: using /dev/loop1 not /dev/loop0
Rename envvar LVM_LOG_FILE_UNLINK_STATUS to LVM_EXPECTED_EXIT_STATUS
and change compare sign from '!' to '>'.
Validate LVM_LOG_FILE_EPOCH and support strictly only
up-to 32 alpha chars. If the content doesn't pass
epoch is simply ignored.
Enhance 'not' to manage autodeletion of log files in right cases.
Use separately marked epoch log files for clvmd and dmeventd.
Properly manage stack tracing for new debug.log names.
Add support for 2 new envvars for internal lvm2 test suite
(though it could be possible usable for other cases)
LVM_LOG_FILE_EPOCH
Whether to add 'epoch' extension that consist from
the envvar 'string' + pid + starttime in kernel units
obtained from /proc/self/stat.
LVM_LOG_FILE_UNLINK_STATUS
Whether to unlink the log depending on return status value,
so if the command is successful the log is automatically
deleted.
API is still for now experimental to catch various issue.
add info for various commits, most significant were:
- toollib: close connection to lvmetad after fork
(fe30658a4d)
- toollib: do not spawn polling in lv_change_activate
(c26d81d6e6)
- pvmove: split pvmove_update_metadata function
(65623b63a2)
- lvconvert: move poll code in before refactoring
(5190f56605)
- pvmove: move poll code in before refactoring
(a098aa419f)
--withfullcomments prints all comment lines for each config option.
--withcomments prints only the first comment line, which should be
a short one-line summary of the option.
Previous commit has made have_cache & have_thin producing
false return value.
Fix it and at the some time provide much better reconfiguring
warning message.
If the test machine is missing needed and configured binaries
it will produce TEST WARNING result.
When a var like LVM_TEST_THIN_CHECK_CMD is set to ""
(which is valid) we need to correctly use '-'.
Otherwise ':-' replaces such value with built-in default.
There are two reasons for this: first, this allows the client side to notice
that some PV has multiple devices associated with it and print appropriate
warnings. Second, if a duplicate device pops up and disappears, after this
change the original connection between the PV and device is not lost.
This removes dependency on lvm binary - if it's not present, all LVM
processing is skipped (shouldn't normally happen because if lvm binary
is missing then there's obviously nothing that would activate it, but
let's make sure).
Without this tight dependency on lvm, the blkdeactivate script can
be packaged with libdevmapper/dmsetup (in contrast to lvm as it was
before) and as such the script can still be used to handle other DM
devices.
Put in pvmove background process into list quickly.
Update API for aux add_to_kill_list()/kill_listed_processes().
Run on 'background' (&) only non-background pvmoves.
If the system is correctly configure (cache & thin tools are present)
avoid 'extra' rebuild of configuration.
On the other hand - if some tool is missing - duplicate ##LVMCONF should
make it more straighforward to see.
sharing connection between parent command and background
processes spawned from parent could lead to occasional failures
due to unexpected corruption in daemon responses sent to either child
or a parent.
lvmetad issued warning about duplicate config values in request.
LVM commands occasionaly failed w/ internal error after receving
corrupted response.
lvmetad connection is renewed when needed after explicit disconnect
in child
spawning a background polling from within the lv_change_activate
fn went to two problems:
1) vgchange should not spawn any background polling until after
the whole activation process for a VG is finished. Otherwise
it could lead to a duplicite request for spawning background
polling. This statement was alredy true with one exception of
mirror up-conversion polling (fixed by this commit).
2) due to current conditions in lv_change_activate lvchange cmd
couldn't start background polling for pvmove LVs if such LV was
about to get activated by the command in the same time.
This commit however doesn't alter the lvchange cmd so that it works same as
vgchange with regard to not to spawn duplicate background pollings per
unique LV.
Reenable TESTDIR & PREFIX replacement.
Since we need to replace string in proper order (1st. @TESTDIR@,
2nd. @PREFIX@), drop map and use plain string.
Drop timestamp logging when 'stacktracing'
This patch adds new options to lvmconf:
--enable-halvm (just like --enable-cluster, but configure LVM
for use in HA LVM - meaning disabling lvmetad and
making sure we have locking_type=1)
--disable-halvm (just like --disable-cluster, but configure LVM
back from HA LVM - meaning enabling lvmetad if
it's enabled by default and making sure we have
default locking type set)
--services (causes clvmd and lvmetad services to be enabled or
disabled appropriately and conforming to the changes
in lvm configuration we've just made with lvmconf)
--mirrorservice (in addition to clvmd and lvmetad services, also
enable or disable cmirrord service appropriately;
this is a separate option because cmirrord is
optional and it doesn't need to be always enabled
when clvmd is enabled)
--startstopservices (in addition to enabling or disabling services,
start and stop these services immediately)
These options are supposed to help users to make their system ready
for cluster with clvmd (active-active) or HA LVM (active-passive) use
while lvmconf script can handle services as well so users don't need
to bother about setting them manually.
Also, before this patch, we hardcoded global/use_lvmetad=0 as default
value in lvmconf script. Howeverm this default may change by just
flipping the value in config_settings.h and we may forget to edit
the lvmconf. It's better to use lvm dumpconfig --type default global/use_lvmetad
to get the actual default value and use this one instead of hardcoded one.
When performing initial allocation (so there is nothing yet to
cling to), use the list of tags in allocation/cling_tag_list to
partition the PVs. We implement this by maintaining a list of
tags that have been "used up" as we proceed and ignoring further
devices that have a tag on the list.
https://bugzilla.redhat.com/983600
Add A_PARTITION_BY_TAGS set when allocated areas should not share tags
with each other and allow _match_pv_tags to accept an alternative list
of tags. (Not used yet.)
Comments from the sample config files are copied into
the comment field of the config settings structure.
This includes only minimal changes to the text.
With this in place, the sample config files can
be generated from 'lvm dumpconfig', and content
for an lvm.conf man page can also be generated.
pv_write is called both to write orphans and to rewrite PV headers
of PVs in VGs. It needs to select the correct VG id so that the
internal cache state gets updated correctly.
It only affected commands that involved further steps after
the pv_write and was often masked because the metadata would
be re-read off disk and correct itself.
"Incorrect metadata area header checksum" warnings appeared.
Example:
Create vg1 containing dev1, dev2 and dev3.
Hide dev1 and dev2 from the system.
Fix up vg1 with vgreduce --removemissing.
Bring back dev1 and dev2.
In a single operation reinstate dev1 and dev2 into vg1 (vgextend).
Done as separate operations (automatically fix-up dev1 and dev2 as orphans,
then vgextend) it worked, but done all in one go the internal cache got
corrupted and warnings about checksum errors appeared.
When clvmd starts, it starts it's own command logging into debug.log.
This is interferring with our other command debug.log.
As as sideeffect we may experience log from command,
followed but lots of zeros and continued with clvmd log.
Fix it by renaming debug.log and now we could also print this trace
to get full list of clvmd activity nicely.
Also improve some post-mortem prints from udevadm and dmsetup to
make the output more usable.
There is no benefit in waking-up all the waiters
when there is no actual change in lock state.
This avoid some unnecessarily ping-pong effects like:
Resource V_LVMTEST15724vg retrying lock in mode:WRITE...
Resource V_LVMTEST15724vg already locked lockid=40, mode:WRITE
Resource V_LVMTEST15724vg retrying lock in mode:WRITE...
Resource V_LVMTEST15724vg already locked lockid=40, mode:WRITE
If the user provides '-m #' (# > 0) with mappings
raid4/5/6, the command silently creates
'#mirrors * #stripes + #parity' image component pairs.
Patch rejects '-m #' altogether for those mappings
in order to avoid LV creation with unexpected layout.
- resolves bz#1209445
Since now we have metadata parts running with normal speed,
we could avoid reinitilising delayed dev for every test.
(Saving seconds on cookie waits...)
If the device name is not found in our metadata,
we cannot call strdup few lines later with NULL name.
More intersting story goes behind how it happens -
pvmove removal is unfortunatelly 'multi-state' process
and at some point (for now) we have in lvm2 metadata
LV pvmove0 as stripe and mirror image as error.
If such metadata are left - we fail with any further removal.
Use common code for error_dev & delay_dev.
Both functions now take list of sectors.
From now on we could delay just 'extent' section, while
keeping running lvm commands fast (having native metadata area).
We cannot print TEST WARNING within test shell script
(since it's running in debug mode and thus always prints it)
Use 'should false' trick let the string printed in this case.
So 'non cluster cases' should now end properly.
If the kernel has 'new lcm()' (3.19) it provides wrong
optimal_io_size value for dm device so lvm2 command cannot
create properly aligned devices.
Use 'should' for this case - so test ends with 'TEST WARNING'.
Commit 80f4b4b803
introduced undesirable side-effects for lvm2app user
which happens to be our own python binding.
It appear obtaing pvs list keeps global lock.
So restricting this to VG_GLOBAL READ locks and skip
the drop skip if WRITE lock is held.
we do not allow 0 interval for pvmove command issued
without parameters with classical polldaemon. It would
query the kernel too often with possibly many pvmoves
in-progress.
So far pvmove_update_metadata (originaly _update_metadata) was
used for both initial and subsequent metadata updates during polling.
With a new polldaemon (lvmpolld) all operations that require polling
have to be split in two parts: The initiating one and the polling one.
The later step will be used from lvm command spawned by lvmpolld to
monitor and advance the mirror on next segment if required.
1) The initiation part is _update_metadata in pvmove.c which performs
only the first update, setting up the pvmove itself in metadata.
2) pvmove_update_metadata in pvmove_poll.c now handles all other
subsequent metadata updates except the last one.
Due to the split we could remove some code. Also some functions were
moved back to pvmove.c as they were suited for initialisation of pvmove
only.
This commit has no impact on functionality. Code required to
be visible outside lvconvert.c is just moved into new file
lvconvert_poll.c and some calls are made non-static and
declared in new header file lvconvert.h
This commit has no impact on functionality. Code required to
be visible outside pvmove.c is just moved into new file
pvmove_poll.c and some calls are made non-static and declared in
new header file pvmove.h
When lvm2-pvscan@.service and lvm2-lvmetad.service are scheduled to be
stopped lvm2-pvscan@.service should be stopped first since pvscan uses
lvmetad.
This is especially important if lvm2-lvmetad.socket is also scheduled to
be stopped as in this case connection requests are suppressed causing
pvscan to fail.
_check_lv_status was called from within dm_list_iterate_items cycle.
This was utterly wrong! _check_lv_status may remove more than one LV from
vg->lvs list we iterated in the same time.
In some scenarios this could lead to deadlock iterationg over same LV
indefinitely or segfault depending on the circumstances.
Fixed by moving the _check_lv_status outside iterating the vg->lvs
list.
Note that commit 6e7b24d34f was not enough
as _check_lv_status may result in removal of more than one LV from the list.
Improve testing for condition that pvmove0 is already running in the
table (so we do not kill pvmove while it has loaded target, but
it's not yet Live).
Also delay_dev for 200ms.
When we use /dev/loopX device - shift first PV1 sector by 1M
so /dev/loop0 and dm device do not appear as same device.
Also notify lvmetad once 'devs' are created - so in case this
command is called in the middle of test - lvmetad properly
drops its metadata for these devices.
Drop used test.img file between reuse so the 'prepare_vg'
always starts with zeroed disks.
When LVM_TEST_AUX_TRACE is set, allow shell tracing of aux commands.
Do not keep dangling LVs if they're removed from the vg->lvs list and
move them to vg->removed_lvs instead (this is actually similar to already
existing vg->removed_pvs list, just it's for LVs now).
Once we have this vg->removed_lvs list indexed so it's possible to
do lookups for LVs quickly, we can remove the LV_REMOVED flag as
that one won't be needed anymore - instead of checking the flag,
we can directly check the vg->removed_lvs list if the LV is present
there or not and to say if the LV is removed or not then. For now,
we don't have this index, but it may be implemented in the future.
This avoids a problem in which we're using selection on LV list - we
need to do the selection on initial state and not on any intermediary
state as we process LVs one by one - some of the relations among LVs
can be gone during this processing.
For example, processing one LV can cause the other LVs to lose the
relation to this LV and hence they're not selectable anymore with
the original selection criteria as it would be if we did selection
on inital state. A perfect example is with thin snapshots:
$ lvs -o lv_name,origin,layout,role vg
LV Origin Layout Role
lvol1 thin,sparse public,origin,thinorigin,multithinorigin
lvol2 lvol1 thin,sparse public,snapshot,thinsnapshot
lvol3 lvol1 thin,sparse public,snapshot,thinsnapshot
pool thin,pool private
$ lvremove -ff -S 'lv_name=lvol1 || origin=lvol1'
Logical volume "lvol1" successfully removed
The lvremove command above was supposed to remove lvol1 as well as
all its snapshots which have origin=lvol1. It failed to do so, because
once we removed the origin lvol1, the lvol2 and lvol3 which were
snapshots before are not snapshots anymore - the relations change
as we're processing these LVs one by one.
If we do the selection first and then execute any concrete actions on
these LVs (which is what this patch does), the behaviour is correct
then - the selection is done on the *initial state*:
$ lvremove -ff -S 'lv_name=lvol1 || origin=lvol1'
Logical volume "lvol1" successfully removed
Logical volume "lvol2" successfully removed
Logical volume "lvol3" successfully removed
Similarly for all the other situations in which relations among
LVs are being changed by processing the LVs one by one.
This patch also introduces LV_REMOVED internal LV status flag
to mark removed LVs so they're not processed further when we
iterate over collected list of LVs to be processed.
Previously, when we iterated directly over vg->lvs list to
process the LVs, we relied on the fact that once the LV is removed,
it is also removed from the vg->lvs list we're iterating over.
But that was incorrect as we shouldn't remove LVs from the list
during one iteration while we're iterating over that exact list
(dm_list_iterate_items safe can handle only one removal at
one iteration anyway, so it can't be used here).
When we're iterating over LVs in _poll_vg fn, we need to use the safe
version of iteration - the LV can be removed from the list which we're
just iterating over if we're finishing or aborting pvmove operation.
The code never mixes reads of committed and precommitted metadata,
so there's no need to attempt to set PRECOMMITTED when
*use_previous_vg is being set.
Refactor the recent metadata-reading optimisation patches.
Remove the recently-added cache fields from struct labeller
and struct format_instance.
Instead, introduce struct lvmcache_vgsummary to wrap the VG information
that lvmcache holds and add the metadata size and checksum to it.
Allow this VG summary information to be looked up by metadata size +
checksum. Adjust the debug log messages to make it clear when this
shortcut has been successful.
(This changes the optimisation slightly, and might be extendable
further.)
Add struct cached_vg_fmtdata to format-specific vg_read calls to
preserve state alongside the VG across separate calls and indicate
if the details supplied match, avoiding the need to read and
process the VG metadata again.
Fixes segfault when 'pvs' encounters two different PVs sharing the same
uuid but one an orphan, the other in a VG.
If VG_GLOBAL is held, there seems no point in doing a full scan more
than once.
If undesirable side-effects show up, we can try restricting this to
VG_GLOBAL READ locks. The original code dates back to 2.02.40.
When pvscan --cache --major --minor command is issued from
udev REMOVE event, it basically resulted into a whole device
scan since the device was missing. So avoid such scan
and first check via /sysfs (when available) if such device actually
exists.
When available use nanosecond stat info.
If commands are running closely enough after config update,
the .cache file from persistent filter could have been ignored.
This happens sometimes during i.e. synthetic test suite run.
There is no reason to support persistent major/minor numbers
for pool volumes - it's only meant to be supported for filesystems
(since i.e. nfs may need to keep volume on a persistent device node.)
Support for pools is now explicitely disabled and documented.
Metadata areas which are marked as ignored should not be scanned
and read during pvscan --cache. Otherwise, this can cause lvmetad
to cache out-of-date metadata in case other PVs with fresh metadata
are missing by chance.
Make this to work like in non-lvmetad case where the behaviour would
be the same as if the PV was orphan (in case we have no other PVs
with valid non-ignored metadata areas).
Simplify the function usage and clean up parameter parsing.
There were 2 significant changes made in the test itself
(they passed before because of incorrect shell string handling)
-pvs_sel 'tags="pv_tag1"' "$dev1 $dev2"
+sel pv 'tags="pv_tag1"' "$dev1" "$dev6"
-lvs_sel '(lv_name=vol1 || lv_name=vol2) || vg_tags=vg_tag1' "vol1 vol2
abc orig snap"
+sel lv '(lv_name=vol1 || lv_name=vol2) || vg_tags=vg_tag1' vol1 vol2
orig snap xyz
The iscsi-shutdown.service is the one responsible for logging out
iscsi sessions so blk-availability.service (running the blkdeactivate
script) should be run before that on shutdown (so we need to use
After=iscsi-shutdown.service because "After" relates to starting
the service and the opposite order is automatically applied on
stopping the service at shutdown).
Avoid busy-looping on CPU while reading socket pipe
and always call read only when select tells there is
something for read.
Change the batch output to old nicer output.
When lvm1 PVs are visible, and lvmetad is used, and the foreign
option was included in the reporting command, the reporting
command would fail after the 'pvscan all devs' function saw
the lvm1 PVs. There is no reason the command should fail
because of the lvm1 PVs; they should just be ignored.
Return 1 on success in pvdisplay_short() and lvdisplay_full()
so commands like vgdisplay are not printinig stracktraces
on successful passes.
As the results of fail/success have been internally ignored for those
calls, it had no other visible side effect - command's return value was
still 0 (success).
Detect an lvm1 system id by looking at the WRITE_LOCKED flag.
Don't copy this lvm1 system id into vg->system_id so that the
restrictions associated with the new system id are not applied
to the old VG with the inherited lvm1 system id.
The string reported by uname -n may include characters
that lvm omits from the system id (like parens, as seen
on a test machine.) Check against the final system id
string that lvm uses.
Since we take a lock inside vg_lock_newname() and we do a full
detection of presence of vgname inside all scanned labels,
there is no point to do this for second time to be sure
there is no such vg.
The only side-effect of such call would be a full validation of
some already exising VG metadata - but that's not the task for
vgcreate when create a new VG.
This call noticable reduces number of scans during 'vgcreate'.
Use similar logic as with text_vg_import_fd() and avoid repeated
parsing of same mda and its config tree for vgname_from_mda().
Remember last parsed vgname, vgid and creation_host in labeller
structure and if the metadata have the same size and checksum,
return this stored info.
TODO: The reuse of labeller struct is not ideal, some lvmcache API for
this functionality would be nicer.
When reading VG mda from multiple PVs - do all the validation only
when mda is seen for the first time and when mda checksum and length
is same just return already existing VG pointer.
(i.e. using 300PVs for a VG would lead to create and destroy 300 config trees....)
Previous versions of lvm will not obey the restrictions
imposed by the new system_id, and would allow such a VG
to be written. So, a VG with a new system_id is further
changed to force previous lvm versions to treat it as
read-only. This is done by removing the WRITE flag from
the metadata status line of these VGs, and putting a new
WRITE_LOCKED flag in the flags line of the metadata.
Versions of lvm that recognize WRITE_LOCKED, also obey the
new system_id. For these lvm versions, WRITE_LOCKED is
identical to WRITE, and the rules associated with matching
system_id's are imposed.
A new VG lock_type field is also added that causes the same
WRITE/WRITE_LOCKED transformation when set. A previous
version of lvm will also see a VG with lock_type as read-only.
Versions of lvm that recognize WRITE_LOCKED, must also obey
the lock_type setting. Until the lock_type feature is added,
lvm will fail to read any VG with lock_type set and report an
error about an unsupported lock_type. Once the lock_type
feature is added, lvm will allow VGs with lock_type to be
used according to the rules imposed by the lock_type.
When both system_id and lock_type settings are removed, a VG
is written with the old WRITE status flag, and without the
new WRITE_LOCKED flag. This allows old versions of lvm to
use the VG as before.
The seg_monitor did not display monitored status for thick snapshots
and mirrors (with mirror log *not* mirrored). The seg monitor did work
correctly even before for other segtypes - thins and raids.
Before (mirrors and snapshots, only mirrors with mirrored log properly displayed monitoring status):
[0] f21/~ # lvs -a -o lv_name,lv_layout,lv_role,seg_monitor vg
LV Layout Role Monitor
mirror mirror public
[mirror_mimage_0] linear private,mirror,image
[mirror_mimage_1] linear private,mirror,image
[mirror_mlog] linear private,mirror,log
mirror_with_mirror_log mirror public monitored
[mirror_with_mirror_log_mimage_0] linear private,mirror,image
[mirror_with_mirror_log_mimage_1] linear private,mirror,image
[mirror_with_mirror_log_mlog] mirror private,mirror,log monitored
[mirror_with_mirror_log_mlog_mimage_0] linear private,mirror,image
[mirror_with_mirror_log_mlog_mimage_1] linear private,mirror,image
thick_origin linear public,origin,thickorigin
thick_snapshot linear public,snapshot,thicksnapshot
With this patch applied (monitoring status displayed for all mirrors and snapshots):
[0] f21/~ # lvs -a -o lv_name,lv_layout,lv_role,seg_monitor vg
LV Layout Role Monitor
mirror mirror public monitored
[mirror_mimage_0] linear private,mirror,image
[mirror_mimage_1] linear private,mirror,image
[mirror_mlog] linear private,mirror,log
mirror_with_mirror_log mirror public monitored
[mirror_with_mirror_log_mimage_0] linear private,mirror,image
[mirror_with_mirror_log_mimage_1] linear private,mirror,image
[mirror_with_mirror_log_mlog] mirror private,mirror,log monitored
[mirror_with_mirror_log_mlog_mimage_0] linear private,mirror,image
[mirror_with_mirror_log_mlog_mimage_1] linear private,mirror,image
thick_origin linear public,origin,thickorigin
thick_snapshot linear public,snapshot,thicksnapshot monitored
If configuration setting is marked in config_setting.h with CFG_DISABLED
flag, default value is always used for such setting, no matter if it's defined
by user (in --config/lvm.conf/lvmlocal.conf).
A warning message is displayed if this happens:
For example:
[1] f21/~ # lvm dumpconfig --validate
WARNING: Configuration setting global/system_id_source is disabled. Using default value.
LVM configuration valid.
[1] f21/~ # pvs
WARNING: Configuration setting global/system_id_source is disabled. Using default value.
PV VG Fmt Attr PSize PFree
/dev/sdb lvm2 --- 128.00m 128.00m
...
Though vgremove operates per VG by definition, internally, it
actually means iterating over each LV it contains to do the
remove.
So we need to direct selection a bit in this case so that the
selection is done per-VG, not per-LV.
That means, use processing handle with void_handle.internal_report_for_select=0
for the process_each_lv_in_vg that is called later in vgremove_single fn.
We need to disable internal selection for process_each_lv_in_vg
here as selection is already done by process_each_vg which calls
vgremove_single. Otherwise selection would be done per-LV and not
per-VG as we intend!
An intra-release fix for commit 00744b053f.
Set ACCESS_NEEDS_SYSTEM_ID VG status flag whenever there is
a non-lvm1 system_id set. Prevents concurrent access from
older LVM2 versions.
Not set on VGs that bear a system_id only due to conversion
from lvm1 metadata.
Export _lvm1_system_id as generate_lvm1_system_id and call it in
vg_setup() so it is set before writing the metadata to disk
and not missing from the initial metadata backup file.
format_text processes both lvm2 on-disk metadata and metadata read
from other sources such as backup files. Add original_fmt field
to retain the format type of the original metadata.
Before this patch, /etc/lvm/archives would contain backups of
lvm1 metadata with format = "lvm2" unless the source was lvm1 on-disk
metadata.
The vg->lvm1_systemd_id needs to be initialized as all the code around
counts with that. Just like we initialize lvm1_system_id in vg_create
(no matter if it's actually LVM1 or LVM2 format), this patch adds this
init in alloc_vg as well so the rest of the code does not segfaul
when trying to access vg->lvm1_system_id.
In log messages refer to it as system ID (not System ID).
Do not put quotes around the system_id string when printing.
On the command line use systemid.
In code, metadata, and config files use system_id.
In lvmsystemid refer to the concept/entity as system_id.
Two new functions added in the init script: rh_status and rh_status_q.
First one to be used in status() and second one to be used in start(),
stop(), force_stop(). Check for 'dmeventd' added and print list of
lvs being monitored in status().
"!dev_cache_get(argv[i], cmd->full_filter) && !rescan_done" --> "!rescan_done && !dev_cache_get(argv[i], cmd->full_filter)
Check the simple condition first (variable), then the function return value
(which in this case certainly takes more time to evaluate) - save some time.
Two problems fixed by this patch:
- PV tags were not recognized at all when using them with pvs
report that has only label fields (regression since 2.02.105)
- incorrect persistent .cache file to be generated after pvs
report that has only label fields (regression since 2.02.106)
These bugs come from the transition from process_each_pv to
process_each_label introduced by commit
67a7b7a87d and commit
490226fc47 and related.
Commands that can never use foreign VGs begin with
cmd->error_foreign_vgs = 1. This tells the vg_read
lib layer to print an error as soon as a foreign VG
is read.
The toollib process_each layer also prints an error if a
foreign VG is read, but is more selective about it. It
won't print an error if the command did not explicitly
name the foreign VG. We want to silently ignore foreign VGs
unless a command attempts to use one explicitly.
So, foreign VG errors are printed from two different layers:
vg_read (lower layer) and process_each (upper layer).
Commands that use toollib process_each, only want errors from
the process_each layer, not from both layers. So, process_each
disables the lower layer vg_read error message by setting
error_foreign_vgs = 0.
Commands that do not use toollib process_each, want errors
from the vg_read layer, otherwise they would get no error
message. The original cmd->error_foreign_vgs setting
enables this error.
(Commands that are allowed to operate on foreign VGs always
begin with cmd->error_foreign_vgs = 0, and all the commands
in this group use toollib process_each with the selective
error reporting.)
If an LV is already rw but still ro in the kernel, allow -prw to issue a
refresh to try to change the kernel state to rw.
Intended for use after clearing activation/read_only_volume_list in
lvm.conf.
The only realistic way for a host to have active LVs in a
foreign VG is if the host's system_id (or system_id_source)
is changed while LVs are active.
In this case, the active LVs produce an warning, and access
to the VG is implicitly allowed (without requiring --foreign.)
This allows the active LVs to be deactivated.
In this case, rescanning PVs for the VG offers no benefit.
It is not possible that rescanning would reveal an LV that
is active but wasn't previously in the VG metadata.
cmirror uses the CPG library to pass messages around the cluster and maintain
its bitmaps. When a cluster mirror starts-up, it must send the current state
to any joining members - a checkpoint. When mirrors are large (or the region
size is small), the bitmap size can exceed the message limit of the CPG
library. When this happens, the CPG library returns CPG_ERR_TRY_AGAIN.
(This is also a bug in CPG, since the message will never be successfully sent.)
There is an outstanding bug (bug 682771) that is meant to lift this message
length restriction in CPG, but for now we work around the issue by increasing
the mirror region size. This limits the size of the bitmap and avoids any
issues we would otherwise have around checkpointing.
Since this issue only affects cluster mirrors, the region size adjustments
are only made on cluster mirrors. This patch handles cluster mirror issues
involving pvmove, lvconvert (from linear to mirror), and lvcreate. It also
ensures that when users convert a VG from single-machine to clustered, any
mirrors with too many regions (i.e. a bitmap that would be too large to
properly checkpoint) are trapped.
A foreign VG should be silently ignored by a reporting/display
command like 'vgs'. If the reporting/display command specifies
a foreign VG by name on the command line, it should produce an
error message.
Scanning commands pvscan/vgscan/lvscan are always allowed to
read and update caches from all PVs, including those that belong
to foreign VGs.
Other non-report/display/scan commands always ignore a foreign
VG, or report an error if they attempt to use a foreign VG.
vgimport should always invalidate the lvmetad cache because
lvmetad likely holds a pre-vgexported copy of the VG.
(This is unrelated to using foreign VGs; the pre-vgexported
VG may have had no system_id at all.)
Add --foreign to the remaining reporting and display commands plus
vgcfgbackup.
Add a NEEDS_FOREIGN_VGS flag for vgimport to always set --foreign.
If lvmetad is being used with --foreign, scan foreign VGs (currently
implemented as a full PV scan).
Handle these things centrally in lvmcmdline.c.
Also allow lvchange and vgchange -an/-aln to deactivate any foreign
LVs that happen to be active if something went wrong.
Remember to set the system ID when creating a new VG in vgsplit.
When checking whether the system ID permits access to a VG, check for
each permitted situation first, and only then issue the appropriate
error message. Always issue a message for now. (We'll try to
suppress some of those later when the VG concerned wasn't explicitly
requested.)
Add more messages to try to ensure every return code is checked and
every error path (and only an error path) contains a log_error().
Add self-correction to vgchange -c to deal with situations where
the cluster state and system ID state are out-of-sync (e.g. if
old tools were used).
Move the lvm1 sys ID into vg->lvm1_system_id and reenable the #if 0
LVM1 code. Still display the new-style system ID in the same
reporting field, though, as only one can be set.
Add a format feature flag FMT_SYSTEM_ON_PVS for LVM1 and disallow
access to LVM1 VGs if a new-style system ID has been set.
Treat the new vg->system_id as const.
Allow cmd->unknown_system_id to be cleared during toolcontext
refresh.
Set a default value of "none" for global/system_id_source.
Allow local/system_id to be empty so it's not impossible for
a later config file to remove it.
In a file containing a system ID:
Any whitespace at the start of a line is ignored;
Blank lines are ignored;
Any characters after a # are ignored along with the #.
The system ID is obtained by processing the first line with
non-ignored characters.
If further lines with non-ignored characters follow, a warning is
issued.
Add WARNING messages if there are problems setting the requested
system ID.
Ban "localhost" as a prefix regardless of the system_id_source.
Use cmd->hostname instead of calling uname again.
Make system_id_source values case-insensitive (as with new settings like
log_debug_classes) and also accept machine-id to match the filename.
Require system ID to begin with an alphanumeric character.
Rename fn to make clear it's only validation for systemid
and always terminate result rather than imposing this on the caller.
In 2.02.99, _init_tags() inadvertently began to ignore the
dm_config_tree struct passed to it. "tags" sections are not
merged together, so the "tags" section in the main config file was
being processed repeatedly and other "tags" sections were ignored.
Dop unused value assignments.
Unknown is detected via other combination
(!linear && !striped).
Also change the log_error() message into a warning,
since the function is not really returning error,
but still keep the INTERNAL_ERROR.
Ret value is always set later.
The dev ext source must be reset for the dev_cache_get call
(which evaluates filters), not lvmcache_label_scan - so fix
original commit 727c7ff85d.
Also, add comments in _pvcreate_check fn explaining why
refresh filter and rescan is needed and exactly in which
situations.
We exclude some signatures from being wiped when using blkid wiping.
These are signatures which we simply overwrite. For example, the
LVM2_member signature which denotes a PV - if we call pvcreate on
existing PV, we just overwrite the PV header, no need to wipe it.
Previously, we counted such signatures as if they were wiped
and they were counted in the final number of wiped signatures
that _wipe_known_signatures_with_blkid fn returned in the "wiped"
output arg. Then the code checking this output arg could be
mislead that wiping happened while no wiping took place in real
and we could fire some code uselessly based on this information
(e.g. refreshing filters/rescanning - see also
commit 6b4066585f).
Before, we refreshed filters and we did full rescan of devices if
we passed through wiping (wipe_known_signatures fn call). However,
this fn returns success even if no signatures were found and so
nothing was wiped. In this case, it's not necessary to do the
filter refresh/rescan of devices as nothing changed clearly.
This patch exports number of wiped signatures from all the
wiping functions below. The caller (_pvcreate_check) then checks
whether any wiping was done at all and if not, no refresh/rescan
is done, saving some time and resources.
pvcreate code path executes signature wiping if there are any signatures
found on device to prepare the device for PV. When the signature is wiped,
the WATCH udev rule triggers the event which then updates udev database
with fresh info, clearing the old record about previous signature.
However, when we're using udev db as dev-ext source, we'd need to wait
for this WATCH-triggered event. But we can't synchronize against such
events (at least not at this moment). Without this sync, if the code
continues, the device could still be marked as containing the old
signature if reading udev db. This may end up even with the device
to be still filtered, though the signature is already wiped.
This problem is then exposed as (an example with md components):
$ mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda /dev/sdb --run
$ mdadm -S /dev/md0
$ pvcreate -y /dev/sda
Wiping linux_raid_member signature on /dev/sda.
/dev/sda: Couldn't find device. Check your filters?
$ echo $?
5
So we need to temporarily switch off "udev" dev-ext source here
in this part of pvcreate code until we find a way how to sync
with WATCH events.
(This problem does not occur with signature wiping which we do
on newly created LVs since we already handle this properly with
our udev flags - the LV_NOSCAN/LV_TEMPORARY flag. But we can't use
this technique for non-dm devices to keep WATCH rule under control.)
(This reverts patch #d95c6154)
Filter complete device list through full_filter unconditionally when
we're getting the list of *all* devices even in case we're interested
only in fraction of those devices - the PVs, not the other devices
which are not PVs yet (e.g. pvs vs. pvs -a).
We need to do this full filtering whenever we're handling *complete*
list of devices, we need to be safe here, mainly if there are any
future changes and we'd forgot to change to use proper filtering then.
Also properly preventing duplicates if there are any block subsystem
components used (mpath, MD ...).
Thing here is that (under use_lvmetad=1), cmd->filter can be used
only if we're sure that the list of devices we're filtering contains
only PVs. We have to use cmd->full_filter otherwise (like it is in
case of _get_all_devices fn which acquires complete list of devices,
no matter if it is a PV or not).
Of course, cmd->full_filter is more extensive than cmd->filter
which is only a subset of full_filter.
We could optimize this in a way that if we're interested in PVs only
during process_each_pv processing (e.g. using pvs in contrast to pvs -a),
we'd get the list of PV devices directly from lvmetad from the
lvmcache_seed_infos_from_lvmetad fn call which currently updates
lvmcache only. We'd add an additional output arg for this fn to get
the list of PV devices directly in addition, without a need to iterate
over all devices which include non-PVs which we're not interested in
anyway, hence we could use only cmd->filter, not the cmd->full_filter.
So the code would look something like this:
static int _get_all_devices(....)
{
struct device_id_list *dil;
if (interested_in_pvs_only)
lvmcache_seed_infos_from_lvmetad(cmd, &dil); /* new "dil" arg */
/* the "dil" list would be filtered through cmd->filter inside lvmcache_seed_infos_from_lvmetad */
else {
lvmcache_seed_infos_from_lvmetad(cmd, NULL);
dev_iter_create(cmd->full_filter)
while (dev = dev_iter_get ...) {
dm_list_add(all_devices, &dil->list);
}
}
}
pvchange now uses process_each_pv so uncomment parts of the test
which check proper functionality of intersection between selection
result and PVs or PV tags directly provided on command line. This
didn't work properly before when pvchange was not using process_each_pv.
For example:
pvchange -u -S 'pv_name=/dev/sda' /dev/sdb
..changes nothing since clearly the intersection of /dev/sda and
/dev/sdb is empty set. The same applies for tags:
pvchange -u -S 'pv_name=/dev/sda' @some_tag
..changes nothing if /dev/sda is not tagged with some_tag.
It's cleaner this way - do not mix static and dynamic
(init_processing_handle) initializers. Use the dynamic one everywhere.
This makes it easier to manage the code - there are no "exceptions"
then and we don't need to take care about two ways of initializing the
same thing - just use one common initializer throughout and it's clear.
Also, add more comments, mainly in the report_for_selection fn explaining
what is being done and why with respect to the processing_handle and
selection_handle.
Invalid devices no longer included in the counters printed at the end.
May now need to use --ignoreskippedcluster if relying upon exit status.
If more than one change is requested per-PV, attempt to perform them
all. Note that different arguments still handle exit status
differently.
When lvm2 is build with valgrind pool detection - always disable
memcheck, since pool memory allocation are unconditionaly passed
into valgrind library.
We still need to get the list as the calls underneath process_each_pv
rely on this list. But still keep the change related to the filters -
if we're processing all devices, we need to use cmd->full_filter.
If we're processing only PVs, we can use cmd->filter only to save
some time which would be spent in filtering code.
When lvmetad is used and at the same time we're getting list of all
PV-capable devices, we can't use cmd->filter (which is used to filter
out lvmetad responses - so we're sure that the devices are PVs already).
To get the list of PV-capable devices, we're bypassing lvmetad (since
lvmetad only caches PVs, not all the other devices which are not PVs).
For this reason, we have to use the "full_filter" filter chain (just
like we do when we're running without lvmetad).
Example scenario:
- sdo and sdp components of MD device md0
- sdq, sdr and sds components of mpatha multipath device
- mpatha multipath device partitioned
- vda device partitioned
=> sdo,sdp,sdr,sds, mpatha and vda should be filtered!
$ lsblk -o NAME,TYPE
NAME TYPE
sdn disk
sdo disk
`-md0 raid0
sdp disk
`-md0 raid0
sdq disk
`-mpatha mpath
`-mpatha1 part
sdr disk
`-mpatha mpath
`-mpatha1 part
sds disk
`-mpatha mpath
`-mpatha1 part
vda disk
|-vda1 part
`-vda2 part
|-fedora-swap lvm
`-fedora-root lvm
Before this patch:
==================
use_lvmetad=0 (correct behaviour!)
$ pvs -a
PV VG Fmt Attr PSize PFree
/dev/fedora/root --- 0 0
/dev/fedora/swap --- 0 0
/dev/mapper/mpatha1 --- 0 0
/dev/md0 --- 0 0
/dev/sdn --- 0 0
/dev/vda1 --- 0 0
/dev/vda2 fedora lvm2 a-- 9.51g 0
use_lvmetad=1 (incorrect behaviour - sdo,sdp,sdq,sdr,sds and mpatha not filtered!)
$ pvs -a
PV VG Fmt Attr PSize PFree
/dev/fedora/root --- 0 0
/dev/fedora/swap --- 0 0
/dev/mapper/mpatha --- 0 0
/dev/mapper/mpatha1 --- 0 0
/dev/md0 --- 0 0
/dev/sdn --- 0 0
/dev/sdo --- 0 0
/dev/sdp --- 0 0
/dev/sdq --- 0 0
/dev/sdr --- 0 0
/dev/sds --- 0 0
/dev/vda --- 0 0
/dev/vda1 --- 0 0
/dev/vda2 fedora lvm2 a-- 9.51g 0
With this patch applied:
========================
use_lvmetad=1
$ pvs -a
PV VG Fmt Attr PSize PFree
/dev/fedora/root --- 0 0
/dev/fedora/swap --- 0 0
/dev/mapper/mpatha1 --- 0 0
/dev/md0 --- 0 0
/dev/sdn --- 0 0
/dev/vda1 --- 0 0
/dev/vda2 fedora lvm2 a-- 9.51g 0
List of all devices is only needed if we want to process devices
which are not PVs (e.g. pvs -a). But if this is not the case, it's
useless to get the list of all devices and then discard it without
any use, which is exactly what happened in process_each_pv where
the code was never reached and the list was unused if we were
processing just PVs, not all PV-capable devices:
int process_each_pv(...)
{
...
process_all_devices = process_all_pvs &&
(cmd->command->flags & ENABLE_ALL_DEVS) &&
arg_count(cmd, all_ARG);
...
/*
* If the caller wants to process all devices (not just PVs), then all PVs
* from all VGs are processed first, removing them from all_devices. Then
* any devs remaining in all_devices are processed.
*/
_get_all_devices(cmd, &all_devices);
...
ret = _process_pvs_in_vgs(...);
...
if (!process_all_devices)
goto out;
ret = _process_device_list(cmd, &all_devices, handle, process_single_pv);
...
}
This patch adds missing check for "process_all_devices" and it gets the
list of all (including non-PV) devices only if needed:
This makes a difference when using selection criteria based on
these fields - if those fields are defined as DM_REPORT_FIELD_TYPE_SIZE
(in contrast to DM_REPORT_FIELD_TYPE_NUMBER), units are also
recognize in selection clause.
For example:
$ lvs -o+seg_start vg1/lv2
LV VG Attr LSize Start
lv2 vg1 -wi-a----- 12.00m 0
lv2 vg1 -wi-a----- 12.00m 8.00m
Before this patch:
$ lvs -o+seg_start --select 'seg_start=8m'
Found size unit specifier but numeric value expected for selection field seg_start.
Selection syntax error at 'seg_start=8m'.
Use 'help' for selection to get more help.
With this patch applied:
$lvs -o+seg_start --select 'seg_start=8m'
LV VG Attr LSize Start
lv2 vg1 -wi-a----- 12.00m 8.00m
(the same applies for ba_start and vg_free fields)
The LVM_COMMAND_PROFILE env var is new - mention it in dumpconfig's
man page.
Also, dumpconfig always displays the top of the config cascade.
To display all the config found in the cascade merged (just like
it's used during LVM command processing), --mergedconfig option
must be used - this one's already described in that man page,
just make sure it's clear and add reference for this option also
in --profile/--commandprofile/--metadataprofile description.
This is a followup patch for previous patchset that enables selection in
process_each_* fns to fix an issue where field prefixes are not
automatically used for fields in selection criteria.
Use initial report type that matches the intention of each process_each_* functions:
- _process_pvs_in_vg - PVS
- process_each_vg - VGS
- process_each_lv and process_each_lv_in_vg - LVS
This is not normally needed for the selection handle init, BUT we would
miss the field prefix matching, e.g.
lvchange -ay -S 'name=lvol0'
The "name" above would not work if we didn't initialize reporting with
the LVS type at its start. If we pass proper init type, reporting code
can deduce the prefix automatically ("lv_name" in this case).
This report type is then changed further based on what selection criteria we
have. When doing pure selection, not report output, the final report type
is purely based on combination of this initial report type and report types
of the fields used in selection criteria.
We already allowed -S|--select with {vg,lv,pv}display -C (which
was then equal to {vg,lv,pv}s command. Since we support selection
in toolib now, we can support -S also without using -C in *display
commands now.
pvchange is an exception that does not use toollib yet for iterating
over the list of PVs (process_each_pv) so intialize the
processing_handle and use just like it's used in toollib.
We have 3 input report types:
- LVS (representing "_select_match_lv")
- VGS (representing "_select_match_vg")
- PVS (representing "_select_match_pv")
The input report type is saved in struct selection_handle's "orig_report_type"
variable.
However, users can use any combination of fields of different report types in
selection criteria - the resulting report type can thus differ. The struct
selection_handle's "report_type" variable stores this resulting report type.
The resulting report_type can end up as one of:
- LVS
- VGS
- PVS
- SEGS
- PVSEGS
This patch adds logic to report_for_selection based on (sensible) combination
of orig_report_type and report_type and calls appropriate reporting functions
or iterates over multiple items that need reporting to determine the selection
result.
The report_for_selection does the actual "reporting for selection only".
The selection status will be saved in struct selection_handle's "selected"
variable.
The code to determine final report type based on combination of input
report type (determined from fields used for reporting to output and selection)
can be reused for pure reporting for selection - factor out this code into
_get_final_report_type function.
This applies to:
- process_each_lv_in_vg - the VG is selected only if at least one of its LVs is selected
- process_each_segment_in_lv - the LV is selected only if at least one of its LV segments is selected
- process_each_pv_in_vg - the VG is selected only if at least one of its PVs is selected
- process_each_segment_in_pv - the PV is selected only if at least one of its PV segments is selected
So this patch causes the selection result to be properly propagated up to callers.
Call _init_processing_handle, _init_selection_handle and
_destroy_processing_handle in process_each_* and related functions to
set up and destroy handles used while processing items.
The init_processing_handle, init_selection_handle and
destroy_processing_handle are helper functions that allocate and
initialize the handles used when processing items in process_each_*
and related functions.
The "struct processing_handle" contains handles to drive the selection/matching
so pass it to the _select_match_* functions which are entry points to the
selection mechanism used in process_each_* and related functions.
This is revised and edited version of former Dave Teigland's patch which
provided starting point for all the select support in process_each_* fns.
The new "report_init_for_selection" is just a wrapper over
dm_report_init_with_selection that initializes reporting for selection
only. This means we're not going to do the actual reporting to output
for display and as such we intialize reporting as if no fields are reported
or sorted. The only fields "reported" are taken from the selection criteria
string and all such fields are marked as hidden automatically (FLD_HIDDEN flag).
These fields are used solely for selection criteria matching.
Also, modify existing report_object function that was used for reporting to
output for display. Now, it can either cause reporting to output or reporting
for selection only. The selection result is stored in struct selection_handle's
"selected" variable which can be handled further by any report_object caller.
This patch replaces "void *handle" with "struct processing_handle *handle"
in process_each_*, process_single_* and related functions.
The struct processing_handle consists of two handles inside now:
- the "struct selection_handle *selection_handle" used for
applying selection criteria while processing process_each_*,
process_single_* and related functions (patches using this
logic will follow)
- the "void* custom_handle" (this is actually the original handle
used before this patch - a pointer to custom data passed into
process_each_*, process_single_* and related functions).
The new dm_report_object_is_selected fn makes it possible to opt whether the
object reported should be displayed on output or not. Also, in addition to
that, it makes it possible to save the result of selection (either 0 or 1).
So dm_report_object_is_selected is simply more general form of object
reporting fn - combinations now allow for:
dm_report_object_is_selected(rh, object, 1, NULL):
This is exactly the original dm_report_object fn and it's fully equal
to it.
dm_report_object_is_selected(rh, object, 0, selected):
Do not display the result on output, but save info whether the object
is selected or not in 'selected' variable.
dm_report_object_is_selected(rh, object, 1, selected):
Display the result on output (if it passes selection criteria) and save
whether the object is selected or not in 'selected' variable.
dm_report_object(rh, object, 0, NULL):
This combination is not allowed - it will end up with internal error.
We're either interested in selection status or we want to display the
result on output or both, but never nothing of the two.
Once LVM_COMMAND_PROFILE environment variable is specified, the profile
referenced is used just like it was specified using "<lvm command> --commandprofile".
If both --commandprofile cmd line option and LVM_COMMAND_PROFILE env
var is used, the --commandprofile cmd line option gets preference.
- The RPM build and the tests are now executed in separate VMs.
- Run the testsuite by using the new lvm2-testsuite RPM.
- The VM running the tests is restarted from the outside if it hangs, and the
runner keeps a journal to avoid running a bad test ad infinitum.
- TODO: lcov reports and more intelligent VM rebooting (track the journal)
all sockets opened by a daemon or handed over by systemd
have to have CLOEXEC flag set. Otherwise we get nasty
warnings about leaking descriptors in processes spawned by
daemon.
Older lvm2 tools where always providing linear mapping for thin pool.
Recent lvm2 version however support external usage of thin pool and
empty/unused pools are loaded without such external linear mapping.
So this patch covers 'upgrade' problem, where older tool has activated
thin-pool with 'linear' layer mapping, and newer tools didn't expected
such mapping to exist and were not able to deactivate such table.
So before checking for new layout in dm-table, check if there is not
an old one already there.
After commit 158e998876 where we may
start to readlv_attr with a 'shared' ioctl call for a single lvs line
we where obtaing single status for thin pools.
However this is not properly reflecting lvm2 reality.
Correcting this by reading lv status from layered thin pool, but lv info
from non-layered (linear) mapped device which is maintained for proper
cluster locking.
Just like MD filtering that detects components of software RAID (md),
add detection for firmware RAID.
We're not adding any native code to detect this - there are lots of
firmware RAIDs out there which is just out of LVM scope. However,
with current changes with which we're able to get device info from
external sources (e.g. external_device_info_source="udev"), we can
do this easily if the external device status source has this kind
of information - which is the case of "udev" source where the results
of blkid scans are stored.
This detection should cover all firmware RAIDs that blkid can detect and
which are identified as:
ID_FS_TYPE = {adaptec,ddf,hpt45x,hpt37x,isw,jmicron,lsi_mega,nvidia,promise_fasttrack,silicon_medley,via}_raid_member
Partitioned devices are marked in udev db as:
ID_PART_TABLE="<partition table type name>"
and at the same time they are *not* marked with:
ID_PART_ENTRY_DISK="<parent disk major:minor>"
Where partition table type name is dos/gpt/... But checking the presence
of this variable is enough for LVM here - it just needs to know whether
there's a partition table or not, not interested in the actual type.
The same applies for parent disk major:minor.
The filter-partitioned code should contain only checks in "partition" domain.
The check for pv_min_size should actually be a part of filter-usable.
If the device size is less than pv_min_size, such device is not usable
as a PV so this check clearly belongs here logically.
With udev external info source, we can get device size via libudev's
sysfs reading interface and we can avoid opening the device this way
effectively.
mpath components are marked in udev db as:
ID_FS_TYPE="mpath_member"
or
DM_MULTIPATH_DEVICE_PATH="1"
(it depends on udev rule/blkid version used for handling mpath)
Composite filter is a filter that can put several filters in one set.
This patch adds a switch when creating the composite filter which will
enable or disable external device info handles for all the filters
the composite filter encompasses.
We want to use this external device info for majority of the filters
which are in the "lvmetad filter chain" (or the respective part if
we're not using lvmetad).
Following patches will use the enabled external device handle in
concrete filters from the composite filter...
for_each_sub_lv() now scans in depth also pools, however for
rename we actually do want to skip pools.
So add a new for_each_sub_lv_except_pools() to be used by rename,
every other user of for_each_sub_lv() scans every sub LV with pools
included.
This is i.e. necessary for properly working preload of pools
that are using raid arrays.
LVSINFO, LVSSTATUS and LVSINFOSTATUS is the same as LVS, just with some
extra info/status decoration attached to it. Recognize this when looking
for properties for lvm2app. This fixes lvm_lv_get_property lvm2app call
for fields which already use LVS{INFO,STATUS,INFOSTATUS} - currently,
this is lv_attr field which was converted to LVSINFOSTATUS from
pure LVS type.
This is a regression from v115 where some of the fields/properties
were converted to using the common "struct lvinfo" and
"struct lv_seg_status" so we don't need to issue info and status
ioctl several times per one reported line. Not all fields are
converted yet, but one that *is* converted is the lv_attr field
with the lv_attr_dup counterpart used in lvm_lv_get_attr lvm2app fn.
These changes were introduced with e34b004422
and later - this patch introduced the "info_ok" field in the
lv_with_info_and_seg_status structure which encapsulates the lvinfo
and lv_seg_status struct.
For the lv_attr_dup, the lv_attr_dup code missed the
assignment for the "info_ok" flag which saves the result of the
lv_info_with_seg_status call. Hence such info was marked
as unusable - unknown and it was returned as such via lvm_lv_get_attr
lvm2app fn.
When cache_mode is undefined, the read of metadata will miss to
set a bit with mode and fails to process metadata on internal
error:
Internal error: LV vg/lvol1 has uknown feature flags 0.
Fix it by setting it to writethrough mode.
Check splitted leg is active before preload.
(Since splitmirrors currently only does work active raid volumes
it's not a change for current code flow).
Minor optimization included - when already positively checked
for raid image don't check again for raid metadata.
for_each_sub_lv() normally does not put pool_lv into deps.
So for now go around it in 'lv_preload()' and add explicit
call with pool.
TODO: think about a better way, we want pool_lv deps only in certain
moments, so maybe for_each_sub_lv() needs new arg for this.
When repairing thin pool or swapping thin pool metadata,
preserve chunk_size property and avoid to be automatically changed
later in the code to better match thin pool metadata size.
When raid leg is extracted, now the preload code handles this state
correctly and put proper new table entry into dm tree,
so the activation of extracted leg and removed metadata works
after commit.
When raid is being splitted, extracted leg & metadata
is still floating in the table - and thus we need to
detect this case and properly preload their matching
table so consequent activation of extracted LVs properly
renames (and FREES) existing raid images, so ongoing
image name shifting will work.
For example, with dmeventd/executable set to "" which is not allowed for
this setting, the config validation now ends up with:
$ lvm dumpconfig --validate
Configuration setting "dmeventd/executable" invalid. It cannot be set to an empty value.
LVM configuration invalid.
This check for empty values for string config settings was not
done before (we only checked empty arrays, but not scalar strings).
Rename original lv_error_when_full field to lv_when_full and also
convert it from binary field to string field displaying three
possible values: "error", "queueu" or "" (blank for undefined).
$ lvs vg/pool vg/pool1 vg/linear_lv -o+lv_when_full
LV VG Attr LSize Data% Meta% WhenFull
linear_lv vg -wi-a----- 4.00m
pool vg twi-aotz-- 4.00m 0.00 0.98 queue
pool1 vg twi-a-tz-- 4.00m 0.00 0.88 error
For -S|--select these synonyms are recognized:
"error" -> "error when full", "error if no space"
"queue" -> "queue when full", "queue if no space"
"" -> "undefined"
The arg check using pvs is unnecessary. If the arg is not a PV,
the command will just fail later. Using the pvs command at this
point in the command is a problem when lvmetad is running, because
the pvs command does not report duplicate PVs when using lvmetad.
(Alternatively, use_lvmetad could be disabled by adding a --config
override to this pvs command.)
Add separate LVSINFOSTATUS field type for fields which display both
dm info-like and dm status-like information.
The internal interface is there with the introduction of LVSSTATUS
field type which can cope with the combination of LVSSTATUS
and LVSINFO field types (several fields).
However, till now, we considered that *single* field can display
either LVSINFO or LVSSTATUS, but not both at the same time.
Till now, we haven't had single field which needs both - hence
add LVSINFOSTATUS field type for such fields as we currently
need this for the lv_attr field which requires combination of
info and status.
This patch just adds interface for an ability to register such fields
(the code that copes with this is already in).
Recently the single 'status' code has been used for number of cache
features.
Extend the API a little bit to allow usage also for lv_attr_dup.
As the function itself is used in lvm2api - add a new function:
lv_attr_dup_with_info_and_seg_status() that is able to use
grabbed info & status information.
report_init() is now using directly passed lvdm struct pointer
which holds the infomation whether lv_info() was correctly obtained or
there was some error when trying to read it.
Move 'healt' attribute to status.
TODO convert raid function to use the already known status.
The previous patch felt short WRT disabling allocation on PVs holding other
legs of the RAID LV persistently; this patch introduces an internal,
transient PV flag PV_ALLOCATION_PROHIBITED to address this very problem.
General problem description for completeness:
An 'lvconvert --repair $RAID_LV" to replace a failed leg of a multi-segment
RAID10/4/5/6 logical volume can lead to allocation of (parts of) the replacement
image component pair on the physical volume of another image component
(e.g. image 0 allocated on the same PV as image 1 silently impeding resilience).
Patch fixes this severe resilince issue by prohibiting allocation on PVs
already holding other legs of the RAID set. It allows to allocate free space
on any operational PV already holding parts of the image component pair.
A full search for duplicate PVs in the case of pvs -a
is only necessary when duplicates have previously been
detected in lvmcache. Use a global variable from lvmcache
to indicate that duplicate PVs exist, so we can skip the
search for duplicates when none exist.
Previously, 'pvs -a' displayed the VG name for only the device
associated with the cached PV (pv->dev), and other duplicate
devices would have a blank VG name. This commit displays the
VG name for each of the duplicate devices. The cost of doing
this is not small: for each PV processed, the list of all
devices must be searched for duplicates.
When multiple duplicate devices are specified on the
command line, the PV is processed once for each of them,
but pv->dev is the device used each time.
This overrides the PV device to reflect the duplicate
device that was specified on the command line. This is
done by hacking the lvmcache to replace pv->dev with the
device of the duplicate being processed. (It would be
preferable to override pv->dev without munging the content
of the cache, and without sprinkling special cases throughout
the code.)
This override only applies when multiple duplicate devices are
specified on the command line. When only a single duplicate
device of pv->dev is specified, the priority is to display the
cached pv->dev, so pv->dev is not overridden by the named
duplicate device.
In the examples below, loop3 is the cached device referenced
by pv->dev, and is given priority for processing. Only after
loop3 is processed/displayed, will other duplicate devices
loop0/loop1 appear (when requested on the command line.)
With two duplicate devices, loop0 and loop3:
# pvs
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop0
PV VG Fmt Attr PSize PFree
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m
# pvs /dev/loop3
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop0
PV VG Fmt Attr PSize PFree
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m
# pvs /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop0
PV VG Fmt Attr PSize PFree
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m
# pvs -o+dev_size /dev/loop0 /dev/loop3
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop0
PV VG Fmt Attr PSize PFree DevSize
/dev/loop0 loopa lvm2 a-- 12.00m 12.00m 16.00m
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
With three duplicate devices, loop0, loop1, loop3:
# pvs -o+dev_size
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop3
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop1
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop3 /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop0 loopa lvm2 a-- 12.00m 12.00m 16.00m
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop3 /dev/loop1
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop1 loopa lvm2 a-- 12.00m 12.00m 32.00m
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop0 /dev/loop1
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop1 loopa lvm2 a-- 12.00m 12.00m 32.00m
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
# pvs -o+dev_size /dev/loop0 /dev/loop1 /dev/loop3
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop1 not /dev/loop0
Found duplicate PV XhLbpVo0hmuwrMQLjfxuAvPFUFZqD4vr: using /dev/loop3 not /dev/loop1
PV VG Fmt Attr PSize PFree DevSize
/dev/loop0 loopa lvm2 a-- 12.00m 12.00m 16.00m
/dev/loop1 loopa lvm2 a-- 12.00m 12.00m 32.00m
/dev/loop3 loopa lvm2 a-- 12.00m 12.00m 32.00m
Processes a PV once for each time a device with its PV ID
exists on the command line.
This fixes a regression in the case where:
. devices /dev/sdA and /dev/sdB where clones (same PV ID)
. the cached VG references /dev/sdA
. before the regression, the command: pvs /dev/sdB
would display the cached device clone /dev/sdA
. after the regression, pvs /dev/sdB would display nothing,
causing vgimportclone /dev/sdB to fail.
. with this fix, pvs /dev/sdB displays /dev/sdA
Also, pvs /dev/sdA /dev/sdB will report two lines, one for each
device on the command line, but /dev/sdA is displayed for each.
This only works without lvmetad.
Support error_if_no_space feature for thin pools.
Report more info about thinpool status:
(out_of_data (D), metadata_read_only (M), failed (F) also as health
attribute.)
API for seg reporting is breaking internal lvm coding - it cannot
use vgmem mem pool for allocation of reported value.
So use separate pool instead of 'vgmem' for non vg related allocations
Add consts for many function params - but still many other are left
for now as non-const - needs deeper level of change even on libdm side.
An 'lvconvert --repair $RAID_LV" to replace a failed leg of a multi-segment
RAID10/4/5/6 logical volume can lead to allocation of (parts of) the replacement
image component pair on the physical volume of another image component
(e.g. image 0 allocated on the same PV as image 1 silently impeding resilience).
Patch fixes this severe resilince issue by prohibiting allocation on PVs
already holding other legs of the RAID set. It allows to allocate free space
on any operational PV already holding parts of the image component pair.
It's not an error if the device is filtered out and hence cleared from
lvmetad cache - "pvscan --cache DevPath" has now the same behaviour in
this case as "pvscan --cache major:minor" (which is more consistent).
Before, the tests expected failure return code for "pvscan --cache DevicePath"
if the device was filtered (which is a different situation if the device
is missing in the system completely!).
Normally, if there are partitions defined on top of device-mapper
device, there should be a device-mapper device created for each
partiton on top of the old one and once the underlying DM device
is used by another devices (partition mappings in this case),
it can't be used as a PV anymore.
However, sometimes, it may happen the partition mappings are
missing - either the partitioning tool is not creating them if
it does not contain full support for device-mapper devices or
the mappings were removed.
Better safe than sorry - check for partition header on DM devs
and filter them out as unsuitable for PVs in case the check is
positive. Whatever the user is doing, let's do our best to prevent
unwanted corruption (...by running pvcreate on top of such device
that would corrupt the partition header).
If pvscan is run with device path instead of major:minor pair and this
device still exists in the system and the device is not visible anymore
(due to a filter that is applied), notify lvmetad properly about this.
This makes it more consistent with respect to existing pvscan with
major:minor which already notifies lvmetad about device that is gone
due to filters.
However, if the device is not in the system anymore, we're not able
to translate the original device path into major:minor pair which
lvmetad needs for its action (lvmetad_pv_gone fn). So in this case,
we still need to use major:minor pair only, not device path. But at
least make "pvscan --cache DevicePath" as near as possible to "pvscan
--cahce <major>:<minor>" functionality.
Also add a note to pvscan man page about this difference when using
pvscan --cache with DevicePath and major:minor pair.
When processing PVs specified on the command line, the arg
name was being matched against pv_dev_name, which will not
always work:
- The PV specified on the command line could be an alias,
e.g. /dev/disk/by-id/...
- The PV specified on the command line could be any random
path to the device, e.g. /dev/../dev/sdb
To fix this, first resolve the named PV args to struct device's,
then iterate through the devices for processing.
No need to use awk now to get appropriate VGs/LVs, use LVM's
own --select - it's quicker, it removes a need for external
dependency on awk and it's also more readable.
Better than previous patch which changed log_warn to log_error -
we can have multiple MDAs and if one of them fails to be written,
we can still continue with other MDAs if we're in a mode where
we can handle missing PVs - so keep the log_warn for single
failed MDA write as it was before.
However, add log_error with "Failed to write VG <vg_name>." in
case we're not handling missing PVs or no MDA was written at all
during VG write process. This also prevents an internal error in
which the vg_write fails and we're not issuing any other log_error
in vg_write caller or above, so we end up with:
"Internal error: Failed command did not use log_error".
At first, all snapshot-origins where marked as unusable unconditionally
here, but we can't cut off whole snapshot-origin use in a stack just
because of this possible mirror state. This whole "device_is_usable"
check was even incorrectly part of persistent filter before commit
a843d0d97c66aae1872c05b0f6cf4bda176aae2 (where filter cleanup was
done).
The persistent filter is used only if obtain_device_list_from_udev=0,
which means that the former check for snapshot-origin here had not even
been hit with default configuration for a few years before commit
a843d0d97c66aae1872c05b0f6cf4bda176aae2 (the check for snapshot-origin and
skipping of this LV was introduced with commit a71d6051ed
back in 2010).
The obtain_device_list_from_udev=1 (and hence not using persistent
filter and hence not hitting this check for snapshot-origins and skipping) has been
in action since commit edcda01a1e (that is 2011).
So for 3 years this condition was not even checked with default configuration,
making it superfluous.
This all changed in 2014 with commit 8a843d0d97
where "filter-usable" is introduced and since then all snapshot-origins
have been marked as unusable more often than before and making snapshot-origins
practically unusable in a stack.
This patch removes this incorrect check from commit a71d6051ed
which caused snapshot-origins to be unusable more often recently.
If we want to fix this eventually in a correct way, we need to look
down the stack and if snapshot-origin is hit and there's a blocked
mirror underneath, only then mark the device as unusable. But mirrors
in stack are not supported anymore so it's questionable whether it's
worth spending more time on this at all...
$ lvcreate -l1 -m1 --type mirror vg
Logical volume "lvol0" created.
$ lvconvert --type raid1 vg/lvol0
Before:
$ lvs -a vg
LV VG Active Attr LSize Cpy%Sync Layout Role
lvol0 vg active rwi-a-r--- 4.00m 100.00 raid,raid1 public
[lvol0_mimage_0_rimage_0] vg active iwi-aor--- 4.00m linear private,raid,image
[lvol0_mimage_1_rimage_1] vg active iwi-aor--- 4.00m linear private,raid,image
[lvol0_rmeta_0] vg active ewi-aor--- 4.00m linear private,raid,metadata
[lvol0_rmeta_1] vg active ewi-aor--- 4.00m linear private,raid,metadata
Incorrect name: lvol0_mimage_0_rimage_0
With this patch applied:
$ lvs -a vg
LV VG Active Attr LSize Cpy%Sync Layout Role
lvol0 vg active rwi-a-r--- 4.00m 100.00 raid,raid1 public
[lvol0_rimage_0] vg active iwi-aor--- 4.00m linear private,raid,image
[lvol0_rimage_1] vg active iwi-aor--- 4.00m linear private,raid,image
[lvol0_rmeta_0] vg active ewi-aor--- 4.00m linear private,raid,metadata
[lvol0_rmeta_1] vg active ewi-aor--- 4.00m linear private,raid,metadata
Proper name: lvol0_rimage_0
When mirror has missing PVs and there are mirror images on those missing
PVs, we delete the images and during this delete operation, we also
reactivate the LV. But if we're trying to reactivate the LV in cluster
which is not active and at the same time cmirrord is not running (which
is OK since we may have created the mirror LV as inactive), we end up
with:
"Error locking on node <node_name>: Shared cluster mirrors are not available."
That is because we're trying to activate the mirror LV without cmirrord.
However, there's no need to do this reactivation if the mirror LV (and
hence it's sub LVs) were not activated before.
This issue caused failure in mirror-vgreduce-removemissing.sh test
recently with this sequence (excerpt from the test script):
prepare_lvs_
lvcreate -an -Zn -l2 --type mirror -m1 --nosync -n $lv1 $vg "$dev1" $dev2" "$dev3":$BLOCKS
mimages_are_on_ $lv1 "$dev1" "$dev2"
mirrorlog_is_on_ $lv1 "$dev3"
aux disable_dev "$dev2"
vgreduce --removemissing --force $vg
The important thing about that test is that we're not running cmirrord,
we're activating the mirror with "-an" so it's inactive and then
vgreduce --removemissing tries to reactivate the mirror images
as part of the _delete_lv function call inside and since cmirrord
is not running, we end up with the "Shared cluster mirrors are not
available." error.
When creating cluster mirrors while they're not supposed to be activated
immediately after creation, we don't need to check for cmirrord availability.
We can just create these mirrors and let the check to be done on activation
later on. This is addendum for commit cba6186325.
When creating/activating clustered mirrors, we should have cmirrord
available and running. If it's not, we ended up with rather cryptic
errors like:
$ lvcreate -l1 -m1 --type mirror vg
Error locking on node 1: device-mapper: reload ioctl on failed: Invalid argument
Failed to activate new LV.
$ vgchange -ay vg
Error locking on node node 1: device-mapper: reload ioctl on failed: Invalid argument
This patch adds check for cmirror availability and it errors out
properly, also giving a more precise error messge so users are able
to identify the source of the problem easily:
$ lvcreate -l1 -m1 --type mirror vg
Shared cluster mirrors are not available.
$ vgchange -ay vg
Error locking on node 1: Shared cluster mirrors are not available.
Exclusively activated cluster mirror LVs are OK even without cmirrord:
$ vgchange -aey vg
1 logical volume(s) in volume group "vg" now active
Since GET_FIELD_RESERVED_VALUE always returns a pointer, don't reference
it with "&" when used - we already have that pointer value (this is an
addendum to recent commit 028ff30947).
Only GET_TYPE_RESERVED_VALUE needs to be referenced with "&" as it
returns directly the value of that type.
We have to use empty list, not NULL if we want to denote that the list
has no items. Otherwise, the code further can segfault as it expects
there's always a sane value (= some list), including empty list,
but never NULL.
Use helper macros to handle reserved values and also define "undefined"
reserved value as:
FIELD_RESERVED_VALUE(cache_policy, cache_policy_undef, "", "", "undefined")
Which means:
- print "" if the cache_policy value is undefined (the first name for this reserved value is "")
- recognize "undefined" reserved name as synonym to ""
(so statements like "lvs -S cache_policy=undefined" are still recognized)
Avoid making a copy of the keyword which is already registered in
values.h for "unmanaged" (vg_mda_copies field) and "auto" reserved
value (lv_read_ahead field). Also use helper macros to handle these
reserved - this is the correct approach - just do not copy the same
thing again and do not mix it! The GET_FIELD_RESERVED_VALUE and
GET_FIRST_RESERVED_NAME macros guarantees this - use it!
In addition to that, rename reserved values:
vg_mda_copies --> vg_mda_copies_unmanaged
lv_read_ahead --> lv_read_ahead_auto
So the field reserved values follows this scheme:
"<field_name>_<reserved_value_name>".
The same applies for type reserved values with this scheme:
"<report type name in lowercase>_<reserved_value_name>"
Add a comment about this scheme for others to follow as well
when adding new fields and their reserved values. This makes
it a bit easier to read the code then.
RESERVED(id) --> GET_TYPE_RESERVED_VALUE(id)
FIRST_NAME(id) --> GET_FIRST_RESERVED_NAME(id)
Also add GET_FIELD_RESERVED_VALUE(id) macro to get per-field reserved value.
This makes it much more readable and hopefully it'll make it
easier to use these helper macros when adding new reporting
fields with reserved values if needed.
The cache policy name taken as LV segment property must be duped
for report as the VG/LV/seg structure is destroyed after processing,
reporting happens later:
$ valgrind lvs -o+cache_policy
...
==16589== Invalid read of size 1
==16589== at 0x54ABCC3: dm_report_compact_fields
(libdm-report.c:1739)
==16589== by 0x153FC7: _report (reporter.c:619)
==16589== by 0x1540A6: lvs (reporter.c:641)
==16589== by 0x148021: lvm_run_command (lvmcmdline.c:1452)
==16589== by 0x1495CB: lvm2_main (lvmcmdline.c:1907)
==16589== by 0x164712: main (lvm.c:21)
==16589== Address 0x7d465f2 is 8,338 bytes inside a block of size
16,384 free'd
==16589== at 0x4C2ACE9: free (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==16589== by 0x54B8C85: _free_chunk (pool-fast.c:318)
==16589== by 0x54B84FB: dm_pool_destroy (pool-fast.c:78)
==16589== by 0x1E59C7: _free_vg (vg.c:78)
==16589== by 0x1E5A6D: release_vg (vg.c:95)
==16589== by 0x159B6E: _process_lv_vgnameid_list (toollib.c:1967)
==16589== by 0x159DD7: process_each_lv (toollib.c:2030)
==16589== by 0x153ED8: _report (reporter.c:598)
==16589== by 0x1540A6: lvs (reporter.c:641)
==16589== by 0x148021: lvm_run_command (lvmcmdline.c:1452)
==16589== by 0x1495CB: lvm2_main (lvmcmdline.c:1907)
==16589== by 0x164712: main (lvm.c:21)
We only checked global per-report-type reserved values for compatibility
with selection code. This patch also adds a check for per-report-field
reserved values. This avoids problems where unsupported report type is
used as reserved value which could cause hard to debug problems
otherwise. So this additional check stops from registering unsupported
and unhandled per-field reserved values.
Registerting such unsupported reserved value is a programmatic error,
so report internal error in this case to stop us from making a mistake
here in the future or even today where STR_LIST fields can't have
reserved values yet.
The {pv,vg,lv}display *do* use reporting in case "-C|--columns" is used.
The man page was correct, the recognition for the --binary was missing
in the code though!
The {pv,vg,lv}display commands don't use reporting capabilites and
as such they can't use --binary. This got into the man pages by
mistake - the display commands do not recognize --binary option.
All the LVM commands are run in mode without lvmetad use (since lvmetad
can't handle duplicates). When we're finished with vgimportclone, we
need to notify lvmetad about changes.
Before this patch (/dev/sda and /dev/sdb contains a copy VG called "vg"):
$ vgimportclone --basevgname vg_snap /dev/sdb
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: Activation disabled. No device-mapper interaction will be attempted.
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
Physical volume "/tmp/snap.zcJ8LCmj/vgimport0" changed 1 physical volume changed / 0 physical volumes not changed
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: Activation disabled. No device-mapper interaction will be attempted.
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
Volume group "vg" successfully changed
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
Volume group "vg" successfully renamed to "vg_snap"
Reading all physical volumes. This may take a while...
Found volume group "vg" using metadata type lvm2
Found volume group "fedora" using metadata type lvm2
$ vgs
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 9.50g 0
vg 1 1 0 wz--n- 124.00m 120.00m
(...lvmetad doesn't see the new "vg_snap"!)
With this patch applied:
$ vgimportclone --basevgname vg_snap /dev/sdb
...
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
Volume group "vg" successfully renamed to "vg_snap"
Notifying lvmetad about changes since it was disabled temporarily.
Reading all physical volumes. This may take a while...
Found volume group "vg_snap" using metadata type lvm2
Found volume group "fedora" using metadata type lvm2
Found volume group "vg" using metadata type lvm2
$ vgs
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 9.50g 0
vg 1 1 0 wz--n- 124.00m 120.00m
vg_snap 1 1 0 wz--n- 124.00m 120.00m
The "restart lvmetad before enabling it" message is a bit misleading
here - we should probably suppress this one, but we can't suppress
warning messages selectively at the moment and we don't want to lose
other warning/error messages printed...
With current dumpconfig, we can generate lvm.conf easily - we can merge
current lvm.conf with the config given on cmd line:
lvm dumpconfig --mergedconfig --config "..."
This is a bit simpler than using awk and it also avoids problems when some of
the configuration is missing in existing lvm.conf file and hardcoded defaults
are used instead. The dumpconfig handles this transparently.
Under certain circumstances, the selection code can segfault:
$ vgs --select 'pv_name=~/dev/sda' --unbuffered vg0
VG #PV #LV #SN Attr VSize VFree
vg0 6 3 0 wz--n- 744.00m 588.00m
Segmentation fault (core dumped)
The problem here is the use of --ubuffered together with regex used in
selection criteria. If the report output is not buffered, each row is
discarded as soon as it is reported. The bug is in the use of report
handle's memory - in the example above, what happens is:
1) report handle is initialized together with its memory pool
2) selection tree is initialized from selection criteria string
(using the report handle's memory pool!)
2a) this also means the regex is initialized from report handle's mem pool
3) the object (row) is reported
3a) any memory needed for output is intialized out of report handle's mem pool
3b) selection criteria matching is executed - if the regex is checked the
very first time (for the very first row reported), some more memory
allocation happens as regex allocates internal structures "on-demand",
it's allocating from report handle's mem pool (see also step 2a)
4) the report output is executed
5) the object (row) is discarded, meaning discarding all the mem pool
memory used since step 3.
Now, with step 5) we have discarded the regex internal structures from step 3b.
When we execute reporting for another object (row), we're using the same
selection criteria (step 3b), but tihs is second time we're using the regex
and as such, it's already initialized completely. But the regex is missing the
internal structures now as they got discarded in step 5) from previous
object (row) reporting (because we're using "unbuffered" reporting).
To resolve this issue and to prevent any similar future issues where each
object/row memory is discarded after output (the unbuffered reporting) while
selection tree is global for all the object/rows, use separate memory pool
for report's selection.
This patch replaces "struct selection_node *selection_root" in struct
dm_report with new struct selection which contains both "selection_root"
and "mem" for separate mem pool used for selection.
We can change struct dm_report this way as it is not exposed via libdevmapper.
(This patch will have even more meaning for upcoming patches where selection
is used even for non-reporting commands where "internal" reporting and
selection criteria matching happens and where the internal reporting is
not buffered.)
Fix incorrect test in configure which sets --enable-udev-systemd-background-jobs
automatically if proper systemd version is available.
The UDEV_SYSTEMD_BACKGROUND_JOBS variable was not properly set to "yes" in
case systemd is available and we had "maybe" for this variable before.
When we split leg from raid - we take a proper new lock for a new LV.
However for now activation checks only 'existince' of device UUID,
but it's not validating device has a proper name.
As a quick fix call suspend()/resume() to rename after split mirror.
Add new dm_report_compact_fields function to cause report outout
(dm_report_output) to ignore fields which don't have any value set
in any of the rows reported. This provides support for compact report
output where only fields which have something to report are displayed.
The dm_report_set_output_selection was not implemented in the end -
we have dm_report_init_with_selection instead. This is just a remnant
from development code that got into libdevmapper.h by mistake.
Free (and clear) h.protocol string on daemon_open() error paths
so it's OK for caller to skip calling daemon_close() if returned
h.socket_fd is -1.
Close h.socket_fd in daemon_close() to avoid possible leak.
https://bugzilla.redhat.com/1164234
The call to dm_config_destroy can derefence result->mem
while result is still NULL:
struct dm_config_tree *get_cachepolicy_params(struct cmd_context *cmd)
{
...
int ok = 0;
...
if (!(result = dm_config_flatten(current)))
goto_out;
...
ok = 1;
out:
if (!ok) {
dm_config_destroy(result)
...
}
...
}
Just call return 0 directly on error path, without using
"goto" - the code is short, no need to use it this way
(the dead code appeared as part of further changes in this
function).
When chunk size needs to be estimated, the code missed to round
to proper 64kb boundaries (or power of 2 for older thin pool driver).
So for some data and metadata size (i.e. 10GB and 4MB) it resulted
in incorrect chunk size (not being a multiple of 64KB)
Fix it by adding proper rounding and also use 1 routine for 2 places
where the same calculation is made.
Fix also incorrect printed warning that has used 'ffs()'
(which returns first 'least significant' bit in word)
and it was not really giving any useful size info and replace it
with properly estimated chunk size.
Use lvm2 standard TARGETS.
Make liblvm_python.c as intermediate target (gets deleted after use)
Properly delete build dir on make distclean.
Mark install_python_bindings as .PHONY.
Fix regression introduced with a2c1024f6a
_setup_task(mknodes ? name : NULL...
has been replaced with:
_setup_task(type != MKNODES ? name : NULL....
Use '=='
Commit d2c116058e introduced regression
with CLVMD_PATH.
+ CLVMD_PATH="$clvmd_prefix/sbin/clvmd"
test "$prefix" != NONE && clvmd_prefix=$prefix
It has set CLVMD_PATH before clvmd_prefix got its final value.
Move it one line below.
The order of the resulting tree is based on the first appearance of
sections. With no section repeats, the sections stay as listed in the
config file. Sections using the brace syntax 'section { key = value }' are
treated the same way: 'section { x = 1 } section { y = 2 }' is the same as
'section/x = 1 section/y = 2' is the same as 'section { x = 1 y = 2 }'
Make sure there is 'control' node before clvmd is started.
Somehow 'clvmd' is not allowed by selinux to create one.
TODO: Check is selinux policy is right here...
In case someone wants to use @some_path@ that contains
prefix or exec_prefix and it's used in other files than Makefiles
(Makefiles used make.tmpl to expand these).
systemd-run is available in systemd>=205. Also, this fix prevents
systemd-specific udev rules in 69-dm-lvm-metad.rules to appear in
case systemd environment is not available - make configure to check
this automatically and use these systemd specific rules only if it
is applicable.
- closer to the recommendation of man-pages (7) if possible
- Add crossrefs
- Sort options and crossrefs
- Fix default timeout (60 secs) of -t
- Documents -I[auto]
Signed-off-by: Stéphane Aulery <saulery@free.fr>
- Closer to the recommendation of man-pages and groff_man (7) if
possible
- Sort options and crossrefs
- Relocate sub-options on the right places
Signed-off-by: Stéphane Aulery <saulery@free.fr>
ignore_vg now returns 0 for the FAILED_CLUSTERED case,
so all the ignore_vg 1 cases will return vg's with an
empty vg->pvs, so we do not need to iterate through
vg->pvs to remove the entries from the devices list.
Clean up whitespace problems in that area from the
previous commit.
- Fix problems with recent changes related to skipping in:
. _process_vgnameid_list
. _process_pvs_in_vgs
- Undo unnecessary changes to the code structure and readability.
- Preserve valid but minor changes:
. testing FAILED bit values in ignore_vg
. using "skip" value from ignore_vg instead of "ret" value
. applying the sigint check to the start of all loops
. setting stack backtrace when ECMD_PROCESSED is not returned,
i.e. apply the following pattern:
ret = process_foo();
if (ret != ECMD_PROCESSED)
stack;
if (ret > ret_max)
ret_max = ret;
Extend/fix d8923457b8 commit.
'skip'-ed VG is not holding any lock - so don't unlock such VG.
At the same time simplify the code around and relase VG at a single
place and unlock only not skiped and not ignored VGs.
Use log_warn when we are effectively not creating an error -
we 'allowed' inconsistent read for a reason - so it's just warning
level we process inconsistent VG - it's upto caller later to decide
error level of command return value and in case of error it needs
to use log_error then.
Rework ignore_vg() API so it properly handles
multiple kind of vg_read_error() states.
Skip processing only otherwise valid VG.
Always return ECMD_FAILED when break is detected.
Check sigint_caught() in front of dm iterator loop.
Add stack for _process failing ret codes.
Failed recovery provides different (NULL) VG then FAILED_INCONSISTENT.
Mark it with different failure bit - since FAILED_INCONSISTENT is
supposed to contain something 'usable' (thought inconsistent).
Move common code into shared internal fn so the logic for getting the
LV info as well LV segment status is not scattered around - call common
_do_info_and_status to gather required parts in reporting handlers.
- Add separate lv_status fn (if we're interested only in seg status,
but not lv info at the same time as it is with existing
lv_info_with_seg_status fn). So we 3 fns:
- lv_info (existing one, runs only info ioctl, fills in struct lvinfo only)
- lv_status (new one, runs status ioctl, fills in struct lv_seg_status only)
- lv_info_with_seg_status (existing one, runs status ioctl, fills
in struct lvinfo as well as lv_seg_status)
- Add more comments in the code explaining the difference between lv_info,
lv_status and lv_info_with_seg_status and their return values.
- Move decision whether lv_info_with_seg_status needs to call only
status ioctl (in case the segment for which we require status is from
the LV for which we require info) or separate status and info ioctl
(in case the segment for which we require status is from different
LV that the one for which we require info) into
lv_info_with_seg_status fn so caller doesn't need to bother about
this at all.
- Cleanup internal interface for this seg status so it's more readable.
Since we support device stack of pools over pool
(thin-pool with cache data volume) the existing code
is no longer able to detect orphan _pmspare.
So instead do a _pmspare check after volume removal,
and remove spare afterwards.
We need to stop guessing deleted names - so rather collect
deleted UUID into a string list - and then remove them properly
in _clean_tree. Restore origin _clean_tree behaviour them for
currently unconverted removal of snapshots.
Pending delete feature now properly tracks whole subtree of cache
(so i.e. data or metadata as raid volumes).
It properly replaces all related volumes with 'errors' in suspend
preload, then resume them as error and remove collected UUIDs
from root - since they are not longer part of any volume deps.
This would be in case the pool segment was not found.
LVM2.2.02.112/lib/metadata/pool_manip.c:238:36: warning: Access to field 'segtype' results in a dereference of a null pointer (loaded from variable 'pool_seg')
LVM2.2.02.112/lib/metadata/cache_manip.c:73: overflow_before_widen: Potentially overflowing expression "*pool_metadata_extents *vg->extent_size" with type "unsigned int" (32 bits, unsigned) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type "uint64_t" (64 bits, unsigned).
LVM2.2.02.112/lib/activate/dev_manager.c:217: overflow_before_widen: Potentially overflowing expression "seg_status->seg->len * extent_size" with type "unsigned int" (32 bits, unsigned) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type "uint64_t" (64 bits, unsigned).
LVM2.2.02.112/lib/activate/dev_manager.c:217: overflow_before_widen: Potentially overflowing expression "seg_status->seg->le * extent_size" with type "unsigned int" (32 bits, unsigned) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type "uint64_t" (64 bits, unsigned).
LVM2.2.02.112/lib/activate/dev_manager.c:196:5: warning: 'dmtask' may be used uninitialized in this function [-Wmaybe-uninitialized]
In _info_run fn:
switch (type) {
case INFO:
...
case STATUS:
...
case MKNODES:
...
}
The "type" is enum and currently only those three types are supported,
but if we added a new type in the future, this would end up with a bug
(if we forgot to add the new "case" in that "switch"). So let's make
sure proper internal error is printed:
default:
log_error(INTERNAL_ERROR "_info_run: unhandled info type");
return 0;
LVM2.2.02.112/daemons/clvmd/clvmd.c:1131: warning[arrayIndexOutOfBoundsCond]: Array 'row[8]' accessed at index 8, which is out of bounds. Otherwise condition 'j==8' is redundant.
This code:
int i,j = 0;
...
for (i = 0; i < len; ++i) {
...
if ((j == 8) || (i + 1 == len)) {
for (;j < 8; ++j) {
...
}
...
j = 0;
}
}
Indeed - j is 0 at the beginning, then iterating till j < 8,
then always zeroed at the end of the outer loop - so "j" never
reaching value of 8 - the j == 8 condition is redundant.
LVM2.2.02.112/tools/toollib.c:1991: leaked_storage: Variable "iter" going out of scope leaks the storage it points to.
LVM2.2.02.112/lib/filters/filter-usable.c:89: leaked_storage: Variable "f" going out of scope leaks the storage it points to.
LVM2.2.02.112/lib/activate/dev_manager.c:1874: leaked_handle: Handle variable "fd" going out of scope leaks the handle.
When getting status for LV segment types, we need to be sure
that proper segment is selected for the status ioctl.
When reporting fields that require status ioctl,
the "_choose_lv_segment_for_status_report" fn in tools/reporter.c
must be completed properly to choose the proper segment for all
the LV types (at the moment, it just takes the first LV segment
by default).
This works fine with cache LVs surely. The other segment types
need more auditing. We use this status ioctl only for cache status
fields at the moment only, so restrict it to the cache only.
Once the _choose_lv_segment_for_status_report is completed
properly, release the restriction in _get_segment_status_from_target_params.
Similar to LVSINFO type which gathers LV + its DM_DEVICE_INFO, the
new LVSSTATUS/SEGSSTATUS report type will gather LV/segment + its
DM_DEVICE_STATUS.
Since we can report status only for certain segment, in case
of LVSSTATUS we need to choose which segment related to the LV
should be processed that represents the "LV status". In case of
SEGSSTATUS type it's clear - the status is reported for the
segment just processed.
The former struct lv_with_info is renamed to lv_with_info_and_seg_status as it can
hold more than just "info", there's lv's segment status now in addition:
struct lv_with_info_and_seg_status {
struct logical_volume *lv;
struct lvinfo *info;
struct lv_seg_status *seg_status;
}
Where struct lv_seg_status is:
struct lv_seg_status {
struct dm_pool *mem;
struct lv_segment lv_seg;
lv_seg_status_type_t type;
void *status; /* struct dm_status_* */
}
Where lv_seg points to lv's segment that is being reported or
processed in general.
New struct lv_seg_status keeps the information about segment status -
the status retrieved via DM_DEVICE_STATUS ioctl. This information will
be used for reporting dm device target status for the LV segment
specified.
So this patch introduces third level of LV information that is
kept for reuse while reporting fields within one reporting line,
causing only one DM_DEVICE_STATUS ioctl call per LV segment line
reported (otherwise we'd need to call the DM_DEVICE_STATUS for each
segment status field in one LV segment/reporting line which is not
efficient).
This is following exactly the same principle as already introduced
by commit ecb2be5d16.
So currently we have three levels of information that can be used
to report an LV/LV segment:
- LV metadata itself (struct logical_volume *lv)
- LV's DM_DEVICE_INFO ioctl result (struct lvinfo *info)
- LV's segment DM_DEVICE_STATUS ioctl result (this status must be
bound to a segment, not the whole LV as the whole LV may be
composed of several segments of course)
(this is the new struct lv_seg_status *seg_status)
Do not use 'any' policy name as a value in config tree - so we stick
with 'policy_settings' and extra 'policy_name' for libdm params.
Update lvm2 API as well.
Example of supported metadata:
policy = "mq"
policy_settings {
migration_threshold = 2048
sequential_threshold = 512
random_threshold = 4
read_promote_adjustment = 10
}
Support new PASSTHROUGH 'feature' flag.
Add dm_config_node to pass in policy args.
Really use origin_uuid instead of using extra call
to pass seg_areas.
Switch to 64bit feature flag bit set so there is
enough space in future for new bits...
More efficient spare volume creation. Save 1 extra commit
and properly activate this volume according to our cluster
activation rules (using lv_active_change() for this).
Since we 'layer' for cache origin which and we support dropping
cache layer - we need to restore origin name in case
the origin LV is more complex target - i.e. raid.
Drop _corig from name
Cleanup and rename parent -> parent_lv.
Revert part of commit 51a29e6056,
it's probably bad idea to continue with any recovery, when
vg_write() or vg_commit() fail - so it's better to leave it as it is.
When deactivating origin, we may have possibly left table in broken state,
where origin is not active, but snapshot volume is still present.
Let's ensure deactivation of origin detects also all associated
snapshots are inactive - otherwise do not skip deactivation.
(so i.e. 'vgchange -an' would detect errors)
Let's use this function for more activations in the code.
'needs_exlusive' will enforce exlusive type for any given LV.
We may want to activate LV in exlusive mode, even when we know
the LV (as is) supports non-exlusive activation as well.
lvcreate -ay -> exclusive & local
lvcreate -aay -> exclusive & local
lvcreate -aly -> exclusive & local
lvcreate -aey -> exclusive (might be on any node).
LVSINFO is just a subtype of LVS report type with extra "info" ioctl
called for each LV reported (per output line) so include its processing
within "case LVS" switch, not as completely different kind of reporting
which may be misleading when reading the code.
There's already the "lv_info_needed" flag set in the _report fn, so
call the approriate reporting function based on this flag within the
"case LVS" switch line.
Actually the same is already done for LV is reported per segments
within the "case SEGS" switch line. So this patch makes the code more
consistent so it's processed the same way for all cases.
Also, this is a preparation for another and new subtype that will
be introduced later - the "LVSSTATUS" and "SEGSSTATUS" report type.
When responding to DM_EVENT_CMD_GET_REGISTERED_DEVICE no longer
ignore threads that have already been unregistered but which
are still present.
This means the caller can unregister a device and poll dmeventd
to ensure the monitoring thread has gone away before removing
the device. If a device was registered and unregistered in quick
succession and then removed, WAITEVENT could run in parallel with
the REMOVE.
Threads are moved to the _thread_registry_unused list when they
are unregistered.
Activate of new/unused/empty thin pool volume skips
the 'overlay' part and directly provides 'visible' thin-pool LV to the user.
Such thin pool still gets 'private' -tpool UUID suffix for easier
udev detection of protected lvm2 devices, and also gets udev flags to
avoid any scan.
Such pool device is 'public' LV with regular /dev/vgname/poolname link,
but it's still 'udev' hidden device for any other use.
To display proper active state we need to do few explicit tests
for this condition.
Before it's used for any lvm2 thin volume, deactivation is
now needed to avoid any 'race' with external usage.
Call check_new_thin_pool() to detect in-use thin-pool.
Save extra reactivation of thin-pool when thin pool is not active.
(it's now a bit more expensive to invoke thin_check for new pools.)
For new pools:
We now active locally exclusively thin-pool as 'public' LV.
Validate transaction_id is till 0.
Deactive.
Prepare create message for thin-pool and exclusively active pool.
Active new thin LV.
And deactivate thin pool if it used to be inactive.
Allowing 'external' use of thin-pools requires to validate even
so far 'unused' new thin pools.
Later we may have 'smarter' way to resolve which thin-pools are
owned by lvm2 and which are external.
When transaction_id is set 0 for thin-pool, libdm avoids validation
of thin-pool, unless there are real messages to be send to thin-pool.
This relaxes strict policy which always required to know
in front transaction_id for the kernel target.
It now allows to activate thin-pool with any transaction_id
(when transaction_id is passed in)
It is now upto application to validate transaction_id from life
thin-pool volume with transaction_id within it's own metadata.
Show some stats with 'lvs'
Display same info for active cache volume and cache-pool.
data% - #used cache blocks/#total cache blocks
meta% - #used metadata blocks/#total metadata blocks
copy% - #dirty/#used cache blocks
TODO: maybe there is a better mapping
- should be seen as first-try-and-see.
When the cache pool is unused, lvm2 code will internally
allow to activate such cache-pool.
Cache-pool is activate as metadata LV, so lvm2 could easily
wipe such volume before cache-pool is reused.
Replace lv_cache_block_info() and lv_cache_policy_info()
with lv_cache_status() which directly returns
dm_status_cache structure together with some calculated
values.
After use mem pool stored inside lv_status_cache structure
needs to be destroyed.
Add init of no_open_count into _setup_task().
Report problem as warning (cannot happen anyway).
Also drop some duplicated debug messages - we have already
printed the info about operation so make log a bit shorter.
Tool will use internal activation of unused cache pool to
clear metadata area before next use of cache-pool.
So allow to deactivation unused pool in case some error
case happend and we were not able to deactivation pool
right after metadata wipe.
Add API call to calculate extents from percentage value.
Size is based in DM_PERCENT_1 units.
(Supporting decimal point number).
This commit is preparing functionality for more global
usage of % with i.e. --size option.
New size_mb_arg_with_percent is able to read size_mb_arg
but also it's able to read % values.
Percent parsing is share with int_arg_with_sign_and_percent.
If root has locales with different decimal point then '.'
(i.e. Czech with ',') lets be tolerant and retry with
"C" locales in the case '.' is found during parse of number.
Locales are then restored back.
Support compile type configurable defaults for creation
of sparse volumes.
By default now create 'thin-pools' for sparse volumes.
Use the global/sparse_segtype_default to switch back to old
snapshots if needed.
Apply the same compile logic for newly introduces mirror/raid1 options.
Some values are reserved for special purpose like 'undefined', 'unmanaged' etc.
When using >, <, >= and < comparison operators where the range is considered,
do not include reserved values as proper values in this range which
would otherwise result in not so obvious criteria match (as the reserved value is
actually transparent for the user). It's incorrect.
Example scenario:
$ vgs -o vg_name,vg_mda_copies vg1 vg2
VG #VMdaCps
vg1 1
vg2 unmanaged
The "unmanaged" is actually mapped onto reserved value
18446744073709551615 (2^64 - 1) internally.
Such reseved value is already caught on selection criteria input
properly:
$ vgs -o name,vg_mda_copies vg1 vg2 -S 'vg_mda_copies=18446744073709551615'
Numeric value 18446744073709551615 found in selection is reserved.
However, we still need to fix situaton where the reserved value may be
included in resulting range:
Before this patch:
$ vgs -o vg_name,vg_mda_copies vg1 vg2 -S 'vg_mda_copies >= 1'
VG #VMdaCps
vg1 1
vg2 unmanaged
With this patch applied:
$ vgs -o vg_name,vg_mda_copies vg1 vg2 -S 'vg_mda_copies >= 1'
VG #VMdaCps
vg1 1
From the examples above, we can see that without this patch applied,
the vg_mda_copies >= 1 also matched the reserved value 18446744073709551615
(which is represented by the "unamanged" string on report). When
applying the operators, such values must be skipped! They're meant to
be matched only against their string representation only, e.g.:
$ vgs -o name,vg_mda_copies vg1 vg2 -S 'vg_mda_copies=unmanaged'
VG #VMdaCps
vg2 unmanaged
...or any synonyms:
$ vgs -o name,vg_mda_copies vg1 vg2 -S 'vg_mda_copies=undefined'
VG #VMdaCps
vg2 unmanaged
Unlike with thin-pool - with cache we support all args also
directly when create cache volume.
So the result of 'separate' cache-pool creation and setting its
options should give same result as specifying those args
during cache creation.
Cache-pool values are used as defaults if the params are
not specified with cache creation.
Move code for creation of thin volume into a single place
out of lv_extend(). This allows to drop extra pool arg
for alloc_lv_segment() && lv_extend() and makes code
more easier to read and follow.
When we create volumes with chunk size bigger then extent size
we try to round up to some nearest chunk boundary.
Until now we did this for thins - use same logic for
cache volumes.
Fixed syntax parsing means that some commands that used to work are now
failing. Particullary this case:
$ invalid lvcreate -l1 --type thin vg/pool
> Needs to fail becase thin type LV needs --virtualsize
$ invalid lvcreate --type snapshot vg/lv1
> Needs to fail because old-snapshot segment type needs --size
Some reported error messages have been also updated.
Use segment flags to avoid zeroing of cache, cache pool
snapshot and thin pool segments.
We never want to zero these segment types.
Note:
Snapshot COW and Cache origin are created as stripes
thus are then properly zeroed.
Let the finaly state of zero & wipe_signature to be
resolved later together with all the types.
Don't play with zero assigment and segtype flag
(i.e. thin-pool -Z has different meaning).
Check if the passed options do allow requested zeroing/wiping.
lvcreate without -Z or -W will fallback to warning if the device
cannot be zeroed, however if user requested them explicitely
it will give user error.
Refactor lvcreate code.
Prefer to use arg_outside_list_is_set() so we get automatic 'white-list'
validation of supported options with different segment types.
Drop used lp->cache, lp->cache and use seg_is_cache(), seg_is_thin()
Draw clear border where is the last moment we could change create
segment type.
When segment type is given with --type - do not allow it to be changed
later.
Put together tests related to individual segment types.
Finish cache conversion at proper part of lv_manip code after
the vg_metadata are written - so we could correcly clean-up created
stripe LV for cache volume.
Move test for size of new LV names in front before
any creation of LV.
Properly check striped segtype kernel presence,
since passed 'segtype' is already tested.
Keep deactivation error path local to wiping part of the function.
Create metadata with temporary flag (it's activated, zeroed
and deactivated).
Introduce new option to specify pool data size.
This will be user to create i.e. cache & cachepool at once.
And possible for thin external origin snapshot.
This is only very basic patch to enable options, the
real working code will come later.
We want to print smarter warning message only when
the zeroing was not provided on the first zeroable segment
of newly created LV.
Put warning within _should_wipe_lv function to avoid reevaluation
of same conditions twice.
Hide creation of temporary LVs and print them only in verbose mode.
e.g. hides confusing message about creation of _pmspare
device during creation of pool.
Use new libdm macro DM_LIST_HEAD_INIT().
Embeded 'free' segment type (so it's not needed in the list)
Drop assignments of 0,NULL since they are defaults.
Ask for lock the proper LV.
Use the top-most LV to query for locally exclusive lock.
The rest of operations are then using 'lv_info()'
TODO:
Check all devices are reloaded from proper level.
In general any query on lv_is_active is supposed to be running
ona lv_lock_holder() volume.
Now we reference segment name via lvseg_name() and
we can drop default implementation and leave its
function pointer to be NULL.
Default give us 'return seg->segtype->name'.
Instead of segtype->ops->name() introduce lvseg_name().
This also allows us to leave name() function 'empty' for default
return of segtype->name.
TODO: add functions for rest of ops->
There was a bug in value and their synonym definition for these two fields
causing selections on these fields to not work correctly - nothing matched
against vg/lv_permissions fields even if selection criteria should have
matched.
Scenario:
$ lvs -o name,lv_permissions vg
LV LPerms
lvol0 read-only
lvol1 writeable
Before this patch:
$ lvs -o name,lv_permissions vg -S 'permissions=read-only'
(blank)
$ lvs -o name,lv_permissions vg -S 'permissions=writeable
(blank)
With this patch applied:
$ lvs -o name,lv_permissions vg -S 'permissions=read-only'
LV LPerms
lvol0 read-only
$ lvs -o name,lv_permissions vg -S 'permissions=writeable'
LV LPerms
lvol1 writeable
Also synonyms match correctly now:
$ lvs -o name,lv_permissions vg -S 'permissions=rw'
LV LPerms
lvol1 writeable
Fix lvm_split that is called when cmd line string is separated into
argv fields to recognize quote chars ('\'" and '"') properly and
when these quotes are used, consider the text within quotes as one
argument, do not separate it based on space characters inside.
The lvm_split is used during processing lvm shell command line or
when calling lvm commands through cmdlib (e.g. dmeventd plugins).
For example, the lvm shell scenario:
Before this patch:
$lvm
lvm> lvs --config 'global{ suffix=0 }'
Parse error at byte 9 (line 1): unexpected token
Failed to set overridden configuration entries.
With this patch applied:
$lvm
lvm> lvs --config 'global{ suffix=0 }'
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
root fedora -wi-ao---- 9.00g
swap fedora -wi-ao---- 512.00m
(Exactly the same problem is hit when calling LVM commands with
quoted arguments via lvm2cmd lib in dmeventd plugins.)
Because of the recent change to process_each_pv(), the vg is always
provided when the pv is in a vg. is_pv(pv) means the pv is in a vg,
which means that the vg arg will not be NULL, which means the removed
block of code is not needed.
Bug https://bugzilla.redhat.com/show_bug.cgi?id=843587 is handled better
now - the hang does not occur anymore. There are still error messages
issued though during shutdown if someone stops lvm2-lvmetad.service
manually that lvm2-monitor.service depends on (even during shutdown).
These errors are correct though and will point to incorrect
configuration (still having use_lvmetad=1 in lvm.conf and stopping
lvm2-lvmetad.service manually).
The workaround to prevent the hang is not needed now. So the
'--config "global{use_lvmetad=0}"' is now removed from the
lvm2-monitor.service's ExecStop line.
We can't hang on blocked or suspended devices when the scan is done
for lvmetad update - when the device gets unblocked or resumed, there's
always CHANGE event generated which will fire the udev rule to run
extra pvscan --cache for that device which makes sure that lvmetad
is up-to-date.
When we are given an existing LV name - it needs to be allowed
to pass in even restricted name as the LV could have existed
long before we introduced some new restriction on prefix/suffix.i
Fix the regression on name limits and drop restriction to be applied
on any existing LVs - only the new created LV names have to be
complient with current name restrictions.
FIXME: we are currently using restricted names incorrectly in few
other places - device_is_usable() skips restricted names,
and udev flags are also incorrectly set for restricted names
so these LVs are not getting links properly.
find_pv_in_vg fn iterates over the list of PVs covered by the VG and
each PV's pvl->pv->dev is compared with device acquired from device
cache. However, in case pvl->pv->dev is NULL as well as device cache
returns NULL (e.g. when device is filtered), we'll get incorrect match
and the code calling find_pv_in_vg uses incorrect PV (as it thinks
it's the exact PV with the pv_name). The INTERNAL_ERROR covers this
situation and errors out immediately.
The warnings arg was used to enable logging of warnings
when reading a PV. This arg is turned into a set of flags
with the WARN_PV_READ flag matching the existing behavior.
A new flag WARN_INCONSISTENT is added that will cause
vg_read_internal() to log the "VG is not consistent"
warning so the various callers do not need to log
this warning themselves.
A new vg_read flag READ_WARN_INCONSISTENT is used from
reporting to enable the WARN_INCONSISTENT flag in
vg_read_internal.
[Committed by agk with cosmetic changes and tweaks.]
Process PVs by iterating through VGs, then iterating through
devices if the command needs to process non-PV devices.
The process_single function can always use the VG and PV args.
[Committed by agk with cosmetic changes and tweaks.]
lvcreate of thin pools can now use '-n lv vg' like other lv types,
or it can name the new thin pool in the free arg as 'vg/lv', which
is not allowed with other lv types.
Introduce pool function for validation of chunk size.
It's good idea to be able to reject invalid chunk size
when entered on command line before we open VG.
Move code to better locations.
Improve test and remove invalid ones
(i.e. no reason to require cache size to be >= then origin).
Correctly comment where the code is doing actual conversion
of other existing volume - we do already a similar thing with
external origins.
Lots of new command line options and combinations is now supported.
Hopefully older syntax still works as well.
lvcreate --cache --cachepool vg/pool -l1
lvcreate --type cache --cachepool vg/pool -l1
lvcreate --type cache-pool vg/pool -l1
lvcreate --type cache-pool --name pool vg -l1
... and many many more ...
Since _pmspare is internal volume move it to
lv_remove_single - so it's automatically removed with
last remove thin-pool.
lv_remove_with_dependencies() is not always used for pool removal.
--splitcache
Splits only cached LV (also pool could be specified).
Detaches cachepool from cached LV.
--split
Should be univerzal command to split various complex targets.
At this moment it knows cache.
--uncache
Opposite command to --cache. Detaches and DELETES cachepool for
cached LV.
Note: we support thin pool cached metadata device for uncaching.
Also use may specify wither cached LV or association cachepool device
to request split of cache.
Over the time lvcreate code has accumulated various hacks.
So try to move that code in right places.
Detect all types early in _lvcreate_params() so functions like
_read_size_params() do not need to change volume types.
Also ultimately respect give volume --type, that its shortcut
(-T, H, -m, -s) and after that options which do type estimation.
(i.e. --cachepool, --thinpool)
Avoid repeative tests - if we know all types are decode at once
place we can 'optimize' number of validations.
Copy the same form as the new process_each_vg.
Replace unused struct cmd_vg and cmd_vg_read() replicator
code with struct vg and vg_read() directly.
The failed_lvnames arg is no longer used since the
cmd_vg replicator wrapper was removed.
[Committed by agk with cosmetic changes and tweaks.]
Split VG argument collection from processing.
This allows the two different loops through VGs to
be replaced by a single loop.
Replace unused struct cmd_vg and cmd_vg_read() replicator
code with struct vg and vg_read() directly.
[Committed by agk with cosmetic changes and tweaks.]
The cache mode of a new cache pool is always explicitly
included in the vg metadata. If a cache mode is not
specified on the command line, the cache mode is taken
from lvm.conf allocation/cache_pool_cachemode, which
defaults to "writethrough".
The cache mode can be displayed with lvs -o+cachemode.
filters/filter-usable.c:22: warning: "ucp.check_..." may be used uninitialized in this function
This can't actually be hit in real, but let's clean this up for the compiler
to be happy again.
There are actually three filter chains if lvmetad is used:
- cmd->lvmetad_filter used when when scanning devices for lvmetad
- cmd->filter used when processing lvmetad responses
- cmd->full_fiilter (which is just cmd->lvmetad_filter + cmd->filter chained together) used
for remaining situations
This patch adds the third one - "cmd->full_filter" - currently this is
used if device processing does not fall into any of the groups before,
for example, devices which does not have the PV label yet and we're just
creating a new one or we're processing the devices where the list of the
devices (PVs) is not returned by lvmetad initially.
Currently, the cmd->full_filter is used exactly in these functions:
- lvmcache_label_scan
- _pvcreate_check
- pvcreate_vol
- lvmdiskscan
- pvscan
- _process_each_label
If lvmetad is used, then simply cmd->full_filter == cmd->filter because
cmd->lvmetad_filter is NULL in this case.
The ENABLE_ALL_DEVS flag is added to the command structure
for commands that should process all devs (pvs and non-pvs)
when they call process_each_pv and the command includes the
--all arg. This will be used in a later process_each_pv patch.
The ALL_VGS_IS_DEFAULT flag is added to the command structure
for commands that should process all vgs when they call
process_each_vg or process_each_lv with no args.
This will be used in later patches to process_each functions.
We need to use proper filter chain when we disable lvmetad use
explicitly in the code by calling lvmetad_set_active(0) while
overriding existing configuration. We need to reinitialize filters
in this case so proper filter chain is used. The same applies
for the other way round - when we enable lvmetad use explicitly in
the code (though this is not yet used).
With this change, the filter chains used look like this now:
A) When *lvmetad is not used*:
- persistent filter -> regex filter -> sysfs filter ->
global regex filter -> type filter ->
usable device filter(FILTER_MODE_NO_LVMETAD) ->
mpath component filter -> partitioned filter ->
md component filter
B) When *lvmetad is used* (two separate filter chains):
- the lvmetad filter chain used when scanning devs for lvmetad update:
sysfs filter -> global regex filter -> type filter ->
usable device filter(FILTER_MODE_PRE_LVMETAD) ->
mpath component filter -> partitioned filter ->
md component filter
- the filter chain used for lvmetad responses:
persistent filter -> usable device filter(FILTER_MODE_POST_LVMETAD) ->
regex filter
Usable device filter is responsible for filtering out unusable DM devices.
The filter has 3 modes of operation:
- FILTER_MODE_NO_LVMETAD:
When this mode is used, we check DM device usability by looking:
- whether device is empty
- whether device is blocked
- whether device is suspended (only on devices/ignore_suspended_devices=1)
- whether device uses an error target
- whether device name/uuid is reserved
- FILTER_MODE_PRE_LVMETAD:
When this mode is used, we check DM device usability by looking:
- whether device is empty
- whether device is suspended (only on devices/ignore_suspended_devices=1)
- whether device uses an error target
- whether device name/uuid is reserved
- FILTER_MODE_POST_LVMETAD:
When this mode is used, we check DM device usability by looking:
- whether device is blocked
- whether device is suspended (only on devices/ignore_suspended_devices=1)
These modes will be used by subsequent patch to create different
instances of this filter, depending on lvmetad use.
Currently, there are 5 things that device_is_usable function checks
(for DM devices only, of course):
- is device empty?
- is device blocked? (mirror)
- is device suspended?
- is device composed of an error target?
- is device name/uuid reserved?
If answer to any of these questions is "yes", then the device is not usable.
This patch just adds possibility to choose what to check for exactly - the
device_is_usable function now accepts struct dev_usable_check_params make
this selection possible. This is going to be used by subsequent patches.
When compiled with valgrind pool support - don't waste time
with preallocation of memory - it just waste of CPU cycles to
trace access to this memory.
We also may get slightly better estimation about real memory usage
during command processing.
Split internals of extract_vgname into _extract_vgname.
This common code will be used for other similar function.
Reuse skip_dev_dir() instead of less mature coded to skip
device dir.
Instead of duplicating full vg/lv name - allocate string
only vg portion of lv name.
Previously, this was the recommended form for creating a thin pool:
lvconvert --thinpool VG/ThinDataLV --poolmetadata VG/ThinMetaLV
but this is confusing, because --thinpool does not actually take
an arg, and is more naturally used to specify an existing thin pool.
The new recommended form is:
lvconvert --type thin-pool --poolmetadata VG/ThinMetaLV VG/ThinDataLV
Previously, this was the recommended form for creating a cache pool:
lvconvert --cachepool VG/CacheDataLV --poolmetadata VG/CacheMetaLV
but this is confusing, because --cachepool does not actually take
an arg, and is more natually used to specify an existing cache pool.
The new recommended form is:
lvconvert --type cache-pool --poolmetadata VG/CacheMetaLV VG/CacheDataLV
We are not using already defined segement type names where we could.
There is a lot of other places in device-mapper and LVM2 we have those
hardcoded so we should better finally have a common interface in
libdevmapper to avoid this.
Use of lv_info() internally in lv_check_not_in_use(),
so it always could use with_open_count properly.
Skip sysfs() testing in open_count == 0 case.
Accept just 'lv' pointer like other functions.
The function has 'built-in' lv_is_active_locally check,
which however is not what we need to check in many place.
For now at least remotely active snapshot merge is
detected and for this case merge on next activation is scheduled.
Move check for snapshot-merge support before archiving.
Split code on 2 paths - with merge_on_activate
using vg_write & vg_commit
and lv_update_reload call for instant merging.
Move printing after backup.
Before leaving _activate_lvs_in_vg() wait till devices
are active - so we do not print message about active
devices earlier then it really happens for a user.
More validations before any thin or cache related conversion begins.
We allow to use and stack:
pool data: cache or raid
pool metadata: raid
pool: linear, striped
cache: linear, striped, raid
thin(extorig): linear, origin, cow, virtual, thin
We use adjusted_mirror_region_size() in two different contexts.
Either on command line -
here we do want to inform user about reduction of size.
Or in pvmove activation context -
here we should only use 'verbose' info.
When requesting to reload an LV imrove this API to
automatically reload its lock holding LV as in cluster
only top-level LVs are addressable with lock.
When vg_ondisk is NULL we do not need to search
through the whole VG to find out the same LV.
NOTE: as of now - VG locking is not enabled as some code parts
are breaking memory locking rules (lvm2app).
Once we enforce VG locking for read-only commands the effect
will be much better for larger VGs.
Do not let fly metadata with just 'minor' set
(since they would not be readable on older version)
Be permissive with invalid major/minor number and
just report them as problem, but allow to use
such metadata with default major:minor.
Use nice instruction_HLT macro
Use log_debug_mem()
Don't actually log things after we prohibit 'mmap'.
Move initialization of strerror & udev before blocking mmap.
We do not need to restore LV content on error path - since
for reactivation we always use ondisk/commited metadata,
so passed data are never used.
Drop some unneded extra message, since the called function
repeated logs same info.
Move common code for reading and processing
of --persistent arguments for lvcreate and lvchange
into lvmcmdline.
Reuse validate_major_minor() routine for validation.
Don't blindly activate LVs after change in cluster
and instead only local reactivation is supported.
(we have now many limited targets now).
Dropping 'sigint_caught()' handling, since
prompt() is resolving this case itself.
Add code to trap both mmap implementation on 32bit arch.
Use dlsym()
Use hlt instraction instead of int3 - generates usable stack trace
when problem is catched.
If we want to support conversion of VG to clustered type,
we currently need to relock active LV to get proper DLM lock.
So add extra loop after change of VG clustered attribute
to exlusively activate all active top level LVs.
When doing change -cy -> -cn we should validate LVs are not
active on other cluster nodes - we could be sure about this only
when with local exclusive activation - for other types
we require user to deactivate volumes first.
As a workaround for this limitation there is always
locking_type = 0 which amongs other skip the detection
of active LVs.
FIXME:
clvmd should handle looks for cluster locking type all the time.
Failure to copy the 'feature_flags' lvconvert_param to the matching
lv_segment field meant that when a user specified the cachemode argument,
the request was not honored.
While we could probably reacquire some type of lock when
going from non-clustered to clustered vg, we don't have any
single road back to drop the lock and keep LV active.
For now keep it safe and prohibit conversion when LV
is active in the VG.
Try to enforce consistent macro usage along these lines:
lv_is_mirror - mirror that uses the original dm-raid1 implementation
(segment type "mirror")
lv_is_mirror_type - also includes internal mirror image and log LVs
lv_is_raid - raid volume that uses the new dm-raid implementation
(segment type "raid")
lv_is_raid_type - also includes internal raid image / log / metadata LVs
lv_is_mirrored - LV is mirrored using either kernel implementation
(excludes non-mirror modes like raid5 etc.)
lv_is_pvmove - internal pvmove volume
Cmirrord has endian bugs, which cause failure to lvcreate a mirrored lv
on s390.
- data_size is uint32, should not use xlate64 to convert, which will
cause data_size 0 after xlate.
- request_type and data_size still used by local(v5_data_switch),
should convert later. If request_type xlate too early, it will
cause request_type judge error; if data_size xlate too early, it
will cause coredump in case DM_ULOG_CLEAR_REGION.
- when receiving package in clog_request_from_network. vp[0] will always
be little endian. We could use xlate64(vp[0]) == vp[0] to decide if
the local node is little endian or not.
Signed-off-by: Lidong Zhong<lzhong@suse.com> & Liuhua Wang <lwang@suse.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Use lv_is_* macros throughout the code base, introducing
lv_is_pvmove, lv_is_locked, lv_is_converting and lv_is_merging.
lv_is_mirror_type no longer includes pvmove.
This is probably better approach than 3880ca5eca.
If dm module is not loaded during dm_is_dm_major call, there are no
lines for dm in /proc/devices, of course. Normally, dm_is_dm_major
is called to check existing devices, hence if module is not loaded,
we can expect there's no DM device present at the same time so we
can directly return 0 here (meaning the major number being inspected
is not dm device's one).
See also https://bugzilla.redhat.com/show_bug.cgi?id=1059711.
For dm_is_dm_major to determine whether the major number given as
an argument belongs to a DM device, libdm code needs to know what
the actual DM major is to do the comparison.
It may happen that the dm-mod module is not loaded during this
call and so for the completness let's try our best before we start
giving various errors - we can still make use of dm-mod autoloading,
though only since kernels 2.6.36 where this feature was introduced.
Caused by recent changes - a7be3b12df.
If global filter was not defined, then part of the code
creating composite filter (the cmd->lvmetad_filter) incorrectly
increased index value even if this global filter was not created
as part of the composite filter. This caused a gap with "NULL"
value in the composite filter array which ended up with the rest
of the filters after the gap to be ignored and also it caused a mem
leak when destroying the composite filter.
We used to print an error message whenever we tried to deal with devices that
lvmetad knew about but were rejected by a client-side filter. Instead, we now
check whether the device is actually absent or only filtered out and only print
a warning in the latter case.
If a PV label is exposed both through a composite device (MD for example) and
through its component devices, we always want the PV that lvmetad sees to be the
composite, since this is what all LVM commands (including activation) will then
use. If pvscan --cache is triggered for multiple clones of the same PV, the last
to finish wins. This patch basically re-arranges the filters so that
component-device filters are part of the global_filter chain, not of the
client-side filter chain. This has a subtle effect on filter evaluation order,
but should not alter visible semantics in the non-lvmetad case.
The code in dev_iter_create assumes that if a filter can be wiped, doing so will
always trigger a call to _full_scan. This is not true for composite filters
though, since they can always be wiped in principle, but there is no way to know
that a component filter inside will exist that actually triggers the scan.
This message should be printed only for activation commands,
however since the handling of this flag is not correct
(rhbz 1140029) and will require further changes,
do now just a minor change and switch message into log_debug
(so it's not printed i.e. with every 'lvs -v')
Use lv_update_and_reload() and lv_update_and_reload_origin()
to handle write/suspend/commit/resume sequence.
In few places this properly handle vg_revert() after suspend failure,
and also ensures there is metadata backup after successful vg_commit().
Fix rename operation for snapshot (cow) LV.
Only the snapshot's origin has the lock and by mistake suspend
and resume has been called for the snapshot LV.
This further made volumes unusable in cluster.
So instead of suspend and resuming list of LVs,
we need to just suspend and resume origin.
As the sequence write/suspend/commit/resume
is widely used in lvm2 code base - move it to
new lv_update_and_reload function.
Commit 94786a3bbf introduced
another bug - since sscanf needs extra 1 byte for \0.
Since there is no easy way to do a macro evaluation for (PATH_MAX-1)
and string concatation of this number to get resulting (%4095s) - let's
go with easiest path and restore extra byte for 0.
Other option would be to prepare sscanf parsing string in runtime.
But lets resolve it when we look at PATH_MAX handling later...
Coverity noticed this function may return untouched buffer,
however in this state can't really happen, but anyway
ensure on error path the buffer will have zero lenght string.
Add extra safety detection for thin pool transaction id
and query pool status after confirmed message.
In case there is a missmatch, immeditelly abort further
processing.
Fixing problem, when user sets volume_list and excludes thin pools
from activation. In this case pool return 'success' for skipped activation.
We need to really check the volume it is actually active to properly
to remove queued pool messages. Otherwise the lvm2 and kernel
metadata started to go async since lvm2 believed, messages were submitted.
Add also better check for threshold when create a new thin volume.
In this case we require local activation of thin pool so we are able
to check pool fullness.
This patch makes the keyword combinations found in "lv_layout" and
"lv_role" much more understandable - there were some ambiguities
for some of the combinations which lead to confusion before.
Now, the scheme used is:
LAYOUTS ("how the LV is laid out"):
===================================
[linear] (all segments have number of stripes = 1)
[striped] (all segments have number of stripes > 1)
[linear,striped] (mixed linear and striped)
raid (raid layout always reported together with raid level, raid layout == image + metadata LVs underneath that make up raid LV)
[raid,raid1]
[raid,raid10]
[raid,raid4]
[raid,raid5] (exact sublayout not specified during creation - default one used - raid5_ls)
[raid,raid5,raid5_ls]
[raid,raid5,raid6_rs]
[raid,raid5,raid5_la]
[raid,raid5,raid5_ra]
[raid6,raid] (exact sublayout not specified during creation - default one used - raid6_zr)
[raid,raid6,raid6_zr]
[raid,raid6,raid6_nc]
[raid,raid6,raid6_ns]
[mirror] (mirror layout == log + image LVs underneath that make up mirror LV)
thin (thin layout always reported together with sublayout)
[thin,sparse] (thin layout == allocated out of thin pool)
[thin,pool] (thin pool layout == data + metadata volumes underneath that make up thin pool LV, not supposed to be used for direct use!!!)
[cache] (cache layout == allocated out of cache pool in conjunction with cache origin)
[cache,pool] (cache pool layout == data + metadata volumes underneath that make up cache pool LV, not supposed to be used for direct use!!!)
[virtual] (virtual layout == not hitting disk underneath, currently this layout denotes only 'zero' device used for origin,thickorigin role)
[unknown] (either error state or missing recognition for such layout)
ROLES ("what's the purpose or use of the LV - what is its role"):
=================================================================
- each LV has either of these two roles at least: [public] (public LV that users may use freely to write their data to)
[public] (public LV that users may use freely to write their data to)
[private] (private LV that LVM maintains; not supposed to be directly used by user to write his data to)
- and then some special-purpose roles in addition to that:
[origin,thickorigin] (origin for thick-style snapshot; "thick" as opposed to "thin")
[origin,multithickorigin] (there are more than 2 thick-style snapshots for this origin)
[origin,thinorigin] (origin for thin snapshot)
[origin,multithinorigin] (there are more than 2 thin snapshots for this origin)
[origin,extorigin] (external origin for thin snapshot)
[origin,multiextoriginl (there are more than 2 thin snapshots using this external origin)
[origin,cacheorigin] (cache origin)
[snapshot,thicksnapshot] (thick-style snapshot; "thick" as opposed to "thin")
[snapshot,thinsnapshot] (thin-style snapshot)
[raid,metadata] (raid metadata LV)
[raid,image] (raid image LV)
[mirror,image] (mirror image LV)
[mirror,log] (mirror log LV)
[pvmove] (pvmove LV)
[thin,pool,data] (thin pool data LV)
[thin,pool,metadata] (thin pool metadata LV)
[cache,pool,data] (cache pool data LV)
[cache,pool,metadata] (cache pool metadata LV)
[pool,spare] (pool spare LV - common role of LV that makes it used for both thin and cache repairs)
This makes it a bit more readable since we can report more general
layouts/roles first and keywords describing the LV more precisely
afterwards in the list.
The 'lv_type' field name was a bit misleading. Better one is 'lv_role'
since this fields describes what's the actual use of the LV currently -
its 'role'.
Sort out the lvresize calculation code to handle size changes
specified as physical extents as well as logical extents
and to process mirror resizing and raid extensions correctly.
The 'approx alloc' option was masking the underlying problem.
Commit 5ebff6cc9f seemed to introduce
new 'for' loop but the mode is not yet used.
But the access to /dev dir needs to go through $DM_DEV_DIR
and whole path needs to be in "".
When testing conversion sanity, we checked lv->status & MIRRORED
which encompasses both old mirrors and raid1 mirrors. But we need to
ban only the old mirrors here hence allow raid1 mirrors.
Avoid playing with +1.
PATH_MAX code needs probably more thinking anyway, since
there is no MAX path in Linux - user may easily create path
with 64kB chars - so 4kB buffer is surelly not enough for
such dirs.
Note:
http://insanecoding.blogspot.cz/2007/11/pathmax-simply-isnt.html
The lv_type_name function is remnant from old code that reported
only single string for the LV type. LV types are now reported
in a more extended way as keyword list that describe the type
precisely (using lv_layout_and_type fn).
The lv_type_name was used in some error messages to display the
type of the LV so just reinstate the old messages back referencing
the type directly with a string - this is enough for error messages.
They don't need to display the LV type as precisely as it's used
on lvs output (which is optimized for selection anyway).
$ lvs -a -o name,vg_name,attr,layout,type
LV VG Attr Layout Type
lvol0 vg -wI-a----- linear linear
[pvmove0] vg p-C-aom--- mirror mirror,pvmove
(added "mirror" for pvmove LV)
$ lvs -a -o name,vg_name,attr,layout,type
LV VG Attr Layout Type
lvol0 vg ori------- linear external,multiple,origin,thin
[lvol1_pmspare] vg ewi------- linear metadata,pool,spare
lvol2 vg Vwi-a-tz-- thin snapshot,thin
lvol3 vg Vwi-a-tz-- thin snapshot,thin
pool vg twi-a-tz-- pool,thin pool,thin
[pool_tdata] vg Twi-ao---- linear data,pool,thin
[pool_tmeta] vg ewi-ao---- linear metadata,pool,thin
(added "multiple" for external origin used for more than one
thin snapshot - lvol0 in the example above)
Thin snapshots having external origins missed the "snapshot" keyword for
lv_type field. Also, thin external origins which are thin devices (from
another pool) were not recognized properly.
For example, external origin itself can be either non-thin volume (lvol0
below) or it can be a thin volume from another pool (lvol3 below):
Before this patch:
$ lvs -o name,vg_name,attr,pool_lv,origin,layout,type
Internal error: Failed to properly detect layout and type for for LV vg/lvol3
Internal error: Failed to properly detect layout and type for for LV vg/lvol3
LV VG Attr Pool Origin Layout Type
lvol0 vg ori------- linear external,origin,thin
lvol2 vg Vwi-a-tz-- pool lvol0 thin thin
lvol3 vg ori---tz-- pool unknown external,origin,thin,thin
lvol4 vg Vwi-a-tz-- pool1 lvol3 thin thin
pool vg twi-a-tz-- pool,thin pool,thin
pool1 vg twi-a-tz-- pool,thin pool,thin
- lvol2 as well as lvol4 have missing "snapshot" in type field
- lvol3 has unrecognized layout (should be "thin"), but has double
"thin" in lv_type which is incorrect
- (also there's double "for" in the internal error message)
With this patch applied:
$ lvs -o name,vg_name,attr,pool_lv,origin,layout,type
LV VG Attr Pool Origin Layout Type
lvol0 vg ori------- linear external,origin,thin
lvol2 vg Vwi-a-tz-- pool lvol0 thin snapshot,thin
lvol3 vg ori---tz-- pool thin external,origin,thin
lvol4 vg Vwi-a-tz-- pool1 lvol3 thin snapshot,thin
pool vg twi-a-tz-- pool,thin pool,thin
pool1 vg twi-a-tz-- pool,thin pool,thin
The maximum stripe size is equal to the volume group PE size. If that
size falls below the STRIPE_SIZE_MIN, the creation of RAID 4/5/6 volumes
becomes impossible. (The kernel will fail to load a RAID 4/5/6 mapping
table with a stripe size less than STRIPE_SIZE_MIN.) So, we report an
error if it is attempted.
This is very rare because reducing the PE size down that far limits the
size of the PV below that of modern devices.
This patch adds a new flag --deferred to dmsetup remove. If this flag is
specified and the device is open, it is scheduled to be deleted on
close.
struct dm_info is extended.
The existing dm_task_get_info() is converted into a wrapper around the
new version dm_task_get_info_with_deferred_remove() so existing binaries
can still use the old smaller structure.
Recompiled code will pick up the new larger structure.
From: Mikulas Patocka <mpatocka@redhat.com>
metadata/lv_manip.c:269: warning: declaration of "snapshot_count" shadows a global declaration
There's existing function called "snapshot_count" so rename the
variable to "snap_count".
When ignoring 'listed' volume, print info message.
(So the final command error message is a bit less confusing,
i.e. when user tries to deactive virtual origin:
> lvchange -an vg/lvol2_vorigin
Ignoring virtual origin logical volume vg/lvol2_vorigin.
One or more specified logical volume(s) not found.
The lv_layout and lv_type fields together help with LV identification.
We can do basic identification using the lv_attr field which provides
very condensed view. In contrast to that, the new lv_layout and lv_type
fields provide more detialed information on exact layout and type used
for LVs.
For top-level LVs which are pure types not combined with any
other LV types, the lv_layout value is equal to lv_type value.
For non-top-level LVs which may be combined with other types,
the lv_layout describes the underlying layout used, while the
lv_type describes the use/type/usage of the LV.
These two new fields are both string lists so selection (-S/--select)
criteria can be defined using the list operators easily:
[] for strict matching
{} for subset matching.
For example, let's consider this:
$ lvs -a -o name,vg_name,lv_attr,layout,type
LV VG Attr Layout Type
[lvol1_pmspare] vg ewi------- linear metadata,pool,spare
pool vg twi-a-tz-- pool,thin pool,thin
[pool_tdata] vg rwi-aor--- level10,raid data,pool,thin
[pool_tdata_rimage_0] vg iwi-aor--- linear image,raid
[pool_tdata_rimage_1] vg iwi-aor--- linear image,raid
[pool_tdata_rimage_2] vg iwi-aor--- linear image,raid
[pool_tdata_rimage_3] vg iwi-aor--- linear image,raid
[pool_tdata_rmeta_0] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_1] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_2] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_3] vg ewi-aor--- linear metadata,raid
[pool_tmeta] vg ewi-aor--- level1,raid metadata,pool,thin
[pool_tmeta_rimage_0] vg iwi-aor--- linear image,raid
[pool_tmeta_rimage_1] vg iwi-aor--- linear image,raid
[pool_tmeta_rmeta_0] vg ewi-aor--- linear metadata,raid
[pool_tmeta_rmeta_1] vg ewi-aor--- linear metadata,raid
thin_snap1 vg Vwi---tz-k thin snapshot,thin
thin_snap2 vg Vwi---tz-k thin snapshot,thin
thin_vol1 vg Vwi-a-tz-- thin thin
thin_vol2 vg Vwi-a-tz-- thin multiple,origin,thin
Which is a situation with thin pool, thin volumes and thin snapshots.
We can see internal 'pool_tdata' volume that makes up thin pool has
actually a level10 raid layout and the internal 'pool_tmeta' has
level1 raid layout. Also, we can see that 'thin_snap1' and 'thin_snap2'
are both thin snapshots while 'thin_vol1' is thin origin (having
multiple snapshots).
Such reporting scheme provides much better base for selection criteria
in addition to providing more detailed information, for example:
$ lvs -a -o name,vg_name,lv_attr,layout,type -S 'type=metadata'
LV VG Attr Layout Type
[lvol1_pmspare] vg ewi------- linear metadata,pool,spare
[pool_tdata_rmeta_0] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_1] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_2] vg ewi-aor--- linear metadata,raid
[pool_tdata_rmeta_3] vg ewi-aor--- linear metadata,raid
[pool_tmeta] vg ewi-aor--- level1,raid metadata,pool,thin
[pool_tmeta_rmeta_0] vg ewi-aor--- linear metadata,raid
[pool_tmeta_rmeta_1] vg ewi-aor--- linear metadata,raid
(selected all LVs which are related to metadata of any type)
lvs -a -o name,vg_name,lv_attr,layout,type -S 'type={metadata,thin}'
LV VG Attr Layout Type
[pool_tmeta] vg ewi-aor--- level1,raid metadata,pool,thin
(selected all LVs which hold metadata related to thin)
lvs -a -o name,vg_name,lv_attr,layout,type -S 'type={thin,snapshot}'
LV VG Attr Layout Type
thin_snap1 vg Vwi---tz-k thin snapshot,thin
thin_snap2 vg Vwi---tz-k thin snapshot,thin
(selected all LVs which are thin snapshots)
lvs -a -o name,vg_name,lv_attr,layout,type -S 'layout=raid'
LV VG Attr Layout Type
[pool_tdata] vg rwi-aor--- level10,raid data,pool,thin
[pool_tmeta] vg ewi-aor--- level1,raid metadata,pool,thin
(selected all LVs with raid layout, any raid layout)
lvs -a -o name,vg_name,lv_attr,layout,type -S 'layout={raid,level1}'
LV VG Attr Layout Type
[pool_tmeta] vg ewi-aor--- level1,raid metadata,pool,thin
(selected all LVs with raid level1 layout exactly)
And so on...
_pvcreate_check() has two missing requirements:
After refreshing filters there must be a rescan.
(Otherwise the persistent filter may remain empty.)
After wiping a signature, the filters must be refreshed.
(A device that was previously excluded by the filter due to
its signature might now need to be included.)
If several devices are added at once, the repeated scanning isn't
strictly needed, but we can address that later as part of the command
processing restructuring (by grouping the devices).
Replace the new pvcreate code added by commit
54685c20fc "filters: fix regression caused
by commit e80884cd080cad7e10be4588e3493b9000649426"
with this change to _pvcreate_check().
The filter refresh problem dates back to commit
acb4b5e4de "Fix pvcreate device check."
Using "[ ]" operator together with "&&" (or ",") inside causes the
string list to be matched if and only if all the items given match
the value reported and the number of items also match. This is
strict list matching and the original behaviour we already have.
In contrast to that, the new "{ }" operator together with "&&" inside
causes the string list to be matched if and only if all the items given
match the value reported but the number of items don't need to match.
So we can provide a subset in selection criteria and if the subset
is found, it matches.
For example:
$ lvs -o name,tags
LV LV Tags
lvol0 a
lvol1 a,b
lvol2 b,c,x
lvol3 a,b,y
$ lvs -o name,tags -S 'tags=[a,b]'
LV LV Tags
lvol1 a,b
$ lvs -o name,tags -S 'tags={a,b}'
LV LV Tags
lvol1 a,b
lvol3 a,b,y
So in the example above the a,b is subset of a,b,y and therefore
it also matches.
Clearly, when using "||" (or "#") inside, the { } and [ ] is the
same:
$ lvs -o name,tags -S 'tags=[a#b]'
LV LV Tags
lvol0 a
lvol1 a,b
lvol2 b,c,x
lvol3 a,b,y
$ lvs -o name,tags -S 'tags={a#b}'
LV LV Tags
lvol0 a
lvol1 a,b
lvol2 b,c,x
lvol3 a,b,y
Also in addition to the above feature, fix list with single value
matching when using [ ]:
Before this patch:
$ lvs -o name,tags -S 'tags=[a]'
LV LV Tags
lvol0 a
lvol1 a,b
lvol3 a,b,y
With this patch applied:
$ lvs -o name,tags -S 'tags=[a]'
LV LV Tags
lvol0 a
In case neither [] or {} is used, assume {} (the behaviour is not
changed here):
$ lvs -o name,tags -S 'tags=a'
LV LV Tags
lvol0 a
lvol1 a,b
lvol3 a,b,y
So in new terms 'tags=a' is equal to 'tags={a}'.
If using persistent filter and we're refreshing filters (just like we
do for pvcreate now after commit 54685c20fc),
we can't rely on getting the primary device of the partition from the cache
as such device could be already filtered by persistent filter and we get
a device cache lookup failure for such device.
For example:
$ lvm dumpconfig --type diff
devices {
obtain_device_list_from_udev=0
}
$lsblk /dev/sda
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128M 0 disk
`-sda1 8:1 0 127M 0 part
$cat /etc/lvm/cache/.cache | grep sda
"/dev/sda1",
$pvcreate /dev/sda1
dev_is_mpath: failed to get device for 8:1
Physical volume "/dev/sda1" successfully created
The problematic part of the code called dev_cache_get_by_devt
to get the device for the device number supplied. Then the code
used dev_name(dev) to get the name which is then used in check
whether there's any mpath on top of this dev...
This patch uses sysfs to get the base name for the partition
instead, hence avoiding the device cache which is a correct
approach here.
The message "Cannot deactivate remotely exclusive device locally." makes
sense only for clustered LV. If the LV is non-clustered, then it's
always exclusive by definition and if it's already deactivated, this
message pops up inappropriately as those two conditions are met.
So issue the message only if the conditions are met AND we have clustered VG.
With cmirrord, we can do pvmove of clustered mirror. The code checking
suitability of LVs on the PV being moved issued a message if a mirror
LV was found and the VG was clustered. However, the actual pvmove did
work correctly.
The top-level mirror LV is actually skipped in the code since it's
always layered on top of internal LVs making up the mirror LV and for pvmove
we consider these internal devices only as they're actually layered on
top of concrete PVs then. But we don't need to issue any message here
about skipping the top-level mirror LV - it's misleading here.
Commit e80884cd08 tried to dump filters
for them to be reevaluated when creating a PV to avoid overwriting
any existing signature that may have been created after last
scan/filtering.
However, we need to call refresh_filters instead of
persistent_filter->dump since dump requires proper rescannig to fill
up the persistent filter again. However, this is true only for pvcreate
but not for vgcreate with PV creation where the scanning happens before
this PV creation and hence the next rescan (if not full scan), does not
fill the persistent filter.
Also, move refresh_filters so that it's called sooner and only for
pvcreate, vgcreate already calls lvmcache_label_scan(cmd, 2) which
then calls refresh_filters itself, so no need to reevaluate this again.
This caused the persistent filter (/etc/lvm/cache/.cache file) to be
wrong and contain only the PV just being processed with
vgcreate <vg_name> <pv_name_to_create>.
This regression caused other block devices to be filtered out in case
the vgcreate with PV creation was used and then the persistent filter
is used by any other LVM command afterwards.
Make lvresize -l+%FREE support approximate allocation.
Move existing "Reducing/Extending' message to verbose level
and change it to say 'up to' if approximate allocation is being used.
Replace it with a new message that gives the actual old and new size or
says 'unchanged'.
This is addendum to commit 2e82a070f3
which fixed these spurious messages that appeared after commit
651d5093ed ("avoid pv_read in
find_pv_by_name").
There was one more "not found" message issued in case the device
could not be found in device cache (commit 2e82a07 fixed this only
for PV lookup itself). But if we "allow_unformatted" for
find_pv_by_name, we should not issue this message even in case
the device can't be found in dev cache as we just need to know
whether there's a PV or not for the code to decide on next steps
and we don't want to issue any messages if either device itself
is not found or PV is not found.
For example, when we were creating a new PV (and so allow_unformatted = 1)
and the device had a signature on it which caused it to be filtered
by device filter (e.g. MD signature if md filtering is enabled),
or it was part of some other subsystem (e.g. multipath), this message
was issued on find_pv_by_name call which was misleading.
Also, remove misleading "stack" call in case find_pv_by_name
returns NULL in pvcreate_check - any error state is reported
later by pvcreate_check code so no need to "stack" here.
There's one more and proper check to issue "not found" message if
the device can't be found in device cache within pvcreate_check fn
so this situation is still covered properly later in the code.
Before this patch (/dev/sda contains MD signature and is therefore filtered):
$ pvcreate /dev/sda
Physical volume /dev/sda not found
WARNING: linux_raid_member signature detected on /dev/sda at offset 4096. Wipe it? [y/n]:
With this patch applied:
$ pvcreate /dev/sda
WARNING: linux_raid_member signature detected on /dev/sda at offset 4096. Wipe it? [y/n]:
Non-existent devices are still caught properly:
$ pvcreate /dev/sdx
Device /dev/sdx not found (or ignored by filtering).
2.02.106 added suffixes to some LV uuids in the kernel.
If any of these LVs is activated with 2.02.105 or earlier,
and then a later version is used, the LVs appear invisible and
activation commands fail.
The code now has to check the kernel for both old and new uuids.
Fix get_pool_params to only read params.
Add poolmetadataspare option to get_pool_params.
Move all profile code into update_pool_params.
Move recalculate code into pool_manip.c
Revert logic and rename new arg_ functions to:
arg_from_list_is_set()
arg_outside_list_is_set()
When err_found is given, log_error message is automaticaly
printed.
Few unecessary comments were written to on-disc metadata.
Use outfc() to have comments only in archived files.
(may also save couple bytes in ringbuffer).
TODO: needed validation against newline char...
Support --repair and --use-policies with mirrors.
(fixes another regression from lvconvert change for thin and cache).
TODO: the code path for mirror needs update.
Major update of lvconvert code to handle cache and thin.
related targets.
Code tries to unify handling of cache and thin pools.
Better supports lvm2 syntax:
lvconvert --type cache --cachepool vg/pool vg/cache
lvconvert --type thin --thinpool vg/pool vg/extorg
lvconvert --type cache-pool vg/pool
lvconvert --type thin-pool vg/pool
as well as:
lvconvert --cache --cachepool vg/pool vg/cache
lvconvert --thin --thinpool vg/pool vg/extorg
lvconvert --cachepool vg/pool
lvconvert --thinpool vg/pool
While catching much more command line errors.
(Yet couple paths still needs more tests)
Detects as much cmdline errors prior opening VG.
Uses single lvconvert_name_params to convert LV names.
Detects as much incompatibilies in VG prior prompting.
Uses single prompt to confirm whole conversion.
TODO: still the code needs fixes...
Cache pools are similar as with thin pools.
Add (needs %s) - since cache has currently
a bit strange need for extra few kb over
our default 4M extent size so make it more obvious.
Since the type passed LV is changed and content of data detroyed,
query user with prompt to confirm this operation.
Also add a proper wiping of header.
Using '--yes' will skip this prompt:
lvconvert -s --yes vg/lv vg/lvcow
When EOF is detect - it could be either 'Ctrl+C'
or empty stdin.
For Ctrl+C there is visual ^C sign.
For EOF print 'n' so decision is clear in debug print.
This is addendum for commit 6dc7b783c8.
LVM1 format stores the ALLOCATABLE flag directly in PV header, not
in VG metadata. So the code needs to be fixed further to work
properly for lvm1 format so that the correct PV header is written
(the flag is set only if the PV is in some VG, unset otherwise).
Before the patch:
$ lvs -o name,active vg/lvol1 --driverloaded n
WARNING: Activation disabled. No device-mapper interaction will beattempted.
LV Active
lvol1 active
With this patch applied:
$ lvs -o name,active vg/lvol1 --driverloaded n
WARNING: Activation disabled. No device-mapper interaction will be attempted.
LV Active
lvol1 unknown
The same for active_{locally,remotely,exclusively} fields.
Also, rename headings for these fields (ActLocal/ActRemote/ActExcl).
If the lv_info call fails for whatever reason/INFO dm ioctl fails or
the dm driver communication is disabled (--driverloaded n), make
sure we always display "unknown" for LVSINFO fields as that's exactly
what happens - we don't know the state.
Before the patch:
$ lvs -o name,device_open --driverloaded n
WARNING: Activation disabled. No device-mapper interaction will be attempted.
Command failed with status code 5.
With this patch applied:
$ lvs -o name,device_open --driverloaded n
WARNING: Activation disabled. No device-mapper interaction will be attempted.
LV DevOpen
lvol1 unknown
Commit 33d69162e4 reduced the number of
PVs to a level where the test could not function. (It is impossible
to replace 3 PVs of a 4-way RAID1 LV if there are only 5 PVs.)
Like other binary fields we already have:
$ lvs -o name,zero vg/lvx vg/pool vg/pool1
LV Zero
lvx unknown
pool
pool1 zero
$ lvs -o name,zero vg/lvx vg/pool vg/pool1 --binary
LV Zero
lvx -1
pool 0
pool1 1
Currently, we have two modes of activation, an unnamed nominal mode
(which I will refer to as "complete") and "partial" mode. The
"complete" mode requires that a volume group be 'complete' - that
is, no missing PVs. If there are any missing PVs, no affected LVs
are allowed to activate - even RAID LVs which might be able to
tolerate a failure. The "partial" mode allows anything to be
activated (or at least attempted). If a non-redundant LV is
missing a portion of its addressable space due to a device failure,
it will be replaced with an error target. RAID LVs will either
activate or fail to activate depending on how badly their
redundancy is compromised.
This patch adds a third option, "degraded" mode. This mode can
be selected via the '--activationmode {complete|degraded|partial}'
option to lvchange/vgchange. It can also be set in lvm.conf.
The "degraded" activation mode allows RAID LVs with a sufficient
level of redundancy to activate (e.g. a RAID5 LV with one device
failure, a RAID6 with two device failures, or RAID1 with n-1
failures). RAID LVs with too many device failures are not allowed
to activate - nor are any non-redundant LVs that may have been
affected. This patch also makes the "degraded" mode the default
activation mode.
The degraded activation mode does not yet work in a cluster. A
new cluster lock flag (LCK_DEGRADED_MODE) will need to be created
to make that work. Currently, there is limited space for this
extra flag and I am looking for possible solutions. One possible
solution is to usurp LCK_CONVERT, as it is not used. When the
locking_type is 3, the degraded mode flag simply gets dropped and
the old ("complete") behavior is exhibited.
Change the help heading from 'Common Fields' to 'Special Fields' for
the fields: selected, help, ?
Remove the code that does 'all' processing with these special fields as
each of them changes the behaviour of the command in an undesirable way.
'lvs -o all,selected' was of course just printing help.
(via internal expansion to 'lv_all,common_all')
and if we ignored the help fields, then '-o common_all' would still
pull in 'selected' and change the way rows were output.
We have 1/"descriptive word"/"yes" for 1 and 0/"no" for 0.
For example (the new recognized values are "yes" and "no"):
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2
LV DevOpen
root open
swap open
lvol1 open
lvol2
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2 -S 'device_open=open'
LV DevOpen
root open
swap open
lvol1 open
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2 -S 'device_open=1'
LV DevOpen
root open
swap open
lvol1 open
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2 -S 'device_open=yes'
LV DevOpen
root open
swap open
lvol1 open
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2 -S 'device_open=0'
LV DevOpen
lvol2
$ lvs -o name,device_open fedora vg/lvol1 vg/lvol2 -S 'device_open=no'
LV DevOpen
lvol2
So all attribute reporting functions are all in one section of code
for quick orientation (all these functions are defined in the order
of their attribute character displayed in pv/vg/lv_attr field).
lv_active_{locally,remotely,exclusively} display the original
"lv_active" field in a more separate way so that we can create
selection criteria in a binary-based form (yes/no).
The macros for reserved value definition makes the process a bit easier,
but there's still a place for improvement and make this even more
transparent. We can optimize and provide better automatism here later on.
Also respect --binary arg and/or report/binary_values_as_numeric
when displaying unknown values. If textual form is used, use "unknown",
if numeric value is used, use "-1" (which we already use to denote
unknown numeric values in other reports like lv_kernel_major and
lv_kernel_minor).
This avoids creating void matchers which have no effect anyway and
they just use resources. Also, it makes lvm dumpconfig --type diff
to mark these settings properly as not being different from defaults
(where by default, devices/preferred_names as well as devices/filter
are void).
Also, add a few comments about builtin rules used to select device
alias in case preferred_names is not defined or it doesn't match
any of device aliases.
There was missing lv_info call for situations where there were
mixed PV/LV segment fields together with LVSINFO fields which
require extra lv_info call for LV device status. This ended up
with NULL lvinfo passed to the field reporting functions, hence
the segfault.
All binary attr fields have synonyms so selection criteria can use
either 0/1 or words to match against the field value (base type
for these binary fields is numeric one - DM_REPORT_FIELD_TYPE_NUMBER
so words are registered as reserved values):
pv_allocatable - "allocatable"
pv_exported - "exported"
pv_missing - "missing"
vg_extendable - "extendable"
vg_exported - "exported"
vg_partial - "partial"
vg_clustered - "clustered"
lv_initial_image_sync - "initial image sync", "sync"
lv_image_synced_names - "image synced", "synced"
lv_merging_names - "merging"
lv_converting_names - "converting"
lv_allocation_locked - "allocation locked", "locked"
lv_fixed_minor - "fixed minor", "fixed"
lv_merge_failed - "merge failed", "failed"
For example, these three are all equivalent:
$ lvs -o name,fixed_minor -S 'fixed_minor=fixed'
LV FixMin
lvol8 fixed minor
$ lvs -o name,fixed_minor -S 'fixed_minor="fixed minor"'
LV FixMin
lvol8 fixed minor
$ lvs -o name,fixed_minor -S 'fixed_minor=1'
LV FixMin
lvol8 fixed minor
The same with binary output - it has no effect on this functionality:
$ lvs -o name,fixed_minor --binary -S 'fixed_minor=fixed'
LV FixMin
lvol8 1
$ lvs -o name,fixed_minor --binary -S 'fixed_minor="fixed
minor"'
LV FixMin
lvol8 1
[1] f20/~ # lvs -o name,fixed_minor --binary -S 'fixed_minor=1'
LV FixMin
lvol8 1
In contrast to per-type reserved values that are applied for all fields
of that type, per-field reserved values are only applied for concrete
field only.
Also add 'struct dm_report_field_reserved_value' to libdm for per-field
reserved value definition. This is defined by field number (an index
in the 'fields' array which is given for the dm_report_init_with_selection
function during report initialization) and the value to use for any
of the specified reserved names.
The --binary option, if used, causes all the binary values reported
in reporting commands to be displayed as "0" or "1" instead of descriptive
literal values (value "unknown" is still used for values that could not be
determined).
Also, add report/binary_values_as_numeric lvm.conf option with the same
functionality as the --binary option (the --binary option prevails
if both --binary cmd option and report/binary_values_as_numeric lvm.conf
option is used at the same time). The report/binary_values_as_numeric is
also profilable.
This makes it easier to use and check lvm reporting command output in scripts.
Physical Volume Fields:
pv_allocatable - Whether this device can be used for allocation.
pv_exported - Whether this device is exported.
pv_missing - Whether this device is missing in system.
Volume Group Fields:
vg_permissions - VG permissions.
vg_extendable - Whether VG is extendable.
vg_exported - Whether VG is exported.
vg_partial - Whether VG is partial.
vg_allocation_policy - VG allocation policy.
vg_clustered - Whether VG is clustered.
Logical Volume Fields:
lv_volume_type - LV volume type.
lv_initial_image_sync - Whether mirror/RAID images underwent initial resynchronization.
lv_image_synced - Whether mirror/RAID image is synchronized.
lv_merging - Whether snapshot LV is being merged to origin.
lv_converting - Whether LV is being converted.
lv_allocation_policy - LV allocation policy.
lv_allocation_locked - Whether LV is locked against allocation changes.
lv_fixed_minor - Whether LV has fixed minor number assigned.
lv_merge_failed - Whether snapshot merge failed.
lv_snapshot_invalid - Whether snapshot LV is invalid.
lv_target_type - Kernel target type the LV is related to.
lv_health_status - LV health status.
lv_skip_activation - Whether LV is skipped on activation.
Logical Volume Info Fields
lv_permissions - LV permissions.
lv_suspended - Whether LV is suspended.
lv_live_table - Whether LV has live table present.
lv_inactive_table - Whether LV has inactive table present.
lv_device_open - Whether LV device is open.
LVSINFO is exactly the same as existing LVS report type,
but it has the "struct lvinfo" populated in addition for
use - this is useful for fields that display the status
of the LV device itself (e.g. suspended state, tables
present/missing...).
Currently, such properties are reported within the "lv_attr"
field so separation is unnecessary - the "lvinfo" call
to populate the "struct lvinfo" is directly a part of the
field reporting function - _lvstatus_disp/lv_attr_dup.
With upcoming patches, we'd like the lv_attr field bits
to be separated into their own fields. To avoid calling
"lvinfo" fn as many times as there are fields requiring
the "lv_info" structure to be populated while reporting
one row related to one LV, we're separating former LVS
into LVS and LVSINFO report type. With this, there's
just one "lvinfo" call for one report row and LV reporting
fields will take the info needed from this struct then,
hence reusing it and not calling "lvinfo" fn on their own.
The get_lv_type_name helps with translating volume type
to human readable form (can be used in reports or
various messages if needed).
The lv_is_linear and lv_is_striped complete the set of
lv_is_* functions that identify exact volume types.
Mention parent LV as well as the LV triggering the warning.
Still leaves some confusing cases but its not worth fixing them
at the moment.
(Thin pool inactive but a thin volume active => deactivate thin vol.
Inactive mirror/raid with pvmove in progress => complete pvmove and
active&deactivate mirror/raid.
If new VG already exists it requires some LVs to be inactive
unnecessarily.)
Support remove of thin volumes With --force --force
when thin pools is damaged.
This way it's possible to remove thin pool with
unrepairable metadata without requiring to
manually edit lvm2 metadata.
lvremove -ff vg/pool
removes all thin volumes and pool even when
thin pool cannot be activated (to accept
removal of thin volumes in kernel metadata)
mirror_or_raid_type_requested really checks for mirror type.
Convert macros mirror_or_raid_type_requested() and
snapshot_type_requested() into inline functions.
Use builddir not srcdir with make pofile.
Append 'incfile:' lines to %.d files to handle newly-missing dependencies
without 'make clean' after a file is moved or deleted.
Using suffixes for mirrors and raids will need more work,
before this could be enabled.
Meanwhile revert to previous behavior.
Keep suffixes for thins and caches.
Since vg_name inside /lib function has already been ignored mostly
except for a few debug prints - make it and official internal API
feature.
vg_name is used only in /tools while the VG is not yet openned,
and when lvresize/lvcreate /lib function is called with VG pointer
already being used, then vg_name becomes irrelevant (it's not been
validated anyway).
So any internal user of lvcreate_params and lvresize_params does not
need to set vg_name pointer and may leave it NULL.
Use suffixes for easier detection of private volumes.
This commit makes older volume UUIDs incompatible and
it most probably needs machine reboot after upgrade.
When creating pool's metadata - create initial LV for clearing with some
generic name and after the volume is create & cleared - rename it to
reserved name '_tmeta/_cmeta'.
We should not expose 'reserved' names for public LVs.
When repairing RAID LVs that have multiple PVs per image, allow
replacement images to be reallocated from the PVs that have not
failed in the image if there is sufficient space.
This allows for scenarios where a 2-way RAID1 is spread across 4 PVs,
where each image lives on two PVs but doesn't use the entire space
on any of them. If one PV fails and there is sufficient space on the
remaining PV in the image, the image can be reallocated on just the
remaining PV.
Previously, the seg_pvs used to track free and allocated space where left
in place after 'release_pv_segment' was called to free space from an LV.
Now, an attempt is made to combine any adjacent seg_pvs that also track
free space. Usually, this doesn't provide much benefit, but in a case
where one command might free some space and then do an allocation, it
can make a difference. One such case is during a repair of a RAID LV,
where one PV of a multi-PV image fails. This new behavior is used when
the replacement image can be allocated from the remaining space of the
PV that did not fail. (First the entire image with the failed PV is
removed. Then the image is reallocated from the remaining PVs.)
I've changed build_parallel_areas_from_lv to take a new parameter
that allows the caller to build parallel areas by LV vs by segment.
Previously, the function created a list of parallel areas for each
segment in the given LV. When it came time for allocation, the
parallel areas were honored on a segment basis. This was problematic
for RAID because any new RAID image must avoid being placed on any
PVs used by other images in the RAID. For example, if we have a
linear LV that has half its space on one PV and half on another, we
do not want an up-convert to use either of those PVs. It should
especially not wind up with the following, where the first portion
of one LV is paired up with the second portion of the other:
------PV1------- ------PV2-------
[ 2of2 image_1 ] [ 1of2 image_1 ]
[ 1of2 image_0 ] [ 2of2 image_0 ]
---------------- ----------------
Previously, it was possible for this to happen. The change makes
it so that the returned parallel areas list contains one "super"
segment (seg_pvs) with a list of all the PVs from every actual
segment in the given LV and covering the entire logical extent range.
This change allows RAID conversions to function properly when there
are existing images that contain multiple segments that span more
than one PV.
If a RAID LV has images that are spread across more than one PV
and you allocate a new image that requires more than one PV,
parallel_areas is only honored for one segment. This commit
adds a test for this condition.
...to avoid using cached value (persistent filter) and therefore
not noticing any change made after last scan/filtering - the state
of the device may have changed, for example new signatures added.
$ lvm dumpconfig --type diff
allocation {
use_blkid_wiping=0
}
devices {
obtain_device_list_from_udev=0
}
$ cat /etc/lvm/cache/.cache | grep sda
$ vgscan
Reading all physical volumes. This may take a while...
Found volume group "fedora" using metadata type lvm2
$ cat /etc/lvm/cache/.cache | grep sda
"/dev/sda",
$ parted /dev/sda mklabel gpt
Information: You may need to update /etc/fstab.
$ parted /dev/sda print
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sda: 134MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
$ cat /etc/lvm/cache/.cache | grep sda
"/dev/sda",
====
Before this patch:
$ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
With this patch applied:
$ pvcreate /dev/sda
Physical volume /dev/sda not found
Device /dev/sda not found (or ignored by filtering).
A field where it has no meaning to do any type of comparison is the
implicit "help" or "?" field. The error given was a bit cryptic
before this patch, the FLD_UNCOMPARABLE flag makes it easier to identify
this situation anywhere in the code and provide much better error message.
This flag can be applied to other fields that may appear in the future -
mostly usable for implicit fields as they always have special purpose
(so we're not exporting it in libdevmapper for now - usual reporting
fields don't need this).
Before this patch:
$ vgs -S help=1
dm_report_object: no data assigned to field help
dm_report_object: no data assigned to field help
(...which is true actually, but let's provide something better...)
With this patch applied:
$vgs -S help=1
Selection field is uncomparable: help.
Selection syntax error at 'help=1'.
$vgs -S '(name=vg && help=1) || vg_size > 1g'
Selection field is uncomparable: help.
Selection syntax error at 'help=1) || vg_size > 1g'.
Take a local file lock to prevent concurrent activation/deactivation of LVs.
Thin/cache types and an extension for cluster support are excluded for
now.
'lvchange -ay $lv' and 'lvchange -an $lv' should no longer cause trouble
if issued concurrently: the new lock should make sure they
activate/deactivate $lv one-after-the-other, instead of overlapping.
(If anyone wants to experiment with the cluster patch, please get in touch.)
'lvs' would segfault if trying to display the "move pv" if the
pvmove was run with '--atomic'. The structure of an atomic pvmove
is different and requires us to descend another level in the
LV tree to retrieve the PV information.
In 'find_pvmove_lv', separate the code that searches the atomic
pvmove LVs from the code that searches the normal pvmove LVs. This
cleans up the segment iterator code a bit.
It's better to have implicit fields at the very end of the output
so users can see them without scrolling back if the list of fields
is long (the "help" is also an implicit field now so it should be
easily visible).
We have "help" and "?" defined as implicit fields now. As such, we
don't need to export these names in libdevmapper (as it was introduced
by commit 7c86131233 within this release).
If anyone uses these field names by mistake, the libdevmapper code can
error out correctly if it detects that the set of explicit field names
(the ones supplied by "fields" arg in dm_report_init/dm_report_init_with_selection)
contains any of the implicit field names (the ones defined internally
by libdevmapper itself).
Making "help" and "?" implicit also simplifies code since the
dm_report_init caller (lvm/dmsetup) doesn't need to check on
dm_report_init return whether "help" or "?" was hit while parsing
fields/sort keys in libdevmapper.
The libdevmapper now sets internal "RH_ALREADY_REPORTED" flag
after it reports the "help" or "?" implicit field. Then libdevmapper
itself checks for this flag in dm_report_object and if found,
the actual reporting is skipped (because the "help" implicit field
was reported instead of the actual report).
replicator/replicator.c:338:2: warning: passing argument 2 of 'build_dm_uuid' from incompatible pointer type [enabled by default]
replicator/replicator.c:629:3: warning: passing argument 2 of 'build_dm_uuid' from incompatible pointer type [enabled by default]
replicator/replicator.c:644:6: warning: passing argument 2 of 'build_dm_uuid' from incompatible pointer type [enabled by default]
replicator/replicator.c:668:7: warning: passing argument 2 of 'build_dm_uuid' from incompatible pointer type [enabled by default]
replicator/replicator.c:677:4: warning: passing argument 2 of 'build_dm_uuid' from incompatible pointer type [enabled by default]
Fix gcc warnings:
libdm-report.c:1952:5: warning: "end_op_flag_hit" may be used uninitialized in this function [-Wmaybe-uninitialized]
libdm-report.c:2232:28: warning: "custom" may be used uninitialized in this function [-Wmaybe-uninitialized]
And snap_percent is not 0% in dm < 1.10.0 so
don't test comparison with 0% here.
Ordering string list items on reports is also new compared to
previous state where items were not ordered at all and they
got reported simply as they appeared/were processed.
pvmove can be used to move single LVs by name or multiple LVs that
lie within the specified PV range (e.g. /dev/sdb1:0-1000). When
moving more than one LV, the portions of those LVs that are in the
range to be moved are added to a new temporary pvmove LV. The LVs
then point to the range in the pvmove LV, rather than the PV
range.
Example 1:
We have two LVs in this example. After they were
created, the first LV was grown, yeilding two segments
in LV1. So, there are two LVs with a total of three
segments.
Before pvmove:
--------- --------- ---------
| LV1s0 | | LV2s0 | | LV1s1 |
--------- --------- ---------
| | |
-------------------------------------
PV | 000 - 255 | 256 - 511 | 512 - 767 |
-------------------------------------
After pvmove inserts the temporary pvmove LV:
--------- --------- ---------
| LV1s0 | | LV2s0 | | LV1s1 |
--------- --------- ---------
| | |
-------------------------------------
pvmove0 | seg 0 | seg 1 | seg 2 |
-------------------------------------
| | |
-------------------------------------
PV | 000 - 255 | 256 - 511 | 512 - 767 |
-------------------------------------
Each of the affected LV segments now point to a
range of blocks in the pvmove LV, which purposefully
corresponds to the segments moved from the original
LVs into the temporary pvmove LV.
The current implementation goes on from here to mirror the temporary
pvmove LV by segment. Further, as the pvmove LV is activated, only
one of its segments is actually mirrored (i.e. "moving") at a time.
The rest are either complete or not addressed yet. If the pvmove
is aborted, those segments that are completed will remain on the
destination and those that are not yet addressed or in the process
of moving will stay on the source PV. Thus, it is possible to have
a partially completed move - some LVs (or certain segments of LVs)
on the source PV and some on the destination.
Example 2:
What 'example 1' might look if it was half-way
through the move.
--------- --------- ---------
| LV1s0 | | LV2s0 | | LV1s1 |
--------- --------- ---------
| | |
-------------------------------------
pvmove0 | seg 0 | seg 1 | seg 2 |
-------------------------------------
| | |
| -------------------------
source PV | | 256 - 511 | 512 - 767 |
| -------------------------
| ||
-------------------------
dest PV | 000 - 255 | 256 - 511 |
-------------------------
This update allows the user to specify that they would like the
pvmove mirror created "by LV" rather than "by segment". That is,
the pvmove LV becomes an image in an encapsulating mirror along
with the allocated copy image.
Example 3:
A pvmove that is performed "by LV" rather than "by segment".
--------- ---------
| LV1s0 | | LV2s0 |
--------- ---------
| |
-------------------------
pvmove0 | * LV-level mirror * |
-------------------------
/ \
pvmove_mimage0 / pvmove_mimage1
------------------------- -------------------------
| seg 0 | seg 1 | | seg 0 | seg 1 |
------------------------- -------------------------
| | | |
------------------------- -------------------------
| 000 - 255 | 256 - 511 | | 000 - 255 | 256 - 511 |
------------------------- -------------------------
source PV dest PV
The thing that differentiates a pvmove done in this way and a simple
"up-convert" from linear to mirror is the preservation of the
distinct segments. A normal up-convert would simply allocate the
necessary space with no regard for segment boundaries. The pvmove
operation must preserve the segments because they are the critical
boundary between the segments of the LVs being moved. So, when the
pvmove copy image is allocated, all corresponding segments must be
allocated. The code that merges ajoining segments that are part of
the same LV when the metadata is written must also be avoided in
this case. This method of mirroring is unique enough to warrant its
own definitional macro, MIRROR_BY_SEGMENTED_LV. This joins the two
existing macros: MIRROR_BY_SEG (for original pvmove) and MIRROR_BY_LV
(for user created mirrors).
The advantages of performing pvmove in this way is that all of the
LVs affected can be moved together. It is an all-or-nothing approach
that leaves all LV segments on the source PV if the move is aborted.
Additionally, a mirror log can be used (in the future) to provide tracking
of progress; allowing the copy to continue where it left off in the event
there is a deactivation.
With recent changes introduced with the report selection support,
the content of lv_modules field is of string list type (before
it was just string type).
String list elements are always ordered now so update lvcreate-thin
test to expect the elements to be ordered.
The differentiation of the original number field into number, size and
percent field types has been introduced with recent changes for report
selection support.
Implicit fields are fields that are registered with the report
and reported internally by libdevmapper itself (compared to explicit
fields that are registered by the layer above libdevmapper - e.g. LVM,
dmsetup...).
The "selected" field is the implicit field (for now the only one)
that reports the result of the selection. Since the selection itself
is the property of the libdevmapper, the upper layer using dm_report_init
can't register this field itself and it must be done directly at
libdevmapper layer.
The "selected" field is internally registered as part of the "common"
report type with id 0x80000000 (the last bit in uin32_t) which is then
reserved (the explicit report types are then checked if they do not
contain this id and if yes, we error out).
This way, the "selected" field is recognized by all libdevmapper users
that initialize the reporting with "dm_report_init_with_selection".
If reporting is initialized with the classical "dm_report_init",
there's no functional change (so the "selected" field is not defined
and it's not recognized).
Make dm_report_init_with_selection to accept an argument with an
array of reserved values where each element contains a triple:
{dm report field type, reserved value, array of strings representing this value}
When the selection is parsed, we always check whether a string
representation of some reserved value is not hit and if it is,
we use the reserved value assigned for this string instead of
trying to parse it as a value of certain field type.
This makes it possible to define selections like:
... --select lv_major=undefined (or -1 or unknown or undef or whatever string representations are registered for this reserved value in the future)
... --select lv_read_ahead=auto
... --select vg_mda_copies=unmanaged
With this, each time the field value of certain type is hit
and when we compare it with the selection, we use the proper
value for comparison.
For now, register these reserved values that are used at the moment
(also more descriptive names are used for the values):
const uint64_t _reserved_number_undef_64 = UINT64_MAX;
const uint64_t _reserved_number_unmanaged_64 = UINT64_MAX - 1;
const uint64_t _reserved_size_auto_64 = UINT64_MAX;
{
{DM_REPORT_FIELD_TYPE_NUMBER, _reserved_number_undef_64, {"-1", "undefined", "undef", "unknown", NULL}},
{DM_REPORT_FIELD_TYPE_NUMBER, _reserved_number_unmanaged_64, {"unmanaged", NULL}},
{DM_REPORT_FIELD_TYPE_SIZE, _reserved_size_auto_64, {"auto", NULL}},
NULL
}
Same reserved value of different field types do not collide.
All arrays are null-terminated.
The list of reserved values is automatically displayed within
selection help output:
Selection operands
------------------
...
Reserved values
---------------
-1, undefined, undef, unknown - Reserved value for undefined numeric value. [number]
unmanaged - Reserved value for unmanaged number of metadata copies in VG. [number]
auto - Reserved value for size that is automatically calculated. [size]
Selection operators
-------------------
...
When the field list is displayed as help for constructing selection
criteria, show also the field value type. This is useful for users
to know what set of operators are allowed for the type - the subsequent
"Selection operands" section in the help output summarize all known
types that can be used in selection.
The "<lvm command> -S/--select help" shows help (including list of fields to match against):
...field list here including the field type name...
Selection operands
------------------
field - Reporting field.
number - Non-negative integer value.
size - Floating point value with units specified.
string - Characters quoted by ' or " or unquoted.
string list - Strings enclosed by [ ] and elements delimited by either
"all items must match" or "at least one item must match" operator.
regular expression - Characters quoted by ' or " or unquoted.
Selection operators
-------------------
Comparison operators:
=~ - Matching regular expression.
!~ - Not matching regular expression.
= - Equal to.
!= - Not equal to.
>= - Greater than or equal to.
> - Greater than
<= - Less than or equal to.
< - Less than.
Logical and grouping operators:
&& - All fields must match
, - All fields must match
|| - At least one field must match
# - At least one field must match
! - Logical negation
( - Left parenthesis
) - Right parenthesis
[ - List start
] - List end
Selection list items are enclosed in '[' and ']' (if there's only
one item, the '[' and ']' can be omitted). Each element of the list
is a string (either quoted or unquoted, like the usual string operand
used in selection) and each element is delimited either by conjunction
(meaining "match all") or disjunction operator (meaning "match any").
For example, if "," is the conjuction operator and "/" is the
disjunction operator then:
lv_tags=[a,b,c]
...will match all fields where tags contain *all* a, b and c.
lv_tags=[a/b/c]
...will match all fields where tags contain *any* of a, b, or c.
Mixing operators within the list is not supported:
lv_tags=[a,b/c]
...will give an error.
The order in which items are defined in the selection do not matter.
This patch enhances the selection parsing functionality to recognize
such lists.
The {pv,vg,lv,seg}_tags and lv_modules fields are reported as string
lists using the new dm_report_field_string_list - so we just pass
the list to the fn that takes care of reporting and item sorting itself.
Add a separate dm_report_field_string_list fn to libdevmapper to
support reporting string lists. Before, the code used libdevmappers's
dm_report_field_string fn which required formatting the list to a
single string. This functionality is now moved to libdevmapper
and the code that needs to report the string list just needs
to pass the list itself and libdevmapper will take care of this.
This also enhances code reuse.
The dm_report_field_string_list also accepts an argument to define
custom delimiter to use. If not defined, a default "," (comma) is
used as item delimiter in the string list reported.
The dm_report_field_string_list automatically sorts the items in
the list before formatting it to a final string. It also encodes
the position and length within the final string where each element
can be found. This can be used to support checking against each
list item reported since since when formatted as a single string
for the actual report, we would lose this information otherwise
(we don't want to copy each item, the position and length within
the final string is enough for us to get the original items back).
When such lists are checked against the selection tree, we can check
each item individually this way and we can support operators like
"match any" and "match all".
The list of strings is used quite frequently and we'd like to reuse
this simple structure for report selection support too. Make it part
of libdevmapper for general reuse throughout the code.
This also simplifies the LVM code a bit since we don't need to
include and manage lvm-types.h anymore (the string list was the
only structure defined there).
This is rebased and edited version of the original design and
patch proposed by Jun'ichi Nomura:
http://www.redhat.com/archives/dm-devel/2007-April/msg00025.html
The dm_report_init_with_selection is the same as dm_report_init
but it contains an additional argument to set the selection
in the form of a string that contains field names to check against and
selection operators. The selection string is parsend and a selection
tree is composed for use in the checks against individual fields when
the report is processed. The parsed selection tree is stored in dm_report
structure as "selection_root".
This is rebased and edited version of the original design and
patch proposed by Jun'ichi Nomura:
http://www.redhat.com/archives/dm-devel/2007-April/msg00025.html
Add support for parsing numbers, strings (quoted or unquoted), regexes
and operators amogst these operands in selection condition supplied.
This is rebased and edited version of the original design and
patch proposed by Jun'ichi Nomura:
http://www.redhat.com/archives/dm-devel/2007-April/msg00025.html
This patch defines operators and structures that will be used
to store the report selection against which the actual values
reported will be checked.
Selection operators
-------------------
Comparison operators:
=~ - Matching regular expression.
!~ - Not matching regular expression.
= - Equal to.
!= - Not equal to.
>= - Greater than or equal to.
> - Greater than
<= - Less than or equal to.
< - Less than.
Logical and grouping operators:
&& - All fields must match
, - All fields must match
|| - At least one field must match
# - At least one field must match
! - Logical negation
( - Left parenthesis
) - Right parenthesis
This makes it easier to check against the fields (following patches for
report selection) and check whether size units are allowed or not
with the field value.
Use NAME_LEN constant to simplify creation of device name.
Since the max size should be already tested in validation,
throw INTERNAL_ERROR if the size of vg/lv is bigger then NAME_LEN.
Let's use the size of origin as the real base for percenta calculation,
and 'silenly' add needed metadata space for snapshot.
So now command 'lvcreate -s -l100%ORIGIN vg/lv' should always create a
snapshot to handle full device overwrite.
Expresing -lXX%LV is not valid for snapshot, but error message for
snapshost case was not complete and missed %ORIGIN.
Also document correct settings for in manpage properly where
it missed %PVS.
While polling for snapshot, detect first the snapshot still
exits. It's valid to have multiple polling threads watching
for the same thing and just 1 can 'win' the finish part.
All others should nicely 'fail'.
If the we are polling an LV due to some sort of conversion and it
becomes inactive, a rather worrisome message is produced, e.g.:
" ABORTING: Mirror percentage check failed."
We can cleanly exit if we do a simple check to see if the LV is
active before performing the check. This eliminates the scary
message.
When creating a cache LV with a RAID origin, we need to ensure that
the sub-LVs of that origin properly change their names to include
the "_corig" extention of the top-level LV. We do this by first
performing a 'lv_rename_update' before making the call to
'insert_layer_for_lv'.
The thin-generic.profile contains settings for thin/thin pool volumes
suitable for generic environment/use containing default settings.
This allows users to change the global lvm.conf settings at will
and still keep the original settings for volumes that have this
thin profile assigned already.
Internal reporting function cannot handle NULL reporting value,
so ensure there is at least dummy label.
So move dummy_lable from tools/reporter.c and use it for all
report_object() calls in lib/report/report.c.
(Fixes RHBZ 1108394)
Simlify lvm_report_object initialization.
More updates to manglename option.
Add reference to LVM2 resource page, since for a long time,
this is the right places for sources for libdevmapper....
Add more time for tests, since debug kernels are getting slower...
and we add more and more tests.
However many test should be shortened to avoid testing disk-perfomance
and focus on lvm functionality...
(Often we should probably test with inactive volumes when we check
metadata operation of lvm2)
We may need to support option for 'DEEP' longer testing.
Also something like LVM_TEST_TIMEOUT_FACTOR might be useful
though it would be much better if test suite could approximete
from system perfomance test lenght...
Enable 'retry' deactivation also in 'cleanup' phase.
It shouldn't be mostly needed - however udev now produces
more and more completelny non-synchronizable device opens,
so even for orphan devices we can't easily predict where
udevd opens devices.
So it's more preferable here to log error about device being open
and retry clean, but let the command proceed.
And use ifdefs there, not exposing it in the tool code itself.
Later in the future, we should probably make the PIDFILE and
daemon checking code available also in case the daemon itself
is not built.
If the user supplies a '--yes' argument, then don't bother them with
a question to confirm whether to change the cluster attribute (even
if clvmd isn't running).
If clvmd is not running or the locking type is not clustered and someone
attempts to set the cluster attribute on a volume group, prompt them to
see if they are sure. (Only prompt for one though. If neither are true,
simply ask them once.)
Instead of rereading device list via cat - keep
the list in bash array. (Also solves problem
with spaces in device path)
Move usage of "$path" out of lvm shell usage,
since we don't support such thing there...
Replace AC_PATH_PROG with AC_PATH_TOOL.
Drop 'x' when already using "" around shell variable.
Simlify some long line and ifs.
Merge multiple test evaluation with '-a', '-o'.
Use 'case' instead if several ifs when it's more elegant.
Improve usage of pkg_config_init and add it where it's been missing.
Check for UDEV_HAS_BUILTIN_BLKID and when building udev-rules.
Since we advertise 'none' as mangling name, accept it.
Keep it backward compatible and leave disabled option still working
(though I guess there is likely no user of this option...)
When directly corrupting RAID images for the purpose of testing,
we must use direct I/O (or a 'sync' after the 'dd') to ensure that
the writes are not caught in the buffer cache in a way that is not
reachable by the top-level RAID device.
As part of better error handling, remove DM devices that have been
sucessfully created but failed to load a table. This can happen
when pvmove'ing in a cluster and the cluster mirror daemon is not
running on a remote node - the mapping table failing to load as a
result. In this case, any revert would work on other nodes running
cmirrord because the DM devices on those nodes did succeed in loading.
However, because no table was able to load on the non-cmirrord nodes,
there is no table present that points to what needs to be reverted.
This causes the empty DM device to remain on the system without being
present in any LVM representation.
This patch should only be considered a partial fix to the overall
problem. This is because only the device which failed to load a
table is removed. Any LVs that may have been loaded as requirements
to the DM device that failed to load may be left in place. Complete
clean-up will require tracking those devices which have been created
as dependencies and removing them along with the device that failed
to load a table.
Accidently it's been commited - but it has also shown,
that on heavy loaded systems (like our test machine could be)
slightly bigger timeouts which waits longer for udev rules
processing does help and avoids occasional refuse of deactivation
because device is still being open.
(i.e. lvcreate...; lvchange -an...)
Unsure how we could now synchronize for this. On very slow(/loaded)
system 5 second timeout is simply not enough.
TODO: introduce at least lvm.conf configurable setting to
allow longer 'retry' loops.
Unsure if this is feature or bug of syncaction,
but it needs to be present physically on the media
and it ignores content of buffer cache...
(maybe lvchange should implicitely fsync all disks
that are members of raid array before starting test??)
Check we know how to handle same UUID
Test currently does NOT work on lvmetad
(or it's unclear it even should - thus test error
is currently lowered to 'test warning')
TODO: replace lib/test with a better shell script name
Reindent lv_check_not_in_use to simplify internal loop code.
Also return always '0/1' (drop -1) - since we only
check for failure (0) - and we don't really know
why lv_info() has failed.
disable_dev can't use transaction - since it may lead occasionaly to
weird error - example could be nomda-missing.sh test case.
Here occasionaly device instead of being removed was left as
error device and testing different code path (which is unfortunatelly
buggy)
When we want to test 'error' device - 'aux error_dev()' should be used.
Warn when --udevcookie/DM_UDEV_COOKIE is used with 'dmsetup remove --force'.
When command is doing multiple ioctl operations on a single device,
it may invoke udev activity, that is colliding with further ioctl commands.
The result of such operation becomes unpredictable.
Use of --retry could partially help...
Disable code which has postprocessed whole tree and reset udev flags.
We need to find out which case was troublesome - since this loop
was just hidding bug in other code parts (most probably preload tree)
LVM has restricter character set that is allowed for VG-LV names
and the dm names constructed do not contain any blacklisted characters
that would require name mangling.
Also, when any other device-mapper device is scanned that could
possibly contain such blacklisted characters, we reference the
device by its major:minor instead of dm name (e.g. _device_is_usable fn).
The "default.profile" name was misleading. It's actually a helper
*template* that can be used for copying and further editing to create
a new profile.
Also, we have separate command and metadata profiles now so the templates
are separated as well - we can't mix profile settings from one group with
another - such profile is rejected by lvm tools.
Delay udev notification after the point udev transaction
is finished - since otherwise some device may still
be found missing until udev transaction is finished.
When MAN7, MAN8CLUSTER or MAN8SYSTEMD_GENERATORS would be empty,
don't call respective INSTALL tools.
On older systems they even generate error causing abort
of makefile target.
Warn user before converting volume to different type.
WARNING: Converting vg/lvol0 logical volume to pool's meta/data volume.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Since the content of volume is lost we have to query user to confirm
such operation. If user is 100% sure, he may use '--yes' to avoid prompts.
The dumpconfig now understands --commandprofile/--profile/--metadataprofile
The --commandprofile and --profile functionality is almost the same
with only one difference and that is that the --profile is just used
for dumping the content, it's not applied for the command itself
(while the --commandprofile profile is applied like it is done for
any other LVM command).
We also allow --metadataprofile for dumpconfig - dumpconfig *does not*
touch VG/LV and metadata in any way so it's OK to use it here (just for
dumping the content, checking the profile validity etc.).
The validity of the profile can be checked with:
dumpconfig --commandprofile/--profile/--metadataprofile --validate
...depending on the profile type.
Also, mention --config in the dumpconfig help string so users know
that dumpconfig handles this too (it did even before, but it was not
documented in the help string).
- When defining configuration source, the code now uses separate
CONFIG_PROFILE_COMMAND and CONFIG_PROFILE_METADATA markers
(before, it was just CONFIG_PROFILE that did not make the
difference between the two). This helps when checking the
configuration if it contains correct set of options which
are all in either command-profilable or metadata-profilable
group without mixing these groups together - so it's a firm
distinction. The "command profile" can't contain
"metadata profile" and vice versa! This is strictly checked
and if the settings are mixed, such profile is rejected and
it's not used. So in the end, the CONFIG_PROFILE_COMMAND
set of options and CONFIG_PROFILE_METADATA are mutually exclusive
sets.
- Marking configuration with one or the other marker will also
determine the way these configuration sources are positioned
in the configuration cascade which is now:
CONFIG_STRING -> CONFIG_PROFILE_COMMAND -> CONFIG_PROFILE_METADATA -> CONFIG_FILE/CONFIG_MERGED_FILES
- Marking configuration with one or the other marker will also make
it possible to issue a command context refresh (will be probably
a part of a future patch) if needed for settings in global profile
set. For settings in metadata profile set this is impossible since
we can't refresh cmd context in the middle of reading VG/LV metadata
and for each VG/LV separately because each VG/LV can have a different
metadata profile assinged and it's not possible to change these
settings at this level.
- When command profile is incorrect, it's rejected *and also* the
command exits immediately - the profile *must* be correct for the
command that was run with a profile to be executed. Before this
patch, when the profile was found incorrect, there was just the
warning message and the command continued without profile applied.
But it's more correct to exit immediately in this case.
- When metadata profile is incorrect, we reject it during command
runtime (as we know the profile name from metadata and not early
from command line as it is in case of command profiles) and we
*do continue* with the command as we're in the middle of operation.
Also, the metadata profile is applied directly and on the fly on
find_config_tree_* fn call and even if the metadata profile is
found incorrect, we still need to return the non-profiled value
as found in the other configuration provided or default value.
To exit immediately even in this case, we'd need to refactor
existing find_config_tree_* fns so they can return error. Currently,
these fns return only config values (which end up with default
values in the end if the config is not found).
- To check the profile validity before use to be sure it's correct,
one can use :
lvm dumpconfig --commandprofile/--metadataprofile ProfileName --validate
(the --commandprofile/--metadataprofile for dumpconfig will come
as part of the subsequent patch)
- This patch also adds a reference to --commandprofile and
--metadataprofile in the cmd help string (which was missing before
for the --profile for some commands). We do not mention --profile
now as people should use --commandprofile or --metadataprofile
directly. However, the --profile is still supported for backward
compatibility and it's translated as:
--profile == --metadataprofile for lvcreate, vgcreate, lvchange and vgchange
(as these commands are able to attach profile to metadata)
--profile == --commandprofile for all the other commands
(--metadataprofile is not allowed there as it makes no sense)
- This patch also contains some cleanups to make the code handling
the profiles more readable...
Mark profilable settings with a separate CFG_PROFILABLE_METADATA
flag where the profile can be attached to VG/LV. This makes it possible
to differentiate global command-profilable settings (CFG_PROFILABLE flag)
and contextual metadata-profilable (per VG/LV) settings (CFG_PROFILABLE_METADATA flag).
The dumpconfig reuses existing config_def_check results in case
the check is done during general lvm command context initialization
(when enabled by config/checks=1) so dumpconfig does not need to run
the same check again during its execution, hence saving some time.
However, we don't check for differences from defaults during general
lvm command initialization as it's useless at that time. It makes
sense only in case when such a check is directly requested (like in
the case of lvm dumpconfig --type diff). We need to take care that
the reused information was already produced with this "diff" checking
before and if not, we need to force the check so the check status also
gathers the new "diff" info now.
Also, do not do diff checking for any other dumpconfig command that
is run after dumpconfig --type diff.
When cmd refresh is called, we need to move any already loaded profiles
to profiles_to_load list which will cause their reload on subsequent
use. In addition to that, we need to take into account any change
in config/profile configuration setting on cmd context refresh
since this setting could be overriden with --config.
Also, when running commands in the shell, we need to remove the
global profile used from the configuration cascade so the profile
is not incorrectly reused next time when the --profile option is
not specified anymore for the next command in the shell.
This bug only affected profile specified by --profile cmd line
arg, not profiles referenced from LVM metadata.
Before, the cft_check_handle used to direct configuration checking
was part of cmd_context. It's better to attach this as part of the
exact config tree against which the check is done. This patch moves
the cft_check_handle out of cmd_context and it attaches it to the
config tree directly as dm_config_tree->custom->config_source->check_handle.
This change makes it easier to track the config tree check results
and provides less space for bugs as the results are directly attached
to the tree and we don't need to be cautious whether the global value
is correct or not (and whether it needs reinitialization) as it was
in the case when the cft_check_handle was part of cmd_context.
Add CONFIG_FILE_SPECIAL config source id to make a difference between
real configuration tree (like lvm.conf and tag configs) and special purpose
configuration tree (like LVM metadata, persistent filter).
This makes it easier to attach correct customized data to the config
tree that is created out of the source then.
Since decisions in the silent mode may not be always obvious,
print skipped prompt with answer 'n'.
Also document '-qq' behaviour (single -q only shuts
logging, while -qq sets silent mode).
Share DM_REPORT_FIELD_RESERVED_NAME_{HELP,HELP_ALT} between libdm and
any libdm user to handle reserved field names, in this case the virtual
field name to show help instead of failing on unrecognized field.
The libdm user also needs to check the field name so it can fire
proper code in this case (cleanup, exit etc.).
Improve testing for needs_check - when old tools are installed,
issue proper warning from configure, but stop sending needs_check
flag to the old tool.
Support upto 3 levels os nesting signal blocking.
As of today - code blocks signals immediatelly when it opens
VG in read-write mode - this however makes current prompt usage
then partially unusable since user may not 'break' command
during prompt (something most user would expect).
Until a better fix for prompting is implemented, put in support
for signal nesting - thus when prompt enables signal acceptance,
make it possible to really break command at this point.
Adding log_sys_debug for eventual logging of system errors.
(Using debug level, since currently signal handling functions
do not fail when any error is encoutered).
The sysinit.target is ordered even before sockets.target and there
may be situations in which the lvmetad is needed early, for example
in rescue.target to activate some LVs on which mountpoints reside.
Also, like in the case of rescue.target, the sockets.target is not
pulled in (unless some other service pulls it in explicitly).
See also: https://bugzilla.redhat.com/show_bug.cgi?id=1087586#c26
for the summary of the problem.
When quering for dmeventd monitoring status, check first
if lvm2 is configured to monitor to avoid unwanted start
of dmeventd process for answering monitoring status.
When showing ACTIVE status for snapshot's origin,
avoid testing all its snapshot - it's not useful
to tell user origin is inactivate, while it's clearly
available and running - just one of its snapshot leg
is invalid...
Relocate info from thin pool and thin volume segments
to proper code section for segments.
Add discards and thin count status info.
Info is shown with 'lvdisplay --maps' (like for other segments).
For percentage display we need -tpool - so check for layered
device presence here instead of plain pool device.
Also update 'info' - so when pool is 'available' we
display open count for -tpool device instead of mostly
irrelevant pool.
TODO: Maybe we should actually display this open info always?
(even when just -tpool is available, but pool is not)
Emphesize virtual extents for virtual LVs and for
those use 'Virtual extents' instead of 'Logical extents',
so it's immeditatelly visible, which extents do have
straighforward physical backend.
While the 'raid1/10' segment types were being handled inadvertently
by '_move_mirrors()', the parity RAIDs were not being properly checked
to ensure that the user had specified all necessary PVs when moving
them. Thus, internal errors were being triggered when only part of
a RAID LV was moved to the new VG. I've added a new function,
'_move_raid()', which properly checks over any affected RAID LVs and
ensures that all the necessary PVs are being moved.
vgsplit of RAID volumes was problematic because the metadata area
and data areas were always on the same PVs. This problem is similar
to one that was just fixed for mirrors that had log and images sharing
a PV (commit 9ac858f). We can now add RAID LVs to the tests for
vgsplit.
Note that there still seems to be an issue when specifying an
incomplete set of PVs when moving RAID LVs.
Given a named mirror LV, vgsplit will look for the PVs that compose it
and move them to a new VG. It does this by first looking at the log
and then the legs. If the log is on the same device as one of the mirror
images, a problem occurs. This is because the PV is moved to the new VG
as the log is processed and thus cannot be found in the current VG when
the image is processed. The solution is to check and see if the PV we are
looking for has already been moved to the new VG. If so, it is not an
error.
ignore_suspended_devices=0 is already used in lvm.conf we distribute,
but it was still "1" in the code (so it was used when lvm.conf value
was not defined). It should be "0" too.
- better add reference to lvm dumpconfig --type default
than stating that lvmetad is not enabled by default
- substitute #DEFAULT_PID_DIR# with concrete value
Switch to allocate buffer from heap, since it might be potentially
bigger when extremaly large set of volumes would be monitored.
In case of allocation failure send ENOMEM message.
Also implicitelly ignore msg->size when msg->data is NULL.
Ensure daemon-io.h is used as a generic header included
with configure defines before other headers.
(In future all lvm2 libraries should settle on a single lib.h header)
Rename couple defines to better match header file names.
Config variables that are processed during setup prior to calling into
particular tools must not be accessed directly afterwards in case the
values already got overridden.
_process_config() already used the tests I'm removing here to call
lvmetad_set_active() and set up lvmetad_used().
This should be the preferred way of configuring lvm2 for udev/systemd
since otherwise one can end up with the processes run from udev (the
pvscan we run for lvmetad update on events) to be killed prematurely
and this can end up with LVM volumes not activated in the end.
This is sort of info we always ask people to retrieve when
inspecting problems in systemd environment so let's have this
as part of lvmdump directly.
The -s option does not need to be bound to systemd only. We could
add support for initscripts or any other system-wide/service tracking
info that can help us with debugging problems.
By default, the thin_pool_chunk_size is automatically calculated.
When defined, it disables the automatic calculation. So to be more
precise here, we should comment it out for the default.profile.
Also, "lvm dumpconfig --type profilable" was used here to generate
the default.profile content. This will be done automatically in the
future once we have the infrastructure for this in (see also
https://bugzilla.redhat.com/show_bug.cgi?id=1073415).
Perform two allocation attempts with cling if maximise_cling is set,
first with then without positional fill.
Avoid segfaults from confusion between positional and sorted sequential
allocation when number of stripes varies as reported here:
https://www.redhat.com/archives/linux-lvm/2014-March/msg00001.html
Set A_POSITIONAL_FILL if the array of areas is being filled
positionally (with a slot corresponding to each 'leg') rather
than sequentially (with all suitable areas found, to be sorted
and selected from).
Since the kill may take various amount of time,
(especially when running with valgrind)
check it's really pvmoved LV.
Restore initial restart of clvmd - it's currently
broken at various moments - basically killed lvm2
command may leave clvmd and confusing state leading
to reports of internal errors.
Prior adding new reply to the list, check
if the reply thread is not already finished.
In that case discard adding message
(which would otherwise be leaked).
Use mutex to access localsock values, so check
num_replies when the thread is not yet finished.
Check for threadid prior the mutex taking
(though this check is probably not really needed)
Added complexity with extra reply mutex is not worth the troubles.
The only place which may slightly benefit from this mutex is timeout
and since this is rather error case - let's convert it to
localsock.mutex and keep it simple.
Move the pthread mutex and condition creation and destroy
to correct place right after client memory is allocatedd
or is going to be released.
In the original place it's been in race with lvm thread
which could have still unlock mutex while it's been already
destroyed.
When TEST_MODE flag is passed around the cluster,
it's been use in thread unprotected way, so it may have
influenced behaviour of other running parallel lvm commands
(activation/deactivation/suspend/resume).
Fix it by set/query function only under lvm mutex.
For hold_un/lock function calls check lock_flags bits directly.
When pvmove0 is finished, it replaces temporarily pvmove0
with error segment, however in this case, pvmove0 remains
unremovable in case pvmove --abort is interrupted in this
moment - since it's not a pvmove anymore and normal
lvremove can't be used to remove LOCKED lv.
There were two bugs before when using pvcreate --restorefile together
with data alignment and its offset specified:
- the --dataalignment was always ignored due to missing braces in the
code when validating the divisibility of supplied --dataalignment
argument with pe_start which we're just restoring:
if (pp->rp.pe_start % pp->data_alignment)
log_warn("WARNING: Ignoring data alignment %" PRIu64
" incompatible with --restorefile value (%"
PRIu64").", pp->data_alignment, pp->rp.pe_start);
pp->data_alignment = 0
The pp->data_alignment should be zeroed only if the pe_start is not
divisible with data_alignment.
- the check for compatibility of restored pe_start was incorrect too
since it did not properly count with the dataalignmentoffset that
could be supplied together with dataalignment
The proper formula is:
X * dataalignment + dataalignmentoffset == pe_start
So it should be:
if ((pp->rp.pe_start % pp->data_alignment) != pp->data_alignment_offset) {
...ignore supplied dataalignment and dataalignment offset...
}
In general for non-toplevel LVs we shouldn't allow any _tree_action.
For now error on request for cache_pool activation which
doesn't even exist in dm-table.
If there ever would be a second call to dm_lib_init()
and envvar would be improperly set, some last set value
would be used while it should reset to default mangling mode.
When the node enters dtree with implicit dependency, it
automatically has udev flags from parent node
and could not be changed later when the node has been
entered again via i.e lvm's preload tracking.
Resolve this by tracking whether the node has been
created by implicit dependency tracking or has been
entered explicitely. Implicit node could be later
upgraded by an explicit _add_dev() with proper udev_flags.
For implicit devices add special udev flags to avoid
any scan and udev rule processing if we resume such device.
Patch allows easier removing of orphan nodes.
When down-converting a RAID1 LV, if the user specifies too few devices,
they will get a confusing message.
Ex:
[root]# lvcreate -m 2 --type raid1 -n raid -L 500M taft
Logical volume "raid" created
[root]# lvconvert -m 0 taft/raid /dev/sdd1
Unable to extract enough images to satisfy request
Failed to extract images from taft/raid
This patch makes the error message a bit clearer by telling the user
the count they are trying to remove and the number of devices they
supplied.
[root@bp-01 lvm2]# lvcreate --type raid1 -m 3 -L 200M -n lv vg
Logical volume "lv" created
[root@bp-01 lvm2]# lvconvert -m -3 vg/lv /dev/sdb1
Unable to remove 3 images: Only 1 device given.
Failed to extract images from vg/lv
[root@bp-01 lvm2]# lvconvert -m -3 vg/lv /dev/sd[bc]1
Unable to remove 3 images: Only 2 devices given.
Failed to extract images from vg/lv
[root@bp-01 lvm2]# lvconvert -m -3 vg/lv /dev/sd[bcd]1
[root@bp-01 lvm2]# lvs -a -o name,attr,devices vg
LV Attr Devices
lv -wi-a----- /dev/sde1(1)
This patch doesn't work in all cases. The user can specify the right
number of devices, but not a sufficient amount of devices from the LV.
This will produce the old error message:
[root@bp-01 lvm2]# lvconvert -m -3 vg/lv /dev/sd[bcf]1
Unable to extract enough images to satisfy request
Failed to extract images from vg/lv
However, I think this error message is sufficient for this case.
Since the usability problem were fixed, we can use this function.
Cleanup orphan LVs with TEMPORARY flags
(reduces couple blkid error reports, but couple of them
is still left...)
Since cache segment is purely virtual mapping, it has nothing for
discard. Discardable is cache origin here which is now
properly removed on 'delete' phase.
Plain lv_empty() call needs to only detach cache origin and leave
origin unchanged.
This test for LV name restriction check name of device is below 128
chars (which is enforced by dm target).
Thus it should not count with device name.
(Though the test for PATH_MAX size should be probably also added,
but this is runtime test, since theoretically devpath might differ in cluster)
It's unclear why we should prohibit use of -v output.
So reenable (like with other 'display' tools)
But -c -m is really unsupported - return invalid cmd.
Sort order for -C|--columns as with other options,
and use short capital name as the first (as with other options).
Also drop multiple reference for pvs/lvs/vgs, since now
the text for -C is really close to referrence of lvm anyway.
TODO:
It seems commit 7e685e6c70
has changed the old logic, when 'pvs device_name' used
to work. (regression from 2.02.104)
Currently put in extra pvscan.
Drop unused passed cmd pointer from function.
TODO:
We have two similar functions (though not identical)
lv_manip.c: for_each_sub_lv()
metadata.c: _lv_each_dependency()
They seem to not always match - we should probably convert
to use only a single function.
Use proper vgmem memory pool for allocation of LV name in the vg
and check if new renamed LV is a valid name.
TODO: validation should really use also VG name, othewise we are not
able to tell "vgname-lvname" will be valid.
Commit 1a832398a7 moved
some code from _pvchange_single() to main pvchange() and
introduced exit code regression as return codes have not
been properly changed, thus pvchange command exited
with '0' exit code, even though it has reported error.
Also there is a missing vg unlock in error path.
Fix it by counting the total number of expected calls before
checking for pvname and also unlock and relase vg when
pv is not found.
For a quick overview of config when debugging and to quickly check
which values are different from defaults and which are not defined
in the config and for which defaults are used.
If clvmd does not hold any lock, it should also not keep any opened
device.
The reason for this patch is, that refresh_toolcontext calls
dev_cache_exit() which destroys whole device cache (even those with
opened file) - previous patch added recovery path to avoid memory
corruption, but opened files are still bugs that need to be fixed.
So this patch certainly kills many internal mirror & raid tests,
since they leak opened file descriptors (when tests are executed
with 'abort_on_error').
Operate with lvm_thread_exit while holding lvm_thread_mutex.
Don't leave unfinished work in the lvm thread queue
and always finish all queued tasks before exit,
so no cmd struct is left in the list.
(in-release fix)
When lvm2 command works with clvmd and uses locking in wrong way,
it may 'leak' certain file descriptors in opened (incorrect) state.
dev_cache_exit then destroys memory pool of cached devices, while
_open_devices list in dev-io.c was still referencing them if they
were still opened.
Patch properly calls _close() function to 'self-heal' from this
invalid state, but it will report internal error (so execution
with abort_on_internal_error causes immediate death). On the
normal 'execution', error is only reported, but memory state is
corrected, and linked list is not referencing devices from
released mempool.
For crash see: https://bugzilla.redhat.com/show_bug.cgi?id=1073886
When we're trying to search for certain tree node
in daemon's reply, we default to a blank string ""
if the node is not found. This happens during lvmetad
initialization.
However, when the default blank string is used, we
can't use dm_config_find_str at the same time - the
dm_config_find_str_allow_empty should be used instead.
Otherwise a a warning message:
"WARNING: Ignoring unsupported value for ..."
is issued.
Before:
thin_disabled_features = ""
Now:
thin_disabled_features = []
Which is a more correct and consistent way of specifying void array
though parses can handle both forms.
Smallest supported size for swap device is 40KB, however current
test skipped devices smaller then 4096 sectors (2MB).
Since page is in bytes, convert it to sectors before comparing
with device size (in sectors).
We have to close cluster in some predicatable way,
otherwise we may access released memory from different
threads.
So move closedown till the point we know all thread
are closed. New messages from cluster are discarded.
When multiple threads act on the same 'quit' variable
the order of exit becomes unpredictable.
So let the main_loop() finish first and then clean up
all queued lvm jobs.
Do not add any new work, when lvm_thread_exit is set.
Properly clean 'client' structure only for LOCAL_SOCK type.
(Fixes bug from commit 460c19df62)
(in release fix)
Also cleanup-up associated pthreads by using cleanup_zombie() function.
Since this function may change the list, restart scanning always from
the list header.
Note: couple following patches are necessary to make this working properly.
When daemon releases memory and it is still in critical
section, issue an error message and drop memory.
We cannot do anything better for now and we at least
release allocated resource.
FIXME:
This code is triggered when i.e. clvmd is killed while
some LVs are suspend - in this case suspended devices leak,
so if this happens during i.e. clvmd upgrade we have
unresolved problem - even locked rootfs...
This function is typically called for cmd context refresh or destroy.
On the non-clustered case we already unlocked all messages,
however when i.e. 'clvmd' gets break signal it may have
still couple messages queued.
For now just report an error.
Recent debug tracing commit introduce read of uninitialized memory,
since VGID is not really a proper string which ends with '\0'.
Enforce at most 32 (ID_LEN) chars are read from vgid.
(in release fix)
Since commit f12ee43f2e call destroy,
it start to check all VGs are unlocked. However when we become_daemon,
we simply reset locking (since lock is still kept by parent process).
So implement a simple 'reset' flag.
There are two types of CPG communications in a corosync cluster:
messages and state transitions. Cmirrord processes the state
transitions first.
When a cluster mirror issues a POSTSUSPEND, it signals the end of
cluster communication with the rest of the nodes in the cluster.
The POSTSUSPEND marks the last communication of the 'message'
type that will go around the cluster. The node then calls
cpg_leave which causes a final 'state transition' communication to
all of the nodes. Once the out-going node receives its own state
transition notice from the cluster, it finalizes the leave. At this
point, the state of the log is 'INVALID'; but it is possible that
there remains some cluster trafic that was queued up behind the
state transition that still wants to be processed. It is harmless
to attempt to dispatch any remaining messages - they won't be
delivered because the node is no longer in the cluster. However,
there was a warning message that was being printed in this case
that is now removed by this patch. The failure of the dispatch
created a false positive condition that triggered the message.
pvscan --uuid was broken since it was using only 128 char buffers
without checking any write size, so any longer device path leads to
crash.
Also ansure format is properly aligned into columns with this option.
The PV size is displayed in sectors, not kilobytes
for 'pvdisplay -c'
Signed-off-by: Thomas Fehr <fehr@suse.de>
Acked-by: Hannes Reinecke <hare@suse.de>
Instead of sending repeatedly LOCAL_SYNC commands to clvmds
like 'lvs', rememeber the last sent commmand, and if there was no other
clvmd command, drop this redundant SYNC call message.
The problem has started with commit:
56cab8cc03
This introduced correct synchronisation of name, when user requests to know
open_count (needs to wait for udev), however it is also executed for
read-only cases like 'lvs' command.
For now implement very simple solution, which is only monitoring
outgoing clvmd command, and when sequence of LOCAL sync names are
recognized, they are skipped automatically.
TODO:
Future solution might move this variable info 'cmd_context' and
use 'needs_sync' flag also i.e. in file locking code.
When the backup is disabled, avoid testing backup presence.
This only leads to errors being logged in debug trace and the missing
backup can't be fixed, since it's disabled.
Check whether lvm dumpconfig --mergedconfig is used only
with --type current (where we're merging current config and
the config supplied on command line). With other types
the config was merged, but it was thrown away since we're
generating other type of config anyway. This lead to a memleak.
Error out if --mergedconfig is used with anything else than
--type current (or without specifying --type in which case
the --type current is used by default).
Decorate NULL returns with debug_cache output so the
debug log doesn't contain spurios <bactrace> line without
any reason for it.
Add internal errors when cache is misused.
The global/suffix was missing from example lvm.conf but it can
be very useful when using lvm in scripts and now in profiles as well
Let's expose it more.
man/lvm2-activation-generator.8:
Generator Specification -> Generators Specification
(this is the exact word in the systemd man page)
conf/example.conf.in:
cleanup recent edits related to report section
man/lvm.conf.5:
add a line about a possibility to generate a new
profile with lvm dumpconfig command as an alternative
to copying the default profile
Delay archiving of metadata until we really start to
update metadata when converting volume into a snapshot.
Archive is not necessary when we abort operation early.
Do not allow conversion of too small LV into a COW snapshot device.
Without this patch snapshot target is generating these kernel
messages before creation fails:
attempt to access beyond end of device
dm-9: rw=16, want=8, limit=2
attempt to access beyond end of device
...
device-mapper: table: 253:11: snapshot: Failed to read snapshot metadata
device-mapper: ioctl: error adding target to table
device-mapper: reload ioctl on failed: Input/output error
Create a separate function to validation snapshot min chunk value
and relocate code into snapshot_manip file.
This function will be shared with lvconvert then.
Usage of origin as a snapshot 'COW' volume is unsupported.
Without this test lvm2 is able to generate this ugly internal error message.
To test this:
lvcreate -L1 -n lv1 vg
lvcreate -L1 -n lv2 -s vg/lv1
lvcreate -L1 -n lv3 vg
lvconvert -s vg/lv3 vg/lv1
Internal error: LVs (5) != visible LVs (1) + snapshots (1) + internal LVs (0) in VG vg
Users can create several profiles for how the tools report
the output very easily and then just use
<lvm reporting command> --profile <report_profile_name>
We need to use "--verifyudev" for dmsetup mangle command used in
the name-mangling test since without the --verifyudev, we'd end up
with the failed rename.
Also, add direct check for the dev nodes - node with old name must
be gone and node with new name must be present. Before, we checked
just the output of the command.
One bug popped up here when renaming with udev and libdevmapper
fallback checking the udev when target mangle mode is "none"
(fixme added in the libdevmapper's node rename code).
This prevents numerous VG refreshes on each "pvscan --cache -aay" call
if the VG is found complete. We need to issue the refresh only if the PV:
- is new
- was gone before and now it reappears (device "unplug/plug back" scenario)
- the metadata has changed
"%d" in buffer_append_vf is 64 bit wide. Using just `int` for the
variable will fetch more from va_list than intended and shifting
remaining arguments resulting in errors like:
Internal error: Bad format string at '#orphan'
This breaks brew/koji as DESTDIR should not be contained in any file and
results in message like:
+ /usr/lib/rpm/check-buildroot
/builddir/build/BUILDROOT/lvm2-2.02.106-0.311.el7.x86_64/usr/share/man/man8/lvm2-activation-generator.8:.B /builddir/build/BUILDROOT/lvm2-2.02.106-0.311.el7.x86_64/usr/lib/systemd/system-generators/lvm2-activation-generator
Found '/builddir/build/BUILDROOT/lvm2-2.02.106-0.311.el7.x86_64' in installed files; aborting
error: Bad exit status from /var/tmp/rpm-tmp.UfX2SX (%install)
Reorder detection for internal device - since this test
is much simpler then target analysis, check it sooner.
Replace test for '68' with sizeof & ID_LEN
Add FIXME about device alias problem with is_reserved_lvname,
since this test fails on devices like /dev/dm-X
so we need to convert tests to UUID.
Let's do this the other way round - this makes more logic than commit b995f06.
So let's allow empty values for global/thin_disabled_features where
such an empty value now means "none of this features are disabled".
The global/thin_disabled_features should be marked as having no default
value. Otherwise the output from 'lvm dumpconfig --type default' would
have 'thin_disabled_features=""' which will produce an error message
'Ignoring empty string in config file ...' if such output is feed
back to lvm.
Even though we make pool volume as a public visible LV,
we still do not want tools to look at this volume.
While we do not create /dev/vg/lv link, device is still
accessible via /dev/mapper/vg-lv and there is no easy
way to recognize it's private without lvm2 metadata.
Enhance UUID with -pool suffix and directly skip
any LV with a suffix in device_is_usable() call.
TODO: enhance other targets with this logic.
blkid may probably use same simple logic.
When we create thin-pool we have used trick to keep
volume active, but since we now support TEMPORARY flag,
we could just localy active & deactive metadata LV,
and let the thinpool through normal activation process.
The empty pool is also the pool which has yet queued list of messages
and transaction_id == 1.
Problem is exposed when pool is created inactive.
lvcreate -L10 -T vg/pool -an
lvcreate -V10 -T vg/pool
When pool_has_message() is queried with NULL lv and 0 device_id
it should just return 'true' when there is any message queued.
So it needs to return negative value dm_list_empty().
Since there is no user for this code path in code currently,
this bug has not been triggered.
When the last entry in the timeout queue is unregistered,
wakeup sleeping condition, so the thread is deleted earlier.
So the thread resource is release earlier.
Also when monitored with tools like valgrind this eliminites reported
leak.
Individual events are handled through separate threads,
so once we have more then a single thread in this eventwait
sleeping, we got race on the dm_log setting, since
if one event is timeout out on alarm, while another is still waiting,
then dm log has been restored to NULL and the next sigalarm
has been reported as error.
Fix it by introducing counter which is protected via mutex,
and only when the last event is released, logging is restored.
TODO: libdm seems to have some static vars which may audit
for this type of use.
This patch will releases allocated private resources from
startup. Needs previous dm_zalloc patch to ensure unset
private pointer is NULL.
TODO: check on real cluster.
We can't use mempool for temporary variable for configuration path inside
find_config_tree_* functions since these functions can use the mempool
themselves deeper in the code and we can free mempool chunks only from
top to bottom which is not the case here (some default string
configuration values can be allocated from the mempool).
Don't add blkid to every linkage.
Link udev library just with lvm tools.
Drop extra linkage of udev library, since deps from libdevmapper
are already resolved in linked -ldevmapper.
Based on patch:
https://www.redhat.com/archives/lvm-devel/2014-March/msg00015.html
The CPPFunction typedef (among others) have been deprecated in favour of
specific prototyped typedefs since readline 4.2 (circa 2001).
It's been working since because compatibility typedefs have been in
place until they where removed in the recent readline 6.3 release.
Switch to the new style to avoid build breakage.
But also add full backward compatibility with define.
Signed-off-by: Gustavo Zacarias <gustavo zacarias com ar>
The same as for allocation/thin_pool_chunk_size - the default value
used is just a starting point. The calculation continues using the
properties of the devices actually used.
The allocation/thin_pool_chunk_size is a bit more complex. It's default
value is evaluated in runtime based on selected thin_pool_chunk_size_policy.
But the value is just a starting point. The calculation then continues
with dependency on the properties of the devices used. Which means for
such a default value, we know only the starting value.
If the config setting is defined as having no default value, but it's
still not NULL, it means such a value acts as a *hint* only
(e.g. a starting value from which the default value is calculated).
The new "cfg_def_get_default_value_hint" will always return the value
as defined in config_settings.h.
The original "cfg_def_get_default_value" will always return 0/NULL if
the config setting is defined with CFG_DEFAULT_UNDEFINED flag (hence
ignoring the hint).
This is needed for proper distiction between a correct default value
and the value which is just a hint or a starting point in calculation,
but it's not the final value (yes, we do have such settings!).
The devices/cache and devices/cache_dir are evaluated in runtime this way:
- if devices/cache is set, use it
- if devices_cache/dir or devices/cache_file_prefix is set, make up a
path out of that for devices/cache in runtime, taking into account
the LVM_SYSTEM_DIR environment variable if set
- otherwise make up the path out of default which is:
<LVM_SYSTEM_DIR>/<cache_dir>/<cache_file_prefix>.cache
With the runtime defaults, we can encode this easily now. Also, the lvm
dumpconfig can show proper and exact information about this setting then
(the variant that shows default values).
Previously, we declared a default value as undefined ("NULL") for
settings which require runtime context to be set first (e.g. settings
for paths that rely on SYSTEM_DIR environment variable or they depend
on any other setting in some way).
If we want to output default values as they are really used in runtime,
we should make it possible to define a default value as function which
is evaluated, not just providing a firm constant value as it was before.
This patch defines simple prototypes for such functions. Also, there's
new helper macros "cfg_runtime" and "cfg_array_runtime" - they provide
exactly the same functionality as the original "cfg" and "cfg_array"
macros when defining the configuration settings in config_settings.h,
but they don't set the constant default value. Instead, they automatically
link the configuration setting definition with one of these functions:
typedef int (*t_fn_CFG_TYPE_BOOL) (struct cmd_context *cmd, struct profile *profile);
typedef int (*t_fn_CFG_TYPE_INT) (struct cmd_context *cmd, struct profile *profile);
typedef float (*t_fn_CFG_TYPE_FLOAT) (struct cmd_context *cmd, struct profile *profile);
typedef const char* (*t_fn_CFG_TYPE_STRING) (struct cmd_context *cmd, struct profile *profile);
typedef const char* (*t_fn_CFG_TYPE_ARRAY) (struct cmd_context *cmd, struct profile *profile);
(The new macros actually set the CFG_DEFAULT_RUNTIME flag properly and
set the default value link to the function accordingly).
Then such configuration setting requires a function of selected type to
be defined. This function has a predefined name:
get_default_<id>
...where the <id> is the id of the setting as defined in
config_settings.h. For example "backup_archive_dir_CFG" if defined
as a setting with default value evaluated in runtime with "cfg_runtime"
will automatically have "get_default_backup_archive_dir_CFG" function
linked to this setting to get the default value.
Using mempool is much safer than using the global static variable.
The global variable would be rewritten on each find_config_tree_* call
and we need to be very careful not to get into this problem (we don't
do now, but we can with the patches for "runtime defaults" that will follow).
These settings don't have any default value predefined:
log/file
log/activate_file
global/library_dir
This settings has default value but not yet declared in config_settings.h:
global/locking_library (default is DEFAULT_LOCKING_LIB)
cmirrord polls for messages on the kernel and cluster interfaces.
Sometimes it is possible for messages to be received on the cluster
interface and be waiting for processing while the node is in the
process of leaving the cluster group. When this happens, the
messages received on the cluster interface are attempted to be
dispatched, but an error is returned because the connection is no
longer valid. It is a harmless situation. So, if we get the
specific error (CS_ERR_BAD_HANDLE) and we know that we have left
the group, then simply don't print the message.
If the PV label is lost (e.g. by doing a dd on the device), call
"systemd-run pvscan --cache <major>:<minor>" in 69-dm-lvm-metad.rules
to inform lvmetad about this state.
The reason for this is that ENV{SYSTEMD_WANTS}="lvm2-pvscan@<major>:<minor>"
logic will not cause the pvscan to be fired in this case since this works
only on proper device addition/removal cycle - the lvm2-pvscan service's
ExecStop is called only on proper REMOVE event - the service is bound to
device existence. Hence we need pvscan call via systemd-run (that
instantiates a quick transient service just to call the command).
See also https://bugzilla.redhat.com/show_bug.cgi?id=1063813.
It's not so easy to recongnize unusable /dev/kmsg
Reorder the code in a way if the first regular read of /dev/kmsg
fail, fallback to klogctl interface.
Call drain_dmesg also for the case there is no user log output.
Add a bit more complexity here - Switch to use /dev/kmsg
which has been introduced in 3.5 kernels and could run without
lossing lines from /proc/kmsg.
On older systems user may set env var LVM_TEST_CAN_CLOBBER_DMESG=1
to get kernel messages via klogctl() call (which deletes dmesg buffer)
otherwise no logging of kernel messages is provided.
Since there could be multiple readers of kmsg (test & journald) it needs
to be fast, to capture things like sysrq trace.
But to capture whole output it would need to prioritize reading of kmsg,
thus we would first log kernel messages and followed by command output.
As a trade-off always log command output first and use large drain
buffer so is captures most of messages, but occasionaly miss some
lines.
Use separate files for raid1, raid456, raid10.
They need different target versions to work, so support
more precise test selection.
Optimize duplicate tests of target avalability and skip
unsupported test cases sooner.
Basically reverts commit af8580d756.
"test: Use klogctl in the harness instead of reading /var/log/messages."
Problem is - this interface clears dmesg buffer
(just like call of dmesg -c)
Thus after running lvm2 test suitedmesg is empty - while all the
messages are usually logged in the journal/message, it's still not nice to
clear dmesg buffer.
It's not a pure revert, but switch to use /proc/kmsg directly instead of
reading /var/log/messages.
In cases where PV appears on a new device without disappearing from an old one
first, the device->pvid pointers could become ambiguous. This could cause the
ambiguous PV to be lost from the cache when a different PV comes up on one of
the ambiguous devices.
This is probably not optimal, but makes the lvmetad case mimic non-lvmetad code
more closely. It also fixes vgremove of a partially corrupt VG with lvmetad, as
_vg_write_raw (and consequently, entire vg_write) currently panics when it
encounters a corrupt MDA. Ideally, we'd be able to explicitly control when it is
safe to ignore them.
%FREE allocation has been broken for RAID. At 100%FREE, there is
still an extent left for certain tests. For now, change the test
to warn rather than completely fail.
Move flags for segments to segtype header where it seems more closely
related as the features are related to segtype and not activation.
Use unsigned #define - since it's more common in lvm2 source code
for bit flags.
Condition was swapped - however since it's been based on 'random'
memory content it's been missed as attribute has not been set.
So now we have quite a few possible results when testing.
We have old status without separate metadata and
we have kernels with fixed snapshot leak bug.
(in-release update)
Code uses target driver version for better estimation of
max size of COW device for snapshot.
The bug can be tested with this script:
VG=vg1
lvremove -f $VG/origin
set -e
lvcreate -L 2143289344b -n origin $VG
lvcreate -n snap -c 8k -L 2304M -s $VG/origin
dd if=/dev/zero of=/dev/$VG/snap bs=1M count=2044 oflag=direct
The bug happens when these two conditions are met
* origin size is divisible by (chunk_size/16) - so that the last
metadata area is filled completely
* the miscalculated snapshot metadata size is divisible by extent size -
so that there is no padding to extent boundary which would otherwise
save us
Signed-off-by:Mikulas Patocka <mpatocka@redhat.com>
While stripe size is twice the physical extent size,
the original code will not reduce stripe size to maximum
(physical extent size).
Signed-off-by: Zhiqing Zhang <zhangzq.fnst@cn.fujitsu.com>
Switch to use ext2 to make it usable on older systems.
Previous test has not been able to catching problem.
Multiple tests are now put in.
FIXME: validate what is doing kernel target when
the header is undeleted and same chunk size is used.
It seems snapshot target successfully resumes and
just complains COW is not big enough:
kernel: dm-8: rw=0, want=40, limit=24
kernel: attempt to access beyond end of device
When chunk size is different it fails instantly.
For checking this with lvm2 and this test case use this patch:
--- a/tools/lvcreate.c
+++ b/tools/lvcreate.c
@@ -769,7 +769,7 @@ static int _read_activation_params(struct
lvcreate_params *lp,
lp->permission = arg_uint_value(cmd, permission_ARG,
LVM_READ | LVM_WRITE);
- if (lp->snapshot) {
+ if (0 && lp->snapshot) {
/* Snapshot has to zero COW header */
lp->zero = 1;
lp->wipe_signatures = 0;
---
and switch to use -c 4 for both snapshots
When read-only snapshot was created, tool was skipping header
initialization of cow device. If it happened device has been
already containing header from some previous snapshot, it's
been 'reused' for a newly created snapshot instead of being cleared.
To make "lvm dumpconfig --type default" output to be usable like any
other config, we need to comment out lines that have no default value
defined. Otherwise, we'd have the output with config options
with blank or zero values which is not the same as when the value
is not defined! And such configuration can't be feed into lvm again
without further edits. So let's fix this.
Currently this covers these configuration options exactly:
devices/loopfiles
devices/preferred_names
devices/filter
devices/global_filter
devices/types
allocation/cling_tag_list
global/format_libraries
global/segment_libraries
activation/volume_list
activation/auto_activation_volume_list
activation/read_only_volume_list
activation/mlock_filter
metadata/dirs
metadata/disk_areas
metadata/disk_areas/<disk_area>
metadata/disk_areas/<disk_area>/start_sector
metadata/disk_areas/<disk_area>/size
metadata/disk_areas/<disk_area>/id
tags/<tag>
tags/<tag>/host_list
The code seems to work fine for the most trivial case - moving a
simple cache LV. However, it can cause problems when trying to
split out other LVs on different VGs and there hasn't been sufficient
testing for LV stacks that contain cache to enable the code. So,
we actively disable what is already broken and wait for the next
release to fix it.
Start to convert percentage size handling in lvresize to the new
standard. Note in the man pages that this code is incomplete.
Fix a regression in non-percentage allocation in my last check in.
This is what I am aiming for:
-l<extents>
-l<percent> LV/ORIGIN
sets or changes the LV size based on the specified quantity
of logical logical extents (that might be backed by
a higher number of physical extents)
-l<percent> PVS/VG/FREE
sets or changes the LV size so as to allocate or free the
desired quantity of physical extents (that might amount to a
lower number of logical extents for the LV concerned)
-l+50%FREE - Use up half the remaining free space in the VG when
carrying out this operation.
-l50%VG - After this operation, this LV should be using up half the
space in the VG.
-l200%LV - Double the logical size of this LV.
-l+100%LV - Double the logical size of this LV.
-l-50%LV - Reduce the logical size of this LV by half.
Reorder detection of cmirrord. Now if cmirrord is not
running, target will not try to load kernel log module,
for communication with cmirrord.
Whole check for attrs now also happens just once.
Test raid10 availability as a target feature (instead of doing
it in all the places where raid10 should be checked).
TODO: activation needs runtime validation - so metadata with raid10
are skipped from activation in user-friendly way in lvm2.
Parsing vg structure during supend/commit/resume may require a lot of
memory - so move this into vg_write.
FIXME: there are now multiple cache layers which our doing some thing
multiple times at different levels. Moreover there is now different
caching path with and without lvmetad - this should be unified
and both path should use same mechanism.
Reuse _node_send_messages for just checking
for valid transaction_id with preload.
This allows earlier detection of incosistent thin pool.
Code does the same thing, except for sending messages.
Improve testing of transation_id to not allow other difference
then either kernel TID is equal or is lower by oned and there
are queued messages for transaction.
Mark messages as submitted if the transaction_id is already matching.
Do not try to deactivate node on failure here and leave it on
proper error path of the caller.
Deactivation of top level node has to happen,
before traversing subtree.
Swap list logic and rather append new nodes to the head
and then use normal iteration.
(in-release update)
Skip over LVs that have a cache LV in their tree of LV dependencies
when performing a pvmove.
This means that users cannot move a cache pool or a cache LV's origin -
even when that cache LV is used as part of another LV (e.g. a thin pool).
The new test (pvmove-cache-segtypes.sh) currently builds up various LV
stacks that incorporate cache LVs. pvmove tests are then performed to
ensure that cache related LVs are /not/ moved. Once pvmove is enabled
for cache, those tests will switch to ensuring that the LVs /are/
moved.
Several fixes for the recent changes that treat allocation percentages
as upper limits.
Improve messages to make it easier to see what is happening.
Fix some cases that failed with errors when they didn't need to.
Fix crashes when first_seg() returns NULL.
Remove a couple of log_errors that were actually debugging messages.
Test currently fails with make check_cluster - so uses 'should'
CLVMD[4f435880]: Feb 19 23:27:36 Send local reply
format_text/archiver.c:230 WARNING: This metadata update is NOT backed up
metadata/mirror.c:1105 Failed to initialize log device
metadata/mirror.c:1145 <backtrace>
lvconvert.c:1547 <backtrace>
lvconvert.c:3084 <backtrace>
Remove 'skip' argument passed into the function.
We always used '0' - as this is the only supported
option (-K) and there is no complementary option.
Also add some testing for behaviour of skipping.
We already have /dev/disk/by-id/dm-uuid-... (which encompasses the
VG UUID and LV UUID in case of LVs since the mapping's UUID is
VG+LV UUID together) and /dev/disk/by-id/dm-name-... (which encompasses
the VG and LV name in case of LVs).
This patch addds /dev/disk/by-id/lvm-pv-uuid-<PV_UUID> that completes
this scheme and makes navigation a bit easier using PV UUIDs since
one can navigate using PV UUIDs only and there's no need to do extra
PV UUID <--> kernel name matching (the PV UUID is stable across reboots).
This may come in handy in various scripts.
Since we already have the PV UUID stored in udev database (as a result
of blkid call - returned in ID_FS_UUID blkid's variable), this operation
is very cheap indeed, just creating the extra one symlink.
There are typically 2 functions for the more advanced segment types that
deal with parameters in lvcreate.c: _get_*_params() and _check_*_params().
(Not all segment types name their functions according to this scheme.)
The former function is responsible for reading parameters before the VG
has been read. The latter is for sanity checking and possibly setting
parameters after the VG has been read.
This patch adds a _check_raid_parameters() function that will determine
if the user has specified 'stripe' or 'mirror' parameters. If not, the
proper number is computed from the list of PVs the user has supplied or
the number that are available in the VG. Now that _check_raid_parameters()
is available, we move the check for proper number of stripes from
_get_* to _check_*.
This gives the user the ability to create RAID LVs as follows:
# 5-device RAID5, 4-data, 1-parity (i.e. implicit '-i 4')
~> lvcreate --type raid5 -L 100G -n lv vg /dev/sd[abcde]1
# 5-device RAID6, 3-data, 2-parity (i.e. implicit '-i 3')
~> lvcreate --type raid6 -L 100G -n lv vg /dev/sd[abcde]1
# If 5 PVs in VG, 4-data, 1-parity RAID5
~> lvcreate --type raid5 -L 100G -n lv vg
Considerations:
This patch only affects RAID. It might also be useful to apply this to
the 'stripe' segment type. LVM RAID may include RAID0 at some point in
the future and the implicit stripes would apply there. It would be odd
to have RAID0 be able to auto-determine the stripe count while 'stripe'
could not.
The only draw-back of this patch that I can see is that there might be
less error checking. Rather than informing the user that they forgot
to supply an argument (e.g. '-i'), the value would be computed and it
may differ from what the user actually wanted. I don't see this as a
problem, because the user can check the device count after creation
and remove the LV if they have made an error.
When clustered VG is available in the system but we don't have
clustering set up for whatever reason, the lvm2-monitor scripts should
not fail completely just because these clustered VGs are skipped during
vgs/vgchange calls in lvm2-monitor initscript/systemd unit.
Avoid introducing libdm structure allocated in library user.
Use direct call with all currently supported args.
When new arg is added, new function will cover it.
When an origin exists and the 'lvcreate' command is used to create
a cache pool + cache LV, the table is loaded into the kernel but
never instantiated (suspend/resume was never called). A user running
LVM commands would never know that the kernel did not have the
proper state unless they also ran the dmsetup 'table/status' command.
The solution is to suspend/resume the cache LV to make the loaded
tables become active.
Do not use default dependencies that systemd adds to the units
so we have better control of when the service is started/stopped
and we don't end up with unexpected behaviour.
When the activation units are generated if use_lvmetad=0 (no
autoactivation), use --ignoreskippedcluster option for vgchange calls
since the cluster with cLVM is set up by separate units.
This avoids a situation in which the generated activation units are
improperly in failed state just because of the vgchange return value
when clustered VGs are encountered while the activation of non-clustered
VGs does proceed normally.
Introduce a new parameter called "approx_alloc" that is set when the
desired size of a new LV is specified in percentage terms. If set,
the allocation code tries to get as much space as it can but does not
fail if can at least get some.
One of the practical implications is that users can now specify 100%FREE
when creating RAID LVs, like this:
~> lvcreate --type raid5 -i 2 -l 100%FREE -n lv vg
I've added an "Advanced Logical Volume Types" section that I hope
to contain information on the logical volume types that may use
multiple steps and multiple commands to create. Cache is the
first entry into this section. I'd like to see thin and RAID in
here in the future.
Update the man page so the user knows that dm-cache 1.3.0 module
is needed. Also, enforce that in the code and print a warning if
the module is not new enough.
Users now have the ability to convert their existing logical volumes
into cached logical volumes. A cache pool LV must be specified using
the '--cachepool' argument. The cachepool is the small, fast LV used
to cache the large, slow LV that is being converted.
This patch allows users to convert existing logical volumes into
cache pool LVs. Since cache pool LVs consist of data and metadata
sub-LVs, there is also the '--poolmetadata' (similar to thin_pool)
which allows for the specification of the metadata device.
Replace some in-test use of lvs commands with their check
and get equivalent.
Advantage is these 'checking' commands are not necessarily always
valiadated via extensive valgrind testing and also the output noice
is significantly reduces since the output of check/get is suppressed.
lv_active_change will enforce proper activation.
Modification of activation was wrong and lead to misuse of
autoactivation. Fix allows to use proper local exclusive activation,
while the removed code turned this into just exclusive
activation (losing required local property).
lvm2_cluster_activation_red_hat.service.in -> lvm2_cluster_activation_systemd_red_hat.service.in
lvm2_clvmd_red_hat.service.in -> lvm2_clvmd_red_hat.service.in
Edit lvm2-cluster-activation reference on cmirror - take new
lvm2-cmirrord.service, it was just cmirrord(.service) before
as the old initscript was used in compatibility mode.
Also, use WantedBy=multi-user.target instead of sysinit.target
in lvm2-cluster-activation.service.
The commit splits original clvmd service in two new native services
for systemd enabled systems while original init scripts remain unaltered.
New systemd native services:
1) clvmd daemon itself (lvm2_clvmd_red_hat.service.in)
2) (de)activation of clustered VGs (lvm2_cluster_activation_red_hat.service.in)
There're several reasons to split it. First, there's no support for conditional
stop in systemd and AFAIK they don't plan to support it. In other words:
if the deactivation fails for some reason, systemd doesn't care and will simply
kill all remaining processes in original cgroup (by default). Killing the
remaining procs can be suppressed however it doesn't solve the following problem:
You can't repeat the stop command of a failed service. The repeated stop command
is simply not propagated to the service in a failed state. You would have to start
and then try to stop the service again. Unfortunately, this can't be done while
the daemon is still running (and we need the daemon to stay active until all
clustered VGs are deactivated properly).
In a separated setup we need only to restart the failed activation service and
that's fine.
No need to fork lvmetad when running under systemd.
Also, the "lvmetad -R" support has been removed in lvm2 v2.02.98
so remove the ExecReload line that called it on "systemctl reload".
The libblkid can detect DM_snapshot_cow signature and when creating
new LVs with blkid wiping used (allocation/use_blkid_wiping=1 lvm.conf
setting and --wipe y used at the same time - which it is by default).
Do not issue any prompts about this signature when new LV is created
and just wipe it right away without asking questions. Still keep the
log in verbose mode though.
gcc reports:
metadata/merge.c:229:58: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
metadata/merge.c:232:58: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
The DM_EVENT_GET_PARAMETERS requests the parameters under which
the running dmeventd is run and the it sends them to caller.
The parameters sent:
- the pid of the running dmeventd
- foreground state
- exec_method (currently either "direct" or "systemd")
The exact message sent back:
pid=<pid> daemon=<no/yes> exec_method=<direct/systemd>
Trying to restart dmeventd as a reload action is causing problems
under systemd environment. The systemd loses track of new dmeventd
this way. See also https://bugzilla.redhat.com/show_bug.cgi?id=1060134
for more info.
We need to call dmeventd -R directly instead of "systemctl reload dm-event.service"
that was used before (the reload is aimed at configuration reload anyway,
not stateful restart of the daemon - we did this before just because
there's no ExecRestart in systemd and there's only ExecStart and
ExecStop with which we'd lose the state).
Also, use ExecStart="dmeventd -f" to run dmeventd in foreground
(and let's rely on systemd to daemonize it) and change the
service type from "forking" to "simple".
This patch allows users to create cache LVs with 'lvcreate'. An origin
or a cache pool LV must be created first. Then, while supplying the
origin or cache pool to the lvcreate command, the cache can be created.
Ex1:
Here the cache pool is created first, followed by the origin which will
be cached.
~> lvcreate --type cache_pool -L 500M -n cachepool vg /dev/small_n_fast
~> lvcreate --type cache -L 1G -n lv vg/cachepool /dev/large_n_slow
Ex2:
Here the origin is created first, followed by the cache pool - allowing
a cache LV to be created covering the origin.
~> lvcreate -L 1G -n lv vg /dev/large_n_slow
~> lvcreate --type cache -L 500M -n cachepool vg/lv /dev/small_n_fast
The code determines which type of LV was supplied (cache pool or origin)
by checking its type. It ensures the right argument was given by ensuring
that the origin is larger than the cache pool.
If the user wants to remove just the cache for an LV. They specify
the LV's associated cache pool when removing:
~> lvremove vg/cachepool
If the user wishes to remove the origin, but leave the cachepool to be
used for another LV, they specify the cache LV.
~> lvremove vg/lv
In order to remove it all, specify both LVs.
This patch also includes tests to create and remove cache pools and
cache LVs.
This patch allows the creation and removal of cache pools. Users are not
yet able to create cache LVs. They are only able to define the space used
for the cache and its characteristics (chunk_size and cache mode ATM) by
creating the cache pool.
A cache LV - from LVM's perpective - is a user accessible device that
links the cachepool LV and the origin LV. The following functions
were added to facilitate the creation and removal of this top-level
LV:
1) 'lv_cache_create' - takes a cachepool and an origin device and links
them into a new top-level LV of 'cache' segment type. No allocation
is necessary in this function, as the sub-LVs contain all of the
necessary allocated space. Only the top-level layer needs to be
created.
2) 'lv_cache_remove' - this function removes the top-level LV of a
cache LV - promoting the cachepool and origin sub-LVs to top-level
devices and leaving them exposed to the user. That is, the
cachepool is unlinked and free to be used with another origin to
form a new cache LV; and the origin is no longer cached.
(Currently, if the cache needs to be flushed, it is done in this
function and the function waits for it to complete before proceeding.
This will be taken out in a future patch in favor of polling.)
Cache pools require a data and metadata area (like thin pools). Unlike
thin pool, if 'cache_pool_metadata_require_separate_pvs' is not set to
'1', the metadata and data area will be allocated from the same device.
It is also done in a manner similar to RAID, where a single chunk of
space is allocated and then split to form the metadata and data device -
ensuring that they are together.
Building on the new DM function that parses DM cache status, we
introduce the following LVM level functions to aquire information
about cache devices:
- lv_cache_block_info: retrieves information on the cache's block/chunk usage
- lv_cache_policy_info: retrieves information on the cache's policy
I am reverting the commit below - removing the new 'dm_config_get_int'
function and simply calling 'dm_config_get_uint32' while casting the
'int *' pointer parameter.
Commit being reverted:
commit 94377dfd5e
Author: Jonathan Brassow <jbrassow@redhat.com>
Date: Mon Jan 27 05:26:19 2014 -0600
Misc: New function for reading lvm config file fields
Introduce 'dm_config_get_int', which will be used by the upcoming
cachepool segment type.
Avoid use of external origin with size unaligned/incompatible with
thin pool chunk size, since the last chunk is not correctly provisioned
when it is overwritten.
Avoid starting conversion of the LV to the thin pool and thin volume
at the same time. Since this is mostly a user mistake, do not try
to just convert to one of those type, since we cannot assume if the
user wanted LV to become thin volume or thin pool.
Before the fix tool reported pretty strange internal error:
Internal error: Referenced LV lvol1_tdata not listed in VG mvg.
Fixed output:
lvconvert --thinpool lvol0 -T mvg/lvol0
Can't use same LV mvg/lvol0 for thin pool and thin volume.
Since we are currently incapable of providing zeroes for
reextended thin volume area, let's disable extension of
such already reduce thin volumes.
(in-release change)
This patch defines a structure for holding all of the device-mapper
cache target's status information. The associated function provides
an easy way for higher levels (LVM) to consume the information.
This patch finishes the device-mapper interface for the cache and
cachepool segment types (i.e. the cache target).
This patch adds the cache segment type - the second of two necessary
to create cache logical volumes. This segment type references the
cachepool (the small fast device) and the origin (the large slow device);
linking them to create the cache device. The cache device is the
hierarchical device-mapper device that the user ulitmately makes use
of.
The cache segment sources the information necessary to construct the
device-mapper cache target from the origin and cachepool segments to
which it links.
This patch adds the new cachepool segment type - the first of two
necessary to eventually create 'cache' logical volumes. In addition
to the new segment type, updates to makefiles, configure files, the
lv_segment struct, and some necessary libdevmapper flags.
The cachepool is the LV and corresponding segment type that will hold
all information pertinent to the cache itself - it's size, cachemode,
cache policy, core arguments (like migration_threshold), etc.
When lvm2 command forks, it calls reset_locking(),
which as an unwanted side effect unlinked lock file from filesystem.
Patch changes the behavior to just close locked file descriptor
in children - so the lock is being still properly hold in the parent.
Test LVM_LVMETAD_PIDFILE for pid for lvm command.
Fix WHATS_NEW envvar name usage
Fix init order in prepare_lvmetad to respect set vars
and avoid clash with system settings.
Update test to really test the 'is running' message.
Comparing for available feature missed the code path, when
maj is already bigger.
The bug would be only hit in the case, thin pool target would have
increased major version.
When thin volume is using external origin, current thin target
is not able to supply 'extended' size with empty pages.
lvm2 detects version and disables extension of LV past the external
origin size in this case.
Thin LV could be however still reduced and extended freely bellow
this size.
In preparation for other segment types that create and use "pools", we
s/create_thin_pool/create_pool/. This way it is not awkward when creating
a cachepool, for example, to use "create_thin_pool".
Functions that handle set-up, tear-down and creation of thin pool
volumes will be more generally applicable when more targets exist
that make use of device-mapper's persistent data format. One of
these targets is the dm-cache target. I've selected some functions
that will be useful for the cache segment type to be moved, since
they will no longer be thin pool specific but are more broadly
useful to any segment type that makes use of a 'pool' LV.
We need both offset and length when trying to wipe detected signatures.
The libblkid can fail so it's good to have an error message issued for
this state instead of being silent (libblkid does not issue any error
messages here). We just issued "stack" here before but that was not
quite useful if some error occurs...
Clear temporary DM_DISABLE_OTHER_RULES_FLAG properly. This did not
cause any bug or problem as the temporary variable is overwritten next
time it's used again, but we should still clean it properly!
Only flag thin LV for no scanning in udev if this LV is about
to be wiped. This happens only in case the thin LV's pool was not
created with zeroing of the new blocks enabled.
Since support for xfs_check is going to be obsoleted,
replace its usage with xfs_repair -n tool.
However this tool needs further intrumentation, since for really small
xfs devices (having just 1 allocation group) it needs to pass in
flag: "-o force_geometry". As we run the tool with '-n', it should
be safe to pass this flag always.
FIXME: figure way without always passing this flag.
Revert activated volumes if callback fails.
This is currently used only for thin_check failure support.
When thin_check detects failure in thin metadata device, it deactivate
volumes in reversed order that have been preloaded for thin pool activation.
After this change lvm command will not leave active pool subvolumes
in dm table.
The size of any metadata must be ignored when calculating the size of an
orphan PV.
Bug introduced by 603b45e0ed ("pvresize: Do
not use pv_read (get the PV from orphan VG).")
Block creations of archive and backup files for internal orphan VGs.
Bug introduced by 603b45e0ed ("pvresize: Do
not use pv_read (get the PV from orphan VG).")
Do not drop device's flag to report readiness for systemd
processing if there's any event that follows the activatiion
event itself. Otherwise, systemd would lost track of this device
on any other event that follows the activating event (IOW, we
need to make SYSTEMD_READY variable change level-based, not edge-based).
This patch applies for MD and loop devices used as PVs.
(intra-release fix for commit 4c267c7286)
DO NOT USE LVMETAD IF YOU HAVE ANY LVM1-FORMATTED PVS.
You may continue to use it without lvmetad, but do please schedule
an upgrade to the lvm2 format (with 'vgconvert').
Sending the original LVM1 formatted metadata to lvmetad is breaking
assumptions made by the code, so I am marking the format as obsolete for
now and no longer sending it to lvmetad.
This means that if you are using lvmetad, lvm1 volumes will usually
appear invisible - though not always: it depends on exactly what
sequence of commands you run!
The current situation is not satisfactory.
We'll either fix lvmetad and reenable this or we'll fix the code to
issue appropriate warning messages when lvm1 PVs are encountered
to avoid accidents.
(The latest unfixed problem is that lvmetad assumes metadata sequence
numbers exist and always increase - but the lvm1 format does not define
or store any sequence number, confusing both the daemon and client
when default values get passed to-and-fro.)
Several fields used to display 0 if undefined. Recent changes
to the way the fields are reported threw away some tests for
valid pointers, leading to segfaults with 'pvs -o all'.
Reinstate the original behaviour.
If a PV in an existing VG becomes orphaned (with 'pvcreate -ff', for
example) the VG struct cached against its vginfo must be invalidated.
This is because the struct device it references no longer contains
the PV label so becomes incorrect.
This triggers the error:
Internal error: PV $dev unexpectedly not in cache.
when the PV from the cached VG metadata is subsequently looked up
in the cache.
Bug introduced in 2.02.87 by commit 7ad0d47c3c
("Cache and share generated VG structs").
Before:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/loop3 vg12 lvm2 a-- 28.00m 28.00m
/dev/loop4 vg12 lvm2 a-- 28.00m 28.00m
lvm> pvcreate -ff /dev/loop3
Really INITIALIZE physical volume "/dev/loop3" of volume group "vg12" [y/n]? y
WARNING: Forcing physical volume creation on /dev/loop3 of volume group "vg12"
Physical volume "/dev/loop3" successfully created
lvm> pvs
Internal error: PV /dev/loop3 unexpectedly not in cache.
PV VG Fmt Attr PSize PFree
/dev/loop3 vg12 lvm2 a-- 28.00m 28.00m
/dev/loop3 lvm2 a-- 32.00m 32.00m
/dev/loop4 vg12 lvm2 a-- 28.00m 28.00m
After:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/loop3 vg12 lvm2 a-- 28.00m 28.00m
/dev/loop4 vg12 lvm2 a-- 28.00m 28.00m
lvm> pvcreate -ff /dev/loop3
Really INITIALIZE physical volume "/dev/loop3" of volume group "vg12" [y/n]? y
WARNING: Forcing physical volume creation on /dev/loop3 of volume group "vg12"
Physical volume "/dev/loop3" successfully created
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/loop3 lvm2 a-- 32.00m 32.00m
/dev/loop4 vg12 lvm2 a-- 28.00m 28.00m
unknown device vg12 lvm2 a-m 28.00m 28.00m
Pass dnode pointer instead of rather unknown child pointer.
The pointer is currently unused and passing child pointer
is quite undefined, while dnode has at least some usability.
Make this code a bit more readable for Coverity as otherwise
it marks the "type" variable in the "_thin_pool_add_message" fn
as undefined for certain path (...which is normally unreachable anyway,
but let's clean this up).
Introduce FMT_OBSOLETE to identify pool metadata and use it and FMT_MDAS
instead of hard-coded format names.
Explain device accesses on pvscan --cache man page.
lvm has a default umask value in the config file that defaults
to 0077 which lvm changes to during normal operation. This
causes a problem when the code is used as a library with
liblvm as it is changing the umask for the process. This
patch saves off the current umask, sets to what is specified
in the config file and restores it what it was on library
function call exit.
The user is now free to change the umask in their application at
anytime including between library calls.
This fix address BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1012113
Tested by setting umask to 0777 and running the python unit
test and verifying that umask is still same value as expected
at the test completion and with a successful run.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When using filters for the pvscan --cache (the global_filter),
there's a difference between:
pvscan --cache -aay /dev/block/<major>:<minor>
and
pvscan --cache -aay <major>:<minor> (or --major <major> --minor <minor>)
In the first case, we need to be sure to have an exact matching line
in the filter for the device to be used, no aliases are considered
So for example even if we have accept rule for "/dev/sda" present,
this won't apply for "/dev/block/8:0" even though it's the same device!
This is because we're comparing the path used on command line directly
with the path written in the rule.
For the second one, any alias mentioned in the filter will apply
as we're comparing the major and minor pair, not looking at actual
device names - so any alias mentioned in the rules will suffice for
the filtering rule to apply.
For the global_filter to be properly used, we need to call the
second one in the lvm2-pvscan@.service - nobody is able to tell
what value of major:minor the kernel assignes next time, hence
this bug makes the use of global_filter quite unusable!
If there is no define for BLKPBSZGET - we have hard time how to
decrypt physical block size - we can't use here block_size,
since this is usually 4k while we need to use 512b.
FIXME: find some better way, until that enforce value 512.
Eventually we could also try to put in:
+#ifndef BLKPBSZGET
+# define BLKPBSZGET _IO(0x12,123)
+#endif
but this will still not work well on old kernels.
This reverts commit 24639be558.
Ok - seems we could be here a bit too active - and we
may remove devices which are unsuable for reasons we are not
aware of - thus taking down whole device could be way to big hammer.
So we still need some solution to recover from failing preload
and activation - but it needs more tunning.
When activation fails - we may leak large tree of partially loaded
devices in the dm table (i.e. failure in snapshot activation)
The best we can do here is try to deactivate whole device and
remove as much inactive table entries as we can.
When LV is scanned for its dependencies - scan also origin's snapshots,
and thin external origins.
So if any PV from snapshot or external origin device is missing - lvm2 will
avoid trying to activate such device.
The metadata/disk_areas setting was incorrectly registered as
"string" configuration option but it's a section where each area
is defined in its own subsection with "start_sector", "size" and "id"
setting.
This setting is not officialy supported, it's undocumented and it's
used solely for debugging.
Note: At this moment, it does not seem to be working with lvmetad!
When the device is inserted in dev_name_confirmed() stat() is
called twice as _insert() has it's own stat() call.
Extend _insert() parameter with struct stat* - which could be used
if it has been just obtained. When NULL is passed code is
doing its own stat() call as before.
Use internal type by default for thin provisioning.
If user is not interested in thin provisiong and doesn't
have thin provisining supporting tools installed,
configure will just print warning at the end of configure
process about limited support.
Boolean algebra changes for process_each_lv_in_vg().
1st.
Drop process_lv variable since it's not needed.
2nd.
process_lv was always initilized to 0 - so the condition was always true.
It the condition (!tags_supplied && !lvargs_supplied) evaluates as "true",
process_all is already set to 1, so skip vg tags evaluation.
3rd.
Move check for matching lv name in the front of lv tags check
since this check can't be skipped for lvargs_matched counter.
If this filter evaluates to true, skip lv tags evaluation.
Some devices, similarly to us, are not prepared after ADD event, but
after an extra CHANGE event when the device is properly set up.
This includes MD and loop devices. This patch fixes the
SYSTEMD_READY assignment that is crucial for proper functionality
of SYSTEMD_WANTS that we use to instantiate a lvm2-pvscan@.service
systemd service to activate the VG/LVs (see also bug
info).
All that extra handling of foreign devices should eventually be moved
to rules which process those devices primarily (MD and loop)! We should
only check a dedicated variable whether the device is usable or not.
Thin kernel target 1.9 still does not support online resize of
thin pool metadata properly - so disable it with expectation
for much higher version - and reenable after fixing kernel.
If we are stuck in user for too long without output,
grab kernel stack traces.
If we just produce too many lines of output, it's
not probably kernel related bug.
When the test is interrupted because debug.log has got to big,
and the test doesn't react on SIGINT - and needs to be only
killed with SIGKILL - it's still valuable to print at least
a portion of this debug.log (currently 4MB).
LVM_TEST_UNLIMITED could be set to avoid this limitation
(i.e. when busy-looping lvm command needs to be running
for gdb attachment)
Since activation takes only read-lock, there could be
multiple activation running in parallel.
So instead of checking before taking any real lock,
let the locking resolve the problem and just
detect if the reason for failure has been remote
exlusive activation.
It should be also faster, since each activation does
not need to do explicit lock query.
For merging thin snapshot we have to do couple extra
checks before we allow this operation.
We pretend thin-snapshot and thin-origin
are tied together and we have to properly
maintain locking.
The PIE and RELRO compiler/linker options can be used to produce a code
some techniques applied that makes the code more immune to some attacks:
- PIE (Position Independent Executable). It can make use of the ASLR
(Address Space Layout Randomization) provided by kernel to avoid
static locations for .text regions of executables (this is the 'pie'
compiler and linker option)
- RELRO (Relocation Read-Only). This prevents overwrite attacks of
the GOT (Global Offset Table) and PLT (Procedure Lookup Table)
used for relocations by making it read-only after all relocations
are resolved (these are the 'relro' and 'now' linker options) -
hence all symbols are resolved at the very start so there's no
need for those tables to be writeable later.
These compiler/linker options are now used by default for daemons
if the compiler/linker supports it.
In the case we have a dir with multiple objects and for
an individual object file we need special define -
allow to define it without adding extra rules.
To ensure dmeventd.o compilation will use EXTRA_FLAGS:
CFLAGS_dmeventd.o += $(EXTRA_FLAGS)
Then it's better to use:
dmeventd.o: CFLAGS += $(EXTRA_FLAGS)
At the end of lvconvert --snapshot with an active origin, the origin
gets reloaded.
Commit 57c0f72b1d ("lvconvert: use
_reload_lv on more places") accidentally replaced this with a snapshot
LV reload (which does nothing because only the origin is active).
Replacement of pv_read by find_pv_by_name in commit
651d5093ed caused spurious
error messages when running pvcreate or vgextend against an
unformatted device.
Physical volume /dev/loop4 not found
Physical volume "/dev/loop4" successfully created
Physical volume /dev/loop4 not found
Physical volume /dev/loop4 not found
Physical volume "/dev/loop4" successfully created
Volume group "vg1" successfully extended
Make it easier to run a live lvmetad in debugging mode and
to avoid conflicts if multiple test instances need to be run
alongside a live one.
No longer require -s when -f is used: use built-in default.
Add -p to lvmetad to specify the pid file.
No longer disable pidfile if -f used to run in foreground.
If specified socket file appears to be genuine but stale, remove it
before use.
On error, only remove lvmetad socket file if created by the same
process. (Previous code removes socket even while a running instance
is using it!)
Do not use signature wiping for newly created LVs in tests - we're
reusing the devs in tests and such detection could just interfere
inappropriately. We'd need to modify all tests to anwer the prompt
whether any signature found should be removed or not or we'd need
to use "-y" option for all lvcreates in tests. It's better to disable
this feature then and let's do a separate test to test this signature
wiping functionality.
If we're calling pvcreate on a device that already has a PV label,
the blkid detects the existing PV and then we consider it for wiping
before we continue creating the new PV label and we issue a warning
with a prompt whether such old PV label should be removed. We don't
do this with native signature detection code. Let's make it consistent
with old behaviour.
But still keep this "PV" (identified as "LVM1_member" or "LVM2_member"
by blkid) detection when creating new LVs to avoid unexpected PV label
appeareance inside LV.
Collapse 2 ifs and replace log_error() with log_warn(), since\
the reported message is not causing tools error.
(and cannot be probably triggered anyway).
Optimize and cleanup recently introduced new function wipe_lv.
Use compound literals to get nicely initialized wipe_params struct.
Pass in lv as explicit argument for wipe_lv.
Use cmd from lv structure.
Initialize only non-null members so it's easy to see what
is the special arg.
Drop find_merging_snapshot() function. Use find_snapshot()
called after check for lv_is_merging_origin() which
is the commonly used code path - so we avoid duplicated
tests and potential risk of derefering NULL point
in unhandled error path.
This is actually the wipefs functionailty as a matter of fact
(wipefs uses the same libblkid calls).
libblkid is more rich when it comes to detecting various
signatures, including filesystems and users can better
decide what to erase and what should be kept.
The code is shared for both pvcreate (where wiping is necessary
to complete the pvcreate operation) and lvcreate where it's up
to the user to decide.
The verbose output contains a bit more information about the
signature like LABEL and UUID.
For example:
raw/~ # lvcreate -L16m vg
WARNING: linux_raid_member signature detected on /dev/vg/lvol0 at offset 4096. Wipe it? [y/n]
or more verbose one:
raw/~ # lvcreate -L16m vg -v
...
Found existing signature on /dev/vg/lvol0 at offset 4096: LABEL="raw.virt:0" UUID="da6af139-8403-5d06-b8c4-13f6f24b73b1" TYPE="linux_raid_member" USAGE="raid"
WARNING: linux_raid_member signature detected on /dev/vg/lvol0 at offset 4096. Wipe it? [y/n]
The verbose output is the same output as found in blkid.
Use common wipe_lv (former set_lv) fn to do zeroing as well as signature
wiping if needed. Provide new struct wipe_lv_params to define the
functionality.
Bind "lvcreate -W/--wipesignatures y" with proper wipe_lv call.
Also, add "yes" and "force" to lvcreate_params so it's possible
to apply them for the prompt: "WARNING: %s detected on %s. Wipe it? [y/n]".
The wipe_known_signatures fn now wraps the _wipe_signature fn that is called
for each known signature (currently md, swap and luks). This patch makes the
code more readable, not repeating the same sequence when used anywhere in the
code. We're going to reuse this code later...
If the refresh fails for any reason before autoactivation, let's not
make this a stopper for autoactivation itself - just log the error
message if it appears.
The reason is that in some rare situations, we can still hit the
problem with the suspend call to fail (as already described in
commit d8085edf65, also
https://bugzilla.redhat.com/show_bug.cgi?id=1027314). The refresh
itself is done for only one reason - to refresh any dm tables
for LVs for which the underlying PVs got unplugged/disconnected
and then plugged/connected back (see also
https://bugzilla.redhat.com/show_bug.cgi?id=954061 for more info).
In this case, the major:minor pair is changed and we need to
update dm tables for LVs accordingly.
Now if refresh fails, the error is still logged, but autoactivation
continues.
If using lv/vgchange --sysinit -aay and lvmetad is enabled, we'd like to
avoid the direct activation and rely on autoactivation instead so
it fits system initialization scripts.
But if we're calling lv/vgchange --sysinit -aay too early when even
lvmetad service is not started yet, we just need to do the direct
activation instead without printing any error messages (while
trying to connect to lvmetad and not finding its socket).
This patch adds two helper functions - "lvmetad_socket_present" and
"lvmetad_used" which can be used to check for this condition properly
and avoid these lvmetad connections when the socket is not present
(and hence lvmetad is not yet running).
Whole struct will be set to 0, just
if the first member is array, gcc gives warning
we should initialized this element as array,
so pick any later simple type.
It will likely not fail to duplicate empty string, but
just keep the test of result of this function consistent.
Also on error path restore extent_size if in some
case someone would still use that variable.
Put common printf() case into a function and use
the string with text format as direct arg to make
the compile time validation of args easier and
code shorter.
Switch log_error() to log_warn(), since 'return 0'
doesn't cause any failure here.
Revert 4777eb6872 which put
target_present check into init_snapshot_merge(). However
this function is also used when parsing metadata. So we would
get this present test performed even when target is not really
needed. So move this target_present test directly into lvconvert.
Changed naming of methods from camel case to all lower case with
underscores per guidelines. Changed any methods that can be
static methods to static.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The error buffer will stack error messages which is fine. However,
once you retrieve the error messages it doesn't make sense to keep
appending for each additional error message when running in the
context of a library call.
This patch clears and resets the buffer after the user retrieves
the error message.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Add a PV create which takes a paramters object that
has get/set method to configure PV creation.
Current get/set operations include:
- size
- pvmetadatacopies
- pvmetadatasize
- data_alignment
- data_alignment_offset
- zero
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=880395
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Moving the core functionality of vgreduce single into
lib/metadata/vg.c so that the command line and lvm2app library
can call the same core functionality. New function is
vgreduce_single.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
In a previous commit we added the ability for the library to do garbage
collection, to free all the memory used by the library handle buffer by closing
and re-opening the library handle. When we introduced this functionality we
also opened up the opportunity that the user of the python bindings to have
an object that references the old library handle. In this case if the user
tried to use the old object a segmentation fault could occur because the
memory had been previously freed.
This patch tries to mitigate this by storing a copy of the library handle that
was used when the object was created so that it can compare the current in
use pointer with what existed when the object was created. In the case where
they match the operation is permitted to continue, otherwise an exception is
throw, thus avoiding a segmentation fault.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
3.12.0 kernel prevents raid test to be usable,
leaving unremovable devices in table.
This needs to be fixed ASAP, meanwhile disable test to make
test machines at least usable.
Add 'can_use_16T' to detect systems where we could
safely use 16T devices without causing system deadlocks.
16T size leads on those to endless loops in udevd
- it calls blkid which tries cached read from such device
- this ends in endless loop.
Related problems:
https://bugzilla.redhat.com/show_bug.cgi?id=1015028
All labellers always use the "private" (void *) field as the fmt pointer. Making
this fact explicit in the type of the labeller simplifies the label reporting
code which needs to extract the format. Moreover, it removes a number of
error-prone casts from the code.
Only reading a single PV works correctly only in very limited circumstances.
Moreover, we can't rely on the MDA available on the PV either, since it may be
out of date in some circumstances (until now, we believed that PVs that have an
empty MDA are always orphans, but this is not 100% reliable either).
Add internal error warning when string value is used
as sort value for numerical field.
Using log_warn since the function itself does not return error,
so we do not confuse log_error() checker.
Fix buggy usage of "" (empty string) as a numerical string
value used for sorting.
On intel 64b platform this was typically resolve
as 0xffffff0000000000 - which is already 'close' to
UINT64_MAX which is used for _minusone64.
On other platforms it might have been giving
different numbers depends on aligment of strings.
Use proper &_minusone64 for sorting value when the reported
value is NUM.
Note: each numerical value needs to be thought about if it needs
default value &_zero64 or &_minusone64 since for cases, were
value of zero is valid, sorting should not be mixing entries
together.
Add wrapper function for dm_report_field_set_value() which returns void
and return 1, so the code could be shorter.
Add wrapper function for percent display _field_set_percent().
There's a tiny race when suspending the device which is part
of the refresh because when suspend ioctl is performed, the
dm kernel driver executes (do_suspend and dm_suspend kernel fn):
step 1: a check whether the dev is already suspended and
if yes it returns success immediately as there's
nothing to do
step 2: it grabs the suspend lock
step 3: another check whether the dev is already suspended
and if found suspended, it exits with -EINVAL now
The race can occur in between step 1 and step 2. To prevent
premature autoactivation failure, we're using a simple retry
logic here before we fail completely. For a complete solution,
we need to fix the locking so there's no possibility for suspend
calls to interleave each other to cause this kind of race.
This is just a workaround. Remove it and replace it with proper
locking once we have that in!
Failures in the temporary mirror used when up-converting cause dmeventd
to issue 'lvconvert --repair' on the sub-LV, <lv_name>_mimagetmp_?. The
'lvconvert' command refuses to deal with this sub-LV outright - it
expects to be given the name of the top-level LV. So, just like we do
with mirrored logs, we strip-off the portion of the name that is not
the top-level LV and issue the command on the top-level LV instead.
Send error message on stdout, since after _display_info_long()
command return errors.
Patch makes consistent behavior for command:
dmsetup info -c non-existing-dev
&
dmsetup info non-existing-dev
Now both commands report error on stderr when they return error status
for non-existing device.
This patch fixes mostly cluster behavior but also updates
non-cluster reaction where calls like 'lvchange -aln'
lead to incorrect errors for some segment types.
Fix the implicit activation rules where some segment types could
be activated only in exclusive mode in cluster.
lvm2 command was not preserver 'local' property and incorrectly
converted local activations in to plain exclusive, so the local
activation could have activate volumes exclusively, but remotely.
If the volume_list filters out volume from activation,
it is still success result for this function.
Change the error message back to verbose level.
Detect if the volume is active localy before zeroing,
so we report error a bit later for cases, where volume
could not be activated because it doesn't pass through volume
list (but user still could create volume when he disables
zeroing)
Correct return code of activate_lv_excl().
Function is not supposed to return activation state of
activated volume, but return code of the operation.
Since i.e. when activation filter is allowing to activate
volume on current system, it is still success even though
no volume is activated.
MD can directly create partition devices without a need to run
an extra kpartx or partprobe call. We need to react to this event in
a different way as for bare MD devices - we need to handle the ADD event
for KERNEL=="md[0-9]*p[0-9]*" kernel name and trigger the LVM scanning
to update lvmetad to trigger autoactivation and so on...
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1023250
This is an addition to original patch for lvcreate - commit 039bdad.
The same principle applies to lvconvert where there are several steps
during which we need to wipe the existing LV that's being converted
to thin pool, making sure there's no other interference from outside (udev).
Reset the DM_UDEV_OTHER_RULES_FLAG to original value right at the
time of dropping the DM_NOSCAN flag.
When DM_NOSCAN is set, the DM_UDEV_DISABLE_OTHER_RULES_FLAG is also set
to avoid udev processing in "other/foreign" rules. If the noscan flag
is dropped, the DM_UDEV_DISABLE_OTHER_RULES_FLAG should be reset to
its original value.
Also, lvmetad should respect the DM_UDEV_DISABLE_OTHER_RULES_FLAG
because if the volume is set with this flag it:
- definitely is not a top-level device (so makes no sense for lvmetad scanning)
- is not supposed to be scanned further (for any stacking on top of
it, including LVM stacking itself and any autoactivation of stacked LVs)
Remove conditional that boils down to "if yes or no, then do". The
previous condition in the statement is sufficient and the extra
(always true) condition is unnecessary.
This fixes a bug in commit 19baf842 where verify_message
was rejecting the CLVMD_FLAG_REMOTE flag. It was missed
since the patch was ported from an lvm version where that
flag does not exist.
There is a problem with the way mirrors have been designed to handle
failures that is resulting in stuck LVM processes and hung I/O. When
mirrors encounter a write failure, they block I/O and notify userspace
to reconfigure the mirror to remove failed devices. This process is
open to a couple races:
1) Any LVM process other than the one that is meant to deal with the
mirror failure can attempt to read the mirror, fail, and block other
LVM commands (including the repair command) from proceeding due to
holding a lock on the volume group.
2) If there are multiple mirrors that suffer a failure in the same
volume group, a repair can block while attempting to read the LVM
label from one mirror while trying to repair the other.
Mitigation of these races has been attempted by disallowing label reading
of mirrors that are either suspended or are indicated as blocking by
the kernel. While this has closed the window of opportunity for hitting
the above problems considerably, it hasn't closed it completely. This is
because it is still possible to start an LVM command, read the status of
the mirror as healthy, and then perform the read for the label at the
moment after a the failure is discovered by the kernel.
I can see two solutions to this problem:
1) Allow users to configure whether mirrors can be candidates for LVM
labels (i.e. whether PVs can be created on mirror LVs). If the user
chooses to allow label scanning of mirror LVs, it will be at the expense
of a possible hang in I/O or LVM processes.
2) Instrument a way to allow asynchronous label reading - allowing
blocked label reads to be ignored while continuing to process the LVM
command. This would action would allow LVM commands to continue even
though they would have otherwise blocked trying to read a mirror. They
can then release their lock and allow a repair command to commence. In
the event of #2 above, the repair command already in progress can continue
and repair the failed mirror.
This patch brings solution #1. If solution #2 is developed later on, the
configuration option created in #1 can be negated - allowing mirrors to
be scanned for labels by default once again.
Add LV_TEMPORARY flag for LVs with limited existence during command
execution. Such LVs are temporary in way that they need to be activated,
some action done and then removed immediately. Such LVs are just like
any normal LV - the only difference is that they are removed during
LVM command execution. This is also the case for LVs representing
future pool metadata spare LVs which we need to initialize by using
the usual LV before they are declared as pool metadata spare.
We can optimize some other parts like udev to do a better job if
it knows that the LV is temporary and any processing on it is just
useless.
This flag is orthogonal to LV_NOSCAN flag introduced recently
as LV_NOSCAN flag is primarily used to mark an LV for the scanning
to be avoided before the zeroing of the device happens. The LV_TEMPORARY
flag makes a difference between a full-fledged LV visible in the system
and the LV just used as a temporary overlay for some action that needs to
be done on underlying PVs.
For example: lvcreate --thinpool POOL --zero n -L 1G vg
- first, the usual LV is created to do a clean up for pool metadata
spare. The LV is activated, zeroed, deactivated.
- between "activated" and "zeroed" stage, the LV_NOSCAN flag is used
to avoid any scanning in udev
- betwen "zeroed" and "deactivated" stage, we need to avoid the WATCH
udev rule, but since the LV is just a usual LV, we can't make a
difference. The LV_TEMPORARY internal LV flag helps here. If we
create the LV with this flag, the DM_UDEV_DISABLE_DISK_RULES
and DM_UDEV_DISABLE_OTHER_RULES flag are set (just like as it is
with "invisible" and non-top-level LVs) - udev is directed to
skip WATCH rule use.
- if the LV_TEMPORARY flag was not used, there would normally be
a WATCH event generated once the LV is closed after "zeroed"
stage. This will make problems with immediated deactivation that
follows.
The blkdeactivate script iterates over the list of devices if they're
given as an argument and it tries to umount/deactivate them one by one.
This iteration failed to proceed if any of the umount/deactivation
was unsuccessful - there was a missing "shift" call to move to the
next argument (device) for processing. As a result of this, the same
device was tried again and again, causing an endless loop, never
proceeding to the next device given.
When using ENV{SYSTEMD_WANTS}=lvm2-pvscan@... to instantiate a service
for lvmetad scan when the new PV appears in the system, the service
is started and executed. However, to track device removal, we need
to bind it (the "BindsTo" systemd directive) to a certain .device
systemd unit.
In default systemd setup, the device is tracked by it's name and
sysfs path (there's normally a sysfs path .device systemd unit for
a device and then the device name .device unit as an alias for it).
Neither of these two is useful for lvmetad update as we need to bind
it to device's <major>:<minor> pair.
The /dev/block/<major>:<minor> is the essential symlink under /dev
that exists for each block device (created by default udev rules
provided by udev directly). So let's use this as an alias for
the device's .device unit as well by means of "ENV{SYSTEMD_ALIAS}"
declaration within udev rules which systemd understands (this will
create a new alias "dev-block-<major>:<minor>.device".
Then we can easily bind the "dev-block-<major>:<minor>" device
systemd unit with instantiated lvm2-pvscan@<major>:<minor>.service.
So once the device is removed from the systemd, the
lvm-pvscan@<major>:<minor>.service executes it's ExecStop action
(which in turn notifies lvmetad about the device being gone).
This completes the udev-systemd-lvmetad interaction then.
Before, pvscan recognized either:
pvscan --cache --major <major> --minor <minor>
or
pvscan --cache <DevicePath>
When the device is gone and we need to notify lvmetad about device
removal, only --major/--minor works as we can't translate DevicePath
into major/minor pair anymore. The device does not exist in the system
and we don't keep DevicePath index in lvmetad cache to make the
translation internally into original major/minor pair. It would be
useless to keep this index just for this one exact case.
There's nothing bad about using "--major <major> --minor <minor>",
but it makes our life a bit harder when trying to make an
interconnection with systemd units, mainly with instantiated services
where only one and only one arg can be passed (which is encoded in the
service name).
This patch tries to make this easier by adding support for recognizing
the "<major>:<minor>" as a shortcut for the longer form
"--major <major> --minor <minor>". The rule here is simple: if the argument
starts with "/", it's a DevicePath, otherwise it's a <major>:<minor> pair.
There is no point eating stderr for these commands. In fact the
redirect causes confusion and hurts dubugging.
Also reword an error message if the pvs command fails so as not be
certain that a device is not a PV. Coupled with removing the stderr
redirect this will improve the user experience in the face of errors.
The new lvm2-pvscan@.service is responsible for on-demand execution
of "pvscan --cache --activate ay" which causes lvmetad to be
updated and LVM activation done if the VG is complete.
Also, use udev-systemd mechanism to instantiate the job as the
lvm2-pvscan@$devnode.service on each newly appeared PV in the system.
This prevents the background job to be killed (that would happen
if it was directly forked from udev rule - this behaviour is seen
in recent versions of udev with the help of systemd that can track
detached processes - the detached process would still be in the same
cgroup).
To enable this official udev-systemd protocol for instantiating
background jobs, use new --enable-udev-systemd-background-jobs
configure switch (it's disabled by default). This option is highly
recommended wherever systemd is used!
Reverts previously added udevsettle call.
Seems to be unrelated, while udev on old system may take over 10
minutes, to finish it's very slow and CPU intensive work, it doesn't
interact directly with created device, only access /dev/mapper/control
node via dmsetup, so the device is ocasionaly blocked by something else.
Patch helps a bit when lvm2 is build with disabled udev_sync support,
but udevd runs in the system - so it randomly influences unrelated tests
even - so before every test wait at least till udevd is settled.
On modern systems udev manages nodes in /dev/mapper directory.
It creates, deletes and renames the nodes according to the
state of the kernel driver.
When the dmsetup is compiled without udev support (--enable-udev_sync)
and runs on the system with running udevd it tries to manage nodes in
/dev/mapper too, so it can race with udev.
dmsetup checks if the node was created/deleted/renamed with the stat
syscall, and skips the operation if it was. However, if udev
creates/deletes/renames the node after the stat syscall and before the
mknod/unlink/rename syscall, dmsetup reports an error.
Since in the system everything happened as expected, skip reporting
error for such case.
These races can be easily provoked by inserting sleep at appropriate
places.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
This file may be included by other programs, so it should be compliant
with the C standard.
* use __linux__ instead of linux - __linux__ is always defined, linux is
not defined when gcc runs in standard-compliant mode (with -std=c89 or
-std=c99) because the C standard doesn't allow polluting namespace
with arbitrary defines.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Initial testing of thin pool's metadata with thin repairing tools.
Try to use tools from configuration settings, but allow them
to be overriden by settings of these variables:
LVM_TEST_THIN_CHECK_CMD,
LVM_TEST_THIN_DUMP_CMD,
LVM_TEST_THIN_REPAIR_CMD
FIXME: test reveals some more important bugs:
pvremove -ff also needs --yes
vgremove -ff doesn not remove metadata when there are no real LVs.
vgreduce is not able to reduce VG with pool without pool's PVs
Prohibit conversion of pool device with active thin volumes.
Properly restore active states only for active thin pool volume.
Use new LV_NOSCAN when converting volume into thin pool's metadata.
This patch reinstates the lv_info call to check for open count of
the LV we're removing/deactivating - this was changed with commit 125712b
some time ago and we relied on the ioctl retry logic deeper in the libdm
while calling the exact 'remove' ioctl.
However, there are still some situations in which it's still required to
check for open count before we do any 'remove' actions - this mainly
applies to LVs which consist of several sub LVs, like it is for
virtual snapshot devices.
The commit 1146691 fixed the issue with ordering of actions during
virtual snapshot removal while the snapshot is still open. But
the check for the open status of the snapshot is still prone to
marking the snapshot as in use with an immediate exit even though
this could be a temporary asynchronous open only, most notably
because of udev and its WATCH udev rule with accompanying scans
for the event which is asynchronous. The situation where this crops
up most often is when we're closing the LV that was open for read-write
and then calling lvremove immediately.
This patch reinstates the original lv_info call for the open status
of the LV in the lv_check_not_in_use fn that gets called before
we do any LV removal/deactivation. In addition to original logic,
this patch adds its own retry loop with a delay (25x0.2 seconds)
besides the existing ioctl retry loop.
Component LVs of a thinpool can be RAID LVs. Users who attempt a
scrubbing operation directly on a thinpool will be prompted to
specify the sub-LV they wish the operation to be performed on. If
neither of the sub-LVs are RAID, then a message telling them that
the operation can only be performed on a RAID LV will be given.
Split image should have an out-of-sync attr ('I') - always. Even if
the RAID LV has not been written to since the LV was split off, it is
still not part of the group that makes up the RAID and is therefore
"out-of-sync".
Reshape code a bit to make sockepair 'swappable' with plain old pipe
call.
Display status for FAILED error.
Increase buffer to hold always at least 1 page size.
Print error results with capitals.
Since the virtual snapshot has no reason to stay alive once we
detach related snapshot - deactivate whole thing in front of
snapshot removal - otherwice the code would get tricky for
support in cluster.
The correct full solution would require to have transactions
for libdm operations.
Also enable to the check for snapshot being opened prior
the origin deactivation, otherwise we could easily end
with the origin being deactivate, but snapshot still kept
active, desynchronizing locking state in cluster.
Addendum to commit ce7489e which introduced a new *internal* LV_NOSCAN
flag and so it needs to be marked that way properly otherwise it
ends up unrecognized and improperly handled during metadata export.
Recognize DM_SUBSYSTEM_UDEV_FLAG0 which for LVM is the "LVM_NOSCAN"
flag that causes the scanning to be skipped (mainly blkid) and
also directs all the foreign rules to be skipped as well.
Important thing here is that the "watch" udev rules is still set
as well as the /dev/disk/by-id content created (which does not
require any scanning to be done). Also, the flag is dropped on
any subsequent event and scanning done...
A common scenario is during new LV creation when we need to wipe the
newly created LV and avoid any udev scanning before this stage otherwise
it could cause the device (the LV) to be claimed by some other subsystem
for which there were stale metadata within LV data.
This patch adds possibility to mark the LV we're just about to wipe with
a flag that gets passed to udev via DM_COOKIE as a subsystem specific
flag - DM_SUBSYSTEM_UDEV_FLAG0 (in this case the subsystem is "LVM")
so LVM udev rules will take care of handling that.
Patch 562ad293fd introduced code regression
when LV was converted to a thin LV with external origin and at the same time,
conversion of LV to a thin pool has been requested.
(RHBZ: #997704)
data_lv needs to be assigned after test for external conversion find pool.
Accept --ignoreskippedcluster with pvs, vgs, lvs, pvdisplay, vgdisplay,
lvdisplay, vgchange and lvchange to avoid the 'Skipping clustered
VG' errors when requesting information about a clustered VG
without using clustered locking and still exit with success.
The messages can still be seen with -v.
Just like we have symbolic names assigned to general DM udev flags
(DM_UDEV_* flags), we have the same for any subsystem flags now
(DM_SUBSYSTEM_UDEV_FLAG*), making it easier to use.
Each subsystem rule that needs to import any of DM_SUBSYSTEM_UDEV_FLAG*
flags is responsible for doing so. This simply moves control of these
flags from general 10-dm.rules to any subsystem rule using these flags
as each subsystem knows better how to handle these flags on its own.
Some code has been added recently which makes it impossible to compile
when "configure --disable-devmapper" is used. This patch just shuffles
the code around so it's under proper #ifdef DEVMAPPER_SUPPORT.
lib/metadata/lv_manip.c:_sufficient_pes_free() was calculating the
required space for RAID allocations incorrectly due to double
accounting. This resulted in failure to allocate when available
space was tight.
When RAID data and metadata areas are allocated together, the total
amount is stored in ah->new_extents and ah->alloc_and_split_meta is
set. '_sufficient_pes_free' was adding the necessary metadata extents
to ah->new_extents without ever checking ah->alloc_and_split_meta.
This often led to double accounting of the metadata extents. This
patch checks 'ah->alloc_and_split_meta' to perform proper calculations
for RAID.
This error is only present in the function that checks for the needed
space, not in the functions that do the actual allocation.
1) When converting from an x-way mirror/raid1 to a y-way mirror/raid1,
the default behaviour should be to stay the same segment type.
2) When converting from linear to mirror or raid1, the default behaviour
should honor the mirror_segtype_default.
3) When converting and the '--type' argument is specified, the '--type'
argument should be honored.
catch such conditions, but errors in the tests caused the issue to go
unnoticed. The code has been fixed to perform #2 properly, the tests
have been corrected to properly test for #2, and a few other tests
were changed to explicitly specify the '--type mirror' when necessary.
If "default" thin pool chunk size calculation method is selected,
use minimum_io_size, otherwise optimal_io_size for "performance"
device hint exposed in sysfs. If there appear to be PVs with
different hints presented, use their least common multiple.
If the hint is less than the default value defined for the
calculation method, use the default value instead.
Add allocation/thin_pool_chunk_size_calculation lvm.conf
option to select a method for calculating thin pool chunk
sizes and define two possible values - "default" and "performance".
A previous commit (b6bfddcd0a) which
was designed to prevent segfaults during lvextend when trying to
extend striped logical volumes forgot to include calculations for
RAID4/5/6 parity devices. This was causing the 'contiguous' and
'cling_by_tags' allocation policies to fail for RAID 4/5/6.
The solution is to remember that while we can compare
ah->area_count == prev_lvseg->area_count
for non-RAID, we should compare
(ah->area_count + ah->parity_count) == prev_lvseg->area_count
for a general solution.
lvmetad is not yet supported in clustered environment so
disable it automatically if using lvmconf --enable-cluster
and reset it to default value if using lvmconf --disable-cluster.
Also, add a few comments in lvm.conf about locking_type vs. use_lvmetad
if setting it for clustered environment.
The corosync cluster interface for clvmd did not correctly
deal with node up/down events so that when a node was removed
from the cluster clvmd would prevent remote operations
from happening, as it thought the node was up but not
running clvmd.
This patch fixes that code by simplifying the case to node
being up or down - which was the original intention
and is supported by pacemaker and CPG in the higher layers.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
When NULL info struct is passed in - function is usable
as a quick query for lv_is_active_locally() - with a bonus
we may query for layered device.
So it could be seen as a more efficient lv_is_active_locally().
A know issue with kmem_cach is causing failures while testing
RAID 4/5/6 device replacement. Blacklist the offending kernel
so that these tests are not performed there.
Add internal devtypes reporting command to display built-in recognised
block device types. (The output does not include any additional
types added by a configuration file.)
> lvm devtypes -o help
Device Types Fields
-------------------
devtype_all - All fields in this section.
devtype_name - Name of Device Type exactly as it appears in /proc/devices.
devtype_max_partitions - Maximum number of partitions. (How many device minor numbers get reserved for each device.)
devtype_description - Description of Device Type.
> lvm devtypes
DevType MaxParts Description
aoe 16 ATA over Ethernet
ataraid 16 ATA Raid
bcache 1 bcache block device cache
blkext 1 Extended device partitions
...
The traditional style used for optional editable definitions
/* #define X /* */
produces a bogus warning from gcc -Wall.
Rather than suppressing this with -Wno-comment, switch over to
the // comment style.
Since our current vgcfgbackup/restore doesn't deal
with difference of active volumes between current and
restored set of volumes - run test with inactive LVs.
The lvm2-activation-net.service was ordered only with respect to iscsi
and fcoe service before. In addition to that, we also need ordering
with respect to lvm2-activation.service to prevent parallel vgchange -aay
runs which may cause some problems during activation.
See also https://bugs.gentoo.org/show_bug.cgi?id=480066.
With this patch, the ordering is firmly set to:
lvm2-activation-early.service -> lvm2-activation.service -> lvm2-activation-net.service
Thanks to Alexander Tsoy for the original patch (modified a bit here):
https://www.redhat.com/archives/lvm-devel/2013-September/msg00049.html
Rewrite check lv_on and add new lv_tree_on
Move more pvmove test unrelated code out to check & get sections
(so they do not obfuscate trace output unnecesserily)
Use new lv_tree_on()
NOTE: unsure how the snapshot origin should be accounted here.
Split pmove-all-segments into separate tests for raid and thins
(so the test output properly shows what has been skipped in test)
Update usage of "" around shell vars.
trim needs to trim both sides now.
trim also removes debug.log since it's only called when lvm command
has finished properly (so if something fails afterward, there
is no missleading debug trace in the log)
'die' evaluates given string - so \n could be used for
multiline error report
Also remove debug.log since the command finished properly when we
call 'die'
Note: we should not call 'die' after lvm command failure.
lvchange-raid.sh checks to ensure that the 'p'artial flag takes
precedence over the 'w'ritemostly flag by disabling and reenabling
a device in the array. Most of the time this works fine, but
sometimes the kernel can notice the device failure before it is
reenabled. In that case, the attr flag will not return to 'w', but
to 'r'efresh. This is because 'r'efresh also takes precedence over
the 'w'ritemostly flag. So, we also do a quick check for 'r' and
not just 'w'.
The DM_ACTIVATION and DM_UDEV_PRIMARY_SOURCE_FLAG needs to be kept the
way it was for backward compatibility (e.g. the old rules are still
in initramfs). This way the check in whether the device should be
scanned in 69-dm-lvmetad.rules is even easier.
Add a very simple hack for embeding /var/log/messages into
the tests output - it's not ideal since it sometimes breaks lines,
but still gives valuable info.
Add more 'realistic' simulation of dlm locking.
Previous version was not capable to maintain multiple locks.
Current version doesn't handle multiqueues for locks,
so the ordering is different.
Older gcc is giving misleading warning:
metadata/lv_manip.c:4018: warning: ‘seg’ may be used uninitialized in
this function
But warning free compilation is better.
The same corner cases that exist for snapshots on mirrors exist for
any logical volume layered on top of mirror. (One example is when
a mirror image fails and a non-repair LVM command is the first to
detect it via label reading. In this case, the LVM command will hang
and prevent the necessary LVM repair command from running.) When
a better alternative exists, it makes no sense to allow a new target
to stack on mirrors as a new feature. Since, RAID is now capable of
running EX in a cluster and thin is not active-active aware, it makes
sense to pair these two rather than mirror+thinpool.
As further background, here are some additional comments that I made
when addressing a bug related to mirror+thinpool:
(https://bugzilla.redhat.com/show_bug.cgi?id=919604#c9)
I am going to disallow thin* on top of mirror logical volumes.
Users will have to use the "raid1" segment type if they want this.
This bug has come down to a choice between:
1) Disallowing thin-LVs from being used as PVs.
2) Disallowing thinpools on top of mirrors.
The problem is that the code in dev_manager.c:device_is_usable() is unable
to tell whether there is a mirror device lower in the stack from the device
being checked. Pretty much anything layered on top of a mirror will suffer
from this problem. (Snapshots are a good example of this; and option #1
above has been chosen to deal with them. This can also be seen in
dev_manager.c:device_is_usable().) When a mirror failure occurs, the
kernel blocks all I/O to it. If there is an LVM command that comes along
to do the repair (or a different operation that requires label reading), it
would normally avoid the mirror when it sees that it is blocked. However,
if there is a snapshot or a thin-LV that is on a mirror, the above code
will not detect the mirror underneath and will issue label reading I/O.
This causes the command to hang.
Choosing #1 would mean that thin-LVs could never be used as PVs - even if
they are stacked on something other than mirrors.
Choosing #2 means that thinpools can never be placed on mirrors. This is
probably better than we think, since it is preferred that people use the
"raid1" segment type in the first place. However, RAID* cannot currently
be used in a cluster volume group - even in EX-only mode. Thus, a complete
solution for option #2 must include the ability to activate RAID logical
volumes (and perform RAID operations) in a cluster volume group. I've
already begun working on this.
New versions of udev changed the default event timeout to 30s
from original 3min. This causes problems with LVM processes that
starve because of the IO load caused by some LVM actions (e.g.
mirror/raid synchronization).
Reinstate the 3min udev timeout for now until we optimize this
in a way that even the 30s timeout is sufficient.
Creation, deletion, [de]activation, repair, conversion, scrubbing
and changing operations are all now available for RAID LVs in a
cluster - provided that they are activated exclusively.
The code has been changed to ensure that no LV or sub-LV activation
is attempted cluster-wide. This includes the often overlooked
operations of activating metadata areas for the brief time it takes
to clear them. Additionally, some 'resume_lv' operations were
replaced with 'activate_lv_excl_local' when sub-LVs were promoted
to top-level LVs for removal, clearing or extraction. This was
necessary because it forces the appropriate renaming actions the
occur via resume in the single-machine case, but won't happen in
a cluster due to the necessity of acquiring a lock first.
The *raid* tests have been updated to allow testing in a cluster.
For the most part, this meant creating devices with '-aey' if they
were to be converted to RAID. (RAID requires the converting LV to
be EX because it is a condition of activation for the RAID LV in
a cluster.)
This patch fixes the way the special devices are handled
(special in this context means that they're not usable
after the usual ADD event like other generic devices):
- DM and MD devices are pvscanned only when they are just set up.
This is the first CHANGE event that makes the device usable
(the DM_UDEV_PRIMARY_SOURCE_FLAG is set for DM and the
md/array_state sysfs attribute is present for MD).
Whether the device is activated is remembered via
DM_ACTIVATED (for DM) and LVM_MD_PV_ACTIVATED (for MD)
udev environment variable. This is then used to decide
whether we should fire the pvscan on ADD event to
support coldplugging. For any (artificial) ADD event
generated during coldplug, the device must be already
set up properly to fire the pvscan on it.
- Similar for loop devices. For loop devices, only CHANGE
events are relevant (so there's a CHANGE after the loop
device is set up as well as detached). Whether the loop
has just been activated is detected via loop/backing_file
sysfs attribute presence. The activation state is remembered
via LVM_LOOP_PV_ACTIVATED udev environment variable.
- Do not pvscan multipath device components (underlying paths).
- Do not pvscan RAID device components.
- Also, set LVM_SCANNED="1" udev environment variable for
debug purposes (it's visible in the lvmdump -u that takes
the current udev database). This variable is set once
the pvscan is triggered.
The table below summarises when the pvscan is triggered
(marked with X, X* means fire only if the special dev is properly set up):
| real ADD | real CHANGE | artificial ADD | artificial CHANGE | remove
=============================================================================
DM | | X | X* | | X
MD | | X | X* | |
loop | | X | X* | |
other | X | | X | | X
When images and their associated metadata are removed from a RAID1 LV,
the remaining sub-LVs are "shifted" down to fill the gaps. For
example, if there is a 3-way mirror:
[0][1][2]
and we remove device#0, the devices will be shifted down
[1][2]
and renamed.
[0][1]
This can create a problem for resume_lv (specifically,
dm_tree_activate_children) during the renaming process though. This
is because it will attempt to rename the higher indexed sub-LVs first
and find that it cannot because there are currently other sub-LVs with
that name. The solution is to check for a conflicting name before
attempting to rename. If a conflict is found and that conflicting
sub-LV is also in the process of renaming, we can defer the current
rename until the conflicting sub-LV has renamed and cleared the
conflict.
Now that resume_lv can handle these types of rename conflicts, we can
remove the workaround in RAID that was attempting to resume a RAID1
LV from the bottom-up in order to force a proper rename in assending
order before attempting a resume on the top-level LV. This "hack"
only worked for single machine use-cases of LVM. Clearing this up
paves the way for exclusive activation of RAID LVs in a cluster.
Properly skip unmonitoring of thin pool volume in deactivation code
path. Code makes sure if there is just any thin pool user
it stays monitored with all its resources.
When the pool is created from non-linear target the more complex rules
have to be used and stacking needs to properly decode args for _tdata
LV. Also proper allocation policies are being used according to those
set in lvm2 metadata for data and metadata LVs.
Also properly check for active pool and extra code to active it
temporarily.
With this fix it's now possible to use:
lvcreate -L20 -m2 -n pool vg --alloc anywhere
lvcreate -L10 -m2 -n poolm vg --alloc anywhere
lvconvert --thinpool vg/pool --poolmetadata vg/poolm
lvresize -L+10 vg/pool
Udev daemon has recently introduced a limit on the number of udev
processes (there was no limit before). This causes a problem
when calling pvscan --cache -aay in lvmetad udev rules which
is supposed to activate the volumes. This activation is itself
synced with udev and so it waits for the activation to complete
before the pvscan finishes. The event processing can't continue
until this pvscan call is finished.
But if we're at the limit with the udev process count, we can't
instatiate any more udev processes, all such events are queued
and so we can't process the lvm activation event for which the
pvscan is waiting.
Then we're in a deadlock since the udev process with the
pvscan --cache -aay call waits for the lvm activation udev
processing to complete, but that will never happen as there's
this limit hit with the number of udev processes.
The process with pvscan --cache -aay actually times out eventually
(3min or 30sec, depends on the version of udev).
This patch makes it possible to run the pvscan --cache -aay
in the background so the udev processing can continue and hence
we can avoid the deadlock mentioned above.
The commit 82d83a01ce
"autoactivation: refresh existing VG before autoactivation"
causes problems (dangling udev_sync cookies, slow processing
of the pvscan --cache --major --minor call from udev rules)
when the autoactivation handler is run in parallel on
several PVs that belong to the same VG. Revert this patch
until the exact source of the problem is found and then
properly fixed and handled.
Simulate crash of the system and restarted pvmove after next VG
activation.
Test is catching regression introduced in 2.02.99 for partial tree
creation changes.
Function to create slower responsive device.
Useful for testing things which needs to happen something during on
going operation - with 'delayed' device - much smaller sizes of devices
are needed and its much more deterministic (though still not optimal)
Do not allow passing '' names to kernel.
This test was missing also in kernel, so it has allowed
to create device with '' name. This then confused dmsetup tool,
since such name is unexpected and unsupported. To remove
such name from table, user has to use -j -m to specify which device
should be removed.
This patch fixes the posibility to run this operation:
dmsetup rename existingdev ''
after this operation commands like 'dmsetup table' are failing.
This patch prohibits to use such name.
After enable_dev, the following commands were not
consistently seeing the pv on it.
Alasdair explained, "whenever enabling/disabling devs
outside the tools (and you aren't trying to test how
the tools cope with suddenly appearing/disappering
devices) use "vgscan""
Remove default "/tmp" as destination directory if no args
specified for lvm2-activation-generator. Require all the
args to be specified directly for proper functionality.
The original "check" target stays confined to a local device directory, while
check_full does 6 flavours, 3 with a local device directory and 3 with the
global /dev directory (the latter are prefixed with "s" for
"system"). I.e.: normal, cluster, lvmetad, snormal, scluster, slvmetad.
Patch includes RAID1,4,5,6,10 tests for:
- setting writemostly/writebehind
* syncaction changes (i.e. scrubbing operations)
- refresh (i.e. reviving devices after transient failures)
- setting recovery rate (sync I/O throttling)
while the RAID LVs are under a thin-pool (both data and metadata)
* not fully tested because I haven't found a way to force bad
blocks to be noticed in the testsuite yet. Works just fine
when dealing with "real" devices.
Test moving linear, mirror, snapshot, RAID1,5,10, thinpool, thin
and thin on RAID. Perform the moves along with a dummy LV and
also without the dummy LV by specifying a logical volume name as
an argument to pvmove.
The patch allows the user to also pvmove snapshots and origin logical
volumes. This means pvmove should be able to move all segment types.
I have, however, disallowed moving converting or merging logical volumes.
Top-level LVs (like RAID, mirror or thin) are ignored when determining which
portions of an LV to pvmove. If the user specified the name of an LV to
move and it was one of the above types, it would be skipped. The code would
never move on to check whether its sub-LVs needed moving because their names
did not match what the user specified.
The solution is to check whether a sub-LVs is part of the LV whose name was
specified by the user - not just if there was a name match.
In stacked environment where we have a PV layered on top of a
snapshot LV and then removing the LV, lvmetad still keeps information
about the PV:
[0] raw/~ $ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
[0] raw/~ $ vgcreate vg /dev/sda
Volume group "vg" successfully created
[0] raw/~ $ lvcreate -L32m vg
Logical volume "lvol0" created
[0] raw/~ $ lvcreate -L32m -s vg/lvol0
Logical volume "lvol1" created
[0] raw/~ $ pvcreate /dev/vg/lvol1
Physical volume "/dev/vg/lvol1" successfully created
[0] raw/~ $ lvremove -ff vg/lvol1
Logical volume "lvol1" successfully removed
[0] raw/~ $ pvs
No device found for PV BdNlu2-7bHV-XcIp-mFFC-PPuR-ef6K-yffdzO.
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 92.00m
[0] raw/~ $ pvscan --cache --major 253 --minor 3
Device 253:3 not found. Cleared from lvmetad cache.
This is because of the reactivation that is done just before
snapshot removal as part of the process (vg/lvol1 from the example above).
This causes a CHANGE event to be generated, but any scan done
on the LV does not see the original data anymore (in this case
the stacked PV label on top) and consequently the ID_FS_TYPE="LVM2_member"
(provided by blkid scan) is not stored in udev db anymore for the LV.
Consequently, the pvscan --cache is not run anymore as the dev is not
identified as LVM PV by the "LVM2_member" id - lvmetad loses this info
and still keeps records about the PV.
We can run into a very similar problem with erasing the PV label directly:
[0] raw/~ $ lvcreate -L32m vg
Logical volume "lvol0" created
[0] raw/~ $ pvcreate /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[0] raw/~ $ dd if=/dev/zero of=/dev/vg/lvol0 bs=1M
dd: error writing '/dev/vg/lvol0': No space left on device
33+0 records in
32+0 records out
33554432 bytes (34 MB) copied, 0.380921 s, 88.1 MB/s
[0] raw/~ $ pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 92.00m
/dev/vg/lvol0 lvm2 a-- 32.00m 32.00m
[0] raw/~ $ pvscan --cache --major 253 --minor 2
No PV label found on /dev/vg/lvol0.
This patch adds detection of this change from ID_FS_LABEL="LVM2_member"
to ID_FS_LABEL="<whatever_else>" and hence informing the lvmetad
about PV being gone.
These test the toollib functions that select
vgs/lvs to process based on command line args:
empty, vg name(s), lv names(s), vg tag(s),
lv tags(s), and combinations of all.
This patch allows pvmove to operate on RAID, mirror and thin LVs.
The key component is the ability to avoid moving a RAID or mirror
sub-LV onto a PV that already has another RAID sub-LV on it.
(e.g. Avoid placing both images of a RAID1 LV on the same PV.)
Top-level LVs are processed to determine which PVs to avoid for
the sake of redundancy, while bottom-level LVs are processed
to determine which segments/extents to move.
This approach does have some drawbacks. By eliminating whole PVs
from the allocation list, we might miss the opportunity to perform
pvmove in some senarios. For example, if we have 3 devices and
a linear uses half of the first, a RAID1 uses half of the first and
half of the second, and a linear uses half of the third (FIGURE 1);
we should be able to pvmove the first device (FIGURE 2).
FIGURE 1:
[ linear ] [ -RAID- ] [ linear ]
[ -RAID- ] [ ] [ ]
FIGURE 2:
[ moved ] [ -RAID- ] [ linear ]
[ moved ] [ linear ] [ -RAID- ]
However, the approach we are using would eliminate the second
device from consideration and would leave us with too little space
for allocation. In these situations, the user does have the ability
to specify LVs and move them one at a time.
The pool metadata LV must be accounted for when determining what PVs
are in a thin-pool. The pool LV must also be accounted for when
checking thin volumes.
This is a prerequisite for pvmove working with thin types.
The function 'get_pv_list_for_lv' will assemble all the PVs that are
used by the specified LV. It uses 'for_each_sub_lv' to traverse all
of the sub-lvs which may compose it.
Do not print success status for lvm2-activation-generator:
"LVM: Activation generator successfully completed."
"LVM: Logical Volume autoactivation enabled." (if use_lvmetad=1)
Though this information is quite useful during boot, it may
be confusing for users if it happens anytime later and it
actually happens if systemd reloads. This is usually on package
update to update the systemd state and load any new units that are
newly installed in the system. The systemd reload is global and
so any existing generators are rerun at that moment too.
This is a regression caused by commit 3bd9048854.
The error message added with that commit "mpath major %d is not dm major %d" is
superfluous.
When scanning for mpath components, we're looking for a parent device.
But this parent device is not necessarily an mpath device (so the dm device)
if it exists - it can be any other device layered on top (e.g. an MD RAID device).
The bug addressed by this patch manifested itself during testing
by showing a mirror that never became 'in-sync' after creation.
The bug is isolated to distributions that do not have support
for openAIS checkpointing (i.e. > RHEL6, > F16).
When a node joins a group that is managing a mirror log, the other
machines in the group send it a checkpoint representing the current
state of the bitmap. More than one machine can send a checkpoint,
but only the initial one should be imported. Once the bitmap state
has been imported from the initial checkpoint, operations (such
as resync, mark, and clear operations) can begin. When subsequent
checkpoints are allowed to be imported, it has the effect of erasing
all the log operations between the initial checkpoint and the ones
that follow.
When cmirrord was updated to handle the absence of openAIS
checkpointing (commit 62e38da133),
the new import_checkpoint() function failed to honor the 'no_read'
parameter. This parameter was designed to avoid reading all but
the initial checkpoint. Honoring this parameter has solved the
issue of corrupting bitmap data with secondary checkpoints.
Recent kernels allow messages to respond with a string.
Add dm_task_get_message_response() to libdevmapper to perform some
basic sanity checks and return this.
Have 'dmsetup message' display any response.
DM statistics will make extensive use of this.
(From Mikulas.)
If loop device is first configured on systems where /dev/loop-control
is used to dynamically create the loop device itself, there's an
ADD+CHANGE even generated. But next time the existing /dev/loop[0-9]*
is reused, there's only a CHANGE event since the device representing
it is already present in kernel (so no ADD event in this case).
We can't ignore this CHANGE event for loop devices! This is a regression
caused by 756bcabbfe. We already had
a similar problem with MD devices which was fixed by
2ac217d408 (but that one was
only an intra-release fix).
libdm-common.c:883:42: warning: pointer/integer type mismatch in conditional expression
define log_sys_error(x, y) log_err("%s%s%s failed: %s", y, *y ? ": " : "", x, strerror(errno))
So the "y" which was 'path ? : "SELinux context reset"' from
previous commit did not quite fit the other "? :" in the log_sys_macro.
- null_fd resource leak on error path in _reopen_fd_null fn
- dead code in verify_message in clvmd code
- dead code in _init_filter_components in toolcontext code
- null dereference in dm_prepare_selinux_context on error path if
setfscreatecon fails while resetting SELinux context
When the system has no PVs we don't have access to
the cmd pointer and it remains NULL which causes
a seg. fault when we try to free the VG lock.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When autoactivating a VG, there could be an existing VG with exactly
the same PV UUIDs. The PVs could be reappeared after previous
loss/disconnect (for example disconnecting and reconnecting iscsi).
Since there's no "autodeactivation" yet, the mappings for the LVs
from the VG were left in the system even if the device was disconnected.
These mappings also hold the major:minor of the underlying device.
So if the device reappears, it is assigned a different major:minor
pair (...and kernel name). We need to cope with this during
autoactivation so any existing mappings are corrected for any changes.
The VG refresh does that (the vgchange --refresh functionality) -
call this before VG autoactivation.
(If the VG does not exist yet, the VG refresh is NOP)
Split out the partitioned device filter that needs to open the device
and move the multipath filter in front of it.
When a device is multipathed, sending I/O to the underlying paths may
cause problems, the most obvious being I/O errors visible to lvm if a
path is down.
Revert the incorrect <backtrace> messages added when a device doesn't
pass a filter.
Log each filter initialisation to show sequence.
Avoid duplicate 'Using $device' debug messages.
Recent version of util-linux/umount (v2.23+) provides
umount --all-targets that can unmount all the mount targets of
the same device (the bind mounts). Use this if available when
calling the umount blkdeactivate.
Otherwise, for older versions of util-linux, use findmnt
(that is also a part of the util-linux) to iterate over all
mount targets of the same device - this is the manual way.
The blkdeactivate now suppresses error messages from external
tools that are called. Instead, only a summary message "done"
or "skipped" is issued by blkdeactivate as any error in calling
the external tool (e.g. unmounting or deactivating a device) causes
the device to be skipped and the blkdeactivate continues with the
next device in the tree.
Add new -e/--errors switch to display any error messages from
external tools.
Also, suppress any output given by the external tools and add
new -v/--verbose switch to display it including the verbose
output of the tools called (this will enable error reporting
as well).
Also add blkdeactivate -vv for even more debug (the script's debug).
84 files changed, 1540 insertions(+), 442 deletions(-)
Mostly bug fixes this time.
Also note:
md raid replaces dm mirroring as the default implementation.
Can call out to thin_repair to fix thin metadata.
Improved clvmd error detection/debugging information.
According to bug 995193, if a volume group
1) contains a mirror
2) is clustered
3) 'locking_type' = 0 is used
then it is not possible to remove the 'c'luster flag from the VG. This
is due to the way _lv_is_active behaves.
We shouldn't allow the cluster flag to be flipped unless the mirrors in
the cluster are not active. This is because different kernel modules
are used depending on whether a mirror is cluster or not. When we
attempt to see if the mirror is active, we first check locally. If it
is not, then we attempt to check for remotely active instances if the VG
is clustered. Since the no_lock locking type is LCK_CLUSTERED, but does
not implement 'query_resource', remote_lock_held will always return an
error in this case. An error from remove_lock_held is treated as though
the lock _is_ held (i.e. the LV is active remotely). This blocks the
cluster flag from changing.
The solution is to implement 'query_resource' for the no_lock type. It
will report a message and return 1. This will allow _lv_is_active to
function properly. The LV would be considered not active remotely and
the VG can change its flag.
Commit ID 8615234c0f failed to include
the actual code changes that were made to fix the bug. Instead, all
tests went in to validate the bug fix. This patch adds the missing
code changes.
1) Since the min|maxrecoveryrate args are size_kb_ARGs and they
are recorded (and sent to the kernel) in terms of kB/sec/disk,
we must back out the factor multiple done by size_kb_arg. This
is already performed by 'lvcreate' for these arguments.
2) Allow all RAID types, not just RAID1, to change these values.
3) Add min|maxrecoveryrate_ARG to the list of 'update_partial_unsafe'
commands so that lvchange will not complain about needing at
least one of a certain set of arguments and failing.
4) Add tests that check that these values can be set via lvchange
and lvcreate and that 'lvs' reports back the proper results.
gcc -O2 v4.8 on 32 bit architecture is causing a bug in parameter
passing. It does not happen with -01 nor -O0.
The problematic part of the code was strlen use in config.c in
the config_def_check fn and the call for _config_def_check_tree in it:
<snip>
rplen = strlen(rp);
if (!_config_def_check_tree(handle, vp, vp + strlen(vp), rp, rp + rplen, CFG_PATH_MAX_LEN - rplen, cn, cmd->cft_def_hash)) ...
</snip>
If compiled with -O0 (correct):
Breakpoint 1, config_def_check (cmd=0x819b050, handle=0x81a04f8) at config/config.c:775
(gdb) p vp
$1 = 0x8189ee0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb)
_config_def_check_tree (handle=0x81a04f8, vp=0x8189ee0 <_cfg_path>
"config", pvp=0x8189ee6 <_cfg_path+6> "", rp=0xbfffe1e8 "config",
prp=0xbfffe1ee "", buf_size=58, root=0x81a2568, ht=0x81a65
48) at config/config.c:680
(gdb) p vp
$4 = 0x8189ee0 <_cfg_path> "config"
(gdb) p pvp
$5 = 0x8189ee6 <_cfg_path+6> ""
If compiled with -O2 (incorrect):
Breakpoint 1, config_def_check (cmd=cmd@entry=0x8183050, handle=0x81884f8) at config/config.c:775
(gdb) p vp
$1 = 0x8172fc0 <_cfg_path> "config"
(gdb) p strlen(vp)
$2 = 6
(gdb) p vp + strlen(vp)
$3 = 0x8172fc6 <_cfg_path+6> ""
(gdb)
_config_def_check_tree (handle=handle@entry=0x81884f8, pvp=0x8172fc7
<_cfg_path+7> "host_list", rp=rp@entry=0xbffff190 "config",
prp=prp@entry=0xbffff196 "", buf_size=buf_size@entry=58, ht=0x
818e548, root=0x818a568, vp=0x8172fc0 <_cfg_path> "config") at
config/config.c:674
(gdb) p pvp
$4 = 0x8172fc7 <_cfg_path+7> "host_list"
The difference is in passing the "pvp" arg for _config_def_check_tree.
While in the correct case, the value of _cfg_path+6 is passed
(the result of vp + strlen(vp) - see the snippet of the code above),
in the incorrect case, this value is increased by 1 to _cfg_path+7,
hence totally malforming the string that is being processed.
This ends up with incorrect validation check and incorrect warning
messages are issued like:
"Configuration setting "config/checks" has invalid type. Found integer, expected section."
To workaround this issue, remove the "static" qualifier from the
"static char _cfg_path[CFG_PATH_MAX_LEN]". This causes the optimalizer
to be less aggressive (also shuffling the arg list for
_config_def_check_tree call helps).
Commit b248ba0a39 attempted to
prevent mirror devices which had a failed device in their
mirrored log from being usable/readable by LVM. This was to
protect against circular dependancies where one LVM command
could be blocked trying to read one of these affected mirrors
while the LVM command to fix/unblock that mirror was stuck
behind the currently running command.
The above commit went wrong when it used 'device_is_usable()' to
recurse on the mirrored log device to check if it was suspended
or blocked. The 'device_is_usable' function also contains a check
for reserved names - like *_mlog, etc. This last check always
triggered when checking a mirror's log simply because of the name,
not because it was suspended or blocked - a false positive.
The solution is to create a new function like 'device_is_usable',
but without the check for reserved names. Using this new function
(device_is_suspended_or_blocked), we can check the status of a
mirror's log device properly.
If there is no RAID support in the kernel but the default mirror
segtype is "raid1", converting legacy mirrors can be problematic.
For example, changing the log type or converting a mirror to a linear
LV does not require the RAID modules to be present. However, because
lp->segtype is set to be RAID1 by the configuration file, the command
fails.
We should only be setting lp->segtype when converting mirrors if it is
going to change (e.g. to linear or between mirror types).
In those places where mirrors were being created while assuming
a default segment type of "mirror", we include the '--type mirror'
argument to explicitly set the segment type. This will preserve
the mirror testing that is performed even though the default
mirroring segment type is now "raid1".
When both the '-i' and '-m' arguments are specified on the command
line, use the "raid10" segment type. This way, the native RAID10
personality is used through dm-raid rather than layering a mirror
on striped LVs. If the old behavior is desired, the '--type'
argument to use would be "mirror" rather than "raid10".
When reading an info about MDAs from lvmetad, we need to use 64 bit
int to read the value of the offset/size, otherwise the value is
overflows and then it's used throughout!
This is dangerous if we're trying to write such metadata area then,
mostly visible if we're using 2 mdas where the 2nd one is at the end
of the underlying device and hence the value of the mda offset is
high enough to cause problems:
(the offset trimmed to value of 0 instead of 4096m, so we write
at the very start of the disk (or elsewhere if the offset has
some other value!)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 4097m
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
[1] raw/~ # pvchange -u /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" changed
1 physical volume changed / 0 physical volumes not changed
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 d43e d2a5 4c20 4d56 2032 5b78 4135 7225
0000010 4e30 3e2a 0001 0000 0000 0000 0000 0000
0000020 0000 0010 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
=======
(the offset overflows to undefined values which is far behind
the end of the disk)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 100g
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # pvchange -u /dev/vg/lvol0
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
Failed to store physical volume "/dev/vg/lvol0"
0 physical volumes changed / 1 physical volume not changed
When creating a new thin pool and there's no profile requested
via "lvcreate --profile ...", inherit any VG profile if it's attached.
Currently this applies to these settings:
allocation/thin_pool_chunk_size
allocation/thin_pool_discards
allocation/thin_pool_zero
Check that fields in clvm_header are valid when
local or remote messages are received. If not,
log an error, dump the message data and ignore
the message.
When creating a timeout thread for snapshots, the thread is not
tracked and thus never joined. This means that the exit status
of the timeout thread is held indefinitely. Saves a bit of
memory to set PTHREAD_CREATE_DETACHED when creating this thread.
I've also added pthread_attr_init|destroy to setup the creation
pthread_attr_t.
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Initial basic support for repair.
It currently takes pool metadata spare volume, which
is used for recovery. New spare is created if the volume
is successfuly repaired.
After the operation the previous _tmeta volume is moved
into _tmeta%d volume and if everything is ok, this volume
could be removed.
New _tmeta needs to be pvmoved to proper place and also
converted to i.e. mirror if it should be mirrored.
Later version will try to automate some steps here.
Add new configure lvm.conf options for binaries thin_repair
and thin_dump.
Those are part of device-mapper-persistent-data package
and will be used for recovery of thin_pool.
Support tests with abort when libdm encounters internal
error - i.e. for dmsetup tool.
Code execution will be aborted when
env var DM_ABORT_ON_INTERNAL_ERRORS is set to 1
The PREFERRED allocation mechanism requires the number of areas in the
previous LV segment to match the number in the new segment being
allocated. If they do not match, the code may crash.
E.g. https://bugzilla.redhat.com/989347
Introduce A_AREA_COUNT_MATCHES and when not set avoid referring
to the previous segment with the contiguous and cling policies.
When running in the context of the test framework
we need to limit our PVs to use to those created
in the framework.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
There were a few places where the code was incorrectly
using parse arguments for the supplied variable type &
size. Changing the variables to be declared exactly
like python is expecting so if we build on an arch
where the size of type is different than typically
expected we will continue to match. In addition the
parse character needed to be corrected in a few spots
too.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When using a global_filter and if this filter is incorrectly
specified, we ended up with a segfault:
raw/~ $ pvs
Invalid filter pattern "r|/dev/sda".
Segmentation fault (core dumped)
In the example above a closing '|' character is missing at the end
of the regex. The segfault itself was caused by trying to destroy
the same filter twice in _init_filters fn within the error path
(the "bad" goto target):
bad:
if (f3)
f3->destroy(f3);
if (f4)
f4->destroy(f4);
Where f3 is the composite filter (sysfs + regex + type + md + mpath filter)
and f4 is the persistent filter which encompasses this composite filter
within persistent filter's 'real' field in 'struct pfilter'.
So in the end, we need to destroy the persistent filter only as
this will also destroy any 'real' filter attached to it.
363 files changed, 19863 insertions(+), 9055 deletions(-)
This is a very large release - so expect bugs!
Please treat this release like a release candidate.
Changes to the external interfaces since 2.02.98 are not yet frozen.
Updated releases will follow quickly (days not weeks) as any problems
are handled.
commit d00d45a8b6 introduced changes
that are causing cluster mirror tests to fail. Ultimately, I think
the change was right, but a proper clean-up will have to wait.
The portion of the commit we are reverting correlates to the
following commit comment:
2) lib/metadata/mirror.c:_delete_lv() - should have been calling
_activate_lv_like_model() with 'mirror_lv'. This is because
'mirror_lv' is the LV that the overall operation is being
performed on. We need to use this LV as the basis for
determining whether to activate locally, or across the
cluster, etc.
It appears that when legs or logs are removed from a mirror, they
are being activated before they are deleted in order to make them
top-level LVs that can be acted upon. When doing this, it appears
they are not activated based on the characteristics of the mirror
from which they came. IOW, if the mirror was exclusively active,
the sub-LVs are activated globally. This is a no-no. This then
made it impossible to activate_lv_like_model if the model was
"mirror_lv" instead of "lv" in _delete_lv(). Thus, at some point
this change should probably be put back and those location where
the sub-LVs are being improperly activated "shared" instead of
EX should be corrected.
Suggest to use _tdata and _tmeta devices for that.
This fixes regression from too relaxed change in
f1d5f6ae81
Without this patch there are some empty LVs created before
mirror code recognizes it cannot continue.
(in release fix)
In case lvmetad is not used, we need to wait for udev to complete
after net-attached storage is initialized (after iscsi/fcoe service).
N.B. This also requires the storage to be attached synchronously
in the kernel itself.
Three fixme's addressed in this commit:
1) lib/metadata/lv_manip.c:_calc_area_multiple() - this could be
safely changed to a comment explaining that currently because
RAID10 can only have a 2-way mirror, we don't need to know the
number of stripes. However, we will need to know that in the
future if RAID10 is to support more than 2-way mirroring.
2) lib/metadata/mirror.c:_delete_lv() - should have been calling
_activate_lv_like_model() with 'mirror_lv'. This is because
'mirror_lv' is the LV that the overall operation is being
performed on. We need to use this LV as the basis for
determining whether to activate locally, or across the
cluster, etc.
3) tools/lvcreate.c:_lvcreate_params() - Minor clean-up. If
'-m 0' is given, treat it as though the mirroring argument
was not given (i.e. as though the requested segment type
was 'stripe' and not mirror).
The function lvm_vg_list_lvs was returning all logical
volumes, including *_tmeta and *_tdata. Added check
to verify that LV is visible before including in list
of returned logical volumes.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The --type mirror requires -m/--mirrrors:
lvconvert --type mirror vg/lvol0
--type mirror requires -m/--mirrors
Run `lvconvert --help' for more information.
The --type raid* is allowed (the checks already existed):
lvconvert --type raid10 vg/lvol0
Converting the segment type for vg/lvol0 from linear to raid10 is not yet supported.
The --type snapshot is a synonym to -s/--snapshot:
lvconvert -s vg/lvol0 vg/lvol1
Logical volume lvol1 converted to snapshot.
lvconvert --type snapshot vg/lvol0 vg/lvol1
Logical volume lvol1 converted to snapshot.
All the other segment types are not supported, e.g.:
lvconvert --type zero vg/lvol0
Conversion using --type zero is not supported.
Run `lvconvert --help' for more information.
Activation is needed only for clustered VG.
For non-clustered VG skip activation, since deactivate_lv()
is called without problems (no testing for lock presence).
(updates f6ded62291)
When resuming a node needed by a higher layer of the tree,
if the resume fails, only remove it if the node did not
originally have a live table.
Ref. 97f8454ecc
Some of the names for various RAID options have been changed and the man
pages need to reflect it.
lvcreate:
from: minrecoveryrate to: [raid]minrecoveryrate
from: maxrecoveryrate to: [raid]maxrecoveryrate
* plus better clarity on what the arg to these options is
specifying
lvchange:
from: minrecoveryrate to: [raid]minrecoveryrate
from: maxrecoveryrate to: [raid]maxrecoveryrate
* plus better clarity on what the arg to these options is
specifying
from: syncaction to: [raid]syncaction
from: writebehind to [raid]writebehind
* plus change arg from BehindCount to IO/count
from: writemostly to: [raid]writemostly
* plus addition to document [:{t|n|y}] option
lvs:
from: mismatches to: raid_mismatch_count
from: sync_action to: raid_sync_action
add : raid_min|max_recovery_rate, raid_write_behind
When the merging of snapshot is finished, we need to clean dm table
intries for snapshot and -cow device. So for merging snapshot
we have to activate_lv plain 'cow' LV and let the table
resolver to its work - shortly deactivation_lv() request
will follow - in cluster this needs LV lock to be held by clvmd.
Also update a test - add small wait - if lvremove is not 'fast enough'
and merging process has not been stopped and $lv1 removed in background.
Ortherwise the following lvcreate occasionally finds name $lv1 still in use.
(in release fix)
We check the version number of dm-raid before testing certain
features to make sure they are present. However, this has
become somewhat complicated by the fact that the version #'s
in the upstream kernel and the REHL6 kernel have been diverging.
This has been a necessity because the upstream kernel has
undergone ABI changes that have necessitated a bump in the
'Y' component of the version #, while the RHEL6 kernel has not.
Thus, we need to know that the ABI has not changed but the
features have been added. So, the current version #'ing stands
as follows:
RHEL6 Upstream Comment
======|==========|========
** Same until version 1.3.1 **
------|----------|--------
N/A | 1.4.0 | Non-functional change.
| | Removes arg from mapping function.
------|----------|--------
1.3.2 | 1.4.1 | RAID10 fix redundancy validation checks.
------|----------|--------
1.3.5 | 1.4.2 | Add RAID10 "far" and "offset" algorithm support.
| | Note this feature came later in RHEL6 as part of
| | a separate update/feature.
------|----------|--------
1.3.3 | 1.5.0 | Add message interface to allow manipulation of
| | the sync_action.
| | New status (STATUSTYPE_INFO) fields: sync_action
| | and mismatch_cnt.
------|----------|--------
1.3.4 | 1.5.1 | Add ability to restore transiently failed devices
| | on resume.
------|----------|--------
1.3.5 | 1.5.2 | 'mismatch_cnt' is zero unless [last_]sync_action
| | is "check".
------|----------|--------
To simplify, writemostly/writebehind, scrubbing, and transient device
failure restoration are all tested based on the same version
requirements: (1.3.5 < V < 1.4.0) || (V > 1.5.2). Since kernel
support for writemostly/writebehind has been around for some time,
this could mean a reduction in the scope of kernels tested for this
feature. I don't view this as much of a problem, since support for
this feature was only recently added to LVM. Thus, the user would
have to be using a very recent LVM version with an older kernel.
The mismatch count reported by a dm-raid kernel target used
to be effectively random, unless it was checked after a
"check" scrubbing action had been performed. Updates to the
kernel now mean that the mismatch count will be 0 unless a
check has been performed and discrepancies had been found.
This has been the intended behaviour all along.
This patch updates the test suite to handle the change.
The status printed for dm-raid targets on older kernels does not include
the syncaction field. This is handled by dev_manager_raid_status() just
fine by populating the raid status structure with NULL for that field.
However, lv_raid_sync_action() does not properly handle that field being
NULL. So, check for it and return 0 if it is NULL.
As the library handle has a dm pool associated with
it for long running processes it is required to close and
re-open the library to free the dm pool. Call added so
python clients can reclaim memory, lvm.gc().
Note: Any python objects on the heap become invalid at the
time this call is made. If referenced, a seg. fault could
occur. No simple way to make this safe with the current
memory management in the C library. Use with caution!
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Add --poolmetadataspare option and creates and handles
pool metadata spare lv when thin pool is created.
With default setting 'y' it tries to ensure, spare has
at least the size of created LV.
Pool creation involves clearing of metadata device
which triggers udev watch rule we cannot udev synchronize with
in current code.
This metadata devices needs to be activated localy,
so in cluster mode deactivation and reactivation
is always needed.
However for non-clustered mode we may reload table
via suspend/resume path which avoids collision with
udev watch rule which was occasionaly triggering
retry deactivation loop.
Code has been also split into 2 separate code paths
for thin pools and thin volumes which improved readability
of the code as well.
Deactivation has been moved out of extend_pool() and
decision is now in _lv_create_an_lv() which knows
the change mode.
Since we vg_write&commit metadata LV inside lv_extend() call,
proper restore is needed in case something fails.
So add bad: section which deactivates activated LV
and removes it from VG.
Also check early for metadata LV name lengh fail.
The new lvm2-activation-net.service activates LVM volumes
after network-attached devices are set up (iSCSI and FCoE)
if lvmetad is disabled and hence the autoactivation is not
used.
When tree for thin LVs was using external_lv, there has been
far less optimal solution, that has tried to add certain
existing dependencie only when new node was added.
However this has lead to way to complex tree construction since
many repeated checks have been made during such tree build.
This patch move this detection to the proper _partial_tree generation
code and uses for it new 'activation' flag, which is set when
tree for ACTIVATION or PRELOAD is generated.
It increases performance when thins with external origins are used.
(in release update)
Created dlid for test is not needed afterward, so lower a memory
usage of this call is repeatedly used for building some large tree.
TODO: create function to use given buffer on stack as much cheaper.
Code needs to check if the layer origin device is suspended,
It's valid to create thinvolume snapshot of thinvolume which is also
used as an old-style snapshot. In this case we need to check -real
is suspended.
When adding origin_only - add only layer thin volume.
(in case it's also old-snapshot add only -real device)
Remove backup() call from update_pool_lv() as it's been there
duplicated and preperly order backup() call after lvresize,
so there is just one such call.
If the thin pool is known to be active, messages can be passed
to the pool even when the created thin volume is not going to be
activated.
So we do not need to stack large list of message and validate
and catch creation errors earlier in this case.
Replace the test for valid activation combination with simpler list of
deactivation combinations.
Clear send_messages flag when they have been delivered successfully.
There is no need to validate it for all other activations of the same
node in the dm_tree.
Also add extra debug message which shows the reason for skipping
sending of messages because the transaction_id has already the matching
value.
cfg_def_get_path uses a global static var to store the result (for efficiency).
So we need to apply the profile first and then get the path for the config item
when calling find_config_tree_* fns.
Also activation/auto_set_activation is not profilable (at least not now,
maybe later if we need that).
Add thin and thin pool lv creation support to lvm library
This is Mohan's thinp patch, re-worked to include suggestions
from Zdenek and Mohan.
Rework of commit 4d5de8322b
which uses refactored properties handling.
Based on work done by M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The common bits from lib/report/properties.[c|h] have
been moved to lib/properties/prop_common.[c|h] to allow
re-use of property handling functionality without
polluting the report handling functionality.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
- lvs -o lv_attr has now 10 indicator bits
- use '--ignoremonitoring' instead of the shortcut '--ig' used before (since
it would be ambiguous with new '--ignoreactivationskip')
The activation/auto_set_activation_skip enables/disables automatic
adding of the ACTIVATION_SKIP LV flag. By default thin snapshots
are flagged to be skipped during activation.
And by default, the auto_set_activation_skip is enabled.
The lvchange has both -k/--setactivationskip and
-K/--ignoreactivationskip option available for use.
The vgchange has only -K/--ignoreactivationskip, but
not the -k/--setactivationskip as the ACTIVATION_SKIP
flag is an LV property, not a VG one and so we change it
only by using the lvchange...
Also add -k/--setactivationskip y/n and -K/--ignoreactivationskip
options to lvcreate.
The --setactivationskip y sets the flag in metadata for an LV to
skip the LV during activation. Also, the newly created LV is not
activated.
Thin snapsots have this flag set automatically if not specified
directly by the --setactivationskip y/n option.
The --ignoreactivationskip overrides the activation skip flag set
in metadata for an LV (just for the run of the command - the flag
is not changed in metadata!)
A few examples for the lvcreate with the new options:
(non-thin snap LV => skip flag not set in MDA + LV activated)
raw/~ $ lvcreate -l1 vg
Logical volume "lvol0" created
raw/~ $ lvs -o lv_name,attr vg/lvol0
LV Attr
lvol0 -wi-a----
(non-thin snap LV + -ky => skip flag set in MDA + LV not activated)
raw/~ $ lvcreate -l1 -ky vg
Logical volume "lvol1" created
raw/~ $ lvs -o lv_name,attr vg/lvol1
LV Attr
lvol1 -wi------
(non-thin snap LV + -ky + -K => skip flag set in MDA + LV activated)
raw/~ $ lvcreate -l1 -ky -K vg
Logical volume "lvol2" created
raw/~ $ lvs -o lv_name,attr vg/lvol2
LV Attr
lvol2 -wi-a----
(thin snap LV => skip flag set in MDA (default behaviour) + LV not activated)
raw/~ $ lvcreate -L100M -T vg/pool -V 1T -n thin_lv
Logical volume "thin_lv" created
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg
LV Attr
pool twi-a-tz-
thin_lv Vwi-a-tz-
thin_snap Vwi---tz-
(thin snap LV + -K => skip flag set in MDA (default behaviour) + LV activated)
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap -K
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg/thin_lv
LV Attr
thin_lv Vwi-a-tz-
(thins snap LV + -kn => no skip flag in MDA (default behaviour overridden) + LV activated)
[0] raw/~ # lvcreate -s vg/thin_lv -n thin_snap -kn
Logical volume "thin_snap" created
[0] raw/~ # lvs -o name,attr vg/thin_snap
LV Attr
thin_snap Vwi-a-tz-
...when creating config trees while calling config_def_create_tree fn
that constructs a tree out of config_settings.h definition
(CFG_DEF_TREE_NEW/MISSING/DEFAULT/PROFILABLE).
Normally, the lvm dumpconfig processes only the configuration tree
that is at the top of the cascade. Considering the cascade is:
CONFIG_STRING -> CONFIG_PROFILE -> CONFIG_MERGED_FILES/CONFIG_FILE
...then:
(dumpconfig of lvm.conf only)
raw/~ $ lvm dumpconfig allocation
allocation {
maximise_cling=1
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=64
}
(dumpconfig of selected profile configuration only)
raw/~ $ lvm dumpconfig --profile test allocation
allocation {
thin_pool_chunk_size=8
thin_pool_discards="passdown"
thin_pool_zero=1
}
(dumpconfig of given --config configuration only)
raw/~ $ lvm dumpconfig --config 'allocation{thin_pool_chunk_size=16}' allocation
allocation {
thin_pool_chunk_size=16
}
The --mergedconfig option causes the configuration cascade to be
merged before processing it with dumpconfig:
(dumpconfig of merged selected profile and lvm.conf)
raw/~ $ lvm dumpconfig --profile test allocation --mergedconfig
allocation {
maximise_cling=1
thin_pool_zero=1
thin_pool_discards="passdown"
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=8
}
(dumpconfig merged given --config and selected profile and lvm.conf)
raw/~ $ lvm dumpconfig --profile test --config 'allocation{thin_pool_chunk_size=16}' allocation --mergedconfig
allocation {
maximise_cling=1
thin_pool_zero=1
thin_pool_discards="passdown"
mirror_logs_require_separate_pvs=0
thin_pool_metadata_require_separate_pvs=0
thin_pool_chunk_size=16
}
Hence with the --mergedconfig, we are able to see the
configuration that is actually used when processing any
LVM command while using any combination of --config/--profile
options together with lvm.conf file.
Till now, we needed the config tree merge only for merging
tag configs with lvm.conf. However, this type of merging
did a few extra exceptions:
- leaving out the tags section
- merging values in activation/volume_list
- merging values in devices/filter
- merging values in devices/types
Any other config values were replaced by new values.
However, we'd like to do a 'raw merge' as well, simply
bypassing the exceptions listed above. This will help
us to create a single tree representing the cascaded
configs like CONFIG_STRING -> CONFIG_PROFILE -> ...
The reason for this patch is that when trees are cascaded,
the first value found while traversing the cascade is used,
not making any exceptions like we do for tag configs.
When CFG_DEF_TREE_MISSING is created, it needs to know the status
of the check done on the tree used (the CFG_USED flag).
This bug was introduced with f1c292cc38
"make it possible to run several instances of configuration check at
once". This patch separated the CFG_USED and CFG_VALID flags in
a separate 'status' field in struct cft_check_handle.
However, when creating some trees, like CFG_DEF_TREE_MISSING,
we need this status to do a comparison with full config definition
to determine which items are missing and for which default values
were used. Otherwise, all items would be considered missing.
So, pass this status in a new field called 'check_status' in
struct config_def_tree_spec that defines how the (dumpconfig) tree
should be constructed (and this struct is passed to
config_def_create_tree fn then).
Start separating the validation from the action in the basic lvresize
code moved to the library.
Remove incorrect use of command line error codes from lvresize library
functions. Move errors.h to tools directory to reinforce this,
exporting public versions of the error codes in lvm2cmd.h for dmeventd
plugins to use.
Update code to match lvm coding standards
Disable/skip test - since it's accessing VGs available in the system.
Before reenable - validate it's not touching any PV outside those
created during test.
Condition needs to check for passed in pool_metadata_lv_name
which needs to be renamed to _tmeta, for !pool_metadata_lv_name
it's already created with correct _tmeta name.
The idea of using 'eprefix' unfortunatelly fails on /usr moved distros.
But there is also default --sysconfdir which is normaly set to /etc.
If unset it's PREFIX/etc.
For now revert.
FIXME: replace lvm vars with standard ones everyone is using.
Document following items:
configuration cascade (man lvm.conf)
--profile ProfileName (man lvm)
--detachprofile (man vgchange/lvchange)
-o vg_profile/lv_profile (man vgs/lvs)
Also document --config a bit so we can see where it fits in the
configuration cascade - will be documented more in following commit...
Fix and improve handling on sigint.
Always check for signal presence *before* calling of command,
so it will not call the command when break was hit.
If the command has been finished succesfully there is
no problem to mark the command ok and not report interrupt at all.
Fix cuple related stack; reports and assignments.
Seems like we have a bit overcomplicated set of options
for deducing individual install dirs.
This patch fixes installation issues with DESTDIR,
but the whole set of configure options should be simplified.
Create/remove PV
Create/remove snapshots (old type & thin)
PV lookups based on name and UUID
PV resize
Signed-off-by: Tony Asleson <tasleson@redhat.com>
The pv resize code required that a lvm_vg_write be done
to commit the change. When the method to add the ability
to list all PVs, including ones that are not assocated with
a VG we had no way for the user to make the change persistent.
Thus additional resize code was move and now liblvm calls into
a resize function that does indeed write the changes out, thus
not requiring the user to explicitly write out he changes.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
After the last rebase, existing unit test case was
run which uncovered a number of errors that required
attention.
Uninitialized variables and changes to type of numeric
return type were addressed.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Code move and changes to support calling code from
command line and from library interface.
V2 Change lock_vol call
Signed-off-by: Tony Asleson <tasleson@redhat.com>
As locks are held, you need to call the included function
to release the memory and locks when done transversing the
list of physical volumes.
V2: Rebase fix
V3: Prevent VGs from getting cached and then write protected.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Simplified version of lv resize.
v3: Rebase changes to make work. Needed to set sizeargs = 1
to indicate to resize that we are asking for a size based
resize.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Extend the lv resize parameter structure to contain everything
the re-size functions need so that the command line does not
need to be present for lower level calls when we call from
library functions.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Add thin and thin pool lv creation support to lvm library
This is Mohan's thinp patch, re-worked to include suggestions
from Zdenek and Mohan.
V2: Remove const lvm_lv_params_create_thin
Add const lvm_lv_params_skip_zero_get
V3: Changed get/set to use generic functions like current
property
V4: Corrected macro in properties.c
V5: Fixed a bug in liblvm/lvm_lv.c function lvm_lv_create.
incorrectly used pool instead of lv_name when doing the
find_lv_in_vg call.
Based on work done by M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
These settins are customizable by profiles:
allocation/thin_pool_zero
allocation/thin_pool_discards
allocation/thin_pool_chunk_size
activation/thin_pool_autoextend_threshold
activation/thin_pool_autoextend_percent
Besides the classical configuration checks (type checking and
checking whether the item is recognized by lvm tools) for profiles,
do an extra check whether the configuration setting is customizable
by a profile at all. Give a warning message if not.
Before, the status of the configuration check (config_def_check fn call)
was saved directly in global configuration definitinion array (as part
of the cfg_def_item_t/flags)
This patch introduces the "struct cft_check_handle" that defines
configuration check parameters as well as separate place to store
the status (status here means CFG_USED and CFG_VALID flags, formerly
saved in cfg_def_item_t/flags). This struct can hold config check
parameters as well as the status for each config tree separately,
thus making it possible to run several instances of config_def_check
without interference.
Just to make it more clear and also not to confuse
config_valid with check against config definition
(and its 'valid' flag within the config defintion tree).
The command to change the profile for existing VG/LV:
"vgchange/lvchange --profile <profile_name>"
The command to detach any existing profile from VG/LV:
"vgchange/lvchange --detachprofile"
If "vgcreate/lvcreate --profile <profile_name>" is used, the profile
name is automatically stored in metadata for making it possible to
load it automatically next time the VG/LV is used.
This is per VG/LV profile loading on demand. The profile itself is saved
in struct volume_group/logical_volume as "profile" field so we can
reference it whenever needed.
When placing the profile in a configuration cascade, this sequence is
used exactly:
CONFIG_STRING -> CONFIG_PROFILE -> CONFIG_FILE/MERGED_FILES
So if the profile is used, it overloads the lvm.conf (and any
existing tag configs). However, if "--config" is used to define
a custom configuration on command line, this overloads even the
profile config!
This patch adds --profile arg to lvm cmds and adds config/profile_dir
configuration setting to select the directory where profiles are stored
By default it's /etc/lvm/profile.
The profiles are added by using new "add_profile" fn and then loaded
using the "load_profile" fn. All profiles are stored in a cmd context
within the new "struct profile_params":
struct profile_params {
const char *dir;
struct profile *global_profile;
struct dm_list profiles_to_load;
struct dm_list profiles;
};
...where "dir" is the directory with profiles, "global_profile" is
the profile that is set globally via the --profile arg (IOW, not
set per VG/LV basis based on metadata record) and the "profiles"
is the list with loaded profiles.
Configuration profiles are selected configuration items that can
be loaded dynamically on demand and overlayed over existing
configuration on demand (either on cmd line by selecting the profile
name to be used globally or retrieved from metadata and used per
VG/LV basis only).
The default directory where profiles are stored is configurable
at compile time with --with-default-profile-subdir.
A helper type that helps with identification of the configuration source
which makes handling the configuration cascade a bit easier, mainly
removing and adding configuration trees to cascade dynamically.
Currently, the possible types are:
CONFIG_UNDEFINED - configuration is not defined yet (not initialized)
CONFIG_FILE - one file configuration
CONFIG_MERGED_FILES - configuration that is a result of merging more files into one
CONFIG_STRING - configuration string typed on cmd line directly
CONFIG_PROFILE - profile configuration (the new type of configuration, patches will follow...)
Also, generalize existing "remove_overridden_config_tree" to work with
configuration type identification in a cascade. Before, it was just
the CONFIG_STRING we used. Now, we need some more to add in a
cascade (like the CONFIG_PROFILE). So, we have:
struct dm_config_tree *remove_config_tree_by_source(struct cmd_context *cmd, config_source_t source);
config_source_t config_get_source_type(struct dm_config_tree *cft);
... for removing the tree by its source type from the cascade and
simply getting the source type.
In the last update not all code paths have set the archived flag.
If we run in test mode or without archiving enabled - set the bit
as well - so test whether archiving has been called succesfully
will be ok. (in relase fix).
Do not keep multiple archives for the executed command.
Reuse the ALLOCATABLE_PV from pv status for
ARCHIVED_VG vg status. Mark VG with the bit with the
first archivation.
Instead of calling syslog() from signal event handler,
run all logging code in the main loop.
Also it needs to take the lock and check for list
only when really needed.
...not the other way round as it was before. This way it makes
more sense as BA use is exceptional and it's useless to
contaminate the log with messages about BA not being found
in metadata.
Show 'at' pointer address with pool name.
It's useful for debugging to be able to locate pointer address in the
debug trace log. It's only available when compiled with extra debug
compilation flag DEBUG_POOL in make.tmpl.
Since reduce the message has informational character and doesn't lead
to exit of the command - reduce the log level to info print as we
use for other similar types.
Reindent next print message.
Test the different RAID lvchange scenarios under snapshot as well.
This patch also updates calculations for where to write to an
underlying PV when testing various syncactions.
If the user would upconvert a linear LV to a mirror without specifying
the segment type ("--type mirror" vs "--type raid1"), the "mirror"
segment type would be chosen without consulting the 'default_mirror_segtype'
setting in lvm.conf. This is now used as the basis for determining
which should be used if left unspecified.
Check for enough space in preallocated buffer.
Fixes problem, when lvm code started to suddenly allocate
too big memory chunks.
TODO: lvmetad protocol should announce needed size ahead,
so if metadata have 1MB we are not reallocating memory...
When vgname has not existed in metadata, it has crashed on double free
in format_instance destroy() - since VG was created, used FID and was
released - which also released FID, so further use was accessing bad
memory.
Fix it for this code path before release_vg() so FID will exists
when _vg_read_file_name() returns NULL.
Revert commit 37ffe6a. If static variables are to be used then we
will put them elsewhere and limit the optimization to reporting
code, rather that have it be used in the general case.
Assumed size of 4M was too large and the test was failing because
'dd' was failing to perform its write.
Calculate the size we need to write with 'dd' instead, so we
don't overrun the device.
aux updates:
prepare_vg now created clustered VG for cluster tests.
since dm-raid doesn't work in cluster, skip the cluster
test when someone checks for dm-raid target until fixed.
Reinstate the previous sort order for origin_size, so that LVs with
an empty origin_size continue to appear at the start of the list
not the end.
Ref. 9d445f371c
Support vgsplit for VGs with thin pools and thin volumes.
In case the thin data and thin metadata volumes are moved to a new VG,
move there also all related thin volumes and check that external origins
are also present in this new VG.
We use mpath filtering (enabled by devices/multipath_component_detection=1
lvm.conf setting) to avoid a situation in which we could end up with
duplicate PVs found. We need to filter out the mpath components and
use only the top-level multipath mapping instead for PV scans.
However, if the there are partitions on multipath components, we need
to filter out these partitions. This patch fixes it so those
partitions found on multipath components are filtered as well.
For example, let's consider following configuration:
The sda and sdb are mpath components, sda1 and sdb1 the partitions
on these components, mpath-test the mpath mapping and mpath-test1
the partition mapping - created automatically by kpartx right
after mpath-test creation. The PV resides on top.
(LVM PV)
|
mpath-test1
|
mpath-test
|
sda1 ---------- sdb1
\ | |/
sda sdb
E.g. for sda1 and sdb1, the code will detect this and it skips
the partition that belongs to the multipath component:
<snippet from the log>
#filters/filter-mpath.c:156 /dev/sda1: Device is a partition, using primary device /dev/sda for mpath component detection
130 #ioctl/libdm-iface.c:1724 dm status (253:2) OF[16384](*1)
131 #filters/filter-mpath.c:196 /dev/sda1: Skipping mpath component device
</snippet from the log>
Othewise, we'd see the same PV label on sda1/sdb1 and mpath-test1
at the same time ending up with "Duplicate PV found...".
The dev_get_primary_dev fn now returns:
0 if the dev is already a primary dev
1 if the dev is a partition, primary dev is returned in "result" (output arg)
-1 on error
This way, we can better differentiate between the error state
and the state in which the dev supplied is not a partition
in the caller (this was same return value before).
Also, if we already have information about the device type,
we can check its major number against the list of known device
types (cmd->dev_types) directly, so we don't need to go through
the sysfs - we only check the major:minor pair which is a bit
more straightforward and faster. If the dev_types does not have
any info about this device type, the code just fallbacks to
the original sysfs interface to get the partition info.
Changes:
- move device type registration out of "type filter" (filter.c)
to a separate and new dev-type.[ch] for common use throughout the code
- the structure for keeping the major numbers detected for available
device types and available partitioning available is stored in
"dev_types" structure now
- move common partitioning detection code to dev-type.[ch] as well
together with other device-related functions bound to dev_types
(see dev-type.h for the interface)
The dev-type interface contains all common functions used to detect
subsystems/device types, signature/superblock recognition code,
type-specific device properties and other common device properties
(bound to dev_types), including partitioning support.
- add dev_types instance to cmd context as cmd->dev_types for common use
- use cmd->dev_types throughout as a central point for providing
information about device types
Fix the usecase when only PV list is specified.
With --poolmetadatasize PV list is used for metadata extents.
Without --poolmetadatasize PV list is used for 100% extension of LV.
Handle the case, when nothing could be resized (i.e. in dmeventd)
Add limit for buffer so if the test is running in some loop
and generating lots of output, the output log gets too large
for browser to display (at least Firefox is quite confused
to display 300MB logs).
For now limit max output of one test to 32MB.
If there is need to see full log set LVM_TEST_UNLIMITED to
disable interruption of test.
harness now also prints max memory used during test,
it user and system time and amount of read/write operations.
Support 'clasic' way of resizing of metadata LV.
Normally we disallow to work with internal 'invisible' devices.
But in this case we can make an exception and if user has some
special needs how to extend thin pool metadata LV - support it.
After resize of metadata LV, the pool will be suspended and resumed,
to be notified of this change.
Add support for lvresize of thin pool metadata device.
lvresize --poolmetadatasize +20 vgname/thinpool_lv
or
lvresize -L +20 vgname/thinpool_lv_tmeta
Where the second one allows all the args for resize (striping...)
and the first option resizes accoding to the last metadata lv segment.
Giving volume type information about being 'metadata' type of volume
has higher priority then i.e. 'mirror' or 'thin' flag - for those
type we have 'target attr' (7th. field).
The special suspend/resume code in lv_remove for LVM1 snapshots was interpsersed
with a vg_commit call. However, while with LVM1 metadata, vg_commit is
technically a no-op, the activation code relied on the ondisk and incore
metadata being the same, since on LVM1, a "commit" happens in vg_write
already. Since the "ondisk" metadata was previously not available with format1
(and incore was silently used instead, via lvmcache), the problem was masked.
This ties the two preceding changes together, actually using the "ondisk"
version of VG metadata instead of calling into lvmcache when activating
volumes. The cache hooks are still used as a fallback, because we don't have an
uncached scanning API yet.
Previously, we have relied on UUIDs alone, and on lvmcache to make getting a
"new copy" of VG metadata fast. If the code which triggers the activation has
the correct VG metadata at hand (the version which is currently on disk), it can
now hand it to the activation code directly.
This allows us to get the current on-disk version of the metadata whenever we
have the current in-flight version, without a recourse to scanning or lvmcache.
When the init scripts are run from within systemd, the systemd
needs to know the pidfile for it to work correctly when the
daemon itself is killed. Otherwise, systemd keeps these services
in "active" and "exited state" at the same time
(it assumes RemainAfterExit=yes without the pidfile reference in
chkconfig header).
See also https://bugzilla.redhat.com/show_bug.cgi?id=971819#c5.
For non udev path use DM_DEFAULT_NAME_MANGLING_MODE.
Skip this test when using real /dev dir, since udev is not able
to create such device name unless mangled...
Last commit made dump filter only partially composable.
Add remaining functionality and also support composable wipe,
which is needed, when i.e. vgscan needs to remove cache.
(in release fix)
This means that a test failure in one flavour no longer prevents all the
subsequent flavours from running. We also get an aggregate summary at
the end of the entire batch, instead of summaries interspersed with
progress output. Do not fail when flavour overrides are empty
Add a generic dump operation to filters and make the composite filter call
through to its components. Previously, when global filter was set, the code
would treat the toplevel composite filter's private area as if it belonged a
persistent filter, trying to write nonsense into a non-sensical file.
Also deal with NULL cmd->filter gracefully.
The global filter in system's lvm.conf may conflict with the custom filter we
set up in vgimportclone (they can easily fail to intersect). Since we explicitly
avoid talking to lvmetad in vgimportclone, it is safe and reasonable to do so.
Merge duplicate code that was validating lvcreate args
for creation of thin and snapshot.
Keep most of thin checks in _check_thin_parameters().
Update couple error messages.
This patch adds the ability to set the minimum and maximum I/O rate for
sync operations in RAID LVs. The options are available for 'lvcreate' and
'lvchange' and are as follows:
--minrecoveryrate <Rate> [bBsSkKmMgG]
--maxrecoveryrate <Rate> [bBsSkKmMgG]
The rate is specified in size/sec/device. If a suffix is not given,
kiB/sec/device is assumed. Setting the rate to 0 removes the preference.
This patch may not be fully correct. It tries to solve
the imbalanced suspend counter.
The problem starts when some LV is created and fails in resume path.
(i.e. resuming to large PV (enforced) over small loop devices)
This fails in _resume_node() after dm_task_run(). And while
existing device with empty table is left in inactive table,
further calls are reporting this device is in suspend state.
When later the lvm2 tries to rollback created device and deactivate it,
it will end with internal error, when we try to decrement
never incremented suspend counter.
As an 'easy fix' for now update suspend counter only for live nodes.
TODO: explore better fix.
There is no point in creation of 2chunks snapshot,
since the snapshot is invalidated immeditelly with the first write
as there is no free chunk for COW blocks
(2 chunks are used by the snap header and the 1st. metadata chunk).
Enhance error message about the lowest usable size.
Instead of seeing wierd overflows inside the lvm code,
giving false error messages, kill the user experiment in the begining.
Who needs to use more then 16EiB with lvm2 and 64bit anyway...
Avoid hitting memory corruption (double free) in code path,
where PV FID has been already destroyed and the released pointer
was left in PV structure and could have been tried to be released
from there 2nd. time with final context destruction.
Switch to use libdm dm_get_status_snapshot() function for
reading status info.
This fixes bug, where the code was using 32bit integers,
while the snapshot target is able to return 64bit sizes.
However this also means, someone is using >1TB snapshot
cow devices, which is actually very bad idea anyway, since the
perfomance and memory usage in this case is very bad.
Since we use get_status also in dmeventd, which may use one pool
for a single device, in case it would be repeatedly returning error,
it may not be freeing the pool and would cause slow but steady growth.
To stay safe in the error path release any allocated memory.
Check for mounted fs also for vgchange command, not just lvchange.
NOTE: Code is using lv_info() just like lvs_in_vg_opened().
It should be probably converted into lv_is_active_locally().
To detect mounted device, use also /proc/self/mountinfo
as so far the check was only able to detect ext4 mounted filesystem.
TODO:
Once proper testing for this feature is added, it may appear,
mountinfo check is enough and covers all cases and sysfs check
could be removed.
There are places where 'lv_is_active' was being used where it was
more correct to use 'lv_is_active_locally'. For example, when checking
for the existance of a kernel instance before asking for its status.
Most of the time these would work correctly. (RAID is only allowed on
non-clustered VGs at the moment, which means that 'lv_is_active' and
'lv_is_active_locally' would give the same result.) However, it is
more correct to use the proper variant and it helps with future
scenarios where targets might be allowed exclusively (or clustered) in
a cluster VG.
If calling _snap_target_present on 2nd and later call and for
a segment with MERGING flag set, we must return the status of
snapshot as well as snapshot-merge target presence, not just
the snapshot one.
Accept --yes on all commands, even ones that don't today have prompts,
so that test scripts that don't care about interactive prompts no
longer need to deal with them.
But continue to mention --yes only in the command prototypes that
actually use it.
This fixes a long standing regression since LVM2 2.02.74 (commit 4efb1d9c,
"Update heuristic used for default and detected data alignment.")
The default PE alignment could be used (via MAX()) even if it was
determined that the device's MD stripe width, or minimal_io_size or
optimal_io_size were not factors of the default PE alignment (either 64K
or the newer default of 1MB, etc). This bug would manifest if the
default PE alignment was larger than the overriding hint that the
device provided (e.g. default of 1MB vs optimal_io_size of 768K).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This is just a temporary fix to support allocation of -l%FREE.
The number of free extent serves to calculate estimated metadata
size. This value is then substracted twice to keep some
free space for recover.
If the dm_realloc would fail, the already allocate _maps_buffer
memory would have been lost (overwritten with NULL).
Fix this by using temporary line buffer.
Also add a minor cleanup to set end of buffer to '\0',
only when we really know the file size fits the preallocated buffer.
Setting the cmd->default_settings.udev_fallback also requires DM
driver version check. However, this caused useless mapper/control
access with ioctl if not needed actually. For example if we're not
using activation code, we don't need to know the udev_fallback as
there's no node and symlink processing.
For example, this premature mapper/control access caused problems
when using lvm2app even when no activation happens - there are
situations in which we don't need to use mapper/control, but still
need some of the lvm2app functionality. This is also the case for
lvm2-activation systemd generator which just needs to look at the
lvm2 configuration, but it shouldn't touch mapper/control.
For reporting stacked or joined devices properly in cluster,
we need to report their activation state according the lock,
which activated this device tree.
This is getting a bit complex - current code tries simple approach -
For snapshot - return status for origin.
For thin pool - return status of the first known active thin volume.
For the rest of them - try to use dependency list of LVs and skip
known execptions. This should be able to recursively deduce top level
device for given LV.
(in release fix)
Commit 756bcabbfe restricted the
situations at which the LVM autoactivation fires - only on ADD
event for devices other than DM. However, this caused a problem
for MD devices...
MD devices are activated in a very similar way as DM devices:
the MD dev is created on first appeareance of MD array member
(ADD event) and stays *inactive* until the array is complete.
Just then the MD dev turns to active state and this is reported
to userspace by CHANGE event.
Unfortunately, we can't differentiate between the CHANGE event
coming from udev trigger/WATCH rule and CHANGE event coming from
the transition to active state - MD would need to add similar logic
we already use to detect this in DM world. For now, we just have
to enable pvscan --cache on *all* CHANGE events for MD so the
autoactivation of the LVM volumes on top of MD works.
A downside of this is that a spurious CHANGE event for MD dev
can cause the LVM volumes on top of it to be automatically activated.
However, one should not open/change the device underneath until
the device above in the stack is removed! So this situation should
only happen if one opens the MD dev for read-write by mistake
(and hence firing the CHANGE event because of the WATCH udev rule),
or if calling udev trigger manually for the MD dev.
(No WHATS_NEW here as this fixes the commit mentioned
above and which has not been released yet.)
Add new lvs segment field 'Monitor' showing 3 states:
"monitored" - LV is monitored by dmeventd.
"not monitored" - LV is currently not being monitored by dmeventd
"" (empty) - LV does not support monitoring, or dmeventd support
is not compiled in.
Support for exclusive activation of snapshots revealed some problems.
When snapshot is created, COW LV is activated first (for clearing) and
then it's transformed into snapshot's COW LV, but it has left the lock
for such LV active in cluster and this lock could not have been removed
from dlm, unless snapshot has been removed within same dlm session.
If the user tried to remove snapshot after rebooting node, the lock was
missing, and COW LV could not have been detached.
Patch modifes the approach in this way:
Always deactivate COW LV for clustered vg after clearing (so it's
activated again via imlicit snapshot activation rule when snapshot is activated).
When snapshot is removed, activate COW LV as independend LV, so the lock
will exist for such LV, but only when the snapshot is active.
Also add test case for testing snapshot removal after cluster reboot.
Patch fixes hidden problem with lvm metadata caching.
When the pretest was made, only the commited data have been cached back
since the call lv_info_by_lvid() triggers mda read operation.
However call of lv_suspend_if_active() also reads precommited metadata.
The problem is visible in this sequence of calls:
vg_write(), suspend_lv(), vg_commit(), resume_lv()
which may end with leaving outdated mda in lvm cache, since vg_write()
drops cached metadata and vg_commit() only transforms precommited
to commited metadata, but in the case of pretesting we have
no precommited mda available so the cache will continue to use
old metadata. This happens, when suspend LV is inactive.
Merging multiple config files together needs to know newest (highest)
timestamp of merged files. Persistent cache file is being used
only in case, the config file is older then .cache file.
Add verbose message when we will not obtain devices from udev
(i.e. testing is using different udev dir, and the log was
giving misleading info about using udev)
Add proper error message if zalloc from pull would have failed.
Fix typo obolete -> obsolete
Assign fid as the last step before returning VG.
Make the format reader for 'lvm1' and 'pool' equal to 'lvm2' format reader.
It has caused memory corruption to lvmetad as it later calls
destroy_instance() to allocated fid. This patch should fix problems
with crashing test lvmetad-lvm1.sh.
Sharing char* with field has a problem in error path,
when we allocate event, but fail to allocate timeout string.
Instead of creating complicated error paths to resolve
it individually stop using unions, and let the resource
to be released in a simple _free_message().
Commit 756bcabbfe fixed autoactivation
to not trigger on each uevent for a PV that appeared in the system
most notably the events that are triggered artificially (udevadm
trigger or as the result of the WATCH udev rule being applied that
consequently generates CHANGE uevents). This fixed a situation in
which VGs/LVs were activated when they should not.
BUT we still need to care about the coldplug used at boot to
retrigger the ADD events - the "udevadm trigger --action=add"!
For non-DM-based PVs, this is already covered as for these we
run the autoactivation on ADD event only.
However, for DM-based PVs, we still need to run the
autoactivation even for the artificial ADD event, reusing
the udev DB content from previous proper CHANGE event that
came with the DM device activation.
Simply, this patch fixes a situation in which we run extra
"udevadm trigger --action=add" (or echo add > /sys/block/<dev>/uevent)
for DM-based PVs (cryptsetup devices, multipath devices, any
other DM devices...).
Without this patch, while using lvmetad + autoactivation,
any VG/LV that has a DM-based PV and for which we do not
call the activation directly, the VG/LV is not activated.
For example a VG with an LV with root FS on it which is directly
activated in initrd and then missing activation of the rest
of the LVs in the VG because of unhandled uevent retrigger on
boot after switching to root FS (the "coldplug").
(No WHATS_NEW here as this fixes the commit mentioned
above and which was not released yet.)
There is no need to strdup a key when inserting into
the hash table as the table allocates memory and copies
the string. This was causing memory to be lost.
'lvchange' is used to alter a RAID 1 logical volume's write-mostly and
write-behind characteristics. The '--writemostly' parameter takes a
PV as an argument with an optional trailing character to specify whether
to set ('y'), unset ('n'), or toggle ('t') the value. If no trailing
character is given, it will set the flag.
Synopsis:
lvchange [--writemostly <PV>:{t|y|n}] [--writebehind <count>] vg/lv
Example:
lvchange --writemostly /dev/sdb1:y --writebehind 512 vg/raid1_lv
The last character in the 'lv_attr' field is used to show whether a device
has the WriteMostly flag set. It is signified with a 'w'. If the device
has failed, the 'p'artial flag has priority.
Example ("nosync" raid1 with mismatch_cnt and writemostly):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg Rwi---r-m 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-w 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-- 1 linear 4.00m
Example (raid1 with mismatch_cnt, writemostly - but failed drive):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg rwi---r-p 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-p 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-p 1 linear 4.00m
A new reportable field has been added for writebehind as well. If
write-behind has not been set or the LV is not RAID1, the field will
be blank.
Example (writebehind is set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r-- 512
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
Example (writebehind is not set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r--
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
This reverts commit 0396ade38b.
The original code also handled len==1, which the new code doesn't.
Press <TAB> in the lvm shell to get a list of the possible
flag completions for a single hyphen.
Commit 9fd7ac7d03 introduced a way a
method of avoiding reading from mirrors with a device failure. If
a device was found to be dead, the mapping table was checked for
'handle_errors' or 'block_on_error'. These strings were checked for
in the table string via 'strstr', which could also match on strings
like, 'no_handle_errors' or 'no_block_on_error'. No such strings
exist, but we don't want to have problems in the future if they do.
So, we check for ' <string>{'\0'|' '}'.
Move common code for changing activation state from
vgchange and lvchange to one function.
Fix the order of checks - so we always implicitelly
activate snapshots and thin volumes in exclusive mode,
and we do not allow local deactivation for them.
Revert commit 31c24dd9f2. This commit
was used to force a RAID device-mapper table to be loaded into the
kernel despite the fact that it was identical to the one already
loaded. The effect allowed a RAID array with a transiently failed
device to refresh and reintegrate the failed device. This operation
is better done in the kernel on a 'resume'. Since,
'lvchange --refresh' already performs a suspend/resume cycle, the
above commit is not needed once the kernel change is made. Reverting
the commit removes an unnecessary (at least for now) change to the
device-mapper interface.
New options to 'lvchange' allow users to scrub their RAID LVs.
Synopsis:
lvchange --syncaction {check|repair} vg/raid_lv
RAID scrubbing is the process of reading all the data and parity blocks in
an array and checking to see whether they are coherent. 'lvchange' can
now initaite the two scrubbing operations: "check" and "repair". "check"
will go over the array and recored the number of discrepancies but not
repair them. "repair" will correct the discrepancies as it finds them.
'lvchange --syncaction repair vg/raid_lv' is not to be confused with
'lvconvert --repair vg/raid_lv'. The former initiates a background
synchronization operation on the array, while the latter is designed to
repair/replace failed devices in a mirror or RAID logical volume.
Additional reporting has been added for 'lvs' to support the new
operations. Two new printable fields (which are not printed by
default) have been added: "syncaction" and "mismatches". These
can be accessed using the '-o' option to 'lvs', like:
lvs -o +syncaction,mismatches vg/lv
"syncaction" will print the current synchronization operation that the
RAID volume is performing. It can be one of the following:
- idle: All sync operations complete (doing nothing)
- resync: Initializing an array or recovering after a machine failure
- recover: Replacing a device in the array
- check: Looking for array inconsistencies
- repair: Looking for and repairing inconsistencies
The "mismatches" field with print the number of descrepancies found during
a check or repair operation.
The 'Cpy%Sync' field already available to 'lvs' will print the progress
of any of the above syncactions, including check and repair.
Finally, the lv_attr field has changed to accomadate the scrubbing operations
as well. The role of the 'p'artial character in the lv_attr report field
as expanded. "Partial" is really an indicator for the health of a
logical volume and it makes sense to extend this include other health
indicators as well, specifically:
'm'ismatches: Indicates that there are discrepancies in a RAID
LV. This character is shown after a scrubbing
operation has detected that portions of the RAID
are not coherent.
'r'efresh : Indicates that a device in a RAID array has suffered
a failure and the kernel regards it as failed -
even though LVM can read the device label and
considers the device to be ok. The LV should be
'r'efreshed to notify the kernel that the device is
now available, or the device should be 'r'eplaced
if it is suspected of failing.
Attempting to up-convert an inactive mirror when there is insufficient
space leads to the following message:
Unable to allocate extents for mirror(s).
ABORTING: Failed to remove temporary mirror layer inactive_mimagetmp_3.
Manual cleanup with vgcfgrestore and dmsetup may be required.
This is caused by a failure to execute the 'deactivate_lv' function in
the error condition. The deactivate returns an error because the LV is
already inactive. This patch checks if the LV is activate and calls
deactivate_lv only if it is. This allows the error cleanup code to work
properly in this condition.
It wasn't that big of a deal anyway, since there was no previous vg_commit
that needed to be reverted. IOW, no harm was done if the allocation failed.
The message was scary and useless.
I've updated the dm_status_raid structure and dm_get_status_raid()
function to make it handle the new kernel status fields that will
be coming in dm-raid v1.5.0. It is backwards compatible with the
old status line - initializing the new fields to '0'. The new
structure is also more amenable to future changes. It includes a
'reserved' field that is currently initialized to zero but could
be used to hold flags describing new features. It also now uses
pointers for the character strings instead of attempting to allocate
their space along with the structure (causing the size of the
structure to be variable). This allows future fields to be appended.
The new fields that are available are:
- sync_action : shows what the sync thread in the kernel is doing
(idle, frozen, resync, recover, check, repair, or
reshape)
- mismatch_count: shows the number of discrepancies which were
found or repaired by a "check" or "repair"
process, respectively.
...to not pollute the common and format-independent code in the
abstraction layer above.
The format1 pv_write has common code for writing metadata and
PV header by calling the "write_disks" fn and when rewriting
the header itself only (e.g. just for the purpose of changing
the PV UUID) during the pvchange operation, we had to tweak
this functionality for the format1 case and we had to assign
the PV the orphan state temporarily.
This patch removes the need for this format1 tweak and it calls
the write_disks with appropriate flag indicating whether this is
a PV write call or a VG write call, allowing for metatada update
for the latter one.
Also, a side effect of the former tweak was that it effectively
invalidated the cache (even for the non-format1 PVs) as we
assigned it the orphan state temporarily just for the format1
PV write to pass.
Also, that tweak made it difficult to directly detect whether
a PV was part of a VG or not because the state was incorrect.
Also, it's not necessary to backup and restore some PV fields
when doing a PV write:
orig_pe_size = pv_pe_size(pv);
orig_pe_start = pv_pe_start(pv);
orig_pe_count = pv_pe_count(pv);
...
pv_write(pv)
...
pv->pe_size = orig_pe_size;
pv->pe_start = orig_pe_start;
pv->pe_count = orig_pe_count;
...this is already done by the layer below itself (the _format1_pv_write fn).
So let's have this cleaned up so we don't need to be bothered
about any 'format1 special case for pv_write' anymore.
The pv_by_path might be also dangerous to use as it does not
count with any other metadata areas but the ones found on the PV
itself. If metadata was not found on the PV referenced by the path,
it returned no PV though it might have been referenced by metadata
elsewhere (on other PVs...).
If extending a VG and including a PV with 0 MDAs that was already
a part of a VG, the vgextend allowed that PV to be added and we
ended up *with one PV in two VGs*!
The vgextend code used the 'pv_by_path' fn that returned a PV for
a given path. However, when the PV did not have any metadata areas,
the fn just returned a PV without any reference to existing VG.
Consequently, any checks for the existing VG failed.
[0] raw/~ # pvcreate --metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[0] raw/~ # pvcreate --metadatacopies 1 /dev/sdb
Physical volume "/dev/sdb" successfully created
[0] raw/~ # vgcreate vg1 /dev/sda /dev/sdb
Volume group "vg1" successfully created
[0] raw/~ # pvcreate --metadatacopies 1 /dev/sdc
Physical volume "/dev/sdc" successfully created
[0] raw/~ # vgcreate vg2 /dev/sdc
Volume group "vg2" successfully created
Before this patch (incorrect):
[0] raw/~ # vgextend vg2 /dev/sda
Volume group "vg2" successfully extended
With this patch (correct):
[0] raw/~ # vgextend vg2 /dev/sda
Physical volume '/dev/sda' is already in volume group 'vg1'
Unable to add physical volume '/dev/sda' to volume group 'vg2'.
Before, the find_pv_by_name call always failed if the PV found was orphan.
However, we might use this function even for a PV that is not part of any VG.
This patch adds 'allow_orphan' arg to find_pv_by_name fn that allows that.
_find_pv_by_name -> find_pv_by_name
_find_pv_in_vg -> find_pv_in_vg
_find_pv_in_vg_by_uuid -> find_pv_in_vg_by_uuid
The only callers of the underscored variants were their wrappers
without the underscore. No other part of the code referenced the
underscored variants.
Usage of layer was not the best plan here - for proper devices stack
we have to keep correct reference in volume_group structure and
make the new thin pool LV appear as a new volume.
Keep the flag whether given thin pool argument has been given on command
line or it's been 'estimated'
Call of update_pool_params() must not change cmdline given args and
needs to know this info.
Since there is a need to move this update function into /lib, we cannot
use arg_count().
FIXME: we need some generic mechanism here.
This was a regression introduced with e33fd978a8
(libdm v1.02.68/lvm2 v2.02.89) with the introduction of new output
fields blkdevname and blkdevs_used for ls and deps dmsetup commands.
A new common '_process_options' fn was added with that commit, but the
fn was called prematurely which then broke processing of
'dmsetup splitname -o' which should implicitly use '-c' option
and this was failing after the commit:
alatyr/~ $ dmsetup splitname -o lv_name /dev/mapper/vg_data-test
Option not recognised: lv_name
Couldn't process command line.
The '-c' had to be used for correct operation:
alatyr/~ $ dmsetup splitname -c -o lv_name /dev/mapper/vg_data-test
LV
test
Now fixed to work as it did before:
alatyr/~ $ dmsetup splitname -o lv_name /dev/mapper/vg_data-test
LV
test
'in_sync' was using the last field in the RAID status output as
the location for the sync ratio field. The sync ratio may not always
be the last field, but it will always be the 7th field. So we switch
to using the absolute value rather than computing the last field
index number.
Previous commit included changes to WHATSNEW, but the code changes
were missing. Here is the description from the previous commit:
commit bbc6378b73
Author: Jonathan Brassow <jbrassow@redhat.com>
Date: Thu Feb 21 11:31:36 2013 -0600
RAID: Make 'lvchange --refresh' restore transiently failed RAID PVs
A new function (dm_tree_node_force_identical_table_reload) was added to
avoid the suppression of identical table reloads. This allows RAID LVs
to reload the on-disk superblock information that contains which devices
have failed and the bitmaps. If the failed device has returned, this has
the effect of restoring the device and initiating recovery. Without this
patch, the user had to completely deactivate their RAID LV and re-activate
it in order to restore the failed device. Now they simply need to
suspend and resume (which is done by 'lvchange --refresh').
The identical table suppression is only avoided if the LV is not PARTAIL
(i.e. all of it's devices can be seen and read by LVM) and the kernel
status of the array contains failed devices. In other words, the function
will only be called in the case where we may have success in restoring
a failed device in the array.
lvm dumpconfig [--ignoreadvanced] [--ignoreunsupported]
--ignoreadvanced causes the advanced configuration options to be left
out on dumpconfig output
--ignoreunsupported causes the options that are not officially supported
to be lef out on dumpconfig output
This shows up in the output as a short commentary:
$ lvm dumpconfig --type default --withcomments metadata/disk_areas
# Configuration option metadata/disk_areas.
# This configuration option is advanced.
# This configuration option is not officially supported.
disk_areas=""
lvm dumpconfig [--withcomments] [--withversions]
The --withcomments causes the comments to appear on output before each
config node (if they were defined in config_settings.h).
The --withversions causes a one line extra comment to appear on output
before each config node with the version information in which the
configuration setting first appeared.
There's a possibility to interconnect the dm_config_node with an
ID, which in our case is used to reference the configuration
definition ID from config_settings.h. So simply interconnecting
struct dm_config_node with struct cfg_def_item.
This patch also adds support for enhanced config node output besides
existing "output line by line". This patch adds a possibility to
register a callback that gets called *before* the config node is
processed line by line (for example to include any headers on output)
and *after* the config node is processed line by line (to include any
footers on output). Also, it adds the config node reference itself
as the callback arg in addition to have a possibility to extract more
information from the config node itself if needed when processing the
output callback (e.g. the key name, the id, or whether this is a
section or a value etc...).
If the config node from lvm.conf/--config tree is recognized and valid,
it's always coupled with the config node definition ID from
config_settings.h:
struct dm_config_node {
int id;
const char *key;
struct dm_config_node *parent, *sib, *child;
struct dm_config_value *v;
}
For example if the dm_config_node *cn holds "devices/dev" configuration,
then the cn->id holds "devices_dev_CFG" ID from config_settings.h, -1 if
not found in config_settings.h and 0 if matching has not yet been done.
To support the enhanced config node output, a new structure has been
defined in libdevmapper to register it:
struct dm_config_node_out_spec {
dm_config_node_out_fn prefix_fn; /* called before processing config node lines */
dm_config_node_out_fn line_fn; /* called for each config node line */
dm_config_node_out_fn suffix_fn; /* called after processing config node lines */
};
Where dm_config_node_out_fn is:
typedef int (*dm_config_node_out_fn)(const struct dm_config_node *cn, const char *line, void *baton);
(so in comparison to existing callbacks for config node output, it has
an extra dm_config_node *cn arg in addition)
This patch also adds these functions to libdevmapper:
- dm_config_write_node_out
- dm_config_write_one_node_out
...which have exactly the same functionality as their counterparts
without the "out" suffix. The "*_out" functions adds the extra hooks
for enhanced config output (prefix_fn and suffix_fn mentioned above).
One can still use the old interface for config node output, this is
just an enhancement for those who'd like to modify the output more
extensively.
lvm dumpconfig [--type {current|default|missing|new}] [--atversion] [--validate]
This patch adds above-mentioned args to lvm dumpconfig and it maps them
to creation and writing out a configuration tree of a specific type
(see also previous commit):
- current maps to CFG_TYPE_CURRENT
- default maps to CFG_TYPE_DEFAULT
- missing maps to CFG_TYPE_MISSING
- new maps to CFG_TYPE_NEW
If --type is not defined, dumpconfig defaults to "--type current"
which is the original behaviour of dumpconfig before all these changes.
The --validate option just validates current configuration tree
(lvm.conf/--config) and it writes a simple status message:
"LVM configuration valid" or "LVM configuration invalid"
Configuration checking is initiated during config load/processing
(_process_config fn) which is part of the command context
creation/refresh.
This patch also defines 5 types of trees that could be created from
the configuration definition (config_settings.h), the cfg_def_tree_t:
- CFG_DEF_TREE_CURRENT that denotes a tree of all the configuration
nodes that are explicitly defined in lvm.conf/--config
- CFG_DEF_TREE_MISSING that denotes a tree of all missing
configuration nodes for which default valus are used since they're
not explicitly used in lvm.conf/--config
- CFG_DEF_TREE_DEFAULT that denotes a tree of all possible
configuration nodes with default values assigned, no matter what
the actual lvm.conf/--config is
- CFG_DEF_TREE_NEW that denotes a tree of all new configuration nodes
that appeared in given version
- CFG_DEF_TREE_COMPLETE that denotes a tree of the whole configuration
tree that is used in LVM2 (a combination of CFG_DEF_TREE_CURRENT +
CFG_DEF_TREE_MISSING). This is not implemented yet, it will be added
later...
The function that creates the definition tree of given type:
struct dm_config_tree *config_def_create_tree(struct config_def_tree_spec *spec);
Where the "spec" specifies the tree type to be created:
struct config_def_tree_spec {
cfg_def_tree_t type; /* tree type */
uint16_t version; /* tree at this LVM2 version */
int ignoreadvanced; /* do not include advanced configs */
int ignoreunsupported; /* do not include unsupported configs */
};
This tree can be passed to already existing functions that write
the tree on output (like we already do with cmd->cft).
There is a new lvm.conf section called "config" with two new options:
- config/checks which enables/disables checking (enabled by default)
- config/abort_on_errors which enables/disables aborts on any type of
mismatch found in the config (disabled by default)
Add support for configuration checking - type checking and recognition
of registered configuration settings that LVM2 understands and also
check the structure of the configuration. Log error on any mismatch
found.
A hash over all allowed configuration paths is created which helps
with matching the exact configuration (lvm.conf/--config tree) with
the configuration item definition from config_settings.h in an
efficient and one-step way.
Two more helper flags are introduced for each configuration definition
item:
- CFG_USED which marks the item as being used (lvm.conf/--config)
This helps with identifying missing configuration options
(and for which defaults were used) when traversing the tree later.
- CFG_VALID which denotes that the item has already been checked and
it was found valid. This improves performance, so if the check
is called once again on the same tree which was not reloaded, we
can just return the state from previous check (with a possibility
to force the check if needed).
The new function that config.h exports and which is going to be used
to perform the configuration checking is:
int config_def_check(struct cmd_context *cmd, int force, int skip, int suppress_messages)
...which is exported internally via config.h.
Export this functionality from libdevmapper just for
convenience and general use when reading boolean values
which could be defined either in a numeric way with 0/1
or by using strings with "true"/"false", "yes"/"no",
"on"/"off", "y"/"n".
For example, the old call and reference:
find_config_tree_str(cmd, "devices/dir", DEFAULT_DEV_DIR)
...now becomes:
find_config_tree_str(cmd, devices_dir_CFG)
So we're referring to the named configuration ID instead
of passing the configuration path and the default value
is taken from central config definition in config_settings.h
automatically.
This patch adds basic structures that encapsulate the config_settings.h
content - it takes each item and puts it in structures:
- cfg_def_type_t to define config item type
- cfg_def_value_t to define config item (default) value
- flags used to define the nature and use of the config item:
- CFG_NAME_VARIABLE for items with variable names (e.g. tags)
- CFG_ALLOW_EMPTY for items where empty value is allowed
- CFG_ADVANCED for items which are considered as "advanced settings"
- CFG_UNSUPPORTED for items which are not officially supported
(config options mostly for internal use and testing/debugging)
- cfg_def_item_t to encapsulate the whole definition of the config
definition itself
Each config item is referenced by named ID, e.g. "devices_dir_CFG"
instead of directly typing the path "devices/dir" as it was before.
This patch also adds cfg_def_get_path helper function to get the
config setting path up to the root for given config ID
(it returns the path in form of "abc/def/.../xyz" where the "abc"
is the topmost element).
This file centrally defines all recognized LVM2 configuration
sections and settings. Each item here has its parent, set of
allowed types, default value, brief comment, version the setting
first appeared in and flags that further define the nature of
the configuration setting and its use.
When a section was empty in a configuration tree (no children - this is
allowed) and we were looking for a config node inside that section, the
_find_config_node function incorrectly returned the section itself if
the node inside that section was not found.
For example the configuration below:
The config:
abc {
}
And a function call to get the "def" node inside "abc" section:
_find_config_node(..., "abc/def")
...returned the "abc" node instead of NULL ("def" not found).
This in turn caused segfaults in the code using lookups in such
a configuration tree as we (correctly) expected that the node
returned was always the one we were looking for or NULL if not
found. But if incorrect node was returned instead, we processed
that as if this was the node we were looking for and so we
processed its value as well. But sections don't have values => segfault.
Just to prevent accidental and improper use when reading the layout
from disk because of the already existing disk_areas_xl[0] lists
that are variable in size. We can read pv_header_extension only
after we know exactly where the lists end...
There are new reporting fields for Embedding Area: ea_start and ea_size.
An example of 1m Embedding Area and relevant reporting fields:
raw/~ # pvs -o pv_name,pe_start,ea_start,ea_size
PV 1st PE EA start EA size
/dev/sda 2.00m 1.00m 1.00m
To create an Embedding Area during PV creation (pvcreate or as part of
the vgconvert operation), we need to define the Embedding Area size.
The Embedding Area start will be calculated automatically by the tools.
This patch adds --embeddingareasize argument to pvcreate and vgconvert.
The PV header extension information (PV header extension version, flags
and list of Embedding Area locations) is stored just beyond the PV header base.
When calculating the Embedding Area start value (ea_start), the same logic is
used as when calculating the pe_start value for Data Area - the value must
follow exactly the same alignment restrictions for its start value
(the alignment detected automatically or provided via command line using
the --dataalignment and --dataalignmentoffset arguments).
The Embedding Area is placed at the very start of the PV, starting at
ea_start. The Data Area starting at pe_start is placed next. The pe_start is
still properly aligned. Due to the pe_start alignment, it's possible that the
resulting Embedding Area size (ea_size) ends up bigger in size than requested
(but never less than requested).
New tools with PV header extension support will read the extension
if it exists and it's not an error if it does not exist (so old PVs
will still work seamlessly with new tools).
Old tools without PV header extension support will just ignore any
extension.
As for the Embedding Area location information (its start and size),
there are actually two places where this is stored:
- PV header extension
- VG metadata
The VG metadata contains a copy of what's written in the PV header
extension about the Embedding Area location (NULL value is not copied):
physical_volumes {
pv0 {
id = "AkSSRf-difg-fCCZ-NjAN-qP49-1zzg-S0Fd4T"
device = "/dev/sda" # Hint only
status = ["ALLOCATABLE"]
flags = []
dev_size = 262144 # 128 Megabytes
pe_start = 67584
pe_count = 23 # 92 Megabytes
ea_start = 2048
ea_size = 65536 # 32 Megabytes
}
}
The new metadata fields are "ea_start" and "ea_size".
This is mostly useful when restoring the PV by using existing
metadata backups (e.g. pvcreate --restorefile ...).
New tools does not require these two fields to exist in VG metadata,
they're not compulsory. Therefore, reading old VG metadata which doesn't
contain any Embedding Area information will not end up with any kind
of error but only a debug message that the ea_start and ea_size values
were not found.
Old tools just ignore these extra fields in VG metadata.
PV header extension comes just beyond the existing PV header base:
PV header base (existing):
- uuid
- device size
- null-terminated list of Data Areas
- null-terminater list of MetaData Areas
PV header extension:
- extension version
- flags
- null-terminated list of Embedding Areas
This patch also adds "eas" (Embedding Areas) list to lvmcache (lvmcache_info)
and it also adds support for common operations on the list (just like for
already existing "das" - Data Areas list):
- lvmcache_add_ea
- lvmcache_update_eas
- lvmcache_foreach_ea
- lvmcache_del_eas
Also, add ea_start and ea_size to struct physical_volume for processing
PV Embedding Area location throughout the code (currently only one
Embedding Area is supported, though the definition on disk allows for
more if needed in the future...).
Also, define FMT_EAS format flag to mark that the format actually
supports Embedding Areas (currently format-text only).
Extract restorable PV creation parameters from struct pvcreate_params into
a separate struct pvcreate_restorable_params for clarity and also for better
maintainability when adding any new items later.
Add basic support for converting LV into an external origin volume.
Syntax:
lvconvert --thinpool vg/pool --originname renamed_origin -T origin
It will convert volume 'origin' into a thin volume, which will
use 'renamed_origin' as an external read-only origin.
All read/write into origin will go via 'pool'.
renamed_origin volume is read-only volume, that could be activated
only in read-only mode, and cannot be modified.
Use the field 'origin' for reporting external origin lv name.
For thin volumes with external origin, report the size of
external origin size via:
lvs -o+origin_size
Do not allow conversion of external origin into writeable LV,
and prohibit changing the external origin size.
If the snapshot origin is also external origin, merge is prohibited.
Reorder activation code to look similar for preload tree and
activation tree.
Its also give much better suppport for device stacking,
since now we also support activation of snapshot which might
be then used for other devices.
A new function (dm_tree_node_force_identical_table_reload) was added to
avoid the suppression of identical table reloads. This allows RAID LVs
to reload the on-disk superblock information that contains which devices
have failed and the bitmaps. If the failed device has returned, this has
the effect of restoring the device and initiating recovery. Without this
patch, the user had to completely deactivate their RAID LV and re-activate
it in order to restore the failed device. Now they simply need to
suspend and resume (which is done by 'lvchange --refresh').
The identical table suppression is only avoided if the LV is not PARTAIL
(i.e. all of it's devices can be seen and read by LVM) and the kernel
status of the array contains failed devices. In other words, the function
will only be called in the case where we may have success in restoring
a failed device in the array.
When there are missing PVs in a volume group, most operations that alter
the LVM metadata are disallowed. It turns out that 'vgimport' is one of
those disallowed operations. This is bad because it creates a circular
dependency. 'vgimport' will complain that the VG is inconsistent and that
'vgreduce --removemissing' must be run. However, 'vgreduce' cannot be run
because it has not been imported. Therefore, 'vgimport' must be one of
the operations allowed to change the metadata when PVs are missing. The
'--force' option is the way to make 'vgimport' happen in spite of the
missing PVs.
If zero metadata copies are used, there's no further recalculation of
PV alignment that happens when adding metadata areas to the PV and
which actually calculates the alignment correctly as a matter of fact.
So fix this for "PV without MDA" case as well.
Before this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 8.00m
After this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
Also, remove a superfluous condition "pv->pe_start < pv->pe_align" in:
if (pe_start == PV_PE_START_CALC && pv->pe_start < pv->pe_align)
pv->pe_start = pv->pe_align ...
This part of the condition is not reachable as with the PV_PE_START_CALC,
we always have pv->pe_start set to 0 from the PV struct initialisation
(...the pv->pe_start value is just being calculated).
If '--mirrors/-m' and '--stripes/-i' are used together when creating
a logical volume, mirrors-over-stripes is currently chosen. The user
can override this by using the '--type raid10' option on creation.
However, we want a place where we can set the default behavior to
'raid10' explicitly - similar to the "mirror" and "raid1" tunable,
mirror_segtype_default.
A follow-on patch should use this new setting to change the default
from "mirror" to "raid10", as this is the preferred segment type.
When a device fails, we may wish to replace those segments with an
error segment. (Like when a 'vgreduce --removemissing' removes a
failed device that happens to be a RAID image/meta.) We are then left
with images that we will eventually want to remove or replace.
This patch allows us to pull out these virtual "error" sub-LVs. This
allows a user to 'lvconvert -m -1 vg/lv' to extract the bad sub-LVs.
Sub-LVs with error segments are considered for extraction before other
possible devices so that good devices are not accidentally removed.
This patch also adds the ability to replace RAID images that contain error
segments. The user will still be unable to run 'lvconvert --replace'
because there is no way to address the 'error' segment (i.e. no PV
that it is associated with). However, 'lvconvert --repair' can be
used to replace the image's error segment with a new PV. This is also
the most appropriate way to do it, since the LV will continue to be
reported as 'partial'.
Currently it is impossible to remove a failed PV which has a RAID LV
on it. This patch fixes the issue by replacing the failed PV with an
'error' segment within the affected sub-LVs. Once there is no longer
a RAID LV using the PV, it can be removed.
Most often, it is better to replace a failed RAID device with a spare.
(You can use 'lvconvert --repair <vg>/<LV>' to accomplish that.)
However, if there are no spares in the volume group and none will be
added, it is useful to be able to removed the failed device.
Following patches address the ability to perform 'lvconvert' operations
on RAID LVs that contain sub-LVs composed of 'error' segments.
We have been using 'mirror_region_size' in lvm.conf as the default region
size for RAID logical volumes as well as mirror logical volumes. Since,
"raid" is more inclusive and representative than "mirror", I have changed
the name of this setting. We must still check for the old setting and warn
the user if we are overriding it with the new setting if both happen to be
present.
Instead of check for lv_is_active() for thin pool LV,
query the whole pool via new pool_is_active().
Fixes a problem when we cannot change discards settings
for active pool device where the actual layer for pool
device was inactive, but thin volumes using thin pool
have been active.
This internal function check for active pool device.
For cluster it checks every thin volume,
On the non-clustered VG we need to check just
for presence of -tpool device.
Update the error path after problems with suspend_lv or vg_commit.
It's not exactly well defined what should happen, and this
code seems to appear in many different instancies<F2> in the
whole source code tree - we should probably pick the best version.
On glibc, those are erroneously (namespace pollution) pulled in via
other headers. this doesn't work with conformant libcs (musl libc in
this case), we simply need to include all needed headers.
Signed-Off-By: John Spencer <maillist-lvm@barfooze.de>
This patch fixes problem reported here:
https://www.redhat.com/archives/dm-devel/2013-January/msg00311.html
Fixing it by separating function for duplicating string token.
---
When /etc/lvm/lvm.conf is truncated at the first '"' of a line, all LVM
utilities crash with a segfault.
The segfault only seems to occur if the last character is the first '"'
(double quote) of a line. If you truncate it at any other point, lvm
detects the error and report parse error
lvm.conf ends like this.
$hexdump -C lvm.conf
....
69 72 20 3d 20 22 2f 64 65 76 22 0a 0a 0a 20 20 |ir = "/dev"... |
20 20 23 20 41 6e 20 61 72 72 61 79 20 6f 66 20 | # An array of |
64 69 72 65 63 74 6f 72 69 65 73 20 74 68 61 74 |directories that|
20 63 6f 6e 74 61 69 6e 20 74 68 65 20 64 65 76 | contain the dev|
69 63 65 20 6e 6f 64 65 73 20 79 6f 75 20 77 69 |ice nodes you wi|
73 68 0a 20 20 20 20 23 20 74 6f 20 75 73 65 20 |sh. # to use |
77 69 74 68 20 4c 56 4d 32 2e 0a 20 20 20 20 73 |with LVM2.. s|
63 61 6e 20 3d 20 5b 20 22 2f 78 22 2c 0a 20 20 |can = [ "/x",. |
20 20 20 20 20 20 20 20 20 20 20 22 | "|
...
Reported-by: dongmao zhang <dmzhang suse com>
There are currently a few issues with the reporting done on RAID LVs and
sub-LVs. The most concerning is that 'lvs' does not always report the
correct failure status of individual RAID sub-LVs (devices). This can
occur when a device fails and is restored after the failure has been
detected by the kernel. In this case, 'lvs' would report all devices are
fine because it can read the labels on each device just fine.
Example:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
However, 'dmsetup status' on the device tells us a different story:
[root@bp-01 lvm2]# dmsetup status vg-lv
0 1024000 raid raid1 2 DA 1024000/1024000
In this case, we must also be sure to check the RAID LVs kernel status
in order to get the proper information. Here is an example of the correct
output that is displayed after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-p 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-p /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-p /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
The other case where 'lvs' gives incomplete or improper output is when a
device is replaced or added to a RAID LV. It should display that the RAID
LV is in the process of sync'ing and that the new device is the only one
that is not-in-sync - as indicated by a leading 'I' in the Attr column.
(Remember that 'i' indicates an (i)mage that is in-sync and 'I' indicates
an (I)mage that is not in sync.) Here's an example of the old incorrect
behaviour:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg Iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg Iwi-aor-- /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0) ** Note that all the images currently are marked as 'I' even though it is
only the last device that has been added that should be marked.
Here is an example of the correct output after this patch is applied:
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[root@bp-01 lvm2]# lvconvert -m +1 vg/lv; lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg rwi-a-r-- 0.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg iwi-aor-- /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-- /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0)
** Note only the last image is marked with an 'I'. This is correct and we can
tell that it isn't the whole array that is sync'ing, but just the new
device.
It also works under snapshots...
[root@bp-01 lvm2]# lvs -a -o name,vg_name,attr,copy_percent,devices vg
LV VG Attr Cpy%Sync Devices
lv vg owi-a-r-p 33.47 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
[lv_rimage_0] vg iwi-aor-- /dev/sda1(1)
[lv_rimage_1] vg Iwi-aor-p /dev/sdb1(1)
[lv_rimage_2] vg Iwi-aor-- /dev/sdc1(1)
[lv_rmeta_0] vg ewi-aor-- /dev/sda1(0)
[lv_rmeta_1] vg ewi-aor-p /dev/sdb1(0)
[lv_rmeta_2] vg ewi-aor-- /dev/sdc1(0)
snap vg swi-a-s-- /dev/sda1(51201)
We can avoid many dev_manager (ioctl) calls by caching the results of
previous calls to lv_raid_dev_health. Just considering the case where
'lvs -a' is called to get the attributes of a RAID LV and its sub-lvs,
this function would be called many times. (It would be called at least
7 times for a 3-way RAID1 - once for the health of each sub-LV and once
for the health of the top-level LV.) This is a good idea because the
sub-LVs are processed in groups along with their parent RAID LV and in
each case, it is the parent LV whose status will be queried. Therefore,
there only needs to be one trip through dev_manager for each time the
group is processed.
Similar to the way thin* accesses its kernel status, we add a method
for RAID to grab the various values in its status output without the
higher levels (LVM) having to understand how to parse the output.
Added functions include:
- lib/activate/dev_manager.c:dev_manager_raid_status()
Pulls the status line from the kernel
- libdm/libdm-deptree.c:dm_get_status_raid()
Parses status line and puts components into dm_status_raid struct
- lib/activate/activate.c:lv_raid_dev_health()
Accesses dm_status_raid to deliver raid dev_health string
The new structure and functions can provide a more unified way to access
status information. ('lv_raid_percent' could switch to using these
functions, for example.)
If there was a nested mountpoint inside an existing mount path,
blkdeactivate could fail to unmount such a mountpoint as it
needs to deactivate the deepest path first and continue upwards.
For example the simplest reproducer:
[root@rhel6-a ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 4G 0 disk
|-vg-lvol0 (dm-2) 253:2 0 32M 0 lvm /mnt/a
`-vg-lvol1 (dm-3) 253:3 0 32M 0 lvm /mnt/a/b
Before this patch:
[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/a
umount: /mnt/a: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
UMOUNT: unmounting vg-lvol1 (dm-3) mounted on /mnt/a/b
LVM: deactivating Logical Volume vg/lvol1
(deactivation of vg/lvol0 is skipped as /mnt/a that is on lvol0
can't be unmounted - it still has /mnt/a/b as nested mountpoint!)
With this patch applied:
[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
UMOUNT: unmounting vg-lvol1 (dm-3) mounted on /mnt/a/b
UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/a
LVM: deactivating Logical Volume vg/lvol0
LVM: deactivating Logical Volume vg/lvol1
===
Also, this patch contains a fix for processing mangled mount paths:
[root@rhel6-a ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 4G 0 disk
`-vg-lvol0 (dm-2) 253:2 0 32M 0 lvm /mnt/x y z
[root@rhel6-a ~]# lsblk -r
vg-lvol0 253:2 0 32M 0 lvm /mnt/x\x20y\x20z
(the mount path is mangled with \xNN that is visible in raw
lsblk output only and which is used in blkdeactive as well)
Before this patch:
[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
umount: /mnt/x\x20y\x20z: not found
After this patch applied:
[root@rhel6-a ~]# blkdeactivate -u
Deactivating block devices:
UMOUNT: unmounting vg-lvol0 (dm-2) mounted on /mnt/x\x20y\x20z
LVM: deactivating Logical Volume vg/lvol0
For reseting locale environment into significantly less memory
consuming version 'C' - use LC_ALL instead of LANG since it has
higher priority in locale settings.
Otherwise we may observe whole locale-archive which might be
over 100MB on i.e. Fedora systems locked in memory with
some daemons.
The idea is to avoid a period when an existing VG is not mapped to either the
old or the new name. (Note that the brief "blackout" was present even if the
name did not actually change.) We instead allow a brief overlap of a VG existing
under both names, i.e. a query for a VG might succeed but before a lock is
acquired the VG disappears.
Add log/debug_classes to lvm.conf to allow debug messages to be
classified and filtered at runtime.
The dm_errno field is only used by log_error(), so I've redefined it
for log_debug() messages to hold the message class.
By default, all existing messages appear, but we can add categories that
generate high volumes of data, such as logging all traffic to/from
lvmetad.
fmt1 doesn't have a separate commit function: updates take effect
immediately vg_write is called, so we must update lvmetad at this
point if we're going to go on and ask lvmetad for the VG metadata
again before calling the commit function (though that's probably an
unsupported and pointless thing to do anyway as the client must
already have that data and it cannot have changed because it's locked
and with devs suspended we shouldn't be communicating with lvmetad;
so when that's fixed properly, this fix here can be reverted).
This problem showed up as an internal error when lvremoving an LVM1
snapshot.
> Internal error: LV snap1 (00000000000000000000000000000001) missing from preload metadata
https://bugzilla.redhat.com/891855
Rename lvmetad_warning() to lvmetad_connect_or_warn().
Log all connection attempts on the client side, whether successful or not.
Reduce some nesting and remove a redundant assertion.
We need to call sync_local_dev_names directly as pvscan uses
VG_GLOBAL lock and this one *does not* cause the synchronization
(sync_dev_names) to be called on unlock (VG_GLOBAL is not a real VG):
define unlock_vg(cmd, vol)
do { \
if (is_real_vg(vol)) \
sync_dev_names(cmd); \
(void) lock_vol(cmd, vol, LCK_VG_UNLOCK); \
} while (0)
Without this fix, we end up without udev synchronization for the
pvscan --cache (mainly for -aay that causes the VGs/LVs to be
autoactivated) and also udev synchronization cookies are then left
in the system since they're not managed properly (code before sets
up udev sync cookies, but we have to call dm_udev_wait at least once
after that to do the wait and cleanup).
Before, the pvscan --cache -aay was called on each ADD and CHANGE
uevent (for a device that is not a device-mapper device) and each CHANGE
event (for a PV that is a device-mapper device).
This causes troubles with autoactivation in some cases as CHANGE event
may originate from using the OPTION+="watch" udev rule that is defined
in 60-persistent-storage.rules (part of the rules provided by udev
directly) and it's used for all block devices
(except fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md* devices). For example, the
following sequence incorrectly activates the rest of LVs in a VG if one
of the LVs in the VG is being removed:
[root@rhel6-a ~]# pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
[root@rhel6-a ~]# vgcreate vg /dev/sda
Volume group "vg" successfully created
[root@rhel6-a ~]# lvcreate -l1 vg
Logical volume "lvol0" created
[root@rhel6-a ~]# lvcreate -l1 vg
Logical volume "lvol1" created
[root@rhel6-a ~]# vgchange -an vg
0 logical volume(s) in volume group "vg" now active
[root@rhel6-a ~]# lvs
LV VG Attr LSize Pool Origin Data% Move Log
Cpy%Sync Convert
lvol0 vg -wi------ 4.00m
lvol1 vg -wi------ 4.00m
[root@rhel6-a ~]# lvremove -ff vg/lvol1
Logical volume "lvol1" successfully removed
[root@rhel6-a ~]# lvs
LV VG Attr LSize Pool Origin Data% Move Log
Cpy%Sync Convert
lvol0 vg -wi-a---- 4.00m
...so the vg was deactivated, then lvol1 removed, and we end up with
lvol1 removed (which is ok) BUT with lvol0 activated (which is wrong)!!!
This is because after lvol1 removal, we need to write metadata to the
underlying device /dev/sda and that causes the CHANGE event to be
generated (because of the WATCH udev rule set on this device) and this
causes the pvscan --cache -aay to be reevaluated.
We have to limit this and call pvscan --cache -aay to autoactivate
VGs/LVs only in these cases:
--> if the *PV is not a dm device*, scan only after proper device
addition (ADD event) and not with any other changes (CHANGE event)
--> if the *PV is a dm device*, scan only after proper mapping
activation (CHANGE event + the underlying PV in a state "just
activated")
If a RAID array is not in-sync, replacing devices should not be allowed
as a general rule. This is because the contents used to populate the
incoming device may be undefined because the devices being read where
not in-sync. The kernel enforces this rule unless overridden by not
allowing the creation of an array that is not in-sync and includes a
devices that needs to be rebuilt.
Since we cannot know the sync state of an LV if it is inactive, we must
also enforce the rule that an array must be active to replace devices.
That leaves us with the following conditions:
1) never allow replacement or repair of devices if the LV is in-active
2) never allow replacement if the LV is not in-sync
3) allow repair if the LV is not in-sync, but warn that contents may
not be recoverable.
In the case where a user is performing the repair on the command line via
'lvconvert --repair', the warning is printed before the user is prompted
if they would like to replace the device(s). If the repair is automated
(i.e. via dmeventd and policy is "allocate"), then the device is replaced
if possible and the warning is printed.
It isn't possible to choose a sane default for snapshot size, so just
play it straight and use the passed size instead of adding special
behavior for 0.
Also revert change to Python lib, size parameter must be supplied.
Signed-off-by: Andy Grover <agrover@redhat.com>
This reverts commit ed23da95b6.
Hash table device_to_pvid seems to contain references to
already deleted pvids and so revert to the older
behaviour using allocated memory.
All operations on shared hash tables need to be protected by mutexes. Moreover,
lookup and subsequent key removal need to happen atomically, to avoid races (and
possible double free-ing) between multiple threads trying to manipulate the same
VG.
If the lvmcache_info_from_pvid() fails to find valid
info, invoke the lookup by dev, and only in this case
call lvmcache_info_from_pvid() again.
Also check for the result of info and return
error directly, so the NULL is not passed
to lvmcache_get_label().
If we fail to get memory for mutex, hash the mutex
or fail somewhere along pthread function calls
return allocated resources back and unlock vg_lock_map mutex.
Detect failure of dm_pool_strdup() and print error in fail path.
Save one extra strchr call - since we already know the distance
for the '=' character.
Drop stack trace from return after log_error().
When the abort_on_internal_errors is enabled, we aborted prior
the syslog logging output.
Since such fatal error gets level _LOG_FATAL it should
not be blocked by debug_level() check so lets move it further,
to get abort error logged also via syslog.
Since we are doing just dump and function doesn't report
any error, explicitely ignore return values from
dm_config_write_node and dm_asprintf.
Same applies for the logging function.
If no size is given, size defaults to 0, which in lvm_lv_snapshot will
allocate extents equal to the original LV be allocated for the new
snapshot.
Signed-off-by: Andy Grover <agrover@redhat.com>
Tabify
Remove use of asize, unneeded.
Don't initialize lvobj->parent_vgobj to NULL, the object ctor already
zeroed everything on alloc.
Redo call to lvm_lv_snapshot to use the liblvm snapshot implementation
we went with.
Add {}s to silence warning in lv_dealloc.
Rename snapshot function for consistency.
Update WHATS_NEW.
Signed-off-by: Andy Grover <agrover@redhat.com>
We can also use this for conversion between different mirror segment
types. Each new segment type converter then needs to check itself
whether the --stripes is applicable.
The motivation to grab the global lock is to avoid a scan and metadata parsing
for each PV, but the cost of obtaining metadata is _mostly_ mitigated by having
lvmetad around. Not taking the global lock improves throughput when multiple pvs
or related commands are running in parallel, like in RHEV.
Calling pvscan --cache with -aay on a PV without an MDA would spuriously fail
with an internal error, because of an incorrect assumption that a parsed VG
structure was always available. This is not true and the autoactivation handler
needs to call vg_read to obtain metadata in cases where the PV had no MDAs to
parse. Therefore, we pass vgid into the handler instead of the (possibly NULL)
VG coming from the PV's MDA.
Arghh, this was bad last-minute shortening of if() expression
in the commit 1ef9831018.
dm_tree_node_set_thin_pool_discard() must not run in the same
expression as check for non-power-2 discard, otherwise
there are 2 calls for dm_tree_node_set_thin_pool_discard
and whole setting of discards is missinterpretted.
In-relase fix it by using proper parentheses {}.
Remove no longer needed warning for unsuppoted discards
for non-power-2 lvcreate commands.
(Missed from the patch for the same update in lvchange made
by commit dde5a6c52b)
Function _ignore_blocked_mirror_devices was not release
allocated strings images_health and log_health.
In error paths it was also not releasing dm_task structure.
Swaped return code of _ignore_blocked_mirror_devices and
use 1 as success.
In _parse_mirror_status use log_error if memory allocation
fails and few more errors so they are no going unnoticed
as debug messages.
On error path always clear return values and free strings.
For dev_create_file use cache mem pool to avoid memleak.
Attempting pvmove on RAID LVs replaces the kernel RAID target with
a temporary pvmove target, ultimately destroying the RAID LV. pvmove
must be prevented on RAID LVs for now.
Use 'lvconvert --replace old_pv vg/lv new_pv' if you want to move
an image of the RAID LV.
In case we don't want to activate, autoactivate or have the
VG/LV read-only. Primarily targeted for the auto_activation_volume_list,
but it makes no harm for other settings (the part of the code
that reads these three settings is shared, but there's no
reason to separate it only for this change).
Rework thin feature detection to support runtime
section to allow to disable them selectively.
New lvm.conf option is born: global/thin_disabled_features
Support swapping of metadata device if the thin pool already
exists. This way it's easy to i.e. resize metadata or their
repair operation.
User may create some empty LV, replace existing metadata
or dump and restore them into bigger LV.
If the resume of preloaded node fails, do not leave such
node in the table - since it may not be easy to detach such
node later when the node is i.e. internal.
i.e. failing activation of the thin pool with mismatching
chunk size may leave -tpool device in the table, which
could have been then removed only by dmsetup command.
Aux function to replace PV with specifically damaged device.
Usage:
aux error_dev "$dev1" 8:32 96:8
Replaces from 8 sector 32 error 512b sectors
and from 96 sector next 8 sectors will fail on rw.
Rest of device is preserved.
For testing:
dd if="$dev1" of=x bs=512 count=104 conv=sync,noerror iflag=direct
Do a better job explaining the '--stripes/-i' option to lvcreate
when it comes to RAID 4/5/6. The extra devices needed for parity
are implicitly added to the argument given. So a 5-device RAID6
logical volume is created with '-i 3' - indicating 3 stripes plus
the implicit 2 devices needed for RAID6.
$ export DM_DISABLE_UDEV=1
$ dmsetup create test --table "0 1 zero"
Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, device-mapper library will manage device nodes in device directory.
$ lvchange -ay vg/lvol0
Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, device-mapper library will manage device nodes in device directory.
Setting this environment variable will cause a full fallback
to old direct node and symlink management in libdevmapper and lvm2.
It means:
- disabling udev synchronization
(--noudevsync in dmsetup and --noudevsync + activation/udev_sync=0
lvm2 config)
- disabling dm and any subsystem related udev rules
(--noudevrules in dmsetup and activation/udev_rules=0 lvm2 config)
- management of nodes/symlinks under /dev directly by libdevmapper/lvm2
(--verifyudev in dmsetup and activation/verify_udev_operations=1
lvm2 config)
- not obtaining any device list from udev database
(devices/obtain_device_list_from_udev=0 lvm2 config)
Note: we could set all of these before - there's no functional change!
However the DM_DISABLE_UDEV environment variable is a nice shortcut
to make it easier for libdevmapper users so that one can switch off all
of the udev management off at one go directly on the command line,
without a need to modify any source or add any extra switches.
If udev synchronization is disabled by means of --noudevsync
option, we should disable just the synchronization and nothing else.
The udev fallback (verifying udev operations and fixing the
nodes/symlinks if found incorrect) is orthogonal and controlled
by a separate activation/verify_udev_operations configuration option.
Allow restoring metadata with thin pool volumes.
No validation is done for this case within vgcfgrestore tool -
thus incorrect metadata may lead to destruction of pool content.
Configurable settings for thin pool create
if they are not specified on command line.
New supported lvm.conf options are:
allocation/thin_pool_chunk_size
allocation/thin_pool_discards
allocation/thin_pool_zero
Check if target supports discards for chunk sizes,
that are not power of 2 (just multiple of 64K),
and enable it in case it's supported by thin kernel target.
Similar to the way the 'mirror', 'raid1' and 'raid10' segment types set
the number of mirrors to 2 ('-m 1') if the argument is not specified,
here we set the number of stripes to 2 if not given on the command line
when creating a RAID10 LV.
Commit bf2741376d started to use
lv_is_active() instead of call for lv_info & info.exists so
we cover also cluster activated devices.
For snapshost the conversion was not correct and introduced
regression by blocking creation of snapshot of inactive LV.
Fix it by assigning lv_is_active() directly.
Note: we still have minor issue to fix - to make
lv_is_???? function able to return error states since
lv_info() may fail.
Move common functions for lvcreate and lvconvert.
get_pool_params() - read thin pool args.
update_pool_params() - updates/validates some thin args.
It is getting complicated and even few more things will be
implemented, so to avoid reimplementing things differently
in lvcreate and lvconvert code has been splitted
into 2 common functions that allow some future extension.
Target tells us its version, and we may allow different set of options
to be supported with different version of driver.
Idea is to provide individual feature flags and later be
able to query for them.
This patch is intended to fix bug 825323 - FS turns read-only during a double
fault of a mirror leg and mirrored log's leg at the same time. It only
affects a 2-way mirror with a mirrored log. 3+-way mirrors and mirrors
without a mirrored log are not affected.
The problem resulted from the fact that the top level mirror was not
using 'noflush' when suspending before its "down-convert". When a
mirror image fails, the bios are queue until a suspend is recieved. If
it is a 'noflush' suspend, the bios can be safely requeued in the DM
core. If 'noflush' is not used, the bios must be pushed through the
target and if a device is failed for a mirror, that means issuing an
error. When an error is received by a file system, it results in it
turning read-only (depending on the FS).
Part of the problem was is due to the nature of the stacking involved in
using a mirror as a mirror's log. When an image in each fail, the top
level mirror stalls because it is waiting for a log flush. The other
stalls waiting for corrective action. When the repair command is issued,
the entire stacked arrangement is collapsed to a linear LV. The log
flush then fails (somewhat uncleanly) and the top-level mirror is suspended
without 'noflush' because it is a linear device.
This patch allows the log to be repaired first, which in turn allows the
top-level mirror's log flush to complete cleanly. The top-level mirror
is then secondarily reduced to a linear device - at which time this mirror
is suspended properly with 'noflush'.
Fix previous commit 360c569ce8.
Remove only fedora-storage-init/fedora-storage-init-late.service, but
not lvm2-activation.service.
fedora-storage-init.service fedora-storage-init-late.service
Don't use lvmetad in lvm2-monitor.service ExecStop to avoid a systemd issue.
- a systemd design issue while processing dependencies
with socket-based activation that ends up with a hang
- https://bugzilla.redhat.com/show_bug.cgi?id=843587
(also tracker bug https://bugzilla.redhat.com/show_bug.cgi?id=871527)
- not using lvmetad in this case is just a workaround, once the bug
above is resolved, we should enable the lvmetad in that specific case
Remove dependency on fedora-storage-init.service in lvm2 systemd units.
- fedora-storage-init.service and fedora-storage-init-late.service is
going to be separated into respective units that belong to each block
device subsystem:
- mpath + mdraid activated via udev solely
- dmraid with its own dmraid-activation.service unit
- lvm2 with the lvm2-activation-generator to generate the
activation units runtime if lvmetad disabled
(global/use_lvmetad=0 set in lvm.conf) and activation done
via udev+lvmetad if lvmetad enabled (global/use_lvmetad=1 set
in lvm.conf)
Depend on lvm2-lvmetad.socket in lvm2-monitor.service systemd unit.
- as lvm2-monitor uses lvmetad if lvmetad is enabled
Issues found (thus far) in unit test developemnt for python bindings.
Added Py_DECREF(ptr) in liblvm_lvm_vg_open & liblvm_lvm_vg_create
in error paths so that we correctly clean up memory.
Added a call to lvm_vg_close when we remove a vg. The code was
clearing out the vg pointer which prevented us from actually
calling lvm_vg_close in the close path.
liblvm_lvm_vg_create_lv_linear was not initializing
lvobj->parent_vgobj and if lvm_vg_create_lv_linear failed
we went through liblvm_lv_dealloc on clean up and tried to
Py_DECREF an invalid pointer.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Commit 9fd7ac7d03 did not handle mirrors
that contained mirrored logs. This is because the status line of the
mirror does not give an indication of the health of the mirrored log,
as you can see here:
[root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
Thus, the possibility for LVM commands to hang still persists when mirror
have mirrored logs. I discovered this while performing some testing that
does polling with 'pvs' while doing I/O and killing devices. The 'pvs'
managed to get between the mirrored log device failure and the attempt
by dmeventd to repair it. The result was a very nasty block in LVM
commands that is very difficult to remove - even for someone who knows
what is going on. Thus, it is absolutely essential that the log of a
mirror be recursively checked for mirror devices which may be failed
as well.
Despite what the code comment says in the aforementioned commit...
+ * _mirrored_transient_status(). FIXME: It is unable to handle mirrors
+ * with mirrored logs because it does not have a way to get the status of
+ * the mirror that forms the log, which could be blocked.
... it is possible to get the status of the log because the log device
major/minor is given to us by the status output of the top-level mirror.
We can use that to query the log device for any DM status and see if it
is a mirror that needs to be bypassed. This patch does just that and is
now able to avoid reading from mirrors that have failed devices in a
mirrored log.
Addresses: rhbz855398 (Allow VGs to be built on cluster mirrors),
and other issues.
The LVM code attempts to avoid reading labels from devices that are
suspended to try to avoid situations that may cause the commands to
block indefinitely. When scanning devices, 'ignore_suspended_devices'
can be set so the code (lib/activate/dev_manager.c:device_is_usable())
checks any DM devices it finds and avoids them if they are suspended.
The mirror target has an additional mechanism that can cause I/O to
be blocked. If a device in a mirror fails, all I/O will be blocked
by the kernel until a new table (a linear target or a mirror with
replacement devices) is loaded. The mirror indicates that this condition
has happened by marking a 'D' for the faulty device in its status
output. This condition must also be checked by 'device_is_usable()' to
avoid the possibility of blocking LVM commands indefinitely due to an
attempt to read the blocked mirror for labels.
Until now, mirrors were avoided if the 'ignore_suspended_devices'
condition was set. This check seemed to suggest, "if we are concerned
about suspended devices, then let's ignore mirrors altogether just
in case". This is insufficient and doesn't solve any problems. All
devices that are suspended are already avoided if
'ignore_suspended_devices' is set; and if a mirror is blocking because
of an error condition, it will block the LVM command regardless of the
setting of that variable.
Rather than avoiding mirrors whenever 'ignore_suspended_devices' is
set, this patch causes mirrors to be avoided whenever they are blocking
due to an error. (As mentioned above, the case where a DM device is
suspended is already covered.) This solves a number of issues that weren't
handled before. For example, pvcreate (or any command that does a
pv_read or vg_read, which eventually call device_is_usable()) will be
protected from blocked mirrors regardless of how
'ignore_suspended_devices' is set. Additionally, a mirror that is
neither suspended nor blocking is /allowed/ to be read regardless
of how 'ignore_suspended_devices' is set. (The latter point being the
source of the fix for rhbz855398.)
The heading 'Copy%' is specific to PVMOVE volumes, but can be generalized
to apply to LVM mirrors also. It is a bit awkward to use 'Copy%' for
RAID 4/5/6, however - 'Sync%' would be more appropriate. This is why
RAID 4/5/6 have not displayed their sync status by any means available to
'lvs' yet.
Example (old):
[root@hayes-02 lvm2]# lvs vg
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sy Convert
lv vg -wi-a---- 1.00g
raid1 vg rwi-a-r-- 1.00g 100.00
raid4 vg rwi-a-r-- 1.01g
raid5 vg rwi-a-r-- 1.01g
raid6 vg rwi-a-r-- 1.01g
This patch changes the heading to 'Cpy%Sync' and allows RAID 4/5/6 to print
their sync percent in this field.
Example (new):
[root@hayes-02 lvm2]# lvs vg
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
lv vg -wi-a---- 1.00g
raid1 vg rwi-a-r-- 1.00g 100.00
raid4 vg rwi-a-r-- 1.01g 100.00
raid5 vg rwi-a-r-- 1.01g 100.00
raid6 vg rwi-a-r-- 1.01g 100.00
The 'copy_percent' function takes the 'extents_copied' field from each
segment in an LV to create the numerator for the ratio that is to
become the copy_percent. (Otherwise known as the 'sync' percent for
non-pvmove uses, like mirror LVs and RAID LVs.) This function safely
works on RAID - not just mirrors - so it is better to have it in
lv_manip.c rather than mirror.c.
There's a lot of different functions that do a lot of different things
in lv_manip.c, so I placed the function near a function in lv_manip.c
that it was close to in metadata-exported.h. Different placement in the
file or a different name for the function may be useful.
The first parameter needs to be present even if it isn't being
used, otherwise the one any only parameter we get is null
which causes PyArg_ParseTuple to seg. fault.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
cookie_set variable found in the struct dm_task should be always
set to 1 after dm_task_set_cookie_call, even if udev_sync is disabled
as the cookie itself carries synchronization informations *as well as*
extra flags to control other aspects of udev support.
For example, one could disable the synchronization itself, but still
direct the libdm code to disable library fallback via
DM_UDEV_DISABLE_LIBRARY_FALLBACK flag. These extra flags still need
to be carried out!
A concrete example:
$ dmsetup create test --table "0 1 zero" --noudevsync
This disables synchronization with udev. As the --verifyudev option is
not used, we don't want to do any corrections. In other words, we
need DM_UDEV_DISABLE_LIBRARY_FALLBACK flag to be used. However,
with --noudevsync this was not the case - the flag was ignored!
This patch fixes the case when noudevsync is used but there are still
some extra flags passed within the cookie flag part. The synchronization
part of the cookie stays zero (which is ok as dm_udev_wait call on such a
cookie is simply a NOOP).
clvmd -d option parsing was not working properly.
clvmd -d 2 (with space) has been ignored because of
'::' used in getopt string, and as failsafe it's been used '1'.
Later this debug_arg has been ignored and debug_opt was used
instead which happend to have value '1'.
Submitted-by: Robert Milasan <rmilasan at suse.com>
Reported-by: Robert Milasan <rmilasan at suse.com>
Our object nesting:
lib -> VG -> LV -> lvseg
-> PV -> pvseg
Implement refcounting and checks to ensure parent objects are not
dealloced before their children. Also ensure calls to self or child's
methods are handled cleanly for objects that have been closed or removed.
Ensure all functions that are object methods have a first parameter named
'self', for consistency
Rename local vars that reference a Python object to '*obj', in order to
differentiate from liblvm handles
Fix a misspelled pv method name
Signed-off-by: Andy Grover <agrover@redhat.com>
Use log_warn to print non-fatal warning messages.
Use of log_error would confuse checker for testing
whether proper error has been reported for some real error.
Instead of requiring users to create a liblvm object, and then calling
methods on it, the module acquires a liblvm handle as part of
initialization. This makes it impossible to instantiate a liblvm object
with a different systemdir, but there is an alternate envvar method for
that obscure use case.
Signed-off-by: Andy Grover <agrover@redhat.com>
I'm not sure what 'BUG's were being encountered when the restriction
to limit lvconvert-raid.sh tests to kernels > 3.2 was added. I do know
that there were BUG's that could be triggered when testing snapshots and
some of the earliest DM RAID available in the kernel. I've taken out
the 3.2 kernel restriction and added a dm-raid >= 1.2 restriction instead.
This will allow the test to run on patched production kernels.
A message is printed when the region_size of a RAID LV is adjusted
to allow for large (> ~1TB) LVs. The message wasn't very clear.
Hopefully, this is better.
Try to bring the lvmetad usage text and man page closer to the code.
There seem to be 3 useful ways to use -d with lvmetad at the moment:
-d all
-d wire
-d debug
(They can also be comma-separated like -d wire,debug.)
Prior to the last release, -d, -dd and -ddd were supported.
Fail if an unrecognised debug arg is supplied on the command line.
Change -V to report the same version as the lvm binary: previously it
just reported version 0.
Reset counter after thin pool resize failure.
If the pool goes above threshold, support unmounting
of all thin volumes if the lvextend fails to avoid
overfilling of the pool.
When valgrind usage is desired by user (--enable-valgrind-pool)
skip playing/closing/reopenning with descriptors - it makes
valgridng useless.
Make sleep delay for clvmd start longer.
blkdeactivate - utility to deactivate block devices
Traverses the tree of block devices and tries to deactivate them.
Currently, it supports device-mapper-based devices together with LVM.
See man/blkdeactivate.8 for more info.
It is targeted for use during shutdown to properly deactivate the
whole block device stack - systemd and init scripts are provided as
well. However, it might be used directly on command line too.
Please, see the commentary at the top of the blkdeactivate script
for dependencies and versions of other utilities required.
lvm2-activation-early.service (generated by activation generator) should
be ordered before cryptsetup.target.
lvm2-monitor.service should be ordered after lvm2-activation.service,
if used. The lvm2-activation.service will replace fedora-storage-init.service
and fedora-storage-init-late.service in the end, but let's have it
prepared now.
On each ioctl return, the device UUID is decoded from \xNN format.
If the UUID of the device being *removed* is malformed (e.g. it
hasn't been corrected before), just remove it without any error
as the UUID is not needed anymore - the device is gone anyway.
Otherwise a misleading error message would be issued just after
the removal:
# dmsetup remove test
The UUID "a b" should be mangled but it contains blacklisted characters.
Command failed
Use configure --enable-python_bindings to generate them.
Note that the Makefiles do not yet control the owner or permissions of
the two new files on installation.
Commit 3501f17fd0 enables a limited set
of metadata updates for partial LV/VGs when issuing lvchange or vgchange.
These tests verify those changes operate as intended.
A while back, the behavior of LVM changed from allowing metadata changes
when PVs were missing to not allowing changes. Until recently, this
change was tolerated by HA-LVM by forcing a 'vgreduce --removemissing'
before trying (again) to add tags to an LV and then activate it. LVM
mirroring requires that failed devices are removed anyway, so this was
largely harmless. However, RAID LVs do not require devices to be removed
from the array in order to be activated. In fact, in an HA-LVM
environment this would be very undesirable. Device failures in such an
environment can often be transient and it would be much better to restore
the device to the array than synchronize an entirely new device.
There are two methods that can be used to setup an HA-LVM environment:
"clvm" or "tagging". For RAID LVs, "clvm" is out of the question because
RAID LVs are not supported in clustered VGs - not even in an exclusively
activated manner. That leaves "tagging". HA-LVM uses tagging - coupled
with 'volume_list' - to ensure that only one machine can have an LV active
at a time. If updates are not allowed when a PV is missing, it is
impossible to add or remove tags to allow for activation. This removes
one of the most basic functionalities of HA-LVM - site redundancy. If
mirroring or RAID is used to replicate the storage in two data centers
and one of them goes down, a server and a storage device are lost. When
the service fails-over to the alternate site, the VG will be "partial".
Unable to add a tag to the VG/LV, the RAID device will be unable to
activate.
The solution is to allow vgchange and lvchange to alter the LVM metadata
for a limited set of options - --[add|del]tag included. The set of
allowable options are ones that do not cause changes to the DM kernel
target (like --resync would) or could alter the structure of the LV
(like allocation or conversion).
Compared to names, UUIDs can't be renamed once they are created
for a device. The 'mangle' command will just issue an error message
about a need for manual intervention in this case - reactivating the
device (remove + create) does the job as the defualt mangling mode
used is "auto" and that will assign a correct mangled form the UUID.
Just like we already have existing mangling support for
device-mapper names, we need exactly the same for device-mapper
UUIDs as their character whitelist is wider than what udev supports.
In case udev is used to create entries in /dev based on UUIDs
and these UUIDs contain characters not supported by udev,
we'll end up with incorrect /dev content for such devices.
So we need to mangle them to a form that is supported by udev.
The mangling used for UUIDs follows the mangling used for names
(that is already supported and used throughout). That means,
setting the name mangling mode via dm_set_name_mangling_mode
affects mangling used for UUIDs in exactly the same manner.
It would be useless to add a new and separate
dm_set_uuid_mangling_mode fn, we'll reuse existing interface.
(un)mangle_name -> (un)mangle_string
check_multiple_mangled_name_allowed -> check_multiple_mangled_string_allowed
Just for clarity as the same functions will be reused to (un)mangle dm UUIDs.
Separate original raid test and new raid10 test,
so the old could be tested on platforms without raid10 support.
Replace test-unfriendly `ls /dev/mapper` with dmsetup ls
The ExecStartPost with pvscan --cache in lvm2-lvmetad.service
is not needed now as this is called transparently within the
first LVM command that queries lvmetad.
Revert changes to origin lvcreate-large test and use separate
test scripts for raid - so they can be properly skipped when
kernel doesn't support raid targets.
For now this convertions is not supported, thus disabled.
The only supported conversion for now is to create mirrored thin pools
from mirrored devices.
It would be possible to activate a RAID LV exclusively in a cluster
volume group, but for now we do not allow RAID LVs to exist in a
clustered volume group at all. This has two components:
1) Do not allow RAID LVs to be created in a clustered VG
2) Do not allow changing a VG from single-machine to clustered
if there are RAID LVs present.
Update code for lvconvert.
Change the lvconvert user interface a bit - now we require 2 specifiers
--thinpool takes LV name for data device (and makes the name)
--poolmetadata takes LV name for metadata device.
Fix type in thin help text -z -> -Z.
Supported is also new flag --discards for thinpools.
Patch clears the flag if thin pool is stacked over mirror.
Since thin pool could be used to stack device over mirrors,
it needs resume properly i.e. mirrors with corelog which are otherwise
unconditionally skipped (for pvmove functionality).
MD's bitmaps can handle 2^21 regions at most. The RAID code has always
used a region_size of 1024 sectors. That means the size of a RAID LV was
limited to 1TiB. (The user can adjust the region_size when creating a
RAID LV, which can affect the maximum size.) Thus, creating, extending or
converting to a RAID LV greater than 1TiB would result in a failure to
load the new device-mapper table.
Again, the size of the RAID LV is not limited by how much space is allocated
for the metadata area, but by the limitations of the MD bitmap. Therefore,
we must adjust the 'region_size' to ensure that the number of regions does
not exceed the limit. I've added code to do this when extending a RAID LV
(which covers 'create' and 'extend' operations) and when up-converting -
specifically from linear to RAID1.
We were using daemon_send_simple until now, but it is no longer adequate, since
we need to manipulate requests in a generic way (adding a validity token to each
request), and the tree-based request interface is much more suitable for this.
- move common dm_config_tree manipulation functions from lvmetad-core to
daemon-shared
- add config-tree-based request manipulation APIs to daemon-client
- factor out _v (va_list) variants of most variadic functions in libdaemon
When reformatting the 'lvchange_resync' code in commit
05131f5853, a '!' should have been removed
from the condition that checks for the LV_NOTSYNCED flag on a corelog
mirror LV. The presence of this '!' caused the LV_NOTSYNCED flag to be
cleared when it wasn't present and left when it was present.
It is not allowed to add images to a 'mirror' or 'raid1' LV if the
LV_NOTSYNCED flag is set. We add some up-convert tests to ensure this
behavior is being enforced and that the LV_NOTSYNCED flag is being
properly cleared by 'lvchange --resync'.
(Not updating WHATS_NEW because this is intrarelease.)
Don't try to issue discards to a missing PV to avoid segfault.
Prevent lvremove from removing LVs that have any part missing.
https://bugzilla.redhat.com/857554
Failing to clear the LV_NOTSYNCED flag when converting a RAID1 LV to
linear can result in the flag being present after an upconvert - even
if the sync is performed when upconverting.
Mirrors do not allow upconverting if the LV has been created with --nosync.
We will enforce the same rule for RAID1. It isn't hugely critical, since
the portions that have been written will be copied over to the new device
identically from either of the existing images. However, the unwritten
sections may be different, causing the added image to be a hybrid of the
existing images.
Also, we are disallowing the addition of new images to a RAID1 LV that has
not completed the initial sync. This may be different from mirroring, but
that is due to the fact that the 'mirror' segment type "stacks" when adding
a new image and RAID1 does not. RAID1 will rebuild a newly added image
"inline" from the existant images, so they should be in-sync.
The "fedora-wait-storage.service" that the "lvm2-activation.service"
had as a dependency (which was fedora-specific solution anyway)
is obsolete now as this unit called "modprobe scsi_wait_scan"
which is not used anymore.
The "fedora-wait-storage.service" had "systemd-udev-settle" as
its dependency, so let's depend on this one directly now,
bypassing the out-dated "fedora-wait-storage.service".
Using 'activation/auto_activation_volume_list = [ "vg/lvol1" ]'.
Before this patch:
3 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi-a--- 4.00m
lvol3 vg -wi-a--- 4.00m
(vg/lvol1 activated as it passes the list and all subsequent volumes too - wrong!)
With this patch:
1 logical volume(s) in volume group "vg" now active
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert
lvol0 vg -wi----- 4.00m
lvol1 vg -wi-a--- 4.00m
lvol2 vg -wi----- 4.00m
lvol3 vg -wi----- 4.00m
(only vg/lvol1 activated as it passes the list and no other - correct!)
Issuing a 'lvchange --resync <VG>/<RAID_LV>' had no effect. This is
because the code to handle RAID LVs was not present. This patch adds
the code that will clear the metadata areas of RAID LVs - causing them
to resync upon activation.
When an LV is to be resynced, the metadata areas are cleared and the
LV is reactivated. This is true for mirroring and will also be true
for RAID LVs. We restructure the code in lvchange_resync() so that we
keep all the common steps necessary (validation of ability to resync,
deactivation, activation of meta/log devices, clearing of those devices,
etc) and place the code that will be divergent in separate functions:
detach_metadata_devices()
attach_metadata_devices()
The common steps will be processed on lists of metadata devices. Before
RAID capability is added, this will simply be the mirror log device (if
found).
This patch lays the ground-work for adding resync of RAID LVs.
By changing the conditional for resyncing mirrors with core-logs a
bit, we can short-circuit the rest of the function for that case
and reduce the amount of indenting in the rest of the function.
This cleanup will simplify future patches aimed at properly handling
the resync of RAID LVs.
We cannot add images to a RAID array while it is not in-sync. The
kernel will simply reject the table, saying:
'rebuild' specified while array is not in-sync
Now we check to ensure the LV is in-sync before attempting image
additions.
It is necessary when creating a RAID LV to clear the new metadata areas.
Failure to do so could result in a prepopulated bitmap that would cause
the new array to skip syncing portions of the array. It is a requirement
that the metadata LVs be activated and cleared in the process of creating.
However in test mode, this requirement should be lifted - no new LVs should
be created or written to.
When printing a message for the user and the lv_segment pointer is available,
use segtype->ops->name() instead of segtype->name. This gives a better
user-readable name for the segment. This is especially true for the
'striped' segment type, which prints "linear" if there is an area_count of
one.
The 'test' subdir needs to be processed before 'tools' subdir
for distclean as all the cmd names are read from 'tools/.commands'
file. Otherwise we'd end up with dangling symlinks in 'tools' subdir.
If we were defining a section (which is a node without a value) and
the value was created automatically on dm_config_create_node call,
we were wasting resources as the next step after creating the config
node itself was assigning NULL for the node's value.
The dm_config_node_create + dm_config_create_value sequence should be
used instead for settings and dm_config_node_create alone for sections.
The majority of the code already used the correct sequence. Though
with dm_config_node_create fn creating the value as well, the pool
memory was being trashed this way.
This patch removes the node value initialization on dm_config_create_node
fn call and keeps it for the direct dm_config_create_value fn call.
We should check whether the fd is opened before trying to reopen it.
For example, the stdin is closed in test/lib/harness.c causing the
test suite to fail.
Fix setvbuf code by closing and reopening stream before changing buffer.
But we need to review what this code is doing embedded inside a library
function rather than the simpler original form being run independently
at the top of main() by tools that need it.
Accept -q as the short form of --quiet.
Suppress non-essential standard output if -q is given twice.
Treat log/silent in lvm.conf as equivalent to -qq.
Review all log_print messages and change some to
log_print_unless_silent.
When silent, the following commands still produce output:
dumpconfig, lvdisplay, lvmdiskscan, lvs, pvck, pvdisplay,
pvs, version, vgcfgrestore -l, vgdisplay, vgs.
[Needs checking.]
Non-essential messages are shifted from log level 4 to log level 5
for syslog and lvm2_log_fn purposes.
This patch adds support for RAID10. It is not the default at this
stage. The user needs to specify '--type raid10' if they would like
RAID10 instead of stacked mirror over stripe.
Adding couple INTERNAL_ERROR reports for unwanted parameters:
Ensure the 'top' metadata node cannot be NULL for lvmetad.
Make obvious vginfo2 cannot be NULL.
Report internal error if handler and vg is undefined.
Check for handle in poll_vg().
Ensure seg is not NULL in dev_manager_transient().
Report missing read_ahead for _lv_read_ahead_single().
Check for report handler in dm_report_object().
Check missing VG in _vgreduce_single().
- logging is not controlled by "levels" but by "types"; types are
independent of each other... implementation of the usual "log level"
user-level semantics can be simply done on top; the immediate
application is enabling/disabling wire traffic logging independently
of other debug data, since the former is rather bulky and can easily
obscure almost everything else
- all logs go to "outlets", of which we currently have 2: syslog and
stderr; which "types" go to which "outlets" is entirely configurable
Always store discard setting in LV metadata. (Note that lvcreate_params
doesn't yet use --discard to set the initial value.)
Remove undocumented env var LVM_THIN_VERSION_MIN that has no use on a
live system.
Change verbose 'feature not found' messages to debug.
Use discard_str for string value of discard.
I think it's better not to abbreviate human-readable fields like
'discard' to a single character. Users can truncate it to the
first character themselves if they wish.
It's confusing to use the variable name discard for different things in
different places - use discard_str when it's a string not the enum.
Respond with "unknown" rather than a NULL pointer if there's an
internal error and the discard value is invalid.
Don't accept 'no_passdown' or 'no-passdown' variants in the LVM
metadata: this is written by the program so should only ever contain
"nopassdown" and should be validated strictly against that.
Remove the limit for major and minor number arguments used while specifying
persistent numbers via -My --major <major> --minor <minor> option which
was set to 255 before. Follow the kernel limit instead which is 12 bits
for major and 20 bits for minor number (kernel >= 2.6 and LVM formats
that does not have FMT_RESTRICTED_LVIDS - so still keep the old limit
of 255 for lvm1 format).
The lvm2 activation generator generates systemd units conditionally
based on the global/use_lvmetad lvm.conf setting.
If use_lvmetad=0, the lvm2-activation-early.service and lvm2-activation.service
units will be generated. These units are responsible for direct volume activation
by calling "vgchange -aay --sysinit" (this is actually the original on-boot
activation as it was used before). If use_lvmetad=1, no units will be generated
as we're relying on autoactivation.
Important thing to note is that the lvm2-activation units normally bring
in the udev-settle ("storage-wait") service that waits for udev to settle
(with block devices). We don't need this if lvmetad is used in conjunction
with autoactivation feature... but systemd units can't be enabled or disabled
(or dependencies added/removed) dynamically based on external configuration.
Therefore, we need the unit generator which adds support for such situations:
the units as a whole either exist or not based on the external configuration.
Allowing people to add devices to a VG that has PVs missing helps
people avoid the inability to repair RAID LVs in certain cases.
For example, if a user creates a RAID 4/5/6 LV using all of the
available devices in a VG, there will be no spare devices to
repair the LV with if a device should fail. Further, because the
VG is missing a device, new devices cannot be added to allow the
repair. If 'vgreduce --removemissing' were attempted, the
"MISSING" PV could not be removed without also destroying the RAID
LV.
Allowing vgextend to operate solves the circular dependency.
When the PV is added by a vgextend operation, the sequence number is
incremented and the 'MISSING' flag is put on the PVs which are missing.
Monitoring is handled using "vgchange --monitor" call. Ensure that lvmetad is up
and running at the time of this call to prevent any fallback to direct scan
within the vgchange. The same applies for shutdown sequence but the other way
round - switch monitoring off and lvmetad afterwards.
Commit 8767435ef8 allowed RAID 4/5/6
LV to be extended properly, but introduced a regression in device
replacement - a critical component of fault tolerance.
When only 1 or 2 drives are being replaced, the 'area_count' needed
can be equal to the parity_count. The 'area_multiple' for RAID 4/5/6
was computed as 'area_count - parity_devs', which could result in
'area_multiple' being 0. This would ultimately lead to a division by
zero error. Therefore, in calc_area_multiple, it is important to take
into account the number of areas that are being requested - just as
we already do in _alloc_init.
A regression introduced in 2.02.89 (11e520256b)
caused the lvm dumpconfig <node> to print out
the node as well as its subsequent siblings.
The information about "only_one" mode got lost.
Before this patch (just an example node):
# lvm dumpconfig global/use_lvmetad
use_lvmetad=1
thin_check_executable="/usr/sbin/thin_check"
thin_check_options="-q"
(...all nodes to the end of the section)
With this patch applied:
# lvm dumpconfig global/use_lvmetad
use_lvmetad=1
If a daemon (like lvmetad that is using common daemon-server code)
received a kill signal that was supposed to shut the daemon down,
a spurious message was issued: "Failed to handle a client connection".
This happened if the kill signal came just in the middle of waiting
for a client request in "select" - the request that was supposed to
be handled was blank at that moment of course.
Update lvchange to allow change of 'zero' flag for thinpool.
Add support for changing discard handling.
N.B. from/to ignore could be only changed for inactive pool.
Add arg support for discard.
Add discard ignore, nopassdown, passdown (=default) support.
Flags could be set per pool.
lvcreate [--discard {ignore|no_passdown|passdown}] vg/thinlv
When --sysinit -a ay is used with vg/lvchange and lvmetad is up and running,
we should skip manual activation as that would be a useless step - all volumes
are autoactivated once all the PVs for a VG are present.
If lvmetad is not active at the time of the vgchange --sysinit -a ay
call, the activation proceeds in standard 'manual' way.
This way, we can still have vg/lvchange --sysinit -a ay called
unconditionally in system initialization scripts no matter if lvmetad
is used or not.
Reducing a RAID 4/5/6 LV or extending it with a different number of
stripes is still not implemented. This patch covers the "simple" case
where the LV is extended with the same number of stripes as the orginal.
In process_each_pv() if we haven't yet scanned and the PV appears
to be an orphan, we must scan the other PVs looking for mdas that
reference it to find out what VG it is in.
1. If the PV has no mdas, we must scan.
2. If the PV has an mda that is not ignored we do not need to scan.
3. If the PV has an mda that is ignored, we do need to scan.
This patch fixes case 3.
> pvs -o +mda_count,vg_mda_count /dev/loop[0123]
PV VG Fmt Attr PSize PFree #PMda #VMda
/dev/loop0 vg3 lvm2 a- 96.00m 96.00m 0 1
/dev/loop1 vg3 lvm2 a- 96.00m 96.00m 1 1
/dev/loop2 vg2 lvm2 a- 96.00m 96.00m 1 2
/dev/loop3 vg2 lvm2 a- 28.00m 28.00m 1 2
Before:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 lvm2 a-- 100.00m 100.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 lvm2 a-- 100.00m 100.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
After:
> pvs /dev/loop2 /dev/loop3 /dev/loop0 /dev/loop1 --unbuffered
PV VG Fmt Attr PSize PFree
/dev/loop2 vg2 lvm2 a-- 96.00m 96.00m
/dev/loop3 vg2 lvm2 a-- 28.00m 28.00m
/dev/loop0 vg3 lvm2 a-- 96.00m 96.00m
/dev/loop1 vg3 lvm2 a-- 96.00m 96.00m
Change 'lv_passes_volumes_filter' fn back to static as it's not
actually needed in the other code (a remnant from devel version).
Fix lvm.conf comment referencing '--autoactivate' which was finally
decided to be '--activate ay'.
If _alloc_parallel_area for raid devices chooses an area already used
up, it doesn't notice that it has no space left in it and leaves
later code trying to place a zero-length area into the LV.
https://bugzilla.redhat.com/832596
The clmvd init script called "vgchange -aly" before to activate
all VGs in cluster environment. This activated all VGs, no matter
if it was clustered or not.
Auto activation for clustered VGs is not supported yet so the behaviour
of -aay is still the same as before for clustered VGs. However, for
non-clustered VGs, we need to check with the activation/auto_activation_volume_list
whether the VG/LV should be activated on boot or not.
One can use "lvcreate --aay" to have the newly created volume
activated or not activated based on the activation/auto_activation_volume_list
this way.
Note: -Z/--zero is not compatible with -aay, zeroing is not used in this case!
When using lvcreate -aay, a default warning message is also issued that zeroing
is not done.
Define auto_activation_handler that activates VGs/LVs automatically
based on the activation/auto_activation_volume_list (activating all
volumes by default if the list is not defined).
The autoactivation is done within the pvscan call in 69-dm-lvmetad.rules
that watches for udev events (device appearance/removal).
For now, this works for non-clustered and complete VGs only.
Normally, the 'vgchange -ay' activates all volume groups (that pass
the activation/volume_list filter if set).
This call can appear in two scenarios:
- system boot (so activation within a script in general)
- manual call on command line (so activaton on user's direct request)
For the former one, we would like to select which VGs should be actually
activated. One can define the list of VGs directly to do that. But that
would require the same list to be provided in all the scripts.
The 'vgchange -aay' will check for the activation/auto_activation_volume_list
in adition and it will activate only those VGs/LVs that pass this
filter (assuming all to be activated if the list is not defined - the
same logic we already have for activation/volume_list).
Init/boot scripts should use this form of activation primarily
(which, anyway, becomes only a fallback now with autoactivation done
on PV appearance in tandem with lvmetad in place).
Define an 'activation_handler' that gets called automatically on
PV appearance/disappearance while processing the lvmetad_pv_found
and lvmetad_pv_gone functions that are supposed to update the
lvmetad state based on PV availability state. For now, the actual
support is for PV appearance only, leaving room for PV disappearance
support as well (which is a more complex problem to solve as this
needs to count with possible device stack).
Add a new activation change mode - CHANGE_AAY exposed as
'--activate ay/-aay' argument ('activate automatically').
Factor out the vgchange activation functionality for use in other
tools (like pvscan...).
We're refererring to 'activation' all over the code and we're talking
about 'LVs being activated' all the time so let's use 'activation/activate'
everywhere for clarity and consistency (still providing the old
'available' keyword as a synonym for backward compatibility with
existing environments).
Update release_lv_segment_area not to discard any PV extents,
as it also gets used when moving extents between LVs.
Instead, call a new function release_and_discard_lv_segment_area() in
the two places where data should be discarded - lv_reduce() and
remove_mirrors_from_segments().
Remove executable path detection in udev rules and use sbindir that
is configured, but still provide the original functionality by means
of 'configure --enable-udev-rule-exec-detection'.
Normally, the exec path for the tools called in udev rules should
not differ from the sbindir used, however, there are cases this is
necessary. For example different environments could be assembled
in a way that these path differ for some reason (distribution installer,
initrd ...).
This functionality is kept for compatibility only. Any environment
moving the binaries around and using different paths should be fixed
eventually!
There were several hard-coded values for run directory around the code.
Also, some tools are DM specific only, others are LVM specific and there
was no distinction made here before. With this patch applied, we have
this cleaned up a bit (subsystem in brackets, defaults in parentheses):
[common] configurable PID_DIR (/var/run)
lvm [lvm] configurable RUN_DIR (/var/run/lvm)
configurable locking dir (/var/lock/lvm)
clvmd [lvm] configurable pid file (PID_DIR/clvmd.pid)
socket (RUN_DIR/clvmd.sock)
lvmetad [lvm] configurable pid file (PID_DIR/lvmetad.pid)
socket (RUN_DIR/lvmetad.socket)
dm [dm] configurable DM_RUN_DIR (/var/run)
cmirrord [dm] configurable pid file (PID_DIR/cmirrord.pid)
dmeventd [dm] configurable pid file (PID_DIR/dmeventd.pid)
server fifo (DM_RUN_DIR/dmeventd-server)
client fifo (DM_RUN_DIR/dmeventd-client)
The changes briefly:
- added configure --with-default-pid-dir
- added configure --with-default-dm-run-dir
- added configure --with-lvmetad-pidfile
- by default, using one common pid directory for everything
(only lvmetad was not following this before)
There's no need to have the device open RW while obtaining the readahead value.
The RW open used before caused the CHANGE udev event to be generated if the
WATCH udev rule was set for the underlying device (and that is normally the
case both for non-dm and dm devices by default).
This did not cause any problems before since we were not interested in
*underlying* devices. However, with upcoming changes (autoactivation), we're
watching for events on underlying devices marked as PVs and such a spurious
event could cause the autoactivation code to be triggered. So when trying
to deactivate the volume, we could end up with immediate activation just after
that because of the CHANGE event originated in the WATCH udev rule since the
underlying device was open RW during the deactivation process.
Though maybe a better solution would be to completely filter such spurious
events out of the autoactivation process somehow, it's still useful if there
are as least spurious events generated as possible in the system itself.
If the user would set bigger reserved stack size then what
is allowed in resources (ulimit -s), then he would get coredump
So avoid coredump and ignore creation of such large stack size
(lvm should work properly, with just 64KB, so the option could
be eliminated).
If the user specifies number in the range of [4G/1024, 4G>,
the used value would wrap around (32bit math).
So keep the math 64bit.
Note, using such large lvm.conf values is pointless with lvm2.
With latest changes in the udev, some deprecated functions were removed
from libudev amongst which there was the "udev_get_dev_path" function
we used to compare a device directory used in udev and directore set in
libdevmapper. The "/dev" is hardcoded in udev now (udev version >= 183).
Amongst other changes and from packager's point of view, it's also
important to note that the libudev development library ("libudev-devel")
could now be a part of the systemd development library ("systemd-devel")
because of the udev + systemd merge.
Support has many limitations and lots of FIXMEs inside,
however it makes initial task when user creates a separate LV for
thin pool data and thin metadata already usable, so let's enable
it for testing.
Easiest API:
lvconvert --chunksize XX --thinpool data_lv metadata_lv
More functionality extensions will follow up.
TODO: Code needs some rework since a lot of same code is getting copied.
When mirrors are up-converted, a transient mirror layer is put in so that
only the new devices are sync'ed. That transient layer must carry the tags
of the original mirror LV, otherwise it will fail to activate when activation
is regulated by lvm.conf:activation/volume_list. The conversion would then
fail.
The fix is to do exactly the same thing that is being done for linear ->
mirror converting (lib/metadata/mirror.c:_init_mirror_log()). We copy the
tags temporarily for the new LV and remove them after the activation.
Snapshots of RAID logical volumes are allowed (including "raid1"). However,
snapshots of "mirror" logical volumes has been disallowed due to unsolvable
issues inherent to the design. The fact that mirroring (dm-raid1.c) must
stop all I/O as the result of a failure and wait for userspace intervention
can lead to a circular dependency if userspace is simultaneously waiting for
snapshots (on mirrors) to make an I/O update before proceeding.
Various snapshot on mirror tests have been removed as a result.
Looking at the code in cmirrord/local.c, we can see the various different
request types handled in different ways. Some information that is non-changing
does not need to go around the cluster and can be short-circuited. For
example, once the cluster mirror is in-sync, it is pointless to continue
sending that query around the cluster. We can save network bandwidth and reply
directly back to the kernel. When it comes to status information, there are
two types 'TABLE' and 'INFO'. The 'TABLE' information never changes and
belongs to the group of requests that can be safely short-circuited. The
'STATUS' information can change - and will change if a device fails. Thus it
cannot be short-circuited, but this is exactly what was found. The 'STATUS'
information request was being short-circuited and therefore never reporting the
failure condition to anyone other than the "server" that experienced it
directly.
If two devices in an array failed, it was previously impossible to replace
just one of them. This patch allows for the replacement of some, but perhaps
not all, failed devices.
Thanks to agk for providing the patch that prevents resume from attempting
(and then failing) to create error devices which already exist; having been
created by a corresponding suspend operation.
In some occasional case dmevent restart was experiencing problems
with obtaining pid lockfile. So this patch tries to send several more kill
message until daemon kills itself so there is would reponse.
With this small loop the restart seems to work reliable,
although the loopsize and usleep are just randomly picked for now.
Just to make it clearer since there is the "dmsetup info -c -o blkdevname"
as well that shows the "block device name for this mapping", having a
"BlkDevName" header on output.
It's a bit confusing then if the "dmsetup info -c -o devs_used,blkdevs_used"
is named with a plural "DevNames"/"BlkDevNames" but at the same time having
a totally different meaning than the singular form "BlkDevName".
DevNames --> DevNamesUsed
BlkDevNames --> BlkDevNamesUsed
...makes it much more comprehensible.
The 'mirror' segtype and 'raid1' segtype both set the 'MIRRORED' flag.
However, due to differences in the way these device-mapper targets behave
'mirror' must be suspended with the 'noflush' option and 'raid1' does not
have to be.
This patch ensures that when the 'MIRRORED' flag is checked to see if
'noflush' is needed that it does not also set it for 'raid1' by mistake.
The logic for resuming the original and newly split LVs was not properly
done to handle situations where anything but the last device in the array
was split. It did not take into account the possible name collisions that
might occur when the original LV undergoes the shifting and renaming of its
sub-LVs.
When resizing thin pool - we need to use strip info from _tdata volume.
In future more generic solution will be necessary once we start to support
lvconvert (resize of stacked devices and stay properly aligned).
For now we just allow striped or linear LV so this code will work.
When given lvresize new size - round upward for stripes - unless we use % and
we are at the border of free extents.
This patch is not a complete fix and few more cases will need special care.
Libudev does not provide transactions when querying udev database - once we
get the list of block devices (devices/obtain_device_list_from_udev=1) and
we iterate over the list to get more detailed information about device node
and symlink names used etc., the device could be removed just in between we
get the list and put a query for more info. In this case, libudev returns
NULL value as the device does not exist anymore.
Recently, we've added a warning message to reveal such situations. However,
this could be misleading if the device is not related to the LVM action
we're just processing - the non-related block device could be removed in
parallel and this is not an error but a possible and normal operation.
(N.B. This "missing info" should not happen when devices are related to
the LVM action we're just processing since all such processing should be
synchronized with udev and the udev db must always be in consistent state
after the sync point. But we can't filter this situation out from others,
non-related devices, so we have to lower the message verbosity here for a
general solution.)
various dmeventd plug-ins into a new function called 'dmeventd_lvm2_command',
but the new function did not strip off the "_mlog" extentions that the
mirror plug-in had been doing. This created bug 794904 - failure to replace
devices in a redundant log.
The test suite did catch this scenario because it performs repair tests (mainly)
through the CLI and not dmeventd. It's also not easy to test because the test
itself will hang if the bug is encountered.
When vg_read fails, it internally unlocks VG if it's been locked,
so in error path we should skip unlock_vg for this case.
(user would see ugly internal warning)
Add make help target.
Add LVM_TEST_PARALLEL to support parallel runs of tests
Work around the problem the dmsetup table/info may return error
by using dmtable and dminfo function that will use 'should'.
(Error happens when some concurently running process removes table
entry while dmsetup command resolves table entries inside the loop.)
Activation on remote node should be tried only if it is masked by tags
locally (like when hosttags enabled, IOW activate_lv_excl_local()
doesn't return error.)
Introduced change caused that lvchange -aey succeeded even if volume was
activated exclusively remotely.
Calling vgscan alone should reuse information from the lvmetad (if running).
The --cache option should initiate direct device scan and update lvmetad
appropriately (if running).
This is mainly for vgscan to behave consistently compared to pvscan.
locally or on more nodes while others are activated exclusively.
Current pvmove code can either use local mirror (for exclusive
activation) or cmirror (for clustered LVs).
Because the whole intenal pvmove LV is just segmented LV containing
segments of several top-level LVs, code cannot properly handle
situation if some segment need to be activated exclusively.
Previously, it wrongly activated exclusive LV on all nodes
(locing code allowed it) but now this is no lnger possible.
If there is exclusively activated LV, pvmove is only
possible if all affected LVs are aslo activated exclusively.
(Note that in non-exclusive mode pvmove still activates LVs
on other nodes during move.)
# lvchange -aly vg_test/lv1
# lvchange -aey vg_test/lv2
# pvmove -i 1 /dev/sdc
Error locking on node bar-01: Device or resource busy
Error locking on node bar-03: Volume is busy on another node
...
Failed to activate lv2
In this case we should allow to use local mirror, check for cmirror
should apply only for lvconvert/lvcreate.
Introduced in 2.02.86 by removing !(lv->status & ACTIVATE_EXCL).
(Partially workaround, it is minimalistic patch for now.)
Code adds better support for monitoring of thin pool devices.
update_pool_lv uses DMEVENTD_MONITOR_IGNORE to not manipulate with monitoring.
vgchange & lvchange are checking real thin pool device for existance
as we are using _tpool real device and visible LV pool device might not
be even active (_tpool is activated implicitely for any thin volume).
monitor_dev_for_events is another _lv_postorder like code it might be worth
to think about reusing it here - for now update the code to properly
monitory thin volume deps.
For unmonitoring add extra code to check the usage of thin pool - in case it's in use
unmonitoring of thin volume is skipped.
Make the teardown really usable - it will try down to remove all the left
devices even from previous test runs
(the only missing piece is probably proper mdadm teardown)
Add few more local vars
Try to setup PATH and LD_LIBRARY_PATH just once.
Try shorter sleeps.
Actually restart was failing for different reason - so pass in proper
location of dmeventd for restart from lvm command and avoid using
the one from /sbin location.
Update pv create test with "" around path.
There are kernel drivers (smblk) which set '-1' as their device major number.
This number is listed in /proc/devices then - but the kernel itself is using
just 12 bits - thus device is accessible via 4095 - there is posted patch
for 3.4 to fix this behavior (0 for auto allocation was mean to be used).
However to still allow using such devices with older kernels add some code
to use same behavior - so cut 12 bits from the major number from /proc/devices.
For now use log_warn() - maybe the severity of the message could be lowered
to just verbose level.
Indent
Shell improvements - use internal function for checks
Use PVs in "" (LV and VG cannot have spaces)
Several test very starting 'dmeventd' without annoucing
it via prepade_dmeventd.
Fix some of test actually.
Indent
Better shell usage
Function simplification
More usage of 'get' functions
Don't use valgrind tracing for check and get function (faster)
Update shell debugging (PS4, better stacktrace)
Support paths with spaces
Export SCRIPTNAME for external usage
Watch for dmeventd unexpectedly started during test
Indent
Add valgrind support:
env LVM_TEST_VALGRIND={0123} (the higher level, more commands tested)
env LVM_TEST_CLVMD=1 runs clvmd within valgrind.
env VALGRIND script name executed for each lvm command (def. is valg).
Smarted teardown - should minimize occurence of left dev entries
(using dmsetup remove -f only as last resort)
sort removed devices by open count before actual removal
Use "" around string that may contain spaces.
Set log/verbose and activation/retry_deactivation to defined value.
Remove debug.log after successful lvm command (easier to check output).
Currently we could not test special prefixes in our test suite.
As teardown will not find such device and basicaly busyloops here,
as at cannot remove such names.
This patch adds possibity to use:
vgcreate V_$vg1 $dev
Note: you still need to use $PREFIX somewhere in the name.
(And of course, it's really bad idea to use $PREFIX (=LVMTEST)
for normally used LVs)
The only purpose of this patch is to allow testing cluster with
special vg names that begins with V_ , P_....
Fix propagation of -e option - pass it via internal shell variable.
Fix parsing of /proc/mounts files (don't check for substrings).
as reported by O.Mangold with suggested patch:
https://www.redhat.com/archives/linux-lvm/2012-February/msg00030.html
Properly pass arguments with spaces ("$@")
Add validation for YES and EXTOFF variable content.
When down-converting a RAID1 device, it is the last device that is extracted
and removed when the user does not specify a particular device. However,
when a device is specified (and it is not the last), the device is removed and
the remaining sub-LVs are "shifted down" to fill the hole. This cause problems
when resuming the LV because if the shifted devices were resumed (and thus
renamed) before the sub-LV being extracted, there would be a name conflict.
The solution is to resume the extracted sub-LVs first so that they can be
properly renamed preventing a possible conflict.
This addresses bug 801967.
Update a way we handle option passing - so we now support path and options
with space inside.
Fix dm name usage for thin pools with '-' in name.
Use new lvm.conf option thin_check_options to pass in options as string array.
LISTEN_PID and LISTEN_FDS environment variables are defined only during systemd
"start" action. But we still need to know whether we're activated during
"reload" action as well - we use the reload action to call "dmeventd -R"/"lvmetad -R"
for statefull daemon restart. We can't use normal "restart" as that is simply
composed of "stop" and "start" and we would lose any state the daemon has.
Save some relocation entries and use directly char[].
Since we do not need yes more then 127 partitions per device, use just int8_t.
Move lvm_type_filter_destroy into local static function.
Support timestamping with harness - using VERBOSE=2
Fix also logging in several situation
(i.e. continue logging multiple test in VERBOSE mode,
do not coredump with empty output).
Never return unfinished toolcontext - since error path is hit on
various stages of initialization we cannot leave it partially uninitialized,
since we would need to spread many more test across the code for config_valid.
Instead return NULL and properly release udev library resources as well.
We can't use 'DM_SBIN_PATH'. This one is set only for DM devices but not
for all block devices - the pvscan is run on all relevant block devices!
This LVM_SBIN_PATH (as well as DM_SBIN_PATH) detection should be removed
eventually but for upstream solution, we still have to do that as there are
known cases where the binaries are put either in /sbin or /usr/sbin
(some installation systems, initrd systems etc.).
Fix regression in man page. The chunk size is in kilobyte units on command line
input though in the source code we work with sector size unit
so make it clear in the man page.
Update chunksize for thin pool in man page - it's max value is 1024M == 1G.
Fix warning range message to show proper max value.
If the lvcreate may decide some automagical values for a user,
try to keep the pool metadata size into 128MB range for optimal
perfomance (as suggested by Joe).
So if the pool metadata size and chunk_size were not specified,
try to select such values they would fit into 128MB size.
Use thin_dump --repair suggestion in log error message
and use just warning on deactivation path without repair info
(since node has been deactivated).
Also check whether there is not 16 args for thin_check configured.
Auto mode can't deal with multiple mangled names. We can do that while working
in hex mode, but in auto mode, this would lead to device name ambiguity.
Be more strict when unmangling names on ioctl return - require the name to be
properly mangled in 'auto' and 'hex' mode. There really should not be any
blacklisted character since the names should be renamed already (by means of
renaming it directly or running 'dmsetup mangle' for automatic rename).
Avoid using NULL pointers from udev. It seems like some older versions of udev
were improperly returning NULL in some case, so do not silently break here,
and give at least a warning to the user.
if the thin_check fail on thin pool - still return successful deactivation,
since lvremove would currently fail.
TODO: find some way to not run check with lvremove.
Test if thin_check is present in system and disable its use, when its missing.
Add testing for poolmetadatasize.
FIXME: Allocation policy for metadata pool might need some relaxing.
(For now it needs to put all block on one PV.)
Reduce disc excercise for some test and focus on LVM testing by
using smaller extent size.
Reduce number of teardown_devs calls and use vg/lvremove instead.
Don't sleep for seconds on pvmove.
FIXME: shell/lvconvert-mirror-basic.sh seems to need more checking.
Test fails for smalled extent size then 512k.
Add some hack math to allow 16GB devices to be passed as thinpool metadata.
Since kernel has put in limit to not allow which are just bigger then
some predefined constant in kernel but not matching 16GB so any device bigger
is rejected.
FIXME: Current code still might need more tweaks to be more generic.
Use libdm callback to execute thin_check before activation
thin pool and after deactivation as well.
Supporting thin_check_executable which may pass in extra options for
the tool.
Add 3rd daemon return state "unknown" for lookups that are carried out
successfully but don't find the item requested.
Avoid issuing error messages when it's expected that a device that's
being looked up in lvmetad might not be there.
Move the code for poolmetadatasize operation into one place.
Report override for minimum and maximum size.
Drop _read_thin_params function its error reporting is handled elsewhere.
If no size was give the later added minimal size check efectively
disable this code. Also the argument for size now must be kept
in sector_size, so adding division by SECTOR_SIZE (moved into
a const expression)
Hold global lock in pvscan --lvmetad. (This might need refinement.)
Add PV name to "PV gone" messages.
Adjust some log message severities. (More changes needed.)
F17 is getting rid of OpenAIS libraries (and checkpointing). While the
CPG stuff is staying, some if its constants are being removed. So, we
must adjust and use the remaining constants which the CPG constants were based on.
[~]# egrep 'CPG_DISPATCH_ALL|CPG_OK' /usr/include/*/*
corosync/corotypes.h:#define CPG_DISPATCH_ALL CS_DISPATCH_ALL
corosync/corotypes.h:#define CPG_OK CS_OK
Since lvm seems to call driver_version(NULL, 0) this would lead
to crash. Though the combination of the code is probably very hard to hit.
If the user doesn't supply version buffer, just skip printing to buffer.
If the thin pool has disabled zeroing (created with -Zn), we at least
clear initial 4KiB of such thin volume (provisions 1st block).
If lvcreate is executed with '-an' command will abort (same way like we for
normal LV - however for normal LV option -Zn may skip clearing completely,
for thin volumes this option is not supported (applies only for pools).
The OpenAIS checkpoint library is going away; therefore, cmirrord must
operate without it. The algorithms the handle the timing of when to send
a checkpoint, the determination of what to send, and which ongoing cluster
requests are relevent with respect to the checkpoints are unaffected. We
need only replace the functions that actually perform the storing/transmitting
and retrieving/receiving of the checkpoint data. Rather than store the
checkpoint data in an OpenAIS checkpoint file, we simply transmit it along
with the message that notifies the incoming node that the checkpoint is
ready.
pvcreate gives
WARNING: Ignoring unsupported value for metadata/pvmetadataignore.
It was warning if there is no config file entry instead of only if the node
exists but is empty.
Drop whole buffer clearing (most messages at <100 bytes).
Just make sure we have always \0 terminated string for strlen() operations.
(before for PIPE_BUF sized messages this was not set).
Addressing somewhat tricky bug here.
Since stdin,stdout,stderr were closed it's been occasionally possible to
see some unexpected messages to be flowing into a clvmd and generating some
randomly sized allocation of many megabytes. Since the message was not
being generated by standard send_message() construction, after some more
testing it apperead to be a debug log message - thus something has flown
to local socket opened on strandard out descriptor.
To fix the issue - use standard file descriptor duplication code for daemons.
For making easier debugging of polling daemon - developer might want to recompile
without modifition of standard file descriptors.
This could be seen as some sort of simple validation - it's not easy to
recognize a valid message for now - but we definitely do not want to
allocate a lot of megabytes in clvmd memory locked daemon when broken
message gets in.
Size of 8000 is just selected for now - possibly there could be much
lower value put in.
As API is passing structures by value, do not leave
the function which created buffer and keeps valid pointer
look like it would be some memory leak and move
free of buffer from inner function - makes more obvious,
how is the memory management handled.
Using report_type_t for bitmask is not correct, since we have not defined types
for all bit combinations - so switching to unsigned type, since values of
report_type_t enum are unsigned.
Make sure both hash tables are initialized before _read_sections() call.
Presents no functional change (since PV scan phase was not adding LV hashes),
but makes the code easier to handle mem failing case, and static analyzer is
hapier as well.
Adding at least stack traces with some FIXMEs for cases,
where we might want to do something cleaver - maybe fail command
or give user hints something is not going well ?
For remote_backup is stack probably 'good' enough for now.
We cannot do anything better here anyway - we are already in logging function,
so just ignore this issue here - it will most likely stop application later.
Code cannot proceed if oldname would be NULL.
Since lvmetad currently doesn't use logging mechanism of lvm to report
internal errors - stay with current code style of lvmetad which uses
plain asserts for cases like this.
Why using the order 69:
- Storage processing in general happens in 60-persistent-storage.rules,
including the blkid call that adds some usable information we can use
for filtering and speedup (these rules are part of upstream udev and
the order is preserved on most distros)
- There's still some other storage-related processing done after
60-persistent-storage.rules in general. These might add some detailed
storage-related information we might use to filter devices effectively
(e.g. MD udev rules, ...).
- We need lvmetad rules to be processed before any consumers can use the
output - so the metadata cache is ready soon enough (e.g. udisks rules).
- There's no official (upstream udev) document about assigning the order,
so this number is chosen in best belief it will suit all scenarios.
Should be faster then strncpy - since we could avoid clearing 4KB pages
with each strncpy(...,PATH_MAX).
Also it's easy to check whether string fit - and eventually avoid
to continue working we incomplete string.
If we have good enough glibc to return number of needed chars, do not
loop try to reach good size, but use this size directly for allocation,
saving also last strdup.
Since now we start with 16 bytes - skip buffer realloc for shorter string.
make devices invisible to lvm, but the behaviour of those is slightly different
than of actual missing devices. Running vgscan after re-enabling the device
triggers a metadata repair which is not done by vgremove -ff. This is not a
regression, merely an odd behaviour that has been around even before lvmetad.
The code fail to account for the case where we just need a single device
in a RAID 4/5/6 array. There is no good way to tell the allocation functions
that we don't need parity devices when we are allocating just a single device.
So, I've used a bit of a hack. If we are allocating an area_count that is <=
the parity count, then we can assume we are simply allocating a replacement
device (i.e. no need to include parity devices in the calculations). This
should make sense in most cases. If we need to allocate replacement devices
due to failure (or moving), we will never allocate more than the parity count;
or we would cause the array to become unusable. If we are creating a new device,
we should always create more stripes than parity devices.
/etc/tmpfiles.d directory holds configuration files for temporary/volatile
files and directories that should be automatically managed. For example,
if we have some parts of the fs hierarchy on tmpfs, we'd like to recreate
some files or directories on every boot so they're always prepared for use.
Systemd can read such configuration files. For now, the lock and run directory
are the ones that are most probably placed on tmpfs. If this is the case, we
can install the configuration by 'make install_tmpfiles_configuration'.
There were no messages printed upon completiion of RAID device replacement.
This could cause confusion/concern during automated recovery, because the
user sees the failure messages but no other messages indicating correction.
Built-in blkid is supported since udev v176 - set the UDEV_HAS_BUILTIN_BLKID
variable appropriately so we can use it in the rules to call the built-in
blkid conditionaly.
config tree per PV which is mostly provided by the client, so it can be used to
keep track of things like label_sector, PV format, mda count / offsets and so
on.
Read lvm.conf setting for monitoring for each command. So we should not
activate monitoring if the default compilation is set to monitor during
lvconvert commnads.
Patch also removes check for clustered VG and allows to disable monitoring
for clustered VG with the assumption, the problem with monitoring and dmeventd
flag passing for INGNORE is already fixed.
We don't have anything better yet...
The problems the watch rule caused when removing devices should be covered
now with the "retry remove" logic. It's also better to have this maintained
by us, rather than having this rule anywhere else without proper control.
Device-mapper in kernel uses '\' as escape character so it's better
to double it to avoid any confusion when using existing device names
with '\' in the table specification.
For example:
dmsetup create x --table "0 8 linear /dev/mapper/a\x20b 0"
should pass just fine now without a need to explicitly escape the '\' char
like this:
dmsetup create x --table "0 8 linear /dev/mapper/a\\x20b 0"
If dm_task_get_name or dm_task_get_names gets called, these will return
unmangled form of the names so the name mangling stays totally transparent
to any libdevmapper user (unless DM_STRING_MANGLING_NONE is used in which
case the name is not touched and it is is returned as it is in kernel).
For example:
dmsetup create "a b" - will create a\x20b device in kernel and so udev will
create /dev/mapper/a\x20b
dm_task_get_name/names will still return "a b"
In AUTO mode, the libdevmapper user can still query the device by using
the mangled ("a\x20b") or unmangled form of the name when calling dm_task_set_name.
If mangled name is provided, it's detected and the name is kept as it is.
If unmangled name is provided, it will be mangled. IOW in AUTO mode it's
totally transparent and it should not require any changes in the code
using libdevmapper.
However, any libdevmapper user must be aware of the fact that the mangled form
of the name appears in /dev/mapper (udev just can't deal with those blacklisted
characters).
- rename the hashes to be explicit about the mapping
- add VG/PV listing calls to the protocol
- cache slightly more of the per-PV state
- filter cached metadata
- compare the metadata upon metadata_update
dm_task_get_name_mangled will always return mangled form of the name while
the dm_task_get_name_unmangled will always return unmangled form of the name
irrespective of the global setting (dm_set/get_name_mangling_mode).
This is handy in situations where we need to detect whether the name is already
mangled or not. Also display functions make use of it.
Use the DEV_NAME macro to use the mangled form of the name if present,
use normal name otherwise (we store both forms - mangled and unmangled in
struct dm_task). Mangled form should be always preferred over unmangled
with the exception of the situations where we divide one task into several
others (like "create and load") - we need to avoid mangling the name twice
(because of multiple dm_task_set_name calls)!
If dm_task_set_name/newname is called, the name provided will be
automatically translated to correct encoded form with the hex enconding
so any character not on udev whitelist will be mangled with \xNN
format where NN is hex value of the character used.
By default, the name mangling mode used is the one set during
configure with the '--with-default-name-mangling' option.
This option configures the default name mangling mode used, one of:
AUTO, NONE and HEX.
The name mangling is primarily used to support udev character whitelist
(0-9, A-Z, a-z, #*-.:=@_) so any character that is not on udev whitelist
will get translated into an encoded form \xNN where NN is the hex value
of the character.
It was not possible to pass down the DM_[FORCE|NO]SYNC flags to
'dm_tree_node_add_raid_target'. This meant that converting to 'raid1' from
'mirror' would cause a full resync. (It also meant that '--nosync' was
ineffective when creating a 'raid1' LV.)
I've taken the 'reserved' parameter in 'dm_tree_node_add_raid_target' and
used it for the "flags" parameter. Now it is possible to pass the sync
flags and any other flags that may come up.
s/Issue/Use/, otherwise it is easy to misread "Issue" as "Issuing" - causing
the user confusion as to whether the action was performed automatically or
whether they need to issue the command.
'_lv_update_log_type' takes a lvconvert_params argument so that it can pass
down the user's preference of 'region_size' and allocation_policy. When
'mirror_remove_missing' was introduced (commit ID
95986e42a1) it didn't make sense to pass down
user preferences - so NULL was given instead. While it may never happen in
practice, static analysis reveals that this argument could be dereferenced.
So, if the user preferences were not passed in, glean the necessary fields
from what is already set in the LV.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
(Not updating WHATSNEW for this simple clean-up.)
Commit 02f6f4902f introduced a bug that caused
RAID devices to fail to activate if the device for a single sub-LV failed.
The special case of LVM mirror was handled, but not LVM RAID.
EXAMPLE:
[root@bp-01 ~]# devices vg
LV Copy% Devices
lv 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] /dev/sde1(1)
[lv_rimage_1] /dev/sdh1(1)
[lv_rmeta_0] /dev/sde1(0)
[lv_rmeta_1] /dev/sdh1(0)
[root@bp-01 ~]# vgchange -an vg
0 logical volume(s) in volume group "vg" now active
[root@bp-01 ~]# off.sh sdh
Turning off sdh
[root@bp-01 ~]# vgchange -ay vg --partial
Partial mode. Incomplete logical volumes will be processed.
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
Cannot activate vg/lv_rimage_1: all segments missing.
0 logical volume(s) in volume group "vg" now active
AFTER this patch:
[root@bp-01 ~]# vgchange -ay vg --partial
Partial mode. Incomplete logical volumes will be processed.
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
1 logical volume(s) in volume group "vg" now active
[root@bp-01 ~]# devices vg
Couldn't find device with uuid fbI0YO-GX7x-firU-Vy5o-vzwx-vAKZ-feRxfF.
LV Copy% Devices
lv 100.00 lv_rimage_0(0),lv_rimage_1(0)
[lv_rimage_0] /dev/sde1(1)
[lv_rimage_1] unknown device(1)
[lv_rmeta_0] /dev/sde1(0)
[lv_rmeta_1] unknown device(0)
[root@bp-01 ~]# dmsetup table vg-lv; dmsetup status vg-lv
0 1024000 raid raid1 3 0 region_size 1024 2 253:2 253:3 - -
0 1024000 raid raid1 2 AD 1024000/1024000
No WHATSNEW update necessary because this is an intrarelease fix.
brassow
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
In case of zero bytes would be read from sysfs, it would store '\0' on
temp_buf[-1] address.
Simplify some buffer length calculation and use strcpy if we've just
checked string fits in give buffer.
Replace jump label error: with bad: commonly used in libdm.
Replace asserts with test for failing memory allocation.
Add at least stack traces.
Index counter starts from 1 (0 reserved for error), so replacing fingerprint.
label_exit() is called destroy_toolcontext() and we are now
using standard dm_list function for destroy, we have to make sure
dm_list gets initialized properly.
Tested condition has been already evaluated before
For strlen() code has already excluded <ID_LEN.
For repairing, already tested (!argc && !repairing) before.
As atoi may return negative value - test for both limits.
Test log_args for limits before calling alloca().
Code from dmeventd mirror plugin should probably share same code as
we have in mirrored.c.
Since the function dm_get_next_target() returns NULL as 'next' pointer
so it's not a 'real' error - set 0 to all parameters when NULL is
returned because of missing head.
i.e. one of use case::
do {
next = dm_get_next_target(dmt, next, &start, &length,
&target_type, ¶ms);
size += length;
} while (next);
Since the function dev_close() has code path, which really could close
file (for unlocked vg) and destroy dev handler, stay on safe side and move
the close few lines later, even our current use case shouldn't trigger
such scenario.
Count number of error and existing areas and if there is no existing area
for the LV avoid its activation.
Always disable partial activatio for thin volumes.
For mirrors currently put in hack to let it pass with a special name
since current mirror code needs to activate such LV during some operations.
For reading % of mapped size of thin volume use as origin for
old style snapshot '-real' device needs to be queried.
Fix log_error report given for lvs -a in this case.
Before removing thin pool LV always make sure, stacked message
for previous run are cleared - but allow to remove any
device that should have been created
(i.e. creation of snapshot failed - so the message for snapshot creation
may be replaced with delete message within unfinished transaction).
Also commit messages after lv remove - so free space is released in pool.
Extend the usage of origin_only flag to allow resume of thin pool LV
(when it's active) to pass only the messages.
origin_only flag will skip detection of already resumed tree for thin_pool,
so we do not need to suspend the tree and we just send messages.
Pass in the origin_only flag also for thin volumes - but curently the flag
is not used to its best.
FIXME: achieve the state where only thin volume snapshot origin is
suspended without its childrens - let's explore whether this may
happen automatically inside libdm (might be generic for other targets).
So the code would not need to annotate the node for this.
Extend lv_activate_opts with bool flag to know for which purpose
dtree is created - and add message only for activation tree
(since that's the only place that may send them).
Extend validation check for thin snapshot creation and test whether
active snapshot origin is suspended before its snapshot is created
(useful in recover scenarios) - in this case also detect, whether
transaction has been already completed and avoid such suspend check
failure in that case.
Add pool_has_message and use it in attach_pool_message.
Also update header to make more obvious which segment type is
expected as parameter.
Rename 'read_only' to 'no_update' (no auto update transaction_id)
to better fit how it's used.
Fix problem when there was only one stacked message replaced with delete
message that caused unwanted transaction_id increase.
Using PRELOAD part would lead to problems when the problem
would happen before vg_write and vg_commit.
Also this change is necessary for snapshot creation sequence.
Failure to do so results in "Performing unsafe table load while X device(s) are
known to be suspended" errors. While fixing the problem in this way works and
is consistent with the way the mirror segment type does it, it would be nice
to find a solution that uses the generic suspend/resume calls.
Also included in this check-in are additions to the test suite that perform
conversions on RAID LVs under a snapshot. These tests are disabled for the
time being due to a kernel bug that is yet to be tracked down.
Similar to the "mirror" segment type's log device, _add_dev_to_dtree should
be called and not _add_lv_to_dtree when adding metadata sub-LVs to the deptree.
Since _add_lv_to_dtree was being called, 'origin_only' could be set if a
snapshot sits on top of the RAID device. This would cause the actual device
that needed to be added to be skipped in favor of the non-existant device,
"<foo>-real".
Reformat name and path how the LV is represented with lvm1 compatible option,
to switch to the old way - which had number of problem - i.e. many links
do not exist - since for private devices we are not creating them.
Add more info about thin pools and volumes.
Since striped name function knows when to report 'linear' instead of
'stripe' type name - drop it from this place.
This fixes problem when reporting segtype e.g. for thin-pool which
is also using area_count=1 to store thin data device reference.
It also returns properly strduped memory instead of badly casted const char*.
This patch to the suspend code - like the similar change for resume -
queries the lock mode of a cluster volume and records whether it is active
exclusively. This is necessary for suspend due to the possibility of
preloading targets. Failure to check to exclusivity causes the cluster target
of an exclusively activated mirror to be used when converting - rather than
the single machine target.
Since snapshot needs to suspend origin - it might lead to pool userspace
deadlock (as the pool will wait for new space in case it would be overfilled,
but dmeventd would not be able to resize it, as the lvcreate operation would
have kept the VG lock.)
To minimize the risk of such scenario - we prevent to create new snapshot
in case we are over the threshold - but beware, there is still small timewindow,
so keep threshold at some reasonable level!
New field Data% is able to display info about
thin_pool, thin, snapshot and has generic meaning here.
Simple Time/Host field are here to display host and time creation.
Basic support to keep info when the LV was created.
Host and time is stored into LV mda section.
FIXME: Current version doesn't support configurable string via lvm.conf
and used fixed version strftime "%Y-%m-%d %T %z".
This value returns percentage of 'mapped' size compared with total LV size.
(Without passed seg pointer it return highest mapped size - but it's
not used yet.)
used to be a few mis-ordered memory accesses (release and access in the next
block). Fix that set_flag could have sometimes corrupted the flags being
modified.
A few issues with metadata tracking are sorted out as well now, and there are
only a few problems remaining before we can integrate lvmetad, mostly on the
client side:
- metadata areas need to be tracked in lvmetad (most likely to be addressed
through an extension of metadata, meaning no special support in lvmetad would
be needed)
- non-udev scanning code needs to be taught about telling lvmetad about device
disappearance (pvscan most importantly)
- this last item also needs to mesh with metadata inconsistencies and
suddenly-incomplete volume groups (aux disable_dev in tests); udev-based
scanning should address this separately and more elegantly
Add 'blkdevname' and 'blkdevs_used' field to dmsetup info -c -o.
Add 'blkdevname' option to dmsetup ls --tree to see block device names.
Add '-o options' to dmsetup deps and ls to select device name type on output.
This is accomplished by reading associated sysfs information. For a dm device,
this is /sys/dev/block/major:minor/dm/name (supported in kernel version >= 2.6.29,
for older kernels, the behaviour is the same as for non-dm devices).
For a non-dm device, this is a readlink on /sys/dev/block/major:minor, e.g.
/sys/dev/block/253:0 --> ../../devices/virtual/block/dm-0.
The last component of the path is a proper kernel name (block device name).
One can request to read only kernel names by setting the 'prefer_kernel_name'
argument if needed.
LVM- prefix.
Try harder not to leave stray empty devices around (locally or remotely) when
reverting changes after failures while there are inactive tables.
If we know major:minor number of device (which is known after resume) we will
try to use sysfs to set/get read ahead parameters of device.
This avoid potential problem of blocking commands like 'dmsetup info' awaiting
for device being usable for open/close - i.e. overfilled thin pool may block
such command.
We want to keep this logic -
when LV is extend - extend the LV by at least given amount,
when LV is reduced - reduce the LV by at most given amount.
So for this the rounding needs to be used.
Current logic which seems to satisfy give rule is to round up all
extent values for LV resize upward except for values with '-' sign
that are round downward.
This patch also fixes the problem when lvextend --use-polices tried
to extend LV the by i.e. 20% - but the resulting 20% were smaller
the extent size thus before this patch no extension happened.
For snapshot, prepare whole command in front into private buffer.
Add also some missing '\n' for syslog messages.
For raid and mirror only convert creation of command line string.
This should avoid any unbound growth of mempool for dm_split_names.
Fix some minor outstading issue from thin plugin introduction -
Call dmeventd_lvm2_exit() in failpath for registration.
Add some missing '\n' in syslog messages.
Since the !(dev->flags & DEV_REGULAR) code path just called
dev_name_confirmed() which has just called 'stat()' inside,
remove duplicate second stat() call here.
When both path have identical prefix i.e. /dev/disk/by-id
skip 2 x lstat() for /dev /dev/disk /dev/disk/by-id
and directly lstat() only different part of the path.
Reduces amount of lstat calls on system with lots of devices.
Add dm_get_status_thin_pool and dm_get_status_thin functions to
parse 'params' argument which is received via dm_get_next_target.
Returns filed structure allocated from given mempool.
The RAID plug-in for dmeventd now calls 'lvconvert --repair' to address failures
of devices in a RAID logical volume. The action taken can be either to "warn"
or "allocate" a new device from any spares that may be available in the
volume group. The action is designated by setting 'raid_fault_policy' in
lvm.conf - the default being "warn".
Also, don't allow a splitmirror operation on a RAID LV that is already tracking
a split, unless the operation is to stop the tracking and complete the split.
Example:
~> lvconvert --splitmirrors 1 --trackchanges vg/lv /dev/sdc1
# Now tracking changes - image can be merged back or split-off for good
~> lvconvert --splitmirrors 1 -n new_name vg/lv /dev/sdc1
# ^ Completes split ^
If a split is performed on a RAID that is tracking an already split image and
PVs are provided, we must ensure that
1) the already split LV is represented in the PVs
2) we are careful to split only the tracked image
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
LVM metadata knows only of striped segments - not linear ones.
The activation code detects segments with a single stripe and switches
them to use the linear target.
If the new lvm.conf setting is set to 0 (e.g. in a test script), this
'optimisation' is turned off.
Cleanup generated files from coverage testing.
Do not skip standard .o compilation for lib/not and lib/harness.
Make a bit longer string in harness to fit new shell/ in.
Simplify /api makefile and use SUBDIRS target for test dir.
Properly cleanup Makefiles with distclean in /test.
Use symbolic links for shell scripts for non-srcdir compilation.
Use gcc warning options only for .c -> .o compilation
So it makes the output more clear.
Do not use INCLUDES and DEFS for .o -> .so.
Do not use CFLAGS for deps .d generation.
bitset_t.c:39: warning: 'last' may be used uninitialized in this function
Compiler is not smart enough to see the code path which avoid using
unitialized 'last'.
tests from unit-tests/*/*_t.c (now under test/unit). The valgrind/pool test is
missing, since it's not really a unit test and probably not too valuable
either. Available via "make unit" (and if --enable-testing was passed to
configure, also executed by make check).
Remove FIXMES - there should not be any pool free call since
the memory pool is from device manager, and pool is detroyed
after the operation, so doing extra free here would not help here.
However lv_has_target_type() is using cmd mempool so here the extra
call for dm_pool_free makes sence.
"result_independent_of_operands: ((dev->dev & 0xfff00UL) >> 8) ==
18446744073709551615UL /* -1 */ is always false regardless of the values
of its operands (logical operand of if)."
'dev->dev' is set in dev-cache.c _insert() and it's not expectable
st_rdev would have '-1'
This code has been introduced with drbd support commit and code never
worked - so eliminated.
Avoid creation of target type name when it's longer then
DM_MAX_TYPE_NAME (noticed by static analyzer where the
sp.target_type might be missing '\0' at the end.)
Before patch:
$> dmsetup create long
0 1000 looooooooooooooooooooooooooong
^D
device-mapper: reload ioctl failed: Invalid argument
After patch:
$> dmsetup create xxx
0 1000 looooooooooooooooooooooooooong
Target type name looooooooooooooooooooooooooong is too long.
Command failed
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
For write we do not need to hold memory locked.
This relaxes many conditions and avoid problems when allocating
a lot of memory for writting metadata buffers.
(In case of huge MDA size this would lead to mismatch between
locked and unlocked memory region size).
Add also internal check we are not writing in critical section.
Removal of an inactive origin removes also all related snapshots.
When we now support 'old' external snapshots with thin volumes,
removal of pool will not only drop all thin volumes, but as
a consequence also all snapshots - which might be seen a bit
unexpected for the user - so add a query to confirm such action.
lvremove -f will skip the prompt.
Update region_size only for mirror and raid targets.
This fixes warning messages when vg is using small
extent size like 1KiB and no mirror/raid is created,
but the user still got the message:
$> vgcreate -s 1K vg <pvs>
$> lvcreate -L10K vg
Using reduced mirror region size of 4 sectors
Since we support snapshots of thin volumes, we could have more layers,
so we have to check whether tpool layer is going to be inserted.
As the _add_segment_to_dtree() is the only place that adds tpool
segment, we may just check pointer (no strcmp for layer).
Switch to use seg_is_ function instead of lv_is_.
When a PV label write is deferred to a vg_write call (as introduced by a patch
in 2.02.86), the PV is flagged with the internal UNLABELLED_PV flag. However,
when calling vg_archive before vg_write, we still have the PV labelled with the
UNLABELLED_PV flag which was not recognised as a proper flag while exporting
VG metadata:
# vgcreate vg /dev/sda
No physical volume label read from /dev/sda
Metadata inconsistency: Not all flags successfully exported.
Metadata inconsistency: Not all flags successfully exported.
Writing physical volume data to disk "/dev/sda"
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
udev may also need to be disabled if you didn't build it statically too.
dmeventd.static could be fixed with some more work but I don't really see the
point: without dlopen() it's useless, and if you have dlopen(), why not support
normal shared libraries too?
Any documentation less-detailed than Documentation/device-mapper is
dangerous for the non-trivial ctr lines. And anyway, this should be in s4
not here.
Use standard manpage style.
Keep options and commands in alphabetic order.
Added at least a very simply info about some other targets.
TODO: documenting targest needs far more work...
Remove DM_THIN_ERROR_DEVICE_ID from API.
Remove API warning.
Drop code that was using DM_THIN_ERROR_DEVICE_ID (already commented)
Remove debug message which slipped in through some previous commit.
Add filter which tries to check if scanned device is part
of active multipath.
Firstly, only SCSI major number devices are handled in filter.
Then it checks if device has exactly one holder (in sysfs) and
if it is device-mapper device and DM-UUID is prefixed by "MPATH-".
If so, this device is filtered out.
The whole filter can be switched off by setting
mpath_component_detection in lvm.conf.
https://bugzilla.redhat.com/show_bug.cgi?id=597010
Signed-off-by: Milan Broz <mbroz@redhat.com>
If the extent_size is smaller then the chunk_size we may try
to find better aligment (wasting less space).
i.e. using 4KB extent_size and 64KB chunk size will
lead to creation of 64KB aligned thin volume.
Since we finaly recognize thin creation only after
_determine_snapshot_type() - move _read_activation_params()
after it - so we can support lvcreate -an thin snapshot.
Always make sure table gets reloaded.
For now activate and deactivate pool volume if it's not active.
FIXME: we could do this only if we are sure some thin volume is alive.
that we can also test clustered volume groups (vgcreate -c y) in the test
suite. Unfortunately we can't make this the testing default since cluster
mirrors require further infrastructure, and snapshots probably don't work at
all. I'll eventually add a few test cases that create clustered VGs
specifically.
Support allocation of metadata from the same PV, if the VG
is build only from one PV.
As thinp is not mirror - we do not require 2 PVs
for basic thin usage as user is losing only perfomance.
Since activation of pool is now independent on thin activation,
user may do whatever he needs - thought preferable thin should stay alive,
but it it will be found inactivate, update_pool will bring the pool up.
Replace detach_pool_messages with update_pool_lv.
Move creation code from to 'if' condition into 1.
Ensure creation has finished all previous message operations.
Let's put the overlay device over real thin pool device.
So we can get the proper locking on cluster.
Overwise the pool LV would be activate once implicitely
and in other case explicitely, confusing locking mechanism.
This patch make the activation of pool LV independent on
activation of thin LV since they will both implicitely use
real -thin pool device.
A little code shuffling and adding support for
DM_THIN_ERROR_DEVICE_ID which might be eventually be used
for activation of thin which is going to be deleted.
For now we do not need it lvm.
Normally, restart simply means "stop and start" for systemd. However, if
we're installing new versions of the dmeventd binary/libdevmapper, we need
to restart dmeventd. This fails if we have some devices monitored - we need
to call "dmeventd -R" instead.
The "ExecReload" did not work quite well in some old versions of systemd,
systemd assumed that only the configuration is reloaded on "ExecReload",
not the whole binary itself so it lost track of dmeventd daemon (it lost new
dmeventd PID). This is fixed and seems to be working fine now with recent
versions of dmeventd.
All thins are created with the next activation and VG is updated
without messages. Only some basic commands works.
(i.e. lvcreate -an -V10 -T mvg/pool)
There can be some combination to confuse this system.
This functionality for snapshots is going to be interesting.
Add a new node flag send_messages that is used to simplify
test when to call _node_send_messages().
Add call to _node_send_messages when pool is deeper in the tree.
If something fails during creation of thin LV remove such LV
and deactivate in case it's been already tried to activate
(i.e. thin kernel driver fails for some reason.)
To ensure we properly handle LV cluster locking - explicitely do
not allow to change the availability of the thin pool that is in use
for some thin LV.
As soon as the thin volume is created the only way to activate pool
is via implicit dependency.
Ignore thinpool open count for lv/vgchange operations.
Before adding a new virtual segment to LV, check first whether
the last segment isn't already of the same type. In this case
extend last segment instead of creating the new one.
Thin volumes should have always only 1 virtual segment, but it
helps also to virtual snapshot or error segtype..
There should be no need for retry for our internal devices - it would be hinding
our own bug in the tree processing.
Update error messages to show also also device name.
No WHATS_NEW - in release fix.
Git commit ID 0864378250 was meant to disallow
'mirrored' logs for cluster mirrors. However, when add_mirror_log is used
to create the log (as is now the case when using 'lvcreate' or converting only
the log) the check is bypassed.
This patch adds the check to add_mirror_log.
pvck prints 'extra' character from the label since there is no '\0'
after the struct label entry and just uint64_t follows directly.
So avoid it by limiting 8 chars to be printed.
https://www.redhat.com/archives/lvm-devel/2011-January/msg00109.html
Signed-off-by: Paul Bolle <pebolle tiscali nl>
grep need -F to check what we really want to test.
Add better test for existing device.
Currently this test DOES NOT work with real /dev handle via udev
since our tool does not see such device listet through udev.
FIXME: We might be able to see it at least through dmsetup table and
use for lvm.
Since thin is not able to use _lv_insert_empty_sublvs,
remove its appearence from this function.
Start to use extend_pool() function for desired functionality
and modify lv_extend() for this.
Code in _lv_insert_empty_sublvs was not able to provide proper
initialization order for thin pool LV.
New function extend_pool() first adds metadata segment to pool LV which
is still visible. Such LV is activate and cleared.
Then new meta LV is created and metadata segments are moved there.
Now the preallocated pool data segment is attached to the pool LV
and layer _tpool is created. Finaly segment is marked as thin_pool.
This function could be useful for other _manip source files.
Use dm_list manipulation function for provided functionality,
which make the code more readable and avoid touching list
internal details here.
This was the intended behaviour, as described in the lvchange man page, so you
have complete control through volume_list in lvm.conf, but the code seems to
have been treating -ae as local-only for a very long time.
When DEBUG_MEM is used, the memory is trashed with extra pattern before real
free() is called, and as this memory was marked as non accessible when used with
valgrind, make it again usable.
Couple FIXMEs put into the code for parts of the code which may be
improved later, since we might be able to add 'lazy' device creation later.
For now require exclusive activation.
Split syntax for thin-pool since it cannot be fully matched with snapshot.
So to avoid more confusion - take thin support into separate line.
Though still significant updates are needed for thin provisioning.
Certain errno codes could be expected in some situations thus
add experimental support for them.
When expected errno is set after ioctl error - function skips error
printing and exits succefully.
Currently only useful for thin pool messages.
monitoring state of the logical volumes they are currently acting on.
Until now, every time a logical volume has been changed by a dmeventd plugin,
this plugin would have called back to dmeventd through the external FIFO
mechanism. I am fairly sure this was superfluous, inefficient and possibly even
dangerous.
The '--merge' option to lvconvert works on snapshots and RAID1. The man
pages correctly reflect this, but the CLI help output still used the term,
'SnapshotLogicalVolume'.
lvm part of messaging.
Each message is now stored it's own thin pool section:
message1 {
create = lv
}
Messages are queued to thin pool dm target when this target
is going to be resumed or used through some dependency.
Currently 'delete' message are purely queued and processed
with next thin pool resume operation (i.e. create_thin).
WARNING - thin provisioning support is developmental code.
Version 2 of the userspace log protocol accepts return information during the
DM_ULOG_CTR exchange. The return information contains the name of the log
device that is being used (if there is one). The kernel can then register the
device via 'dm_get_device'. Amoung other things, this allows for userspace to
assemble a correct dependency tree of devices - critical for LVM handling of
suspend/resume calls.
Also, update dm-log-userspace.h to match the kernel header associated with
this protocol change. (Includes a version inc.)
The upstream kernel version that this file mirrors has changed, here is the
commit message:
commit 86a54a4802df10d23ccd655e2083e812fe990243
Author: Jonathan Brassow <jbrassow@redhat.com>
Date: Thu Jan 13 19:59:52 2011 +0000
dm log userspace: add version number to comms
This patch adds a 'version' field to the 'dm_ulog_request'
structure.
The 'version' field is taken from a portion of the unused
'padding' field in the 'dm_ulog_request' structure. This
was done to avoid changing the size of the structure and
possibly disrupting backwards compatibility.
The version number will help notify user-space daemons
when a change has been made to the kernel/userspace
log API.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When verify_udev_operations was disable, code for stacking fs operation for
lvm links was completely disable - but this code was also used for collecting
information, that a new node is being created.
Add a new flag which is set when a creation of lv symlinks is requested which
should restore old behaviour of lv_info function, that has called fs_sync()
before quere for open count on device.
Code here is using thread write protected variable without locking.
So add locking, for proper synchronization and a FIXME, since the
code needs closer look.
Barrier is supposed to be used in situation like this
and replace tricky mutex usage, where mutex has been unlocked
by a different thread than the locking thread.
Usage of thread unprotected init_test is not correct and needs probably lvm lock
since it part of lvm library. Current implementation may probably fail with
test mode and actually create something unexpectedly (and vice versa).
Since default thread stack size is around 8MB and clvmd creates for now thread
for message, clvmd may easily reach multi GB size of in-memory locked pages
(runs with mlockall()).
This patch significantly reduces memory usage to just tens of MB,
and now different reasons are the cause of server overloading.
Now we are running out of free file descriptors mostly.
Go with just 64KiB for stack.
Closer inspection should be made, whether we actually need to play with
settings at all.
Since default stack size is 8MB and gets mapped via page locking thus,
it seems there is no big help with preallocation of stack to some value.
Properly detect if the filters were refreshed properly.
(May needs few more fixes ??)
Filter refresh may fail because it may be out of free file descriptors
when clvmd gets overloaded.
Replace usleep with pthread condition to increase speed testing
(for simplicity just 1 condition for all locks).
Use thread mutex also for unlock resource (so it wakes up awaiting
threads)
Better check some error states and return error in fail case with
unlocked mutex.
Using log_warn to report missing symlinks as warning, since the command
itself returns as successful, we should not produce log_error().
log_warn is better fit here.
Cosmetic, since r is already 0 for the error path, no need to assign it there,
and r is assigned to 1 after switch command.
Also makes the code more readable.
Workaround for the current code with big FIXME,
since proper solution for pvmove needs to be developed.
Commiting this only for the purpose to get cluster testing covered.
Example:
~> lvconvert --type raid1 vg/mirror_lv
Steps to convert "mirror" to "raid1"
1) Allocate a RAID metadata LV for each mirror image from the same PVs
on which they are located.
2) Clear the metadata LVs. This involves writing LVM metadata, so we don't
change any aspects of the mirror LV before this so that the user can easily
remove LVs from the failed convert attempt while retaining the original
mirror.
3) Remove the mirror log, if it exists.
4) Add metadata LVs to mirror LV
5) Rename mirror sub-lvs (s/mimage/rimage/)
6) Change flags and segtype from mirror to raid1
Example:
~> lvconvert --type raid1 -m 1 vg/lv
The following steps are performed to convert linear to RAID1:
1) Allocate a metadata device from the same PV as the linear device
to provide the metadata/data LV pair required for all RAID components.
2) Allocate the required number of metadata/data LV pairs for the
remaining additional images.
3) Clear the metadata LVs. This performs a LVM metadata update.
4) Create the top-level RAID LV and add the component devices.
We want to make any failure easy to unwind. This is why we don't create the
top-level LV and add the components until the last step. Should anything
happen before that, the user could simply remove the unnecessary images. Also,
we want to ensure that the metadata LVs are cleared before forming the array to
prevent stale information from polluting the new array.
A new macro 'seg_is_linear' was added to allow us to distinguish linear LVs
from striped LVs.
This patch allows a mirror to be extended without an initial resync of the
extended portion. It compliments the existing '--nosync' option to lvcreate.
This action can be done implicitly if the mirror was created with the '--nosync'
option, or explicitly if the '--nosync' option is used when extending the device.
Here are the operational criteria:
1) A mirror created with '--nosync' should extend with 'nosync' implicitly
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv ; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 10.00g lv_mlog 100.00
2) The 'M' attribute ('M' signifies a mirror created with '--nosync', while 'm'
signifies a mirror created w/o '--nosync') must be preserved when extending a
mirror created with '--nosync'. See #1 for example of 'M' attribute.
3) A mirror created without '--nosync' should extend with 'nosync' only when
'--nosync' is explicitly used when extending.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 5.02g lv_mlog 0.39
vs.
[EXAMPLE]# lvs vg; lvextend -L +5G vg/lv --nosync; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg mwi-a-m- 20.00m lv_mlog 100.00
Extending 2 mirror images.
Extending logical volume lv to 5.02 GiB
Logical volume lv successfully resized
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.02g lv_mlog 100.00
4) The 'm' attribute must change to 'M' when extending a mirror created without
'--nosync' is extended with the '--nosync' option. (See #3 examples above.)
5) An inactive mirror's sync percent cannot be determined definitively, so it
must not be allowed to skip resync. Instead, the extend should ask the user if
they want to extend while performing a resync.
[EXAMPLE]# lvchange -an vg/lv
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv is not active. Unable to get sync percent.
Do full resync of extended portion of vg/lv? [y/n]: y
Logical volume lv successfully resized
6) A mirror that is performing recovery (as opposed to an initial sync) - like
after a failure - is not allowed to extend with either an implicit or
explicit nosync option. [You can simulate this with a 'corelog' mirror because
when it is reactivated, it must be recovered every time.]
[EXAMPLE]# lvcreate -m1 -L 5G -n lv vg --nosync --corelog
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "lv" created
[EXAMPLE]# lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 100.00
[EXAMPLE]# lvchange -an vg/lv; lvchange -ay vg/lv; lvs vg
LV VG Attr LSize Pool Origin Snap% Move Log Copy% Convert
lv vg Mwi-a-m- 5.00g 0.08
[EXAMPLE]# lvextend -L +5G vg/lv
Extending 2 mirror images.
Extending logical volume lv to 10.00 GiB
vg/lv cannot be extended while it is recovering.
7) If 'no' is selected in #5 or if the condition in #6 is hit, it should not
result in the mirror being resized or the 'm/M' attribute being changed.
NOTE: A mirror created with '--nosync' behaves differently than one created
without it when performing an extension. The former cannot be extended when
the mirror is recovering (unless in-active), while the latter can. This is
a reasonable thing to do since recovery of a mirror doesn't take long (at
least in the case of an on-disk log) and it would cause far more time in
degraded mode if the extension w/o '--nosync' was allowed. It might be
reasonable to add the ability to force the operation in the future. This
should /not/ force a nosync extension, but rather force a sync'ed extension.
IOW, the user would be saying, "Yes, yes... I know recovery won't take long
and that I'll be adding significantly to the time spent in degraded mode, but
I need the extra space right now!".
This patch also does some clean-up of the splitmirrors code.
I've attempted to clean-up the splitmirrors code to make it easier to
understand with fewer operations. I've tried to reduce the number of
metadata operations without compromising the intermediate stages which
are necessary for easy clean-up in the even of failure.
These changes now correctly handle cluster situations - including exclusive
cluster mirrors. Whereas before, a splitmirror operation would result in
remote nodes having LVM commands report the newly split LV with a proper
name while DM commands would report the old (pre-split) names of the device.
IOW, there was a kernel/userspace mismatch.
The original commit comments can be located via this git commit ID:
7d8e615c0b
There were three possible solutions to the original problem proposed in the
initial check-in. The one chosen was as follows:
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it doesn't
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
Turns out, the cluster also views the extra suspend/resume operations as
pointless too and ignores them. So, this solution doesn't work in a cluster.
Further, I've noticed that in addition to the remote cluster nodes still getting
I/O errors from scanning the error target, they also have a different LVM and
DM views of the same LV. IOW, while the LVM level (gotten from the LVM metadata)
sees the correct name for the newly split LV, device-mapper still maintains the
old names.
Because the original fix failed to completely fix the problem (or work-around it)
and because a better solution must be found to address the additional cluster
issue of device renaming, I am reverting the above mentioned commit.
The current code does not always assign proper udev flags to sub-LVs (e.g.
mirror images and log LVs). This shows up especially during a splitmirror
operation in which an image is split off from a mirror to form a new LV.
A mirror with a disk log is actually composed of 4 different LVs: the 2
mirror images, the log, and the top-level LV that "glues" them all together.
When a 2-way mirror is split into two linear LVs, two of those LVs must be
removed. The segments of the image which is not split off to form the new
LV are transferred to the top-level LV. This is done so that the original
LV can maintain its major/minor, UUID, and name. The sub-lv from which the
segments were transferred gets an error segment as a transitory process
before it is eventually removed. (Note that if the error target was not put
in place, a resume_lv would result in two LVs pointing to the same segment!
If the machine crashes before the eventual removal of the sub-LV, the result
would be a residual LV with the same mapping as the original (now linear) LV.)
So, the two LVs that need to be removed are now the log device and the sub-LV
with the error segment. If udev_flags are not properly set, a resume will
cause the error LV to come up and be scanned by udev. This causes I/O errors.
Additionally, when udev scans sub-LVs (or former sub-LVs), it can cause races
when we are trying to remove those LVs. This is especially bad during failure
conditions.
When the mirror is suspended, the top-level along with its sub-LVs are
suspended. The changes (now 2 linear devices and the yet-to-be-removed log
and error LV) are committed. When the resume takes place on the original
LV, there are no longer links to the other sub-lvs through the LVM metadata.
The links are implicitly handled by querying the kernel for a list of
dependencies. This is done in the '_add_dev' function (which is recursively
called for each dependency found) - called through the following chain:
_add_dev
dm_tree_add_dev_with_udev_flags
<*** DM / LVM divide ***>
_add_dev_to_dtree
_add_lv_to_dtree
_create_partial_dtree
_tree_action
dev_manager_activate
_lv_activate_lv
_lv_resume
lv_resume_if_active
When udev flags are calculated by '_get_udev_flags', it is done by referencing
the 'logical_volume' structure. Those flags are then passed down into
'dm_tree_add_dev_with_udev_flags', which in turn passes them to '_add_dev'.
Unfortunately, when '_add_dev' is finding the dependencies, it has no way to
calculate their proper udev_flags. This is because it is below the DM/LVM
divide - it doesn't have access to the logical_volume structure. In fact,
'_add_dev' simply reuses the udev_flags given for the initial device! This
virtually guarentees the udev_flags are wrong for all the dependencies unless
they are reset by some other mechanism. The current code provides no such
mechanism. Even if '_add_new_lv_to_dtree' were called on the sub-devices -
which it isn't - entries already in the tree are simply passed over, failing
to reset any udev_flags. The solution must retain its implicit nature of
discovering dependencies and be able to go back over the dependencies found
to properly set the udev_flags.
My solution simply calls a new function before leaving '_add_new_lv_to_dtree'
that iterates over the dtree nodes to properly reset the udev_flags of any
children. It is important that this function occur after the '_add_dev' has
done its job of querying the kernel for a list of dependencies. It is this
list of children that we use to look up their respective LVs and properly
calculate the udev_flags.
This solution has worked for single machine, cluster, and cluster w/ exclusive
activation.
The problem as reported by "ben <benscott@nwlink.com>" on lvm-devel:
vgsplit fails with mirrored mirror log
#lvs --all -o lv_name,lv_attr,devices
LV Attr Devices
MyMirror mwi--
[MyMirror_mimage_0] Iwi--- /dev/sdq(0)
[MyMirror_mimage_1] Iwi--- /dev/sdo(0)
[MyMirror_mimage_2] Iwi--- /dev/sdi(0)
[MyMirror_mlog] mwi---
[MyMirror_mlog_mimage_0] Iwi--- /dev/sds(0)
[MyMirror_mlog_mimage_1] Iwi--- /dev/sde(0)
#vgsplit -v "TestA" "TestB" "/dev/sdq" "/dev/sdo" "/dev/sdi" "/dev/sds"
"/dev/sde"
Checking for volume group "TestA"
Checking for new volume group "TestB"
Archiving volume group "TestA" metadata (seqno 213).
Can't split mirror MyMirror between two Volume Groups
AFTER FIX:
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
Volume group "new" not found
Skipping volume group new
LV VG Devices
lv vg lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] vg /dev/sdb1(0)
[lv_mimage_1] vg /dev/sdc1(0)
[lv_mlog] vg lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] vg /dev/sdh1(0)
[lv_mlog_mimage_1] vg /dev/sdi1(0)
[root@bp-01 ~]# vgsplit vg new /dev/sd[bchi]1
New volume group "new" successfully split from "vg"
[root@bp-01 ~]# lvs -a -o name,vg_name,devices vg new
LV VG Devices
lv new lv_mimage_0(0),lv_mimage_1(0)
[lv_mimage_0] new /dev/sdb1(0)
[lv_mimage_1] new /dev/sdc1(0)
[lv_mlog] new lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
[lv_mlog_mimage_0] new /dev/sdh1(0)
[lv_mlog_mimage_1] new /dev/sdi1(0)
Make limits for thin data_block_size and device_id part of public API.
FIXME: read them possible from some kernel header file in the future ?
But we may need to support different values for different versions ?
Since execve passed only NULL as environ, we had lost all environment vars on
restart - thus actually running 'different' clvmd then the one at start.
Preserving environ allows to restart clvmd with the same settings
(i.e. LD_LIBRARY_PATH)
Add test for second restart.
Since it's internal function and we always check for NULL value
before call - this is safe.
Just for case add nonnull attribute so analyzer might better
catch error.
It's 100% equivalent test - since it always happen for the first iteration.
But the check for 'l' is understandable with analyzers - since analyzer
is not smart enough to deduce connection between root->child == NULL.
Add named cluster_ops to easily learn the name of the active cluster manager,
so we are able to restart singlenode manager in testing.
Add simple test for clvmd -S (restart) and -R (refresh)
(though it needs some extensions).
When running tests it might be useful to have an override option when
testing on real /dev and some broken system (i.e. Debian and its rules).
So one can use:
LVM_TEST_DEVDIR=/dev LVM_VERIFY_UDEV=1 make check
When read in drain returned <0 value, terminal content has been trashed.
Remove unneeded memset() and use whole buffer.
Free readbuf before exit (valgrind).
Next iteration for better fit of lvmetad compilation.
Move build of libdaemon.a into common subdir Makefile.
libdaemon.a is device-mapper target.
Build and install lvmetad as lvm2 target.
Test whether nodes could be used on given filesystem where TMP
dir is being used and skip teardown quicker in fail case.
(makes the problem quickly obvious if you try to such fs).
Skip teardown_dev if we have not created any devs yet.
and do not mkdir /dev/mapper dir when LVM_TEST_DEVDIR is set.
Drop this test from t-000-basic.sh.
Read 2 environmental vars to learn about overide position for
CLVMD and LVM binaries.
We support LVM_BINARY in other script - and this way we could easily
test restart in our test-suite.
Bugfix:
Add (most probably unfinished) support for -E arg with list of exclusive
locks. (During clvmd restart all exclusive locks would have been lost and
in fact, if there would have been an exclusive lock, usage text would be
printed and clvmd exits.)
Instead of parsing list options multiple times every time some lock UUID is
checked - put them straight into the hash table - make the code easier to
understand as well.
Remove was_ex_lock() function (replaced with dm_hash_lookup()).
Swap return value for get_initial_state() (1 means success).
Update man pages and usage info for -E option.
Before, we used to display "Can't remove open logical volume" which was
generic. There 3 possibilities of how a device could be opened:
- used by another device
- having a filesystem on that device which is mounted
- opened directly by an application
With the help of sysfs info, we can distinguish the first two situations.
The third one will be subject to "remove retry" logic - if it's opened
quickly (e.g. a parallel scan from within a udev rule run), this will
finish quickly and we can remove it once it has finished. If it's a
legitimate application that keeps the device opened, we'll do our best
to remove the device, but we will fail finally after a few retries.
Add dm_device_has_mounted_fs fn to check mounted filesystem on a device.
This requires sysfs directory to be correctly set via dm_set_sysfs_dir
(/sys by default). If sysfs dir is not used or it's set incorrectly,
dm_device_has_{holders,mounted_fs} will return 0!
If you specify the segment type (e.g. --type mirror) and the mirrors argument
as zero, it would result in a mirrored LV with only one image. While the device
may be valid in theory, it should not be allowed in practice. It also makes it
difficult on the conversion tools, since they react badly to single-image
mirrors.
seg->areas and seg->meta_areas. We also need to copy the memory from the
old arrays to the newly allocated arrays. The amount of memory to copy was
determined by seg->area_count. However, seg->area_count was being set to the
higher value after copying the 'seg->areas' information, but before copying
the 'seg->meta_areas' information. This means we were copying more memory
than necessary for 'seg->meta_areas' - something that could lead to a segfault.
Patch fixes Clang warnings about possible access via lv_name NULL pointer.
Replaces allocation of memory (strdup) with just pointer assignment
(since execve is being called anyway).
Checks for !*lv_name only when lv_name is defined.
(and as I'm not quite sure what state this really is - putting a FIXME
around - as this rather looks suspicios ??).
Add debug print of passed clvmd args.
(since data in statbuf are invalid).
Check whether sysconf managed to find _SC_PAGESIZE.
Report at least debug warning about failing unlink
(logging scheme here seems to be a different then in lvm).
Duplicate terminal FDs and use similar code as is made in clvmd
and cleanup warns about missing open/close tests.
FIXME: Looks like we already have 3 instancies of the same code in lvm repo.
It seems like test-suite is significantly slowed down on removal phase.
Maybe it should be part of generic teardown - though the reason of slowdown
should be also discovered - probably related to retry loop ?
When the srcdir == builddir we get the link on non-exectable file.
So make always sure fsadm.sh is executable script file.
Add cleanup target for lib/fsadm.
Compiler says variable may be used uninitialized. It can't be, but we
initialize the variable to NULL anyway. Also, remove the double initialization
of another variable.
When fsadm is test - it needs to execute lvm and fsadm from non-standard path
setting. So adding a support in fsadm script when user set LVM_BINARY, then
the lvm command invoced from fsadm will have the same PATH setting as before
entering fsadm command.
Needed for testing.
In case someone would use filename paths with spaces when changing
this script surround commands with '"'.
With default settings there is no change in behavior.
This bug showed up when trying to add a log to a mirror whose images are on
multiple devices. This is an intra-release regression and no WHATS_NEW
entry will be added. The error was introduce in the following commit:
2d8a2f35c7
The solution is to recognise in _alloc_init that if there are no mirrors
or stripes specified, then 'new_extents' should be zero.
to settle udev before calling deactivate_lv.
This is an intra-release regression (no WHATS_NEW entry required). It is
part of the fix for the current WHATS_NEW entry:
Work around resume_lv causing error LV scanning during splitmirror operation.
Currently test is skipped by default (since it needs code hack to work)
Check command line options to create & remove thin pools and thin volumes.
Activation code for thin LV support is missing, thus it only works without
driver loaded.
When user wants to remove thin pool - check if there are no thin volumes using it.
If so - query before removal (or -ff for no question) and remove them first.
LVM has huge set of options now - it's approaching 60 short-arg less options
and we get interesting case of misdetection for 'merge' option which has been
put into the middle of options with 'short_arg' - thus certainly past 65. (ASCII 'A').
To avoid confusion of short_arg with long_opt number - add '128' to all such
non-short-arg options.
When LV is unlinked, we want to catch problem in vg_validate,
that LV has changed.
i.e. catch LV has been removed and is no long thin_pool while still
being referenced by some thin volume.
Transfer of build_dm_uuid() function into libdm made uuid_prefix as parameter,
thus sizeof() was replaced with strlen() and room for '\0' missed.
As it's only fix in current version - no whatsnew.
Revert John patch, which fixed only 1 place where ~LVM_WRITE was in use and
convert ommited LVM_READ/WRITE flags to 64bit constants as well.
(Since both 'status' flags for LV and VG are 64bit.)
Changing lv_mirror_count to only count the AREA_LVs made the function
stop working for PVMOVE mirrors. A conditional has been added to fix
that problem. Additionally, when counting the images in a mirror stack,
we don't need to subtract 1 from the count we get back from the
lv_mirror_count call on the temporary mirror layer. (This is because we
are no falsely counting the top layer of the temporary mirror.)
lv_mirror_count was not able to handle mirrors of stripes properly. When a
failed device is removed, the MIRRORED status flag is removed from the LV
conditionally based on the results of lv_mirror_count. However, lv_mirror_count
trusted the MIRRORED flag - thinking any such LV must be mirrored. It would
happily assign first_seg(lv)->area_count as the number of mirrors, but when
a mirrored striped LV was reduced to a simple striped LV area_count would be
the number of /stripes/ not the number of /mirrors/. A result higher than 1
would be returned from lv_mirror_count, the MIRRORED flag would not be cleared,
and the LV would fail to be up-converted properly in lvconvert_mirrors_aux
because of it.
The operation of deactivating the residual error target LV after removing a
mirror layer can cause a "device in-use" conflict with udev. Giving udev a
poke before calling deactivate_lv eliminates the conflict. The stick used
to poke udev is 'sync_local_dev_names'.
Kernel requires a mirror to be at least 1 region large. So,
if our mirror log is itself a mirror, it must be at least
1 region large. This restriction may not be necessary for
non-mirrored logs, but we apply the rule anyway.
(The other option is to make the region size of the log
mirror smaller than the mirror it is acting as a log for,
but that really complicates things. It's much easier to
keep the region_size the same for both.)
WHATS_NEW entry:
Fix log size calculation when only a log is being added to a mirror.
The original fix pass the mirror LV to allocate_extents (rather than
passing NULL) so that _alloc_init could correctly determine the necessary
size of the mirror log. In the previous check-in, I noted:
In order to get a decent value computed, we need to pass in the 'lv' argument
to allocate_extents. This would normally imply a desire for cling/contiguous
allocation to the given LV, but since we are not allocating any parallel
extents and only log extents, it works fine.
However, passing in the LV did have unintended consequences on the placement of
the log. The better solution is to pass in the number of extext that are in
the mirror LV instead of the LV itself. This will not cause the allocator to
reserve that number of extents, because 'stripes' and 'mirrors' are specified
as 0. Thus, 'extents' is used to calculate the size of the log, but won't
affect how much is allocated.
LVM_WRITE is a 32-bit flag. Now that RAID[_IMAGE|_META] are 64-bit,
and'ing a RAID LV's status against LVM_WRITE can reset the higher order
flags.
A similar thing will affect thinp flags if not careful.
This is a workaround for long-lasting problem with using the WATCH udev
rule. When trying to remove a DM device, this one can still be opened
while processing the event in parallel (generated based on the WATCH
udev rule).
Let's use this until we have a proper solution.
_alloc_init calculates the number of necessary log extents via
'mirror_log_extents'. 'mirror_log_extents' takes 3 arguments: region_size,
pe_size, and size of the mirror LV. Unfortunately, _alloc_init is guessing at
the mirror size by using 'ah->new_extents / ah->area_multiple' - the number of
extents that the mirror images have. However, this is /always/ wrong when
allocating the log separately. Further, the log is always allocated separately
unless we are up-converting the mirror at the same time. It was by luck alone
that a default value of '1' reflects what we want in most cases.
In order to get a decent value computed, we need to pass in the 'lv' argument
to allocate_extents. This would normally imply a desire for cling/contiguous
allocation to the given LV, but since we are not allocating any parallel
extents and only log extents, it works fine.
When an image is split from a 2-way mirror, the original mirror is converted to
a linear device. To do this, the top "layer" must be removed. The segments
are transferred from the sub-lv to the top-level LV and the link is severed.
The former sub-lv - having its segments transferred - now contains a temporary
error target.
When the original LV is resumed, the old sub-lv that now contains an error
segment is activated and scanned. This is what causes the I/O error messages.
There are three ways to fix this problem:
1) Do not set the sub-lv which contains the error target as "visible" before
suspending the original LV. This way, when the original is resumed, the sub-lv
device node is not created and it is not scanned - avoiding the error messages.
The problem with this approach is that if the machine crashes after the
resume, it leaves the *hidden* LV in place and the user has a more difficult
time noticing that it needs to be cleaned up. Thus, this type of processing is
frowned upon.
2) Do like _remove_mirror_images does and suspend the original, then suspend
the sub-lv (the error target), then resume the sub-lv, and finally resume the
original LV. This seems like extra pointless operations to me, but it does not
produce the error message (although, I'm not sure why) and it allows us to
leave the visible flag in place.
3) Flag the sub-lv (error target) with a "do not scan" flag. This seems like
the cleanest approach, but I have been unable to find the method for doing
this. LVs get tagged in such a way by _get_udev_flags, but in this case the
resume of the original LV also resumes the error target LV without running it
through _get_udev_flags (likely because they are no longer linked). Could
there be something wrong in resume_lv?
Option #2 was chosen to fix this bug, but it seems like more of a workaround
for now.
A gentle reminder that anyone relying on the output of reporting commands
like lvs in scripts must use -o to guarantee they get the fields they expect.
The default sequence of fields can change from release to release.
Equally, the 'attr' fields can have new values introduced and/or characters
appended to them.
Some major distributions are still using 'mawk' and they are not using
the latest version - we end here with hidden dependency on the latest
version of mawk (1.3.4) while i.e. Debian Lenny seems to stay with 1.3.3.
So we end with completely broken vgimportclone script on such system.
We would need to check for proper support of :space: and abort build if
it doesn't work or simplier replace [:space:] with [ \t] which seems
sufficient to make it work (as can be seen in this patch)
A better fix would be to use command line parameter override - leaving
as FIXME comment.
This patch makes t-vgimportclone.sh test passing on Lenny.
Makes dumpconfig whole-section output wrong in a different way from before,
but we should be able to merge cft_cmdline properly into cmd->cft now and
remove cascade.
There was a bad sequence:
*) Make changes to LV layout to split images (e.g. 4-way -> 2-way/2-way)
1) vg_write, suspend_lv(original_mirror), vg_commit
2) activate_lv(newly_split_lv)
3) resume_lv(original_mirror)
Step #2 is not allowed. However, without it, the resume of the original
mirror will also resume its former sub-LVs - making it impossible to
activate the newly split LV due to the changes in layering, pointers, and
names that had already been made. Additionally, the resume or the original
brings the sub-lv's online with names that differ from the metadata on disk -
also a no-no. Thus, the split must be done in stages such that the active LVs
always reflect what is in the committed LVM metadata.
First, alter the original mirror by releasing the images. The images are made
visible and independent as an intermediate stage. (This way, we can have
consistency between LVM metadata and active LVs.) The second stage collects
the recently split LVs, deactivates them, forms them into a mirror if necessary,
and then activates them. It is a bit of a circuitous method, but it is the only
way to split a mirror from a mirror and obey these general rules:
1) Never [de]activate sub-lvs when the top-level LV is suspended
2) Avoid having active LVs that differ from the description in the LVM metadata
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
functionality. A number of bugs (copied and pasted all over the code) should
disappear:
- most string lookup based on dm_config_find_node would segfault when
encountering a non-zero integer (the intention there was to print an
error message instead)
- check for required sections in metadata would have been satisfied by
values as well (i.e. not sections)
- encountering a section in place of expected flag value would have
segfaulted (due to assumed but unchecked cn->v != NULL)
daemon/common code in a single libdaemon.a, which is completely private. This
is currently linked into the lvmetad binary, and will be linked into LVM (the
client part, since static linking only picks up only symbols that are actually
used). I have also added --enable/disable-lvmetad to ./configure; although the
current default is off, I expect this to be flipped to on shortly. There's no
LVM-side support yet, but when there is, even when built, it'll still need to
be enabled by an lvm.conf option.
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
There's a very high memory usage when calling _pv_analyse_mda_raw (e.g. while
executing pvck) that can end up with "out of memory".
_pv_analyse_mda_raw scans for metadata in the MDA, iteratively increasing the
size to scan with SECTOR_SIZE until we find a probable config section or we're
at the edge of the metadata area. However, when using a memory pool, we're also
iteratively chasing for bigger and bigger mempool chunk which can't be found
and so we're always allocating a new one, consuming more and more memory...
This patch just changes the mempool to direct memory allocation in this
problematic part of the code.
There is duplicate code in t-lvconvert-raid.sh and t-lvcreate-raid.sh.
This should be moved into a common file which is then sourced by these two
files. I'll wait to move the duplicate code until I can talk to mornfall.
This patch adds the ability to upconvert a raid1 array - say from 2-way to
3-way. It does not yet support upconverting linear to n-way.
The 'raid' device-mapper target allows for individual components (images) of
an array to be specified for rebuild. This mechanism is used when adding
new images to the array so that the new images can be resync'ed while the
rest of the images in the array can remain 'in-sync'. (There is no
mirror-on-mirror layering required.)
~> lvconvert --splitmirrors 1 --trackchanges vg/lv
The '--trackchanges' option allows a user the ability to use an image of
a RAID1 array for the purposes of temporary read-only access. The image
can be merged back into the array at a later time and only the blocks that
have changed in the array since the split will be resync'ed. This
operation can be thought of as a partial split. The image is never completely
extracted from the array, in that the array reserves the position the device
occupied and tracks the differences between the array and the split image via
a bitmap. The image itself is rendered read-only and the name (<LV>_rimage_*)
cannot be changed. The user can complete the split (permanently splitting the
image from the array) by re-issuing the 'lvconvert' command without the
'--trackchanges' argument and specifying the '--name' argument.
~> lvconvert --splitmirrors 1 --name my_split vg/lv
Merging the tracked image back into the array is done with the '--merge'
option (included in a follow-on patch).
~> lvconvert --merge vg/lv_rimage_<n>
The internal mechanics of this are relatively simple. The 'raid' device-
mapper target allows for the specification of an empty slot in an array
via '- -'. This is what will be used if a partial activation of an array
is ever required. (It would also be possible to use 'error' targets in
place of the '- -'.) If a RAID image is found to be both read-only and
visible, then it is considered separate from the array and '- -' is used
to hold it's position in the array. So, all that needs to be done to
temporarily split an image from the array /and/ cause the kernel target's
bitmap to track (aka "mark") changes made is to make the specified image
visible and read-only. To merge the device back into the array, the image
needs to be returned to the read/write state of the top-level LV and made
invisible.
Users already have the ability to split an image from an LV of "mirror"
segtype. This patch extends that ability to LVs of "raid1" segtype.
This patch only allows a single image to be split off, however. (The
"mirror" segtype allows an arbitrary number of images to be split off.
e.g. 4-way => 3-way/linear, 2-way/2-way, linear,3-way)
of top-level LV.
We can't activate sub-lv's that are being removed from a RAID1 LV while it
is suspended. However, this is what was being used to have them show-up
so we could remove them. 'sync_local_dev_names' is a sufficient and
proper replacement and can be done after the top-level LV is resumed.
1) add new function 'raid_remove_top_layer' which will be useful
to other conversion functions later (also cleans up code)
2) Add error messages if raid_[extract|add]_images fails
3) Add function prototypes to prevent compiler warnings when
compiling with '--with-raid=shared'
Fix a couple more issues that kabi found.
- Add some error messages in failure cases
- s/malloc/zalloc/
- use vg->vgmem for lv names instead of vg->cmd->mem
Skip decoding of DM flags when device is removed.
We currently need DM flags only for add|change events. So forking
dmsetup process for removed devices is a waste of CPU time.
Udev is already quite slow, so make it just a tiny bit faster.
Add config option to enable crc checking of VG structures.
Currently it's disabled by default.
For the internal test-suite this check it is enabled.
Note: In the case the internal error is detected, debug build with
compile option DEBUG_ENFORCE_POOL_LOCKING helps to catch the source
of the problem.
Use debug pool locking functionality. So the command could check,
whether the memory in the pool has not been modified.
For lv_postoder() instead of unlocking and locking for every changed
struct status member do it once when entering and leaving function.
(mprotect would trap each such memory access).
Currently lv_postoder() does not modify other part of vg structure
then status flags of each LV with flags that are reverted back to
its original state after function exit.
Adding debuging functionality to lock and unlock memory pool.
2 ways to debug code:
crc - is default checksum/hash of the locked pool.
It gets slower when the pool is larger - so the check is only
made when VG is finaly released and it has been used more then
once.Thus the result is rather informative.
mprotect - quite fast all the time - but requires more memory and
currently it is using posix_memalign() - this could be
later modified to use dm_malloc() and align internally.
Tool segfaults when locked memory is modified and core
could be examined for faulty code section (backtrace).
Only fast memory pools could use mprotect for now -
so such debug builds cannot be combined with DEBUG_POOL.
Extend vginfo cache with cached VG structure. So if the same metadata
are use, skip mda decoding in the case, the same data are in use.
This helps for operations like activation of all LVs in one VG,
where same data were decoded giving the same output result.
Patch adds 1-to-1 connection between volume_group and lvmcache_vginfo.
Move the free_vg() to vg.c and replace free_vg with release_vg
and make the _free_vg internal.
Patch is needed for sharing VG in vginfo cache so the release_vg function name
is a better fit here.
As this flag could not have been set by the current code - removing it.
Note: because of the wrong code logic this call:
lvmcache_update_vg(correct_vg, correct_vg->status & PRECOMMITTED &
(inconsistent ? INCONSISTENT_VG : 0));
had always passed '0' - now after flag removal it's passing
PRECOMMITTED flag in - this present functinal change in this patch.
To match the original functionality - 0 had to be always passed.
More testing is needed here.
Compiler complaining that meta_lv could be used uninitialized. (Not true
because it is protected by 'clear_metadata'.) I switched to using 'lv->vg',
as it makes no difference to vg_[write|commit].
(here clvmd crashed in the middle of operation),
lock is not removed from cache - here is one example:
locking/cluster_locking.c:497 Locking VG V_vg_test UN (VG) (0x6)
locking/cluster_locking.c:113 Error writing data to clvmd: Broken pipe
locking/locking.c:399 <backtrace>
locking/locking.c:461 <backtrace>
Internal error: Volume Group vg_test was not unlocked
Code should always remove lock info from lvmcache and update counters
on unlock, even if unlock fails.
Today, we use "suppress_messages" flag (set internally in init_locking fn based
on 'ignorelockingfailure() && getenv("LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES")'.
This way, we can suppress high level messages like "File-based locking
initialisation failed" or "Internal cluster locking initialisation failed".
However, each locking has its own sequence of initialization steps and these
could log some errors as well. It's quite misleading for the user to see such
errors and warnings if the "--sysinit" is used (and so the ignorelockingfailure
&& LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES environment variable). Errors and
warnings from these intermediary steps should be suppressed as well if requested.
This patch propagates the "suppress_messages" flag deeper into locking init
functions. I've also added these flags for other locking types for consistency,
though it's not actually used for no_locking and readonly_locking.
Last usage was removed in Petr's commit related to VG mda repair fix
where relaxed check starts to ignore inconsistencies coming from
PVs that are marked MISSING - thus removing unused variable.
Defer the test of the function return value after the string memory is released.
Otherwise in this error path the string would present memory leak.
(Thought in this case we are already out of memory...)
Implementation described in doc/lvm2-raid.txt.
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
This should be set by default! Normally we have "activation/udev_sync = 1"
in lvm.conf (example.conf.in). But if we use lvm2 without any config file
(or without a definition within '--config' option) the DEFAULT_UDEV_SYNC
is used instead. Together with verify_udev_operations=0 (when we rely on
udev fully), this can cause races as the node could be missing when needed.
(See also https://bugzilla.redhat.com/show_bug.cgi?id=723144)
Systemd preloads file descriptors for us and passes them in for
newly spawned daemon when using on-demand fifo (or socket)
based activation.
This patch adds checks for file descriptors preloaded by
systemd and uses them instead of opening the FIFOs again
to properly support on-demand FIFO-based activation.
(We'll change FIFOs to sockets soon - but still this
part of the code will stay almost the same.)
The filename to adjust the oom score was changed in 2.6.36.
We should use oom_score_adj instead of oom_adj (which is still
there under /proc, but it's scheduled for removal in August 2012).
New oom_score_adj uses a range from -1000 (OOM_SCORE_ADJ_MIN,
disable oom killing) to 1000 (OOM_SCORE_ADJ_MAX).
Clvmd detects modifed config file before it takes lv_lock.
If the config file is changed rapidly - the change was ignored within
a seocnd ranged. This patch adds also compare of file size.
So change like some flag for 0 to 1 would pass unnoticed - but
it's quick fix for failing test suite.
FIXME: Implement inotify solution.
very reasonable amount of parallel access, although the hash tables may become
a point of contention under heavy loads. Nevertheless, there should be orders
of magnitude less contention on the hash table locks than we currently have on
block device scanning.
where the naming is left completely on lvm.
(Commited code has been different version of test).
So here it should be able to figure out new free name and create a new LV.
device that has been brought back from the dead: this sometimes fails with
clvmd (the cache is updated "too soon"). Instead, force a pvscan and rely on an
up-to-date cache as usual.
If MD linear device has set rounding (overload chunk size attribute),
the pvcreate command prints this warning:
/dev/md0 sysfs attr level not in expected format: linear
teardown sequence. (Previously the snapshot was deactivated while its
origin was active and before its removal was committed to disk, so
restarting after a crash at the point would leave corruption.)
When some target is passing empty parameters to some dm target,
report this as an internal error to better catch some broken
table construction (some mirror conversions seem to be doing
this for now).
Since some test may leave devices in suspend mode which would require
carefull order of resume operation - use '-f' to replace them with
error targets
For disable_dev - when 'error' target is used for open count - treat
return code as ok (|| true) to avoid breaking futher test processing.
The conditional is not just unnecessary, it would have been wrong. The code
is suppose to be checking if the 'splitmirrors_ARG' is negative, but it
instead is checking 'mirrors_ARG'. Rather than changing the argument being
checked, I've pulled the check entirely because 'splitmirrors_ARG' is already
guarenteed to not be negative by virtue of the fact that it is a 'int_arg'.
Negative values will be caught in _process_command_line().
When an LVM mirror is up-converted (an additional image added), it creates
a temporary mirror stack. The lower-level mirror in the stack that is
created was not being activated exclusively - violating the exclusive nature
of the original mirror. We now check for exclusive activation of a mirror
before converting it, and if found, we ensure that the temporary mirror
is also exclusively activated.
Mirrors used to be created by first creating a linear device and then adding
the other images plus the log. Now mirrors are created by creating all the
images in one go and then adding the log separately. The new way ran into
the condition that cluster mirrors cannot change the log type (in the case
of creation, from core -> disk) while the mirror is not active. (It isn't
active because it is in the process of being created.) The reason this
condition is in place is because a remote node may have the mirror active, and
we don't want to alter the log underneath it.
What we really needed was a way of checking if the mirror was active remotely
but not locally, and in that case do not allow a change of the log. I've added
this check, and cluster mirrors can now be created again.
This fn calls rm_dev_node directly - an exceptional case. It needs to check
the DM_UDEV_DISABLE_LIBRARY_FALLBACK flag directly (it's called in dm_task_run
normally where it's checked already).
We've used udev fallback code till now to check whether udev
created/removed the entries in /dev correctly and if not,
a repair was done (giving a warning messagea about that).
This patch adds a possibility to enable this additional check
and subsequent fallback only when required (debugging purposes
mostly) and trust udev completely.
So let's disable the fallback code by default and add a new
configuration option "activation/udev_fallback".
(The original code for creating the nodes will still be used
in case the device directory that is set in lvm.conf differs
from the one that udev uses and also when activation/udev_rules
is set to 0 - otherwise we would end up with no nodes/symlinks
at all)
It's useful to keep the partial flag cached - so just move the call
for vg_mark_partil_lvs() into import_vg_from_config_tree() so it gets
evaluated before it goes through the lvmcache.
This patch should not present any functional change.
Note: It is rather temporal solution - proper place is probably inside the
'read' call back - but needs some more discussion.
For now using this minor hack.
As the ACTIVATE_EXCL could be set only in clvmd code - there is no
use for this test in lv_add_mirrors() function only called from
tools context.
FIXME: Add cluster test case for this.
To avoid modification of 'read-only' volume group structure
add a new structure to pass local data around the code for LV
activation.
As origin_only is one such flag - replace this parameter with new
struct lv_activate_opts.
More parameters might eventually become part of lv_activate_opts.
transient error), stemming from the following sequence of events:
1) devices fail IO, triggering repair
2) dmeventd starts fixing up the mirror
3) during the downconversion, a new metadata version is written
--> the devices come back online here
4) the mirror device suspend/resume is called to update DM tables
5) during the suspend/resume cycle, *pre*-commit metadata is read;
however, since the failed devices are now back online, we get back
inconsistent set of precommit metadata and the whole operation fails
The patch relaxes the check that fails in step 5 above, namely by ignoring
inconsistencies coming from PVs that are marked MISSING.
This was missing in liblvm and it caused all udev-related operations to
not take effect when using liblvm, e.g. obtaining the list of devices from udev
db instead of scanning the whole /dev which also recreated the .cache as a side
effect. This was also the case with udisks-lvm-pv-export prober which is run
from within udev rules whenever the CHANGE event is fired.
and use this for the LVM critical section logic. Also report an error if
code tries to load a table while any device is known to be in the
suspended state.
(If the variety of problems these changes are showing up can't be fixed
before the next release, the error messages can be reduced to debug
level.)
are affected by the move. (Currently it's possible for I/O to become
trapped between suspended devices amongst other problems.
The current fix was selected so as to minimise the testing surface. I
hope eventually to replace it with a cleaner one that extends the
deptree code.
Some lvconvert scenarios still suffer from related problems.
Patch adds check for stripe not only in direct
LV segment but also in mirror image segment.
This prevents bugs like:
# lvcreate -i2 -l10 -n lv vg_test
# lvconvert -m1 -i1 vg_test/lv
# lvreduce -f -l1 vg_test/lv
WARNING: Reducing active logical volume to 4.00 MiB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Reducing logical volume lv to 4.00 MiB
Segment extent reduction 9 not divisible by #stripes 2
Logical volume lv successfully resized
# lvremove -f vg_test
Segment extent reduction 1 not divisible by #stripes 2
LV segment lv:0-4294967295 is incorrectly listed as being used by LV lv_mimage_0
Internal error: LV segments corrupted in lv_mimage_0.
Currently some operation with striped mirrors lead
to corrupted metadata, this patch just add detection of such
situation.
Example:
# lvcreate -i2 -l10 -n lvs vg_test
# lvconvert -m1 vg_test/lvs
# lvreduce -f -l1 vg_test/lvs
Reducing logical volume lvs to 4.00 MiB
Segment extent reduction 9not divisible by #stripes 2
Logical volume lvs successfully resized
# lvremove vg_test/lvs
Segment extent reduction 1not divisible by #stripes 2
LV segment lvs:0-4294967295 is incorrectly listed as being used by LV lvs_mimage_0
Internal error: LV segments corrupted in lvs_mimage_0.
We should never remove more extents than requested by user,
so round up to next stripe boundary during lvreduce.
Also this fixes round to zero sized LV bug:
# lvcreate -i2 -I 64k -l10 -n lvs vg_test
# lvreduce -f -l1 vg_test/lvs
Rounding size (1 extents) down to stripe boundary size for segment (0 extents)
WARNING: Reducing active logical volume to 0
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Reducing logical volume lvs to 0
Failed to suspend lvs
There's a possibility someone will use the '/' in the hostname. Since we
generate a temporary file name (path) including the hostname, any '/' would
be ambiguous.
We can always set such hostname using 'sethostname' from unistd.h. But the
'hostname' command already includes the check and removes the '/' char.
However, some old versions still allow that.
See: https://bugzilla.redhat.com/show_bug.cgi?id=711445.
Since this is only a temporary name and the possibility of this error is
quite negligible, we don't need any complex escape sequence here, just a
simple char replace.
Before, we used vg_write_lock_held call to determnine the way a device is
opened. Unfortunately, this opened many devices in RW mode when it was not
really necessary. With the OPTIONS+="watch" rule used in the udev rules,
this could fire numerous events while closing such devices (and it caused
useless scans from within udev rules in return).
A common bug we hit with this was with the lvremove command which was unable
to remove the LV since it was being opened from within the udev rules. This
patch should minimize such situations (at least with respect to LVM handling
of devices).
Though there's still a possibility someone will open a device 'outside' in
parallel and fire the event based on the watch rule when closing a device
once opened for RW.
allocates these buffers in such way it adds memory page for each such buffer
and size of unlock memory check will mismatch by 1 or 2 pages.
This happens when we print or read lines without '\n' so these buffers are
used. To avoid this extra allocation, use setvbuf to set these bufffers ahead.
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Also, add a new 'obtain_device_list_from_udev' setting to lvm.conf with which
we can turn this feature on or off if needed.
If set, the cache of block device nodes with all associated symlinks
will be constructed out of the existing udev database content.
This avoids using and opening any inapplicable non-block devices or
subdirectories found in the device directory. This setting is applied
to udev-managed device directory only, other directories will be scanned
fully. LVM2 needs to be compiled with udev support for this setting to
take effect. N.B. Any device node or symlink not managed by udev in
udev directory will be ignored with this setting on.
This fails on lenny buildslave for some reason.
For now disable the vgimportclone part of the test until proper fix.
Let the first part of the test still run though, which shows pvs working
with duplicate pvs.
Related to rhbz 697959.
This test fails prior to these two commits:
commit af112eb2c9
Author: Zdenek Kabelac <zkabelac@redhat.com>
Date: Thu Apr 21 13:15:26 2011 +0000
Skip check for NULL before dm_free
dm_free makes this test itself.
commit 91419c3e86
Author: Zdenek Kabelac <zkabelac@redhat.com>
Date: Thu Apr 21 13:13:40 2011 +0000
Fix use of released vgname and vgid
Avoid using of already released memory when duplicated MDA is found.
As get_pv_from_vg_by_id() may call lvmcache_label_scan() use the local copy
of the vgname and vgid on the stack as vginfo may dissapear and code was
then accessing garbage in memory.
i.e. pvs /dev/loop0
(when /dev/loop0 and /dev/loop1 has same MDA content)
Invalid read of size 1
at 0x523C986: dm_hash_lookup (hash.c:325)
by 0x440C8C: vginfo_from_vgname (lvmcache.c:399)
by 0x4605C0: _create_vg_text_instance (format-text.c:1882)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
by 0x47FBD8: vg_read_internal (metadata.c:3231)
by 0x477594: get_pv_from_vg_by_id (metadata.c:344)
by 0x45F07A: _get_pv_if_in_vg (format-text.c:1400)
by 0x45F0B9: _populate_pv_fields (format-text.c:1414)
by 0x45F40F: _text_pv_read (format-text.c:1493)
by 0x480431: _pv_read (metadata.c:3500)
by 0x4802B2: pv_read (metadata.c:3462)
Address 0x652ab80 is 0 bytes inside a block of size 4 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x442277: _free_vginfo (lvmcache.c:963)
by 0x44235E: _drop_vginfo (lvmcache.c:992)
by 0x442B23: _lvmcache_update_vgname (lvmcache.c:1165)
by 0x443449: lvmcache_update_vgname_and_id (lvmcache.c:1358)
by 0x443C07: lvmcache_add (lvmcache.c:1492)
by 0x46588C: _text_read (text_label.c:271)
by 0x466A65: label_read (label.c:289)
by 0x4413FC: lvmcache_label_scan (lvmcache.c:635)
by 0x4605AD: _create_vg_text_instance (format-text.c:1881)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
Add testing script
pv_manip.c to properly account for case when pe_start=0 and the first
physical extent is to be released (currently skip the first extent to
avoid discarding the PV label).
My previous patch fixed incorrect error check for dm_snprintf.
However in this particular case - dm_snprintf has been used differently -
just like strncpy + setting last char with '\0' - so the code had to return
error - because the buffer was to short for whole string.
Patch replaces it with real strncpy.
Also test for alloca() failure is removed - as the program behaviour
is rather undefined in this case - it never returns NULL.
lvseg properties for lvm2app, 'devices' and 'seg_pe_ranges'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
dm_pool_free incorrectly. This check-in fixes that incorrect usage.
I've also added a WHATS_NEW line to reflect the changes I made to allow
lv_extend to operate on 0 length intrinsically layered LVs (i.e mirrors
and RAID). I forgot that in the last commit.
allows us to allocate all images of a mirror (or RAID array) at one
time during create.
The current mirror implementation still requires a separate allocation
for the log, however.
Actually, we can call vg_set_fid(vg, NULL) instead of calling
destroy_instance for all PV structs and a VG struct - it's the same
code we already have in the vg_set_fid.
Also, allow exactly the same fid to be set again for the same PV/VG
Before, this could end up with the fid destroyed because we destroyed
existing fid first and then we used the new one and we didn't care
whether existing one == new one by chance.
As now the FID management is more complex, the code inside free_vg needs
to access some parts of memory pools which were not needed before.
For this - makes the order of unlock_and_free_vg() unconditional.
Keek using unlock_and_free_vg() API function.
For properly working VG locking mechanism only the alphabeting locking
orderer needs to be preserved.
TODO: there could be few more code parts simplified when we 'officially'
support of referencies between different memory pools.
Instead of searching linear list of all LVs, PVs - use created hash tables
also for quick mapping between LV.
(Note - for small number of PVs or LVs the overhead of the hash is bigger).
TODO: Use hash tables in volume_group structure directly.
Instead of regenerating config tree and parsing same data again,
check whether export_vg_to_buffer does not produce same string as
the one already cached - in this case keep it, otherwise throw cached
content away.
For the code simplicity calling _free_cached_vgmetadata() with
vgmetadata == NULL as the function handles this itself.
Note: sometimes export_vg_to_buffer() generates almost the same data
with just different time stamp, but for the patch simplicity,
data are reparsed in this case.
This patch currently helps for vgrefresh.
Align strdup char* allocation just on 2 bytes.
It looks like wasting space to align strings on 8 bytes.
(Could be even 1byte - but for hashing it might eventually get better
perfomance - but probably hardly measurable).
TODO: check on various architectures it's not making any problems.
Isn't usually perfomance critical - but log_error is used i.e.for debuging,
this code noticable slows down the processing.
Added 512KB limit to avoid memory exhastions in case of some endless loop.
TODO: use _lvm_errmsg buffer only when lvm2api needs it.
Avoid locking sum testing with valgrind compilation.
Make memory unaccessible in the valgrind for dm_pool_abadon_object.
Valgrind hinting should not be needed in _free_chunk for dm_free.
'a small step' towards cleaner shutdown sequence.
Normally clvmd doens't care about unreleased memory on exit -
but for valgrind testing it's better to have them cleaned all.
So - few things are left on exit path - this patch starts to remove
just some of them.
1. lvm_thread_fs is made as a thread which could be joined on exit()
2. memory allocated to local_clien_head list is released.
(this part is somewhat more complex if the proper reaction is
needed - and as it requires some heavier code moving - it will
be resolved later.
Could be reached via few of our lvm2 test cases:
==11501== Invalid read of size 8
==11501== at 0x49B2E0: _area_length (import-extents.c:204)
==11501== by 0x49B40C: _read_linear (import-extents.c:222)
==11501== by 0x49B952: _build_segments (import-extents.c:323)
==11501== by 0x49B9A0: _build_all_segments (import-extents.c:334)
==11501== by 0x49BB4C: import_extents (import-extents.c:364)
==11501== by 0x497655: _format1_vg_read (format1.c:217)
==11501== by 0x47E43E: _vg_read (metadata.c:2901)
cut from t-vgcvgbackup-usage.sh
--
pvcreate -M1 $(cat DEVICES)
vgcreate -M1 -c n $vg $(cat DEVICES)
lvcreate -l1 -n $lv1 $vg $dev1
--
Idea of the fix is rather defensive - to allocate one extra element
to 'map' array which is then used in _area_length() - where the
loop checks, whether next map entry is continuous.
By placing there always one extra zero entry -
we fix the read of unallocated memory, and we make sure the data would
not make a continous block.
FIXME: there could be a problem if some special broken lvm1 data would be imported.
As the format1 is currently not really used - leave it for future fix
and use this small hotfix for now.
With recent update of dm_report_field_string() API call to accept
completely const objects - we no longer need loose constness here
and keep it forwarding.
Invalid primary_vginfo was supposed to move all its lvmcache_infos to
orphan_vginfo - however it has called _drop_vginfo() inside the loop
that released primary_vginfo itself - thus made the loop using released
memory.
Use _vginfo_detach_info() instead and call _drop_vginfo after
th loop is finished.
Valgrind trace it should fix:
Invalid read of size 8
at 0x41E960: _lvmcache_update_vgname (lvmcache.c:1229)
by 0x41EF86: lvmcache_update_vgname_and_id (lvmcache.c:1360)
by 0x441393: _text_read (text_label.c:329)
by 0x442221: label_read (label.c:289)
by 0x41CF92: lvmcache_label_scan (lvmcache.c:635)
by 0x45B303: _vg_read_by_vgid (metadata.c:3342)
by 0x45B4A6: lv_from_lvid (metadata.c:3381)
by 0x41B555: lv_activation_filter (activate.c:1346)
by 0x415868: do_activate_lv (lvm-functions.c:343)
by 0x415E8C: do_lock_lv (lvm-functions.c:532)
by 0x40FD5F: do_command (clvmd-command.c:120)
by 0x413D7B: process_local_command (clvmd.c:1686)
Address 0x63eba10 is 16 bytes inside a block of size 160 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x41DE70: _free_vginfo (lvmcache.c:980)
by 0x41DEDA: _drop_vginfo (lvmcache.c:998)
by 0x41E854: _lvmcache_update_vgname (lvmcache.c:1238)
by 0x41EF86: lvmcache_update_vgname_and_id (lvmcache.c:1360)
by 0x441393: _text_read (text_label.c:329)
by 0x442221: label_read (label.c:289)
by 0x41CF92: lvmcache_label_scan (lvmcache.c:635)
by 0x45B303: _vg_read_by_vgid (metadata.c:3342)
by 0x45B4A6: lv_from_lvid (metadata.c:3381)
by 0x41B555: lv_activation_filter (activate.c:1346)
by 0x415868: do_activate_lv (lvm-functions.c:343)
problematic line:
dm_list_iterate_items_safe(info2, info3, &primary_vginfo->infos)
Fix 2 more functions sending cluster messages to avoid passing uninitilised bytes
and compensate 1 extra byte attached to the message from the clvm_header.args[1]
member variable.
If _move_lv_segments is passed a 'lv_from' that does not yet
have any segments, it will screw things up because the code
that does the segment copy assumes there is at least one
segment. See copy code here:
lv_to->segments = lv_from->segments;
lv_to->segments.n->p = &lv_to->segments;
lv_to->segments.p->n = &lv_to->segments;
If 'segments' is an empty list, the first statement copies over
the values, but the next two reset those values to point to the
other LV's list structure. 'lv_to' now appears to have one
segment, but it is really an ill-set pointer.
When I see 'seg_is_mirrored', I expect the argument to be an lv_segment.
In this case, it is lvcreate_params. Both structures, have a 'segtype'
entry which the macro dereferences. However, it just seems easier to
understand if we do 'segtype_is_mirrored' instead.
other cases, the code could wind up removing wrong number of mirrors. In yet
other cases, we could remove the right number of mirrors, but fail to respect
the removal preferences (i.e. keep an image that was requested to be removed
while removing an image that was requested to be kept). Under some
circumstances, remove_mirror_images could also get stuck in an infinite loop.
This patch should fix all of the above undesirable behaviours.
Signed-off-by: Petr Rockai <prockai@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
LVM doesn't behave correctly if running as non-root user,
there is warning when it detects it.
Despite this, it produces many error messages, saying nothing.
See https://bugzilla.redhat.com/show_bug.cgi?id=620571
This patch fixes two things:
1) Removes eror message from device_is_usable() which has no
information value anyway (real warning is printed inside it).
2) it fixes device-mapper initialization, if we support
core dm module autoload and device node is present, it should
fail early and not try recreate existing and correct node.
(non-root == permission denied here)
N.B. In future code should support user roles, some more
drastic checks in code are probably contraproductive now.
Attach \0 for proper char* display - otherwise somewhat random message could
be displayed in debug more and read of unpredictable read of uninitilized
memory values could happen.
As code uses strncpy(system_id, NAME_LEN) and doesn't set '\0'
Fix it by always allocating NAME_LEN + 1 buffer size and with zalloc
we always get '\0' as the last byte.
This bug may trigger some unexpected behavior of the string operation
code - depends on the pool allocator.
FIXME: refactor this code to alloc_vg.
We have 3 components and traling '\0' so allocate proper room for all of them.
Problem was nicely hidden by allocation from pool and allocation aligment
offset - so to trigger real problem with this one is actually hard.
Return value of readlink limits valid string size.
Characters after returned size present some garbage to printf.
Fix it by placing '\0' on the return size value.
Missing free_vg on error_path in lvmcache_get_vg fn. Call destroy_instance
only if the fid is not part of the vg in backup_read_vg fn (otherwise it's
part of the VG we're returning and we definitely don't want to destroy it!).
This is necessary for proper format instance ref_count support. We iterate
over vg->pvs and vg->removed_pvs list and the ref_count is decremented and
then it is destroyed if not referenced anymore.
Since format instances will use own memory pool, it's necessary to properly
deallocate it. For now, only fid is deallocated. The PV structure itself
still uses cmd mempool mostly, but anytime we'd like to add a mempool
in the struct physical_volume, we can just rename this fn to free_pv and
add the code (like we have free_vg fn for VGs).
This is essential for proper format instance ref_count support. We must
use these functions to set the fid everywhere from now on, even the NULL
value!
We'd like to use the fid mempool for text_context that is stored
in the instance (we used cmd mempool before, so the order of
initialisation was not a matter, but now it is since we need to
create the fid mempool first which happens in create_instance fn).
The text_context initialisation is not needed anywhere outside the
create_instance fn so move it there.
Format instances can be created anytime on demand and it contains
metadata area information mostly (at least for now, but in the future,
we may store more things here to update/edit in a PV/VG). In case we
have lots of metadata areas, memory consumption will rise. Using cmd
context mempool is not quite optimal here because it is destroyed too
late. So let's use a separate mempool for format instances.
Reference counting is used because fids could be shared, e.g. each PV
has either a PV-based fid or VG-based fid. If it's VG-based, each PV has
a shared fid with the VG - a reference to VG's fid.
Makes the code more readable and has a smaller number of memory
accesses thus it's small optimisation as well.
For _get_token() optimize number parsing. Check for '.' char only
if it's not a digit. Move pointer incrementation into one place.
For _eat_space() check only p->te for '\0' in skipping of comment line.
Avoid check for '\0' when we know it is space. Also master while loop
doesn't need checking p->tb for '\0'. We just need to check p->tb
isn't already at the end of buffer. This could give 'extra' loop cycle
if we are already there - but safes memory access in every other case.
Add _lv_postorder_vg() - for calling _lv_postorder() for every LV from VG.
We use this in 2 places - vg_mark_partial_lvs() and vg_validate()
so make it as a one function.
Benefit here is - to use only one cleanup code and avoid
potentially duplicate scans of same LVs.
Copy this file as '.gdbinit' to your home directory or your working
directory. It adds the following commands to gdb:
- first_seg
- lv_status
- lv_status_r
- lv_is_mirrored
- seg_item
- seg_status
- segs_using_this_lv
You can get a list of these user-defined commands by typing:
(gdb) help user-defined
You can get more information on each command by typing:
(gdb) help <command>
Accelerate validation loop by using lvname, lvid, pvid hash tables.
Also merge pvl loop into one cycle now - no need to scan the list twice.
List scan is stopped when dm_hash_insert fails.
The error message with loop_counter1 is no longer provided - however
the message has been misleading anyway.
dm_hash binary functions takes void* key - so there is no need to cast
pointers to char* (also the hash key does not have trailing '\0').
This is slight API change, but presents no change for the API user side
it just allows to write code easier as the casting could be removed.
Create new function alloc_vg() to allocate VG structure.
It takes pool_name (for easier debugging).
and also take vg_name to futher simplify code.
Move remainder of _build_vg_from_pds to _pool_vg_read
and use vg memory pool for import functions.
(it's been using smem -> fid mempool -> cmd mempool)
(FIXME: remove mempool parameter for import functions and use vg).
Move remainder of the _build_vg to _format1_vg_read
Fixing few issues:
struct clvm_header contains 'char args[1]' - so adding '+ 1' here
for message length calculation is 1 byte off.
Message with last byte uninitialized is then passed to write function.
Update also related arglen.
Initialise xid and clintid to 0.
Memory allocation is checked for NULL
When the ->params string is empty - memory access is made on the byte
before allocated buffer (catched by valgrind) - in the case it would
constain 0x20 - it would even overwrite this buffer.
So fix by checking len > 0 before doing such access.
Also slightly optimise this loop from repeated strlen call.
lvseg_segtype_dup used memory pool vg memory pool for strind duplication.
However this one gets released before reporting happens so the command like:
pvs -o segtype
prints data from already released memory pool. Thanks to the fact there
is not much allocation happing after the VG is released, the memory
stays unmodified and correct result is printed.
Fix adds support for mempool passed parameter (like other similar
query commands) and uses dm_report memory pool for string duplication.
Failure on pipe kills lv command and produces memory stacktrace.
Add 'true' for empty grep/sed to avoid generating stacktrace.
This should help with long time (7min) for this test on Lenny testing maching
(mimage was filler with stacktrace and each word was used as LV name).
This lvconvert is now able to complete operation automatically without question.
So remove 'echo y |' - it's breaking pipe on some bash versions.
i.e.:
+ echo y
./t-lvconvert-mirror-basic.sh: line 104: echo: write error: Broken pipe
while command has completed this operation succesfully
While STRIPE_SIZE_LIMIT * 2 is basically UINT_MAX, 32bit integer
value can already overflow durin arg size parsing.
(This really happens in test where --stripesize 4294967291 is used,
in s390x uint overflow and this test is ineffective.)
- returned char not needed to be explicitly const
- warn if pipe() fails in clvmd (more fixes here needed for error paths...)
- assign (and ignore) read() output in drain buffer
Currently the test are not properly working with new code.
Use settings for the previous behaviour.
FIXME: update tests to properly test new allocation policies.
We allow writing non-orphan PVs only for resize now. The "orphan PV" assert
in pv_write fn uses the "allow_non_orphan" parameter to control this assert.
However, we should find a more elaborate solution so we can remove this
restriction altogether (pv_write together with vg_write is not atomic, we
need to find a safe mechanism so there's an easy revert possible in case of
an error).
There is a lot to test.
Two new config settings added that are intended to make the code behave
closely to the way it did before - worth a try if you find problems.
Add a small fix that preserves pe_start for lvm1 PVs when being converted.
(this fix needs to be replaced with something more clever, but let's have this working now)
If the PV is already part of the VG (so the pv->fid == vg->fid), it makes no
sense to attach the mdas information from PV to a VG. Instead, we read new
PV metadata information from cache and attach it to the VG fid.
This function also sets a reference to a new VG format instance for all PVs
that are part of the VG so the PV-VG interconnection is consistent after the
change.
Add supporting functions to work with the format instance and metadata area
structures stored within the format instance. Add support for simple indexing
of metadata areas using PV id and mda order (for on-disk PV only for now, we
can extend the indexing even for other mdas if needed - we only need to define
a proper key for the index).
As the kernel seems to be doing weird things during
mlock -> munlock - allow 1 page locking difference without
warning - and log just debug message for a 1 page difference.
Allocation happens outside critical section probably during
log_warn printing.
Should make tests passing for now.
Fixing some const warnings - with API change in:
int vg_extend(struct volume_group *vg, int pv_count, const char *const *pv_names,
Change is needed - as lvm2api expects const behaviour here.
So vg_extend() is doing local strdup for unescaping.
skip_dev_dir return const char* from const char* vg_name.
Rest of the patch is cleanup of related warnings.
Also using dm_report_filed_string() API change to simplify
casting in _string_disp and _lvname_disp.
As dm_report_field_string() doesn't modify content of data pointer,
it can be marked as const.
It's slight API change - but doesn't require any change on the user side
and supports wider range of arguments without const casting.
(i.e. we may use as paramater const lv struct this way: &lv->name)
New strategy for memory locking to decrease the number of call to
to un/lock memory when processing critical lvm functions.
Introducing functions for critical section.
Inside the critical section - memory is always locked.
When leaving the critical section, the memory stays locked
until memlock_unlock() is called - this happens with
sync_local_dev_names() and sync_dev_names() function call.
memlock_reset() is needed to reset locking numbers after fork
(polldaemon).
The patch itself is mostly rename:
memlock_inc -> critical_section_inc
memlock_dec -> critical_section_dec
memlock -> critical_section
Daemons (clmvd, dmevent) are using memlock_daemon_inc&dec
(mlockall()) thus they will never release or relock memory they've
already locked memory.
Macros sync_local_dev_names() and sync_dev_names() are functions.
It's better for debugging - and also we do not need to add memlock.h
to locking.h header (for memlock_unlock() prototyp).
Add configurable option to define minimal size of
of block device usable as a PV.
pv_min_size() is added to lvm-globals and it's being
initialized through _process_config.
Macro PV_MIN_SIZE is unused and removed.
New define DEFAULT_PV_MIN_SIZE_KB is added to lvm-global
and unlike PV_MIN_SIZE it uses KB units.
Should help users with various slow devices attached to the system,
which cannot be easily filtered out (like FDD on /dev/sdX):
https://bugzilla.redhat.com/show_bug.cgi?id=644578
results in clvmd deadlock
When a logical volume is activated exclusively in a cluster, the
local (non-cluster-aware) target is used. However, when creating
a snapshot on the exclusive LV, the resulting suspend/resume fails
to load the appropriate device-mapper table - instead loading the
cluster-aware target.
This patch adds an 'exclusive' parameter to the pertinent resume
functions to allow for the right target type to be loaded.
As functions compiled within this define are apparently stil part of the public API,
(though lvm2 code is never using them unless this define is used for compilation),
keep functions available in the code for now -> revert.
Fix regresion from 2.02.75 speedup - so currently crc32 is a little bit
more complicated on big-endian CPU as the uint32_t needs to be shifted
on here.
Make configurable default behaviour how to deal with device node creates.
With udev system natural options should be 'resume'.
For older systems where user expect there is node in /dev/mapper immediately
after dmsetup create --notable - use 'create'
FIXME:
Code needs fixing passing this flag through udev cookie.
activated.
In order to achieve this, we need to be able to query whether
the origin is active exclusively (a condition of being able to
add an exclusive snapshot).
Once we are able to query the exclusive activation of an LV, we
can safely create/activate the snapshot.
A change to 'hold_lock' was also made so that a request to aquire
a WRITE lock did not replace an EX lock, which is already a form
of write lock.
Add new function dm_task_set_add_node() to select between 2 types
of node creation in device directory.
DM_ADD_NODE_ON_RESUME is now default and ensures node is created on
resume. Old original behavior is accessible with DM_ADD_NODE_ON_CREATE.
In this case node would be created during dmsetup create --notable.
For the user 2 new options for dmsetup create are added:
[{--addnodeonresume | --addnodeoncreate }]
Properly working node creation on resume is needed for proper operation
stacking and ability to correctly check in which state the device should
after whole udev transation.
Remove temporaly added fs_unlock() calls to fix clmvd usablity.
Now when the message passing is properly working - they are no longer needed.
Simplify no_locking check for VG unlock - as message is always send
for all targets - clustered & non-clustered.
Thanks to CLVMD_CMD_SYNC_NAMES propagation fix the message passing started
to work. So starts to send a message before the VG is unlocked.
Removing also implicit sync in VG unlock from clmvd as now the message
is delievered and processed in do_command().
Also add support for this new message into external locking
and mask this event from further processing.
With the ability to stack many operations in one udev transaction -
in same cases we are adding and removing same device at the same time
(i.e. deactivate followed by activate).
This leads to a problem of checking stacked operations:
i.e. remove /dev/node1 followed by create /dev/node1
If the node creation is handled with udev - there is a problem as
stacked operation gives warning about existing node1 and will try to
remove it - while next operation needs to recreate it.
Current code removes all previous stacked operation if the fs op is
FS_DEL - patch adds similar behavior for FS_ADD - it will try to
remove any 'delete' operation if udev is in use.
For FS_RENAME operation it seems to be more complex. But as we
are always stacking FS_READ_AHEAD after FS_ADD operation -
should be safe to remove all previous operation on the node
when udev is running.
Code does same checking for stacking libdm and liblvm operations.
As a very simple optimization counters were added for each stacked ops
type to avoid unneeded list scans if some operation does not exists in
the list.
Enable skipping of fs_unlock() (udev sync) if only DEL operations are staked.
as we do not use lv_info for already deleted nodes.
This is better way how to fix clustered synchronization with udev.
As the code for message passing needs fixed - put currently
fs_unlock() after every active/deactive command in clvmd to
ensure nodes are properly created in time.
It would be most useful to add "dmsetup ls --tree" to the commands run.
This command helps in answering the question "which devices are actually
underneath a given LV?"
Although the info is available with other existing dmsetup commands,
adding this command gives a much clearer summary of complex setups.
Here's an example of an LVM mirror, with mirror images on partitions
created on top of multipath devices. The multipath devices are on
simple block devices. As you can see, it is easy to see the stacking
from the "dmsetup ls --tree" output:
vgmpathtest-lvmpathmir (253:14)
├─vgmpathtest-lvmpathmir_mimage_1 (253:13)
│ └─mpath5p1 (253:5)
│ └─mpath5 (253:2)
│ ├─ (8:16)
│ └─ (8:0)
├─vgmpathtest-lvmpathmir_mimage_0 (253:12)
│ └─mpath6p1 (253:6)
│ └─mpath6 (253:3)
│ ├─ (8:48)
│ └─ (8:32)
└─vgmpathtest-lvmpathmir_mlog (253:11)
└─mpath7 (253:4)
├─ (8:80)
└─ (8:64)
VolGroup00-LogVol01 (253:1)
└─ (202:2)
vgtest-lvmir (253:10)
├─vgtest-lvmir_mimage_1 (253:9)
│ └─ (7:1)
├─vgtest-lvmir_mimage_0 (253:8)
│ └─ (7:0)
└─vgtest-lvmir_mlog (253:7)
└─ (7:3)
VolGroup00-LogVol00 (253:0)
└─ (202:2)
But it is much harder to see the stacking with only the commands today
("dmsetup info", "dmsetup status", and "dmsetup table"). We could
piece together the stacking from "dmsetup table" but it requires
further processing (take output from "dmsetup info to get
map name to major/minor, then parse "dmsetup table", etc).
There was no effect from having this wrong yet, because the
tree of callers only ever cared about the answer to the first
condition (!response), which determines whether a lock is
held or not. Correct responses, however, are needed soon.
Instead of implicitly syncing udev operation in clustered and
file locking code - call synchronization directly in lock_vol() when
the operation unlocks VG
The problem is missing implicit fs_unlock() in the no_locking code.
This is used with --sysinit on read-only filesystem locking dir.
In this case vgchange -ay could exit before all udev nodes are properly
synchronised and may cause problems with accessing such node right after
vgchange --sysinint command is finished.
Add test case for vgchange --sysinit.
As the option 'set -e -o pipefail' is very sensite on pipe breaking
stop using '-q' for grep commands.
Otherwise this command (with large enough table) would fail:
dmsetup table | egrep -q
with exit code 141 (128 + SIGPIPE)
As Peter suggested, he prefers to keep '-o pipefail' - so make sure all
piped commands will read the whole output and will not exit too early.
This is to avoid any scanning and processing of DM devices while they are in
suspended state (e.g. a rename while the device is suspended - a CHANGE event
is generated!). Otherwise, any scanning in the rules could end up with locking
the calling process until the device is resumed and so we don't receive a
notification about udev rules completion until then (and that effectively
locks out the process awaiting the notification!).
However, we still keep 'disk' and any 'subsystem' related udev rules running.
We trust these and these should check themselves whether a device is suspended
or not, not trying to run any scanning if it is.
strncpy (which check each byte for \0) is not need as we always copy
the length size - so using memcpy is a bit cheaper.
Add missing log_error message for failed allocation.
Fix assert abort of dmsetup (when compiled with pool debug)
dmsetup splitname --nameprefixes --noheadings --rows gvg-a2
Move pool begin in the inner loop - otherwise it would using
already 'ended' pool object.
return lockspace reference (even if lockspace already exists)
and thus increases DLM lockspace count. It means that after
clvmd restart the lockspace is still in use.
(The only way to clean environment to enable clean cluster
shutdown is call "dlm_tool leave clvmd" several times.)
Because only one clvmd can run in time, we can use simpler logic,
try to open lockspace with dlm_open_lockspace() and only if it fails
try to create new one. This way the lockspace reference doesn not
increase.
Very easily reproducible with "clvmd -S" command.
Patch also fixes return code when clvmd_restart fails and fixes
double free if debug option was specified during restart.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=612862
Improve condition within lock_vol so we are not calling extra unlock
if the volume just has been deactivated.
Patch uses lck_type and replaces negative 'and' condition to more
readable 'or' condition.
Few missing strace traces added.
As sync_local_dev_names() cannot be called within activation context,
add new parametr which allows to select if the sync call is needed
before executing new command.
Instead of checking $LOCAL_CLVMD set during some 'aux' execution which
doesn't seem to be propagated to this shell - check for existance of pid
file of clvmd process - so this test is skipped in singlenode cluster test.
Harness seems to be able to busyloop in while cycle and not moving forward
for certain buffer - so check whethere there was some progress.
This fix allows to continue after failed cluster test.
Fix gcc warning for hiding global variable 's' -> sig.
Stop calling fs_unlock() from lv_de/activate().
Start using internal lvm fs cookie for dm_tree.
Stop directly calling dm_udev_wait() and
dm_tree_set/get_cookie() from activate code -
it's now called through fs_unlock() function.
Add lvm_do_fs_unlock()
Call fs_unlock() when unlocking vg where implicit unlock solves the
problem also for cluster - thus no extra command for clustering
environment is required - only lvm_do_fs_unlock() function is added
to call lvm's fs_unlock() while holding lvm_lock mutex in clvmd.
Add fs_unlock() also to set_lv() so the command waits until devices
are ready for regular open (i.e. wiping its begining).
Move fs_unlock() prototype to activation.h to keep fs.h private
in lib/activate dir and not expose other functions from this header.
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
If some allocation for peristent filter fails its memory reference
was lost, fix it by calling filter's destructor.
Fix log_error messages for failing allocation.
Variable 'ret' assigned from _do_event() was actually not used and replaced with next
assignment without any read of the returned value.
Code is reformated - so the error path is put in the if() branch and normal
code is put after the 'if' together with FIXME comment.
FIXME lowprio: logging needs to be fixed in this code,
- multiple log_errors are printed, stacks are missing...
a mirror image (or removing a log while adding a mirror). Advise the
user to use two separate commands instead.
This issue become especially problematic when PVs are specified, as they
tend to mean different things when adding vs removing. In a command that
mixes adding and removing, it is impossible to decern exactly what the
user wants.
This change prevents bug 603912.
- somewhat neater, more consistent and more readable output
- possible to set any lvm.conf value: aux lvmconf "section/key = value"
- LVM_TEST_NODEBUG to suppress the (lengthy) "## DEBUG" output
- back-substitution on test output ($TESTDIR/$PREFIX -> @TESTDIR@/@PREFIX@)
- support code moved from test/ to test/lib/ --> less clutter
Checking for vg being != NULL in this place is not needed.
Pointer vg is already dereferced in this function above this code line.
Also this internal function _read_pv is always called with valid 'vg' pointer.
Call for pthread_join() does not set errno value even though return values
looks like that. For now assign errno from return value and still use
strerror() to print some error message as this seems to be commonly used.
Add also log_sys_error() message for error close of local pipe.
Add checks for clonning allocation a fail-out when something is
not allocated correctly.
Also move var declaration to the begining of the function
and fix log_error messages.
As 'const' types are also passed to macro dm_list_struct_base -
keep offset calculation with const char pointers.
Fixes several gcc constness warnings.
As const segment_type or const format_type are never released
use their non-const version and remove const downcast from dm_free calls.
This change fixes many gcc warnings we were getting from them.
Change API interface to accept even completely const array patterns.
This should present no change for libdm users and allows to pass
pattern arrays without cast to const char **.
To have better control were the config tree could be modified use more
const pointers and very carefully downcast them back to non-const
(for config tree merge).
As ternary operator has lower priority then add operation, this check
was not doing what seemed to be expected.
So enclose the test in braces and check for NULL in *buf.
We need to be sure that /var/run and /var/lock is always there.
(E.g. these two directories could be using tmpfs which then loose
all the content after reboot.)
Detect existence of new SELinux selabel interface during configure.
Use new dm_prepare_selinux_context instead of dm_set_selinux_context.
We should set the SELinux context before the actual file system object creation.
The new dm_prepare_selinux_context function sets this using the selabel_lookup
fn in conjuction with the setfscreatecon fn. If selinux/label.h interface
(that should be a part of the selinux library) is not found during configure,
we fallback to the original matchpathcon function instead.
Set cmd->independent_metadata_areas if metadata/dirs or disk_areas in use.
- Identify and record this state.
Don't skip full scan when independent mdas are present even if memlock is set.
- Clusters and OOM aren't supported, so no problem doing the proper scans.
Avoid revalidating the label cache immediately after scanning.
- A simple optimisation.
Support scanning for a single VG in independent mdas.
- Not used by the fix but I left it in anyway as later patches might use it.
This reset of vgmem pointer causes access of already released memory.
(_vg_make_handle allocates vg from vgmem pool itself - which is a bit tricky)
Interestingly this memory fault was missed by our test suite.
Add log_error message for lv_info failure and exit from futher
processing.
Replace 'leg' occurence in debug message with 'image' which
is used in other messages.
Patch adds extra check for lv_name not being NULL.
Test avoids unneeded strlen call for this case.
Otherwise there is no functional change as test would fail on
size_t comparation even for NULL lv_name (thus there is no risk
of NULL dereference when taking 'true' if branch.
Add test for NULL before passing uuid as src argument to memcpy.
As memcpy function is declared as function not accepting NULL.
Though we pass NULL only with zero length so this patch presents
no functional change to the code.
Add test for NULL from dm_poll_create.
Reorder dm_pool_destroy() before file close and add label out:.
Avoid leaking file descriptor if the allocation fails.
Nicely hidden memory leak in outf macro error path.
This macro is using out_text() and does automagical return_0.
That would leak tag_buffer allocated memory.
As there was same code for tags output - create _out_tags() function.
Set vg to NULL after releasing it as the following memlock() test may
lead to goto for the second call of vg_release() with the already
released vg pointer.
LCK_CACHE is defined as 0x100 so it cannot be passed through
unsigned char parameter - remove it from the sprintf code.
If the LCK_CLUSTER should be printed here - lot of code need
to be reworked - so adding FIXME comment.
was lacking the (vgmem) pool. We now create that pool. There is at least one
more such VG (_dummy_vg) which is pool-less. I am not sure what is the right
way to go about this, but this is currently necessary to fix a segfault
introduced by using vgmem in the reporter in Dave's lvseg lvm2app patches.
Signed-off-by: Petr Rockai <prockai@redhat.com>
Similar to 'get' property internal functions.
Add specific 'set' function for vg_mda_copies.
Signed-off-by: Dave Wysochanski <wysochanski@pobox.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Fix for the last commit as $MOUNTED is not only used as bool flag,
but also store mounted location for remount - so parsing output
from mount differently then from /proc/mounts.
Prefix calls of 'tunefs' tools with LANG=C to be sure we always do get
some nonlocalized strings.
Avoid using forced 'resize2fs' for cleanly unmounted filesystems and
run regular fsck -f for this case as required by resize2fs.
'fsadm check' uses date difference for extX filesystems between
the last mount and last check of 'fsck -f' execution and if the mount
was later run 'fsck' with -f so resize2fs is happy and user does not
need to pass '-f' flag.
As util-linux package seems to give all the time different names,
try harder to figure out, where is the given lv possible mounted
and scan /proc/mounts and if not found there, test also 'mount' output.
/dev/dm-xxx
/dev/mapper/vg-lv
/dev/vg/lv
All of them could be used different combination in /proc/mount and mount output.
Patch fixes regression for older systems where new detection code failed to
find valid combination.
Fixing warning introduced by 'include make.tmpl' commit.
Produced this warning:
Makefile:29: warning: overriding commands for target `distclean'
../make.tmpl:366: warning: ignoring old commands for target `distclean'
Updated patch from Florian Haas from Linux-HA project.
User needs to 'configure --enable-ocf' to get file installed
by 'make install' target by default.
User can also use 'make install_ocf' to get only ocf files installed.
With disabled (default) ocf support - no ocf files are installed.
FIXME: ocf installation path needs to be kept in sync with pacemaker.
find better way and possible also better location.
Patch updates exec_cmd() and adds 3rd parameter with pointer for
status value, so caller might examine returned status code.
If the passed pointer is NULL, behavior is unmodified.
Patch allows to confinue with lvresize if the failure from fsadm check is
caused by mounted filesystem as many of filesystem resize tools do support
online filesystem resize. (originally user had to use flag '-n' to bypass
this filesystem check)
Return status code 3 for fsadm check of mounted filesystem - used later with
lvresize update patch to better support online filesystem resize.
Also makes a more consistent user interruption and returns status code 2
in this case.
Fix memory leak of field_id in _output_field function.
There's been a patch added recently to use dynamic allocation for metadata
tags buffer to remove the 4k limit (for writing metadata out). However, when
using reporting commands like vgs and lvs, we still need to fix libdm reporting
functions themselves to support such long outputs. So the buffer used in those
reporting functions is dynamic now.
The patch also includes a fix for field_id memory leak which was found in
the _output_field function.
The management threads (main_loop, the socket thread) could close a single fd
twice in a row sometimes. At least one other thread can be running at the same
time as the threads doing the double close. That one running thread also
happens to do some IO (namely, open /proc/devices, read from it, close it). If
there was enough "demand" for the local socket, this could happen:
- a connection to clvmd is about to finish, let's say the fd is 13 (it often
happens to be in my test script, don't ask why)
- the local_sock thread calls close(13)
- the lvm thread calls open("/proc/devices"...) and gets 13
- the main_loop thread calls close(13) [OOPS!]
- new connection arrives, and is accept'd by a (new) local_sock thread
- the accept gives an fd of 13 (since it's the lowest free fd at this point)
- the lvm thread gets around to read from it's /proc/devices handle... 13,
again
- the lvm thread hangs forever trying to read from the socket instead of
/proc/devices
Signed-off-by: Petr Rockai <prockai@redhat.com>
Reviewed-by: Milan Broz <mbroz@redhat.com>
Function pull_stateo() checks for NULL 'buf' - but return for this error
path was missing. cmirror code never calls this function with NULL 'buf',
so this fix has no effect on current code base, but makes clang happier.
It's quite new feature which is not supported by older compilers.
So until some better macros are introduced into LVM code - hotfix current
compilation problems and compile this code only for __clang__ defining compilers.
Simultaneous -a and --refresh is not valid.
poll+monitor are valid together with or without -ay* (but not with -an*)
No longer print polling results summary if no LVs in the VG were polled.
Add a generic LV property function to lvm2app, similar to VG function.
Return lvm_property_value and require caller to check 'is_valid' flag
and lvm_errno() for API error.
Add a generic PV property function to lvm2app, similar to VG function.
Return lvm_property_value and require caller to check 'is_valid' flag
before using the value. If 'is_valid' is not set, then lvm_errno()
should be used to obtain the specific error.
Add a generic VG property function to lvm2app. Call the internal library
vg_get_property() function. Strings are dup'd internally.
Rework lvm_vg_get_property to return lvm_property_value and require caller
to check 'is_valid' flag. If !is_valid, the caller can check lvm_errno()
for the specific error.
Create a 'get_property' function, local to lvm2app, that factors out
most of the common code that copies the components of lvm_property_type
into lvm_property_value. This allows for a 1-line function for each
of the generic property functions exported by lvm2app.
Makes clang happier as it covers all code paths and avoids NULL pointer
dereference through the 'com' pointer (which is NULL by default static
initialisation).
Reported by clang as: Argument with 'nonnull' attribute passed null
Reuse the result of the last strchr() call - make sure, 'st' point is not
null for the next strchr() call.
We cast (char*) to (uint32_t*) that changes alignment requierements.
For our case the code has been correct as alloca() returns properly
aligned buffer, however this patch make it cleaner and more readable
and avoids warning generation.
A merged snapshot's DM device is made to use the "error" target as part
of lvm's transaction to merge a snapshot. This snapshot merge use-case
aside, any device using the error target shouldn't be scanned.
Based on review comments, rename a few fields in lvm_property_type.
In particular, change 'is_writeable' to 'is_settable', which is
more intuitive to the intent of the bitfield (a 'set' function
exists for this field/property). Also, remove the char array
for 'id' - unnecessary as we can just use the string passed in
to do the strcmp. Finally rename the union members from n_val
to 'integer' and 's_val' to 'string'.
The signalling code (pthread_cond_signal/pthread_cond_wait) in the
pre_and_post_thread was using the wait mutex (see man pthread_cond_wait)
incorrectly, and this could cause clvmd to deadlock when timing was
right. Detailed explanation of the problem follows.
There is a single mutex around (L for Lock, U for Unlock), a signal (S) and a
wait (W). C for pthread_create. Time flows from left to right, each arrow is a
thread.
So first the "naive" scenario, with no mutex (PPT = pre_and_post_thread, MCT =
main clvmd thread; well actually the thread that does read_from_local_sock). I
will also use X, for a moment when MCT actually waits for something to happen
that PPT was supposed to do.
MCT -----C ------S--X-----S----X----------------------S------XXXXXXXXX
| everything OK up to this --> <-- point...
PPT -----WWW-----WWWW------------------------------WWWWWWWWWWWWW
Ok, so pthread API actually does not let you use W/S like that. It goes out of
its way to tell you that you need a mutex to protect the W so that the above
cannot happen. *But* if you are creative and just lock around the W's and S's,
this happens:
MCT ----C-----LSU----X-----LSU----X------------LSU------XXXXXXX
|
PPT ---LWWWU-------LWWWWU-----------------------LWWWWWWWWW
Ooops. Nothing changed (the above is what actually was done by clvmd before
this satch). So let's do it differently, holding L locked *all* the time in
PPT, unless we are actually in W (this is something that the pthread API does
itself, see the man page).
MCT ----C-----LSU------X---LSU---X-----LLLLLLLSU----X----
| (and they live happily ever after)
PPT L---WWWWW---------WWWW----------------W----------
So W actually ensures that L is unlocked *atomically* together with entering
the wait. That means that unless PPT is actually waiting, it cannot be
signalled by MCT. So if MCT happens to signal it too soon (it wasn't waiting
yet), it (MCT) will be blocked on the mutex (L), until PPT is actually ready to
do something.
to lvm.conf in the activation section: 'snapshot_autoextend_threshold' and
'snapshot_autoextend_percent', that define how to handle automatic snapshot
extension. The former defines when the snapshot should be extended: when its
space usage exceeds this many percent. The latter defines how much extra space
should be allocated for the snapshot, in percent of its current size.
Reorder linked libraries so we better support --as-needed linker flag used
by some distributions (i.e. Gentoo).
Patch suggested by Diego Elio Pettenò <flameeyes <at> gmail.com>
Problem:
When both legs of a mirrored log fail, neither the log nor the parent
mirror can proceed. The repair code must be careful to replace the
log with an error target before operating on the parent - otherwise,
the parent can get stuck trying to suspend because it can't push through
any writes. The steps to replace the log device with an error target
were incomplete and resulted in the replacement not happening at all!
The code originally had all the necessary logic to complete the
replacement task, but was pulled out in a effort to clean-up that
section of code, while fixing another bug:
<offending commit msg>
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
</offending commit msg>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/LVM2/lib/metadata/mirror.c?cvsroot=lvm2&f=h#rev1.130
I've reverted the "clean-up" changes associated with that fix, but not what
that commit was actually fixing.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
page.
Add ->target_name to segtype_handler to allow a more specific target
name to be returned based on the state of the segment.
Result of trying to merge a snapshot using a kernel that doesn't have
the snapshot-merge target:
Before:
# lvconvert --merge vg/snap
Can't expand LV lv: snapshot target support missing from kernel?
Failed to suspend origin lv
After:
# lvconvert --merge vg/snap
Can't process LV lv: snapshot-merge target support missing from kernel?
Failed to suspend origin lv
Unable to merge LV "snap" into it's origin.
Use _even_rand() function instead of floor() in _bitset_with_random_bits().
floor() function is missing in dietlibc (on architectures other than x86).
Moreover using floor() to clip rand results does not assure even result
distribution. _even_rand() uses integer arithmetic only and is designed to
return evenly distributed results.
> Looks OK to me. It took a while to decipher what is the exact meaning of
> the loop in _even_rand (to a non-pseudorandomness-expert) but I am
> fairly comfortable with it now. If I understand this correctly, it
> rejects numbers that come from an "incomplete" slice of the RAND_MAX
> space (considering the number space [0, RAND_MAX] is divided into some
> "max"-sized slices and at most a single smaller slice, between [n*max,
> RAND_MAX] for suitable n -- numbers from this last slice are discarded
> because they could distort the distribution in favour of smaller
> numbers).
Signed-off-by: Przemyslaw Iskra <sparky <at> pld-linux.org>
Reviewed-by: Petr Rockai <prockai <at> redhat.com>
re-add a physical volume that has gone missing previously, due to a transient
device failure, without re-initialising it.
Signed-off-by: Petr Rockai <prockai@redhat.com>
Reviewed-by: Alasdair Kergon <agk@redhat.com>
Try to distinguish between the case of using interactive shell and non
interactive running - different combinations of '-y' and '-p' option
needs to be used for fsck.
Update the way how fsadm detects mounted filesystem.
With udev /dev/dm-XXX paths are now returned - but mount or /proc/mounts
prints names in form of /dev/mapper/vg-lv - so the match was not found.
Fixex RHBZ #638050.
Current solution uses same trick as mount and detects vg-lv name through
/sys where available - this should be reasonable safe.
Instead of calling mount without parameter to get actual mount table,
switch to use /proc/mounts directly.
Fix missing 'dry' execution of lvresize - fixing problem where resize
command were 'dry-run' executed - but lvresize has been executed for real.
Also adapt code slightly to support better recursive execution of fsadm
through lvresize call.
Under certain conditions it was possible to break (^C) fsadm before actually
resizing filesystem, but lvresize which executed fsadm will think resize
was succesful and shrinks partitions with unresized filesystem on it.
Fix by returning error (1) for this case - this stops lvresize from futher
proceding in resize operation.
In other LVM memory structures such as volume_group, the field
used to store flags is called "status", and on-disk fields are called
'flags', so rename the one inside metadata_area to be consistent.
Not only is it more consistent with existing code but is cleaner
to say "the status of this mda is ignored".
Background for this patch - prajnoha pinged me on IRC this morning
about a fix he was working on related to metadataignore when
metadata/dirs was set. I was reviewing my patches from this year
and realized the 'flags' field was probably not the best choice
when I originally did the metadataignore patches.
Current lvm1 allocation code seems to not properly
map segments on missing PVs.
For now disable this functionality.
(It never worked and previous commit just introduced segfault here.)
So the partial mode in lvm1 can only process missing PVs
with no LV segments only.
Also do not use random PV UUID for missing part but use fixed
string derived from VG UUID (to not confuse clvmd tests).
We need to use a similar function for pv and lv properties, so just make
a generic _get_property() function that contains most of the required
functionality. Also, add a check to ensure the field name matches the
object passed in by re-using report_type_t enum. For pv properties,
the report_type might be either PVS or LABEL.
In addition, add 'const' to 'get' functions object parameter, but not
'set' functions. Add _not_implemented_set() and _not_implemented_get()
functions.
Add 'get' functions based on generic macros for VG, PV, and LV.
Add 'get' functions for vg string fields, vg_name, vg_fmt, vg_sysid,
vg_uuid, vg_attr, and vg_tags, and all numeric fields.
Add supporting functions for vg_name, vg_fmt, vg_system_id.
Append "_dup" to end of supporting functions to make clear the strings
are dup'd and to avoid namespace conflict with vg_name.
Add supporting functions for pv_uuid, vg_uuid, and lv_uuid.
Call new function id_format_and_copy. Use 'const' where appropriate.
Add "_dup" suffix to indicate memory is being allocated.
Call {pv|vg|lv}_uuid_dup from lvm2app uuid functions.
This patch addresses code review request to simplify creation of 'attr'
strings. The simplification is done in this separate patch to more
easily review and ensure the simplification is done without error.
Move the creating of the 'attr' strings into a common function so
they can be called from the 'disp' functions as well as the new
'get' property functions.
Add "_dup" suffix to indicate memory is allocated.
Refactor pvstatus_disp to take pv argument and call pv_attr_dup().
This patch is similar to the other patches for pv and vg
functionality, and separates lv functionality into separate
files, concentrating on reporting fields and simple functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out pv functions and data, particularly
related to the reporting fields calculations.
More code could be moved here but for now I'm stopping at reporting
functions 'get' / 'set' functions.
The metadata.[ch] files are very large. This patch makes a first
attempt at separating out vg functions and data, particularly
related to the reporting fields calculations.
Read complete content of /proc/self/maps into one buffer without
realocation in the middle of reading and before doing any m/unlock
operation with these lines - as some of them gets change.
With previous implementation we've read some mappings twice ([stack])
If some lvm1 device is missing, lvm fails on all operations
# vgcfgbackup -f bck -P vg_test
Partial mode. Incomplete volume groups will be activated read-only.
3 PV(s) found for VG vg_test: expected 4
PV segment VG free_count mismatch: 152599 != 228909
PV segment VG extent_count mismatch: 152600 != 228910
Internal error: PV segments corrupted in vg_test.
Volume group "vg_test" not found
Allow loading of lvm1 partial VG by allocating "new" missing PV,
which covers lost space. Also this fake mising PV inform code
that it is partial VG.
https://bugzilla.redhat.com/show_bug.cgi?id=501390
Revert to old glibc behaviour for vsnprintf used in emit_to_buffer fn.
Otherwise, the check that follows would be wrong for new glibc versions.
This caused the rh bug #633033 to be undetected and pass throught the check,
corrupting the metadata!
In certain configurations, we're not under a VG rw lock while trying to write
a new archive file with VG metadata. A common example is using "vgs" while
having the content of backup and archive directories empty. The code scans the
content of these directories and tries to determine the final index that should
be used in archive name. Since we're not under a lock, we can get into a race
while choosing the index which could end up showing errors about not being able
to rename to final archive name. Let's add random number suffix to these archive
file names so we can avoid the race.
For example, when using '--config "backup { ... }"' line, the values from
lvm.conf (or default values) should be overridden. This patch adds
reinitialisation of archive and backup handling on toolcontext refresh
which makes these settings to be applied.
can be opprobriously slow if created with '--nosync'.
One of the ways cluster mirrors coordinate I/O and recovery
amoung the different machines is by the use of the log
function 'is_remote_recovering()' which lets nodes know if
a region they wish to perform a write on is currently being
recovered on another node. If the region is being recovered,
the I/O is delayed.
The 'is_remote_recovering' routine has been optimized to
avoid the deluge of requests that would be issued to the
userspace log server by maintaining a marker of how far
the recovery has gotten. It can then immediately return
'not recovering' if the region being inquired about is
less than this mark. Additionally, if the region of
concern is greater than the mark, the function will
limit the number of transmissions to userspace by assuming
the region /is/ being recovered when skipping the
transmission. This limits the amount of processing
and updates the mark in 1/4 sec time steps.
This patch fixes a problem where 'the mark' is not being
updated because of faulty logic in the userspace log
daemon. When '--nosync' is used to create a cluster
mirror, the userspace log daemon never has a chance
to update the mark in the normal way. The fix is to set
the mark to "complete" if the mirror was created with
the --nosync flag.
For now force it in lvm.conf (otherwise it fails on older systems like RHEL5
where are these options disabled by default).
FIXME: it should test and detect both versions.
may never complete.
If you convert from a linear to a mirror and then convert that
mirror back to linear /while/ the previous (up)convert is
taking place, the mirror polling process will never complete.
This is because the function that polls the mirror for
completion doesn't check if it is still polling a mirror and
the copy_percent that it gets back from the linear device is
certainly never 100%.
The fix is simply to check if the daemon is still looking at
a mirror device - if not, return PROGRESS_CHECK_FAILED.
The user sees the following output from the first (up)convert
if someone else sneaks in and does a down-convert shortly
after their convert:
[root@bp-01 ~]# lvconvert -m1 vg/lv
vg/lv: Converted: 43.4%
ABORTING: Mirror percentage check failed.
to block when a mirror under a snapshot suffers a failure.
The problem has to do with label scanning. When a mirror suffers
a failure, the kernel blocks I/O to prevent corruption. When
LVM attempts to repair the mirror, it scans the devices on the
system for LVM labels. While mirrors are skipped during this
scanning process, snapshot-origins are not. When the origin is
scanned, it kicks up I/O to the mirror (which is blocked)
underneath - causing the label scan (an thus the repair operation)
to hang.
This patch simply bypasses snapshot-origin devices when doing
labels scans (while ignore_suspended_devices() is set). This
fixes the issue.
introduced in commit b16b4d92a7
"Improve various log messages."
fixes a lot of
../include/metadata.h:148: warning: type qualifiers ignored on function return type
set appropriate Required-Start and Required-Stop at configure time.
Reorder the checks for user selected cluster managers to match auto
detected ones, to be consistent in the output.
Add special case for qdiskd that´s started after cman/lock_gulmd for
RHEL-4/RHEL-5.
If pvmove crashed and metadata contains pvmove LV
but without miorrored segments, pvmove --abort
will not repair the situation (and finish wth success!).
Fix it by allowing metadata update if aborting
(thus removing pvmove LV) even if no moved LVs detected.
(Tested on real metadata provided by an lvm user:-)
detected alignment.
NOTE: lvm2 doesn't detect MD 1.2 metadata (now the default on RHEL6) so
for now I'm forcing 1.0 metadata. This was needed to be able to reuse
the existing loop devices but recreate the md device with different
raid0 striping.
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
Add 'get' functions based on the simple macro function definition for a
numeric property.
Add 'get' functions for the following: _vg_extent_count_get,
_vg_free_count_get, _max_lv_get, _max_pv_get, _pv_count_get,
_lv_count_get, _snap_count_get, _vg_seqno_get, _vg_size_get,
_vg_free_get, vg_mda_*.
For size functions, multiply by SECTOR_SIZE to return the value in bytes.
Extend the existing reporting infrastructure definitions and structures
to include a 'get' and 'set' function for each field. We will provide
a 'get' and 'set' function for each of these fields, which will be utilized
by exported lvm2app functions.
Define a default _not_implemented 'get' and 'set' function that just sets
an errno and returns 0. Future patches will actually implement the
specific 'get' and 'set' functions for each property. For read-only
properties, only the 'get' function will be implemented.
Define vg_get_property() function to query a property. We will call
this from a lvm2app function.
Rather than hard code the size of the field, use a #define, so we can re-use.
The #define will be needed in a future patch when we extend the reporting
infrastructure to have 'get' and 'set' functions for each field, allowing
lvm2app functions which query any report field. In order to provide a
generic lookup based on the field id, we will define a type containing this
field id, and thus, we will need to re-use the length of this string as
it's defined inside libdevmapper.h.
The 'id' entries in columns.h are the report field names. Since these are
unique, we'd like to use them in generation of 'get' / 'set' functions.
As a step towards using them for this purpose, remove the explicit double
quotes and use the macro '#' character to add the double quotes back when
placing them into the '_fields' array 'id' member.
Add a 'flags' field to columns.h, and set it to 0 by default.
Define FIELD_MODIFIABLE flag to indicate whether a 'set' function exists
to change the field's value.
In all top vg read functions only LCK_VG_READ/WRITE can be used.
All other vg lock definitions are low-level backend machinery.
Moreover, LCK_WRITE cannot be tested through bitmask.
This patch fixes these mistakes.
For _recover_vg() we do not need lock_flags, it can be only
two of above and we always upgrading to LCK_VG_WRITE lock there.
(N.B. that code is racy)
There is no functional change in code (despite wrong masking
it produces correct bits:-)
One shiny day we should use libblkid here. But now using LUKS is
very common together with LVM and pvcreate destroys LUKS completely.
So for user's convenience, try to detect LUKS signature and allow abort.
This is not only undocumented but is is also in violation with --help
documentation.
Using --yes without --force is useful in pvcreate when it detects
old signature.
pvcreate detects MD and swap signature.
The logic hidden there is not only documented but it is also
user unfriendly. Who invented this logic should run pvcreate
on its own critical MD device to see why;-)
This patch
- creates one function instead of duplication code
- asks if user want to overwrite signature
- allows aborting (!)
(Please note that writing LVM signatute without wiping old
is wrong, it confuses blkid, MD will not work anyway and
swap and LUKS is broken too.)
We can't rely on the fact that udev should prepare the node with right major
and minor number to trigger the module autoloading. We have to take into
account that the node could be missing or it could exist with improper
major and minor number assigned (e.g. from previous kernel versions in
an environment with static nodes and without udev). Make any corrections
if needed!
The lvm repair issues I believe are the superficial symptoms of this
bug - there are worse issues that are not as clearly seen. From my
inline comments:
* If the mirror was successfully recovered, we want to always
* force every machine to write to all devices - otherwise,
* corruption will occur. Here's how:
* Node1 suffers a failure and marks a region out-of-sync
* Node2 attempts a write, gets by is_remote_recovering,
* and queries the sync status of the region - finding
* it out-of-sync.
* Node2 thinks the write should be a nosync write, but it
* hasn't suffered the drive failure that Node1 has yet.
* It then issues a generic_make_request directly to
* the primary image only - which is exactly the device
* that has suffered the failure.
* Node2 suffers a lost write - which completely bypasses the
* mirror layer because it had gone through generic_m_r.
* The file system will likely explode at this point due to
* I/O errors. If it wasn't the primary that failed, it is
* easily possible in this case to issue writes to just one
* of the remaining images - also leaving the mirror inconsistent.
*
* We let in_sync() return 1 in a cluster regardless of what is
* in the bitmap once recovery has successfully completed on a
* mirror. This ensures the mirroring code will continue to
* attempt to write to all mirror images. The worst that can
* happen for reads is that additional read attempts may be
* taken.
Ignore snapshots when performing mirror recovery beneath an origin.
Pass LCK_ORIGIN_ONLY flag around cluster.
Add suspend_lv_origin and resume_lv_origin using LCK_ORIGIN_ONLY.
DM devices were not handled properly on nodes in a cluster that were not
where the splitmirrors command was issued. This was happening because
suspend_lv/resume_lv were being used in a place where activate_lv should
have been used.
When the suspend/resume are issued on (effectively) new LVs, their
'resource' (UUID) is not located in the lv_hash. Thus, both operations
turn into no-ops. You can see this from the output of clvmd from one
of the remote nodes:
<snip>
do_suspend_lv, lock not already held
<snip>
do_resume_lv, lock not already held
'activate_lv' enjoins the other nodes in the cluster to process the lock
and activate the new LV. clvmd output from remote node as follows:
do_lock_lv: resource 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S', cmd = 0x19 LCK_LV_ACTIVATE (READ|LV|NONBLOCK), flags = 0x84 (DMEVENTD_MONITOR ), memlock = 1
sync_lock: 'zMseY7CBuO3Ty09vXlplPAHzD0Y0CovjrTdv0R1VcwggMwPdYhutHErRcwm5Nd2S' mode:1 flags=1
sync_lock: returning lkid 27b0001
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
This can happen with older rules (without support for synthesized events)
that are still part of initrd while using new udev rules in the system itself.
The consequence was that new udev rules incorrectly assumed that not having
DM_UDEV_PRIMARY_SOURCE_FLAG set always means the uevent is synthesized and
inappropriate (device is still not properly activated) and so it should be
ignored. However, initrd is not updated automatically while updating the
libdevmapper/udev rules in the system and so we end up with the rules not
detecting and setting crucial parts in the initrd environment and the rules
in the system that rely on the information that should have been stored in
udev db (which is incorrect in this configuration, of course).
The overall consequence is that the update of libdevmapper/lvm2 without
regenerating the initrd could end up with a boot failure! Ignoring the event
means removing any existing symlinks in /dev!
To fix this, increase udev rules version to make a difference. So from now on,
mark rules without proper support for synthesized events as
DM_UDEV_RULES_VSN="1" and 2 (or higher) if that support is included.
We still need to detect this one! We're not so strict with CHANGE events as
with the ADD events while applying filters in the rules so this one would
pass and it would process the rules prematurely (because it appears *before*
the actual CHANGE event used when resuming a DM device while setting read-only
state at the same time).
clvmd daemon itself does the right thing when invoked as non-root, by
returning 4.
The patch removes the use daemon function from
/etc/rc.d/init.d/functions that´s unnecessary and has th bad habit to
mask the return codes from the real daemon.
Add a simple and generic check to see if clvmd is executed by root or not.
Our stop/reload/restart paths in the init script are complex and not all
the tools involved in the process are guaranteed to return 4 if executed
by non-root against a process that´s running as root (for example kill
-TERM will return -1 and parsing the output to catch the error is
suboptimal at best).
https://bugzilla.redhat.com/show_bug.cgi?id=553381
The new standard in the storage industry is to default alignment of data
areas to 1MB. fdisk, parted, and mdadm have all been updated to this
default.
Update LVM to align the PV's data area start (pe_start) to 1MB. This
provides a more useful default than the previous default of 64K (which
generally ended up being a 192K pe_start once the first metadata area
was created).
Before this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 188.00k 192.00k
After this patch:
# pvs -o name,vg_mda_size,pe_start
PV VMdaSize 1st PE
/dev/sdd 1020.00k 1.00m
The heuristic for setting the default alignment for LVM data areas is:
- If the default value (1MB) is a multiple of the detected alignment
then just use the default.
- Otherwise, use the detected value.
In practice this means we'll almost always use 1MB -- that is unless:
- the alignment was explicitly specified with --dataalignment
- or MD's full stripe width, or the {minimum,optimal}_io_size exceeds
1MB
- or the specified/detected value is not a power-of-2
Introduce --norestorefile to allow user to override the new requirement.
This can also be overridden with "devices/require_restorefile_with_uuid"
in lvm.conf -- however the default is 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We can already detect MD devices internally. But when using MD partitions,
these have "block extended major" (blkext) assigned (259). Blkext major
is also used in general, so we need to check whether the original device
is an MD device actually.
An incorrect fix on July 13, 2010 for an annoyance has caused a regression.
The offending check-in was part of the 2.02.71 release of LVM. That
check-in caused any PVs specified on the command line to be ignored when
performing a mirror split.
This patch reverses the aforementioned check-in (solving the regressions)
and posits a new solution to the list reversal problem. The original
problem was that we would always take the lowest mimage LVs from a mirror
when performing a split, but what we really want is to take the highest
mimage LVs. This patch accomplishes that by working through the list in
reverse order - choosing the higher numbered mimages first. (This also
reduces the amount of processing necessary.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Takahiro Yasui <takahiro.yasui@hds.com>
corruption bug in cmirror. 'dm_bit' is only ever used as a boolean operation
within LVM, but it can return a range of values. If the bit is set, a power of
2 is returned. If the bit is unset, 0 is returned.
'log_test_bit' (a function in the cluster mirror log daemon code) has switched
to using the dm bit operations in rhel6. There are two places in the daemon
code where 'log_test_bit' is not used merely as a boolean, but rather the
return value is used as the return value for the log functions 'is_clean' and
'in_sync' - having assumed that 'dm_bit' was returning 0 or 1 only.
One place the 'in_sync' function is utilized is in 'dm_rh_get_state' - a
function that informs the mirroring code how to treat I/O and which devices to
read/write from. 'dm_rh_get_state' was checking if the return value of
'in_sync' was 1 to determine if the region was DM_RH_CLEAN. Since 'dm_bit'
(and by extension 'log_test_bit' and 'in_sync') was returning powers of 2,
DM_RH_CLEAN was rarely being reported as it should have been. Thinking the
region was out-of-sync, the mirroring code would write only to the primary
device. When the primary device was failed, all of those writes were lost -
leaving the entire mirror corrupted.
udev_sync feature requires semaphores (part of System V IPC) to be configured
in kernel (CONFIG_SYSVIPC). Check whether it is supported and if not, give
a warning message and disable udev synchronisation code automatically to
avoid any further error states and associated problems.
One should use the kernel with System V IPC support enabled or libdevmapper
with udev_sync feature disabled.
all but one mirror leg.
<patch header>
To handle a double failure of a mirrored log, Jon's two patches are
commited, however, lvconvert command can't still handle an error
when mirror leg and mirrored log got failure at the same time.
[Patch]: Handle both devices of a mirrored log failing (bug 607347)
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00009.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00027.html
[Patch]: Handle both devices of a mirrored log failing (bug 607347) -
additional fix
posted: https://www.redhat.com/archives/lvm-devel/2010-July/msg00093.html
commit: https://www.redhat.com/archives/lvm-devel/2010-July/msg00101.html
In the second patch, the target type of mirrored log is replaced with
error target when remove_log is set to 1, but this procedure should be
also used in other cases such as the number of mirror leg is 1. This
patch relocates the procedure to the main path.
In addition, I added following three changes.
- Removed tmp_orphan_lvs handling procedure
It seems that _delete_lv() can handle detached_log_lv properly
without adding mirror legs in mirrored log to tmp_orphan_lvs.
Therefore, I removed the procedure.
- Removed vg_write()/vg_commit()
Metadata is saved by vg_write()/vg_commit() just after detached_log_lv
is handled. Therefore, I removed vg_write()/vg_commit().
- With Jon's second patch, we think that we don't have to call
remove_mirror_log() in _lv_update_mirrored_log() because will be
handled remove_mirror_images() in _lvconvert_mirrors_repaire().
</patch header>
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
The cluster log daemon (cmirrord) is not multi-threaded and
can handle only one request at a time. When a log is stacked
on top of a mirror (which itself contains a 'core' log), it
creates a situation that cannot be solved without threading.
When the top level mirror issues a "resume", the log daemon
attempts to read from the log device to retrieve the log
state. However, the log is a mirror which, before issuing
the read, attempts to determine the 'sync' status of the
region of the mirror which is to be read. This sync status
request cannot be completed by the daemon because it is
blocked on a read I/O to the very mirror requesting the
sync status.
With mirror_log_fault_policy of 'remove' and mirror_image_fault_policy
of 'allocate', the log type of the mirror volume is converted from
'disk' or 'mirrored' to 'core' when all mirror legs but one in a mirror
volume broke.
Keep new_log_count as a number of valid log devices by using log_count
variable for a temporary usage in the first phase of error recovery
in _lvconvert_mirrors_repair().
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Reviewed-by: Petr Rockai <prockai@redhat.com>
There was missing "revert" call in _create_and_load_v4 fn while the preparation
for table load ends up with failure in create/load/resume sequence. Otherwise
we could end up with a device being created, but not table-loaded nor resumed.
Even though the table is not loaded and the device is not resumed at this
stage, we still need to synchronize with udev when calling the revert
"remove" ioctl - there's still a remove uevent generated! The "revert"
code does exactly that.
CMIRRORD_PIDFILE is not defined. This makes the build fail.
Therefore, we need to conditionalize the check for cmirrord
based on if CMIRRORD_PIDFILE is defined.
mirrors, we must also check that the log daemon (cmirrord) is running.
The log module can be auto-loaded, but the daemon cannot be
"auto-started". Failing to check for the daemon produces cryptic
messages that customers have a hard time deciphering. (The system
messages do report that the log daemon is not running, but people
don't seem to find this message easily.)
Here are examples of what is printed when the module is available,
but the log daemon has not been started.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg
Shared cluster mirrors are not available.
[root@bp-01 LVM2]# lvcreate -m1 -l1 -n lv vg -v
Setting logging type to disk
Finding volume group "vg"
Archiving volume group "vg" metadata (seqno 3).
Creating logical volume lv
Executing: /sbin/modprobe dm-log-userspace
Cluster mirror log daemon is not running
Shared cluster mirrors are not available.
Creating volume group backup "/etc/lvm/backup/vg" (seqno 4).
When splitting off mirror images from a mirror, we always take
LVs from the end of a list. For example, if the mirror sub-devices
are lv_mimage_[012], we should select lv_mimage_2 if splitting off
one image. However, lv_mimage_0 was being selected instead.
The problem came from calling '_move_removable_mimages_to_end'
when it was unnecessary to do so. When the user /does/ specify
specific devices to be removed, this function properly moved the
appropriate LVs to the end of the list for extraction. However,
if the user /doesn't/ give any specific PVs, the function should
do nothing. '_move_removable_mimages_to_end' was keying off of
whether 'removable_pvs' was NULL or not and this value was
improperly being populated with the set of all available PVs.
This was causing '_move_removable_mimages_to_end' to completely
reverse the list, which in turn caused us to extract the
hithertofore front-of-the-list LVs.
An unhandled condition allowed the command to terminate
cleanly without a warning. Added a check for the
'--splitmirrors' argument to allow execution to the lower
level function that has the check to see if the user is
trying to split a linear device. You should now see a
message if you try to use --splitmirrors on a linear device.
The main problem with these bugs was that the newly split
off LV was not being suspended properly. This meant that
the memlock count was not being balanced, the DM devices
were not being renamed, and some DM devices which should
have been removed were not.
I've also renamed some of the variables and added comments
to make things clearer as to what is going on. (I can break
this patch in two if it means easier review.)
pvchange: Add --metadataignore description
vgchange: Fix minor formatting
pvcreate: Update metadataignore description to refer to pvchange
lvm.conf: Refer to pvcreate and pvchange for metadata options.
Switch dmeventd to use dm_create_lockfile and drop duplicate code.
Allow clvmd pidfile to be configurable.
Switch cmirrord and clvmd to use dm_create_lockfile.
This should bring less confusion when there are some settings left and
people just forgot about it and then they run into problems. These messages
should give them a hint of what's really going on.
even though there was no log. A simple run through the in-tree test
suite would have caught this. :(
- if (lv_is_mirrored(detached_log_lv) &&
+ if (detached_log_lv && lv_is_mirrored(detached_log_lv) &&
Also, made some cosmetic changes suggested by kabi after my last check-in
(e.g. s/return 0/return_0/ and adding an error message).
A previous check-in added logic to handle the case where both images
of a mirrored log failed. It solved the problem by simply removing
the log entirely - leaving the parent mirror with a 'core' log. This
worked for most cases. However, if there was a small delay between
the failures of the two mirrored log devices, the mirror would hang,
LVM would hang, and no additional LVM commands could be issued.
When the first leg of the log fails, it signals the need for repair.
Before 'lvconvert --repair' is run by dmeventd, the second leg fails.
'lvconvert' would see both devices as failed and try to remove the
log entirely. When it came time to suspend the parent mirror to
update the configuration, the suspend would hang because it couldn't
get any I/O through the mirrored log, which was plugged waiting for
corrective action. The solution is to replace the log with an error
target to clear any pending writes before removing it. This allows
the parent mirror to suspend and make the proper changes.
Pass metadataignore through PV creation / setup paths.
As a result of this cleanup, we can remove the unnecessary setting
of mda_ignore bits inside pvcreate_single(), after call to pv_create.
For now, just set metadataignore to '0' in some places. This is
equivalent to the prior functionality, although the 0 is given
by the caller not hardcoded in _mda_setup() call.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Indent updates.
Use AC_HELP_STRING for help string.
Start help string with lower letter.
Add [] around some default values i.e. [TYPE=internal].
Skip some "" around shell assigment when not needed.
Fix typo --with-device-gid=UID string.
When using vgmetadatacopies value other than "umanaged" (0), prompt
the user if the usage of --metadataignore would change the value of
vgmetadatacopies. The main 2 cases are:
1) pvchange --metadataignore
2) vgextend --metadataignore
We leave the prompt check in the tools, and do not change anything
if the user says 'n'.
Examples:
vgextend --metadataignore y vgtest /dev/loop0
Setting metadataignore will override preferred number of copies of VG vgtest metadata.
Are you sure? [y/n]: y
No physical volume label read from /dev/loop0
Physical volume "/dev/loop0" successfully created
Volume group "vgtest" successfully extended
pvchange --metadataignore y /dev/loop3
Setting metadataignore on /dev/loop3 will override preferred number of copies of VG vgtest metadata.
Are you sure? [y/n]: y
WARNING: Changing preferred number of copies of VG vgtest metadata from 3 to 2
Physical volume "/dev/loop3" changed
1 physical volume changed / 0 physical volumes not changed
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Test the auto-repair capability when we fail committing to an mda
on a new pv adding to a vg. This test should fail until we fix
the auto-repair in this case.
For now, this is just a precaution. Normally, all the other (non-dm) rules
should check DM_UDEV_DISABLE_OTHER_RULES_FLAG and therefore avoid setting
any inotify watches as well. But let's make sure.
Support for final assignment of the "nowatch" rule (the use of ":=") will
appear in next udev release, v160. This should also work in previous udev
versions but the setting won't be sealed so any further OPTIONS="watch" will
always prevail there.
We may want to add more specific "nowatch" rules later if needed.
- If a PV contained empty mdas, the auto-recovery code was not kicking in.
- The 'inconsistent' state was getting lost when metadata was cached so
recovery didn't kick in. But leave the behaviour alone when using
precommitted metadata because of a warning in a confusing FIXME.
In my testing, pvs and vgs didn't repair inconsistent metadata like they
used to do. (How many other tools fail similarly now?)
And there should be no need to cache inconsistent metadata because it is
supposed to get repaired under the protection of a write lock immediately it is
discovered.
This code is in need of a redesign based on first principles.
I still see bugs in this code and this commit is risky.
Rather than attempting to remove all the images of a mirrored
log volume via remove_mirror_images, simply remove the log
if all its devices have failed.
Taka was the first to report that there is still an outstanding
issue with handling this case. I've managed to reproduce it
only very rarely, and am still working on identifying the problem.
Failing to handle the problem rarely is better than not handling
the scenario at all, so I'm checking this in.
Moreover, in current mirror handling, when it calls activate
on removed but suspended detached log this counter drops below zero
and confuses debug log.
When a mirror is being downconverted in a cluster, a series of suspends and
resumes is executed.
With the change to using UUIDs in dev_manager instead of names, the behaviour
has changed with regards to including an _mlog in the deptree of a logical
volume. In the old (pre-UUID-enabled) code, the _mlog would appear in a deptree
of any volume purely based on a name match: a linear volume foo would include
foo_mlog in its dependencies if that happened to exist. This behaviour was
fixed and the mlog is now only included for mirrors.
By a coincidence, this mlog bug had been hiding a different bug in clvmd. When
a mirror is being dismantled (and converted to a linear volume), it is first
suspended as a whole, then later resumed in parts. Nevertheless, the overall
memlock balance is maintained in this operation. The problem kicks in, because
even though the mirror log was suspended as part of the mirror, when the
dismantled mirror is resumed again, it is no longer a mirror and therefore the
mirror log stays suspended. This would not be a problem in itself, since
_delete_lv (from metadata/mirror.c) is called on it subsequently, which does an
activate/deactivate cycle and removes the LV. The activate/deactivate cycle
correctly prompts clvmd to resume the device: however, in doing this, it will
issue an unpaired resume operation (the suspend that caused the mirror log to
be suspended is paired with resuming the dismantled mirror later). We have
concluded that the path in clvmd should never affect memlock_count, since there
should never be an unmatched explicit suspend preceding this resume.
Allow metadataignore flag to be passed in to pvcreate.
Ideally, more refactoring of the mda allocation / initialization
is warranted, but for now, we just add another parameter to 'add_mda'
to take an existing mda ignored flag. We need to do this or pv_write
loses the state of the mda 'ignored' flag before copying and writing
to disk.
Print device name when setting or clearing metadata ignore bit.
Example:
label/label.c:160 /dev/loop2: lvm2 label detected
cache/lvmcache.c:1136 lvmcache: /dev/loop2: now in VG #orphans_lvm2 (#orphans_lvm2)
metadata/metadata.c:4142 Setting mda ignored flag for metadata_locn /dev/loop2.
format_text/text_label.c:318 Skipping mda with ignored flag on device /dev/loop2 at offset 4096
Logging isn't ideal, especially for mda_set_ignore. Ideally we'd
like to display the device name and offset in this case but this
requires a bit more work and a per-format 'mda_description' function
pointer definition (we don't have access to mda_context in
metadata.c).
In preparation to call this from both pvcreate as well as pvchange,
move the guts of metadataignore into a library function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
There's an intermittent failure with vgcfgbackup that seems to have been
introduced with the metadataignore / vgmetadatacopies patchset.
Intermittent failures are often the result of uninitialized data,
so this patch calls zalloc in a few places it might matter.
Update example.conf to describe vgmetadatacopies. Provide an
explanation for the '0' ("unmanaged") value.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Allowing an 'all' and 'unmanaged' value is more intuitive, and
provides a simple way for users to get back to original LVM behavior
of metadata written to all PVs in the volume group.
If the user requests "--vgmetadatacopies unmanaged", this instructs
LVM not to manage the ignore bits to achieve a specific number of
metadata copies in the volume group. The user is free to use
"pvchange --metadataignore" to control the mdas on a per-PV basis.
If the user requests "--vgmetadatacopies all", this instructs LVM
to do 2 things: 1) clear all ignore bits, and 2) set the "unmanaged"
policy going forward.
Internally, we use the special MAX_UINT32 value to indicate 'all'.
This 'just' works since it's the largest value possible for the
field and so all 'ignore' bits on all mdas in the VG will get
cleared inside _vg_metadata_balance(). However, after we've
called the _vg_metadata_balance function, we check for the special
'all' value, and if set, we write the "unmanaged" value into the
metadata. As such, the 'all' value is never written to disk.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Now that we have both --pvmetadatacopies and --vgmetadatacopies,
we need to make sure --metadatacopies gets interpreted correctly.
For pv commands, --metadatacopies should imply --pvmetadatacopies,
and for vg commands, --vgmetadatacopies.
Note: this will change the behavior of vgcreate with --metadatacopies
to be a synonym for --vgmetadatacopies. Previously, --metadatacopies
would apply to any PVs given with vgcreate that needed an implicit
pvcreate. As a result, one small change is needed to one of the nightly
tests - t-vgcreate-usage.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When vgmerge is called we move the mdas from the source to the
destination. With metadata balancing we now have another mda
list, fid->metadata_areas_ignored, so move the mdas on this list
as well.
This patch should not matter as the code is written today. However
we include it for completeness in the case that _vgmerge_single()
is refactored and/or moved into a library function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The check in vg_split_mdas will trigger an error if the 'from' vg
list is empty. However, this might be ok in some instances now
that we have ignored mdas. Relax this check so an error is triggered
only in the case where there's truly no more mdas in the 'from'
vg.
One example of where this makes a difference is with vgreduce.
If we try to vgreduce a PV with un-ignored mdas, this should trigger
the balancing function to un-ignore mdas on another PV in the VG.
However, we don't get to vg_write() before we fail because this
list size check fails, and we see an error message indicating:
"Cannot remove final metadata area ..."
Another example is with vgsplit into a new VG, where the PVs
being moved contain all ignored mdas. We must move the mdas on
fid->metadata_areas_ignored from 'vg_from' to 'vg_to'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The vgextend path calls add_pv_to_vg(). Inside add_pv_to_vg(),
we must ensure we pass the correct mdas list into pv_setup(), as
copies of mdas are placed on the vg->fid list. If we don't place
the mdas on the correct vg->fid list, the various counts may be
incorrect and the metadata balance algorithm will not work when
called from vg_write() path.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Allow parsing of --vgmetadatacopies for vgcreate. Accept
--metadatacopies as a synonym for --vgmetadatacopies.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When a user explicitly sets a new mda ignore value for a PV, we
should update vg_mda_copies accordingly. When the VG is written
out, the user would not want the new ignore state to get lost as
a result of the vg_mda_copies value and logic in the vg_write
path.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Compare the value of the newly added vg_mda_copies field
(--vgmetadatacopies parameter) with the current count of
in-use mdas and ignoring or unignoring mdas as necessary to
get to the target count. Also, as a safety check before
returning, ensure we have at least one mda enabled.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Update logic in vgchange to handle --vgmetadatacopies, allow
--metadatacopies as a synonym to --vgmetadatacopies,
and add these parameters to args.h and commands.h
Forbit both --vgmetadatacopies and --metadatacopies as only
one allowed.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the ability to read/write the vg->mda_copies values
from/to the vg metadata.
If we read the VG metadata and this field does not exist, we set
mda_copies to the default value of 0. Later in the code, we use
this special '0' value to indicate a disable of metadata balancing.
This should preserve existing LVM behavior and ensure metadata balancing
can be turned off should the need arise.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the get and partially implemented set function.
The 'set' function should probably ignore or un-ignore metadata areas
based on new values.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a field to struct volume_group to later implement metadata
balancing:
- mda_copies: target # of non-ignored mdas in the VG; default 0 (do
not control pv 'ignore mdas' bit.
This patch just adds the parameter to the structures with the default
values but does not modify any commands. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds a vgmetadatacopies parameter for metadata balancing.
This parameter provides a simple way for users to create a policy for
placing metadata on PVs automatically by LVM. The behavior is implemented
inside LVM by managing the 'ignore' mda bits. We chose the name
'vgmetadatacopies' as this is a natural extension to the existing parameter
'pvmetadatacopies' / 'metadatacopies' in pvcreate.
This is a first step at VG parameter based metadata balancing. Most users
will probably want to state that they want a certain number of PVs to contain
metadata, and they may be less concerned about a specific number of metadata
copies in the volume group. However, for default values (pvmetadatacopies
is 1 by default), the number of metadatacopies in the volume group, and the
number of PVs with metadata are the same. In the future we could add
vgmetadatacopiespvs to define more specifically the number of pvs in the
VG that contain metadata, but for now we start with this parameter.
Another possible future extension would be to define a specific pv tag
to mark the set of PVs that should be used for metadata balancing. This
tag based approach could be used in conjunction with 'vgmetadatacopies'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Arrange mdas so mdas that are to be ignored come first. This is an
optimization that ensures consistency on disk for the longest period of time.
This was noted by agk in review of the v4 patchset of pvchange-based mda
balance.
Note the following example for an explanation of the background:
Assume the initial state on disk is as follows:
PV0 (v1, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
If we did not sort the list, we would have a commit sequence something like
this:
PV0 (v2, non-ignored)
PV1 (v2, ignored)
PV2 (v2, ignored)
PV3 (v2, non-ignored)
After the commit of PV0's mdas, we'd have an on-disk state like this:
PV0 (v2, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is an inconsistent state of the disk. If the machine fails, the next
time it was brought back up, the auto-correct mechanism in vg_read would
update the metadata on PV1-PV3. However, if possible we try to avoid
inconsistent on-disk states. Clearly, because we did not sort, we have
a greater chance of on-disk inconsistency - from the time the commit of
PV0 is complete until the time PV3 is complete.
We could improve the amount of time the on-disk state is consistent by simply
sorting the commit order as follows:
PV1 (v2, ignored)
PV2 (v2, ignored)
PV0 (v2, non-ignored)
PV3 (v2, non-ignored)
Thus, after the first PV is committed (in this case PV1), on-disk we would
have:
PV0 (v1, non-ignored)
PV1 (v2, ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is clearly a consistent state. PV1 will be read but the mda will be
ignored. All other PVs contain v1 metadata, and no auto-correct will be
required. In fact, if we commit all PVs with ignored mdas first, we'll
only have an inconsistent state when we start writing non-ignored PVs,
and thus the chances we'll get an inconsistent state on disk is much
less with the sorted method.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When we are constructing the vg, we may need to adjust the list of
metadata_areas if there are ignored mdas. At label read time, we
do not read the metadata of ignored mdas, and as a result, they do
not get placed on vg->fid->metadata_areas inside _text_create_text_instance
since lvmcache does not have these areas attached to vginfo->infos.
However, when we're checking the pvids inside _vg_read, after having
read another metadata area from another PV, we do have the opportunity
to update the metadata_area and metadata_areas_ignored lists based
on the read metadata_area. We need accurate mda lists for the reporting
functions that count the ignored mdas, as well as general correctness
of mda balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
With the addition of ignored mdas, we replace all checks for an empty
mda list with a new function to look for either an empty mda list or
ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a helper function to consolidate checking for an empty mdas list
or ignored mdas. Ignored mdas should behave almost identically to
an empty mda list - the metadata areas should not be read or written
to. This function will make it easier to implement metadata balancing
and easier to track pvs with an empty mda list or ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We implement ignore of an mda at label_read time by checking for
the ignore bit, and then skipping the reading of the vgname and
other information in the metadata. This will have an effect similar
to a PV found with no mdas. Thus, it will look like an orphan in the
cache until we scan the rest of the system and find a PV with
metadata, and the mda will not be on the vg->fid->metadata_areas
list so no read/writes will be done to the metadata area.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch just modifies pvchange to call the underlying ignore
functions for mdas. Ensure special cases do not reflect changes
in metadata (PVs with 0 mdas, setting ignored when already ignored,
clearing ignored when not ignored).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Define a new pvs field, pv_mda_used_count, and a new vgs field,
vg_mda_used_count to match the existing pv_mda_count and vg_mda_count.
These new fields count the number of mdas that have the 'ignored' bit
clear (they are in use on the PV / VG). Also define various supporting
functions to implement the counting as well as setting the ignored
flag and determining if an mda is ignored. These high level functions
call into the lower level location independent mda ignore functions
defined by earlier patches.
Note that counting ignored mdas in a vg requires traversing both lists
and checking for the ignored bit on the mda. The count of 'ignored'
mdas then is defined by having the bit set, not by which list the mda
is on. The list does determine whether LVM actually does read/write to
the mda, though we must count the bits in order to return accurate numbers
for the various counts. Also, pv_mda_set_ignored must search both vg
lists for ignored mda. If the state changes and needs to be committed
to disk, the ignored mda will be on the non-ignored list.
Note also in pv_mda_set_ignored(), we must properly manage the mda lists.
If we change the ignored state of an mda, we must change any mdas on
vg->fid->metadata_areas that correspond to this pv. Also, we may
need to allocate a copy of the mda, as is done when fid->metadata_areas
is populated from _vg_read(), if we are un-ignoring an ignored mda.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a second mda list, metadata_areas_ignored to fid, and a couple
functions, fid_add_mda() and fid_add_mdas() to help manage the lists.
These functions are needed to properly count the ignored mdas and
manage the lists attached to the 'fid' and ultimately the 'vg'.
Ensure metadata_areas_ignored is initialized in other formats, even
if the list is never used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Because of the way mdas are handled internally, where a PV in a VG
has mdas on both info->mdas and vg->fid->metadata_areas list, we
need a location independent copy constructor for struct
metadata_area. Break up the existing format-text specific copy
constructor into a format independent piece and a format dependent
piece.
This function is necessary to properly implement pv_set_mda_ignored().
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
A metadata_area is defined independent of the location. One downside
is that there is no obvious mapping from a pv to an mda. For a PV in
a VG, we need a way to start with a PV and end up with an MDA, if we
are to manage mdas starting with a device/pv. This function provides
us a way to go down the list of PVs on a VG, and identify which ones
match a particular PV.
I'm not entirely happy with this approach, but it does fit into the
existing structures in a reasonable way.
An alternative solution might be to refactor the VG - PV interface such
that mdas are a list tied to a PV. However, this seemed a bit tricky since
a PV does not come into existence until after the list of mdas is
constructed (see _vg_read() - we create a 'fid' and attach mdas to it,
then we go through them and attach pvs).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
We'd like to pass in mda_header to vgname_from_mda(). In order to
do this, we need to call raw_read_mda_header() from text_label.c,
_text_read(), which gets called from the label_read() path, and
peers into the metadata and update vginfo cache. We should check
the disable bit here, and if set, not peer into the vg metadata,
thus reducing the I/O to disk.
In the process, move vgname_from_mda() to layout.h, since the fn
only gets called from format_text code, and we need the mda_header
definition from the private layout.h.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This refactoring moves the device open/close up one level to the caller of
_vg_read_raw_area(). Should be no functional change and facilitate future
changes related to metadata balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
First we add a 'flags' field to the location independent
metadata_area structure, and a MDA_IGNORE flag. The
mda_is_ignored and mda_set_ignored functions are added to
manage the flag. Adding the flag and functions gives a
library interface to ignore metadata areas independent of
the underlying location (disk, file, etc). The location
specific read/write functions must then handle the specifics
of what this flag means to the location.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Adding a flag to the 'rlocn' structure in the mda header of the
text format allows us to flip a bit to ignore an area on disk that
stores the metadata via the text format specific mda_header.
This patch defines the flag and access functions to manage the flag.
Other patches will manage the ignore on a format-independent basis,
by using a flag in the metadata_area structure.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Future patches will make use of a specific flag in the on-disk 'raw_locn'
structure to enable/disable metadata areas, and facilitate metadata
balancing.
Note that 'filler' is always set to '0' (see add_mda() - memset),
so use of this area as a non-zero flags field is a safe way to
provide future code features.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The same region size is used for both mirror volume and mirrored
log volume, but when the physical extent size is bigger than region size,
the size of mirror leg for mirrored log is smaller than the region size
and lvcreate command fails.
This patch adjusts a region size of mirrored log to a smaller value of
region size or physical extent size.
[This patch ensures that the region_size of the mirrored log does not
exceed the size of the mirrored log itself, which would violate the
kernel constraint: (region_size <= ti->len).]
Signed-off-by: Takahiro Yasui <takahiro.yasui@hds.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Preload libc.mo file for localized lvm before taking memory lock - this way
we prevent disk access for some error paths in libdm, that prints localized
errno messages while they are still in memory locked state.
the failure of a device that contained both a image of
a mirror and an image of the mirrored log. The order
of the handling of those faults was important (and
wrong), this patch corrects that.
Patch-From: Takahiro Yasui <tyasui@redhat.com>
We can use DM_UDEV_PRIMARY_SOURCE_FLAG to identify the spurious events
and use it as an indication that the device has already been activated before
(and hence we can find this property in udev database).
WARNING: This change requires udev startup script to preserve udev database
from initrd. All the information stored there during activation of devices
is important for the initial "udevadm trigger --action=add" call that is
used in udev startup script. If not done this way, udev startup script needs
to define DM_UDEV_PRIMARY_SOURCE_FLAG=1 property for any ADD events it uses.
- s/Active clustred VG/clustered VG/ (only LV can be active)
- print only active LVs (not all) in status command
(In the lvdisplay form /dev/vg/lv.)
For now, still use awk (already used in clustered_vgs).
https://bugzilla.redhat.com/show_bug.cgi?id=598495
converting from 2-way to 3-way mirror (collapse_mirrored_lv)
was calling '_remove_mirror_images' with the 'remove_log'
parameter set. When the code was put in to fix 599898 to
honor log parameters during conversion, this argument was
suddenly being honored. Thus, when someone would convert from
a 2-way to 3-way mirror, the log would get removed.
'collapse_mirrored_lv' should not be calling '_remove_mirror_images'
with 'remove_log' set.
to 3-way mirror. When conversion operations are performed on
these types of mirrors, log options can be confused/ignored.
In the case of a converting 3-way mirror, we have a top-level
2-way corelog mirror whose legs are 1) a 2-way disk-log mirror
and 2) a linear device. If we wish to convert this 3-way mirror
to a 2-way mirror, the linear device is removed and the extra
top layer is eliminated. If we also wished to convert the disk
log to a core log in the same step, ambiguity creeps in. It is
somewhat obvious what the user wants - a 2-way mirror with a
corelog. However, looking at the top level mirror before
compression, it seems that the mirror already has a core log.
This is why the operation seemed to fail.
This patch simply re-evaluates what mirrored_seg points to after
a compression and then considers the log argument.
This is a fix for bug 599898.
Because execve stops the command loop,
we never receive response (only socket close) for clvmd -S,
so waiting for response here makes no sense.
But if the calling process (clvmd -S) exits too early, connection
is closed from client side, clvmd takes this as an error and
never run restart code.
Ugly hack(TM).
When using clustered mirrors, we need device nodes to be created during
processing of device tree, not at its end like we normally do (we need to
access the nodes in cmirror prematurely). Therefore we use a new flag called
"immediate_dev_node" stored in deptree's load_properties struct to instruct the
device tree processing code to immediately synchronize with udev and flush all
stacked node operations so the nodes are prepared for use.
For now, the immediate_dev_node is used for clustered mirrors during
processing the dm_tree_preload_children code only. We can add more later if
needed.
linux/kdev_t.h even though it wasn't needed. Strangely, it seems
to be causing problems on various architectures (i686) in the
function daemons/cmirrord/functions.c:disk_status_info()->sprintf.
I'm not sure why this is a problem since none of the macros in
kdev_t.h are used in that code, but it certainly doesn't hurt to
pull an unnecessary header and it seems to fix the problem.
Code is mixing up internal DLM and LVM definitions of lock
modes and flags.
OpenAIS and singlenode locking do not depend on DLM but
code currently cannot be compiled without libdlm.h!
LCK_* flags is LVM abstraction, used through all the code.
Only low-level backend (clvmd-cman etc) should use DLM definitions,
also this code should do all needed conversions.
Because there are two DLM flags used in generic code
(NOQUEUE, CONVERT) we define it similar way like lock modes.
(So all needed binary-compatible flags are on one place in locking.h)
(Further code cleaning still needed, though:-)
Introduce lvm_exec_prefix with resolved exec_prefix.
(using same ac_default_prefix as for CLVMD_PATH)
Use lvm_exec_prefix instead of dmeventd_prefix (fixes missing ac_default_prefix)
Note: This patch is rather hot-fix as currently generate code
does not create correct code for make exec_prefix=
- allocate environment dynamically (still missing some limit?)
- try to recover, if destroy failed (do not destroy lvm here) and free memory
- check strdup() return codes
- report failure to log
- do not print NULL in exclusive lock loop
A kernel patch is on its way for 2.6.35 adding support for dm-mod module
autoload. Udev v155 and higher is able to read static node information given
in modules.devname (extracted by depmod before) and will create such nodes
at its start. The first access to such node will load the module automatically
(directly in kernel) before the actual read/write operation is processed.
Activate only the first replicator-dev LV, that activates all other
related LVs from Replicator. In case of error during this activation,
it will not retry again for other heads (less confusing error log).
length(array) is specific to GNU awk and doesn't work in mawk.
Use a return value of "split" function to indicate array size, this is
supported in both gawk and mawk.
This patch fixes the following errors during "make install" when mawk is
installed as a default awk.
mawk: scripts/relpath.awk: line 25: illegal reference to array from
mawk: scripts/relpath.awk: line 25: illegal reference to array to
mawk: scripts/relpath.awk: line 27: illegal reference to array from
mawk: scripts/relpath.awk: line 32: illegal reference to array to
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Adding function _add_partial_replicator_to_dtree() to create
partial tree for Replicator target.
Using dm_tree_node_set_presuspend_node() for Replicator.
As for _process_one_vg() we need similar retry loop for
process_each_lv_in_vg(). This patch retries to process
failed LVs with reopened VGs.
Patch does not add any extra repeated invocations if there is not
found any missing VG during LV processing.
Patch modifes behavior of _process_one_vg().
In the first pass vg_read() collectis for replicator sorted list of
additional VGs during lock_vol().
If any other VG is needed by the replicator and it is not yet opened
then next iteration loop is taken with all collected VGs.
Flag vg->cmd_missing_vgs detects missing VGs.
Introduce struct cmd_vg to store information about needed
volume group name, vgid, flags and the pointer to opened VG.
Keep VGs list in alphabetical order for locking order.
Introduce functions:
cmd_vg_add() add new cmd_vg entry.
cmd_vg_lookup() search cmd_vgs for vg_name.
cmd_vg_read() open VGs in cmd_vgs list.
cmd_vg_release() close VGs in reversed order.
Add pointer to linked list of opened VGs. List temporarily keeps
the information about needed or locked and opened VGs for replicator target.
Also add cmd_missing_vgs flag information for quick check and
also for possible continuos process_each_lv() usage where we need
to detect whether failure has been caused by missing VG or
some other reason.
Adding configure.in support for Replicators.
Adding basic lib lvm support for Replicators.
Adding flags REPLICATOR and REPLICATOR_LOG.
Adding segments SEG_REPLICATOR and SEG_REPLICATOR_DEV.
Adding basic methods for handling replicator metadata.
For deactivation of Replicator check in advance that all heads
have open_count == 0. For this presuspend_node is used as all
head nodes are linking this control node.
Introducing dm_tree_node_set_presuspend_node() for presuspending child
node (i.e. replicator control target) before deactivation of parent node
(i.e. replicator-dev target).
This patch presents no functional change to current dtree - only
replicator target currently sets presuspend node for dev nodes.
Patch adds failed_lvnames to the list of parameters for process_each_lv_in_vg().
If the list is not NULL it will be filled with LV names of failing LVs
during function execution.
Application could later reiterate only on failed LVs.
lvm2app forces applications to start with a volume group name,
open the volume group, then operate on individual pvs. In some
cases the application may want to start with a device name rather
than the volume group name. Today, if an application wants to
do this, it must iterate through all the volume groups to find
the volume group that the specific device is attached to.
These new interfaces allow the application to avoid such overhead.
Bump the lvm2app version number to 3.
Earlier patches added some infrastructure to lookup a vgname from
a pvname. We now can cleanup some of the pvchange and other code
by requiring callers that want to modify some pv property:
1) lookup the vgname by the pvname
2) use the vgname to obtain a vg handle
3) get the pv handle from the vg handle
This should work going forward and be a much cleaner interface,
as we move away from pvs as standalone objects.
Some commands start with a pvname, but we'd like to force users to
start with a vg handle to obtain a pv handle. Our best option seems
to be providing a way to look up the vgname from the pvname, and then
require them to use vg_read/vg_open.
In addition to the pvname lookup function, this patch also provides a
lookup by pvid. The lookup by pvid can be used in conjunction with
lvmcache_get_pvids to process all pvs in the system.
The pvid find function first calls lvmcache_vgname_from_pvid, which may
cause the label to be read if it is not in the cache. If the vgname is
returned is an orphan, we then check to see if there are metadata areas,
and if not, we scan every PV on the system by calling scan_vgs_for_pvs().
In most cases we should not need to do this, and by using the info->mdas
count, we avoid calling pv_read() as prior code did. So this patch is a
bit cleaner and should allow us to refactor more of the pv code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
are active mirrors or snapshots.
We don't have the mechanisms in place to change the device-mapper
tables for those targets that have behavioral differences between
cluster and single machine instances. Allowing users to change
the attribute but not changing the target's behavior can lead to
data corruption.
The following bugs are fixed/avoided by this patch:
235123 - vgchange -c [ny] do not change target types when necessary
289331 - RFE: switching from cluster domain to local domain needs to deactivate volume somehow
289541 - when changing from local to cluster, volumes can not appear to be deactivated
This should avoid various races between dmeventd on multiple nodes
in cluster where one node already repairing device and another
run full scan and locks the device.
the device cache file is dumped both in vgscan and clvmd process.
Unfortunately, clvmd calls lvmcache_label_scan,
it properly destroys persistent filter, but during
persistent_filter_dump it merges old cache content back!
This causes that change in filters is not properly propagated
into device cache after vgscan on cluster.
(Only new devices are added.)
https://bugzilla.redhat.com/show_bug.cgi?id=591861
to a file using log/file, with log/overwrite set and dump this file in
STACKTRACE. The overall effect is that only the command that ran last before
the failure has been triggered will get its debug output logged. This is
similar to how we treat coredumps.
lvconvert testing is now in its own test, t-lvconvert-mirror-basic ... it
doesn't do anything fancy but it does run lvconvert through a lot of
combinations.
I have also merged the remaining t-mirror-lvconvert tests into
t-lvconvert-mirror and abolished the former. The latter will be split again
later into more thematic divisions. (The previous split was rather arbitrary,
may I even say random...)
Use Requires.private: instead of Libs.private:
Use UDEV_PC and SELINUX_PC for Require.private:
It looks like usage of Requires.private is prefered from Libs.private.
However pkg-config documentation is really poor here. But here is
short outcome:
There is a difference in Libs.private: and Requires.private: where
we specify libselinux instead of -lselinux -lsepol.
We leave resolving of query like 'pkg-config --libs --static devmapper'
on taking proper selinux and udev libs to their .pc files instead of
hardcoding them into our .pc file which is might give incorrect answer.
- i.e. dependency of libselinux package might change and we may return
wrong list of linked libraries.
http://bugs.freedesktop.org/show_bug.cgi?id=4738http://err.no/personal/blog/tech/2008-03-25-18-07_pkg-config,_sonames_and_Requires.private
A shortcut for --ignorelockingfailure, --ignoremonitoring, --poll n options
and LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES environment variable used all at
once in initialisation scripts (e.g. rc.sysinit or initrd).
Target install_dm_plugin installs files to libdir/device-mapper.
Target install_lvm2_plugin installs files to libdir/lvm2.
Both targets creates relative links to libdir to keep the code
compatible with current dlopen handling.
Once we will be able to read plugins from subdir, links
could be removed.
(with table provided).
This remove ioctl generates udev events like any other hence it needs to be
synchronized properly as well. Also, add dm task type in debug log when
setting a cookie (for better debugging).
fails, the test will carry on but will issue a warning. The harness detects
such warnings from tests and marks tests that passed with warnings with a
special status.
We can use it even in read-only environment where a try to initialise
file-based locking fails (not to mention other processing related with
lvm2 init). Simply, we want to output the version only, nothing else.
And this should always work.
is active and while it is in-active.
+for i in $(seq 0 4); do
+ for j in $(seq 0 4); do
+ for k in core disk mirrored; do
+ for l in core disk mirrored; do
The testing code still needs some improvement. I'd like to add
the ability to test specifying the PVs to be added/removed during
a convert. It will also be important to test partial PV
specification during down converts (i.e. request to remove more
mirror images than we have provided PVs for).
This rule appeared in udev v152 and it helps us to support spurious events
where we didn't have any flags set (events originated in udevadm trigger
or the watch rule). These flags are important to direct the rule application.
Now, with the help of this rule, we can regenerate old udev db content.
To implement this correctly, we need to flag all proper DM udev events with
DM_UDEV_PRIMARY_SOURCE_FLAG. That happens automatically for all ioctls
generating events originated in libdevmapper.
being able to remove more images from a mirror than the
number of PVs directly specified for removal.
The effort to fix bug 581611 corrected a bug that was unnoticed
at the time. The loop in _remove_mirror_images that looks over
the specified PVs was allowing devices that were previously
counted and moved to the end of the list to be double-counted.
This resulted in the number of devices needed for removal always
being satisfied - even if the user did not specify enough PVs
for removal to satisfy the request. When 581611 was fixed, this
double-counting no longer took place and the result was to remove
only the minimum of the number of PVs specified or the number
that was asked to be removed.
By simply always setting 'new_area_count' (as used to be done
only in the else statement), we return to the previous behavior.
Indeed, this is exactly what the double-counting was allowing
to happen before the fix of 581611.
Allow lv_remove_with_dependencies() to know the top-level LV that was
requested to be removed (otherwise it recurses and we lose context).
A merging snapshot cannot be removed directly but the associated origin
can be. Disallow removal of a merging snapshot unless the associated
origin is also being removed.
(process_each_lv_in_vg) provides it.
Also, removing this check from _lvs_single now allows displaying hidden
LVs that are specifically named on the command line.
There's no need for foreign udev rules to touch LVM reserved devices
(snapshot, pvmove, _mlog, _mimage, _vorigin) even if they happen to
be visible. The same applies for /dev/disk content - no need to create
any content for these devices (and so no need to run any "blkid" etc.).
This also prevents setting any inotify "watch" from udev rules on such
devices that is a source of race conditions (the rules need to honor
DM_UDEV_DISABLE_OTHER_RULES_FLAG for this to work though).
1) Test that the primary mirror image cannot be removed while
the mirror set is sync'ing.
2) Test that you cannot start a second mirror up-convert while
one is already in progress.
The trouble is that if the sync/conversion finishes before the
tests occur, the tests will fail by why of success where there
should have been failure. This means the sync/conversion must
happen very quickly, but this is possible because the test
mirrors we are creating are so small.
In order to decrease the likelyhood of these test failing (or
more correctly, failing to test the right thing), I've increase
the size of the mirrors. It will still be remotely possible that
the tests will fail (by way of failing to test the right thing).
If this continues to happen, more involved mechanisms will need
to be put in place. (Perhaps these will still be created, but
this change should be a remedy until that time.)
These two old-test/regex utils are usable for testing output of regex
processing at core level. For getting them usable configure need to create
Makefile. This is currently disable by default.
Reintroduce split teardown (teardown() calls teardown_devs()) because
t-topology-support.sh only needs the teardown_devs() subset of the full
teardown() between each iteration of the topology tests -- in particular
the $TESTDIR must not get removed between each topology test iteration.
prepare_loop() must return if prepare_scsi_debug_dev() already
established $LOOP.
Also fix (and simplify) the unsafe scsi-debug device discovery in
prepare_scsi_debug_dev().
Ensure we can create devices for use in tests before running the real
tests. This may in various cases, and may involve machine configuration
rather than a failure in some specific test.
For example, if a test is run with LVM_TEST_DIR=/tmp, selinux is enabled,
and the default security context is set to "<<none>>" for /tmp, all the
tests will fail, unable to create devices, since dmsetup will fail, a
result of machpathcon() returning an error code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This version number change reflects the memory handling change
for string-based pv/vg/lv string based attributes.
In addition, when adding support for tags, I forgot to increase
the version number.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Everywhere else in the API the caller can rely on lvm2app taking care of
memory allocation and free, so make the 'name' and 'uuid' properties of a
vg/lv/pv use the vg handle to allocate memory.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We should write metadata into next position in the ring buffer while calling
vgrename and vgcfgrestore. At this code level (_vg_write_raw), we were not able
to determine if this is a rename or not. If yes, then accompanying VG structure
passed here has a new name set, not the old one.
When looking for a location where to put metadata next, we were given a NULL
value because of failed VG name comparison (in _find_vg_rlocn) between the
name in existing metadata and metadata we're just about to write.
This resets the position in the ring buffer, overwriting any existing metadata
(and also incorrectly updates the cache to "orphan" afterwards).
This patch just adds old_name item in struct volume_group that we can check and use
if necessary and detect renames at lower layers as well.
The same applies for vgcfgrestore, but here we're using a special value of
old_name, an empty string, to disable the check with existing metadata totally.
Internally, we used DM names instead of UUIDs while processing event
handlers. This caused problems while trying to vgrename a VG with active LVs
where the names are being changed and so the devices were not found then.
The patch also contains a little bit of refactoring, moving "build_dlid" code
found in dev_manager.c to "build_dm_uuid", now in lvm-string.c (so we have
build_dm_uuid and build_dm_name at one place).
lvm2app needs a link back to the vg in order to use the vg handle for
memory allocations as well as other things. This patch adds the field
to struct physical_volume, and sets pv->vg when reading a vg from disk or
extending a vg by using the helper function previously added,
add_pvl_to_vgs(). Moves and renames are handled with separate code
inside move_pv() and vgmerge(). Add pv->vg check to vg_validate().
A NULL value in pv->vg signifies membership in the orphan VG.
Note though in the case of pv_read() on a device with metadatacopies == 0,
more devices may need to be read for an authoritative answer.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Now that we have library functions to add/delete a pv from the vg->pvs
list, call them from everywhere.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a delete function to manage the vg->pvs list.
NOTE: It may be possible to do further cleanup to these add/del functions
by passing a 'pv' as input instead of 'pv_list'. The pv_list is used for
functions which do allocations (lvcreate) while other places in the code
just manage a list of 'pv' (e.g. import functions, vgextend, etc).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Move the increment of vg->pv_count next to the place where we add to
vg->pvs. It looks safe to do this since the only caller of import_pool_vg()
calls import_pool_pvs() immediately afterward, and there is no way
import_pool_vg() can fail (always returns 1). However, if there's a
memory allocation failure inside import_pool_pvs(), we will end up with
a different count in vg->pv_count that with the original code. In any
case, vg->pv_count should be as close to dm_list_size(&vg->pvs) as
possible, as is the case everywhere else in the code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The dm_list * parameter is unnecessary since we are passing in 'vg'
and the only caller of import_pool_pvs() passes '&vg->pvs' in the
dm_list * parameter. Just use vg->pvs directly in the function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
{core|disk|mirrored}. However, under the options section they are
described as {core|disk} - even though the 'mirrored' argument is
described.
s/'{core|disk}'/'{core|disk|mirrored}'/
Fix unwanted modification of $(top_builddir)/make.tmpl.
Using dependency rules to install rules for udev.
There is minor problem, with concurent usage of builddir
and srcdir could lead to missuse of 10-dm.rules which
could be found in VPATH from different builddir.
However current solution uses intermediate target so
the generated 10-dm.rules exists only for short period of time
during make install execution.
Patch is inspired by Debian's extra patch.
- removes OWNER & GROUP make vars they are parts of INSTALL command.
- adds INSTALL_PROGRAM for executable, uses $(INSTALL)
- adds INSTALL_DATA for non-executable data, uses ($INSTALL)
- adds INSTALL_WDATA for writable non-executable data, uses ($INSTALL)
- adds configure option --enable-write_install - to support
installatin of writable files used by distribution
- replaces usage of ifeq @LIB_SUFFIX@ with $(LIB_SUFFIX)
- installs .a files from static builds without executable flag
- installs .a files to $(usrlibdir) instead of $(libdir)
- installs all static binaries to $(staticdir)
- create .so links for devel package in $(usrlibdir) instead of
$(libdir)
- makes .so and .so.LIB_VERSION files within builddir
- removes VERSIONED_SHLIB and created versioned LIB_SHARED automagicaly
- install LIB_SHARED via install_lib_shared target
- install plugins via install_lib_shared_plugin target
- prints whole 'install' command during installation instead of less
informative "Installing $(something) $(somewhere)"
- install multiple man pages with one INSTALL command
- use DISTCLEAN_TARGETS instead of creating multiple distclean targets
Usage of VPATH makes troubles when used within $(builddir).
Not only source files are being found through VPATH,
but targets as well. (make --debug=v)
Thus if user builds the code in $(srcdir) and also in some $(builddir)
he gets mangled results as some generated files (i.e. .export.sym)
are 'reused' from $(srcdir) instead of $(builddir).
This patch switches to use vpath were we could explicitly name
suffixes that should be looked via vpath - we must take care,
we do not generate files with these suffixes:
.c, .in, .po, .exported_symbols
A user specifying duplicate paths on the cmdline of vgcreate will
get a message similar to the following:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Internal error: Duplicate PV id jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8 detected for /dev/loop3 in vgtest2.
This is caught by vg_validate(), but it would be good to find
this condition earlier in the vgcreate code. add_pv_to_vg()
currently checks by pvname, but does not look for duplcate pvids.
This patch adds the check for duplicate pvids and results in new
error output as follows:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Physical volume '/dev/loop5 (jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8)' listed more than once.
Unable to add physical volume '/dev/loop5' to volume group 'vgtest2'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When moving parts of striped LVs, pvmove wouldn't care about leaving you with
two stripes on the same disk. Now --alloc anywhere is needed for that.
(Tried and gave up on two alternative approaches before the one committed here.)
Small refactor of main places in the code where a pv is added to a
vg into a small function which adds the pv to the list and updates
the vg counts.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Refactor adding to the vg->pvs list and incrementing the count, which
will allow further refactoring. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Simple refactor to mov code that updates the vg extent counts from a
single pv's counts close to the code that adds a pv to vg->pvs and
updates vg->pv_count. No functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
In add_pv_to_vg(), we should only add the pv to vg->pvs after all
internal checks have passed. The check for vg->extent_count exeeding
maximum was after we added the pv to the list, so this function could
return a state of vg->pvs that did not reflect other parameters such
as vg->pv_count.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
to check for presence of this module and avoid using --frames
option for genhtml in this case.
Fix arg list for AC_PATH_PROG for lcov and genhtml.
(detecting empty LCOV and GENHTML string in Makefiles).
Patch fixes generation of coverage files for dmeventd and adds support for clvmd.
Path names are stripped, so the the html looks better.
Frames 'previews' is enabled for generated pages.
Using top_srcdir was wrong here - though we still can't easily use builddir.
Requiers using shell variables before execution of binaries build outside
of srcdir.
Because we have now strong rule for lock ordering:
- VG locks must be taken in alphabetical order
- ORPHAN locks must be the last
vgs_locked() is now not needed.
This fixes problem with orphan locking, e.g.
vgremove VG1 | vgremove VG2
lock(VG1) | lock(VG2)
lock(ORPHAN) | lock(ORPHAN) -> fail, non-blocking
https://bugzilla.redhat.com/show_bug.cgi?id=578413
(More similar places in code.)
Physical segments were still allocated from global
command context mempool.
This leads to very high memory usage when
activating large VG (vgchange).
(Memory usage was about 2G when >3000LVs).
Fix it by properly using vg->vgmem private pool,
so all the memory is released early.
New memory pool parameter is needed here for pv_split_segment
function.
Also fix the same problem in some minor allocations
(vg description, lv segment split).
In addition to previous patch, we really do not need
to search for segment which was just allocated in
split request.
Make pv_split_segment function return newly allocated
(split) segment also.
(So after this patch, there is only one user
of slow find_peg_by_pe).
The function find_peg_by_pe is incredibly inefficient
for Pvs with many segments.
In shiny future there should be binary (or interval) tree
instead of sorted linked list (volunteers?).
Anyway, for now, we can use dirty trick here to optimise this case:
- Allocations are usually applied from the beginning
of PV (we have no alloocation policy which allocates areas
"backwards")
- The only user of find_peg_by_pe is pv_split_segment()
call. In *most* cases it need to split *last* PV segment.
So if we search sorted pv segment list backwards, we
hit the requested segment immediatelly.
This patch applies this tiny change.
(and saves >30% of processing time when >3000LVs segments are on one PV!)
To discourage using this inefficient function from other code,
it is moved to pv_manip.c and used static for now:-)
vg_validate call is an adept to optimisation, it is very
ineeficient and slow.
Anyway, we should call it only before writing data to disk.
The call in lvmcache was just temporary validation,
we realy do not need to revalidate cached metadata
every time.
(Actually, I added that there just to prove that cache works
properly and forgot to remove it.)
Patch removes it from lvmcache completely, this can hit only
internal bug in export function (and this bug must
be detected in any vg_write call anyway before).
The _read_vg uses already hash for PVs to optimise
reading of large VGs and avoiding repeated PV list traversing.
Use the same aproach to speed up parsing VG with many LVs.
If dmeventd runs with -d flag, it doesn't fork into backgroud.
The command kill(getppid(), SIGTERM) attempts to kill the parent dmeventd
process, however, if there is no parent, it kills whatever process spawned
dmeventd. In case of debugging with gdb, the parent is gdb, thus
kill(getppid(), SIGTERM) kills the debugger.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
we are running the repair manually. If we don't ignore, then dmeventd
and the manually run repair can collide. (We should still get clean
results in such a case, but it makes it harder to validate the test
results.)
Code moves initilization of stats values to _memlock_maps().
For dmeventd we need to use mlockall() - so avoid reading config value
and go with _use_mlockall code path.
Patch assumes dmeventd uses C locales!
Patch needs the call or memlock_inc_daemon() before memlock_inc()
(which is our common use case).
Some minor code cleanup patch for _un/_lock_mem_if_needed().
If the error path of _register_for_event() calls _free_thread_status()
_lib_put() call is missing.
To make thing simpler move this _lib_put() into common error path code.
As the header file <sys/mman.h> was not included in dmeventd.c
thus missed definition of MCL_CURRENT so this patch only makes
it obvious we were not locking memory here.
This patch has no functional change.
Later part of this patch set handles mlockall() via memlock_inc_daemon().
clvmd does not propagate DMEVENTD_MONITOR_IGNORE.
Update get_activation_monitoring_mode() to check if the VG that the
LV is being activated in is clustered. If so, skip it.
Any get_activation_monitoring_mode() error will cause the associated LV
(or VG) to be skipped during activation. Both vgchange_single() and
lvchange_single(), which call get_activation_monitoring_mode(), are
called by their respective process_each_..() method.
in clvmd, dmevend, man, tests.
Don't include dependency files for clow and cscope.out targets
Improve dependency tracking for dmeventd and liblvm2cmd sources.
to obtain sources. Create make.tmpl target for
simplier generation of cflow files with the help of
CFLOW_LIST, CFLOW_LIST_TARGET, CFLOW_TARGET.
Still cflow usage is not perfect.
Move daemons/ and lib/ subtargets to their Makefiles so we don't get
double cleanup error during execution of distclean target.
Instead of duplicating clean target inside distclean target,
just use it as a subtarget and avoid add duplicating code.
This check-in enables the 'mirrored' log type. It can be specified
by using the '--mirrorlog' option as follows:
#> lvcreate -m1 --mirrorlog mirrored -L 5G -n lv vg
I've also included a couple updates to the testsuite. These updates
include tests for the new log type, and some fixes to some of the
*lvconvert* tests.
clvmd's do_lock_lv() already properly controls dmeventd monitoring based
on LCK_DMEVENTD_MONITOR_MODE in lock_flags -- though one small fix was
needed for this to work: _lock_for_cluster() must treat
dmeventd_monitor_mode()'s return as a tri-state value.
Also cleanup do_lock_lv() to:
- explicitly init_dmeventd_monitor() based on LCK_DMEVENTD_MONITOR_MODE
- no longer reset init_dmeventd_monitor() to default at the end of
do_lock_lv() -- it is unnecessary
This is the next preparatory step towards better --alloc anywhere
support and is not intended to break anything that currently works so
please report any problems - segfaults, bogus data in the new debug
messages, or if the code now chooses bizarre allocation layouts.
. Add "monitoring" option to "activation" section of lvm.conf
. Have clvmd consult the lvm.conf "activation/monitoring" too.
. Introduce toollib.c:get_activation_monitoring_mode().
. Error out when both --monitor and --ignoremonitoring are provided.
. Add --monitor and --ignoremonitoring support to lvcreate. Update
lvcreate man page accordingly.
. Clarify that '--monitor' controls the start and stop of monitoring in
the {vg,lv}change man pages.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This prevents some confusion when libudev was not found so udev_sync was disabled
automatically. Configure was successful though giving only a tiny warning.
Also, if "dmsetup udevcreatecookie" is used, never return 0x000000 as a result if
udev is not running and keep the output blank.
We need to know whether we should wait for any uevent or not when
using udev_sync. A kernel patch was posted recently that changed the
way uevents are sent on dm device resume - it is sent only if the
device has been suspended before. There's also a new DM_UEVENT_GENERATED_FLAG
in the ioctl to notify userspace whether the event was generated.
If the uevent was not generated (e.g. the situation where the device is
*not* suspended and we call a resume), we just call dm_udev_complete
explicitly from within libdevmapper itself to prevent infinite waiting
while trying to synchronise with udev processing.
Prevent lvresize from being able to resize internal LVs: mirror legs
(*_mimage_*), mirror log (*_mlog), snapshot placeholder LVs (snapshot*)
and others. Resizing these would leads to unexpected metadata and
sometimes crashes (in case of growing snapshot*).
When we pv_read() a device that has an orphan vgname, we might need to scan
the system to be sure this is true. However, if the PV has mdas, there's
no way possible for it to have an orphan vgname unless it is a true orphan.
Some areas of the code were optimized to take advantage of this fact, while
others were not (we would still do the expensive scan if a device had mdas
but had an orphan VG).
This patch unifies the code so that every place we are operating on such
a PV, we skip the expensive scan if there are mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Petr Rockai <prockai@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
This option should be configurable, but for now
do not set it at all.
(lvm2app is used in udisks probers and there
cac cause several nasty races when trying to update
lvmcache during rescan.)
If user try to vgcreate or vgextend non-existent VG,
these messages appears:
# vgcreate xxx /dev/xxx
Internal error: Volume Group xxx was not unlocked
Device /dev/xxx not found (or ignored by filtering).
Unable to add physical volume '/dev/xxx' to volume group 'xxx'.
Internal error: Attempt to unlock unlocked VG xxx.
(the same with existing VG and non-existing PV & vgextend)
# vgextend vg_test /dev/xxx
...
It is caused because code tries to "refresh" cache if
md filter is switched on using cache destroy.
But we can change filters and rescan even without this
machinery now, just use refresh_filters
(and reset md filter afterwards).
(Patch also discovers cache alias bug in vgsplit test,
fix it by using better filter line.)
(VDSO on 32bit is VSyscall on 64bit)
It seems it could be locked on 64bit kernels running 32bit binaries,
but it makes troubles on real 32bit machines where mlock() returns
error when trying to lock such map area. (0xffffe000)
Behavior of mlockall() seems to be similar.
This patch adds a new implementation of locking function instead
of mlockall() that may lock way too much memory (>100MB).
New function instead uses mlock() system call and selectively locks
memory areas from /proc/self/maps trying to avoid locking areas
unused during lock-ed state.
Patch also adds struct cmd_context to all memlock() calls to have
access to configuration.
For backward compatibility functionality of mlockall()
is preserved with "activation/use_mlockall" flag.
As a simple check, locking and unlocking counts the amount of memory
and compares whether values are matching.
For static builds dependency for SELinux libs is not handled by 'ar'.
Till better solution is found, for static builds STATIC_LIBS is used.
Patch updates SELinux detection to use 3rd & 4th parameter for Success/Fail.
Also removes detection of pthread from this check as we know which
version of libdevmapper we are going to link with lvm after merge.
SELinux header check moved to the SELinux test code.
Create new substituted variable PTHREAD_LIBS and link this library
only with tools/libs which really needs it - i.e. dmeventd.
Check for libpthread only for builds with clvmd or dmeventd.
Remove variable LIB_PTHREAD
Modify linking of readline library. Create new substituted varible
READLINE_LIBS - readline library is linked ONLY with tools that really use
it - i.e. lvm. (Static lvm does not use readlin).
Previous behaviour put this library into the variable LIBS and thus
linked it with all created object files of lvm project (i.e. plugins...).
READLINE detection is simplified.
Termcap library is linked in only if readline library doesn't have its own
dependency (i.e. old distributions).
The kernel's blk_stack_limits() function may flag a device as
'misaligned'. If it does the alignment_offset will be -1.
Update set_pe_align_offset() to accommodate this corner case.
- increase timeout to 30 secs (on Chrissie request)
- source both cluster and clvmd for options (like all the other cluster
init scripts)
- add clustered_vgs and _lvs commodity fns
- move rh_status* fns at the top, so they can be reused
- heavily cleanup start and stop fns from redundant code and unnecessary
loops
- improve output from different operations
- make the init script lsb compliant
- don´t force kill of the daemon, send only a TERM signal and then wait
for it to exit
- Resolves rhbz#533247
lvm2 devices have always UUID set even if imported from lvm1 metadata.
Patch removes name argument from dev_manager_info call and converts
all activation related calls to use query by UUID.
Also it simplifies mknode call (which is the only user on mknodes parameter).
Fix add/remove tag function headers.
Fix a lot of little problems with doxygen comments.
Clarify the basic objects and their handles, and place functions with their
appropriate handles/objects.
All this cleanup moves automatic documentation of lvm2app much closer to being
useful as official documentation. In the future I will add some examples
and plan to build the examples as part of the unit tests.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add lvm2app functions to manage LV tags.
For lvm_lv_get_tags(), we return a list of tags, similar to other
functions that return lists. An empty list is returned if there
are no tags. NULL is returned if there is a problem obtaining
the list of tags.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add lvm2app functions to manage VG tags.
For lvm_vg_get_tags(), we return a list of tags, similar to other
functions that return lists. An empty list is returned if there
are no VG tags. NULL is returned if there is a problem obtaining
the list of tags.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a supporting function to copy a list of internal tags to lvm2app list.
We need to put this here because of the lvm_str_list_t type which we export
in lvm2app.h. If we didn't export this type, we could put this in the
internal library and use struct str_list.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We need to allocate memory for the tag and copy the tag value before we
add it to the list of tags. We could put this inside lvm2app since the
tools keep their memory around until vg_write/vg_commit is called, but
we put it inside the internal library to minimize code in lvm2app.
We need to copy the tag passed in by the caller to ensure the lifetime of
the memory until the {vg|lv} handle is released.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Similar refactoring to vgchange - pull out common parts and put into
library function for reuse. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Pull out common code to be called from tools as well as lvm2app.
Leave archive() at tool level so we can use from vgcreate
as well as vgchange. Should be no functional change.
- add stack macro in vgchange
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a merging snapshot to the deptree, using the "error" target, rather
than avoid adding it entirely. This allows proper cleanup of the -cow
device without having to rename the -cow to use the origin's name as a
prefix.
Move the preloading of the origin LV, after a merge, from
lv_remove_single() to vg_remove_snapshot(). Having vg_remove_snapshot()
preload the origin allows the -cow device to be released so that it can
be removed via deactivate_lv(). lv_remove_single()'s deactivate_lv()
reliably removes the -cow device because the associated snapshot LV,
that is to be removed when a snapshot-merge completes, is always added
to the deptree (and kernel -- via "error" target).
Now when the snapshot LV is removed both the -cow and -real devices
get removed using uuid rather than device name. This paves the way
for us to switch over to info-by-uuid queries.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There's a tiny period of time when the _mimage device is visible during
downconversion from mirror to linear. Since it is visible, we need to
create the symlinks, otherwise warning messages will be issued about udev
not creating those symlinks. We have to rely on udev flags completely.
- add DM_UDEV_DISABLE_LIBRARY_FALLBACK udev flag to rely on udev only
- export dm_udev_create_cookie function to create new cookies on demand
- add --udevcookie, udevcreatecookie and udevreleasecookie for dmsetup
(to support "udev transactions" where one cookie value can be used for
several dmsetup calls)
- don't use DM_UDEV_DISABLE_CHECKING env. var. anymore and set the state
automatically (based on udev and libdevmapper dev path comparison)
Internally we store sizes in sectors, but lvm2app exports sizes
in bytes. We could get fancier and allow units configuration but
this fix should do for now.
Fixes rhbz561422.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We unfortunately don't yet _know_, in dev_manager_snapshot_percent(), if
a snapshot-merge target is active (activation is deferred if dev is
open); so we can't short-circuit origin devices based purely on existing
LVM LV attributes.
Set 'fail_if_percent_unsupported' in dev_manager_snapshot_percent() for
a merging origin LV, otherwise passing unsupported LV types to _percent
will lead to a default successful return with percent_range as
PERCENT_100.
For a merging origin, PERCENT_100 will result in a polldaemon that runs
infinitely (because completion is PERCENT_0).
When activating a merging origin it is valid, and expected, to not have
a node in the deptree for both the origin and its merging snapshot. The
_cached_info() caller is only concerned with whether a device is open.
If there isn't a node in the tree the associated device is definitely
not open.
This change was deferred to help ease the review of previous refactoring
related to using process_each_lv() for lvconvert's merge support. Not
that doing so _really_ helped but...
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Switch lvconvert's --merge code over to using process_each_lv(). Doing
so adds support for a single 'lvconvert --merge' to start merging
multiple LVs (which includes @tag expansion).
Add 'lvconvert --merge @tag' testing to test/t-snapshot-merge.sh
Adjust man/lvconvert.8.in to reflect these expanded capabilities.
The lvconvert.c implementation requires rereading the VG each iteration
of process_each_lv(). Otherwise a stale VG instance associated with
the LV passed to lvconvert_single_merge() would result in stale VG
metadata being written back out to disk. This overwrote new metadata
that was written when a previous snapshot LV finished merging (via
lvconvert_poll). This is only an issue when merging multiple LVs that
share the same VG (a single VG is typical for most LVM configurations on
system disks).
In the end this new support is very useful for performing a "system
rollback" that requires multiple snapshot LVs be merged to their
respective origin LV.
The yum-utils 'fs-snapshot' plugin tags all snapshot LVs that it creates
with a common 'snapshot_tag' that is unique to the yum transaction.
Rolling back a yum transaction, that created LVM snapshots with the tag
'yum_20100129133223', is as simple as:
lvconvert --merge @yum_20100129133223
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
refactoring.
Document the need to cleanup the "name" args passed around polldaemon,
lvconvert and pvmove. It is quite a mess.
Annotate the unused nature of the existing poll_fns->get_copy_vg
methods' 'uuid' arg.
depending on if the mirror has a 'core' or 'disk' log. When there
is a disk log, the new leg is added by stacking a new mirror on
top of the old (one leg is the old mirror and the other leg is the newly
added device). When the log is a 'core' log, the new leg is simply added
to the existing mirror and all the devices are re-synced.
The logic that handles collapsing the stacked 'disk' log mirror was
having the effect of causing 'core' logged mirrors to begin resync'ing
for a second time. I have used the 'CONVERTING' flag to indicate that
a mirror is converting by way of stacking. This is no longer set for
up-converting core logs. The final 'collapse' logic can safely be skipped
for 'core' log mirrors - getting rid of the second resync.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
where we should not expose internal VG names/uuids (the ones with "#" prefix )through the
interface. Otherwise, we could end up with library users opening internal VGs which will
initiate locking mechanism that won't be cleaned up properly.
"#orphans_{lvm1, lvm2, pool}" names are treated in a special way, they are truncated first
to "orphans" and this is used as a part of the lock name then (e.g. while calling lvm_vg_open()).
When library user calls lvm_vg_close(), the original name "orphans_{lvm1, lvm2, pool}"
is used directly and therefore no unlock occurs.
We should exclude internal VG names and uuids in the lists provided by lvmcache:
lvmcache_get_vgids() and lvmcache_get_vgnames().
Allow the number of logical extents to be expressed (for a snapshot) as
a percentage of the total space in the Origin Logical Volume with the
suffix %ORIGIN.
Update the relevant man pages accordingly. Eliminate inconsistencies
between the man pages and tools/commands.h
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*_safe. This had the effect of segfaulting the log daemon when
converting a mirror from one log type to another.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
When activation of pvmove mirror fails on cluster, some nodes
still possibly succeeded in activation.
- Explicitly deactivate that mirror to be sure
- properly pair suspend/resume calls to not cause memory lock problems in clvmd
Code cannot simply call _finish_pvmove on cluster in this situation, because
changed LVs are suspended twice (causing memory inbalance) and also temporary
mirror is activated when it is not expected (and we know that it failed already).
Patch prepares special function which remove temporary mirror references from
metadata and then resumes changed LVs.
In dev_manager_info 0 means error and 1 info is returned,
not that device exists (that value is part of info struct).
Fix query by uuid only (no name) which returns 0 when device
does not exist.
Support "wait before testing" using '+' in pvmove and lvconvert
interval. Doing so overrides the new default of sleeping after checking
the LV's progress.
Sleeping before checking progress can lead to extraneous polldaemons
being left running. These polldaemons would have otherwise exited had
they checked before sleeping. Checking progress before sleeping helps
workaround the subtly unreliable nature of "finished" state checking
in _percent_run.
Update test/t-mirror-names.sh to use '+' when providing its lvconvert
interval.
more descriptive message if locking fails instead of
"Locking type -1 initialisation failed."
Use read-only locking instead of misleading ignorelocking option
in message.
All this seems to do is provide a memory leak so remove it.
The only caller of _alloc_pv() later explicitly sets
pv->vg_name = fmt->orphan_vg_name so clearly this allocation
should be removed. I also saw no where in the code where
strncpy was used to assign pv->vg_name - only direct assignments
and strdup's.
DSO is currently not dl_close-ing pluing during it is unregister handling,
so clear structure and related counter, so there are no memory problems.
Futher fixes are needed.
merge completes. This narrows the scope of this "hack" (which still
needs a proper fix within the deptree).
This stops dmeventd from trying to access snapshot devices that were
already removed.
LVM2 test (rather than using the traditional loop device).
prepare_scsi_debug_dev currently assumes exclussive access to the
scsi_debug module. Any script that tries to use prepare_scsi_debug_dev
when scsi_debug is unavailable or already loaded into the kernel will be
skipped.
t-topology-support.sh shows how prepare_scsi_debug_dev function can be
used repeatedly (within a script) to test LVM2 ontop of a ramdisk-based
SCSI device w/ arbitrary scsi_debug features.
isn't already available (in /proc/mdstat).
switch to requiring 2.6.33 for the alignment_offset tests; 2.6.{31,32}
alignment_offset values aren't reliable. 2.6.33 _should_ have mkp's
alignment_offset fixes but so far it doesn't (as of 2.6.33-rc4).
For mirror repair (and similar tasks) it can happen that full
device rescan is issued from clvmd.
Because code can be in the middle of repair (calling suspend)
clvmd should never try to scan suspended devices
(otherwise it causes deadlock).
Also code must not change ignore_suspended_device flag when
doing refresh_filters (called from lvmcache scan code).
'const'. Be consistent with its use (and dev_manager_snapshot_percent()).
Pass 'lv' from dev_manager_snapshot_percent() to _percent() to
_percent_run(). _percent_run() always dereferenced 'lv' (when
initializing segh) even though it may have been NULL (as was the case
until now for dev_manager_snapshot_percent()).
If a "snapshot-origin" LV (snapshot-merge whose merge was deferred
becuase it was open) was passed to _percent_run() it would always return
100%.
Update _percent_run() to NOT return PERCENT_100 et. al. if
->target_percent() wasn't ever called and supplied 'lv' is a merging
origin. A default return of 100% does not work for snapshot-merge.
Also tweak a related lvconvert log_error() to include "Aborting merge."
bitmap tracking was switched from the e2fsprogs implementation to
the device-mapper implementation (dm_bitset_t). The latter has a
leading uin32_t field designed to hold the number of bits that are
being tracked. The code was not properly handling this change in
all places. Specifically, when getting the bitmap to/from disk.
Endian adjustments will likely need to be made on the accounting
field as well, since bitmaps are passed between machines on
start-up.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
that were necessary to be passed on to userspace.
The cluster mirror table (log portion only) used to look like this:
clustered-disk <parm_count> <disk> <region_size> <uuid> \
[[no]sync] [block_on_error]
Now it looks like this:
userspace <parm_count> <uuid> clustered-disk <disk> <region_size> \
[[no]sync]
So, there is one extra argument in the latter case - this was
unaccounted for.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Eliminate 'merging_snapshot' from 'struct logical_volume' and just use
'snapshot' for origin lv's reference to the merging snapshot; also set
MERGING in the origin lv's status.
If either the origin or snapshot that is to be merged is open the merge
will not start; only the merge metadata will be written. The merge will
start on the next activation of the origin (or via lvchange --refresh)
IFF both the origin and snapshot are closed.
Merge on activate is particularly important if we want to merge over a
mounted filesystem that cannot be unmounted (until next boot) --- for
example root.
snapshots are suspended, new origin is created, snapshots are resumed, new
origin is resumed. So it allocates memory while suspended.
To fix it, move vg_commit after suspend_lv, so that the suspend code will
treat it as precommitted vg and will preload new origin prior to suspend.
NOTE: agk doesn't like this "hack"; need to revisit and fix
This is useful for when the snapshot is still active and merging hasn't
started yet; it shows a merge is pending. Once merging starts the
merging snapshot will be hidden but can still be displayed with 'lvs -a'
Report snapshot origin with merging snapshot as 'O' instead of 'o':
Before merge starts this shows that a merge is pending. While merging
the snapshot will be hidden, 'O' enables a user to see that there is a
snapshot merging.
"snapshot-merge" target based on whether the LV is a merging snapshot.
When activating a snapshot-merge target do not attempt to monitor the
LV for events; the polldaemon will monitor the snapshot as it is
merged.
Allow "snapshot-merge" target's usage to be parsed via standard
"snapshot" methods.
NOTE: follow on fixes to the _percent_run change are still needed
Introduces new libdevmapper function dm_tree_node_add_snapshot_merge_target
Verifies that the kernel (dm-snapshot) provides the 'snapshot-merge'
target.
Activate origin LV as snapshot-merge target. Using snapshot-origin
target would be pointless because the origin contains volatile data
while a merge is in progress.
Because snapshot-merge target is activated in place of the
snapshot-origin target it must be resumed after all other snapshots
(just like snapshot-origin does) --- otherwise small window for data
corruption would exist.
Ideally the merging snapshot would not be activated at all but if it is
to be activated (because snapshot was already active) it _must_ be done
after the snapshot-merge. This insures that DM's snapshot-merge target
will perform exception handover in the proper order (new->resume before
old->resume). DM's snapshot-merge does support handover if the reverse
sequence is used (old->resume before new->resume) but DM will fail to
resume the old snapshot; leaving it suspended.
To insure the proper activation sequence dm_tree_activate_children() was
updated to accommodate an additional 'activation_priority' level. All
regular snapshots are 0, snapshot-merge is 1, and merging snapshot is 2.
Make 'merging_snapshot' pointer that points from the origin to the
segment that represents the merging snapshot.
Import/export 'merging_store' metadata.
Do not allow creating snapshots while another snapshot is merging.
Snapshot created in this state would certainly contain invalid data.
NOTE: patches at the end of this series will remove 'merging_snapshot'
and will introduce helpful wrappers and cleanups.
This spurious 'break' has been here since this code was first committed
in June 2005 and stopped the algorithm behaving as described in the
comment above it and rendered the variable 'already_found_one' useless.
1. Found bug in 'redundant log' implementation that caused
problems when converting a linear that spanned multiple
devices to a mirror (wasn't checking for NULL value of
provided parameter in _alloc_parallel_area)
2. Testsuite was failing to perform tests when 'not' modifier
was used. This allowed a couple issues to slip through.
Added a 'not_sh' modifier that negates tests performed by
functions defined in the shell source file.
3. Was initializing a variable to far down, which cause
previously set value to be overridden. (This was the
result of the collision of the "redundant log" and
lvconvert fix patches.)
successfully created it must _exit() once it completes.
Update _become_daemon() to differentiate between a failed fork() and a
successful fork().
Added lvm_return_code() to lvmcmdline.[ch]
Upon successful fork(), _become_daemon() must assert that the locks that
are currently held belong to the parent, not the child. All of the
child's internal state saying 'this process holds a lock' has to be
reset.
A proper lvmcache_locking_reset() should follow later.
date: 2010/01/07 20:42:55; author: jbrassow; state: Exp; lines: +11 -0
The patch fixes some lvconvert issues (WRT mirror <-> mirror).
1) 'exisiting_mirrors' and 'lp->mirrors' where taken to be in 'n-1'
notation (i.e a 2-way mirror is '1' and a linear is '0'), but the
variables were in 'n' notation.
2) After adding the redundant mirror log support, I was calculating
log_count by looking at the mirror log LV, but didn't take into
account the fact that there could be no mirror log!
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Sometimes it is really needed to switch off udev checking and the warnings we show when
we detect that udev has not done its job right - the messages like "Udev should have done
this and that. Falling back to direct node creation/removal. " etc.
This would be especially handy while setting DM_DEV_DIR env var that could be set to a
different location than standard /dev (udev can't create nodes/symlinks out of that one
directory that is configured into udevd). The exact same situation happens while we're
running our tests.
It is pretty much the same as reducing the number of
mirror legs, but we just don't delete them afterwards.
The following command line interface is enforced:
prompt> lvconvert --splitmirror <n> -n <name> <VG>/<LV>
where 'n' is the number of images to split off, and
where 'name' is the name of the newly split off logical volume.
If more than one leg is split off, a new mirror will be the
result. The newly split off mirror will have a 'core' log.
Example:
[root@bp-01 LVM2]# !lvs
lvs -a -o name,copy_percent,devices
LV Copy% Devices
lv 100.00 lv_mimage_0(0),lv_mimage_1(0),lv_mimage_2(0),lv_mimage_3(0)
[lv_mimage_0] /dev/sdb1(0)
[lv_mimage_1] /dev/sdc1(0)
[lv_mimage_2] /dev/sdd1(0)
[lv_mimage_3] /dev/sde1(0)
[lv_mlog] /dev/sdi1(0)
[root@bp-01 LVM2]# lvconvert --splitmirrors 2 --name split vg/lv /dev/sd[ce]1
Logical volume lv converted.
[root@bp-01 LVM2]# !lvs
lvs -a -o name,copy_percent,devices
LV Copy% Devices
lv 100.00 lv_mimage_0(0),lv_mimage_2(0)
[lv_mimage_0] /dev/sdb1(0)
[lv_mimage_2] /dev/sdd1(0)
[lv_mlog] /dev/sdi1(0)
split 100.00 split_mimage_0(0),split_mimage_1(0)
[split_mimage_0] /dev/sde1(0)
[split_mimage_1] /dev/sdc1(0)
It can be seen that '--splitmirror <n>' is exactly the same
as '--mirrors -<n>' (note the minus sign), except there is the
additional notion to keep the image being detached from the
mirror instead of just throwing it away.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Made .update_metadata optional in 'struct poll_functions' definitions;
eliminated _update_lvconvert_mirror() stub.
Tweak a mirror-specific error message in the generic polldaemon code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The default log option for a mirror is 'disk'. If the log
type is not explicitly stated on the command line when
converting from an X-way mirror to a Y-way mirror, 'disk'
is chosen. So, if you have a 'core' log mirror and you
convert, your result will contain a 'disk' log.
This patch remembers what the old log type was. If the
user is merely trying to switch the number of mirror
images, the log type is now kept the same.
There is one historical behaviour I left in place...
If you have a 2-way, core-log mirror and you use lvconvert to
specify you want a 2-way mirror - without specifying the
log type - you will get a 2-way, disk-log mirror.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Informal-IRC-ACK-by: agk
The logic was that lvconvert repair volumes, marking
PV as MISSING and following vgreduce --removemissing
removes these missing devices.
Previously dmeventd mirror DSO removed all LV and PV
from VG by simply relying on
vgreduce --removemissing --force.
Now, there are two subsequent calls:
lvconvert --repair --use-policies
vgreduce --removemissing
So the VG is locked twice, opening space for all races
between other running lvm processes. If the PV reappears
with old metadata on it (so the winner performs autorepair,
if locking VG for update) the situation is even worse.
Patch simply adds removemissing PV functionality into
lvconcert BUT ONLY if running with --repair and --use-policies
and removing only these empty missing PVs which are
involved in repair.
(This combination is expected to run only from dmeventd.)
Version >= 1.8.0 of the DM snapshot target appends metadata sectors used
to a snapshot's status. This patch allows LVM2 to accurately determine
if the snapshot store is empty. Knowing when a snapshot store is empty
is important in the context of snapshot-merge (means merge is complete).
Also update LVM2 to be aware of the possibility for "Merge failed" in
the snapshot-merge target's status.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_tree_activate_children() callers.
Otherwise resume_lv and its variants can fail silently.
Catching these failures is especially important now that dm targets like
crypt and snapshot-merge can fail in .preresume
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
the background polldaemon is allowed to start. It can be used
standalone or in conjunction with --refresh or --available y.
Control over when the background polldaemon starts will be particularly
important for snapshot-merge of a root filesystem.
Dracut will be updated to activate all LVs with: --poll n
The lvm2-monitor initscript will start polling with: --poll y
NOTE: Because we currently have no way of knowing if a background
polldaemon is active for a given LV the following limitations exist and
have been deemed acceptable:
1) it is not possible to stop an active polldaemon; so the lvm2-monitor
initscript doesn't stop running polldaemon(s)
2) redundant polldaemon instances will be started for all specified LVs
if vgchange or lvchange are repeatedly used with '--poll y'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This patch tries to correctly track changes in lvmcache related to commit/revert.
For vg_commit: if there is cached precommitted metadata, after successfull commit
these metadata must be tracked as committed.
For vg_revert: remote nodes must drop precommitted metadata and its flag in lvmcache.
(N.B. Patch do not touch LV locks here in any way.)
All this machinery is needed to properly solve remote node cache invalidaton which
cause several problems recently observed.
Lock mode is int masked by LCK_TYPE_MASK, always.
Patch also remove uneccessary masking lock flag on sender side,
if masking is needed, it is don on client side already.
- Add drop_precommitted flag to force drop precommitted metadata
- add lvmcache_commit_metadata() which upgrades precommitted metadata in cache
No functional change in this patch - just preparation for following change.
And decode flags in humar readable form in client.
And clean some trailing whitespaces.
No functional change in this patch (only debugging messages changed).
The use_precommitted flag indicates, that we want to use precommitted metadata
(used in suspend call to preload table with precommitted data).
But if there are no such data, committed metadata are read but the cache
still contains that precommitted flag.
(The problem is that later possible drop_metadata call will not invalidate
device in cache.)
The wrong precommitted state is stored in on remote nodes during normal
suspend/resume cycle _without_ vg_write/commit.
Use the PRECOMMITTED status flag here instead (which is always set if using
precommited metadata here).
If renaming snapshot with virtual origin, the origin is renamed too.
But the code must resume LVs in reverse order to properly
pair memlock (in cluster locking).
(The resume of snapshot resumes origin too and later resume
is ignored otherwise.)
(So the tests can run under cluster locking and do not require
cluster mirror or snapshots.)
Add vgscan before block device readahead change
(flush long running process - clvmd - dev cache.)
When PV device reappears with old metadata, it is
always updated to new version byt atutomatic metadata
repair.
Remove missing flag if device is empty.
If device contains allocated extents, issue warning that
user must remove volumes and re-add this PV before
manipulating with this volume.
This partially solves bug 547842 when one PV (log) is failed,
dmeventd removes that device and later this device reappears and
is wrongly added into VG marked missing.
The memlock_inc() fix is wrong, memlock count is not
propagated to long living process (clvmd) and just
it underflow there.
Also suspend is needed to pre-load precommited metadata
on other nodes (remapping to error taget in this case).
With explicit suspend we generate lock request and code
can update memlock count.
(Infinitely "locked" memory caused that fs_unlock() was not
called properly and on cluster nodes remains
old links in /dev/mapper for not active devices.)
(N.B. failing of suspend call here is not handled as fatal
error - the LV is going to be removed later anyway.)
The new recovery code first tries to repair LV and then removes failed PV
from VG. It means that during operation there can be VG with PV missing,
and vg_read code handles it like not consistent VG.
We already allows returning "inconsistent" commited metadata,
for mirror repair we need this for precommited too.
(The suspend call prepares precommited metadata to inactive table on
other cluster nodes.)
"Inconsistent" here means - correct metadata, just with some metadata areas
not found (obviously on missing or failed PVs).
The LV locks make sense only for clustered LVs.
Properly check cluster flag and never issue cluster lock here.
There are several places in code, where it is already checked, this
patch add this check to all needed calls.
In previous code the lock behaviour was inconsistent,
for example, the pre/post callback can take lock even for local volume,
but deactivate call do not released this lock and it remains held forever.
The local LV lock request now just let run the underlying activation code
on local node, the same process like in local locking.
(Again, this is important for new mirror repair calls, here for local
mirrors but with cluster locking enabled.)
This is unnoticed regression from commit 31672ff60e
The pre/post callback need to convert lock always, local node
is going to modify metadata in this case, it it fails conversion,
the call is ignored.
Also it fixes bug when the lock is not yet held, we cannot set LKF_CONVERT
in this case, it will fail because this lock do not exist.
Note that the automatic conversion is still disabled in activate
call, so the original fix (reactivation of exlusive LV) should
be still in place.
(Code already not fail if unlocking not locked resource.)
This is needed in pre/post lock_lv call, where we can
request the same lock on local node becuase of suspend call.
- do_command and lock_vg expect flags (no change here)
Bug fixes:
- lock_vg should check for NONBLOCK on lock_cmd, flags have this bit masked-out
- do_pre/post_command expect do not mask flag at all, this causes that
the code inside is never run! (see following patches, these functions
expect plain command without flags)
Patch should not cause any problems, only real change is
removing LCK_LOCAL bit from lock type flag, it is never used there.
(LCK_LOCAL is part arg[1] bits anyway.)
If there is problem deactivate LV and
_init_mirror_log is called with remove_on_failure = 1,
remove the newly created log LV from metadata.
(This can happen if there is active device with the same name
but different UUID.)
The main reason for this "workaround" patch is to
- do not keep _mlog volume in metadata, so user can repeat the action
- print better error message describing the real problem
# lvcreate -m 2 -n lv1 -l 1 --nosync vg_bar
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
/dev/vg_bar/lv1_mlog: not found: device not cleared
Aborting. Failed to wipe mirror log.
Error locking on node bar-01: Input/output error
Unable to deactivate mirror log LV. Manual intervention required.
Failed to create mirror log.
# lvcreate -m 2 -n lv1 -l 1 --nosync vg_bar
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Aborting. Unable to deactivate mirror log.
Failed to initialise mirror log.
I see "Deactivated" message when I activate and "Activated" message when
I deactivate. The code uses "activate" as boolean but it can be any one
of the enum values from CHANGE_AY, CHANGE_AN, CHANGE_AE, etc.
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
There's a new change udev event generated since kernel 2.6.32 that
notifies userspace about a change in read-only attribute for block
devices (with DISK_RO=1 environment variable set).
We need to detect this and disable the rule application so the
meaning of this change event is not interchanged with the regular
change event used while resuming/renaming DM devices.
If there's anybody awaiting this notification in foreign rules,
he can still check for this env var and do the appropriate actions
separately.
At this point they probably do not matter but going forward they
may - depends on future patches for replicator, etc. I think
these probably got missed because they were 'flags' so I changed
the name to 'status' to be consistent. So the on-disk
things 'flags' and the in structure 'status' (bits).
NOTE: WHATS_NEW already has entry for this in current release.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
pvmove suspends all moved LVs + pvmoveX mirrored LV itself.
This suspends even underlying pvmoveX and following explicit
suspend call is just noop.
But in resume the pvmoveX volume is no longer underlying
device for moved LVs, so it performs full resume with memlock
decrease.
Code must call memlock_inc() if suspend is requested, volume
is already suspended and error is not requested.
These are no longer used by anyone. The dm_list defines are all in
libdevmapper.h and libdm/datastruct/list.c contains any function definitions.
There is some code in "old-tests" that still use this but this code is not
being maintained.
Thanks to Zdenek for spotting this.
The physical_volume, volume_group, logical_volume and lv_segment
structures' 'status' member is now uint64_t.
The alignment of these structures was also audited to remove holes. The
movement of some members in 'volume_group' and 'lv_segment' eliminates
holes. The 'physical_volume' structure still has one 4-byte hole after
'pe_size'; the other structures no longer have any holes. Each
structures' size has not changed.
If the vg_read() returned error, no lock was taken,
so always call vg_release().
Otherwise this can happen because of missing FAILED_*:
# vgchange -a y x --ignorelockingfailure
Volume group "x" not found
Internal error: Attempt to unlock unlocked VG x
The sysfs filter initialise hash of available devices using
scan of /sys/block. We need to refresh even this hash
when performing full scan otherwise the newly appeared
device could be rejected, because there is no entry
in sysfs filter.
This easily could happen when attaching new device
to cluster node. (Only force refresh of context
in clvmd -R works here now).
Unfortunately consequences of this are much worse,
missing device part on that node is replaced with missing segment
(even when no partial arg is selected) and this directly
lead to data corruption.
See https://bugzilla.redhat.com/show_bug.cgi?id=538515
Simply fix it by refreshing device filters in lvmcache
before performing the full device scan.
(on one node a storage connection failed):
# vgchange -a y vg_bar ; echo $?
Error locking on node bar-02: Refusing activation of partial LV lv1. Use --partial to override.
1 logical volume(s) in volume group "vg_bar" now active
0
So activation fails on one node, error is correctly printed but
status code is wrong.
This patch fixes the top level (vgchange) to return proper code
(and print # of activated LVs).
(lvchange returns error properly here.)
(This affects only cluster locking because only cluster
locking module set LCK_PRE_MEMLOCK.)
With currect code you get
# vgchange -a n
Internal error: _memlock_count has dropped below 0.
when using cluster locking.
It is caused by _unlock_memory calls here
if ((flags & (LCK_SCOPE_MASK | LCK_TYPE_MASK)) == LCK_LV_RESUME)
memlock_dec();
Unfortunately it is also (wrongly) called in immediate unlock
(when LCK_HOLD is not set) from lock_vol
(LCK_UNLOCK is misinterpreted as LCK_LV_RESUME).
Avoid this by comparing original flags and provide memlock
code type of operation (suspend/resume).
All hidden (not visible) volumes should be activated through
other visible volumes.
(There are already exceptions like snapshot, mirror log and image,
which should be cleaned one day...)
This solves problems for future types of hidden volumes,
which can have special meaning and must not be activated implicitly
(e.g. key store volume).
This provides better support for environments where udev rules are installed
but udev_sync is not compiled in (however, using udev_sync is highly
recommended). It also provides consistent and expected functionality even
when '--noudevsync' option is used.
There is still requirement for kernel >= 2.6.31 for the flags to work though
(it uses DM cookies to pass the flags into the kernel and set them in udev
event environment that we can read in udev rules).
'last_rule' option has been removed from udev (version >= 147).
From now on, we require foreign rules to check and honor
ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG} instead. Foreign
rules should be skipped totally when this flag is set.
- fix missing unlocking of VG
lvcreate -l 100%PVS -n lv1 vg_test
Please specify physical volume(s) with %PVS
Internal error: Volume Group vg_test was not unlocked
- if no PVS specified, use all available
Fix segfault if %PVS in lvresize without PVs list.
indented metadata lines.
Macro outnl() is using exported out_newline() instead of direct
call f->fn(), that required the visibility of the internal
struct formatter.
Rename fill_default_pvcreate_params to pvcreate_params_set_defaults.
Rename pvcreate_validate_restore_params to pvcreate_restore_params_validate.
Rename pvcreate_validate_params to pvcreate_params_validate.
- add copyright notice for 10-dm.rules.in,
- set DM_UDEV_DISABLE_{DISK, OTHER}_RULES_FLAG in 11-dm-lvm.rules directly
for inappropriate and non-top-level subdevices in case we use older kernels
where DM_COOKIE is not used (and therefore there are no flags passed from
the LVM process itself). This applies for older kernels (version < 2.6.31),
- remove unnecessary filters in 95-dm-notify.rules - the DM_COOKIE env var
itself is set for change/remove udev events and for DM devices only so
there's no need to double-check this.
Similar to other vg_set_* functions, we create a vg_set_clustered() function
which does a few checks and sets a flag. This is where we check for
any limitations of clusters.
The DRBD uses underlying device so code should prefer top
device if duplicate is found.
Patch also introduce
dev_subsystem_part_major and dev_subsytem_name
functions to easily handle all these replication susbystems
and not hardcode md_major call.
See https://bugzilla.redhat.com/show_bug.cgi?id=530881
for full problem description.
Option --all is only partially documented currently, so document in all
commands. Also make {pv|vg|lv}{display|s} man pages consistent with help
output. Remove ununsed 'disk_ARG' parameter. Leave --trustcache out of
the man page output. Update --units argument to show all possible units.
- we have these levels when the udev rules are processed:
10-dm.rules --> [11-dm-<subsystem>.rules] --> [12-dm-permissions.rules] -->
13-dm-disk.rules --> [...all the other foreign rules...] --> 95-dm-notify.rules
- each level can be disabled now by
DM_UDEV_DISABLE_{DM, SUBSYSTEM, DISK, OTHER}_RULES_FLAG
- add DM_UDEV_DISABLE_DM_RULES_FLAG to disable 10-dm.rules
- add DM_UDEV_DISABLE_OTHER_RULES_FLAG to disable all the other (non-dm) rules.
We cutoff these rules by using the 'last_rule', so this one should really be
used with great care and in well-founded situations. We use this for lvm's
hidden and layer devices now.
- add a parameter for add_dev_node, rm_dev_node and rename_dev_node so it's
possible to switch on/off udev checks
- use DM_UDEV_DISABLE_DM_RULES_FLAG and DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG
if there's no cookie set and we have resume, remove and rename ioctl.
This could happen when someone uses the libdevmapper that is compiled with
udev_sync but the software does not make use of it. This way we can switch
off the rules and fallback to libdevmapper node creation so there's no
udev/libdevmapper race.
- remove default permissions set in 95-dm-notify.rules (and add a hint in 12-dm-permissions.rules to set it by the user directly)
- add multipath DM_ACTION=="PATH_FAILED" filter
- remove unnecessary filters in the headers of the rules (we can simply use DM_UDEV_RULES_VSN instead)
- fix symlink priorities in /dev/disk/ (snapshot volumes have low priority for FS UUID symlinks so it will not overwrite symlinks for the origin)
[root@xxxx-01 ~]# lvconvert -m 1 --corelog VG/cmirror
Unable to convert the log of inactive cluster mirror cmirror
I've tried to clean-up the message a little more, so the name
of the mirror stands out more while preserving the sense that
it's not a problem with the specific device, but the fact that
it is inactive that is causing the problem.
New msg:
Unable to convert the log of an inactive cluster mirror, cmirror
Per discussion on lvm-devel mailing list and part of debian patch set,
don't set defaults for owner and group, since nobody seems to use them, and
still allow override.
Going forward, we would like to allow users to specify the total
number of metadatacopies in a VG rather than on a per-PV basis. In
order to facilitate that, introduce --pvmetadatacopes to replace
--metadatacopies everywhere. We still allow --metadatacopies for
pv commands, but require --pvmetadatacopies for vg commands.
Eventually we will introduce --vgmetadatacopies. Once we do that,
we should either deprecate --metadatacopies or make it a synonym based
on the command (pvmetadatacopies for pv commands, and vgmetadatacopies
for vg commands). The latter option would likely just require a simple
'strncpy' check against cmd->command->name to qualify the merge_synonym
call.
Update nightly tests to cover the pvmetadatacopies synonym.
Note that this patch is the result of an eariler review comment for
the implicit pvcreate patches. Should apply cleanly on top of the
implicit pvcreate patches (I applied after patch 10/10 in that series).
NOTE: This patch will require --pvmetadatacopies for vgconvert as
--metadatacopies is no longer accepted.
Going forward, we would like to allow users to specify the total
number of metadatacopies in a VG rather than on a per-PV basis. In
order to facilitate that, introduce --pvmetadatacopes to replace
--metadatacopies everywhere. We still allow --metadatacopies for
pv commands, but require --pvmetadatacopies for vg commands.
Eventually we will introduce --vgmetadatacopies. Once we do that,
we should either deprecate --metadatacopies or make it a synonym based
on the command (pvmetadatacopies for pv commands, and vgmetadatacopies
for vg commands). The latter option would likely just require a simple
'strncpy' check against cmd->command->name to qualify the merge_synonym
call.
Update nightly tests to cover the pvmetadatacopies synonym.
Note that this patch is the result of an eariler review comment for
the implicit pvcreate patches. Should apply cleanly on top of the
implicit pvcreate patches (I applied after patch 10/10 in that series).
NOTE: This patch will require --pvmetadatacopies for vgconvert as
--metadatacopies is no longer accepted.
vgcreate and vgextend now implicitly call pvcreate on devices not
currently initialized for LVM use. Previously these commands would
fail with an error, so clarify the new behavior in the man page.
Adds implicit pvcreate support when calling vgcreate or vgextend with
device paths that are not yet PVs. This changes the behavior of vgcreate
and vgextend from failing with an error message to implicitly pvcreating.
Split pvcreate_validate_params into recovery and non-recovery parameters.
This is necessary so we can call the non-recovery validate function from
vgextend / vgcreate. Note in the pvcreate tool case, we must call the
recovery validation function first (see treatment of pe_start and --zero),
and that we add a call to fill_default_pvcreate_params before the validation
functions.
We need defaults for pvcreate_params at a higher level - this will
allow us to use a common function from the tools to take defaults,
then fill in any non-defaults from the commandline.
Future patches will refactor vgcreate/vgextend to call this function
if one or more pvcreate parameters are given on the commandline.
Another refactoring for implicit pvcreate support. We need to get
the pvcreate parameters somehow to the vg_extend routine. Options
seemed to be:
1. Attach the parameters to struct volume_group. I personally
did not like this idea in most cases, though one could make an
agrument why it might be ok at least for some of the parameters
(e.g. metadatacopies).
2. Pass them in to the extend routine. This second route seemed
to be the best approach given the constraints.
Future patches will parse the command line and fill in the actual
values for the pvcreate_single call.
Should be no functional change.
Should be no functional change. If this parameter is set to NULL, just fail
the extend if the device is not already a PV. If non-NULL, try pvcreate_single
before failing. Note that pvcreate_single() handles the log_error in case
of failure so we just return 0 if pvcreate_single() fails.
is granted at one mode and an attempt to convert it wthout the LCK_CONVERT
flag set then it will return errno=EBUSY.
This fixes a pretty bad bug in which an LV could be activated exclusively on
one node and lvchange -ay on another would convert it to shared!
It might break some things in other areas, but I doubt it.
Now uppercase letters imply Si units, so use lowercase everywhere.
We could stay with uppercase, but then we'd have to deal with rounding, etc.
Also, some output / error messages change slightly (instead of "GB" we're
now saying "GiB").
One test enhancement might be to add some new tests for the units changes
but for now let's just get the test back to passing.
lv_deactivate now returns always success, because tree deactivation
functions (see dm_tree_deactivate_children) always returns success.
Because code should return failure in lv_deactivate at least,
fix it by checking for device existence after real deactivation call.
(After discussion this was prefered solution to dm tree function rewrite
which affects snapshots and mirrors.)
Add configure --enable-units-compat to set si_unit_consistency off by default.
Use standard output units for 'PE Size' and 'Stripe size' in pv/lvdisplay.
confuses me, so I've added a comment at the top of the function to
remind me of this.
I also found that 'mirror_emit_segment_line' was returning 0 (return_0)
on failure /and/ success. It now returns 1 for success and 0 for failure -
just like '_emit_areas_line' and its calling function, '_emit_segment_line'.
Clean up VG_RESIZEABLE flag by creating vg_is_resizeable().
Update comment - we no longer have ALLOW_RESIZEABLE.
Also use vg_is_exported() in one place missed by earlier patch.
Should be no functional change.
Remove the checks for vg_read_error() in most of the tools callback
functions and instead make the check in _process_one_vg() more general.
In all but vgcfgbackup, we do not want to proceed if we get any error
from vg_read(). In vgcfgbackup's case, we may proceed if the backup
is to proceed with inconsistent VGs. This is a special case though,
and we mark it with the READ_ALLOW_INCONSISTENT flag passed to
process_each_vg (and subsequently to _process_one_vg).
NOTE: More cleanup is needed in the vg_read_error() path cases.
This patch is a start.
This patch is all just cleanup and no other patch depends on it.
Replace explicit dereference and check with vg_is_exported().
Update a few copyrights and remove unnecessary whitespace.
Should be no functional change.
Of the vgs field vg_attr, a few of the most likely to be used attributes
are clustered, exported, and partial. This patch adds the following 3
functions:
uint64_t lvm_vg_is_clustered(const vg_t vg)
uint64_t lvm_vg_is_exported(const vg_t vg)
uint64_t lvm_vg_is_partial(const vg_t vg)
We don't have to call dm_cookie_supported with dm_udev_get_sync_support
this way. Also, it's necessary for the current code to work properly on
systems without cookie support (older kernels).
- add DM_UDEV_RULES_VSN to provide a variable to be checked for in the other
rules (e.g. to check that DM rules are actually installed, we can alternate
functionality in the other rules based on this information, also we have
versioning support for the rules)
- set proper sbin path for dmsetup and blkid, /sbin first, then /usr/sbin.
This is necessary for anaconda to work properly.
- add 'last_rule' for cryptsetup's temporary devices (symlinks in /dev/mapper
only)
The test/api directory TARGET line will be reserved for non-interactive
unit tests. Building the interactive test can still be done with "make test"
from the test/api dir.
When clogd was renamed to cmirrord, somehow git got the remove of the old
files but not the add of the new files. This patch adds the new files.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The test/api directory TARGET line will be reserved for non-interactive
unit tests. Building the interactive test can still be done with "make test"
from the test/api dir.
Now that we've refactored the internal library functions that do the
vg_remove, we can handle the deferred commit of a lvm_vg_remove() inside
lvm_vg_write(). This makes the VG create/remove API more consistent in
terms of disk commits - they now both require an lvm_vg_write() to commit
the create or remove to disk.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
Now that we've split vg_remove_single into two routines, in the first routine
that only manipulates memory, we move the PVs from the vg->pvs list to the
vg->removed_pvs list. Then later, we iterate through this list to write the
removed PVs to disk, which removes them from the volume group and places them
into the internal ORPHAN VG.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
Split vg_remove_single into vg_remove_check (mandatory checks before
vgremove) and vg_remove (do actual remove by committing to disk).
In liblvm, we'd like to provide an consistent API that allows multiple
changes in memory, then let lvm_vg_write() control the commit to disk. In
some cases (for example, lvresize calls fsadm) this may not be possible.
However, since we are using an object model and dividing things into small
operations, the most logical model seems to be the lvm_vg_write model, and
handling the special cases as they arrive. So as best as possible
we move towards this end.
A possible optimization would be to consolidate vg_remove (committing)
code with vgreduce code. A second possible optimization is making vgreduce
of the last device equivalent to vgremove. Today, lvm_vg_reduce fails if
vgreduce is called with the last device, but from an object model perspective
we could view this as equivalent to vgremove and allow it. My gut feel is
we do not want to do this though.
Author: Dave Wysochanski <dwysocha@redhat.com>
Later patches should consolidate the vgremove / vgreduce functions but for
now let's clarify what vg_remove actually does by changing the name.
Author: Dave Wysochanski <dwysocha@redhat.com>
Add a new constraint that vgname locks must be obtained in
alphabetical order. At this point, we have test coverage for
the 3 commands affected - vgsplit, vgmerge, and vgrename.
Tests have been updated to cover these commands.
Going forward any command or library call that must obtain
more than one vgname lock must do so in alphabetical order.
Future patches will update lvm2app to enforce this ordering.
Author: Dave Wysochanski <dwysocha@redhat.com>
These functions are really identical but for clarity I made them separate
functions in this patch.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
If the destination vgname comes before the source vgname, we must open the
destination first because of the locking rules. Thus, do a strcmp and set
the flag based on the comparison.
Author: Dave Wysochanski <dwysocha@redhat.com>
Slight functional change. If we open the destination first, we cannot
know the 'fmt'. In this case we use the default metadata type unless
the user has specified -M on the cmdline. If not, in most cases this
is fine since we use the LVM2 default metadata type. However, if the
user is specifying a non-default metadata type (e.g. lvm1) and the order
of the names is such that we have to open the destination (vg_to) first,
we have a problem. So in this case, we require the use of -M and vgsplit
will fail with an error if not. I've updated the man page to recommend
the usage of -M in this case.
Author: Dave Wysochanski <dwysocha@redhat.com>
Move the creating/opening of the destination vg into its own function so later
we can reorder the source / destination vg opening based on the alphabetical
lock order rule.
Should be no functional change but code is a bit tricky.
Author: Dave Wysochanski <dwysocha@redhat.com>
Introduce 'lock_vg_from_first' flag to retain which vg was locked first.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
This will aid in future refactorings and allow for us to reorder the source
and destination vg based on alphabetical names.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
interface it should be using, it can still be overriden with -I.
If corosync isn't running or there is no information then the usual
checking will happen.
This code only builds if corosync is available.
# pvcreate -u udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi /dev/sdc
uuid udwxr7BoKYEeKMr033xK6o4og7F13sGi|��� already in use on "/dev/sdb1"
is now
# pvcreate -u udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi /dev/sdc
uuid udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi already in use on "/dev/sdb1"
Eliminate busy loop during pvcreate of a "normal" partition.
_md_sysfs_attribute_snprintf() would busy loop if the device it was
given was not a blkext-based MD partition.
Rather than being cute with a busy-loop prone 'goto check_md_major' in
_md_sysfs_attribute_snprintf(): explicitly check if the provided device
is a blkext-based partition (blkext_major()); and then check that the
get_primary_dev() determined parent is an MD device (md_major()).
The device-mapper mirror CTR table has been changing over time. This has
now been corrected to handle the old and new methods for invoking the
'block_on_errors' and 'cluster' features. (The code that does this was
accidentally committed in the previous check-in. This check-in finishes
the job.)
This check-in includes the touch-ups, make file changes, copyrights,
and other necessities to include the cluster log daemon into the
build system.
[autoconf still needs to be run to generate the 'configure' and
'Makefile' files.]
Eliminate dependency on outside library, since the same functionality
exists in our tree.
[It is important that the bitops work in the same way, as the bitmaps
must remain backwards compatible. I haven't tested every architecture,
but the x86* archs work. My test involved using the old ext2fsprogs
bitops, memcpy'ing the bits over to the LVM bitset array and ensuring
that only the bits set via the old methods were set.]
This patch update pv_t handle to be consistent with lvm_t - define as a pointer
to internal struct physical_volume.
Author: Dave Wysochanski <dwysocha@redhat.com>
This patch update lv_t handle to be consistent with lvm_t - define as a pointer
to internal struct logical_volume.
Author: Dave Wysochanski <dwysocha@redhat.com>
This patch update vg_t handle to be consistent with lvm_t - define as a pointer
to internal struct volume_group.
Author: Dave Wysochanski <dwysocha@redhat.com>
Full changes
- Fix vgextend error path when lock_vol(VG_ORPHANS) fails
- Move lock_vol(VG_ORPHANS) before archive(vg) - safe & simpler error paths
- Remove legacy comment/code that no longer applies
Found in review - Milan Broz <mbroz@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The changes to remove LCK_NONBLOCK from the LVM locks broke clvmd because the
code was clearly wrong but working anyway! The constant was being masked rather
than the variable that was supposed to match against it.
Rename private _primary_dev() to a public get_primary_dev() and reuse it
to allow retrieval of the MD sysfs attributes (raid level, etc) for MD
partitions.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Improve lib/device/device.c:_primary_dev()'s ability to look up the
primary device associated with all partitions; including blkext
(e.g. partitions directly on MD). The same will also work for obscure
sysfs paths; e.g.: paths with mangled names like the HP cciss driver
uses: /sys/block/cciss!c0d0/cciss!c0d0p1/
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Adds 'data_alignment_detection' config option to the devices section of
lvm.conf. If your kernel provides topology information in sysfs (linux
>= 2.6.31) for the Physical Volume, the start of data area will be
aligned on a multiple of the ’minimum_io_size’ or ’optimal_io_size’
exposed in sysfs.
minimum_io_size is used if optimal_io_size is undefined (0). If both
md_chunk_alignment and data_alignment_detection are enabled the result
of data_alignment_detection is used.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pvcreate --dataalignmentoffset option is not specified the start
of a PV's aligned data area will be shifted by the associated
'alignment_offset' exposed in sysfs (unless
devices/data_alignment_offset_detection is disabled in lvm.conf).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Documented which use-cases force the reinstatement of the nuanced
handling of pe_start. As soon as orphan PVs are eliminated much of this
will no longer be a concern ('preserve_pe_start' can be reenabled in
.pv_setup).
Added defensive 'if (pv->pe_align)' check in _text_pv_write()'s pe_start
loop.
If pv_setup was given a non-zero pe_start it would short-circuit
establishing a default pv->pe_align. pv->pe_align=0 would result
in a divide by zero in _mda_setup(). 'vgconvert -M2 $vgname' hit this.
.pv_write still properly preserves pe_start if it was supplied.
Adds pe_align_offset to 'struct physical_volume'; is initialized with
set_pe_align_offset(). After pe_start is established pe_align_offset is
added to it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
areas.
This preserved pe_start would quickly be readjusted to follow the first
mda anyway. An example use-case that hit this code path is: running
pvcreate on an already existing PV _without_ a preceeding pvremove.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Without this fix rounding the end of the first mda to a pe_align
boundary could silently exceed the disk_size.
Final 'if (start1 + mda_size1 > disk_size)' block serves as a safety
net.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Document existing pe_start policy.
Fix issue in _text_pv_setup() where existing pe_start case could have
the pv->pe_start set to pv->pe_align even though pe_start shouldn't ever
change.
vgconvert and pvcreate have a facility to preserve the existing start
of the on-disk data extents, known as pe_start.
They indicate this by passing the existing value to the pvsetup function
which must preserve it.
This patch avoids one particular case where the value could get
changed incorrectly now that the alignment settings are configurable.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A patch to the kernel, adding the 'luid' field to dm_ulog_request,
will allow us to properly identify log instances. We will now
be able to definitively identify which logs are to be removed/
suspended/resumed. This replaces the old faulty behavior of
assuming the logs were the same if they had the same UUID and
incrementing/decrementing a reference count.
For now, a simple way to enforce the read/write semantics is to just save the
open mode of the VG. If the caller uses lvm_vg_create, the mode is write.
The caller using lvm_vg_open can use either read or write to open the VG.
Once we have this, we enforce the permissions on each API call and don't allow
a caller to modify a VG that has not been opened properly.
This may be better combined with the locking mode, but I view that as future
cleanup, past this initial release. The intial release should enforce the
basic object semantics though, as described in the lvm.h file.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Adding the ability to get the seqno is important for an application to
determine if something has changed in a VG. Otherwise, the only way to
know is to open the VG with write permission and hold the handle.
This addresses a a large amount of Alasdair's review. Subsequent patches
will address remaining issues.
Addressed:
// FIXME Mention that's also required on error.
// FIXME Be consistent in terminology. It's called "system_dir" then last sentence says "system directory setting". Is it referring to "system_dir" there or something else?
// FIXME Mention it frees all resources and cannot be used subsequently?
// FIXME What does "any system configuration" mean?
// FIXME Expand on that explanation a bit, now that we know what the other fns look like.
// FIXME Not sure about that - it needs to scan sometimes. "will not" or "might not" ?
// FIXME: That's a FIXME in the code!!!
// FIXME What does "copied" mean in this context???
// FIXME Say what struct the returned struct dm_list is a list of...
// FIXME "This API" ? This function creates an object in memory?
// FIXME This function commits the Volume Group object referenced by the VG handle to disk?
// FIXME Where is "Name" defined? Absolute pathname?
Outstanding:
// FIXME Version function first? No structs or handles needed for that.
// FIXME Sort out this alignment. "Set an" directly below "system_dir" looks awful. Indent differently? More blank lines?
// FIXME Check how doxygen processes this. E.g. "return: LVM handle. You must use lvm_error() to check there were no errors and confirm that the handle is valid for passing to other functions."
// FIXME Find a better name. lvm_init.
// FIXME Consider renaming according to the new name for lvm_create.
// FIXME Please can we use dm_malloc throughout?
Allowing the caller to override the LVM configuration with an API will
enable them to use things such as device filters.
While very flexible, there is some danger to this API in that it will
make it harder to debug setups that have a changing config and deduce
what might have happened. At some point we may want to limit the scope
of this API but for now it is as open as the --config option to lvm commands.
Update exported symbols. When I renamed lvm_reload_config to lvm_config_reload
I forgot to rename so I renamed that one here.
This I believe is the last liblvm API for now. ;-)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
These lower-priority interfaces are not currently implemented in liblvm
but are on the TODO list in the near term.
Author: Thomas Woerner <twoerner@redhat.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
Extend lvm_vg_write to remove pvs removed in lvm_vg_reduce. The lvm
volume_group internal structure removed_pvs is used for that. The list is
empty afterwards.
Add new test for vg->pvs emptyness to lvm_vg_write to prevent to write empty
vgs. This is needed because of lvm_vg_reduce and lv_vg_create, which creates
empty vgs.
Signed-off-by: Thomas Woerner <twoerner@redhat.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
This function behaves a little bit different than vg_reduce_single, because
it allowes to remove even the latest pv. This has been done to be consistent
to lvm_vg_create, which creates an empty vg.
removed_pvs has been added to the volume_group struct. vg_reduce adds remove
pvs to this list to be able to commit the changes for the pvs in lvm_vg_comm
in liblvm2app.
Initialize removed_pvs list in format-specific volume_group constructors.
Ideally, we should have a base constructor here that initializes the general
non-format specific members of struct volume_group. But until then, there
are multiple places to initialize these members. Maybe a better patch would
be a base constructor patch for struct volume_group. That is more work
though.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Thomas Woerner <twoerner@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
The two liblvm functions that return a list of vgnames and vguuids use
cmd->mem to allocate the list. Make it clear to the caller that this
memory will be freed when the LVM handle is freed.
Clean up and clarify the return value of the functions. In the
case of a memory allocation error, add a couple log_errnos to the internal
code, and make it clear that memory allocation returns a NULL pointer.
If there are no VGs in the system, the list returned is an empty list.
Make a note of the fact that currently we return hidden VG names, how
these can be detected (always start with "#"), and that they should
not be used.
Author: Dave Wysochanski <dwysocha@redhat.com>
The general naming scheme for most liblvm APIs is:
lvm_<object>_<action>
As there are likely to be other things to do on the lvm 'config' object
(i.e. lvm_config_set_device_filter), we should use consistent naming.
Author: Dave Wysochanski <dwysocha@redhat.com>
Limited implementation but other types of activation should probably have
separate calls. We also currently do not handle pvmoves or lvconverts.
Author: Dave Wysochanski <dwysocha@redhat.com>
For now, liblvm will return -1 (fail) / 0 (success) or
NULL (fail) / non-NULL (success). Upon failure, lvm_errno and
lvm_errmsg should be used to determine the precise error.
Author: Dave Wysochanski <dwysocha@redhat.com>
For now, we use the following scheme.
For APIs that return an int, success is 0, fail is -1.
APIs that return handles, success is non-NULL, fail is NULL.
At this early stage, liblvm error handling mechanism is subject to change,
but for now we go with this simple scheme consistent with system
programming.
Author: Dave Wysochanski <dwysocha@redhat.com>
Hard to divide into smaller patches because of the moving around and
commenting, so just put in the new file diffs. We will review and can
update as necessary.
Author: Dave Wysochanski <dwysocha@redhat.com>
Add a very simple version of lvm_vg_remove_lv.
Since we currently can only create linear LVs, this simple remove function
is adequate. We must refactor lvremove_single a bit.
Author: Dave Wysochanski <dwysocha@redhat.com>
Add the most straightforward 'get' functions required for anaconda.
These are the ones that return simple uint64_t values.
The other more complex ones involve the lv_attr bits. These will
come in a separate patch series since each lv_attr bit will be returned
in a separate API instred of returning the string and requiring the
user to parse it.
Author: Dave Wysochanski <dwysocha@redhat.com>
For liblvm 'get' functions, we should share code with the reporting functions.
This means we need common code to return the values for the fields.
In this patch we refactor a few of the fields needed in liblvm.
Unfortunately, for the simple fields that do derefernces of structure
members (for example, vg_extent_count), we cannot call the common function
from the reporting infrastructure without more refactoring. The reason is
that the dereference of the simple fields is done deep inside the reporting
code (to get the generic "data" pointer), and the display function is a
generic 'size32' function. We can fix these issues later with more
refactoring.
Should be no functional change and the testsuite should cover any possible
regressions. The only fields in the report affected by this patch are:
vg_size, vg_free, and pv_mda_count.
Author: Dave Wysochanski <dwysocha@redhat.com>
After some refactorings, we can now move the bulk of _lvcreate into the
internal library, and we can call from liblvm. In the future, we should
refactor lv_create_single further, probably by segtype, to reduce the
size of struct lvcreate_params. For now this is a reasonable refactor
and allows us to re-use the function from liblvm.
Author: Dave Wysochanski <dwysocha@redhat.com>
The main _lvcreate function should deal with extents - the 'size' parameter
is just an intermediate step.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
Create a new structure, lvcreate_cmdline_params, to store parameters only
relevant for the cmdline, not the library call to lvcreate.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
Move extents calculation adjustments into their own local functions
right after we read the vg. This calculation really is not part of
the LV create function but is rather an adjustment to the parameters
based on what is given on the cmdline. So we move it outside the main
_lvcreate.
Should be no functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
A couple simple refactorings of _lvcreate - should be no functional change.
Move tags_ARG parsing into _lvcreate_params. Also use lp->voriginsize
instread of arg_count(). These refactorings make it easier to move the
bulk of _lvcreate into the library.
Author: Dave Wysochanski <dwysocha@redhat.com>
Although the tools do not currently do this, we update lvm_vg_extend
to do an implicit pvcreate on an uninitialized device. The tools will
soon be refactored to do this as well, but more work is needed in the
tools. For now we update lvm_vg_extend since this is the behavior
required by liblvm.
With this change, the simple liblvm unit test, test/api/vgtest.c
should pass whether or not the device is initialized.
Author: Dave Wysochanski <dwysocha@redhat.com>
The implicit pvcreate require either moving the ORPHAN_VG lock outside
pvcreate_single or somehow having the function know or detect whether
the ORPHAN_VG lock is already held.
Author: Dave Wysochanski <dwysocha@redhat.com>
Passing NULL for pvcreate parameters gives you default parameters for
pvcreate_single.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
In preparation for implicit pvcreate during vgcreate / vgextend,
move bulk of pvcreate logic inside library.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
We must hold the VG_ORPHAN lock until we commit to disk. Otherwise,
we risk a race condition on vgcreate / vgextend. Reverts the following
commit:
commit 72a41480ba
Author: Dave Wysochanski <dwysocha@redhat.com>
Date: Fri Jul 10 20:09:21 2009 +0000
Move orphan lock obtain/release inside vg_extend().
With this change we now have vgcreate/vgextend liblvm functions.
Note that this changes the lock order of the following functions as the
orphan lock is now obtained first. With our policy of non-blocking
second locks, this should not be a problem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The lvm_list_vg_{names|ids} functions do not do a scan so we provide
a liblvm function that does a scan.
Author: Dave Wysochanski <dwysocha@redhat.com>
These functions provide the capability of enumerating all vgnames and
vgids in the system. They do not do a scan of the system.
Author: Dave Wysochanski <dwysocha@redhat.com>
Caller must free the memory of the uuid / name returned.
This may not be the best memory management policy since it may lead to
memory leaks if the caller has code like this:
if (!lvm_vg_get_name(vg))
Maybe we don't care - if we do we can use pools tied to handles later
or some other scheme.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Thomas Woerner <twoerner@redhat.com>
- Use vgmem pool to allocate a list of lvm_*_list structs
- Allocate a new list each call (list may have changed since last call)
- Add to liblvm's exported symbols
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Thomas Woerner <twoerner@redhat.com>
- pv_t, vg_t, lv_t
- include libdevmapper.h: needed for struct dm_list
These list structures will be needed in later APIs to return a list of
handles to one object, given another object. For example, lvm_vg_list_lvs()
will return a list of LV handles (lv_t's) given a VG handle (vg_t). We
need a structure to do this so we define the LV structure, as well as the
other structures at this point.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Some of the error interface is still TBD. Rather than exporting a lot
of codes, etc, just use a simple pass / fail. The allows our unit test
to not segfault if trying to create a VG that already exists.
lvm_vg_open() calls internal vg_read() function which is the entry point
for reading an existing VG. In addition to the mode, we include a 'flags'
parameter for future extensions.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reverts some of my 'cleanup' from last night. For now we will use pass/fail
on API calls (either 'int' return or NULL/non-NULL handle), then use
lvm_errno() to get more specific errors.
Author: Dave Wysochanski <dwysocha@redhat.com>
made to compensate for the changes in the kernel-side component
that recently went upstream. (Things like: renamed structures,
removal of structure fields, and changes to arguments passed
between userspace and kernel.)
These messages are unnecessary in the set functions. We check for this
condition and print a message in the vgchange tool but not the library
functions.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When converting to the new liblvm functions, the vgcreate code path
changed to create a new vg, then set values. As a result of this
change, and the fact that we give a user a message if they try to
set the same value of a VG attribute (extent_size, alloc_policy, etc),
you'll see these 2 extraneous "is already" messages with vgcreate:
tools/lvm vgcreate vg2 /dev/loop2
Physical extent size of VG vg2 is already 4.00 MB
Volume group allocation policy is already normal
Volume group "vg2" successfully created
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
Since we are using errno values, we should use '0' as a default value
which indicates a non-error, rather than defining some made-up default
value that is not defined in errno. If we need to deviate from errno
values, this will most likely indicate a flaw in our design.
Add prototypes for lvm_errno and lvm_errmsg inside lvm.h and provide
a basic description of their function. This fixes a couple compile
warnings.
Author: Dave Wysochanski <dwysocha@redhat.com>
We provide a lock type that behaves like no_locking, but is not
clustered. Moreover, it also forbids any write locks. This magically (and
consistently) prevents use of clustered VGs, or changing local VGs with
--ignorelockingfailure. As a bonus, we can remove the special hacks in a few
places. Of course, people looking for trouble can always set their locking_type
to 0 to override.
In _process_one_vg, we should never proceed if the VG read fails with certain
conditions. If we cannot allocate or construct the volume_group structure,
we should not proceed - this is true regardless of the tool calling the
iterator. In other cases, when the volume group structure is constructed but
there is some error (PVs missing, metadata corrupted, etc), some tools may
want to process the VG while others may not.
Author: Dave Wysochanski <dwysocha@redhat.com>
In vg_backup_single, we should error out if we vg_read_error(vg) and the
error code we received was anything other than FAILED_INCONSISTENT.
Original code contained an error because C operator precedence.
Note - this was part of the vg_read() so no WHATS_NEW entry neceesary.
Author: Dave Wysochanski <dwysocha@redhat.com>
liblvm unit test case uses the following APIs:
- lvm_create, lvm_destroy
- lvm_vg_create, lvm_vg_extend, lvm_vg_set_extent_size, lvm_vg_write,
lvm_vg_remove, lvm_vg_close
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Add the following VG APIs to liblvm/.exported_symbols:
-lvm_reload_config
-lvm_vg_create
-lvm_vg_extend
-lvm_vg_set_extent_size
-lvm_vg_write
-lvm_vg_close
-lvm_vg_remove
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Add some liblvm APIs for VGs. Most of these APIs simply call into the internal
liblvm library. Ideally we should call the liblvm functions directly from
the tools. However, until we convert more of the code to liblvm functions,
things like the cmd_context will get in the way. For now just implement the
liblvm functions as wrappers around the internal functions, with a little
error checking and return code handling. We put all these vg APIs into a
new file, lvm_vg.c
The following APIs are implemented:
lvm_vg_create, lvm_vg_extend, lvm_vg_set_extent_size, lvm_vg_write,
lvm_vg_remove, lvm_vg_close.
Still TODO:
- cleanup error handling by using lvm_errno() and related APIs
- cleanup naming / clarify which functions commit to disk vs not
- implement more 'set' functions
- decide on 'set' / 'change' nomenclature
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
This needs initialized to non-NULL before using the archive() call.
Normally this is set to the cmdline string when lvm is called from a tool.
We could think about using it in another way, as a potential audit trail
of liblvm calls, or just leave it set to the default "liblvm", similar to
what clvmd does. For now, just set it to "liblvm".
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Define the 5 main liblvm objects to be the pv, vg, lv, lvseg, and pvseg.
We need handles defined to all these objects in order for liblvm to be
equivalent to the reporting commands pvs, vgs, and lvs.
- move vg_t, lv_t, and pv_t from metadata-exported.h into lvm.h
- move lv_segment and pv_segment forward declarations into lvm.h
- add lvseg_t and pvseg_t to lvm.h
NOTE: We currently have an inconsistency in handle definitions.
lvm_t is defined as a pointer, while these other handles are just
structures. We should pick one scheme and be consistent - perhaps
define all handles as pointers (this is what I've seen elsewhere).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
The checks for RESIZEABLE_VG should now be inside the various functions that
have to do such operations.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Remove READ_REQUIRE_RESIZEABLE flag from vgsplit similar to the removal from
vgextend. Move the check inside the functions that actually move pvs from
one vg structure to another. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
In the future we may export these functions or something like them in liblvm
For now this helps in cleaning up the checks for RESIZEABLE since we can
use the internal library function vg_bad_status_bits.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Move the check for the RESIZEABLE flag inside the vg_extend function.
When we consolidated the vg locking, reading, and status flag checking,
we tied the check for the RESIZEABLE flag to the vg_read() call. The problem
with this is you cannot know what other APIs the application my or may not
call after a vg_read() call. Thus the READ_REQUIRE_RESIZEABLE flag is not
really ideal - ideally we should be checking for this flag on a specific
operation, not inside the vg_read() call. This patch moves one check inside
the library.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
xlate64 produces unsigned long long type, but PRIu64 is defined
to accept argument unsigned long type (on 64-bit machines).
On existing machines, both types have the same size, so it works,
but it is still wrong and produces a warning.
Fix it by using a cast to uint64_t --- according to the standard,
PRIu64 argument matches type uint64_t.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Orphan lock is now obtained second and released first, and all tools
are consistent in this regard.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
With this change we now have vgcreate/vgextend liblvm functions.
Note that this changes the lock order of the following functions as the
orphan lock is now obtained first. With our policy of non-blocking
second locks, this should not be a problem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Move the vg orphan lock inside vg_remove_single, now a complete liblvm
function. Note that this changes the order of the locks - originally
VG_ORPHAN was obtained first, then the vgname lock. With the current
policy of non-blocking second locks, this could mean we get a failure
obtaining the orphan lock. In the case of a vg with lvs being removed,
this could result in the lvs being removed but not the vg. Such a
scenario could have happened prior though with a different failure.
Other tools were examined for side-effects, and no major problems
were noted.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Move check for active LVs outside of library function. The vgremove
liblvm function function will fail if there are active LVs. It will
be the application's responsibility to check this condition and remove
the LVs individually before calling vgremove. Note also that we've
duplicated the EXPORTED_VG check in vgremove_single (tools) and
vg_remove_single (library). Duplication seemed the only option here
since we don't want to do the automatic removal of LVs (in the tools)
if the vg is exported, and we still need to protect the library call
from removal if the vg is exported.
We still need to deal with the ORPHAN lock but vg_remove_single is now
very close to our liblvm function.
TODO: Refactor lvremove in a similar way.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
If there is syntax error in metadata, it now prints messages
like:
Couldn't read 'start_extent' for segment 'extent_count'.
Couldn't read all logical volumes for volume group vg_test.
The segment specification is wrong and confusing.
Patch fixes it by introducing "parent" member in config_node which
points to parent section and config_parent_name function, which
provides pointer to node section name.
Also it adds several LV references where possible.
vg_t *vg_create(struct cmd_context *cmd, const char *vg_name);
This is the first step towards the API called to create a VG.
Call vg_lock_newname() inside this function. Use _vg_make_handle()
where possible.
Now we have 2 ways to construct a volume group:
1) vg_read: Used when constructing an existing VG from disks
2) vg_create: Used when constructing a new VG
Both of these interfaces obtain a lock, and return a vg_t *.
The usage of _vg_make_handle() inside vg_create() doesn't fit
perfectly but it's ok for now. Needs some cleanup though and I've
noted "FIXME" in the code.
Add the new vg_create() plus vg 'set' functions for non-default
VG parameters in the following tools:
- vgcreate: Fairly straightforward refactoring. We just moved
vg_lock_newname inside vg_create so we check the return via
vg_read_error.
- vgsplit: The refactoring here is a bit more tricky. Originally
we called vg_lock_newname and depending on the error code, we either
read the existing vg or created the new one. Now vg_create()
calls vg_lock_newname, so we first try to create the VG. If this
fails with FAILED_EXIST, we can then do the vg_read. If the
create succeeds, we check the input parameters and set any new
values on the VG.
TODO in future patches:
1. The VG_ORPHAN lock needs some thought. We may want to treat
this as any other VG, and require the application to obtain a handle
and pass it to other API calls (for example, vg_extend). Or,
we may find that hiding the VG_ORPHAN lock inside other APIs is
the way to go. I thought of placing the VG_ORPHAN lock inside
vg_create() and tying it to the vg handle, but was not certain
this was the right approach.
2. Cleanup error paths. Integrate vg_read_error() with vg_create and
vg_read* error codes and/or the new error APIs.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
NOTE: vg_set_alloc_policy() returns success if you try to set a value that
is already stored. The behavior of vgchange is the same though - it fails.
There is a fixme noted in the code about this inconsistency, which should
be resolved if possible.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
In liblvm, we will reserve the word 'change' to mean an API that
both sets one or more values, and commits to disk. This will be
consistent with the LVM commandline. The existing vg_change_pesize()
function does not commit to disk, but just changes the extent_size
and ensures all internal structures are updated. This logic should
be contained in a function that sets the value.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
It would be nice to have one function that does all the validation
and setting of the VG's pesize. However, currently some checks
are in the higher-level function _vgchange_pesize(), and some
checks are in the lower function vg_change_pesize().
This patch moves most of the higher-level checks inside
vg_change_pesize. In one case a failure return code is
changed from ECMD_FAILED to EINVALID_CMD_LINE.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove unneeded LOCK_NONBLOCKING from vg_read() API and tools that
use it. We no longer need this flag anywhere since we now automatically
set LCK_NONBLOCK inside lock_vol() if vgs_locked().
For further details, see:
commit d52b3fd3fe
Author: Dave Wysochanski <dwysocha@redhat.com>
Date: Wed May 13 13:02:52 2009 +0000
Remove NON_BLOCKING lock flag from tools and set a policy to auto-set.
As a simplification to the tools and further liblvm, this patch pushes
the setting of NON_BLOCKING lock flag inside the lock_vol() call.
The policy we set is if any existing VGs are currently locked, we
set the NON_BLOCKING flag.
At some point it may make sense to add this flag back if we get an
RFE from a liblvm user, but for now let's keep it as simple as
possible.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove READ_CHECK_EXISTENCE and vg_might_exist().
This flag and API is no longer used now that we have a separate
API to check for existence.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove unneeded LOCK_KEEP from vg_read() interface.
Update comment to clarify cases where _vg_lock_and_read() may return
with an error but the lock held. Would be nice to make the vg_read()
interface consistent with regards to lock held and error behavior.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove LOCK_KEEP and READ_CHECK_EXISTENCE from vgsplit.
These flags are no longer necessary. We now check for existence
in a differnet function, and it is not necessary to keep the lock.
Removing these flags simplifies the new vg_read() interface.
After this patch, we can fully remove LOCK_KEEP.
READ_CHECK_EXISTENCE needs a bit more work before full removal.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Update units_to_bytes() to support (S)ectors: 500 bytes.
- 500 byte (S)ectors is of questionable value but it adds to consistency
if a user happens to use --units S. This seems better than an error.
Updated test/t-covercmd.sh to test --units [hS]
Document the units that can be displayed via --units uniformly.
- (p)etabytes and (e)xabytes were missing in pvs, vgs and lvs man pages.
Made lvreduce man page "... in units of megabytes." consistent (with the
lvextend and lvresize man pages).
Fix vg_read() error paths to properly release upon vg_read_error().
Note that in the iterator paths (process_each_*()), we release
inside the iterator so no individual cleanup is needed. However there
are a number of other places we missed the cleanup. Proper cleanup
when vg_read_error() is true should be calling vg_release(vg), since
there should be no locks held if we get an error (except in certain
special cases, which IMO we should work to remove from the code).
Unfortunately the testsuite is unable to detect these types of memory
leaks. Most of them can be easily seen if you try an operation
(e.g. lvcreate) with a volume group that does not exist. Error
message looks like this:
Volume group "vg2" not found
You have a memory leak (not released memory pool):
[0x1975eb8]
You have a memory leak (not released memory pool):
[0x1975eb8]
Author: Dave Wysochanski <dwysocha@redhat.com>
Sun May 3 12:54:28 CEST 2009 Petr Rockai <me@mornfall.net>
* Convert vgrename to vg_read_for_update.
Rebased 6/26/2009 - Dave W.
Author: Petr Rockai <prockai@redhat.com>
Committer: Dave Wysochanski <dwysocha@redhat.com>
Sun May 3 12:32:30 CEST 2009 Petr Rockai <me@mornfall.net>
* Rework the toollib interface (process_each_*) on top of new vg_read.
Rebased 6/26/09 by Dave W.
- Add skipping message to process_each_lv
- Remove inconsistent_t.
Sun May 3 11:40:51 CEST 2009 Petr Rockai <me@mornfall.net>
* Convert the straight instances of vg_lock_and_read to new vg_read(_for_update).
Rebased 6/26/09 by Dave W.
Sun May 3 11:40:51 CEST 2009 Petr Rockai <me@mornfall.net>
* Convert the straight instances of vg_lock_and_read to new vg_read(_for_update).
Author: Petr Rockai <prockai@redhat.com>
Committer: Dave Wysochanski <dwysocha@redhat.com>
Is an application uses query and set major:minor
to device, it should not fallback to default major by default.
Add new function whoich allows that (and use it in lvm2).
- validate the specified device is a PV and that it is in a VG
- automatically enable DEBUG (-d) if >= 4 -v instances were supplied
- preserve TMP_LVM_SYSTEM_DIR if it contains an lvm.conf and -d was
specified
- fix handling of special-case where PV is listed as "unknown device"
- more descriptive error when a PV is missing ("unknown device")
- unset LVM_SYSTEM_DIR if it was not originally set
- skip final vgscan if no changes were made
http://bugzilla.redhat.com/show_bug.cgi?id=500861
- Update list of fields/columns for each command (a few missing).
- Update list order to match "-o help" output (easier to verify field list)
- Add "{pv|vg|lv}_all" description.
- Move "-o help" sentence above column/field list.
New sample man page for lvs (pvs and vgs are similar):
-o, --options
Comma-separated ordered list of columns. Precede the list with ’+’ to append to the
default selection of columns instead of replacing it.
Use -o lv_all to select all logical volume columns, and -o seg_all to select all logical
segment columns.
Use -o help to view the full list of columns available.
Column names include: lv_uuid, lv_name, lv_attr, lv_major, lv_minor, lv_read_ahead,
lv_kernel_major, lv_kernel_minor, lv_kernel_read_ahead, lv_size, seg_count, origin, ori-
gin_size, snap_percent, copy_percent, move_pv, convert_lv, lv_tags, mirror_log, modules,
segtype, stripes, stripesize, regionsize, chunksize, seg_start, seg_start_pe, seg_size,
seg_tags, seg_pe_ranges, devices.
With --segments, any "seg_" prefixes are optional; otherwise any "lv_" prefixes are
optional. Columns mentioned in vgs (8) can also be chosen.
The lv_attr bits are:
E.g.
# vgscan
Parse error at byte 2360 (line 54): expected a value
Failed to load config file /etc/lvm/lvm.conf
You have a memory leak (not released memory pool):
[0x818c788] library (12 bytes)
...
status flag after reading the volume group. Thus, no need to set the flag
in vg_read() or clear it later before calling _vg_bad_status_bits().
Also, add back in the !lockingfailed() as part of the CLUSTERED flag check.
It's unclear why it was removed when the check was moved from
_vg_bad_status_bits() to inside _vg_lock_and_read().
There was an open question about the last check in the 'if' stmt for
lockingfailed() with a previous patch submitted. However, I would
defer that right now as it is a separate item and this patch should
be no functional change by including the !lockingfailed().
Petr acked this patch on 5/26 I just forgot to check it in.
Acked-by: Petr Rockai <prockai@redhat.com>
- it can support multiple segments, but note that
to work properly, correct IV (initialization vector)
offset parameter must be set properly.
Because most usage of IV start offset is when we join
several crypto segments together (so iv_offset is the segment
start offset), DM_CRYPT_IV_DEFAULT is defined to simplify
the process.
Function accepts the string in cipher agrument (already
including chainmode and iv type; chainmode and iv parameters are NULL
in this case) or user can provide split parameters which will
join into dm-crypt cipher specification "cipher-chainmode-iv".
All these parameters must be supplied in correct dm-crypt format.
Various tools need to check for existence of a VG before doing something
(vgsplit, vgrename, vgcreate). Currently we don't have an interface to
check for existence, but the existence check is part of the vg_read* call(s).
This patch is an attempt to pull out some of that functionality into a
separate function, and hopefully simplify our vg_read interface, and
move those patches along.
vg_lock_newname() is only concerned about checking whether a vg exists in
the system. Unfortunately, we cannot just scan the system, but we must first
obtain a lock. Since we are reserving a vgname, we take a WRITE lock on
the vgname. Once obtained, we scan the system to ensure the name does
not exist. The return codes and behavior is in the function header.
You might think of this function as similar to an open() call with
O_CREAT and O_EXCL flags (returns failure with -EEXIST if file already
exists).
NOTE: I think including the word "lock" in the function name is important,
as it clearly states the function obtains a lock and makes the code more
readable, especially when it comes to cleanup / unlocking. The ultimate
function name is somewhat open for debate though so later we may rename.
- PPC uses 64k page, some caculations are wrong in tests
- file utility is buggy on PPC and cannot detect swap, use blkid instead
- read ahead is quietly rounded down according to page size
(for now use some common divisor value in test)
Because preload of table for snapshot can produce snapshot
metadata (in kernel cow header) read.
Code should suspend origin first to avoid possible deadlock
when preloading (thus calling snapshot in-kernel constructor)
for origin with suspended cow device.
(fixes previous commit)
Several commands calls process_each_vg() and in provided
callback it explicitly recovers VG if inconsistent.
(vgchange, vgconvert, vgscan)
It means that old VG is released and reread but the function
above (process_one_vg) tries to unlock and release old VG.
Patch moves the repair logic into _process_one_vg() function.
It always tries to read vg (even inconsistent) and then decides
what to do according new defined parameter.
Also patch unifies inconsistent error messages.
The only slight change if for vgremove command, where
it now tries to repair VG before it removes if force arg is given.
(It works similar way before, just the order of operation changed).
When some parts of lvm are built as shared libraries (for example with
--with-snapshots=shared), the 'make' command does not build these parts.
The shared parts are built with 'make install' command.
This bug can be seen if you go to 'lib' subdirectory and type 'make'.
If you type 'make', the shared libraries are not built, if you type
'make all', the shared libraries are built.
The reason for the bug is the line $(SUBDIRS): $(LIB_STATIC)
If make is executed without any arguments, it makes the first target
in the Makefile. If the first target is '$(SUBDIRS): $(LIB_STATIC)',
it only builds static libraries.
This patch moves '$(SUBDIRS): $(LIB_STATIC)' after
include $(top_srcdir)/make.tmpl. make.tmpl contains the 'all' target
as its first target, so 'make' will be equivalent to 'make all' and
shared libraries will be build with 'make' command.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
When mirror convert polling is started (mainly as backgound process,
in lvchange -a y or in lvconvert itself) it tries to read VG
and LV identified by its name.
Unfortunatelly, the VG can have already different LV under the same name,
and various more or less funny things can happen (note that
_finish_lvconvert_mirror suspends the volume for example).
(the typical example is our testing script which continuously recreates
LVs under the same name in the same VG.)
This patch adds optional uuid parameter which helps to properly
select the monitoring object. For lvconvert polling it is set to LV UUID
and both _get_lvconvert_vg and _get_lvconvert_lv uses it to read proper VG/LV.
(In the pvmove case it is NULL, here we poll for physical volume name).
If there is no free area for log, code should break the loop.
(Otherwise it uses uninitializes areas later.)
Easily reproducible using lvconvert --repair
- kill device with log
- run lvconvert --repair vg/lv (with no PV usable for log)
During vgreduce is failed mirror image replaced with error segment,
this segmant type has always area_count == 0.
Current code expects that there is at least one area with device,
patch fixes it by additional check (fixes segfault during vgreduce).
Also do not calculate readahead in every lv_info call, we only need
to cache PV readahead before activation calls which locks memory.
Use lvconvert --repair in dmeventd mirror DSO.
for now.
It replaces bad behaviour in one set of circumstances with bad behaviour
in a different set. We think the behaviour needs to be more configurable.
When we are stacking LV over device, which has for some reason
increased read_ahead (e.g. MD RAID), the read_ahead hint
for libdevmapper is wrong (it is zero).
If the calculated read_ahead hint is zero, patch uses read_ahead of underlying device
(if first segment is PV) when setting DM_READ_AHEAD_MINIMUM_FLAG.
Because we are using dev-cache, it also store this value to cache for future use
(if several LVs are over one PV, BLKRAGET is called only once for underlying device.)
This should fix all the reamining problems with readahead mismatch reported
for DM over MD configurations (and similar cases).
Current code, when need to ensure that volume is not
active on remote node, it need to try to exclusive
activate volume.
Patch adds simple clvmd command which queries all nodes
for lock for given resource.
The lock type is returned in reply in text.
(But code currently uses CR and EX modes only.)
This means two things:
1) Non-mirrored LVs will be no longer affected by mirror monitoring. (Before,
if you had a LV that went partially missing on a VG where a mirror leg failed,
this LV would be removed automatically by dmeventd... Probably not an
unrecoverable dataloss bug, but still quite unpleasant.)
2) If enough parallel PV space is available at the time of the mirror failure,
the failed devices will be automatically replaced using this spare space. Which
(and whether) free space may be used is still not configurable, but is a
planned feature. Since it is relatively easy to undo the action by converting
the mirror manually, I don't consider this to be a showstopper. In fact, I
think the compromise is much better than what we have now.
pvmove now keep suspended devices if temporary mirror creation fails.
We can try to restore previous state if it is first attempt to activate
pvmove (code basically run the same code as --abort automatically).
Currently code uses pv_dev_name() for hash when getting internal
"pvX" name.
This produce corrupted metadata if PVs are missing, pv->dev
is NULL and all these missing devices returns one name
(using "unknown device" for all missing devices as hash key).
We can temporarily violate max_lv during mirror conversion etc.
(If the operation fails, orphan mirror images are visible to administrator
for manual remove for example. Not that this should ever happen:-)
Force limit only for lvcreate (and vg merge) command.
Patch also adds simple max_lv tests into testsuite
The vg->lv_count parameter now includes always number of visible
logical volumes.
Note that virtual snapshot volume (snapshotX) is never visible,
but it is stored in metadata with visible flag.
link_lv_to_vg and unlink_lv_from_vg are the only functions
for adding/removing logical volume from volume group.
Only these function should manipulate with vg->lvs list.
Later patch initializes lv->vg after the LV structure is prepared,
so pass through cmd context and do not use vg->cmd here.
Also move LV id calculation (which uses lv->vg too).
Also properly free memory pool if operation fails.
The snapshot segment (snapshotX) is created twice
during the text metadata segment processing.
This can cause temporary violation of max_lv count.
Simplify the code, snapshot segment is properly initialized
in init_snapshot_seg function now and do not need to be replaced
by vg_add_snapshot call.
The vg_add_snapshot() is now usefull only for adding new
snapshot and it shares the same initialization function.
The snapshot name is always generated, name paramater can be
removed from function call.
As a simplification to the tools and further liblvm, this patch pushes
the setting of NON_BLOCKING lock flag inside the lock_vol() call.
The policy we set is if any existing VGs are currently locked, we
set the NON_BLOCKING flag.
Should be no functional change.
The seg variable is temporary variable for list iterator,
code cannot expect that after iteration it remains NULL
(it contains non-NULL pointer here id list is empty).
Patch fixes first_seg function so it now correctly returns NULL
for empty segment list.
Buildsystem support device-mapper only install,
but generic install tagret includes both dm+lvm2.
For distribution which uses separate install_device-mapper,
there is no way how to install lvm2 only
(so after installing lvm2 for packaging purposes
built system must remove installed device-mapper files).
Fix it by allowing lvm2_install target, similarily like
install_cluster for clvmd.
(install = install_device-mapper + install_lvm2)
The dataalign value must always be aligned according
to MDA area.
The currect code checks if calculated value collides with
MDA area but not if the value is so small that it is
located before MDA starts.
Unfortunatelly there can be also MDA in the end of the device.
The patch adds simple check to avoid this miscalculation.
Patch expects that first MDA always starts on <= pagesize boundary
(this is true for all allowed label sector parameters).
Add lvs origin_size field.
Fix linux configure --enable-debug to exclude -O2.
Still a few rough edges, but hopefully usable now:
lvcreate -s vg1 -L 100M --virtualoriginsize 1T
Run backup of metadata on remote nodes in the
same place like local node - when calling backup().
Introduce backup_locally() which calls only
local backup if needed.
Remote backup is now trigerred by LCK_VG_BACKUP flag
combination (special VG lock).
This lock type will call check_current_backup()
(including backup_locally() call) and updates
metadata on all nodes.
(Patch fixes non-functional remote backup,
current call during VG lock never triggers.)
The backup() call store metadata from memory.
But in cluster backup() call performs
remote nodes metadata backup and it reads data from disk.
For metadata backup consistency,
patch moves all backup() calls after vg_commit.
(Moreover, some tools already do that this way.)
- Rename unlock_all to destroy_lvhash,
this function is called in cluster shutdown
unlocks everything and clean up allocated info space.
- Tidy lv_hash_lock use
.
Except adding free(lvi) in lv_has destructror
there is no functional change.
If user requests report attribute from PVSEG type
and PV is orphan (or all devices is set), the report
is empty.
Try for example (when only orphan PV are present)
#pvs
#pvs -o +devices
# pvs /dev/sdb1
PV VG Fmt Attr PSize PFree
/dev/sdb1 lvm2 -- 46.58G 46.58G
# pvs -o +devices /dev/sdb1
(no output)
The problem is caused by empty pv->segments list.
Fix it by providing fake segment here (similar to fake structures
in _pvsegs_sub_single() calls.
# pvs -a -o devices
Volume group name (null) has invalid characters
Skipping volume group (null)
...
_pvsegs_sub_single creates fake vg, we need to check
that pv is real here.
Since now, all code reading volume group is responsible for releasing
the memory allocated by calling vg_release(vg).
(For simplicity of use, vg_releae can be called for vg == NULL,
the same logic like free(NULL)).
Also providing simple macro for unlocking & releasing in one step,
tools usualy uses this approach.
The global memory pool (cmd->mem) should be used only for global
physical volume operations.
This patch have to be applied with all subsequent patches to complete
memory pool per vg logic.
Using separate memory pool has quite bit memory saving impact when
using large VGs, this is mainly needed when we have to use
preallocated and locked memory (and should not overflow from that
memory space).
The all_pvs list, used in vg_read, should make its own private
copy of pv structures, otherwise (when vg will use its own pool)
it will point to released memory pool.
The same applies for get_pvs() call.
Patch adds pv_list copy helper and adds explicit memory pool
parameter into _copy_pv.
(Please note that all these helper functions cannot guarantee that
vg related fields are valid - proper vg read & lock must be used
if it is requested.)
Currently PV commands, which performs full device scan, repeatly
re-reads PVs and scans for all devices.
This behaviour can lead to OOM for large VG.
This patch allows using internal metadata cache for pvs & pvdisplay,
so the commands scan the PVs only once.
(We have to use VG_GLOBAL otherwise cache is invalidated on every
VG unlock in process_single PV call.)
If the vg in process_each_segment_in_pv is NULL, the pv struct
can be incomplete (for example lv_segs are not copied in get_pvs()
call).
We need use the new pv from just read-in volume group.
(The same code is in pvdisplay already.)
In libdm/Makefile.in, we need to cleanup the symlink properly.
Adding to CLEAN_TARGETS seemed like the simplest way to do this
in the current build framework. We could redo dependencies for
VERSIONED_SHLIB, but for now just add to CLEAN_TARGETS.
For scripts/Makefile.in, we should be adding to DISTCLEAN_TARGETS.
The generic rule in make.tmpl.in takes care of the cleanup.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Takahiro Yasui <tyasui@redhat.com>
By gnu coding stds, 'distclean' should remove all files generated
by ./configure in addition to what 'clean' does.
Author: Takahiro Yasui <tyasui@redhat.com>
Patch fixes these problems:
- during the snapshot creation process, it needs create 2 LVs,
one is cow, second becomes snapshot.
If the code fails in vg_add_snapshot, code lvcreate will not remove
LV cow volume.
- if max_lv is set and VG contains snapshot, it can happen that
during the activation lv_count is temporarily increased over the limit
and VG metadata are not properly processed
see https://bugzilla.redhat.com/show_bug.cgi?id=490298
- vgcfgrestore alows restore with max_lv set to lower valuer that actual
LV count. This later leads to situation that max_lv is completely ignored.
- vgck doesn't call vg_validate(). It should at least try:-)
Signed-off-by: Milan Broz <mbroz@redhat.com>
We would like to declare our handles pv_t, vg_t, and lv_t in
the external library header lvm.h. However, these are already
defined in metadata-exported.h for the use of some of the
in-progress liblvm APIs. Thus, we cannot both define
them in lvm.h and include metadata-exported.h in the external
library C files. We could use preprocessor tricks (#ifndef)
but for now we just avoid the include.
The original liblvm.a has been moved to liblvm-internal.a.
We now use liblvm.a for the new application library and build
it inside liblvm directory.
Change dependencies so tools depend on liblvm application library,
and application library depends on liblvm internal.
(for example when CLVMD_CMD_LOCK_VG for is called during vgscan).
If clvmd calls LV lock, it calls
/* clean the pool for another command */
dm_pool_empty(cmd->mem);
to clean up memory pool after command.
Unfortunately, do_refresh_cache() do not call this
and because during it operation it allocates some memory,
the pool increases.
Also do_refresh_cache should use lvm_lock
(it manipulates with lvm internal data).
The same applies for lvm_backup command.
Signed-off-by: Milan Broz <mbroz@redhat.com>
if rlocn not defined (there is no metadata area).
In most cases it fails in validate_name(),
unfortunately there are situatuions, when
validate_name is ok and later code fails with
checksum error.
Reproducer:
# dd if=/dev/zero of=/dev/loop0
# pvcreate --metadatasize 637k /dev/loop0
Physical volume "/dev/loop0" successfully created
# pvs /dev/loop0
/dev/loop0: Checksum error
PV VG Fmt Attr PSize PFree
/dev/loop0 lvm2 -- 1.00M 1.00M
Signed-off-by: Milan Broz <mbroz@redhat.com>
-
Using argv[] list in exec_cmd() to allow more params for external commands.
Fsadm does not allow checking mounted filesystem.
Fsadm no longer accepts 'any other key' as 'no' answer to y/n.
Fsadm improved handling of command line options.
New structure lvm (used as an alias to cmd_context), new type definition lvm_t
for the lvm handle. Added functions lvm_create, lvm_destroy and
lvm_reload_config using the new handle.
Modified test/api/test.c:
Use new lvm.h header file and lvm_t handle.
Removed lib/lvm2.h
Author: Thomas Woerner <twoerner@redhat.com>
This patch is not fully tested and leaves some related bugs unfixed.
Intended behaviour of the code now:
pe_start in the lvm2 format PV label header is set only by pvcreate (or
vgconvert -M2) and then preserved in *all* operations thereafter.
In some specialist cases, after the PV is added to a VG, the pe_start
field in the VG metadata may hold a different value and if so, it
overrides the other one for as long as the PV is in such a VG.
Currently, the field storing the size of the data area in the PV label
header always holds 0. As it only has meaning in the context of a
volume group, it is calculated whenever the PV is added to a VG (and can
be derived from extent_size and pe_count in the VG metadata).
When reporting explicitly label attributes (pv_uuid for example), we do not
need to read metadata.
This patch separate the label fileds and removes scan_vgs_for_pvs
in process_each_pv() if not needed.
(There should be no user visible change in output.)
In libdm, we only ever use 'fields', while the tools use 'options' and
'fields' interchangeably.
Ideally it would be good to use 'fields' consistently everywhere.
However, 'options' most likely comes from the tool commandline '-o' and
'--options' which cannot be changed.
We display '0' for these fields now in this case. Ideally these values are
undefined for an orphan PV but today there is no way to specify undefined
with display functions such as _size64_disp().
"test" never prints anything. Therefore, "return $(test ...)" is equivalent to
just "return;" which means success in sh (same as return 0). We can however,
thanks to set -e, use "test foo = bar" as an assertion.
PS: test a == b is invalid syntax. It is either = or -eq: = is textual and -eq
is numeric comparison.
For example in LVM2, "pv_all" gives all PV fields.
"seg_all" gives all LV segment fields.
"all" gives all fields of the final report type. I think this is more
useful than just adding the current prefix.
So "lvs -o seg_all" gives all the LV segment fields, whilst
"lvs --segments -o all" adds in LV and VG fields too.
"lvs -o all -O vg_name" has report type LVS+VGS so includes all LV and all
VG fields.
Reports the size of the smallest metadata area in a PV or a VG.
Useful to confirm pvcreate --metadatasize or pvmetadatasize setting in
/etc/lvm/lvm.conf file.
NOTE: Actual value in these fields will most always differ from that
given in pvcreate options due to rounding and alignment effects.
There is a rudimentary make file in place so people can build by hand
from 'LVM2/daemons/clogd'. It is not hooked into the main build system
yet. I am checking this in to provide people better access to the
source code.
There is still work to be done to make better use of existing code in
the LVM repository. (list.h could be removed in favor of existing list
implementations, for example. Logging might also be removed in favor
of what is already in the tree.)
I will probably defer updating WHATS_NEW_DM until this code is linked
into the main build system (unless otherwise instructed).
Checks added for DM device names to allow only names < DM_NAME_LEN,
otherwise a part of lengthy name would be silently ignored and could
cause confusion while using dmsetup. Also, the name should not contain
'/' character, if it is used in context of creating a new device
or renaming the existing one (because we do not consider full path
to devices, they do not exist in filesystem yet) and appropriate error
messages are shown.
pvcreate $DEV
vgcreate -s 1k vg_test $DEV
lvcreate -l 1 -n lv1 vg_test
..
/dev/vg_test/lv1: write failed after 1024 of 4096 at 0: No space left on device
Just check for maximum write size in set_lv.
It fails for 1k PE now.
Patch adds log_region_size into allocation habdle struct
and use it in _alloc_parallel_area() for proper log size calculation
instead of hardcoded 1 extent - which can fail.
Reproducer for incorrect log size calculation:
DEV=/dev/sd[bcd]
pvcreate $DEV
vgcreate -s 1k vg_test $DEV
lvcreate -m1 -L 12M -n mirr vg_test
https://bugzilla.redhat.com/show_bug.cgi?id=477040
The log size calculation is mostly copied from kernel code.
Check for major/minor collision is added in _add_dev_to_dtree()
where we already read info by uuid,
so in the case of requesting major/minor it queries device-mapper
by major/minor for device availability.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=204992
Very simple / crude method of removing 'is_static' from initialization.
Why should we require an application tell us whether it is linked
statically or dynamically to libLVM? If the application is linked
statically, but libraries exist and dlopen() calls succeed, why
do we care if it's statically linked?
This allows us to remove one argument from create_toolcontext() and
moves it closer to a generic library init function.
In the arg_*() functions, we just use _the_args() directly.
For now we leave the first parameter to these
arg_*() functions (struct cmd_context *) because
of the number of files involved in removing the
parameter.
In preparation for removing cmd->args.
IMO, it makes more sense to put these accessor functions
in the same location as the static array _the_args.
Next patch will update arg_* functions to use _the_args[]
directly and remove cmd->args.
Problem is dm_report_init() may return NULL and subsequent call to
dm_report_set_output_field_name_prefix() doesn't handle NULL value.
Example:
pvs --nameprefixes --rows --unquoted --noheadings -opv_name,fred
Logical Volume Fields
---------------------
lv_uuid - Unique identifier
lv_name - Name. LVs created for internal use are enclosed in brackets.
...
Physical Volume Segment Fields
------------------------------
pvseg_start - Physical Extent number of start of segment.
pvseg_size - Number of extents in segment.
Unrecognised field: fred
Segmentation fault
Move init_full_scan_done(0) and init_mirror_in_sync(0) from init_lvm()
after call to create_toolcontext() to _init_globals(), called from bottom
of create_toolcontext(). No functional change.
Author: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: James Cameron <james.cameron@hp.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
init_formats() sets up the command formats, and currently sets cmd->fmt_backup
but does not set cmd->fmt to a default value. This seems incorrect so we
set it to cmd->default_settings.fmt before returning.
The call we remove here may set cmd->fmt based on a command line setting.
But it is safe to remove this, because the only caller of init_lvm() that
cares about the cmdline override is the cmdline tools (clvmd does not care),
called from lvm2_main(). After lvm2_main() calls init_lvm(), it later calls
lvm_run_command(). In lvm_run_command(), we have a call to _apply_settings(),
which has the identical assignment of cmd->fmt that this patch removes.
This is very obvious - _init_logging() makes the identical init_msg_prefix()
and init_cmd_name() calls with cmd->default_settings so these calls are
clearly redundant after calling create_toolcontext().
Very similar argument to removal of init_debug() and other calls.
create_toolcontext() calls _process_config() which sets
cmd->default_settings.activation, then calls
set_activation(cmd->default_settings.activation). Later, create_toolcontext()
sets cmd->current_settings = cmd->default_settings. So these calls
set_activation(cmd->current_settings.activation) are redundant.
Identical argument to previous patch which removed archive_enable() calls.
We add a new parameter to backup_init() which sets the enable value based
on the cmd->default_settings.backup value. This value was used to set
cmd->current_settings.backup, used in the removed backup_enable() call.
_init_backup() calls archive_init(), which originally set 'enabled' to
a hardcoded '1' value. This seems incorrect based on my read of other
areas of the code so here we add a 'enabled' paramter to archive_init().
We pass in cmd->default_settings.archive, which is obtained from the
config tree. Later in create_toolcontext, cmd->current_settings is
set to cmd->default_settings. The archive_enable() call we remove
here was using cmd->current_settings to set the 'archive' enable
value. The final value of cmd->archive_params->enabled should thus
be equivalent to the original code.
This one we actually need to move. _init_logging() is called from
create_toolcontext(), which makes this call:
/* Test mode */
cmd->default_settings.test =
find_config_tree_int(cmd, "global/test", 0);
But it does not call init_test(). So we need an init_test() somewhere.
The most logical place is to put it inside _init_logging(), since this
is where the config value is read and default_settings are set. Placing
the init_test() call here matches what is done with other variables and
seems to make sense.
This variable is set at the top of create_toolcontext() to 0.
Nothing later in create_toolcontext() changes the value.
In init_lvm(), nothing between create_toolcontext() call and this assignment
changes the value. Thus, the assignment is redundant.
The rationale for removing init_verbose() call is very similar to removing
init_debug() call. create_toolcontext() calls _init_logging() which
makes these calls:
/* Verbose level for tty output */
cmd->default_settings.verbose =
find_config_tree_int(cmd, "log/verbose", DEFAULT_VERBOSE);
init_verbose(cmd->default_settings.verbose + VERBOSE_BASE_LEVEL);
And being that create_toolcontext() copies default_settings into
current_settings at the bottom, the init_verbose() call we are removing:
init_verbose(cmd->current_settings.verbose + VERBOSE_BASE_LEVEL);
is redundant.
We can safely remove because create_toolcontext() calls _init_logging(),
which makes these calls:
/* Debug level for log file output */
cmd->default_settings.debug =
find_config_tree_int(cmd, "log/level", DEFAULT_LOGLEVEL);
init_debug(cmd->default_settings.debug);
Then at the bottom of create_toolcontext() we do this:
cmd->current_settings = cmd->default_settings;
So the call we are removing from init_lvm() functions (clvmd and lvmcmdline):
init_debug(cmd->current_settings.debug);
Just sets the value of debug based on 'cmd->current_settings.debug'.
Since cmd->current_settings is equivalent to cmd->default_settings, and
init_debug() was called with cmd->default_settings, the call we remove is
redundant.
for losetup, break out of the loop when successful setup of loop device,
and only look at 7 loop devices (default loop module setting)
for blockdev, use old option if new one is not available
with the second vgcreate overwriting the first.
Obtain lock before calling vg_create(), which checks for existence of vgname
and fails if it already exists.
Prior to this patch, "lvremove -f vgname" would fail if vgname contained
one or more snapshot LVs. Now this passes, but has a side-effect.
If you issue "lvremove vgname" where vgname contains one or more snaps,
you will get an extra "y/n" prompt to remove the same snapshot.
Example:
$ lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lvsnap vgtest swi-a- 16.00M lvtest 0.05
lvtest vgtest owi-a- 64.00M
$ lvremove vgtest
Do you really want to remove active logical volume "lvsnap"? [y/n]: n
Logical volume "lvsnap" not removed
Do you really want to remove active logical volume "lvsnap"? [y/n]: n
Logical volume "lvsnap" not removed
Command failed with status code 5.
Fixing this will most likely require modification of the iterator
function, process_each_lvs_in_vg() to iterate over snaps in some
cases (e.g. lvs, vgdisplay -v) but not in others (lvremove).
(Avoids having same mirror table loaded twice concurrently by first
using a 'zero' table to set the size of the device so when mirror
table is preloaded it doesn't have to be activated immediately.)
snapshot DSO unregistered itself when snapshot changed state to invalid.
This can cause a race (and several timeouts), when for example another snapshot
is added and in the middle of operation (suspend/resume) the monitoring thread
unregister itself.
Fix it by keeping the snapshot monitored after invalidation - just reset
threshold to not really print any messages to syslog.
If the PV has two metadata areas, second one is located at the end of the device.
Do not allow resize of PV or second metadata area can be overwritten.
(The check was active only for orphan PVs.)
Fixes problem when after downconvert to lvm1 VG is broken:
# lvcreate -n lv1 -l 4 vg_test
Invalid LV in extent map (PV /dev/sdb1, PE 0, LV 0, LE 0)
...
- add lvcreate rejects repeated invocation test
- fix pvs metadata test for partial failure test
- add pvchange reject --addtag to lvm1 pv test
(All fixes by Jaroslav Stava)
Original code would create "*.so" symbolic links if there were no actual
files ending in "so". The second iteration would then cause an error
in the test logs.
Function _text_pv_write doesn't use memory pool but static buffer,
call dm_pool_free in error path in _raw_write_mda_header is wrong.
Move pool free only to path where is the memory pool used.
Do not override the default action of AC_CHECK_LIB([readline],...
(i.e., leave the ACTION-IF-FOUND parameter blank) so that the
subsequent check for rl_completion_matches can use -lreadline.
Also, replace AC_CHECK_FUNC+AC_DEFINE with an equivalent AC_CHECK_FUNCS call.
Failure to check for label_write() return code caused the following test
to indicate it passed when it really failed:
pvcreate rejects labelsector > 1000000000000
The "status" field is treated as it ever has been, unknown flags there are
treated as fatal metadata errors. However, in the "flags" field, any unknown
flags will be ignored and silently dropped. This improves
backward-compatibility possibilities. (Any versions without support for this
new "flag" field will drop the field altogether, which is same as ignoring all
the flags there.)
failed to link against liblvm2cmd.
Dmeventd DSOs *require* lvm2cmd to be linked in.
For the future:
1) AC_SUBST does not create Makefile variables, only @foo@-style substitutions
2) When using `test', whitespace around `=' is essential:
test a=b is true, as is test a=a
The warning is bogus and is only seen on certain versions of gcc.
However using the enum does seem to clarify the intent of the code - only
3 possible md minor superblock versions.
Related compiler warning:
device/dev-md.c:53: warning: 'sb_offset' may be used uninitialized in this function
* configure.in (LVM2CMD_LIB): Define if --enable-cmdlib.
* dmeventd/mirror/Makefile.in (CLDFLAGS): Use $(LVM2CMD_LIB) rather
than hard-coding -llvm2cmd.
* dmeventd/snapshot/Makefile.in (CLDFLAGS): Likewise.
* configure.in: Define READLINE_SUPPORT not when processing
--enable-readline or --disable-readline, but rather only after
determining that readline support is desired and the readline
library is available/usable.
Related compiler warning:
log/log.c:242: warning: declaration of 'error_message_produced' shadows a global declaration
../include/log.h:98: warning: shadowed declaration is here
The new error checking code caught some commands that were returning '0' as
an exit status for success. This is incorrect and resulted in a benign error
message displayed (see below). As of today, all commands should return a
value defined in lib/commands/errors.h (1-5). This results in an exit code of
0 on success, or > 0 on failure (as stated in the lvm.8 man page).
Before change:
1. Make sure no PVs are on the system
2. Run 'pvs'
Command failed with status code 0.
After change:
<no output>
Specific test case:
1. pvcreate /dev/loop1; vgcreate vg1 /dev/loop1; lvcreate -L 64M -n lv1 vg1
2. vgremove vg1 (will prompt user)
3. CTRL-C
Code will exit with:
Do you really want to remove volume group "vg2" containing 2 logical volumes? [y/n]:
Volume group "vg2" not removed
Command failed with status code 5.
Internal error: Volume Group vg2 was not unlocked
Device '/dev/loop1' has been left open.
After change:
Do you really want to remove volume group "vg2" containing 2 logical volumes? [y/n]:
Volume group "vg2" not removed
Command failed with status code 5.
which is not used since the switch away from async version saLck
. num_nodes should equal to member_list_entries, i.e.
joined_list_entires is 0 when a node leaves the group.
Thanks to Xinwei Hu for the patch.
It does 2 things.
1. The cpg_deliver_callback make a compare between target_nodeid and our_nodeid.
It turns out openais set target_nodeid to 0 sometimes. for broadcasting ? I change the behavior so that lvm will process_remote also on target_nodeid == 0
2. The joined_list passed to cpg_confchg_callback doesn't include the already exist nodes in the group, which leads to an incomplete node_hash. I simply add all other nodes in member_list to node_hash also.
Thanks to Xinwei Hu for this patch.
Handles non-clustered as well as clustered. For clustered,
the best we can do is try exclusive local activation. If this
succeeds, we know it is not active elsewhere in the cluster.
Otherwise, we assume it is active elsewhere.
This bug has been around for a long time as far as I can tell.
Without this fix, a vgsplit would unconditionally move the
'hidden/internal' snapshot LVs, and result in corrupted metadata
in the following case:
vg1: contains lv1, lv1snap, both on pvset1
vg1: contains lv2, on pvset2
"vgsplit vg1 vg2 pvset2"
would result in "snapshot0" hidden LV being moved to vg2, and
the origin and cow being left in vg1. The tools detect the
corruption in vg2, but not in vg1.
When vg_lock_and_read() calls were added, they were done so incorrectly for
the destination VG (vg_to). This resulted in the VG lock not obtained when
a new VG was the destination (vg_lock_and_read() would fail in the vg_read()
clause, which would then release the lock before returning NULL), and could
result in corrupted destination VG.
The fix was to put back the original lock_vol() and vg_read() calls for 'vg_to'.
The failure of vg_read() indicates "vg does not exist", and we key off that
to determine whether we are dealing with a new or existing VG as the
destination.
The first two error messages were also the result of the incorrect
vg_lock_and_read() calls:
Volume group "new" not found
cluster request failed: Invalid argument
New volume group "new" successfully split from "vg"
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=438249
BEFORE:
tools/lvm lvresize -l +4 vg22/lv1linear
Volume group "vg22" not found
Volume group vg22 doesn't exist
AFTER:
tools/lvm lvresize -l +4 vg22/lv1linear
Volume group "vg22" not found
- Add validation on pv_count, lv_count, and snap_count after split
NOTE: Some of these counts are misleading. If you compare "lvs" output
with these counts you will be left scratching your head what a "logical volume"
really is. ;-)
Fix unfilled paramater passed to fsadm from lvresize
Update fsadm to call lvresize if the partition size differs (with option -l)
Fix fsadm to support vg/lv name (like the rest of lv-tools)
Fix 'make check' to use DMDIR to check DM_DEV_DIR support in dmsetup.
Add basic test cases for mirrored LV.
Add basic test cases for lvconvert mirror.
Add basic test cases for pvmove.
Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Add new vgsplit and vgmerge tests.
Dave Wysochanski <dwysocha@redhat.com>
* test/t-000-basic.sh: Invoke initial test of lvm with its "version"
argument, so that the behavior of the tool doesn't depend on whether
readline was enabled at configure time.
Author: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Committer: Jim Meyering <meyering@redhat.com>
Fix missing VG unlocks in some pvchange error paths.
Add some missing validation of VG names.
Rename validate_vg_name() to validate_new_vg_name().
Change orphan lock to VG_ORPHANS.
Change format1 to use ORPHAN as orphan VG name.
This makes the tests more reproducible and helps isolate
them from any existing LVM set-up.
* test/Makefile.in (abs_builddir): Define.
(init.sh): Emit definition of abs_builddir.
* test/lvm-utils.sh (unsafe_losetup_): Keep only the portable,
iterative approach.
(dmsetup_has_dm_devdir_support_): New function.
(init_root_dir_): New function.
Invoke init_root_dir_ for all but the first test.
* test/test-lib.sh (this_test_): Adapt to test-name change.
Invoke lvm-utils.sh much later (after tmpdir creation), and
only if the current test is not being skipped.
Remove useless abs_top_srcdir definition.
Rename t0->test_dir_rand_.
* test/t-lvcreate-pvtags.sh: Skip this test if the available
version of dmsetup is not new enough.
Use global, $G_dev_, rather than hard-coded "/dev".
* test/t-lvcreate-usage.sh: Make --verbose output more useful.
Author: Jim Meyering <jim@meyering.net>
Committer: Jim Meyering <meyering@redhat.com>
* dmsetup/dmsetup.c (DEV_PATH): Remove definition.
(parse_loop_device_name): Add parameter: dev_dir.
Declare the "dev" parameter to be "const".
Use dev_dir, not DEV_PATH. Handle the case in which dev_dir
does not end in a "/".
(_get_abspath): Declare "path" parameter "const", to match.
(_process_losetup_switches): Add parameter: dev_dir.
Pass dev_dir to parse_loop_device_name.
(_process_switches): Add parameter: dev_dir.
Pass dev_dir to _process_losetup_switches.
(main): Set dev_dir from the DM_DEV_DIR envvar, else to "/dev".
Call dm_set_dev_dir.
* lib/libdm-common.c (dm_set_dev_dir): Rewrite to be careful
about boundary conditions, now that dev_dir may be tainted.
* man/dmsetup.8: Mention $DM_DEV_DIR.
Author: Jim Meyering <meyering@redhat.com>
DMSETUP=: to disable dmsetup checks (but let the script run
nevertheless); warn the user if this is the case
b) put the non-root and dmsetup warnings both at start and end of
output
Print just one line:
Use `COMMAND --help' for more information.
after "real" diagnostic(s), rather than all of the usage lines.
Otherwise, the 30-40+ lines of --help output could obscure the real diagnostic.
Author: Jim Meyering <jim@meyering.net>
* test/Makefile.in (srcdir, top_srcdir): Use @srcdir@, etc.
(top_builddir, abs_srcdir, abs_top_builddir, abs_top_srcdir): Likewise.
(so_name): Remove definition.
(.bin-dir-stamp): No longer create symlink in $(DMDIR) tree.
Prompted by suggestions from Alasdair Kergon.
* test/t1000-lvcreate-usage.sh (cleanup_): Redirect to a file,
rather than to /dev/null.
Change wording of some test titles.
Suggestions from Alasdair Kergon.
* test/Makefile.in: Add a copyright notice.
* test/lvm-utils.sh: Likewise.
* test/mkdtemp: Likewise.
* test/t0000-basic.sh: Likewise.
* test/t1000-lvcreate-usage.sh: Likewise.
* test/t3000-lvcreate-pvtags.sh: Likewise.
* test/t4000-pv-range-overflow.sh: Likewise.
* test/test-lib.sh: Likewise.
Author: Jim Meyering <jim@meyering.net>
* test/t1000-lvcreate-usage.sh: New tests.
* test/Makefile.in (T): Add it.
Derived from test cases by Dave Wysochanski.
Author: Jim Meyering <jim@meyering.net>
* test/Makefile.in (so_name): Use @DMDIR@.
(.bin-dir-stamp): Create symlink only if @DMDIR@ is nonempty.
(lvm-wrapper): Emit LD_LIBRARY_PATH setting only if @DMDIR@ is nonempty.
Based on a patch from Jun'ichi Nomura.
Author: Jim Meyering <jim@meyering.net>
* configure.in: Convert a relative dmdir directory name to the required
absolute form, e.g. in ./configure --with-dmdir=../device-mapper
Suggestion from Jun'ichi Nomura.
* configure: Regenerate.
Author: Jim Meyering <jim@meyering.net>
* Makefile.in (check): New target.
* configure.in (AC_CONFIG_FILES): Add test/Makefile.
* configure: Regenerate.
* test/.gitignore: New file.
* test/Makefile.in: New file.
* test/lvm-utils.sh: New script.
* test/mkdtemp (die, rand_bytes, mkdtemp): New script.
* test/t0000-basic.sh: New tests.
* test/t3000-lvcreate-pvtags.sh: New, failing test.
Derived from a script by Jun'ichi Nomura.
* test/t4000-pv-range-overflow.sh: New test.
* test/test-lib.sh: Testing framework, based on the one from git.
Author: Jim Meyering <jim@meyering.net>
* tools/toollib.c (xstrtouint32): New function.
(_parse_pes): Use xstrtouint32; don't cast strtoul's unsigned
long to uint32_t. Detect overflow.
Author: Jim Meyering <jim@meyering.net>
* lib/device/dev-io.c (dev_open_flags):
Use log_sys_error after failed stat to report strerror(errno).
Use a slightly different diagnostic to report mismatched device number.
(MIRROR_NOTSYNCED) is added to the LVM metadata. This flag is
not cleared when converting to linear. Subsequently, if you
up-convert the linear to a mirror, the flag remains - even though
an up-convert will always force a complete resync.
Move yes_no_prompt() into library (display.c, display.h).
Fixup includes as a result of movement of prior two functions.
Fixup force_t enum to be more descriptive.
log type. Previously, we had a '--corelog' argument that
would change the default type from 'disk' to 'core'. I
think that creates too much confusion - especially when
doing conversions on mirrors.
The new argument '--log' takes either "disk" or "core"
as a parameter. This could be expanded in the future
for additional logging types as well.
Examples:
# Creating a 2-way mirror
$> lvcreate -m1 ... # implicitly use default disk logging
$> lvcreate -m1 --log disk ... # explicit disk logging
$> lvcreate -m1 --log core ... # specify core logging
$> lvcreate -m1 --corelog ... # old way still works
# Conversion examples
$> lvconvert --log core ... # convert to core logging
$> lvconvert --log disk ... # convert to disk logging
$> lvconvert -mX --corelog ... # old way still works
$> lvconvert -mX ... # old way of converting to disk logging still works
Changes are reflected in the man pages.
* lib/misc/lvm-file.c (lvm_fclose): New function.
* lib/misc/lvm-file.h (lvm_fclose): Declare it.
* lib/config/config.c (write_config_file): Use the new function to detect
and diagnose unlikely write failure.
* lib/filters/filter-persistent.c (persistent_filter_dump): Likewise.
* lib/format_text/archive.c (archive_vg): Likewise.
* lib/format_text/format-text.c (_vg_write_file): Likewise.
* lib/log/log.c (fin_log): Similar, but use dm_fclose directly.
Include "\n" at end of each fprintf format string.
* dmeventd/dmeventd.c (_set_oom_adj): When writing to /proc/self/oom_adj,
detect failure even if it's hidden behind ferror. [Using dm_fclose's
extra ferror test here is probably not needed, since the amount written
is nowhere near BUFSIZ, but use it regardless, for consistency. ]
* lib/fs/libdevmapper.c (do_suspend): Detect fclose failure when
writing to suspend.
* lib/misc/lvm-file.h (is_same_inode): Define.
* lib/filters/filter-persistent.c (persistent_filter_dump): Use is_same_inode
in place of a direct st_ino-only comparison.
* lib/locking/file_locking.c (_release_lock, _lock_file): Likewise.
config/config.c:493: warning: format '%d' expects type 'int', but argument 5 has type 'long int'
Modified original patch from Jim Meyering <jim@meyering.net>
There are two fixes other than improving variable names and updating code
layout etc.
The loop counter is incremented by area_len instead of area_len * stripes;
the 3rd _check_stripe parameter is no longer multiplied by number of stripes.
Use loop to iterate through the now-ordered policy list in _allocate().
Check for failure to allocate just the mirror log.
Introduce calc_area_multiple().
Support mirror log allocation when there is only one PV: area_count now 0.
(See lvm-devel list archives for further details.)
e.g. lvcreate -l 100%FREE to create an LV using all available space.
lvextend -l 50%LV to increase an LV by 50% of its existing size.
lvcreate -l 20%VG to create an LV using 20% of the total VG size.
Include mirror log (untested) in _for_each_pv() processing.
Use MIRROR_LOG_SIZE constant.
Remove struct seg_pvs from _for_each_pv() for generalisation.
Avoid adding duplicates to list of parallel PVs to avoid.
event-driven model. Without changes to the way the cache gets updated, the
option is currently unreliable without a global lock to prevent any lvm2
commands from running concurrently.
Add --config for overriding most config file settings from cmdline.
Quote arguments when printing command line.
Remove linefeed from 'initialising logging' message.
Add 'Completed' debug message.
Don't attempt library exit after reloading config files.
Always compile with libdevmapper, even if device-mapper is disabled.
Fix some memory leaks in error paths found by coverity.
Use C99 struct initialisers.
Move DEFS into configure.h.
Clean-ups to remove miscellaneous compiler warnings.
[Some activation-related features will stop working for a while now.
Some types of activation are getting split into two steps, with the
first step using the precommitted metadata.]
The daemon side of this is mostly the same as the patch I sent out. To select
a timeout period different than the default and to get the timeout period,
I added two library calls, dm_set_event_timeout() and dm_get_event_timeout().
If people are against them, the other option is to tack extra arguments onto
dm_regiser_for_event() and dm_get_registered_device(). I also added a
-t option to dmevent, so people can try out timeouts.
log types. This means the threaded_syslog type is no longer valid. A new
fxn multilog_async is available to toggle between the two modes. If an
app is compiled without pthreads and tries to use async logging, no logging
will occur while async is enabled.
dmeventd has been modified to use the new code
I'm not positive I like the way the async_logger code calls the log fxn,
but it works for now. Suggestions for other ways to do it would be helpful
- don't echo after an 'action' call - action does the echo itself
- use vgdisplay/vgs to determine which VGs are marked clustered and only
deactivate those VGs (unless the LVM_VGS var is set in
/etc/sysconfig/cluster)
- multilog_add_type, multilog_del_type, multilog_custom, and
multilog_init_verbose all have different arguments.
- Primary change is that caller only passes in config info, and the
lib keeps track of state internally. No more exporting of
'struct log_data'.
- Custom callers now only get the custom data pointer passed into their
log fxn (that is set with multilog_custom)
- Added basic README that describes libmultilog
# dmevent -l
Also, changed the behaviour of dm_get_registered_device(), so that it doesn't
change the pointer you passed in without freeing the memory on a non-next call,
and doesn't free your pointer without setting it to NULL on a failed next call.
multilog_add_type()/multilog_del_type cycles correctly.
o fixed segfault in multilog_add_type()
o fixed test-multilog.c
o cleaned up libmultilog (list macros, indentation, braces, comments)
- add event_nr to thread_status struct and set appropriately so that the
thread actually waits for an event
- essentially make error_detected return true. Let the DSOs determine
how to interpret the status info
o more tweaks to libmultilog calls - the api isn't set in stone yet, so
don't get too comfortable.
o not sure the dmeventd in device-mapper/dmeventd works - i've been using
the one in lib/event/
o currently both daemons are set to log only to syslog
o changed
int dm_get_next_registered_device(char **dso_name, char **device,
enum event_type *events);
to
int dm_get_registered_device(char **dso_name, char **device,
enum event_type *events, int next)
so that the daemon is able to retrive the next one of the list without
running into locking issues.
o changed dmevent.c to use dm_get_registered_device()
o couldn't test this yet because of the comms issues
(daemon exits in do_process_request())
Improve reporting of node-specific locking errors so you'll get
somthing a little more helpfiul than "host is down" - it will now tell
you /which/ host it thinks is down.
./configure --with-clvmd
wil do this by default. Or you can choose which you want with
./configure --with-clvmd=gulm or
./configure --with-clvmd=cman
When clvmd with both included is run, it will automatically detect the cluster
manager in use.
Additional verbosity level -vvvv includes line numbers and backtraces.
Verbose messages now go to stderr not stdout.
Close any stray file descriptors before starting.
Refine partitionable checks for certain device types.
Allow devices/types to override built-ins.
and further reduce the number of ioctl calls made.
o Metadata area struct change.
o Make config file accessible to activation functions & get stripe_filler
from it.
o Allow kernel to return snapshot status as a fraction or a percentage.
checks clearer (incl. variable renaming); using a flag to indicate when
output data doesn't fit into supplied buffer instead of returning an error etc.
Clear many compiler warnings (i386) & associated bugs - hopefully without
introducing too many new bugs:-) (Same exercise required for other archs.)
Default compilation has optimisation - or else use ./configure --enable-debug
to back this out so you can do that commit, let me know. Also, if there's
an issue with the error message that's displayed, just change it in tools.h.
This causes a "device-mapper driver/module not loaded?" error message to
be displayed for the commands that require dm-mod, if the tools can't get
the driver version. It's not done for commands that don't require dm-mod.
This should clear up some problems people have had attempting to use lvm2
without rtfm'ing.
allocation policy. This can currently take one of three values:
typedef enum {
ALLOC_NEXT_FREE,
ALLOC_STRICT,
ALLOC_CONTIGUOUS
} alloc_policy_t;
Notice that 'SIMPLE' has turned into the slightly more meaningful NEXT_FREE.
ii) Put code into display.[hc] for converting one of these enums to a
text representation and back again.
ii) Updated the text format so this also has the alloc_policy field.
o Various other kernel side tidy-ups.
o Version number changes so we have the option of adding new ioctl commands
in future without affecting the use of existing ones should you later
revert to an older kernel but not revert the userspace library/tools.
o Better separation of kernel/userspace elements in the build process to
prepare for independent distribution of the kernel driver.
preparsed status info, shove it all into a string, and then parse it
again to get the info back out (which is what i was doing before)
o basically that's it...i like this *much* better than the previous
method and i think it makes the _status fxn more flexible if we need
to use it to get other info out.
o Not sure if the code in dev_manager is really optimal, but it works..
will look at adjusting it a bit now.
o I *think* it works right when one snapshot if full but others aren't,
but I haven't really been able to test it because the full snapshot
somehow resets itself and weird things start happening to the system...
o There is still a bit missing
+ all are missing the {PV,VG,LV} # - that is not applicable in LVM2
+ pvdisplay doesn't show how many LVs are contained on it
+ much of the snapshot information isn't available for lvdisplay
o Look at the code for other potiential FIXMEs :)
Lots of changes/very little testing so far => there'll be bugs!
Use 'vgcreate -M text' to create a volume group with its metadata stored
in text files. Text format metadata changes should be reasonably atomic,
with a (basic) automatic recovery mechanism if the system crashes while a
change is in progress.
Add a metadata section to lvm.conf to specify multiple directories if
you want (recommended) to keep multiple copies of the metadata (eg on
different filesystems).
e.g. metadata {
dirs = ["/etc/lvm/metadata1","/usr/local/lvm/metadata2"]
}
Plenty of refinements still in the pipeline.
Patrick, can you see if this fixes your cluster syncing problem please ?
If so I'll make it so it only syncs if you have actually written to the
device.
from lock_vol() - otherwise it now attempts to acquire the lock and then
immediately releases it.
o Extend the id field in struct logical_volume to hold VG uuid + LV uuid
for format1. This unique lvid can be used directly when calling lock_vol().
o Add the VG uuid to vgcache to make VG uuid lookups possible. (Another
step towards using them instead of VG names internally.)
o #defines for common lock flag combinations
o Try out hyphens instead of colons in device-mapper names - does this
make messages containing filenames easier to read?
o rewrote activate.c to use dev-manager, I'm sure these two will merge
at some point.
o Rename is broken ATM
o dev-manager puts the calls through to fs.c for layers that have the
'visible' flag set.
o Use first unused lv_number when creating new LV
o Use lv_number for refs to snapshots
o Update persistent minor logic after the lvcreate restructure
o Reset all parameters before use in lvcreate.
and add severe warning if it's used to make a device seem bigger than
it really is. This is not an option people should be using as it
breaks metadata integrity.
o Use uint64_t throughout (rather than unsigned long long)
o Convert a few messages that contain pathnames into the more common form:
pathname: message
I'm taking a different route from LVM1 here in that snapshots are a
seperate entity from the logical volumes, I think of them as an
application of an LV (or two lvs rather). As such there is a list of
snapshots held against the vg, and there is *not* a SNAPSHOT, or
SHAPSHOT_ORG flag in lv->status.
all since it only supports vg_write. It has been replaced with:
int archive_vg(struct volume_group *vg,
const char *dir,
const char *desc,
uint32_t retain_days,
uint32_t min_archive);
which is now called directly by tools/archive.c
o Text format now has a description and time field at the top level.
o archiving and backup set the description appropriately. eg,
for an archive:
description = "Created *before* executing 'lvextend test_vg/lvol0 -l +1'."
creation_time = 1013166332
for a backup:
description = "Created *after* executing 'lvextend test_vg/lvol0 -l +1'."
creation_time = 1013166332
This is preparing the way for a simple vgcfgundo command.
slightly different from the current LVM1 method.
lvcreate --persistent y --minor 10 (to specify when created)
lvchange --persistent n (to turn off)
lvchange --persistent y --minor 11 (to change)
--persistent uses a new LV status flag stored on disk
minor number is stored on disk the same way as LVM1 does
(but major number stored is 0; any LVM1 major/minor setting gets lost)
lvchange -ay --minor 12 (to activate using minor 12, regardless of the
on-disk setting, which doesn't get changed)
--minor == -m
--persistent == -M
missing from a VG. (Linear targets use the device-mapper 'error' target
which returns ioerror; striped targets use '/dev/ioerror' for now - which must
already exist e.g. as a sufficiently large block device version of /dev/zero).
by allocating the data block with an additional dbg_malloc.
o Added an assertion to check that no one is requesting alternate
alignment for memory allocated from pool. I can't see us needing this
for LVM2.
Otherwise LVM1 decides the PV structure is corrupt.
But do we need to keep both pv->pe_size and vg->extent_size
in internal metadata or can we generate pvd->pe_size when writing out
a PV that belongs to a VG?
This should be a rare occurrence so the aim is to recover if it's
straightforward to do so, otherwise just to abort the operation.
If people *knowingly* change device names, they should always run vgscan
afterwards.
A few bytes of memory gets leaked inside a pool each time an alias
has to be discarded - it's not worth restructuring the code to reuse it.
More of LVM2 needs updating to pass device objects (or uuids) about
instead of pathnames so that resolution of pathname->object only happens
once per operation.
dev_cache_get() should now always return the *current* device at the path given
dev_name_confirmed() replaces dev_name() whenever it's important to
know that name for the device is still current (ie when opening it).
If the cache doesn't know a current name, the function fails.
dev_open() guarantees that the file descriptor returned is for the dev_t
of the device structure it was passed.
o When opening device, return error if its cached name is incorrect (eg if
it's changed since the cache was generated). This prevents use until
the cache is rebuilt (eg with vgscan). Doesn't catch every case.
struct pv_list {
struct list list;
struct physical_volume pv;
};
to
struct pv_list {
struct list list;
struct physical_volume *pv;
};
o New function in toollib 'create_pv_list', which creates a list of pv's
from a given command line array of pv's.
o Changed lvcreate/extend to use this (fixes lvextend [pv list] bug).
Kernel driver has a version number (stored in kernel/VERSION).
The first two components of this (0.94) give the version number of the
ioctl interface. This number must be changed whenever a change is
made to the ioctl interface that breaks backwards compatibility.
The library has a version number (stored in VERSION) which is
used for linking.
The first and/or second component of this must be changed whenever
a change is made to the library API that breaks backwards
compatibility.
onto a new device). uuid specified must not already exist on the system.
o More message tidying.
o When checking for label, only read PV metadata.
o Add ataraid. [Needs moving into config/defaults files.]
o updated vgcfgrestore args
o change _check_for_open_devices only to check devices present in the hash
table instead of using dev_iter which triggers a full scan even when only
displaying command line help
Supply offset to start of variable data area (so struct size can change
without breaking backward compatibility)
Add command that just returns the driver version
o roll vgcache back to agk's implementation, we'll revisit this as part
of the cluster integration.
o change the extra_info field in a label to be a void *
is active in the device-mapper.
o Many operations can be carried out regardless of whether the VG is
active or not.
o vgscan does not activate anything - use vgchange.
o Change lvrename to support renaming of active LVs.
o Remove '//' appearing in some pathnames.
o Dummy lv_check_segments() for compilation.
I've split the old autobackup function into two seperate areas:
'archiving' is performed *before* a vg configuration is changed. This
produces a numbered backup in /etc/lvm/archive.
A 'backup' is performed *after* a vg change. So the directory /etc/lvm/backup
will hold the a copy of the current configuration.
Current version of LVM2 instead relies on /usr/include/libdevmapper.h
which gets installed by the device mapper package.
(Should this location now be configurable?)
at the top of the file.
o Changed completion_matches -> rl_completion_matches, and added some consts.
This will probably break things on pre readline 4.2 systems.
o There is now a _default_debug, and _default_verbose level, when
using lvm interactively -vv and -dd switches just effect the current
command.
o Added a --quiet switch which sets both verbose and debug to zero.
o Introduced the LVM_SYSTEM_DIR variable.
This makes more sense because the persistent cache, and backup directories
are config specific.
eg, I use /etc/lvm for running my real LV's
but I have another directory /dev/lvm_loops that contains a config
that allows only loopback devices, I use this for testing.
o You must list long args with no short option (eg. --version) at the
front of the args.h file.
o If an argument has no short option, set the short option in args.h to '\0'
o The index into the 'the_args' var is now stored as the option value
for getopt, iff there is no short opt.
A substantial speed-up - particularly in readline mode.
If the hints turn out to be wrong, the relevant parts get thrown away.
vgscan destroys it totally. In both cases it then rebuilds itself as
required.
- The iterator can find labels by string and also appropriate version number (==,
<= or any) if you want.
- Add labels_match() call that compares the two labels and returns an error if
they do not match.
- Write labels in sector 1 & last rather than 2 & last as per Joe.
o Changed pv_map.c to maintain the list of free areas in size order, which
is more helpful to the allocators. If you want to allocate a bit of an
area call consume_area(area, size), this will adjust the area if there's
some space left and shuffle it to the correct place in the list.
Not tested.
old dmfs-lv.c thats gone.
o Dropped out support for multiple tables in line with ioctl interface
o Some reordering to better support the userland library
o Updated to 2.4.16
I'm fairly happy with the way that this is working now, so the next job is
to start the integration with the ioctl interface so there is a single
common dm.[ch] and selectable interfaces (fs or ioctl).
Further improvements can be made even now, but I hope to wait until we've
got this going and integrated and the libdm parts working as well before
investigating other avenues.
a seperate chunk of memory from dbg_malloc for each pool_alloc. This
will allow the bounds checking code in dbg_malloc to do it's stuff.
o The normal implementation moved to pool-fast.c
o pool.c now just contains a #ifdef and includes the appropriate .c file.
Alasdair, could you make sure that gcc -MM get's passed all the
CFLAGS please, otherwise the dependencies get calculated incorrectly.
logical volumes. It includes:
format1 changes.
metadata.h changes.
lv_manip.c changed (striped allocation still not done though).
activate.c changes.
Nothing has been near a compiler as yet.
Alasdair can you look at changing display.c to use to output the mappings
in a more segment oriented format please ?
I haven't put the span list into struct physical_volume to represent allocated
extents. I think the burden of maintaining it for things like lv_extend may
out weigh it's uses.
o Changed disk-rep to use these
o if NDEBUG is not defined the dev_cache will check for open devices on
teardown.
I was hoping this would speed things up. But I'm still getting:
reti:/home/joe/sistina/LVM2/tools# time ./lvm vgchange -a n
Volume group vg0 successfully changed
real 0m5.751s
user 0m0.060s
sys 0m0.070s
even though I have only 1 device with the vg on it passing the filters.
N.B. This means that you have to take very great care in the event that
you want to access the dcache tree from in kernel
o Added extra field to allow out of memory conditions to result in the
correct error code. (This hasn't received a lot of testing...)
I've ditched the final project (which would have cleared my whole list)
since its got other complications which I don't have time to fix right
now. Still as Meatloaf says, two out of three ain't bad!
o Full signed arguments to lvreduce/lvextend
o Consistent lv_number/pe map use
o Populate pv->pe_allocated
o Fixes for allocation/writing of multiple LVs
o Created dmfs.h as a private header for the filesystem code
o Using seq_file.[ch] written by Al Viro as a generic mechanism for /proc
style files which have one record per line. We use a slight modification
here, so if you are using a recent -ac kernel you'll need to replace the
existing seq_file.[ch] with the ones here and do a bit of editing to make
it work. I'll submit the changes to Al Viro shortly as they are very
small and I think make sense generally.
o Using fail_writepage()
o Init code for filesystem now all in dmfs-super.c
o Some common code reduction amoung the dmfs-*.c files
o Auto allocation of major device number (default). You can specify a
particular major by using a module argument. If built in then you don't
get this option at the moment but it could be added if required.
o Hotplug support
o General tidying
o Updated projects.txt file
o Patches updated to 2.4.14
Please add to/edit this file as you think of new ideas or discover bugs. The
items in it are in no particular order. They are also only ideas and hence may
never get implemented depending on whether they turn out to be good ideas or
not.
filters in order.
eg,
f = composite_filter_create(2, regex_filter, persistent_filter);
ownership of the filters passes, they will be destroyed when f's
destroy method is called.
devices {
# first match is final, eg. /dev/ide/cdrom
# get's rejected due to the first pattern
filter=["r/cdrom/", # don't touch the music !
"a/hd[a-d][0-9]+/",
"a/ide/",
"a/sd/",
"a/md/",
"a|loop/[0-9]+|", # accept devfs style loop back
"r/loop/", # and reject old style
"a/dasd/",
"a/dac960/",
"a/nbd/",
"a/ida/",
"a/cciss/",
"a/ubd/",
"r/.*/"] # reject all others
}
Alasdair this is ready to roll into the tools now.
and builds a *very* efficient engine that will tell you which regex a string
matches with only a single pass through the string. To be used in the config
file when specifying devices.
o Anchor's aren't supported yet (^ and $) but that won't take long.
o Also when we get some realistic config files we may want to consider adding an
extra level of indirection to the dfa state in order to compress the table.
It all depends on how large typical tables get.
Things to note:
o Changes to the dm-*.c files have been kept as small as possible during
the development of the new fs interface and there are a few places where
the new code does odd things to give the original code what it wants. These
places will gradually go away during the next few days once we are sure the
new code is sound.
o I've spent most of my testing time looking at the parser since thats where
a lot of the changes are, I've not checked the actual I/O very much, but
then that code hasn't changed at all.
o The print operation in the target type operations is there to help in
debugging and will go away eventually
o There are some other printk's which will also go away once we are sure that
things are working correctly.
o I've tagged the old code with PRE_DMFS if you want to use that until this is
stable.
o There are no kernel patches for this yet (will fix after lunch... :-)
o Makefile needs some changes
o need to EXPORT_SYMBOL(deny_write_access); in ksyms.c
How to use the new interface ?
mount -t dmfs dmfs /mnt/dm
cd /mnt/dm
mkdir fish fish/tank
cd fish/tank
cat ~/my.table > table
cd ..
ln -s tank ACTIVE
Creates a logical volume called fish and activates a table called tank, if
there is a problem doing the link, look in /mnt/dm/fish/tank/errors to see
what is wrong.
If you see any odd things happening, let me know right away as I'm sure there'll
be one or two things that slipped through my testing.
o Error file routines (initial idea)
o Various fixes for bugs
o Tidy a few things
o Added a bit of debugging code ready for when this gets tested
o get_exclusive_write_access() function which will get moved into namei.c
I hope (and rewritten accordingly), should this become the final version
used.
Still a few more areas need thinking about, but in general much closer now I
think. Last area to sort out before testing is the symlink code which is
pretty close now... just a few more checks needed and the actual calls to
the core code.
o Fixed to work with highmem
o Added dmfs private inode struct for lv and table directories
o Fixed a number of errors/typos
o Status file read returns 0 so we can leave this until we've actually got
something to report in this now.
o New locking on tables.... still some issues to be worked out here but
closer now I think.
o Now use mapping of table directory to hold pages rather than mapping of
table file inode. Need to write a note to myself to fix issues with the
file length at the same time....
Well thats enough for tonight I think. The error file will be part of
tomorrows work.
o Redo write logic for table file
o Relax rules for symlink content by removing the rewriting function
Well I probably won't get a chance to work on this tomorrow, so this is my
changeset to date.
non-obvious, its time to simplify :-)
o Moving towards a simpler and more obviously correct interface
o Removed some fs operations in directories representing volumes
o Changed some file names
o Made things cleaner
more changes to follow...
o Changed dm_remove() to accept a struct mapped_device argument rather than
a name
o We no longer have to look up devices by name, the dcache handles that
nicely for us
o Fixed a bug where we were freeing a structure before we'd finished with
it.
o The name field in struct mapped_device is now only used in a very few
places in dm.c and will be replaced in future with a back reference to
the dentry rather than keeping the name in two places.
o dmfs-dir.c becomes dmfs-lv.c
o dmfs-file.c becomes dmfs-table.c
o A few tweeks and updates
The main reason for the slow progress on these files (which are not yet used
by the device mapper) is that we are working out what this interface should
look like as we go along.
Once this has evolved a bit further and in a state where it can be used we'll
announce it on the lists for further comment.
o Only one list of block devices for all tables
o Locking to ensure that block devices only get opened once
o Block device handling is now in dm-blkdev.c
o We open block devices when we create the tables and hold them open until
the table is destroyed (this prevents the module for the device being
unloaded after table parsing and before the table is used)
o We compute the hardsect size when the table is created rather than when
someone requests it.
Still to fix/change:
o Probably want to hash the device lists in dm-blkdev.c and also remove refs
to struct dm_bdev outside this file.
o Need to ensure that hardsect_size doesn't change when new tables are
swapped in (maybe this ought to be a per volume parameter and the tables
will only parse if they match the value for the volume?).
Things are changing fast here, so if you want a stable version of thic code
try checking out yesterdays.
o Moved the linear target into its own module (not really because it needs to
be there, but because its useful to have a simple example so people can see
what we are doing)
Btw, this needs testing properly.
to up_write() and down_write() etc so that you can see what kind of a lock
it is (otherwise it could be anything.. semaphore, spinlock, spinlock_bh,
spinlock_irq, br_lock, etc.)
Mount the dm-fs filesystem on /device-mapper (will fix later). mkdir
to create a device, inside that directory every file you create is a table
file. If there are errors <table>.err will appear automagically. Mv a table
file to ACTIVE to activeate the device. I'm not happy with mv being the
binding command, symlink would be better.
devices based on /proc/devices
+ The dev_mgr structure now has a 256 element char array that is
initially all 0s
+ When a match is found, the array element corresponding to the major
number of the match is set to a non-zero value
+ to check for a match, all one has to do is check that the array
element at the major number in question is non-zero.
o I'm wondering if we should do this with bitwise operators instead? Does
anyone expect the major numbers to grow larger than 8-bits?
o I don't like it, but I'm committing it so I can go back and laugh at
myself later
o I have a (hopefully) better idea that i'll try to commit yet today.
o I'm working on caching the /proc/devices entries now, and should have
that in by the end of today or early tomorrow.
o There will be much cleanup involved with that...
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.