IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Corrupt metadata text (with good mda header) was being handled
in the label_scan phase, but not in the vg_read phase. This
was sufficient because metadata areas would always be read and
checksummed during label_scan (metadata parsing was skipped
previously as an optimization.)
This changed with the optimization in
commit 61a6f9905e
"metadata: optimize reading metadata copies in scan"
Now, some metadata areas will not be read and checksummed
at all during the label_scan phase, only during the vg_read
phase. This means that bad metadata text may first be detected
in the vg_read phase. So, add equivalent bad metadata handling
to the vg_read path to match the label_scan path.
While being in lockless scanning phase, we can avoid reading and checking
matching metadata copies if we already know them from other PV
and just rely on matching metadata header information.
These copies will be examined later during locked metadata read/write
access.
This patch may postpone discovering some read failures to locked phase.
When creating lvm2 metadata for VG, lvm2 allocate some buffer,
and if buffer is not big enough, the buffer is 'reallocated' bigger,
and whole metadata creation is repeated until metadata fits.
We can try to use 'previous' metadata size as hint to reduce looping
here.
Preserve computed crc32 check from first written PV, just like we
preserve generated metadata.
Also there is no need to call crc32 twice on wrapping buffer with 2 calcs,
result must be always the same as with single crc32 checking.
sysconf() may also return -1 although rather theoretically.
Default to 4K when such case would happen.
Also in function call it just once and keep as static variable.
Ensure only nonNULL 'du' pointer is dereference altough the comment
to the last assign 'du' pointer already suggest 'NULL' case should not happen.
So just being explicit.
mer du
Error path in _lockd_retrive_vg_pv_list() has not zeroed released path
caussing possible double-free later in the code.
Fix it by using one single function freeing lock_pvs structure.
When cache creation fails on table reload path, implemen more
advanced revert solution, that tries to restore state of LVM
metadata into is look before actual caching started.
Loading invalid MQ/SMQ policy settings table line cause immediate
rejection - to prevent such failure, automatically filter valid
settins before it is being uploaded by lvm2.
For invalid setting issue a warnning informing user how to remove
them.
This solution is used to keep running cached LVs that might had
been created in the past with invalid settings that have been actually
unused due to another code bug.
New versions of kvdo module exposes statistics at new location:
/sys/block/dm-XXX/vdo/statistics/...
Enhance lvm2 to access this location first.
Also if the statistic info is missing - make it 'debug' level info,
so it is not failing 'lvs' command.
Since VDO is always returns 'zero' on unprovisioned read
and every provisioned block is always 'zeroed' on partial writes,
we can avoid 'zeroing' of such LVs.
Some device id types can only be used with specific device major
numbers, so use this restriction to avoid some comparisions.
This is more efficient, and can avoid some incorrect matches.
pvid and vgid are sometimes a null-terminated string, and
other times a 'struct id', and the two types were often
cast between each other. When a struct id was cast to a char
pointer, the resulting string would not necessarily be null
terminated. Casting a null-terminated string id to a
struct id is fine, but is still avoided when possible.
A struct id is: int8_t uuid[ID_LEN]
A string id is: char pvid[ID_LEN + 1]
A convention is introduced to help distinguish them:
- variables and struct fields named "pvid" or "vgid"
should be null-terminated strings.
- variables and struct fields named "pv_id" or "vg_id"
should be struct id's.
- examples:
char pvid[ID_LEN + 1];
char vgid[ID_LEN + 1];
struct id pv_id;
struct id vg_id;
Function names also attempt to follow this convention.
Avoid casting between the two types as much as possible,
with limited exceptions when known to be safe and clearly
commented.
Avoid using variations of strcpy and strcmp, and instead
use memcpy/memcmp with ID_LEN (with similar limited
exceptions possible.)
When splitting VG with thin/cache pool volume, handle pmspare during
such split and allocate new pmspare in new VG or extend existing pmspare
there and eventually drop pmspare in original VG if is no longer needed
there.
When check active componet of thinLV with external origin,
we need to check if the external origin isn't already active.
For this however we need to use layered check for -real device.
sysfs-based multipath component detection quit if a
device had multiple holders, and in this case would
fail to detect a device was an mpath component.
For the pv_min_size check, always use dev_get_size()
which is commonly used elsewhere, and don't bother
asking libudev for the device size when
external_device_info_source=udev.
related to config settings:
obtain_device_info_from_udev (controls if lvm gets
a list of devices from readdir /dev or from libudev)
external_device_info_source (controls if lvm asks
libudev for device information)
. Make the obtain_device_list_from_udev setting
affect only the choice of readdir /dev vs libudev.
The setting no longer controls if udev is used for
device type checks.
. Change obtain_device_list_from_udev default to 0.
This helps avoid boot timeouts due to slow libudev
queries, avoids reported failures from
udev_enumerate_scan_devices, and avoids delays from
"device not initialized in udev database" errors.
Even without errors, for a system booting with 1024 PVs,
lvm2-pvscan times improve from about 100 sec to 15 sec,
and the pvscan command from about 64 sec to about 4 sec.
. For external_device_info_source="none", remove all
libudev device info queries, and use only lvm
native device info.
. For external_device_info_source="udev", first check
lvm native device info, then check libudev info.
. Remove sleep/retry loop when attempting libudev
queries for device info. udev info will simply
be skipped if it's not immediately available.
. Only set up a libdev connection if it will be used by
obtain_device_list_from_udev/external_device_info_source.
. For native multipath component detection, use
/etc/multipath/wwids. If a device has a wwid
matching an entry in the wwids file, then it's
considered a multipath component. This is
necessary to natively detect multipath
components when the mpath device is not set up.
How to trigger:
```
~ # export LVM_SYSTEM_DIR=_
~ # pvscan
No matching physical volumes found
double free or corruption (!prev)
Aborted (core dumped)
```
when LVM_SYSTEM_DIR is empty, _load_config_file() won't be called.
when LVM_SYSTEM_DIR is not empty, cfl->cft links into cmd->config_files
by _load_config_file()@lib/commands/toolcontext.c
core dumped code: _destroy_config()@lib/commands/toolcontext.c
```
/* CONFIG_FILE/CONFIG_MERGED_FILES */
if ((cft = remove_config_tree_by_source(cmd, CONFIG_MERGED_FILES)))
config_destroy(cft);
else if ((cft = remove_config_tree_by_source(cmd, CONFIG_FILE)))
config_destroy(cft); <=== first free the cft
dm_list_iterate_items(cfl, &cmd->config_files)
config_destroy(cfl->cft); <=== double free the cft
```
Fixes: c43f2f8ae0
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
expands commit d5a06f9a7d
"pvscan: skip indexing devices used by LVs"
The dev cache index is expensive and slow, so limit it
to commands that are used to observe the state of lvm.
The index is only used to print warnings about incorrect
device use by active LVs, e.g. if an LV is using a
multipath component device instead of the multipath
device. Commands that continue to use the index and
print the warnings:
fullreport, lvmdiskscan, vgs, lvs, pvs,
vgdisplay, lvdisplay, pvdisplay,
vgscan, lvscan, pvscan (excluding --cache)
A couple other commands were borrowing the DEV_USED_FOR_LV
flag to just check if a device was actively in use by LVs.
These are converted to the new dev_is_used_by_active_lv().
dev_cache_index_devs() is taking a large amount of time
when there are many PVs. The index keeps track of
devices that are currently in use by active LVs. This
info is used to print warnings for users in some limited
cases.
The checks/warnings that are enabled by the index are not
needed by pvscan --cache, so disable it in this case.
This may be expanded to other cases in future commits.
dev_cache_index_devs should also be improved in another
commit to avoid the extreme delays with many devices.
There have been two separate checks for metadata
validity: first that the metadata text begins with
a valid VG name, and second the checksum of the
metadata text. These happen in different places,
which means there have been two separate error paths
for invalid metadata. This also causes large metadata
to be read in multiple parts, the first part is read
just to check the vgname, and then remaining parts are
read later when the full metadata is needed.
This patch moves the vg name verification so it's
done just before the checksum verification, which
results in a single error path for invalid metadata,
and causes the entire metadata to be read together
rather that in parts from different parts of the code.
If label_scan encounters bad vg metadata, invalidate
bcache data for the device and reread the mda_header
and metadata text back to back. With concurrent commands
modifying large metadata, it's possible that the entire
metadata area can be rewritten in the time between a
command reading the mda_header and reading the metadata
text that the header points to. Since the label_scan
is just assembling an initial overview of devices, it
doesn't use locking to serialize with other commands
that may be modifying the vg metadata at the same time.
Add profilable configurable setting for vdo pool header size, that is
used as 'extra' empty space at the front and end of vdo-pool device
to avoid having a disk in the system the may have same data is real
vdo LV.
For some conversion cases however we may need to allow using '0' header size.
TODO: in this case we may eventually avoid adding 'linear' mapping layer
in future - but this requires further modification over lvm code base.
When adding a device to the devices file with --adddev, lvm
by default chooses the best device ID type for the new device.
The new --deviceidtype option allows the user to override the
built in preference. This is useful if there's a problem with
the default type, or if a secondary type is preferrable.
If the specified deviceidtype does not produce a device ID,
then lvm falls back to the preference it would otherwise use.
Previously there have been necessary explicit call of backup (often
either forgotten or over-used). With this patch the necessity to
store backup is remember at vg_commit and once the VG is unlocked,
the committed metadata are automatically store in backup file.
This may possibly alter some printed messages from command when the
backup is now taken later.
Instead of calling explicit archive with command processing logic,
move this step towards 1st. vg_write() call, which will automatically
store archive of committed metadata.
This slightly changes some error path where the error in archiving
was detected earlier in the command, while now some on going command
'actions' might have been, but will be simply scratched in case
of error (since even new metadata would not have been even written).
So general effect should be only some command message ordering.
For shared VG or LV locking, IDM locking scheme needs to use the PV
list assocated with VG or LV for sending SCSI commands, thus it requires
to use some places to generate PV list.
In reviewing the flow for LVM commands, the best place to generate PV
list is in the locking lib. So this is why this patch parses PV list as
shown. It iterates over all the PV nodes one by one, and compare with
the VG name or LV prefix string. If any PV matches, then the PV is
added into the PV list. Finally the PV list is sent to lvmlockd daemon.
Here as mentioned, it compares LV prefix string with the format
"lv_name_", the reason is it needs to find out all relevant PVs, e.g.
for the thin pool, it has LVs for metadata, pool, error, and raw LV, so
we can use the prefix string to find out all PVs belonging to the thin
pool.
For the global lock, it's not covered in this patch. To avoid the egg
and chicken issue, we need to prepare the global lock ahead before any
locking can be used. So the global lock's PV list is established in
lvmlockd daemon by iterating all drives with partition labeled with
"propeller".
Signed-off-by: Leo Yan <leo.yan@linaro.org>
We can consider the drive firmware a server to handle the locking
request from nodes, this essentially is a client-server model.
DLM uses the kernel as a central place to manage locks, so it also
complies with client-server model for locking operations. This is
why IDM and DLM are similar with each other for their wrappers.
This patch largely works by generalizing the DLM code paths and then
providing degeneralized functions as wrappers for both IDM and DLM.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
error reading dev and no pvid on dev were both
returning 0. make it easier for callers to
know which, if they care.
return 1 if the device could be read, regardless
of whether a pvid was found or not.
set has_pvid=1 if a pvid is found and 0 if no
pvid is found.
This case happens when i.e. we convert LV to another type,
when we change existing LV into a different type - so change
to debug level and avoid confusing users with message about
Device path not match.
We may eventually enhnace caching code to drop cached info
after taking lock and reading VG.
With commit 0b18c25d93 there
was introduced 'zalloc()' for allocation of outdates pvs,
but no matching 'free()' is present.
Switch to use cmd mempool instead of adding free() code into
several places.
The autoactivation property can be specified in lvcreate
or vgcreate for new LVs/VGs, and the property can be changed
by lvchange or vgchange for existing LVs/VGs.
--setautoactivation y|n
enables|disables autoactivation of a VG or LV.
Autoactivation is enabled by default, which is consistent with
past behavior. The disabled state is stored as a new flag
in the VG metadata, and the absence of the flag allows
autoactivation.
If autoactivation is disabled for the VG, then no LVs in the VG
will be autoactivated (the LV autoactivation property will have
no effect.) When autoactivation is enabled for the VG, then
autoactivation can be controlled on individual LVs.
The state of this property can be reported for LVs/VGs using
the "-o autoactivation" option in lvs/vgs commands, which will
report "enabled", or "" for the disabled state.
Previous versions of lvm do not recognize this property. Since
autoactivation is enabled by default, the disabled setting will
have no effect in older lvm versions. If the VG is modified by
older lvm versions, the disabled state will also be dropped from
the metadata.
The autoactivation property is an alternative to using the lvm.conf
auto_activation_volume_list, which is still applied to to VGs/LVs
in addition to the new property.
If VG or LV autoactivation is disabled either in metadata or in
auto_activation_volume_list, it will not be autoactivated.
An autoactivation command will silently skip activating an LV
when the autoactivation property is disabled.
To determine the effective autoactivation behavior for a specific
LV, multiple settings would need to be checked:
the VG autoactivation property, the LV autoactivation property,
the auto_activation_volume_list. The "activation skip" property
would also be relevant, since it applies to both normal and auto
activation.
If we are signaled with SIGTERM it should be at least as good
as with SIGINT - as the command should stop ASAP.
So when lvm2 command allows signal handling we also
enable SIGTERM handling. If there are some other signals
we should handle equally - we could just extend array.
Not all libc (like musl, uclibc dietlibc) libraries support full symbol
version resolution in runtime like glibc.
Add support to not generate symbol versions when compiling against them.
Additionally libdevmapper.so was broken when compiled against
uclibc. Runtime linker loader caused calling dm_task_get_info_base()
function recursively, leading to segmentation fault.
Introduce --with-symvers=STYLE option, which allows to choose
between gnu and disabled symbol versioning. By default gnu symbol
versioning is used.
__GNUC__ check is replaced now with GNU_SYMVER.
Additionally ld version script is included only in
case of gnu option, which slightly reduces output size.
Providing --without-symvers to configure script when building against
uclibc library fixes segmentation fault error described above, due to
lack of several versions of the same symbol in libdevmapper.so
library.
Based on:
https://patchwork.kernel.org/project/dm-devel/patch/20180831144817.31207-1-m.niestroj@grinn-global.com/
Suggested-by: Marcin Niestroj <m.niestroj@grinn-global.com>
Function is not having the best name since it does check
no just raid LVs to be in sync.
Restore the mirror percentage checking - although without retries,
since only raid target is currently known to need it - for other
types it would be ATM a bug to get inconsistent result.
Whiel waiting for raid to return consistent status,
use interruptible sleep - so command can break quickly.
Use lv_raid_status() to get percentage easily from status.
Enabled extension/mixing of stripes/linears, error and zero
segtype LVs with stripes/linear, error and zero segtypes.
It is not very useful in practice, as the user cannot store any real
data on error or zero segtypes, but it may get some uses in
some scenarios where i.e. some portion of the device should not be
readable. Mixing of types happens on 'extent_size' level:
lvcreate -L1 -n lv vg
lvextend --type error -L+1 vg/lv
lvextend --type zero -L+1 vg/lv
lvextend --type linear -L+1 vg/lv
lvextend --type striped -L+1 vg/lv
lvs -o+segtype,seg_size vg
Note: when the type is not specified, the last segment type is
automatically selected.
It's also a small 'can of worms' since we can't tell LVs if
the LV is linear/error/zero or their mixtures. So the meaning behind
them may need some updates.
We already have this types of LV created i.e by:
vgreduce --removemissing --force
where missing LV segments have been replaced by either
error or zero segtype (lvm.conf).
TODO: it might be worth adding a message while such device is activated.
Add some extra code to handle differently sized thin-pool
from thin-pool data volume.
ATM this can't really happen, but once we start to use multiple
commits while resizing stacked LV, we may actually get into
the position, where data LV has been already resized,
but thin-pool stayed with old size.
But for now - report difference as internal error.
Devices made only from 'error' target cannot be used,
but if the device is also combined from 'zero' target
the same rule can be applied as such device cannot be used.
Just like with other segtype use this function to get whole
raid status info available per a single ioctl call.
Also it nicely simplifies read of percentage info about
in_sync portion of raid volume.
TODO: drop use of other calls then lv_raid_status call,
since all such calls could already use status - so it just
adds unnecessary duplication.
Reduce ioctl count and avoid separate info check,
when we can get the same info from status ioctl.
When devmanager calls return 0, then the exists value 0
means the reason of failure is missing device in table.
In such case we avoid stack trace.
Swap the flush parameter for the vdo status function
to match thin pool status.
Reuse similar 'acceleration' as used for dependent volumes also
for snapshot - so when origin is being removed with all thick
snapshots, don't bother with individual 'COW' detachments
and write&commits, and when possible handle this all within
a single commit.
Move code for prompting about removed LV to a single function
and use it also to prompt for removal of origin and all its thick
snapshots and also when removing merging origin.
Function does handle postponed write_and_commit so there is
no 'in-flight' operation while waiting on [y|n] answer.
Compat code and handle unusual case, where
thin snapshot is also a 'thick snapshot origin' and such
snapshot gets merged into a thin origin.
However since now lv_is_visible() (which is complex function)
replaced &VISIBLE_LV check, the whole this check seems to be
no longer useful as sum of all 3 will always match??
Since cached LV is going to be removed together with its cache,
there is not much to gain if we try to flush cache first.
User may use 'vgcfgrestore' to get back origin + cache.
Assuming user is not using issue_discards.
When data are discarded after remove there is nothing to restore!
This change allows to futher reduce number of commits
during lvremove/vgremove.
When lvremove/vgremove removes thin volumes with its thin-pool as well,
try to skip any updates of such thin-pool, so when everything properly
deactivates, there is no message send to this thin-pool and whole
thin-pool is removed with a single commit.
Returning NULL for lv_committed is basically instant crash,
so instead try with passed LV instead.
It shouldn't matter as this is internall error path anyway,
but coverity should be happier.
Another step towards better automatic handling of backup,
and automatically setup needs_backup after commit.
In some next step we should reduce number of backups and takem
then only at the command finish with vg_committed content.
With commit b44db5d1a7
needs to check allocated pointer for failed malloc().
Existing check was actually no checking anything so failing
malloc here would result in segfault (although with very
low chance to ever happen).
Match VG uuid just once per list of all LVs in VG.
TODO: maybe some more efficeint tree or hash could be better here,
but since it's used not so often, the total benefit is not so great,
so ATM just reducing amount of checked bytes.
Use different 'hint' size for dm_hash_create() call - so
when debug info about hash is printed we can recognize which
hash was in use.
This patch doesn't change actual used size since that is always
rounded to be power of 2 and >=16 - so as such is only a
help to developer.
We could eventually use 'name' arg, but since this would have changed
API and this patchset will be routed to libdm & stable - we will
just use this small trick.
Just like with deactivation, call of 'lv_is_not_in_use()'
now has embeded report for inactivate LV.
Note: this patch cannot be backported to stable-2.02 - as
there lv_is_active() has 'cluster' meaning and differs from lvinfo().
When LV is deactivativate, we check for presence, and later
for some LV types also for being in use.
We can however do this check in 1 step for them a remove extra ioctl.
Add return value '2' to lv_check_not_in_use() to recognize LV is not
present.
Existing users were just testing for 0, so no change for them.
When parsing VG metadata we can create from a single config tree
also 'vg_committed' that is always created for writable VG.
This avoids extra uncessary step of serializing and deserilizing
just parsed VG.
Every vg_write stores new 'metadata' into precommitted slot.
For this step we use 'serialized buffer' to ascii metadata.
Instead of recreating this buffer after whole 'vg_write()' we
use this buffer instantly for creating of precommitted VG.
This has also the advantage of catching any problems with
reparsing of ascii metadata back to VG early before any write.
This patch postpones update of lvm metadata for each removed
LV for later moment depending on LV type.
It also queues messages to be printed after such write & commit.
As such there is some change in the behavior - although before
prompt we do make write&commit happens automatically in some
other error case we rather keep 'existing' state - so there
could be difference in amount of removed & commited LVs.
IMHO introduce logic is slightly better and more save.
But some cases still need the early commit - i.e. thin-removal
and fixing this needs some more thinking.
TODO: improve removal at least with the case of the whole thin-pool.
i.e. we can simply recognize removal of 'all LVs/whole VG'.
Make the generic "device is not usable" message from filter-usable
more specific in case the device is not usable because it's an LV.
(i.e. when scan_lvs=0)
When 'lv_info()' is called with &info structure,
the presence of node has to be checked from this structure.
Without this we were needlesly trying to look out 0:0 device.
When lvm2 calls archive() or backup() it can be useful to allow handling
break signal so the command can be interrupted at some consistent point.
Signal is accepted during processing these calls - and can be evaluated
later during even lengthy processing loops.
So now user can interrupt lengthy lvremove().
Taking backup with each removed LV is slowing down the process
considerable and is largerly uneeded. We are supposed to take
backup only on significant points and making sure the backup
is correct when the command is finished.
TODO: check how many other commands can be improved.
Use 'C' for alphasort - there is no need to use localized and slower
sorting for internal directory scanning.
Ensure on all code paths allocated dirent entries are released.
Optimize full path construction.
Drop the comment "This setting is no longer used." which
was printed just before the standard deprecation comment:
"This configuration option is deprecated."
When lvmconfig --typeconfig full printed a deprecated
entry it would attempt to print a non-existing
deprecation comment resulting in output like:
# (null) # This setting is no longer used.
The LVM devices file lists devices that lvm can use. The default
file is /etc/lvm/devices/system.devices, and the lvmdevices(8)
command is used to add or remove device entries. If the file
does not exist, or if lvm.conf includes use_devicesfile=0, then
lvm will not use a devices file. When the devices file is in use,
the regex filter is not used, and the filter settings in lvm.conf
or on the command line are ignored.
LVM records devices in the devices file using hardware-specific
IDs, such as the WWID, and attempts to use subsystem-specific
IDs for virtual device types. These device IDs are also written
in the VG metadata. When no hardware or virtual ID is available,
lvm falls back using the unstable device name as the device ID.
When devnames are used, lvm performs extra scanning to find
devices if their devname changes, e.g. after reboot.
When proper device IDs are used, an lvm command will not look
at devices outside the devices file, but when devnames are used
as a fallback, lvm will scan devices outside the devices file
to locate PVs on renamed devices. A config setting
search_for_devnames can be used to control the scanning for
renamed devname entries.
Related to the devices file, the new command option
--devices <devnames> allows a list of devices to be specified for
the command to use, overriding the devices file. The listed
devices act as a sort of devices file in terms of limiting which
devices lvm will see and use. Devices that are not listed will
appear to be missing to the lvm command.
Multiple devices files can be kept in /etc/lvm/devices, which
allows lvm to be used with different sets of devices, e.g.
system devices do not need to be exposed to a specific application,
and the application can use lvm on its own set of devices that are
not exposed to the system. The option --devicesfile <filename> is
used to select the devices file to use with the command. Without
the option set, the default system devices file is used.
Setting --devicesfile "" causes lvm to not use a devices file.
An existing, empty devices file means lvm will see no devices.
The new command vgimportdevices adds PVs from a VG to the devices
file and updates the VG metadata to include the device IDs.
vgimportdevices -a will import all VGs into the system devices file.
LVM commands run by dmeventd not use a devices file by default,
and will look at all devices on the system. A devices file can
be created for dmeventd (/etc/lvm/devices/dmeventd.devices) If
this file exists, lvm commands run by dmeventd will use it.
Internal implementaion:
- device_ids_read - read the devices file
. add struct dev_use (du) to cmd->use_devices for each devices file entry
- dev_cache_scan - get /dev entries
. add struct device (dev) to dev_cache for each device on the system
- device_ids_match - match devices file entries to /dev entries
. match each du on cmd->use_devices to a dev in dev_cache, using device ID
. on match, set du->dev, dev->id, dev->flags MATCHED_USE_ID
- label_scan - read lvm headers and metadata from devices
. filters are applied, those that do not need data from the device
. filter-deviceid skips devs without MATCHED_USE_ID, i.e.
skips /dev entries that are not listed in the devices file
. read lvm label from dev
. filters are applied, those that use data from the device
. read lvm metadata from dev
. add info/vginfo structs for PVs/VGs (info is "lvmcache")
- device_ids_find_renamed_devs - handle devices with unstable devname ID
where devname changed
. this step only needed when devs do not have proper device IDs,
and their dev names change, e.g. after reboot sdb becomes sdc.
. detect incorrect match because PVID in the devices file entry
does not match the PVID found when the device was read above
. undo incorrect match between du and dev above
. search system devices for new location of PVID
. update devices file with new devnames for PVIDs on renamed devices
. label_scan the renamed devs
- continue with command processing
User use 'lvconvert -Zn --type vdo-pool' to convert an existing
vdo formated volume and skip lvm2 internal formating.
This however requires user is passing proper matching parameters.
For them user can use --profile|--metadataprofile option whos
support has been also enhanced.
TODO: add support to read values directly from formated volume.
In some cases we use 'creation' also during conversion.
Here it can be actually unwanted side effect we may remove
not just newly created layers - but also original converted LV.
So until we make clear how to properly revert from some errors
in middle of conversion, disable removal for any 'lvconvert' commands.
When lvdisplay was executed and thin snaphost has be merged to
thin origin and the operation has been postponed till devices
are closed, command crashed.
Check LV is COW before trying to check snapshot percentage.
Fix clearing persistent filter state when clearing all
the state from a label_scan.
label_scan reads devs and saves info in bcache, lvmcache,
and in the persistent filter. In some uncommon cases, an
lvm command wants to clear all info from a prior label_scan,
and repeat label_scan from scratch. In these cases, info
in lvmcache, bcache and the persistent filter all need to
be cleared before repeating label_scan.
By missing the persistent filter wiping, outdated persistent
filter info, from a prior label_scan, could cause lvm to
incorrectly filter devices that change between polling intervals.
(i.e. if the device changes in such a way that the filtering
results change.)
A case where lvm wants to do multiple label_scans is a
polling command (like lvconvert --merge), when lvmpolld
has been disabled, so that the command itself needs to
to do repeated polling checks.
Automatically figure out resizable layer in the LV stack and
resize it online.
Split check for reshaped raids and postpone removal of
unused space after finished reshaping after metadata archiving.
Drop warning about unsupported automatic resize of monitored thin-pool.
Currently there is not yet support for resize of writecache.
Move extra md component detection into the label scan phase.
It had been in set_pv_devices which was deep within the vg_read
phase, which wasn't a good place (better to detect that earlier.)
Now that pv metadata info is available in the scan phase, the pv
details (size and device_hint) can be used for extra md checking.
Use the device_hint from the pv metadata to trigger a full md
component check if the device_hint begins with /dev/md.
Stop triggering full md component checks based on missing
udev info for a dev.
Changes to tests to reflect that the code is now detecting
md components in some test case that it wasn't before.
Current allocation limitation requires to fit metadata/log LV on
a single PV. This is usually not a big problem, but since
thin-pool and cache-pool is using this for allocating extents
for their metadata LVs it might be eventually causing errors
where the remaining free spaces for large metadata size is spread
over several PV.
When passing 'pvmove --name arg' try to automatically move
all associated dependencies with given LV.
i.e. 'pvmove --name thinpool vg vgnew'
moves all thins and data and metadata LV into a new VG vgnew.
Use update_pool_metadata_min_max() which is shared with
thin-pool metadata min-max updating.
Gives improved messages when converting volumes to metadata.
There is not much point to let allocate more then this size
even when i.e. converted LV is bigger then 16GiB (%extent_size)
ATM neither thin-pool nor cache-pool supports bigger metadata.
Initial support for thin-pool used slightly smaller max size 15.81GiB
for thin-pool metadata. However the real limit later settled at 15.88GiB
(difference is ~64MiB - 16448 4K blocks).
lvm2 could not simply increase the size as it has been using hard cropping
of the loaded metadata device to avoid warnings printing warning of kernel
when the size was bigger (i.e. due to bigger extent_size).
This patch adds the new lvm.conf configurable setting:
allocation/thin_pool_crop_metadata
which defaults to 0 -> no crop of metadata beyond 15.81GiB.
Only user with these sizes of metadata will be affected.
Without cropping lvm2 now limits metadata allocation size to 15.88GiB.
Any space beyond is currently not used by thin-pool target.
Even if i.e. bigger LV is used for metadata via lvconvert,
or allocated bigger because of to large extent size.
With cropping enabled (=1) lvm2 preserves the old limitation
15.81GiB and should allow to work in the evironement with
older lvm2 tools (i.e. older distribution).
Thin-pool metadata with size bigger then 15.81G is now using CROP_METADATA
flag within lvm2 metadata, so older lvm2 recognizes an
incompatible thin-pool and cannot activate such pool!
Users should use uncropped version as it is not suffering
from various issues between thin_repair results and allocated
metadata LV as thin_repair limit is 15.88GiB
Users should use cropping only when really needed!
Patch also better handles resize of thin-pool metadata and prevents resize
beoyond usable size 15.88GiB. Resize beyond 15.81GiB automatically
switches pool to no-crop version. Even with existing bigger thin-pool
metadata command 'lvextend -l+1 vg/pool_tmeta' does the change.
Patch gives better controls 'coverted' metadata LV and
reports less confusing message during conversion.
Patch set also moves the code for updating min/max into pool_manip.c
for better sharing with cache_pool code.
When detaching writecache, make the first stage send a message
to dm-writecache to set the cleaner option. This is instead of
reloading the dm table with the cleaner option set. Reloading
the table causes udev to process/probe the dm dev, which gets
stalled because of the writeback activity, and the stalled udev
in turn stalls the lvconvert command when it tries to sync with
udev events.
When getting writecache status we do not need to get
open_count or read_head info, which can cause extra steps.
In case legs of a raid0 LV are removed, the lvdisplay command still
reports 'available' though raid0 is not providing any resilience
compared to the other raid levels.
Also lvdisplay does not display '(partial)' in case of missing raid0
legs as oposed to the lvs command.
Enhance lvdisplay to report "NOT available" for any RaidLV type in case
too many legs are inaccessible hence causing data loss. I.e. any leg
for raid0, all for raid1, more than 1 for raid4/5, more than 2 for raid6
and in case of completely lost mirror groups for raid10.
Add test/shell/lvdisplay-raid.sh.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1872678
New VDO targets v6.2.3 corrects support for online rename of VDO device.
If needed if can be disable via new lvm.conf setting:
vdo_disabled_features = [ "online_rename" ]
When removing pool LV from a stacked LV setup, it's been possible
to leak _pmspare and such hidden LV then required manual
user removal.
Fix it by moving automatic removal into _lv_reduce().
When adding replacement raid+integrity images (lvconvert --repair
after a raid image is lost), various errors can cause the function
to exit with an error. On this exit path, the function attempts
to revert new images that had been created but not yet used. The
cleanup failed to account for the fact that not all images needed
to be reverted.
Since commit 77fdc17d70 always include
log_len size into needed extents - however now we may need sometimes
more extents then necessary - mainly when multiple PVs are involved
into allocation.
Add logs_still_needed into calculation of sufficient_pes_free()
When a writecache sublv or an integrity metadata sublv
are partial (missing a dev), set the partial flag on
the upper level LV also, as is done for other sublvs.
When using cache with a cachevol, the cache_check tool was
not being run on the cache metadata during activation.
cache_check clears the needs_check flag in the cache
metadata, so if the flag was set due to an unclean
shutdown, the activation would fail.
Each integrity image in a raid LV reports its own number
of integrity mismatches, e.g.
lvs -o integritymismatches vg/lv_rimage_0
lvs -o integritymismatches vg/lv_rimage_1
In addition to this, allow the total number of integrity
mismatches from all images to be displayed for the raid LV.
lvs -o integritymismatches vg/lv
shows the number of mismatches from both lv_rimage_0 and
lv_rimage_1.
The args for pvcreate/pvremove (and vgcreate/vgextend
when applicable) were not efficiently opened, scanned,
and filtered. This change reorganizes the opening
and filtering in the following steps:
- label scan and filter all devs
. open ro
. standard label scan at the start of command
- label scan and filter dev args
. open ro
. uses full md component check
. typically the first scan and filter of pvcreate devs
- close and reopen dev args
. open rw and excl
- repeat label scan and filter dev args
. using reopened rw excl fd
- wipe and write new headers
. using reopened rw excl fd
In some cases the dev size may not have been read yet
in set_pv_devices(). In this case get the dev size
before comparing the dev size with the pv size.
To read the lvm headers and set dev->pvid if the
device is a PV. Difference from label_scan_ functions
is this does not read any vg metadata or add any info
to lvmcache.
Filtering in label_scan was controlled indirectly by
the fact that bcache was not yet set up when label_scan
first ran. The result is that filters that needed data
would not run and would return -EAGAIN, which would
result in the dev flag FILTER_AFTER_SCAN being set.
After the dev header was read for checking the label,
filters would be rechecked because of FILTER_AFTER_SCAN.
All filters would be checked this time because bcache
was now set up, and the filters needing data would
largely use data already scanned for reading the label.
This design worked but is hard to adjust for future
cases where bcache is already set up.
Replace this method (based on setting up bcache, or not)
with a new cmd flag filter_nodata_only. When this flag
is set filters that need data will not run. This allows
the same label_scan behavior when bcache has been set up.
There are no expected changes in behavior.
Touch of stack allocation validated given size with rlimit
and if the reserved_stack was above rlimit, its been completely
ignored - now we will always touch stack upto rlimit/2 size.
Since BLKZEROOUT ioctl should be supposedly fastest
way how to clear block device start using this ioctl
for zeroing a device. Commonly we do zero typically
small portion of a device (8KiB) - however since we now
also started to zero metadata devices, in the case
of i.e. thin-pool metadata this can go upto ~16GiB
and here the performance starts to be noticable.
Since dev_set_bytes() now closes dev on error path itself,
remove this unneeded call now (introduced few commits back
in history thus removing comment from WHATS_NEW)
Since lvm2 normally block signals during protected
phase where it does not want to be interrupted.
Support interruptible processing when allowed
in section between sigint_allow() ... sigint_restore())
and let the 'io_getenvents()' finish with EINTR.
When bcache tries to write data to a faulty device,
it may get out of caching blocks and then just busy-loops
on a CPU - so this check protects this by checking
if there is already max_io (~64) errored blocks.
Call _wait_all() which does check whether there is still
some pending IO before sleep. Otherwise it may happen
our submitted IO operations have been already dispatched
and this call then endlessly waits for IO which are all done.
This can be reproduced when device returns quickly errors
on write requests.
When detaching a writecache, use the cleaner setting
by default to writeback data prior to suspending the
lv to detach the writecache. This avoids potentially
blocking for a long period with the device suspended.
Detaching a writecache first sets the cleaner option, waits
for a short period of time (less than a second), and checks
if the writecache has quickly become clean. If so, the
writecache is detached immediately. This optimizes the case
where little writeback is needed.
If the writecache does not quickly become clean, then the
detach command leaves the writecache attached with the
cleaner option set. This leaves the LV in the same state
as if the user had set the cleaner option directly with
lvchange --cachesettings cleaner=1 LV.
After leaving the LV with the cleaner option set, the
detach command will wait and watch the writeback progress,
and will finally detach the writecache when the writeback
is finished. The detach command does not need to wait
during the writeback phase, and can be canceled, in which
case the LV will remain with the writecache attached and
the cleaner option set. When the user runs the detach
command again it will complete the detach.
To detach a writecache directly, without using the cleaner
step (which has been the approach previously), add the
option --cachesettings cleaner=0 to the detach command.
Since we detect already transaction if before starting
to build dm tree - this extra check is a duplicate
that would only capture very tiny 'race' and we later
validate transaction_id with suspended snapshot origin.
Introduce structures lv_status_thin_pool and
lv_status_thin (pair to lv_status_cache, lv_status_vdo)
Convert lv_thin_percent() -> lv_thin_status()
and lv_thin_pool_percent() + lv_thin_pool_transaction_id() ->
lv_thin_pool_status().
This way a function user can see not only percentages, but also
other important status info about thin-pool.
TODO:
This patch tries to not change too many other things,
but pool_below_threshold() now uses new thin-pool info to return
failure if thin-pool cannot be actually modified.
This should be handle separately in a better way.
LVM2 is distributed under GPLv2 only. The readline library changed its
license long ago to GPLv3. Given that those licenses are incompatible
and you follow the FSF in their interpretation that dynamically linking
creates a derivative work, distributing LVM2 linked against a current
readline version might be legally problematic.
Add support for the BSD licensed editline library as an alternative for
readline.
Link: https://thrysoee.dk/editline
Cover the case where two copies of metadata have the
same seqno but different checksums. Also elaborate
on an existing fixme in the code for this case, since
we should be doing something better for this case.
This had been uncovering an issue with reopening
fds in readwrite mode.
Improve error response and reporting, when creating thin snapshots.
If the thin pool kernel metadata already have device with ID lvm2
tries to create, give more meanigful error message and also properly
restore transaction id to the value known to thin-pool in this case.
Before it's been possible to divert by one from kernel TID value,
and lvm2 stacked delete message for such thin device.
Since ATM kernel does not support this operation,
disable 'lvrename' of an active vdopool.
As a workaround, user may simply deactivate, rename and activate.
When user tries to extend vdo pool - he needs to go always
at least by 1 full VDO slab (defined as vdo_slab_size_mb).
To avoid all trouble around find 'workable' size - lvm2 automatically
increases the passed (or by --use-policies calculated) extension size
(and informs a user about sometimes possibly large increase as slab
size can go upto 32GiB)
With VDO users need to always 'think-big' anyway and expect such
operation to be in GiB domain range.
When thetable reload fails during suspend() - we were only calling
plain resume() - and this will reload only those devices,
which were left suspend, but will not try to restore
metadata state according to lvm2 reverted metadata.
So if we were reloading device tree - we have restored
only top-level LV and rest of reverted device manipulation
were left alone and possibly mismatched what is in committed
metadata.
FIXME: There are several cases were such revert will likely not work
properly anyway as some operation are currenly handled in single commit,
while they need multiple commits, but it's step towards better correctness.
At least we catch there errors now earlier.
lvm opens devices readonly to scan them, but
needs to open then readwrite to update the metadata.
Previously, the ro fd was closed before the rw fd
was opened, leaving a small gap where the dev was
not held open, and during which the dev could
possibly change which storage it referred to.
With the bcache_change_fd() interface, lvm opens a
rw fd on a device to be written, tells bcache to
change to the new rw fd, and closes the ro fd.
. open dev ro
. read dev with the ro fd (label_scan)
. lock vg (ex for writing)
. open dev rw
. close ro fd
. rescan dev to check if the metadata changed
between the scan and the lock
. if the metadata did change, reread in full
. write the metadata
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
During removal of a lot of locking code the signal blocking got lost
and signal processing got broken leading to unpredictable
behavior of i.e. activation code the can get interrupted in the
middle of DM table processing.
lvm2 code always expects signals are blocked while lock is held
unless it is explictelly placed into section of:
sigint_allow();....;sigint_restore();
For checking catched interrupt there is sigint_catched();
Metadata size was calculated correctly only for raids.
Fixes problem for crash during lvcreate when thin-pool was created
on a VG where remaining free space had the size to only fit a single
metadata LV and not also its _pmspare.
Lvcreate crashed with this assert message:
lvcreate: metadata/pv_map.c:198: consume_pv_area: Assertion `to_go <= pva->count' failed.
Aborted (core dumped)
TODO: there is probably to large overload of several alloc_handle
variables.
Reported-by: Wu Guanghao<wuguanghao3@huawei.com>
Reported-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
When using --use-policy for automatic extension of thin-pool,
the extension of thin-pool's metadata itself can actually take
some extra space.
Since I'm not aware of exact compensation formula, add just
1% extra to calculated amount and hope it fits.
Wanted target is to always have usable thin-pool that fits
bellow pool_metadata_min_threshold().
Since we query on regular code these:
lv_raid_has_integrity()
lv_has_integrity_recalculate_metadata()
without prior checking for lv_is_raid() - these 'return 0' should
not use <stacktrace> as they are expected.
Correcting rounding rules for percentage evaluation.
Validate supported range of percentage.
(although ranges are already validated earlier on code path)
This is probably somewhat experimantal patch - but when i.e. raid device
is just extend, there should not be a technical need for flush,
unless the target would stricly need it. It should allow faster
processing of lvm command not being blocked by possibly longer flush.
Since we do not support rimage & rmeta for snapshots - we can
avoid quering for -cow devices and add them as origin_only -
since their snapshots (-cow) could have never existed.
This redumes several ioctl operation during table preloading.
Switch remaining zero sized struct to flexible arrays to be C99
complient.
These simple rules should apply:
- The incomplete array type must be the last element within the structure.
- There cannot be an array of structures that contain a flexible array member.
- Structures that contain a flexible array member cannot be used as a member of another structure.
- The structure must contain at least one named member in addition to the flexible array member.
Although some of the code pieces should be still improved.
While normally the 'mmap' file reading is better utilizing resources,
it has also its odd side with handling errors - so while we normally
use the mmap only for reading regular files from root filesystem
(i.e. lvm.conf) we can't prevent error to happen during the read
of these file - and such error unfortunately ends with SIGBUS error.
Maintaing signal handler would be compilated - so switch to slightly
less effiecient but more error resistant read() functinality.
reproducible steps:
1. vgcreate vg1 /dev/sda /dev/sdb
2. lvcreate --type raid0 -l 100%FREE -n raid0lv vg1
3. do remove the /dev/sdb action
4. lvdisplay show wrong 'LV Status'
After removing raid0 type LV underlying dev, lvdisplay still display
'available'. This is wrong status for raid0.
This patch add a new function raid_is_available(), which will handle
all raid case.
With this patch, lvdisplay will show
from:
LV Status available
to:
LV Status NOT available (partial)
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
It's better to set most of option as 'commented' with some
documented defaults instead of providing strict values.
This has the advantage we can eventually 'change' defualts
and get them working in future. Otherwise once the setting
is stored in lvm.conf in /etc, such setting has strictly
defined value and that can be only change with file update.
merge.c:_check_lv_segment() was checking regionsize vs. mirrored LV size on
any 'mirror/raid1/raid10' segment type including type 'mirrored' mirror logs.
Avoid the check only for 'mirrored' mirror logs to allow conversion from log
type 'disk' with regionsize > mirror log SubLV size.
As we disabled support for 'mirrored' mirror logs with
commit e82303fd6a which still conditionally
allows to enable it via global/support_mirrored_mirror_logs=1,
patch is mandatory for all distributions.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1712983
Currently lvm2 is not wiping signatures when creating 'metadata' volumes
and raid _rmeta was the only exception - so make the behavior consistent
with other metadata devices and drop wiping ATM.
Drop also some extra debug since they are now more explanatory in
wipe_lv() function.
Also note - although lvm2 now does not wipe signatures - the error
from such wipping used to be actually 'ignored' before wipe_lv()
started to return error (with recent commit) and raid creation
continued with 'unzeroed' metadata device.
TODO: Several issues to resolve:
1. We may want to flip to wipping with all LVs (in that case we need to
support passing --yet & --force).
2. Also we may want to clear whole metadata device - however current
function is also used for wipping i.e. snapshot COW device which
is likely not a good candidate for full device zeroing.
We may also need to think about better logic when extent size is
enforcing very large LVs, when only a small portion of LV is ever
being used.
3. Using TRIM instead of zeroing metadata device might be worth to
implement.
mm
To avoid polution of metadata with some 'garbage' content or eventualy
some leak of stale data in case user want to upload metadata somewhere,
ensure upon allocation the metadata device is fully zeroed.
Behaviour may slow down allocation of thin-pool or cache-pool a bit
so the old behaviour can be restored with lvm.conf setting:
allocation/zero_metadata=0
TODO: add zeroing for extension of metadata volume.
Failure in wiping/zeroing stop the command.
If user wants to avoid command abortion he should use -Zn or -Wn
to avoid wiping.
Note: there is no easy way to distinguish which kind of failure has
happend - so it's safe to not proceed any futher.
When initiated larger write request, it may have happened, bcache
got out of free chunks - fix the loop, that is supposed to wait
until next free chunk becomes avain available.
To create a new cache or writecache LV with a single command:
lvcreate --type cache|writecache
-n Name -L Size --cachedevice PVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, a new cachevol LV is created internally, using PVfast
specified by the cachedevice option.
- Then, the cachevol is attached to the main LV, converting the
main LV to type cache|writecache.
Include --cachesize Size to specify the size of cache|writecache
to create from the specified --cachedevice PVs, otherwise the
entire cachedevice PV is used. The --cachedevice option can be
repeated to create the cache from multiple devices, or the
cachedevice option can contain a tag name specifying a set of PVs
to allocate the cache from.
To create a new cache or writecache LV with a single command
using an existing cachevol LV:
lvcreate --type cache|writecache
-n Name -L Size --cachevol LVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, the cachevol LVfast is attached to the main LV, converting
the main LV to type cache|writecache.
In cases where more advanced types (for the main LV or cachevol LV)
are needed, they should be created independently and then combined
with lvconvert.
Example
-------
user creates a new VG with one slow device and one fast device:
$ vgcreate vg /dev/slow1 /dev/fast1
user creates a new 8G main LV on /dev/slow1 that uses all of
/dev/fast1 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1
-n main -L 8G vg /dev/slow1
Example
-------
user creates a new VG with two slow devs and two fast devs:
$ vgcreate vg /dev/slow1 /dev/slow2 /dev/fast1 /dev/fast2
user creates a new 8G main LV on /dev/slow1 and /dev/slow2
that uses all of /dev/fast1 and /dev/fast2 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1 --cachedevice /dev/fast2
-n main -L 8G vg /dev/slow1 /dev/slow2
Example
-------
A user has several slow devices and several fast devices in their VG,
the slow devs have tag @slow, the fast devs have tag @fast.
user creates a new 8G main LV on the slow devs with a
2G writecache on the fast devs:
$ lvcreate --type writecache -n main -L 8G
--cachedevice @fast --cachesize 2G vg @slow
It's possible for a dev-cache entry to remain after all
paths for it have been removed, and other parts of the
code expect that a dev always has a name. A better fix
may be to remove a device from dev-cache after all paths
to it have been removed.
When either logical block size or physical block size is 4K,
then lvmlockd creates sanlock leases based on 4K sectors,
but the lvm client side would create the internal lvmlock LV
based on the first logical block size it saw in the VG,
which could be 512. This could cause the lvmlock LV to be
too small to hold all the sanlock leases. Make the lvm client
side use the same sizing logic as lvmlockd.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum. When used on
raid images, dm-raid will correct the error by reading
the block from another image, and the device user sees
no error. The integrity metadata (checksums) are stored
on an internal LV allocated by lvm for each linear image.
The internal LV is allocated on the same PV as the image.
Create a raid LV with an integrity layer over each
raid image (for raid levels 1,4,5,6,10):
lvcreate --type raidN --raidintegrity y [options]
Add an integrity layer to images of an existing raid LV:
lvconvert --raidintegrity y LV
Remove the integrity layer from images of a raid LV:
lvconvert --raidintegrity n LV
Settings
Use --raidintegritymode journal|bitmap (journal is default)
to configure the method used by dm-integrity to ensure
crash consistency.
Initialization
When integrity is added to an LV, the kernel needs to
initialize the integrity metadata/checksums for all blocks
in the LV. The data corruption checking performed by
dm-integrity will only operate on areas of the LV that
are already initialized. The progress of integrity
initialization is reported by the "syncpercent" LV
reporting field (and under the Cpy%Sync lvs column.)
Example: create a raid1 LV with integrity:
$ lvcreate --type raid1 -m1 --raidintegrity y -n rr -L1G foo
Creating integrity metadata LV rr_rimage_0_imeta with size 12.00 MiB.
Logical volume "rr_rimage_0_imeta" created.
Creating integrity metadata LV rr_rimage_1_imeta with size 12.00 MiB.
Logical volume "rr_rimage_1_imeta" created.
Logical volume "rr" created.
$ lvs -a foo
LV VG Attr LSize Origin Cpy%Sync
rr foo rwi-a-r--- 1.00g 4.93
[rr_rimage_0] foo gwi-aor--- 1.00g [rr_rimage_0_iorig] 41.02
[rr_rimage_0_imeta] foo ewi-ao---- 12.00m
[rr_rimage_0_iorig] foo -wi-ao---- 1.00g
[rr_rimage_1] foo gwi-aor--- 1.00g [rr_rimage_1_iorig] 39.45
[rr_rimage_1_imeta] foo ewi-ao---- 12.00m
[rr_rimage_1_iorig] foo -wi-ao---- 1.00g
[rr_rmeta_0] foo ewi-aor--- 4.00m
[rr_rmeta_1] foo ewi-aor--- 4.00m
When vdopool is activated standalone - we use a wrapping linear device
to hold actual vdo device active - for this we can set-up read-only
device to ensure there cannot be made write through this device to
actual pool device.
Creating a snapshot was using a persistent LV lock
on the origin, so if the origin LV was inactive at
the time of the snapshot the LV lock would remain.
(Running lvchange -an on the inactive LV would
clear the LV lock.) Use a transient LV lock so it
will be dropped if it was not locked previously.
When formating VDO volume, the calculated amound of bits
for 'vdoformat --slab-bits' parameter was shifted by 2 bits
(calculated size was making 2MiB vdo_slab_size_mb value appear like if
user would be specifying only 512KiB)
Fixed by properly converting internal size_mb value to KiB.
Fix the anoying kernel message reported:
device-mapper: cache: 253:2: metadata operation 'dm_cache_commit' failed: error = -5
which has been reported while cachevol has been removed.
Happened via confusing variable - so switch the variable to commonly user '_size'
which presents a value in sector units and avoid 'scaling' this as extent length
by vg extent size when placing 'error' target on removal path.
Patch shouldn't have impact on actual users data, since at this moment
of removal all date should have been already flushed to origin device.
m
The previous patch improved read of pipe when lvm2 was looking
for default logical size, but we clearly must read pipe also
for -V case, when the logical size is already defined.
Still the place can be better to block only particular reshape
operations which ATM cause kernel problems.
We check if the new number of images is higher - and prevent to take
conversion if the volume is in use (i.e. thin-pool's data LV).
clang: it's supposedly impossible path to hit, as we should always
have origin_lv defined when running this path, but adding protection
isn't a big issue to make this obvious to analyzer.
Since _reserve_area() may fail due to error allocation failure,
add support to report this already reported failure upward.
FIXME: it's log_error() without causing direct command failure.
Although we expect min_chunk_size to be 32bit value, for
large size of caches it might be useful to do calcs 64bit.
So to avoid doing shift as signed 32bit - use unsigned 64bit
from the start.
reporting fields (-o) directly from kernel:
writecache_total_blocks
writecache_free_blocks
writecache_writeback_blocks
writecache_error
The data_percent field shows used cache blocks / total cache blocks.
Until we resolve reshape for 'stacked' devices, we need to disable it.
So users can no longer reshape i.e. thin-pool data volumes, causing
ATM bad thin-pool problems.
After the VG lock is taken for vg_read, reread the mda_header
and compare the metadata text offset and checksum to what was
seen during label scan. If it is unchanged, then the metadata
has not changed since the label scan, and the metadata does not
need to be reread under the lock for command processing.
For commands that do not make changes (e.g. reporting), the
mda_header is reread and checked on one mda to decide if the
full metadata rereading can be skipped. For other commands
(e.g. modifying the vg) the mda_header is reread and checked
from all PVs. (These could probably just check one mda also.)
When pvcreate/pvremove prompt the user, they first release
the global lock, then acquire it again after the prompt,
to avoid blocking other commands while waiting for a user
response. This release/reacquire changes the locking
order with respect to the hints flock (and potentially other
locks). So, to avoid deadlock, use a nonblocking request
when reacquiring the global lock.
Avoid mem leaking hint on every loop continue and
allocate hint only when it's going to be added into list.
Switch to use 'dm_strncpy()' and validate sizes.
For dev_in_device_list() != 0 allocated 'devl' was
actually leaking - so instead allocate 'devl' only
when !dev_in_device_list() and indent code around.
Since we check for NULL pointers earlier we need
to be consistent across function - since the NULL
would applies across whole function.
When dropping 'mda' check - we are actually
already dereferencing it before - so it can't
be NULL at that places (and it's validated
before entering _read_mda_header_and_metadata).
dev_unset_last_byte() must be called while the fd is still valid.
After a write error, dev_unset_last_byte() must be called before
closing the dev and resetting the fd.
In the write error path, dev_unset_last_byte() was being called
after label_scan_invalidate() which meant that it would not unset
the last_byte values.
After a write error, dev_unset_last_byte() is now called in
dev_write_bytes() before label_scan_invalidate(), instead of by
the caller of dev_write_bytes().
In the common case of a successful write, the sequence is still:
dev_set_last_byte(); dev_write_bytes(); dev_unset_last_byte();
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
When resizing 2 volumes like thin-pool and it's metadata and they
would be of a different type - command would be actually expecting
both LVs being of a same segtype - and would throw an error in
case they are different.
This patch fixes is by setting a new segtype from last segment of
2nd. extented device.
Also it fixes the possible 'percentage' extension setup that
might have been used for 'primary' volume - while the 'secondary'
LV always goes with direct size - as we do not support 'percentage'
setup for them
This affects maily usage of thin-pool where the extension of
thin-pool data size may also lead to extension of metadata size.
Instead of checking all LVs in a VG - do just a direct copy of LVs
from the existing list ->segs_using_thin_lv.
TODO: maybe it could be better to expose seg_list to /tools...
Enhance lv_info with lv_info_with_name_check.
This 'variant' not only check existance if UUID in DM table
but also compares its DM name whether it's matching expected LV name.
Otherwise activation may 'skip' activation with rename in case the
DM UUID already exists, just device is different name.
This change make fairly easier manipulation with i.e. detached mirror
leg which ATM is using same UUID - just the LV name have been changed.
Used code was not able to run 'activation' (and do a rename) and just
skipped the call. So the code used to do a workaround and 'tried'
to deactivate such LV firts - this however work only in non-clvmd case,
as cluster was not having the lock for deactivated LV.
With this extended lv_info code will run 'activation' and will
synchronize the name to match expected LV name.
Patch extends _lv_info() with new paramter 'with_name_check',
which is later translated into 'name_check' argument for
_info_run() which in case of name mismatch evaluates the
check as if device does not exists.
Such call is only used in one place _lv_activate() which then
let activation run. All other invocation of _info() calls
are left intact.
TODO: fix mirror table manipulation (and raid)....
The return value from bcache_invalidate_fd() was not being checked.
So I've introduced a little function, _invalidate_fd() that always
calls bcache_abort_fd() if the write fails.
The resume of 'released' 'COW' should preceed the resume of origin.
The fact we need to do the sequence differently for merge was
cause by bugs fixed in 2 previous commits - so we no longer need
to recognize 'merging' and we should always go with single
sequence.
The importance of this order is - to properly remove '-real' device
from origin LV. When COW is activated as 2nd. '-real' device is
kept in table as it cannot be removed during 1st. resume of origin,
and later activation of COW LV no longer builds tree associated
with origin LV.
When checking device id of a thin device that is just being
merged - the snapshot actually could have been already finished
which means '-real' suffix for the LV is already gone and just LV
is there - so check explicitely for this condition and use
correct UUID for this case.
When a cachevol LV is attached, have the LV keep it's lock
allocated. The lock on the cachevol won't be used while
it's attached. When the cachevol is split a new lock does
not need to be allocated. (Applies to cachevol usage by
both dm-cache and dm-writecache.)
When LV gets cached and uses cache-pool - such cache-pool
will now get _cpool suffix automatically.
Thus 'Pool' column for cached LV will now show either _cvol
or _cpool LV.
Before 'archive()' is called, lvm2 must not touch/modify metadata.
So move setting CACHE_VOL related flags past this point.
Also make sure reading of cache segtype always restores this
flag properly (even if compatible flag would be lost).
Since code is using -cdata and -cmeta UUID suffixes, it does not need
any new 'extra' ID to be generated and stored in metadata.
Since introduce of new 'segtype' cache+CACHE_USES_CACHEVOL we can
safely assume 'new' cache with cachevol will now be created
without extra metadata_id and data_id in metadata.
For backward compatibility, code still reads them in case older
version of metadata have them - so it still should be able
to activate such volumes.
Bonus is lowered size of lv structure used to store info about LV
(noticable with big volume groups).
The first part of a cachevol LV is used for metadata,
and the rest of the space is used for data. The
division of space between metadata and data depends
on the total size of the cachevol.
The previous division gave more space than needed to
metadata, it was:
cachevol size 8M to 128M -> metadata size 16M *
cachevol size 128M to 1G -> metadata size 32M
cachevol size 1G and up -> metadata size 64M
(* if this resulted in over half the LV used as
metadata, then half the cachevol would be used
for metadata, and the other half for data.)
The division of space now gives less space to
metadata, it is:
cachevol size 8M to 16M -> metadata size 4M
cachevol size 16M to 4G -> metadata size 8M
cachevol size 4G to 16G -> metadata size 16M
cachevol size 16G to 32G -> metadata size 32M
cachevol size 32G and up -> metadata size 64M
When a VG contains some LVs with unknown segtypes, the user
should still be allowed to activate other LVs in the VG that
are understood.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi------- 4.00m
other foo vwi---u--- 48.00m
$ lvcreate -l1 foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Cannot change VG foo with unknown segments in it!
Cannot process volume group foo
$ lvchange -ay foo/lvol0
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
$ lvchange -ay foo/other
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
Refusing activation of LV foo/other containing an unrecognised segment.
$ lvs foo
WARNING: Unrecognised flag CACHE_USES_CACHEVOL in segment type cache+CACHE_USES_CACHEVOL.
WARNING: Unrecognised segment type cache+CACHE_USES_CACHEVOL
LV VG Attr LSize
lvol0 foo -wi-a----- 4.00m
other foo vwi---u--- 48.00m
A cachevol LV had the CACHE_VOL status flag in metadata,
and the cache LV using it had no new flag. This caused
problems if the new metadata was used by an old version
of lvm. An old version of lvm would have two problems
processing the new metadata:
. The old lvm would return an error when reading the VG
metadata when it saw the unknown CACHE_VOL status flag.
. The old lvm would return an error when reading the VG
metadata because it would not find an expected cache pool
attached to the cache LV (since the cache LV had a
cachevol attached instead.)
Change the use of flags:
. Change the CACHE_VOL flag to be a COMPATIBLE flag (instead
of a STATUS flag) so that old versions will not fail when
they see it.
. When a cache LV is using a cachevol, the cache LV gets
a new SEGTYPE flag CACHE_USES_CACHEVOL. This flag is
appended to the segtype name, so that old lvm versions
will fail to use the LV because of an unknown segtype,
as opposed to failing to read the VG.
Enhance activation of cached devices using cachevol.
Correctly instatiace cachevol -cdata & -cmeta devices with
'-' in name (as they are only layered devices).
Code is also a bit more compacted (although still not ideal,
as the usage of extra UUIDs stored in metadata is troublesome
and will be repaired later).
NOTE: this patch my brink potentially minor incompatiblity for 'runtime' upgrade
Let vgck --updatemetadata repair cases where different mdas
hold indepedently valid but unmatching copies of the metadata,
i.e. different text metadata checksums or text metadata sizes.
vgck --updatemetadata would write the same correct
metadata to good mdas, and then to bad mdas, but the
sequence of vg_write/vg_commit calls betwen good and
bad mdas could cause a different description field to
be generated for good/bad mdas. (The description field
describing the command was recently included in the
ondisk copy of the metadata text.)
If a VG is forcibly changed from lock_type sanlock to
lock_type none, the internal lvmlock LV is left behind.
If that LV is not removed before vgremove is run on the
VG, then an internal check will be triggered by the
hidden lvmlock LV. So, check for and remove a left over
lvmlock LV during vgremove.
Add lots of vdo fields:
vdo_operating_mode - For vdo pools, its current operating mode.
vdo_compression_state - For vdo pools, whether compression is running.
vdo_index_state - For vdo pools, state of index for deduplication.
vdo_used_size - For vdo pools, currently used space.
vdo_saving_percent - For vdo pools, percentage of saved space.
vdo_compression - Set for compressed LV (vdopool).
vdo_deduplication - Set for deduplicated LV (vdopool).
vdo_use_metadata_hints - Use REQ_SYNC for writes (vdopool).
vdo_minimum_io_size - Minimum acceptable IO size (vdopool).
vdo_block_map_cache_size - Allocated caching size (vdopool).
vdo_block_map_era_length - Speed of cache writes (vdopool).
vdo_use_sparse_index - Sparse indexing (vdopool).
vdo_index_memory_size - Allocated indexing memory (vdopool).
vdo_slab_size - Increment size for growing (vdopool).
vdo_ack_threads - Acknowledging threads (vdopool).
vdo_bio_threads - IO submitting threads (vdopool).
vdo_bio_rotation - IO enqueue (vdopool).
vdo_cpu_threads - CPU threads for compression and hashing (vdopool).
vdo_hash_zone_threads - Threads for subdivide parts (vdopool).
vdo_logical_threads - Logical threads for subdivide parts (vdopool).
vdo_physical_threads - Physical threads for subdivide parts (vdopool).
vdo_max_discard - Maximum discard size volume can recieve (vdopool).
vdo_write_policy - Specified write policy (vdopool).
vdo_header_size - Header size at front of vdopool.
Previously only 'lvdisplay -m' was exposing them.
Since we now support activation of 'vdo' volume
without explicit activation of 'vdopool' it's now possible
to have active layer vdopool (-vpool) volume and
having vdopool itself inactive - yet still in this
case we can show available stats for this volume.
But we need to show correct activation status and other
standard info.
Avoid checking 'lv_is_active()' since special LV types does this
validation anyway what calling _percent() function and call it
ONLY when none of special types is queried.
This restores support for VDO resize (as with support for
separate VDO pool activation, plain query for lv_is_active()
is not working in this case).
If the linear mapping is lost (for whatever reason, i.e.
test suite forcible 'dmsetup remove' linear LV,
lvm2 had hard times figuring out how to deactivate such DM table.
So add function which is in case inactive VDO pool LV checks if
the pool is actually still active (-vpool device present) and
it has open count == 0. In this case deactivation is allowed
to continue and cleanup DM table.
- use internal CACHE_VOL flag on cachevol LV
- add suffixes to dm uuids for internal LVs
- display appropriate letters in the LV attr field
- display writecache's cachevol in lvs output
. For dm-cache in writethrough, always allow splitcache,
whether the cache is missing PVs or not.
. For dm-cache in writeback, if the cache is missing PVs,
allow splitcache with force and yes.
. For dm-writecache, if the cache is missing PVs,
allow splitcache with force and yes.
Enhance 'activation' experience for VDO pool to more closely match
what happens for thin-pools where we do use a 'fake' LV to keep pool
running even when no thinLVs are active. This gives user a choice
whether he want to keep thin-pool running (wihout possibly lenghty
activation/deactivation process)
As we do plan to support multple VDO LVs to be mapped into a single VDO,
we want to give user same experience and 'use-patter' as with thin-pools.
This patch gives option to activate VDO pool only without activating
VDO LV.
Also due to 'fake' layering LV we can protect usage of VDO pool from
command like 'mkfs' which do require exlusive access to the volume,
which is no longer possible.
Note: VDO pool contains 1024 initial sectors as 'empty' header - such
header is also exposed in layered LV (as read-only LV).
For blkid we are indentified as LV with UUID suffix - thus private DM
device of lvm2 - so we do not need to store any extra info in this
header space (aka zero is good enough).
When lvm2 is activating layered pool LV (to basically keep pool opened,
the other function used to be 'locking' be in sync with DM table)
use this LV in read-only mode - this prevents 'write' access into
data volume content of thin-pool.
Note: since EMPTY/unused thin-pool is created as 'public LV' for generic
use by any user who i.e. wish to maintain thin-pool and thins himself.
At this moment, thin-pool appears as writable LV. As soon as the 1st.
thinLV is created, layer volume will appear is 'read-only' LV from this moment.
When an online PV completed a VG, the standard
activation functions were used to activate the VG.
These functions use a full scan of all devs.
When many pvscans are run during startup and need
to activate many VGs, scanning all devs from all
the pvscans can take a long time.
Optimize VG activation in pvscan to scan only the
devs in the VG being activated. This makes use of
the online file info that was used to determine
the VG was complete.
The downside of this approach is that pvscan activation
will not detect duplicate PVs and block activation,
where a normal activation command (which scans all
devices) would.
Fixes a regression from commit ba7ff96faf
"improve reading and repairing vg metadata"
where the error path for a vg name with invalid
charaters was missing an error flag, which led
to the caller not recognizing an error occured.
Previously, an error flag was hidden in the old
_vg_make_handle function.
Only the first entry of the filter array was being
included in the copy of the filter, rather than the
entire thing. The result is that hints would not be
refreshed if the filter was changed but the first
entry was unchanged.