IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Since cached LV is going to be removed together with its cache,
there is not much to gain if we try to flush cache first.
User may use 'vgcfgrestore' to get back origin + cache.
Assuming user is not using issue_discards.
When data are discarded after remove there is nothing to restore!
This change allows to futher reduce number of commits
during lvremove/vgremove.
When lvremove/vgremove removes thin volumes with its thin-pool as well,
try to skip any updates of such thin-pool, so when everything properly
deactivates, there is no message send to this thin-pool and whole
thin-pool is removed with a single commit.
Returning NULL for lv_committed is basically instant crash,
so instead try with passed LV instead.
It shouldn't matter as this is internall error path anyway,
but coverity should be happier.
Another step towards better automatic handling of backup,
and automatically setup needs_backup after commit.
In some next step we should reduce number of backups and takem
then only at the command finish with vg_committed content.
With commit b44db5d1a7
needs to check allocated pointer for failed malloc().
Existing check was actually no checking anything so failing
malloc here would result in segfault (although with very
low chance to ever happen).
Match VG uuid just once per list of all LVs in VG.
TODO: maybe some more efficeint tree or hash could be better here,
but since it's used not so often, the total benefit is not so great,
so ATM just reducing amount of checked bytes.
Use different 'hint' size for dm_hash_create() call - so
when debug info about hash is printed we can recognize which
hash was in use.
This patch doesn't change actual used size since that is always
rounded to be power of 2 and >=16 - so as such is only a
help to developer.
We could eventually use 'name' arg, but since this would have changed
API and this patchset will be routed to libdm & stable - we will
just use this small trick.
Just like with deactivation, call of 'lv_is_not_in_use()'
now has embeded report for inactivate LV.
Note: this patch cannot be backported to stable-2.02 - as
there lv_is_active() has 'cluster' meaning and differs from lvinfo().
When LV is deactivativate, we check for presence, and later
for some LV types also for being in use.
We can however do this check in 1 step for them a remove extra ioctl.
Add return value '2' to lv_check_not_in_use() to recognize LV is not
present.
Existing users were just testing for 0, so no change for them.
When parsing VG metadata we can create from a single config tree
also 'vg_committed' that is always created for writable VG.
This avoids extra uncessary step of serializing and deserilizing
just parsed VG.
Every vg_write stores new 'metadata' into precommitted slot.
For this step we use 'serialized buffer' to ascii metadata.
Instead of recreating this buffer after whole 'vg_write()' we
use this buffer instantly for creating of precommitted VG.
This has also the advantage of catching any problems with
reparsing of ascii metadata back to VG early before any write.
This patch postpones update of lvm metadata for each removed
LV for later moment depending on LV type.
It also queues messages to be printed after such write & commit.
As such there is some change in the behavior - although before
prompt we do make write&commit happens automatically in some
other error case we rather keep 'existing' state - so there
could be difference in amount of removed & commited LVs.
IMHO introduce logic is slightly better and more save.
But some cases still need the early commit - i.e. thin-removal
and fixing this needs some more thinking.
TODO: improve removal at least with the case of the whole thin-pool.
i.e. we can simply recognize removal of 'all LVs/whole VG'.
Make the generic "device is not usable" message from filter-usable
more specific in case the device is not usable because it's an LV.
(i.e. when scan_lvs=0)
When 'lv_info()' is called with &info structure,
the presence of node has to be checked from this structure.
Without this we were needlesly trying to look out 0:0 device.
When lvm2 calls archive() or backup() it can be useful to allow handling
break signal so the command can be interrupted at some consistent point.
Signal is accepted during processing these calls - and can be evaluated
later during even lengthy processing loops.
So now user can interrupt lengthy lvremove().
Taking backup with each removed LV is slowing down the process
considerable and is largerly uneeded. We are supposed to take
backup only on significant points and making sure the backup
is correct when the command is finished.
TODO: check how many other commands can be improved.
Use 'C' for alphasort - there is no need to use localized and slower
sorting for internal directory scanning.
Ensure on all code paths allocated dirent entries are released.
Optimize full path construction.
Drop the comment "This setting is no longer used." which
was printed just before the standard deprecation comment:
"This configuration option is deprecated."
When lvmconfig --typeconfig full printed a deprecated
entry it would attempt to print a non-existing
deprecation comment resulting in output like:
# (null) # This setting is no longer used.
The LVM devices file lists devices that lvm can use. The default
file is /etc/lvm/devices/system.devices, and the lvmdevices(8)
command is used to add or remove device entries. If the file
does not exist, or if lvm.conf includes use_devicesfile=0, then
lvm will not use a devices file. When the devices file is in use,
the regex filter is not used, and the filter settings in lvm.conf
or on the command line are ignored.
LVM records devices in the devices file using hardware-specific
IDs, such as the WWID, and attempts to use subsystem-specific
IDs for virtual device types. These device IDs are also written
in the VG metadata. When no hardware or virtual ID is available,
lvm falls back using the unstable device name as the device ID.
When devnames are used, lvm performs extra scanning to find
devices if their devname changes, e.g. after reboot.
When proper device IDs are used, an lvm command will not look
at devices outside the devices file, but when devnames are used
as a fallback, lvm will scan devices outside the devices file
to locate PVs on renamed devices. A config setting
search_for_devnames can be used to control the scanning for
renamed devname entries.
Related to the devices file, the new command option
--devices <devnames> allows a list of devices to be specified for
the command to use, overriding the devices file. The listed
devices act as a sort of devices file in terms of limiting which
devices lvm will see and use. Devices that are not listed will
appear to be missing to the lvm command.
Multiple devices files can be kept in /etc/lvm/devices, which
allows lvm to be used with different sets of devices, e.g.
system devices do not need to be exposed to a specific application,
and the application can use lvm on its own set of devices that are
not exposed to the system. The option --devicesfile <filename> is
used to select the devices file to use with the command. Without
the option set, the default system devices file is used.
Setting --devicesfile "" causes lvm to not use a devices file.
An existing, empty devices file means lvm will see no devices.
The new command vgimportdevices adds PVs from a VG to the devices
file and updates the VG metadata to include the device IDs.
vgimportdevices -a will import all VGs into the system devices file.
LVM commands run by dmeventd not use a devices file by default,
and will look at all devices on the system. A devices file can
be created for dmeventd (/etc/lvm/devices/dmeventd.devices) If
this file exists, lvm commands run by dmeventd will use it.
Internal implementaion:
- device_ids_read - read the devices file
. add struct dev_use (du) to cmd->use_devices for each devices file entry
- dev_cache_scan - get /dev entries
. add struct device (dev) to dev_cache for each device on the system
- device_ids_match - match devices file entries to /dev entries
. match each du on cmd->use_devices to a dev in dev_cache, using device ID
. on match, set du->dev, dev->id, dev->flags MATCHED_USE_ID
- label_scan - read lvm headers and metadata from devices
. filters are applied, those that do not need data from the device
. filter-deviceid skips devs without MATCHED_USE_ID, i.e.
skips /dev entries that are not listed in the devices file
. read lvm label from dev
. filters are applied, those that use data from the device
. read lvm metadata from dev
. add info/vginfo structs for PVs/VGs (info is "lvmcache")
- device_ids_find_renamed_devs - handle devices with unstable devname ID
where devname changed
. this step only needed when devs do not have proper device IDs,
and their dev names change, e.g. after reboot sdb becomes sdc.
. detect incorrect match because PVID in the devices file entry
does not match the PVID found when the device was read above
. undo incorrect match between du and dev above
. search system devices for new location of PVID
. update devices file with new devnames for PVIDs on renamed devices
. label_scan the renamed devs
- continue with command processing
User use 'lvconvert -Zn --type vdo-pool' to convert an existing
vdo formated volume and skip lvm2 internal formating.
This however requires user is passing proper matching parameters.
For them user can use --profile|--metadataprofile option whos
support has been also enhanced.
TODO: add support to read values directly from formated volume.
In some cases we use 'creation' also during conversion.
Here it can be actually unwanted side effect we may remove
not just newly created layers - but also original converted LV.
So until we make clear how to properly revert from some errors
in middle of conversion, disable removal for any 'lvconvert' commands.
When lvdisplay was executed and thin snaphost has be merged to
thin origin and the operation has been postponed till devices
are closed, command crashed.
Check LV is COW before trying to check snapshot percentage.
Fix clearing persistent filter state when clearing all
the state from a label_scan.
label_scan reads devs and saves info in bcache, lvmcache,
and in the persistent filter. In some uncommon cases, an
lvm command wants to clear all info from a prior label_scan,
and repeat label_scan from scratch. In these cases, info
in lvmcache, bcache and the persistent filter all need to
be cleared before repeating label_scan.
By missing the persistent filter wiping, outdated persistent
filter info, from a prior label_scan, could cause lvm to
incorrectly filter devices that change between polling intervals.
(i.e. if the device changes in such a way that the filtering
results change.)
A case where lvm wants to do multiple label_scans is a
polling command (like lvconvert --merge), when lvmpolld
has been disabled, so that the command itself needs to
to do repeated polling checks.
Automatically figure out resizable layer in the LV stack and
resize it online.
Split check for reshaped raids and postpone removal of
unused space after finished reshaping after metadata archiving.
Drop warning about unsupported automatic resize of monitored thin-pool.
Currently there is not yet support for resize of writecache.
Move extra md component detection into the label scan phase.
It had been in set_pv_devices which was deep within the vg_read
phase, which wasn't a good place (better to detect that earlier.)
Now that pv metadata info is available in the scan phase, the pv
details (size and device_hint) can be used for extra md checking.
Use the device_hint from the pv metadata to trigger a full md
component check if the device_hint begins with /dev/md.
Stop triggering full md component checks based on missing
udev info for a dev.
Changes to tests to reflect that the code is now detecting
md components in some test case that it wasn't before.
Current allocation limitation requires to fit metadata/log LV on
a single PV. This is usually not a big problem, but since
thin-pool and cache-pool is using this for allocating extents
for their metadata LVs it might be eventually causing errors
where the remaining free spaces for large metadata size is spread
over several PV.
When passing 'pvmove --name arg' try to automatically move
all associated dependencies with given LV.
i.e. 'pvmove --name thinpool vg vgnew'
moves all thins and data and metadata LV into a new VG vgnew.
Use update_pool_metadata_min_max() which is shared with
thin-pool metadata min-max updating.
Gives improved messages when converting volumes to metadata.
There is not much point to let allocate more then this size
even when i.e. converted LV is bigger then 16GiB (%extent_size)
ATM neither thin-pool nor cache-pool supports bigger metadata.
Initial support for thin-pool used slightly smaller max size 15.81GiB
for thin-pool metadata. However the real limit later settled at 15.88GiB
(difference is ~64MiB - 16448 4K blocks).
lvm2 could not simply increase the size as it has been using hard cropping
of the loaded metadata device to avoid warnings printing warning of kernel
when the size was bigger (i.e. due to bigger extent_size).
This patch adds the new lvm.conf configurable setting:
allocation/thin_pool_crop_metadata
which defaults to 0 -> no crop of metadata beyond 15.81GiB.
Only user with these sizes of metadata will be affected.
Without cropping lvm2 now limits metadata allocation size to 15.88GiB.
Any space beyond is currently not used by thin-pool target.
Even if i.e. bigger LV is used for metadata via lvconvert,
or allocated bigger because of to large extent size.
With cropping enabled (=1) lvm2 preserves the old limitation
15.81GiB and should allow to work in the evironement with
older lvm2 tools (i.e. older distribution).
Thin-pool metadata with size bigger then 15.81G is now using CROP_METADATA
flag within lvm2 metadata, so older lvm2 recognizes an
incompatible thin-pool and cannot activate such pool!
Users should use uncropped version as it is not suffering
from various issues between thin_repair results and allocated
metadata LV as thin_repair limit is 15.88GiB
Users should use cropping only when really needed!
Patch also better handles resize of thin-pool metadata and prevents resize
beoyond usable size 15.88GiB. Resize beyond 15.81GiB automatically
switches pool to no-crop version. Even with existing bigger thin-pool
metadata command 'lvextend -l+1 vg/pool_tmeta' does the change.
Patch gives better controls 'coverted' metadata LV and
reports less confusing message during conversion.
Patch set also moves the code for updating min/max into pool_manip.c
for better sharing with cache_pool code.
When detaching writecache, make the first stage send a message
to dm-writecache to set the cleaner option. This is instead of
reloading the dm table with the cleaner option set. Reloading
the table causes udev to process/probe the dm dev, which gets
stalled because of the writeback activity, and the stalled udev
in turn stalls the lvconvert command when it tries to sync with
udev events.
When getting writecache status we do not need to get
open_count or read_head info, which can cause extra steps.
In case legs of a raid0 LV are removed, the lvdisplay command still
reports 'available' though raid0 is not providing any resilience
compared to the other raid levels.
Also lvdisplay does not display '(partial)' in case of missing raid0
legs as oposed to the lvs command.
Enhance lvdisplay to report "NOT available" for any RaidLV type in case
too many legs are inaccessible hence causing data loss. I.e. any leg
for raid0, all for raid1, more than 1 for raid4/5, more than 2 for raid6
and in case of completely lost mirror groups for raid10.
Add test/shell/lvdisplay-raid.sh.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1872678
New VDO targets v6.2.3 corrects support for online rename of VDO device.
If needed if can be disable via new lvm.conf setting:
vdo_disabled_features = [ "online_rename" ]
When removing pool LV from a stacked LV setup, it's been possible
to leak _pmspare and such hidden LV then required manual
user removal.
Fix it by moving automatic removal into _lv_reduce().
When adding replacement raid+integrity images (lvconvert --repair
after a raid image is lost), various errors can cause the function
to exit with an error. On this exit path, the function attempts
to revert new images that had been created but not yet used. The
cleanup failed to account for the fact that not all images needed
to be reverted.
Since commit 77fdc17d70 always include
log_len size into needed extents - however now we may need sometimes
more extents then necessary - mainly when multiple PVs are involved
into allocation.
Add logs_still_needed into calculation of sufficient_pes_free()
When a writecache sublv or an integrity metadata sublv
are partial (missing a dev), set the partial flag on
the upper level LV also, as is done for other sublvs.
When using cache with a cachevol, the cache_check tool was
not being run on the cache metadata during activation.
cache_check clears the needs_check flag in the cache
metadata, so if the flag was set due to an unclean
shutdown, the activation would fail.
Each integrity image in a raid LV reports its own number
of integrity mismatches, e.g.
lvs -o integritymismatches vg/lv_rimage_0
lvs -o integritymismatches vg/lv_rimage_1
In addition to this, allow the total number of integrity
mismatches from all images to be displayed for the raid LV.
lvs -o integritymismatches vg/lv
shows the number of mismatches from both lv_rimage_0 and
lv_rimage_1.
The args for pvcreate/pvremove (and vgcreate/vgextend
when applicable) were not efficiently opened, scanned,
and filtered. This change reorganizes the opening
and filtering in the following steps:
- label scan and filter all devs
. open ro
. standard label scan at the start of command
- label scan and filter dev args
. open ro
. uses full md component check
. typically the first scan and filter of pvcreate devs
- close and reopen dev args
. open rw and excl
- repeat label scan and filter dev args
. using reopened rw excl fd
- wipe and write new headers
. using reopened rw excl fd
In some cases the dev size may not have been read yet
in set_pv_devices(). In this case get the dev size
before comparing the dev size with the pv size.
To read the lvm headers and set dev->pvid if the
device is a PV. Difference from label_scan_ functions
is this does not read any vg metadata or add any info
to lvmcache.
Filtering in label_scan was controlled indirectly by
the fact that bcache was not yet set up when label_scan
first ran. The result is that filters that needed data
would not run and would return -EAGAIN, which would
result in the dev flag FILTER_AFTER_SCAN being set.
After the dev header was read for checking the label,
filters would be rechecked because of FILTER_AFTER_SCAN.
All filters would be checked this time because bcache
was now set up, and the filters needing data would
largely use data already scanned for reading the label.
This design worked but is hard to adjust for future
cases where bcache is already set up.
Replace this method (based on setting up bcache, or not)
with a new cmd flag filter_nodata_only. When this flag
is set filters that need data will not run. This allows
the same label_scan behavior when bcache has been set up.
There are no expected changes in behavior.
Touch of stack allocation validated given size with rlimit
and if the reserved_stack was above rlimit, its been completely
ignored - now we will always touch stack upto rlimit/2 size.
Since BLKZEROOUT ioctl should be supposedly fastest
way how to clear block device start using this ioctl
for zeroing a device. Commonly we do zero typically
small portion of a device (8KiB) - however since we now
also started to zero metadata devices, in the case
of i.e. thin-pool metadata this can go upto ~16GiB
and here the performance starts to be noticable.
Since dev_set_bytes() now closes dev on error path itself,
remove this unneeded call now (introduced few commits back
in history thus removing comment from WHATS_NEW)
Since lvm2 normally block signals during protected
phase where it does not want to be interrupted.
Support interruptible processing when allowed
in section between sigint_allow() ... sigint_restore())
and let the 'io_getenvents()' finish with EINTR.
When bcache tries to write data to a faulty device,
it may get out of caching blocks and then just busy-loops
on a CPU - so this check protects this by checking
if there is already max_io (~64) errored blocks.
Call _wait_all() which does check whether there is still
some pending IO before sleep. Otherwise it may happen
our submitted IO operations have been already dispatched
and this call then endlessly waits for IO which are all done.
This can be reproduced when device returns quickly errors
on write requests.
When detaching a writecache, use the cleaner setting
by default to writeback data prior to suspending the
lv to detach the writecache. This avoids potentially
blocking for a long period with the device suspended.
Detaching a writecache first sets the cleaner option, waits
for a short period of time (less than a second), and checks
if the writecache has quickly become clean. If so, the
writecache is detached immediately. This optimizes the case
where little writeback is needed.
If the writecache does not quickly become clean, then the
detach command leaves the writecache attached with the
cleaner option set. This leaves the LV in the same state
as if the user had set the cleaner option directly with
lvchange --cachesettings cleaner=1 LV.
After leaving the LV with the cleaner option set, the
detach command will wait and watch the writeback progress,
and will finally detach the writecache when the writeback
is finished. The detach command does not need to wait
during the writeback phase, and can be canceled, in which
case the LV will remain with the writecache attached and
the cleaner option set. When the user runs the detach
command again it will complete the detach.
To detach a writecache directly, without using the cleaner
step (which has been the approach previously), add the
option --cachesettings cleaner=0 to the detach command.
Since we detect already transaction if before starting
to build dm tree - this extra check is a duplicate
that would only capture very tiny 'race' and we later
validate transaction_id with suspended snapshot origin.
Introduce structures lv_status_thin_pool and
lv_status_thin (pair to lv_status_cache, lv_status_vdo)
Convert lv_thin_percent() -> lv_thin_status()
and lv_thin_pool_percent() + lv_thin_pool_transaction_id() ->
lv_thin_pool_status().
This way a function user can see not only percentages, but also
other important status info about thin-pool.
TODO:
This patch tries to not change too many other things,
but pool_below_threshold() now uses new thin-pool info to return
failure if thin-pool cannot be actually modified.
This should be handle separately in a better way.
LVM2 is distributed under GPLv2 only. The readline library changed its
license long ago to GPLv3. Given that those licenses are incompatible
and you follow the FSF in their interpretation that dynamically linking
creates a derivative work, distributing LVM2 linked against a current
readline version might be legally problematic.
Add support for the BSD licensed editline library as an alternative for
readline.
Link: https://thrysoee.dk/editline
Cover the case where two copies of metadata have the
same seqno but different checksums. Also elaborate
on an existing fixme in the code for this case, since
we should be doing something better for this case.
This had been uncovering an issue with reopening
fds in readwrite mode.
Improve error response and reporting, when creating thin snapshots.
If the thin pool kernel metadata already have device with ID lvm2
tries to create, give more meanigful error message and also properly
restore transaction id to the value known to thin-pool in this case.
Before it's been possible to divert by one from kernel TID value,
and lvm2 stacked delete message for such thin device.
Since ATM kernel does not support this operation,
disable 'lvrename' of an active vdopool.
As a workaround, user may simply deactivate, rename and activate.
When user tries to extend vdo pool - he needs to go always
at least by 1 full VDO slab (defined as vdo_slab_size_mb).
To avoid all trouble around find 'workable' size - lvm2 automatically
increases the passed (or by --use-policies calculated) extension size
(and informs a user about sometimes possibly large increase as slab
size can go upto 32GiB)
With VDO users need to always 'think-big' anyway and expect such
operation to be in GiB domain range.
When thetable reload fails during suspend() - we were only calling
plain resume() - and this will reload only those devices,
which were left suspend, but will not try to restore
metadata state according to lvm2 reverted metadata.
So if we were reloading device tree - we have restored
only top-level LV and rest of reverted device manipulation
were left alone and possibly mismatched what is in committed
metadata.
FIXME: There are several cases were such revert will likely not work
properly anyway as some operation are currenly handled in single commit,
while they need multiple commits, but it's step towards better correctness.
At least we catch there errors now earlier.
lvm opens devices readonly to scan them, but
needs to open then readwrite to update the metadata.
Previously, the ro fd was closed before the rw fd
was opened, leaving a small gap where the dev was
not held open, and during which the dev could
possibly change which storage it referred to.
With the bcache_change_fd() interface, lvm opens a
rw fd on a device to be written, tells bcache to
change to the new rw fd, and closes the ro fd.
. open dev ro
. read dev with the ro fd (label_scan)
. lock vg (ex for writing)
. open dev rw
. close ro fd
. rescan dev to check if the metadata changed
between the scan and the lock
. if the metadata did change, reread in full
. write the metadata
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
During removal of a lot of locking code the signal blocking got lost
and signal processing got broken leading to unpredictable
behavior of i.e. activation code the can get interrupted in the
middle of DM table processing.
lvm2 code always expects signals are blocked while lock is held
unless it is explictelly placed into section of:
sigint_allow();....;sigint_restore();
For checking catched interrupt there is sigint_catched();
Metadata size was calculated correctly only for raids.
Fixes problem for crash during lvcreate when thin-pool was created
on a VG where remaining free space had the size to only fit a single
metadata LV and not also its _pmspare.
Lvcreate crashed with this assert message:
lvcreate: metadata/pv_map.c:198: consume_pv_area: Assertion `to_go <= pva->count' failed.
Aborted (core dumped)
TODO: there is probably to large overload of several alloc_handle
variables.
Reported-by: Wu Guanghao<wuguanghao3@huawei.com>
Reported-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
When using --use-policy for automatic extension of thin-pool,
the extension of thin-pool's metadata itself can actually take
some extra space.
Since I'm not aware of exact compensation formula, add just
1% extra to calculated amount and hope it fits.
Wanted target is to always have usable thin-pool that fits
bellow pool_metadata_min_threshold().
Since we query on regular code these:
lv_raid_has_integrity()
lv_has_integrity_recalculate_metadata()
without prior checking for lv_is_raid() - these 'return 0' should
not use <stacktrace> as they are expected.
Correcting rounding rules for percentage evaluation.
Validate supported range of percentage.
(although ranges are already validated earlier on code path)
This is probably somewhat experimantal patch - but when i.e. raid device
is just extend, there should not be a technical need for flush,
unless the target would stricly need it. It should allow faster
processing of lvm command not being blocked by possibly longer flush.
Since we do not support rimage & rmeta for snapshots - we can
avoid quering for -cow devices and add them as origin_only -
since their snapshots (-cow) could have never existed.
This redumes several ioctl operation during table preloading.
Switch remaining zero sized struct to flexible arrays to be C99
complient.
These simple rules should apply:
- The incomplete array type must be the last element within the structure.
- There cannot be an array of structures that contain a flexible array member.
- Structures that contain a flexible array member cannot be used as a member of another structure.
- The structure must contain at least one named member in addition to the flexible array member.
Although some of the code pieces should be still improved.
While normally the 'mmap' file reading is better utilizing resources,
it has also its odd side with handling errors - so while we normally
use the mmap only for reading regular files from root filesystem
(i.e. lvm.conf) we can't prevent error to happen during the read
of these file - and such error unfortunately ends with SIGBUS error.
Maintaing signal handler would be compilated - so switch to slightly
less effiecient but more error resistant read() functinality.
reproducible steps:
1. vgcreate vg1 /dev/sda /dev/sdb
2. lvcreate --type raid0 -l 100%FREE -n raid0lv vg1
3. do remove the /dev/sdb action
4. lvdisplay show wrong 'LV Status'
After removing raid0 type LV underlying dev, lvdisplay still display
'available'. This is wrong status for raid0.
This patch add a new function raid_is_available(), which will handle
all raid case.
With this patch, lvdisplay will show
from:
LV Status available
to:
LV Status NOT available (partial)
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
It's better to set most of option as 'commented' with some
documented defaults instead of providing strict values.
This has the advantage we can eventually 'change' defualts
and get them working in future. Otherwise once the setting
is stored in lvm.conf in /etc, such setting has strictly
defined value and that can be only change with file update.
merge.c:_check_lv_segment() was checking regionsize vs. mirrored LV size on
any 'mirror/raid1/raid10' segment type including type 'mirrored' mirror logs.
Avoid the check only for 'mirrored' mirror logs to allow conversion from log
type 'disk' with regionsize > mirror log SubLV size.
As we disabled support for 'mirrored' mirror logs with
commit e82303fd6a which still conditionally
allows to enable it via global/support_mirrored_mirror_logs=1,
patch is mandatory for all distributions.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1712983
Currently lvm2 is not wiping signatures when creating 'metadata' volumes
and raid _rmeta was the only exception - so make the behavior consistent
with other metadata devices and drop wiping ATM.
Drop also some extra debug since they are now more explanatory in
wipe_lv() function.
Also note - although lvm2 now does not wipe signatures - the error
from such wipping used to be actually 'ignored' before wipe_lv()
started to return error (with recent commit) and raid creation
continued with 'unzeroed' metadata device.
TODO: Several issues to resolve:
1. We may want to flip to wipping with all LVs (in that case we need to
support passing --yet & --force).
2. Also we may want to clear whole metadata device - however current
function is also used for wipping i.e. snapshot COW device which
is likely not a good candidate for full device zeroing.
We may also need to think about better logic when extent size is
enforcing very large LVs, when only a small portion of LV is ever
being used.
3. Using TRIM instead of zeroing metadata device might be worth to
implement.
mm
To avoid polution of metadata with some 'garbage' content or eventualy
some leak of stale data in case user want to upload metadata somewhere,
ensure upon allocation the metadata device is fully zeroed.
Behaviour may slow down allocation of thin-pool or cache-pool a bit
so the old behaviour can be restored with lvm.conf setting:
allocation/zero_metadata=0
TODO: add zeroing for extension of metadata volume.
Failure in wiping/zeroing stop the command.
If user wants to avoid command abortion he should use -Zn or -Wn
to avoid wiping.
Note: there is no easy way to distinguish which kind of failure has
happend - so it's safe to not proceed any futher.
When initiated larger write request, it may have happened, bcache
got out of free chunks - fix the loop, that is supposed to wait
until next free chunk becomes avain available.
To create a new cache or writecache LV with a single command:
lvcreate --type cache|writecache
-n Name -L Size --cachedevice PVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, a new cachevol LV is created internally, using PVfast
specified by the cachedevice option.
- Then, the cachevol is attached to the main LV, converting the
main LV to type cache|writecache.
Include --cachesize Size to specify the size of cache|writecache
to create from the specified --cachedevice PVs, otherwise the
entire cachedevice PV is used. The --cachedevice option can be
repeated to create the cache from multiple devices, or the
cachedevice option can contain a tag name specifying a set of PVs
to allocate the cache from.
To create a new cache or writecache LV with a single command
using an existing cachevol LV:
lvcreate --type cache|writecache
-n Name -L Size --cachevol LVfast VG [PVslow ...]
- A new main linear|striped LV is created as usual, using the
specified -n Name and -L Size, and using the optionally
specified PVslow devices.
- Then, the cachevol LVfast is attached to the main LV, converting
the main LV to type cache|writecache.
In cases where more advanced types (for the main LV or cachevol LV)
are needed, they should be created independently and then combined
with lvconvert.
Example
-------
user creates a new VG with one slow device and one fast device:
$ vgcreate vg /dev/slow1 /dev/fast1
user creates a new 8G main LV on /dev/slow1 that uses all of
/dev/fast1 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1
-n main -L 8G vg /dev/slow1
Example
-------
user creates a new VG with two slow devs and two fast devs:
$ vgcreate vg /dev/slow1 /dev/slow2 /dev/fast1 /dev/fast2
user creates a new 8G main LV on /dev/slow1 and /dev/slow2
that uses all of /dev/fast1 and /dev/fast2 as a writecache:
$ lvcreate --type writecache --cachedevice /dev/fast1 --cachedevice /dev/fast2
-n main -L 8G vg /dev/slow1 /dev/slow2
Example
-------
A user has several slow devices and several fast devices in their VG,
the slow devs have tag @slow, the fast devs have tag @fast.
user creates a new 8G main LV on the slow devs with a
2G writecache on the fast devs:
$ lvcreate --type writecache -n main -L 8G
--cachedevice @fast --cachesize 2G vg @slow
It's possible for a dev-cache entry to remain after all
paths for it have been removed, and other parts of the
code expect that a dev always has a name. A better fix
may be to remove a device from dev-cache after all paths
to it have been removed.
When either logical block size or physical block size is 4K,
then lvmlockd creates sanlock leases based on 4K sectors,
but the lvm client side would create the internal lvmlock LV
based on the first logical block size it saw in the VG,
which could be 512. This could cause the lvmlock LV to be
too small to hold all the sanlock leases. Make the lvm client
side use the same sizing logic as lvmlockd.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum. When used on
raid images, dm-raid will correct the error by reading
the block from another image, and the device user sees
no error. The integrity metadata (checksums) are stored
on an internal LV allocated by lvm for each linear image.
The internal LV is allocated on the same PV as the image.
Create a raid LV with an integrity layer over each
raid image (for raid levels 1,4,5,6,10):
lvcreate --type raidN --raidintegrity y [options]
Add an integrity layer to images of an existing raid LV:
lvconvert --raidintegrity y LV
Remove the integrity layer from images of a raid LV:
lvconvert --raidintegrity n LV
Settings
Use --raidintegritymode journal|bitmap (journal is default)
to configure the method used by dm-integrity to ensure
crash consistency.
Initialization
When integrity is added to an LV, the kernel needs to
initialize the integrity metadata/checksums for all blocks
in the LV. The data corruption checking performed by
dm-integrity will only operate on areas of the LV that
are already initialized. The progress of integrity
initialization is reported by the "syncpercent" LV
reporting field (and under the Cpy%Sync lvs column.)
Example: create a raid1 LV with integrity:
$ lvcreate --type raid1 -m1 --raidintegrity y -n rr -L1G foo
Creating integrity metadata LV rr_rimage_0_imeta with size 12.00 MiB.
Logical volume "rr_rimage_0_imeta" created.
Creating integrity metadata LV rr_rimage_1_imeta with size 12.00 MiB.
Logical volume "rr_rimage_1_imeta" created.
Logical volume "rr" created.
$ lvs -a foo
LV VG Attr LSize Origin Cpy%Sync
rr foo rwi-a-r--- 1.00g 4.93
[rr_rimage_0] foo gwi-aor--- 1.00g [rr_rimage_0_iorig] 41.02
[rr_rimage_0_imeta] foo ewi-ao---- 12.00m
[rr_rimage_0_iorig] foo -wi-ao---- 1.00g
[rr_rimage_1] foo gwi-aor--- 1.00g [rr_rimage_1_iorig] 39.45
[rr_rimage_1_imeta] foo ewi-ao---- 12.00m
[rr_rimage_1_iorig] foo -wi-ao---- 1.00g
[rr_rmeta_0] foo ewi-aor--- 4.00m
[rr_rmeta_1] foo ewi-aor--- 4.00m
When vdopool is activated standalone - we use a wrapping linear device
to hold actual vdo device active - for this we can set-up read-only
device to ensure there cannot be made write through this device to
actual pool device.
Creating a snapshot was using a persistent LV lock
on the origin, so if the origin LV was inactive at
the time of the snapshot the LV lock would remain.
(Running lvchange -an on the inactive LV would
clear the LV lock.) Use a transient LV lock so it
will be dropped if it was not locked previously.
When formating VDO volume, the calculated amound of bits
for 'vdoformat --slab-bits' parameter was shifted by 2 bits
(calculated size was making 2MiB vdo_slab_size_mb value appear like if
user would be specifying only 512KiB)
Fixed by properly converting internal size_mb value to KiB.
Fix the anoying kernel message reported:
device-mapper: cache: 253:2: metadata operation 'dm_cache_commit' failed: error = -5
which has been reported while cachevol has been removed.
Happened via confusing variable - so switch the variable to commonly user '_size'
which presents a value in sector units and avoid 'scaling' this as extent length
by vg extent size when placing 'error' target on removal path.
Patch shouldn't have impact on actual users data, since at this moment
of removal all date should have been already flushed to origin device.
m
The previous patch improved read of pipe when lvm2 was looking
for default logical size, but we clearly must read pipe also
for -V case, when the logical size is already defined.
Still the place can be better to block only particular reshape
operations which ATM cause kernel problems.
We check if the new number of images is higher - and prevent to take
conversion if the volume is in use (i.e. thin-pool's data LV).
clang: it's supposedly impossible path to hit, as we should always
have origin_lv defined when running this path, but adding protection
isn't a big issue to make this obvious to analyzer.
Since _reserve_area() may fail due to error allocation failure,
add support to report this already reported failure upward.
FIXME: it's log_error() without causing direct command failure.
Although we expect min_chunk_size to be 32bit value, for
large size of caches it might be useful to do calcs 64bit.
So to avoid doing shift as signed 32bit - use unsigned 64bit
from the start.
reporting fields (-o) directly from kernel:
writecache_total_blocks
writecache_free_blocks
writecache_writeback_blocks
writecache_error
The data_percent field shows used cache blocks / total cache blocks.
Until we resolve reshape for 'stacked' devices, we need to disable it.
So users can no longer reshape i.e. thin-pool data volumes, causing
ATM bad thin-pool problems.
After the VG lock is taken for vg_read, reread the mda_header
and compare the metadata text offset and checksum to what was
seen during label scan. If it is unchanged, then the metadata
has not changed since the label scan, and the metadata does not
need to be reread under the lock for command processing.
For commands that do not make changes (e.g. reporting), the
mda_header is reread and checked on one mda to decide if the
full metadata rereading can be skipped. For other commands
(e.g. modifying the vg) the mda_header is reread and checked
from all PVs. (These could probably just check one mda also.)
When pvcreate/pvremove prompt the user, they first release
the global lock, then acquire it again after the prompt,
to avoid blocking other commands while waiting for a user
response. This release/reacquire changes the locking
order with respect to the hints flock (and potentially other
locks). So, to avoid deadlock, use a nonblocking request
when reacquiring the global lock.
Avoid mem leaking hint on every loop continue and
allocate hint only when it's going to be added into list.
Switch to use 'dm_strncpy()' and validate sizes.
For dev_in_device_list() != 0 allocated 'devl' was
actually leaking - so instead allocate 'devl' only
when !dev_in_device_list() and indent code around.
Since we check for NULL pointers earlier we need
to be consistent across function - since the NULL
would applies across whole function.
When dropping 'mda' check - we are actually
already dereferencing it before - so it can't
be NULL at that places (and it's validated
before entering _read_mda_header_and_metadata).
dev_unset_last_byte() must be called while the fd is still valid.
After a write error, dev_unset_last_byte() must be called before
closing the dev and resetting the fd.
In the write error path, dev_unset_last_byte() was being called
after label_scan_invalidate() which meant that it would not unset
the last_byte values.
After a write error, dev_unset_last_byte() is now called in
dev_write_bytes() before label_scan_invalidate(), instead of by
the caller of dev_write_bytes().
In the common case of a successful write, the sequence is still:
dev_set_last_byte(); dev_write_bytes(); dev_unset_last_byte();
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
When resizing 2 volumes like thin-pool and it's metadata and they
would be of a different type - command would be actually expecting
both LVs being of a same segtype - and would throw an error in
case they are different.
This patch fixes is by setting a new segtype from last segment of
2nd. extented device.
Also it fixes the possible 'percentage' extension setup that
might have been used for 'primary' volume - while the 'secondary'
LV always goes with direct size - as we do not support 'percentage'
setup for them
This affects maily usage of thin-pool where the extension of
thin-pool data size may also lead to extension of metadata size.
Instead of checking all LVs in a VG - do just a direct copy of LVs
from the existing list ->segs_using_thin_lv.
TODO: maybe it could be better to expose seg_list to /tools...
Enhance lv_info with lv_info_with_name_check.
This 'variant' not only check existance if UUID in DM table
but also compares its DM name whether it's matching expected LV name.
Otherwise activation may 'skip' activation with rename in case the
DM UUID already exists, just device is different name.
This change make fairly easier manipulation with i.e. detached mirror
leg which ATM is using same UUID - just the LV name have been changed.
Used code was not able to run 'activation' (and do a rename) and just
skipped the call. So the code used to do a workaround and 'tried'
to deactivate such LV firts - this however work only in non-clvmd case,
as cluster was not having the lock for deactivated LV.
With this extended lv_info code will run 'activation' and will
synchronize the name to match expected LV name.
Patch extends _lv_info() with new paramter 'with_name_check',
which is later translated into 'name_check' argument for
_info_run() which in case of name mismatch evaluates the
check as if device does not exists.
Such call is only used in one place _lv_activate() which then
let activation run. All other invocation of _info() calls
are left intact.
TODO: fix mirror table manipulation (and raid)....
The return value from bcache_invalidate_fd() was not being checked.
So I've introduced a little function, _invalidate_fd() that always
calls bcache_abort_fd() if the write fails.
The resume of 'released' 'COW' should preceed the resume of origin.
The fact we need to do the sequence differently for merge was
cause by bugs fixed in 2 previous commits - so we no longer need
to recognize 'merging' and we should always go with single
sequence.
The importance of this order is - to properly remove '-real' device
from origin LV. When COW is activated as 2nd. '-real' device is
kept in table as it cannot be removed during 1st. resume of origin,
and later activation of COW LV no longer builds tree associated
with origin LV.
When checking device id of a thin device that is just being
merged - the snapshot actually could have been already finished
which means '-real' suffix for the LV is already gone and just LV
is there - so check explicitely for this condition and use
correct UUID for this case.
When a cachevol LV is attached, have the LV keep it's lock
allocated. The lock on the cachevol won't be used while
it's attached. When the cachevol is split a new lock does
not need to be allocated. (Applies to cachevol usage by
both dm-cache and dm-writecache.)
When LV gets cached and uses cache-pool - such cache-pool
will now get _cpool suffix automatically.
Thus 'Pool' column for cached LV will now show either _cvol
or _cpool LV.
Before 'archive()' is called, lvm2 must not touch/modify metadata.
So move setting CACHE_VOL related flags past this point.
Also make sure reading of cache segtype always restores this
flag properly (even if compatible flag would be lost).
Since code is using -cdata and -cmeta UUID suffixes, it does not need
any new 'extra' ID to be generated and stored in metadata.
Since introduce of new 'segtype' cache+CACHE_USES_CACHEVOL we can
safely assume 'new' cache with cachevol will now be created
without extra metadata_id and data_id in metadata.
For backward compatibility, code still reads them in case older
version of metadata have them - so it still should be able
to activate such volumes.
Bonus is lowered size of lv structure used to store info about LV
(noticable with big volume groups).