IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Effectively revert whole 76f6951c3e.
We need to figure out some other solution.
At this moment usage of --config with 'repair' of blocked mirror
is 'freezing' combination.
For each section 8 man page, a .8_gen file is created from one of:
.8_main - Old-style man page - content used directly
.8_des and .8_end - Description and end section of a generated page
.8_pregen - Pre-generated page used if the generator fails
Other man sections are not generated and use the suffix .5_main or .7_main.
Developers should use 'make generate' to regenerate the .8_pregen files.
When a given command:
- matches command definition A based on required options,
but uses optional options that are not accepted by A.
- matches command definition B based on required options,
(but fewer required items than A), and uses no
unaccepted optional options.
then B is the correct choice. Command A would fail because
of the unaccepted optional options. The logic was mistakenly
letting A win because it had a greater number of required option
matches, without accounting for the optional option mismatches.
Systematically outline every combination of:
--type striped, --type mirror, --type raid, --mirrors, --stripes
and make sure each is assigned to one specific cmd def.
This revealed that a new command def is needed for
lvcreate that uses both --mirrors and --stripes
but no --type option.
The use of LVCREATE_RAID shortcut for an option set
resulted in mirrors/stripes being included in optional
opts set when they were already in the required list.
Sending %d as format argument in lvmetad_vg_remove_pending() will cause
segfaults in config_make_nodes_v() when va_arg() casts to int64_t. Also, it is
clearly advertised in the lvm source code that using plain %d is prohibited, so
let's switch to FMTd64.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Require that the path argument to dmfilemapd be an absolute path
and document this in tool output, libdevmapper.h and dmfilemapd.8.
The check is also enforced by dm_stats_start_filemapd() to avoid
forking a new process with an invalid path argument.
The initial check on argc incorrectly returns 1 when the wrong
number of arguments are present, causing a segfault in main()
when no arguments are given:
# dmfilemapd
Wrong number of arguments.
usage: dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Segmentation fault (core dumped)
Commit f4b30b0dae was about displaying visible LV size
when reshape space is allocated. Take parity devices
into account when displaying the user visible LV size.
As was recently done with relative signes for sizes/extents,
limit the signs used with the mirrors option, e.g.
lvcreate --mirrors now does not accept or advertise an
optional minus sign with the value. lvconvert --mirrors
accepts +|-.
Add a warning about maximum supported numbers of stripes
with striped LVs realtive to RAID conversions.
Add examples for a more elaborate, multi-step conversion
from linear to striped (and vice versa).
Shrink lvs examples output.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The LVCONVERT_RAID shortcut for including options ended
up including mirrors/stripes as optional opts in defs where
they were already required, or in defs where they would
not be used. Remove the option set and specify in each
case only the options wanted.
Better support for lvdisplay.
By default info about running (in kernel) cache status is printed.
To get 'segtype' info, user runs: 'lvdisplay -m', example:
--- Logical volume ---
LV Path /dev/vg/lvol0
LV Name lvol0
VG Name vg
LV UUID Y4uWuN-TBGk-duer-aPWl-yBWn-iFFR-RU1gg1
LV Write Access read/write
LV Creation host, time linux, 2017-03-01 20:52:39 +0100
LV Cache pool name lvol2
LV Cache origin name lvol0_corig
LV Status available
# open 0
LV Size 12,00 MiB
Cache used blocks 10,42%
Cache metadata blocks 0,49%
Cache dirty blocks 0,00%
Cache read hits/misses 112 / 34
Cache wrt hits/misses 133 / 0
Cache demotions 0
Cache promotions 20
Current LE 3
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Segments ---
Logical extents 0 to 2:
Type cache
Chunk size 64,00 KiB
Metadata format 1
Mode writethrough
Policy smq
Setting migration_threshold=100000
OO_LVCREATE_CACHE accepts --cachemetadataformat.
Support new option --cachemetadataformat auto|1|2 for caching.
Word 'auto' can be also be given as '0'.
Report CMFmt column with cache metadata format version.
Report KMFmt column with 'kernel cache metadata format version' for device.
(a value reported from status).
(Update 'CacheMode' to name 'Cache' as primary segtype).
Only cache-pool segtype may store cache_metadata_format.
Only supported values are 0,1,2
Format 2 requires LV status uses LV_METADATA_FORMAT.
Format 0 (unselected) or 1 shall not set this 'incompatible' status.
Cache pool read/writes metadata_format within its segment type..
For CachePoolLV unselected metadata format is NOT stored in metadata.
For CacheLV when metadata format is not present/selected in lvm2 metadata,
it's automatically assumed to be the version 1 (backward compatible).
To ensure older lvm2 will not 'miss-read' metadata with new version 2,
such LV is marked with METADATA_FORMAT status flag (segment is
specifying metadata format). So when cache uses metadata format 2,
it will become inaccesible on older system without such support.
(kernel dm cache < 1.10, lvm2 < 2.02.169).
Add new profilable configation setting to let user select
which metadata format of a created cache pool he wish to use.
By default the 'best' available format is autodetected at runtime,
but user may enforce format 1 or 2 ATM.
Code also detects availability for metadata2 supporting cache target.
In case of troubles user may easily Disable usage of this feature
by placing 'metadata2' into global/cache_disabled_features list.
Older library version was not detecting unknown 'feature' bits
and could let start target without needed option.
New versioned symbol now checks for supported feature bits.
_Base version keeps accepting only previously known features and
mask/ignores unknown bits.
NB: if the older binary passed in 'random' bits, it will not get
metadata2 by chance. New linked binary get new validation function.
Library user is required to not pass 'trash' for unsupported bits,
as such calls will be rejected.
Dm cache target version 1.10 introduces new cache metadata format
(upstream kernel >=4.11).
New format is enable by passing new target feature flag metadata2.
Interace side on libdm uses DM_CACHE_FEATURE_METADATA2.
This feature bit is now also recognized on status
and set in 'feature_flags' field of dm_status_cache structure.
Code also adds check for 'highest' supported feature flag bit.
So it rejects properly any 'unknown' feature bit set by application.
Better code to enforce writethrough caching for cleaner policy.
Only check for cleaner when DM_CACHE_FEATURE_PASSTHROUGH or
DM_CACHE_FEATURE_WRITEBACK is set.
As now we can properly recognize all paramerters for pool creation,
we may drop PASS_ARG_ defines and rely on '_UNSELECTED' or 0 entries
as being those without user given args.
When setting are not given on command line - 'update' function
fill them from profiles or configuration. For this 'profile' arg
was needed to be passed around and since 'VG' itself is not needed,
it's been all replaced with 'cmd, profile, extents_size' args.
Reuse same code for getting/setting cache parameters with lvcreate.
So there is single one place how to get vars from profiles and configs.
Variables declarations are moved to start of function and
there is no need to initialize them as they are always
defined by functions get_cache_params() and get_pool_params().
Since cache chunk might be huge and there is no technical need
to enforce rounding and there is actually more 'real' VG space
used then necessary - keep rounding on 'chunk' bounrary only
for thin volumes - where it's the space used anyway.
NB: we support conversion of any-size 'existing' LV into cached LV.
Fix missing reset of '*settings' pointer when no args were given.
Handle cache_chunk settings like all other settings, so it is properly
updated only with non-zero settings and the existing cache-pool
chunk_size is not being reconfigured.
User can specify metadata profile which stores important cache
geometry data for easy configuration.
Fix missing support for getting chunk_size, cache_mode, cache_policy
for a cache/cache pools volumes from configuration or metadata profile.
Split-off patch that just replaces returns with 'goto bad'
so there is single place to release policy_settings.
In the next patch we will start to use some shared
function call between lvconvert and lvcreate
(effectively restoring previous logic which has got lost).
To more easily recognize unselected state from select '0' state
add new 'THIN_ZERO_UNSELECTED' enum.
Same applies to THIN_DISCARDS_UNSELECTED.
For those we no longer need to use PASS_ARG_ZERO or PASS_ARG_DISCARDS.
Basically code moving operation to have a single place resolving
thin_pool_chunk_size_policy.
Supported are generic & performance profiles.
Function is now shared between thin manipulation code and configuration
_CFG logic to obtain defaults and handle correct reporting upward coding
stack.
Test for NULL in dm_stats_destroy() and return immediately if
the struct dm_stats pointer is NULL (similar to free(NULL)).
This simplifies cleanup code which otherwise needs to:
out:
if (dms)
dm_stats_destroy(dms);
return;
Older compilers cannot tell that the 'mode' variable is only
used in branches in which it is assigned:
dmsetup.c:5651: warning: "mode" may be used uninitialized in this function
dmsetup.c:5023: warning: "mode" may be used uninitialized in this function
Avoid this by always assigning the variable a value.
Older compilers are not able to determine that although group_id
is only assigned in one branch of a conditional, it is never used
used when the other branch is taken:
libdm-stats.c:3319: warning: "group_id" may be used uninitialized in this function
Avoid this by always initialising the variable when it is
declared.
Utilizing the --config option we will utilize global/notify_dbus=0 so
that the service itself doesn't generate change events which it then needs to
process.
We need to place query operations in the queue to prevent the case where
a client knows of something before the service does. For example if a
client creates a PV/VG/LV outside of the dbus API and then immediately
tries to lookup and use that resource in the lvm dbus service it should
be present. By placing the queries in the work queue any previous
refresh operation will complete before we process the query.
In addition to the already supported conversion between 2-legged
raid1 and raid5, raid1 and raid4 can be also converted into each
other with 2 legs (raid4/5 are limited to map a 2-legged raid1).
This patch supports the missing raid4 conversion in the sequence
linear -> 2-legged raid1 -> raid4/5, then restripe to more than one
data stripes for performance and resilience reasons and optionally
convert to striped/raid0.
The other conversion sequence is also possible by converting N-way
striped/raid0 to raid4/5, then restripe to 2 legs followed by a
conversion to raid1 and optionally to linear (loosing all resilience).
Add examples for reshaping number of stripes
and converting from raid6 to striped to raid10.
Remove trailing spaces.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The filemap daemon takes its program_id from the regions it is
managing: use DM_STATS_ALL_PROGRAMS when retrieving an initial
listing and then obtain the correct program_id from the group
leader.
On conversion from striped to raid0, data LVs are created
and all segments and their respective areas of the striped
LV are moved across to new segments allocated for the raid0
image LVs. This can cause non-canonical segments to be added
to the image LVs.
Add a call to lv_merge_segments() once all segments have been
added to an image LV to compensate for that. This avoids
unsafe table loads on activation.
Fix comments.
Automatic dmeventd repair of mirrors with active lvmetad configured
(mirror_image_fault_policy = "allocate") fails because the lvscan
run before the repair in the mirror DSO does not update the
lvmetad cache properly thus "lvconvert --repair ..." fails.
Need to scan the mirror LV before and after the repair
to have proper cache content after the repair finished.
The cache can't be relied on or the repair will fail.
Resolves: rhbz1380521
Launch an instance of the filemap monitoring daemon when creating,
or updating, a file mapped group, unless the --nomonitor switch is
given.
Unless --foreground is given the daemon will detach from the
terminal and run in the background until it is signaled or the
daemon termination conditions are met.
The --follow={inode|path} switch is added to control the daemon
behaviour when files are moved, unlinked, or renamed while they
are being monitored.
The daemon runs with the same verbosity as the dmstats command
that starts it.
Add a daemon that can be launched to monitor a group of regions
corresponding to the extents of a file, and to update the regions as the
file's allocation changes.
The daemon is intended to be started from a library interface, but can
also be run from the command line:
dmfilemapd <fd> <group_id> <path> <mode> [<foreground>[<log_level>]]
Where fd is a file descriptor open on the mapped file, group_id is the
group identifier of the mapped group and mode is either "inode" or
"path". E.g.:
# dmfilemapd 3 0 vm.img inode 1 3 3<vm.img
...
If foreground is non-zero, the daemon will not fork to run in the
background. If verbose is non-zero, libdm and daemon log messages will
be printed.
It is possible for the group identifier to change when regions are
re-mapped: this occurs when the group leader is deleted (regroup=1 in
dm_stats_update_regions_from_fd()), and another region is created before
the daemon has a chance to recreate the leader region.
The operation is inherently racey since there is currently no way to
atomically move or resize a dm_stats region while retaining its
region_id.
Detect this condition and update the group_id value stored in the
filemap monitor.
A function is also provided in the the stats API to launch the filemap
monitoring daemon:
int dm_stats_start_filemapd(int fd, uint64_t group_id, const char *path,
dm_filemapd_mode_t mode, unsigned foreground,
unsigned verbose);
This carries out the first fork and execs dmfilemapd with the arguments
specified.
A dm_filemapd_mode_t value is specified by the mode argument: either
DM_FILEMAPD_FOLLOW_INODE, or DM_FILEMAPD_FOLLOW_PATH. A helper function,
dm_filemapd_mode_from_string(), is provided to parse a string containing
a valid mode name into the appropriate dm_filemapd_mode_t value.
It's not an error to call dm_stats_group_present() on a handle
that contains no regions.
This causes dmfilemap to log a false backtrace during shutdown
if all regions are removed from the corresponding device:
exiting _filemap_monitor_get_events() with deleted=0, check=0
waiting for FILEMAPD_WAIT
dm message (253:1) [ opencount flush ] @stats_list dmstats [32768] (*1)
<backtrace>
Filemap group removed: exiting.
Change this to only emit a backtrace if the handle is NULL.
Splitting off an image LV of a 2-legged
raid1 LV causes loss of resilience.
Ask user to avoid uninformed loss of all resilience.
Don't ask for N > 2 legged raid1 LVs.
Adjust tests.
Splitting off an image LV of a 2-legged raid1 LV tracking changes
causes loosing partial resilience for any newly written data set.
Full resilience will be provided again after the split off image LV
got merged back in and the new data set got fully synchronized.
Reason being that the data is only stored on the remaining single
writable image during the split.
Ask user to avoid uninformed loss of such partial resilience.
Don't ask for N > 2 legged raid1 LVs.
In case N images fail (N <= parity chunks) _and_
a "vgreduce --removemissing --force VG" was applied
a following repair of the RaidLV fails:
Unable to remove N images: Only 0 devices given.
Failed to remove the specified images from tb/r.
Failed to replace faulty devices in tb/r.
Fix as of this commit results in correct repair:
Faulty devices in tb/r successfully replaced.
Add new values for different sign variations, resulting in:
size_VAL no sign accepted
ssize_VAL accepts + or -
psize_VAL accepts +
nsize_VAL accpets -
extents_VAL no sign accepted
sextents_VAL accepts + or -
pextents_VAL accepts +
nextents_VAL accepts -
Depending on the command being run, change the option
values for --size, --extents, --poolmetadatasize to
use the appropriate value from above.
lvcreate uses no sign (but accepts + and ignores it).
lvresize accepts +|- for a relative change.
lvextend accepts + for a relative change.
lvreduce accepts - for a relative change.
Like opt and val arrays in previous commit, combine duplicate
arrays for lv types and props in command.c and lvmcmdline.c.
Also move the command_names array to be defined in command.c
so it's consistent with the others.
command.c and lvmcmdline.c each had a full array defining
all options and values. This duplication was not removed
when the command.c code was merged into the run time.
seg->extents_copied has to be defined properly on reducing
the size of a raid LV or conversion from raid5 with 1 stripe
to raid1 will fail.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Reshaping a raid5 LV to one stripe aiming to convert it to
raid1 (and optionally to linear) reports the wrong LV size
when still having reshape space allocated.
The lv_extend/_lv_reduce API doesn't cope with resizing RaidLVs
with allocated reshape space and ongoing conversions. Prohibit
resizing during conversions and remove the reshape space before
processing resize. Add missing seg->data_copies initialisation.
Fix typo/comment.
For the time being raid10 is limited to even number of total stripes
as is and 2 data copies. The number of stripes provided on creation
of a raid10(_near) LV with -i/--stripes gets doubled to define
that even total number of stripes (i.e. images).
Apply the same on disk adding conversions (reshapes) with
"lvconvert --stripes RaidLV" (e.g. 2 stripes = 4 images
total converted to 3 stripes = 6 images total).
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Part of the UNIT description was still living in the
--size description. Move it to the Size[UNIT] description
since it is used by other options, not just --size.
In lvcreate/lvconvert, --poolmetdatasize can only be an
absolute value, but in lvresize/lvextend, --poolmetadatasize
can be given a + relative value.
The val types only currently support relative values that
go in both directions +|-. Further work is needed to add
val types that can be relative in only one direction, and
switching various option values to one those depending on
the command name. Then poolmetdatasize will not appear
with a +|- option in lvcreate/lvconvert, and will
appear with only the + option in lvresize/lvextend.
Enhance the raid report functions for the recently added LV fields
reshape_len, reshape_len_le, data_offset, new_data_offset, data_copies,
data_stripes and parity_chunks to cope with "lvs --select".
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Clean up and correct the details around --extents and --size.
lvcreate/lvresize/lvreduce/lvextend all now display the
extents option in usages.
The Size and Number value variables for --size and --extents
are now displayed without the [+|-] prefix for lvcreate.
A special case is needed to display
--extents for lvcreate since the cmd defs
treat --extents as an automatic alternative
to --size (to avoid duplicating every cmd def).
Now that we got the "data_stripes" field key, adjust the "stripes" field description.
Enhance the "regionsize" field description to cover raids as well.
When a cmd def includes multiple sets of options (OO_FOO),
allow multiple OO_FOO sets to contain the same option and
avoid repeating it in the cmd def.
There was confusion in the code about whether or not the
--size option accepted a sign. Make it consistent and clear
that it does.
This exposes a new problem in that an option can only
accept one value type, e.g. --size can only accept a
signed number, it cannot accept a positive or negative
number for some commands and reject negative numbers for
others.
In practice, lvcreate accepts only positive --size
values and lvresize accepts positive or negative --size
values. There is currently no way to encode this
difference. Until that is fixed, the man page output
is hacked to avoid printing the [+|-] prefix for sizes
in lvcreate.
Settings specified in other command line args take precedence over
profiles and --config, which takes precedence over settings in actual
config files.
Since commit 1e2420bca8 ('commands: new
method for defining commands') commands like this:
lvchange --config 'global/test=1' -ay vg
have been printing the 'TEST MODE' message, but nevertheless making
real changes.
Commit 48778bc503 introduced new RAID reshaping related report fields.
The inclusioon of segtype.h in properties.c isn't mandatory, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 80a6de616a versioned the dm_tree_node_add_raid_target_with_params()
and dm_tree_node_add_raid_target() APIs for compatibility reasons.
There's no user of the latter function, remove it.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
During an ongoing reshape, the MD kernel runtime reads stripes relative
to data_offset and starts storing the reshaped stripes (with new raid
layout and/or new stripesize and/or new number of stripes) relative
to new_data_offset. This is to avoid writing over any data in place
which is non-atomic by nature and thus be recoverable without data loss
in the transition. MD uses the term out-of-place reshaping for it.
There's 2 other areas we don't have report capability for:
- number of data stripes vs. total stripes
(e.g. raid6 with 7 stripes toal has 5 data stripes)
- number of (rotating) parity/syndrome chunks
(e.g. raid6 with 7 stripes toal has 2 parity chunks; one
per stripe for P-Syndrome and another one for Q-Syndrome)
Thus, add the following reportable keys:
- reshape_len (in current units)
- reshape_len_le (in logical extents)
- data_offset (in sectors)
- new_data_offset ( " )
- data_stripes
- parity_chunks
Enhance lvchange-raid.sh, lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh, lvconvert-raid-reshape.sh
and lvconvert-raid-takeover.sh to make use of new keys.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add/remove the SECONDARY_SYNTAX flag to cmd defs.
cmd defs with this flag will be listed under the
ADVANCED USAGE man page section, so that the main
USAGE section contains the most common commands
without distraction.
- When multiple cmd defs do the same thing, one variant
can be displayed in the first list.
- Very advanced, unusual or uncommon commands should be
in the second list.
Recently added check for reshaping in this function called for
a cleanup to avoid proliferating it with more explicit conditionals.
Base the reshaping check on the given _features array.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
An initialization was missing when converting striped to raid0(_meta)
causing unitialized reshape_len in the new component LVs first segment.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The imposed minimum region size can cause rejection on
disk removing reshapes. Lower it to avoid that.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Commit 27384c52cf lowered the maximum number of devices
back to 64 for compatibility.
Because more members have been added to the API in
'struct dm_tree_node_raid_params *', we have to version
the public libdm RAID API to not break any existing users.
Changes:
- keep the previous 'struct dm_tree_node_raid_params' and
dm_tree_node_add_raid_target_with_params()/dm_tree_node_add_raid_target()
in order to expose the already released public RAID API
- introduce 'struct dm_tree_node_raid_params_v2' and additional functions
dm_tree_node_add_raid_target_with_params_v2()/dm_tree_node_add_raid_target_v2()
to be used by the new lvm2 lib reshape extentions
With this new API, the bitfields for rebuild/writemostly legs in
'struct dm_tree_node_raid_params_v2' can be raised to 256 bits
again (253 legs maximum supported in MD kernel).
Mind that we can limit the maximum usable number via the
DEFAULT_RAID{1}_MAX_IMAGES definition in defaults.h.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
- Combine the equivalent lvconvert --type raid defs.
(Two cmd defs must be different without relying
on LV type, which are not known at the time the
cmd def is matched.)
- Remove unused optional options from lvconvert --stripes,
and lvconvert --stripesize.
- Use Number for --stripes_long val type.
- Combine the cmd def for raid reshape cleanup into the
existing start_poll cmd def (they were duplicate defs).
Calls into the raid code from a poll opertion will be
added.
Commit 64a2fad5d6 raised the maximum number of RAID devices to 64.
Commit e2354ea344 introduced RAID_BITMAP_SIZE as 4 to have
256 bits (4 * 64 bit array members), thus changing the libdm API
unnecessarilly for the time being.
To not change the API, reduce RAID_BITMAP_SIZE to 1.
Remove an unneeded definition of it from libdm-common.h.
If we ever decide to raise past 64, we'll version the API.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
The fedorahosted git repository shuts down tomorrow:
https://communityblog.fedoraproject.org/fedorahosted-sunset-2017-02-28/
Our upstream git repository has moved back to sourceware.org.
Mailing list hosting is not changing.
Gitweb:
https://www.sourceware.org/git/?p=lvm2
Git:
git://sourceware.org/git/lvm2.git
ssh://sourceware.org/git/lvm2.git
http://sourceware.org/git/lvm2.git
Example command to change the origin of a repository clone:
Public:
git remote set-url origin git://sourceware.org/git/lvm2.git
Committers:
git remote set-url origin git+ssh://sourceware.org/git/lvm2.git
The options list was sorted as:
- options with both long and short forms, alphabetically
- options with only long form, alphabetically
This was done only for the visual effect. Change to
sort alphabetically by long opt, without regard to
short forms.
Fixes commit 286d39ee3c, which was correct except
for a reversed strstr. Now uses strchr, and modifies
a copy of the name so the original argv is preserved.
When requesting a regionsize change during conversions, check
for constraints or the command may fail in the kernel n case
the region size is too smalle or too large thus leaving any
new SubLVs behind.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Allow regionsize on upconvert from linear:
fix related commit 2574d3257a to actually work
Related: rhbz1394427
Remove setting raid5_n on conversions from raid1
as of commit 932db3db53 because any raid5 mapping
may be requested.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Add aforementioned but forgotten new test scripts
lvconvert-raid-reshape-linear_to_striped.sh,
lvconvert-raid-reshape-striped_to_linear.sh and
lvconvert-raid-reshape.sh
Those presume dm-raid target version 1.10.2
provided by a following kernel patch.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
Because of contraints in renaming shifted rimage/rmeta LV names
the current RaidLV limit is a maximum of 10 SubLV pairs.
With the previous introduction of reshaping infratructure that
constriant got removed.
Kernel supports 253 since dm-raid target 1.9.0, older kernels 64.
Raise the maximum number of RaidLV rimage/rmeta pairs to 64.
If we want to raise past 64, we have to introdce a check for
the kernel supporting it in lvcreate/lvconvert.
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces the changes to call the reshaping infratructure
from lv_raid_convert().
Changes:
- add reshaping calls from lv_raid_convert()
- add command definitons for reshaping to tools/command-lines.in
- fix raid_rimage_extents()
- add 2 new test scripts lvconvert-raid-reshape-linear_to_striped.sh
and lvconvert-raid-reshape-striped_to_linear.sh to test
the linear <-> striped multi-step conversions
- add lvconvert-raid-reshape.sh reshaping tests
- enhance lvconvert-raid-takeover.sh with new raid10 tests
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- allow raid_rimage_extents() to calculate raid10
- remove an __unused__ attribute
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- add missing raid1 <-> raid5 conversions to support
linear <-> raid5 <-> raid0(_meta)/striped conversions
- rename related new takeover functions to
_takeover_from_raid1_to_raid5 and _takeover_from_raid5_to_raid1,
because a reshape to > 2 legs is only possible with
raid5 layout
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Change:
- enhance _clear_meta_lvs() to support raid0 allowing
raid0_meta -> raid10 conversions to succeed by clearing
the raid0 rmeta images or the kernel will fail because
of discovering reordered raid devices
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- enhance _raid45610_to_raid0_or_striped_wrapper() to support
raid5_n with 2 areas to raid1 conversion to allow for
striped/raid0(_meta)/raid4/5/6 -> raid1/linear conversions;
rename it to _takeover_downconvert_wrapper to discontinue the
illegible function name
- enhance _striped_or_raid0_to_raid45610_wrapper() to support
raid1 with 2 areas to raid5* conversions to allow for
linear/raid1 -> striped/raid0(_meta)/raid4/5/6 conversions;
rename it to _takeover_upconvert_wrapper for the same reason
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add missing possible reshape conversions and conversion options
to allow/prohibit changing stripesize or number fo stripes
- enhance setting convenient riad types in reshape conversions
(e.g. raid1 with 2 legs -> radi5_n)
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add _raid_reshape() using the pre/post callbacks and
the stripes add/remove reshape functions introduced before
- and _reshape_requested function checking if a reshape
was requested
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add vg metadata update functions
- add pre and post activation callback functions for
proper sequencing of sub lv activations during reshaping
- move and enhance _lv_update_reload_fns_reset_eliminate_lvs()
to support pre and post activation callbacks
- add _reset_flags_passed_to_kernel() which resets anyxi
rebuild/reshape flags after they have been passed into the kernel
and sets the SubLV remove after reshape flags on legs to be removed
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function to support disk adding reshapes
- add function to support disk removing reshapes
- add function to support layout (e.g. raid5ls -> raid5_rs)
or stripesize reshaping
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add function providing state of a reshaped RaidLV
- add function to adjust the size of a RaidLV was
reshaped to add/remove stripes
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces more local infrastructure to raid_manip.c
used by followup patches.
Changes:
- add lv_raid_data_copies returning raid type specific number;
needed for raid10 with more than 2 data copies
- remove _shift_and_rename_image_components() constraint
to support more than 10 raid legs
- add function to calculate total rimage length used by out-of-place
reshape space allocation
- add out-of-place reshape space alloc/relocate/free functions
- move _data_rimages_count() used by reshape space alloc/realocate functions
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces local infrastructure to raid_manip.c
used by followup patches.
Add functions:
- to check reshaping is supported in target attibute
- to return device health string needed to check
the raid device is ready to reshape
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
In order to support striped raid5/6/10 LV reshaping (change
of LV type, stripesize or number of legs), this patch
introduces infrastructure prerequisites to be used
by raid_manip.c extensions in followup patches.
This base is needed for allocation of out-of-place
reshape space required by the MD raid personalities to
avoid writing over data in-place when reading off the
current RAID layout or number of legs and writing out
the new layout or to a different number of legs
(i.e. restripe)
Changes:
- add members reshape_len to 'struct lv_segment' to store
out-of-place reshape length per component rimage
- add member data_copies to struct lv_segment
to support more than 2 raid10 data copies
- make alloc_lv_segment() aware of both reshape_len and data_copies
- adjust all alloc_lv_segment() callers to the new API
- add functions to retrieve the current data offset (needed for
out-of-place reshaping space allocation) and the devices count
from the kernel
- make libdm deptree code aware of reshape_len
- add LV flags for disk add/remove reshaping
- support import/export of the new 'struct lv_segment' members
- enhance lv_extend/_lv_reduce to cope with reshape_len
- add seg_is_*/segtype_is_* macros related to reshaping
- add target version check for reshaping
- grow rebuilds/writemostly bitmaps to 246 bit to support kernel maximal
- enhance libdm deptree code to support data_offset (out-of-place reshaping)
and delta_disk (legs add/remove reshaping) target arguments
Related: rhbz834579
Related: rhbz1191935
Related: rhbz1191978
(Change to recent commit 3f4ecaf8c2.)
Use --foo Number[k|UNIT] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Previously, underlined/italic Unit was used,
like other of variables, but this UNIT is more
like a shortcut than an actual variable.
The MD kernel raid1 personality does no use any writemostly leg as the primary.
In case a previous linear LV holding data gets upconverted to
raid1 it becomes the primary leg of the new raid1 LV and a full
resynchronization is started to update the new legs.
No writemostly and/or writebehind setting may be allowed during
this initial, full synchronization period of this new raid1 LV
(using the lvchange(8) command), because that would change the
primary (i.e the previous linear LV) thus causing data loss.
lvchange has a bug not preventing this scenario.
Fix rejects setting writemostly and/or writebehind on resychronizing raid1 LVs.
Once we have status in the lvm2 metadata about the linear -> raid upconversion,
we may relax this constraint for other types of resynchronization
(e.g. for user requested "lvchange --resync ").
New lvchange-raid1-writemostly.sh test is added to the test suite.
Resolves: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855895
We use --foo Number[k|Units] to indicate that
the default units of the number is k, but other
units listed below are also accepted.
Capitalize and underline Units so it is consistent
with other variables, and reference it at the end.
Technically, the k should be bold, but this
tends to make the text visually hard to read
because of the excessive highlights scattered
everywhere. So it's left normal text for now
(it's unlikely to confuse anyone.)
Print ... after a repeatable option in the OPTIONS section.
An alternative would be to just mention in the text description
that the option is repeatable.
When the test would need to try to write some large amount of data
we can give it 'zero' segments - for obvious reason such written data
can't be verified but in some test cases it doesn't really matter.
Usage follows 'error_dev' style.
For now test suite is not supporting any combination of
error/delay/zero segments so only 1 type could be used per PV.
Previously when lvremove tried to remove 'active' origin,
it had been asking for every 'snapshot' LV separately
and doing individual single snapshot removals first.
To be faster it now deactivates origin before removal
all connected snapshots.
This avoids multiple reloads of dm table for origin volume
which were unnecessary as origin was meant to be removed as well.
There are two kinds of common options:
1. options common to all variants of a given command name
2. options common to all lvm commands
Previously, both kinds of common options were listed together
under "Common options". Now the first are printed under
"Common options for command" (when needed), and the second
are printed under "Common options for lvm" (always).
Remove the "usage notes" which should just
live in the man pages.
When there are 3 or more variants of a command,
print all the options produces a lot of output,
so require --longhelp to print all the options
in these cases.
Check 'dmsetup version' is called before starting any more
advanced logic in $DM_DEV_DIR.
Call also replaces mkdir as it creates needed path with control node.
The --type option has previously been accepted for
lvresize/lvextend. Using it did not affect the operation
of the command. The value was simply verified as matching
the current seg type of the LV.
Commit f45b689406 caused regression
of lvresize -m and --type parameter
After fix this sequence may work when we also fix syntax description:
lvcreate -l1 -m1 -n lv1 vg
lvextend --type mirror -m1 -l+1 vg/lv1
For this syntax:
lvconvert --thinpool LV1 --poolmetadata LV2
lvconvert --cachepool LV1 --poolmetadata LV2
Restore the metadata swapping behavior in addition to
the pool creation behavior. When LV1 is already a pool,
the metadata LV will be swapped with LV2.
When LV1 is not a pool, it will be converted to a
pool using the specified LV for metadata.
This syntax is no longer advertised because of the
ambiguous behavior. The primary syntaxes for pool
creation and metadata swapping will be the advertised
methods.
As we now user binary search - it's nondeterministict
which of the same 'args' will be give - so duplicates
need 'extra' care.
So provide same hack for output for --uuidstr_ARG as
for input.
Solves 'pvscan -u'.
Since there is a lot of options and lot of searches,
use binary search to keep strcmp at minimum.
The interesting part is - alphabetically sorted array contains
duplicates and some of them are not the 'right anwer', so
after we find matching string but not matching long_ARG,
we may need to check if the surrouding strings are the right matching
one.
The single loops is used also for strictly define --foo_long
(i.e. --stripes) and just differs at final part.
TODO1: replace strstr call with some flag (just like short_opt).
TODO2: drop '--' from being stored and tests by strcmp.
When parsing command defs, track and report all
errors that are found. Add an error return case
from define_commands so the standard error exit
path is used.
When using liblvm2cmd, a process can do multiple
init/exit calls, i.e.
lvm2_init(); lvm2_run(); lvm2_exit();
lvm2_init(); lvm2_run(); lvm2_exit();
...
define_commands() needs to set up the global commands[]
definitions only the first time.
The old ad hoc arg parsing when combining a split snapshot
allowed the first lv to have a vgname, but not the second.
Since lvconvert now uses the standard arg parsing in
process_each_lv(), the old one-off behavior requires a
work around.
This reverts commit 717363bb94.
These alternate forms for swapping metadata cannot be
distinguished from the command for creating a pool.
If we were to add these alternate forms for swapping
metadata, we would need to overload the pool creation
command defs, making those definitions ambiguous.
Change run time access to the command_name struct
cmd->cname instead of indirectly through
cmd->command->cname. This removes the two run time
fields from struct command.
All lvconvert functionality has been moved out of the
previous monolithic lvconvert code, except conversions
related to raid/mirror/striped/linear. This switches
that remaining code to be based on command defs, and
standard process_each_lv arg processing. This final
switch results in quite a bit of dead code that is
also removed.
This is a new explicit version of 'lvconvert LV'
which has been an obscure way of triggering polling
to be restarted on an LV that was previously converted.
Lift all the snapshot utilities (merge, split, combine)
out of the monolithic lvconvert implementation, using
the command definitions. The old code associated with
these commands is now unused and will be removed separately.
This lifts the lvconvert --repair and --replace commands
out of the monolithic lvconvert implementation. The
previous calls into repair/replace can no longer be
reached and will be removed in a separate commit.
The new check_single_lv() function is called prior to the
existing process_single_lv(). If the check function returns 0,
the LV will not be processed.
The check_single_lv function is meant to be a standard method
to validate the combination of specific command + specific LV,
and decide if the combination is allowed. The check_single
function can be used by anything that calls process_each_lv.
As commands are migrated to take advantage of command
definitions, each command definition gets its own entry
point which calls process_each for itself, passing a
pair of check_single/process_single functions which can
be specific to the narrowly defined command def.
. Define a prototype for every lvm command.
. Match every user command with one definition.
. Generate help text and man pages from them.
The new file command-lines.in defines a prototype for every
unique lvm command. A unique lvm command is a unique
combination of: command name + required option args +
required positional args. Each of these prototypes also
includes the optional option args and optional positional
args that the command will accept, a description, and a
unique string ID for the definition. Any valid command
will match one of the prototypes.
Here's an example of the lvresize command definitions from
command-lines.in, there are three unique lvresize commands:
lvresize --size SizeMB LV
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync, --reportformat String, --resizefs,
--stripes Number, --stripesize SizeKB, --poolmetadatasize SizeMB
OP: PV ...
ID: lvresize_by_size
DESC: Resize an LV by a specified size.
lvresize LV PV ...
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --resizefs, --stripes Number, --stripesize SizeKB
ID: lvresize_by_pv
DESC: Resize an LV by specified PV extents.
FLAGS: SECONDARY_SYNTAX
lvresize --poolmetadatasize SizeMB LV_thinpool
OO: --alloc Alloc, --autobackup Bool, --force,
--nofsck, --nosync, --noudevsync,
--reportformat String, --stripes Number, --stripesize SizeKB
OP: PV ...
ID: lvresize_pool_metadata_by_size
DESC: Resize a pool metadata SubLV by a specified size.
The three commands have separate definitions because they have
different required parameters. Required parameters are specified
on the first line of the definition. Optional options are
listed after OO, and optional positional args are listed after OP.
This data is used to generate corresponding command definition
structures for lvm in command-lines.h. usage/help output is also
auto generated, so it is always in sync with the definitions.
Every user-entered command is compared against the set of
command structures, and matched with one. An error is
reported if an entered command does not have the required
parameters for any definition. The closest match is printed
as a suggestion, and running lvresize --help will display
the usage for each possible lvresize command.
The prototype syntax used for help/man output includes
required --option and positional args on the first line,
and optional --option and positional args enclosed in [ ]
on subsequent lines.
command_name <required_opt_args> <required_pos_args>
[ <optional_opt_args> ]
[ <optional_pos_args> ]
Command definitions that are not to be advertised/suggested
have the flag SECONDARY_SYNTAX. These commands will not be
printed in the normal help output.
Man page prototypes are also generated from the same original
command definitions, and are always in sync with the code
and help text.
Very early in command execution, a matching command definition
is found. lvm then knows the operation being done, and that
the provided args conform to the definition. This will allow
lots of ad hoc checking/validation to be removed throughout
the code.
Each command definition can also be routed to a specific
function to implement it. The function is associated with
an enum value for the command definition (generated from
the ID string.) These per-command-definition implementation
functions have not yet been created, so all commands
currently fall back to the existing per-command-name
implementation functions.
Using per-command-definition functions will allow lots of
code to be removed which tries to figure out what the
command is meant to do. This is currently based on ad hoc
and complicated option analysis. When using the new
functions, what the command is doing is already known
from the associated command definition.
Some archs can use even 64K pages and then lvm2 runs into trouble if
the stack is 'too small' to fit extra page capturing stack overwrite.
So when lvm2 limits stack - add extra mem page - be it 4K or 64K.
Relates to ppc64le bug: https://bugzilla.redhat.com/1387279
Remove allocate_pvs from raid_manip.c:_region_size_change_request() API
and lv_extend() using it added for temporary test purpose.
Related: rhbz1366296
Add:
- region size checks to raid_manip.c types array and supporting functions
- tests to lvconvert-raid-takeover.sh to check bogus
"lvconvert --type --regionsize N " requests
Related: rhbz1366296
When we preload device with smaller size, we avoid its resume,
so later suspend/resume of full device tree my process all
existing in flight bios.
Also update comment and avoid using confusing opposite meaning.
Kernel 4.10 (dm-crypt v1.15.0) and later supports loading device
tables with crypt segment having key in kernel keyring retention
service.
dmsetup hid key section of tables output. With this patch dmsetup
no longer hides key section if it detects kernel key description
instead of hex byte representation of key itself.
Add:
- conversion support from striped/raid0/raid0_meta to/from raid10;
raid10 goes by the near format (same as used in creation of
raid10 LVs), which groups data copies together with their original
blocks (e.g. 3-way striped, 2 data copies resulting in 112233 in the
first stripe followed by 445566 in the second etc.) and is limited
to even numbers of legs for now
- related tests to lvconvert-raid-takeover.sh
- typo
Related: rhbz1366296
- support shrinking of raid0/1/4/5/6/10 LVs
- enhance lvresize-raid.sh tests: add raid0* and raid10
- fix have_raid4 in aux.sh to allow lv_resize-raid.sh
and other scripts to test raid4
Resolves: rhbz1394048
Commit cfb6ef654d introduced
support to change RAID region size.
Fix:
- don't change region_size until after prompting the user
- use log_print_unless_silent() instead of log_warn()
- avoid superfluous sigint() calls which are already
covered in yes_no_prompt()
- typo
Related: rhbz1392947
Commit cfb6ef654d introduced
support to change RAID region size.
Add:
- missing conditions to support any types to function with
it in lv_raid_convert(); temporary workaround used until
cli validation patches get merged
- tests requesting "-R " to lvconvert-raid-takeover.sh
involving a cleanup of the script
Related: rhbz1392947
Cleanups as of Jons review:
- enhance comment about mandatory raid4 <-> raid5_n activation w/o metadata SubLVs
- remove bogus segment flag setting
- fix to sync related comments on conversions to raid0/striped and amongst raid4/5
- add missing error message for non-synced conversion to raid0/striped
Related: rhbz1366296
Add:
- support to change region size of existing RaidLVs
(all RAID LV types but raid0/raid0_meta)
- lvconvert-raid-regionsize.sh with test variations
for different RAID types and region sizes
Resolves: rhbz1392947
Well waiting for zeroing may take enough time to finish 'raid' sync.
So make the test running faster without zeroing and better avoid race
to have chance to happen (i.e. lvcreate is finished after array
gets already in sync).
Add:
- support for segment types raid6_{ls,rs,la,ra}_6
(striped raid with dedicated last Q-Syndrome SubLVs)
- conversion support from raid5_{ls,rs,la,ra} to/from raid6_{ls,rs,la,ra}_6
- setting convenient segtypes on conversions from/to raid4/5/6
- related tests to lvconvert-raid-takeover.sh factoring
out _lvcreate,_lvconvert funxtions
Related: rhbz1366296
Add:
- support for segment type raid6_n_6 (striped raid with dedicated last parity/Q-Syndrome SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid6_n_6
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add:
- support for segment type raid5_n (striped raid with dedicated last parity SubLVs)
- conversion support from striped/raid0/raid0_meta/raid4 to/from raid5_n
- related tests to lvconvert-raid-takeover.sh
Related: rhbz1366296
Add a new update_filemap command to dmstats that allows a filemap
group to be updated:
# dmstats update_filemap --groupid 0 vm.img
/var/lib/libvirt/images/vm.img: Updated group ID 0 with 137 region(s).
This will update the set of regions mapped to the file to reflect
the current file system allocation.
Currently this needs to be run manually - a future update will add
support for monitoring file maps via a daemon, allowing them to be
automatically updated when the underlying file is modified.
Add a call to update the regions corresponding to a file mapped
group of regions. The regions to be updated must be grouped, to
allow us to correctly identify extents that have been deallocated
since the map was created.
Tables are built of the file extents, and the extents currently
mapped to dmstats regions: if a region no longer has a matching
file extent, it is deleted, and new regions are created for any
file extents without a matching region.
The FIEMAP call returns extents that are currently in-memory (or
journaled) and awaiting allocation in the file system. These have
the FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_DELALLOC flag bits set
in the fe_flags field - these extents are skipped until they
have a known disk location.
Since it is possile for the 0th extent of the file to have been
deallocated this must also handle the possible deletion and
re-creation of the group leader: if no other region allocation
is taking place the group identifier will not change.
If the group_id passed to _stats_group_id_present is equal to the
special value DM_STATS_GROUP_NOT_PRESENT there is no need to perform
any further tests: return false immediately.
For more advanced support we need to ensure better logic for calling
external much more advanced script for maintanance of thin-pool.
So this new code ensures:
When thin-pool data or metadata is bigger then 50%,
then with each 5% increment, action is called.
This is independent from autoextend_threshold.
This action always happens when thin-pool is over threshold,
(so no action when it's exactly i.e. 60%).
The only exception is 100% full thin-pool - which invokes 'last'
action.
Since thin-pool occupancy may change also downward, code needs
to also handle possibly reduction of occupancy of thin-pool.
So when usage drop from 90% to 50%, thin-pool will start to call
again action when it will pass 55% threshold.
This give external commands lot of option i.e. to call 'fstrim'
before actual resize is needed.
Default internal logic will stop trying to do any 'rescue' action
when executed command fails.
This will be now fully in hands of external script if such
behaviour is needed.
Instead of stopping monitoring after couple failing retries,
keep monitoring forever, just make larger delays between command
retries (ATM upto ~42 minutes).
So syslog is not spammed too often, yet commands have a chance to
be retried and succeed eventually...
When dmeventd configured command does not start with 'lvm ' prefix,
it's going to be an 'external' command.
In this case we split command by spaces to argv strings.
When showing sizes with 'H|human' units we do use standard rounding.
This however is confusing users from time to time,
when the printed number uses some biger units i.e. GiB and there is just
tiny fraction of space missing.
So here is some real-life example with new 'r' unit.
$lvs
LV VG Attr LSize Pool Origin
lvol0 vg -wi-a----- 1.99g
lvol1 vg -wi-a----- <2.00g
lvol2 vg -wi-a----- <2.01g
Meaning is - lvol1 has 'slightly' less then 2.00g - from sign '<' user
can be aware the LV doesn't have full 2.00GiB in size so he
will be less surpriced allocation of 2G volume will not succeed.
$ vgs
VG #PV #LV #SN Attr VSize VFree
vg 2 2 0 wz--n- <6,00g <2,01g
For uses needing 'old' undecorated human unit simply will continue
to use 'H|h' units.
The new R|r may further change when we would recongnize some
other way how to improve readability.
With recent commit d6a74025df using
INTERNAL_ERROR while cheking layer LV - it's been noticed mirror
logic currently doesn't do a correct thing during upconversion and
does a full-try instead of checking only allocator capabilities.
This leads to invalid usage of layer.
To keep existing code running before providing a fix, relax
INTERNAL_ERROR just an error and keep the 'code' running.
Once mirror code is fixed, these all check should be switched
to internal errors.
Since we still experience occasiaonal test failure - slow
things down even more to avoid race.
Add support for 'quick' table changes between normal & delayed tables.
This was missing piece in 77997c7673.
When merging origin is inactive (while driver is loaded) we
could already report merge in progress values as there is
no way to activate 'old state' now.
Show proper internal error for failing command when there are some
inconsitencies in sizes of LV and its layer instead of rather
meaningless error code 5.
(Could be hit i.e. if user tried to 'resize' cached LV and then
uncache such LV.)
During rework of resize code this validation check
has been lost (in my resize branch). Upstream
is still not supporting resize of any cache type LV
so needs to be prevented.
When we need to clear dirty cache content of cached LV, there
is table reload which usually is shortly followed by next metadata
change. However udev can't (as of now) process udev event
while device is 'suspended'.
So whenever sequence of 'suspend/resume/suspend' is needed,
we need to wait first for finishing of 'resume' processing before
starting next 'suspend'. Otherwise there is 'race' danger of triggering
unwantend umount by systemd as such event will trigger
SYSTEMD_READY=0 state for a moment for such changed device.
Such race is pretty ugly to trace so we may need to review more
sequencies for missing 'sync'.
(Other option is to enhnace 'udev' rules processing to avoid
such dramatic actions to be happening for suspended devices).
Solves: https://bugzilla.redhat.com/1280496
The only reasonable behaviour here is to error on
any number out of accepted range (i.e. now numbers
wrapping around with some hidden logic).
As this is plain bug there is no support for
backward compatibility since noone should
set numbers >UINT32_MAX and expect 0 or error
depending on how big number was used....
TODO: more fields might need to be converted.
Add simple function to wrap usage for only uint32 numbers.
Unlike 'int_arg' which accepts full range of 64bit number
this function will error on numbers out of this range:
<0, UINT32_MAX>
Add to commits 87117c2b25 and 0b8bf73a63 to avoid refreshing two
times altogether, thus avoiding issues related to clustered, remotely
activated RaidLV. Avoid need to repeat "lvchange --refresh RaidLV"
two times as a workaround to refresh a RaidLV. Fix handles removal
of temporary *-missing-* devices created for any missing segments
in RAID SubLVs during activation.
Because the kernel dm-raid target isn't able to handle transiently
failing devices properly we need
"[dm-devel][PATCH] dm raid: fix transient device failure processing"
as well.
test: add lvchange-raid-transient-failures.sh
and enhance lvconvert-raid.sh
Resolves: rhbz1025322
Related: rhbz1265191
Related: rhbz1399844
Related: rhbz1404425
When thin-pool processes event and 'lvextend --use-policies' fails
rather capture up-to-date new info as the fullness percentage may
have jumped noticable. This way we could use 'more' correct numbers
when checking for thresholds.
When there is 'merging' of an origin in progress, but metadata stil
do provide both origin and snapshot, we should show data from merged
snapshot. This is important mainly for thin case, where there was
a window, where i.e. 'lvs -o+device_id' would report information
about 'already gone' origin thin LV.
This race window is usually hard to trigger but can be ocasionally hit.
Usually shortly after activation, but before polling process manages
to update metadata after merge.
Before starting polling process, validate the merge has actually started
so there is not pointless invoke of lvmpolld.
This also fixes reported message from command, so user has
correct info whether merging has already started or
if it's delayed for next activation.
Move individual segment validation to a separate function
executed for 'complete_vg'.
Move some 'extra' validation bits from 'raid' validation to global
segtype validation (so extending existing normal validation)
TODO: still some test are left to be moved.
Reduce some duplication in validation process - there are still
some left thought so still room for improving overal speed.
The function timeout_add_seconds has quite a bit of variability. Using
timeout_add which specifies the timeout in ms instead of seconds. Testing
shows that this is much more consistent which should improve clients that
are using shorter timeouts for the API and the connection.
We can't keep 'display_lvname' for too long - it's using
ringbuffer and keeps limited number of names. So it's
safe only per few simple tests, but can't be used anymore
after some function calls..
(Fixes 00e641ef37)
Call _stats_regions_destroy() from dm_stats_list() if dms->regions
is non-NULL. This avoids leaking any pool allocations and ensures
the handle is in a known state: if an error occurs during the list,
dms->regions will be NULL and the handle will appear empty.
It could be actually better to use even cache origin in
read-only mode so there could no be some 'acidental'
change being done on this volume.
This however need further tools enhancment - where we would need
to handle whole subtree on 'lvchange -pr/-prw'.
When command calls backup() more then once (which is actually not
wanted) this warning message is shown repeatedly:
"WARNING: This metadata update is NOT backed up."
Instead now print message just once and less confuse user.
Add this functionality to lvconvert:
'lvconvert --thin cachedLV --thinpool vg/poll'
Converts cachedLV to external origin (which will be read-only).
New thin volume is created in thinpool LV and it's using external
origin as source for unprovisioned chunks.
This conversion happens online (while volume is in use).
Thin LV remains fully writable.
Cached external origin no longer could be written so cache will be used
ONLY for read operations. For this limitation we require cache mode
to be writethrough (as writeback cannot write to read-only volumes).
When thinLV is later removed cached external origin is again
fully usable, just note, LV remain in 'read-only' mode.
When read-write is needed, 'lvchange -prw' has to be used.
Single external origin could be user by multiple thinLV in
multiple differen thin pool.
When cache volume may be converted from normal to -real layer LV
we need to improve logic for call cache_check.
With this patch, we register call for cache_check only when metadata LV
is not yet present in active table slot (should match initial table
load).
This avoids unwanted checking when cache would become layer device
online.
The system is likely in some very inconsisten state.
Do not try to make it even more problematic with trying
to invoke tools like thin_check via callback.
External origin could be reloaded via more locks.
It's actually even more complex then thin-pool,
as it may be active on more nodes for linear LVs
(and maybe even more types).
External origin is always read-only thus unmodifiable
device so there should not be a problem accesing it
through multiple nodes.
Also for thin-pool check first presence of active thin-pool.
FIXME:
It's not easy to detect on which nodes this device is active
Thus manipulation with such device may require checking every
node and it active state and refresh.
But since such setup is quite complex to prepare and use,
hopefully there are not user trying to 'explore' this usage yet.
To be ready to show status of cache volume, call the status
with layer. Layer is automatically detected in this case when
cache volume is used in 'layered' form (needs -real suffix).
Avoid printing misleading message about single dirty block.
Instead properly detect condition where the 'cleaner' policy
needs to be installed without 'overloading' dirty variable.
Also print warning if we would be clearing read-only volume.
(it really shouldn't happen).
External origin could be activated as stand-alone device.
When the last thin LV is removed, external origin is no longer
the external origin and it's layer property was dropped.
Ensure dm table is correct by reloading external origin
(when it's active).
When LV is external origin, show info for LV but
status for -layer. So we expose more info to a user
as otherwise active external origin is only linear
mapping of -real layer.
We do the same for i.e. old snaphost origin.
Activation of raid has brough up also splitted image with tracing
(without taking lock for this).
So when raid is now activate - such image is not put into
table (with _rmeta). When user needs such device, just active it.
When --count=0 interval numbers are miscalculated:
Interval #18446744069414584325 time delta: 999920887ns
Interval #18446744069414584325 current err: -79113ns
End interval #18446744069414584325 duration: 999920887ns
This is because the command line argument is cast through the
uint32_t type, and fixed to UINT32_MAX:
_count = ((uint32_t)_int_args[COUNT_ARG]) ? : UINT32_MAX;
We also need to handle --count=0 specially when calculating the
interval number: since intervals count from #1, this must account
for the implicit "minus one" when converting from zero to the
UINT64_MAX value used (which is too large to store in _int_args).
The time management code mixes tests of the _timer_fd value with
code that should be timer agnostic: this causes problems for users
of the usleep() timer, since it cannot properly detect the start
of a new interval:
Beginning first interval
Interval #18446744069414584348 time delta: 1000000000ns
Interval #18446744069414584348 current err: 0ns
End interval #18446744069414584348 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
[...]
Beginning first interval
Interval #18446744069414584349 time delta: 1000000000ns
Interval #18446744069414584349 current err: 0ns
End interval #18446744069414584349 duration: 1000000000ns
Adjusted sample interval duration: 1000000000ns
Separate these out, by defining a _timer_running() call that each
timer implements, and only define _timer_fd if we are compiling
with TIMERFD enabled.
Although the usleep() interval timer is not used if the Linux
TIMERFD interface is available it should still provide reasonably
good timing.
Instead of trying to estimate the error from the duration of the
last sleep, peg it to the start time of the program, and use the
value of ((start_time - now) % interval) to correct the current
interval duration.
This always pulls us back into sync at the end of each interval,
rather than relying on trying to incrementally adjust the time
duration at each interval start.
This greatly reduces drift when the usleep() clock is used.
The dm_bit_copy() macro uses the source (bs1) bitset size as the
limit for memcpy:
memcpy((bs1) + 1, (bs2) + 1, ((*(bs1) / DM_BITS_PER_INT) + 1)..)
This is safe if the destination bitset is smaller than the source,
or if the two bitsets are of the same size.
With a destination that is larger (e.g. when resizing a bitmap to
add more capacity), the memcpy will overrun the source bitset and
set garbage bits in the destination.
There are nine uses of the macro currently (8 in libdm/regex, and
1 in daemons/cmirrord): in each case the two bitsets are always of
equal size so the behaviour is unchanged.
Fix the macro to use bs2's size to simplify resizing bitsets and
avoid the need for another copy macro.
Commit 0690392040 revealed a problem
in raid metadata manipulation.
We do two operations in one table reload:
- raid leg/image extraction
- rename remaining raid legs
This should be made in separate steps. Otherwise we do an
uncorrectable table change on error path (leaving tables
for admin and dmsetup).
As a hotfix - restore the previous logic and use a single
new function _lv_update_and_reload_list which activates exclusively
extracted LVs on the list before resuming suspended raid LV.
This restore 'rename' functionality upon resume.
Also still preserve the 'origin_only' logic - although we know
it's not working properly for cluster and LV stacking.
Further fixes are needed.
If FIEMAP returns a single extent after the first call, no extent
boundary is detected and the first extent is not counted by the
normal mechanism.
In this case, increment nr_extents at the same time the extent is
added to the region table, before returning.
backup is not 'tested' for success and also it should
actually happen just when command is finished.
We do not target to make backups with each inter-step
metadata change.
RAID is LV property
TODO: only 2 flags are seg->status: PVMOVE & MERGING
At least the second one should be soon elimanted as again
we merge LV not a segment.
This is another place for 'common' use pattern or
reload and activation of deleted devices.
(Moving the exclusive activation to _deactivate_and_remove_lvs()).
TODO: looks like halve of raid function is reloading
just 'origin' - and the other full LV.
It's useful to be able to specify a minimum number of bits for a
new bitmap parsed from a list, for e.g. to allow for expansing a
group without needing to copy/reallocate the bitmap.
Add a backwards compatible symbol for programs linked against old
versions of the library.
It is sometimes convenient to iterate over the set bits in a dm
bitset in reverse order (from the highest set bit toward zero), or
to quickly find the last set bit.
Add dm_bit_get_last() and dm_bit_get_prev(), mirroring the existing
dm_bit_get_first() and dm_bit_get_next().
dm_bit_get_prev() uses __builtin_clz when available to efficiently
test the bitset in reverse.
Add a macro for the clz (count leading zeros) operation.
Use the GCC __builtin_clz() for clz() if it is available and fall
back to a shift based implementation on systems that do not set
HAVE___BUILTIN_CLZ.
Split out the loop that iterates over each batch of FIEMAP
extent data from the function that sets up and calls the ioctl
to reduce nesting and simplify local variable use:
_stats_get_extents_for_file()
-> _stats_map_extents()
The _stats_map_extents() function is responsible for detecting
eof and extent boundaries and adding whole, allocated extents
to the file extent table for region creation.
Check that all region_id values specified in a group bitmap are
actually present: although this should not normally happen when
using the dmstats tool, it is possible as a result of manual
changes (or bugs) for a group descriptor to contain one or more
group_id values that do not exist.
Check for this situation when reading group descriptors, warn
the user the user, and clear these bits in the bitmap when
formatting it for output.
If a region has a a DMS_GROUP tag in aux_data where the first
region_id in the bitmap is not the same as the containing region,
dmstats will segfault:
# '2' is never a valid group bitset list for region_id == 0
# dmsetup message vg_hex/root 0 "@stats_set_aux 0 DMS_GROUP=img:2#"
# dmsetup message vg_hex/root 0 "@stats_list"
0: 45383680+16384 16384 dmstats DMS_GROUP=img:2#
1: 46071808+32768 32768 dmstats -
2: 47382528+16384 16384 dmstats -
# dmstats list
Segmentation fault (core dumped)
The crash will occur in some arbitrary dm_stats_get_* property
method - this happens while processing the 1st region_id in the
bitset, because the region is marked as grouped, but there is
no group bitmap present at dms->groups[2]->regions.
Fix this by detecting a mismatch between the expected region_id
and dm_bit_get_first() for the parsed bitset during
_parse_aux_data_group().
Handle files that contain multiple logical extents in a single
physical extent properly:
- In FIEMAP terms a logical extent is a contiguous range of
sectors in the file's address space.
- One or more physically adjacent logical extents comprise a
physical extent: these are the disk areas that will be mapped
to regions.
- An extent boundary occurs when the start sector of extent
n+1 is not equal to (n.start + n.length).
This requires that we accumulate the length values of extents
returned by FIEMAP until a discontinuity is found (since each
struct fiemap_extent returned by FIEMAP only represents a single
logical extent, which may be contiguous with other logical
extents on-disk).
This avoids creating large numbers of regions for physically
adjacent (logical) extents and fixes the earlier behaviour which
would only map the first logical extent of the physical extent,
leaving gaps in the region table for these files.
To be able to detect lvm2 command is not leaking some
'unexpected' device - remove all devices before
test exits by its own command so test teardown
now can check what was 'left' unexpectedly.
Fix order of operation when converting raid1 into old mirror.
Before any later metadata modification are initiated prepare
mirror_log device with all clearing.
Then directly convert raid1 into mirror with mirror_log.
This convertion now properly see as precommitted metadata
new 'mirror' and committed old 'raid' and is able to
preload all LVs.
When mapping regions to a file descriptor, a temporary table of
extent descriptors is built using the dm_pool object building
interface.
Previously this use borrowed the dms->mem region and counter
table pool (since nothing can interleave with the allocation
while the caller is still in dm_stats_create_regions_from_fd()).
This turns out to be problematic for error recovery. When a
region creation operation fails partway through file mapping,
we need to roll back the set of already created regions and
this requires a listed handle: the dm_stats_list() will then
allocate from the same pool as the extents; we either have
to throw away valid list data, or leak the extent table, to
return the handle in a valid state.
Avoid this problem by creating a new, temporary mem pool in
_stats_create_file_regions() to hold the extent data, and
discarding it on exit from the function.
While cleaning up the table of already created regions during a
failed dm_stats_create_regions_from_fd(), list the handle once,
and call _stats_delete_region() directly. This avoids sending a
@stats_list message for each region deleted, reducing runtime
from 6s to 0.7s when cleaning up ~250 out of ~10000 regions:
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
<< pauses here >>
Command failed
real 0m6.267s
user 0m3.770s
sys 0m2.487s
# time dmstats create --filemap b.img
device-mapper: message ioctl on (253:0) failed: Cannot allocate memory
Failed to create region 246 of 309 at 9388032.
Could not create regions from file /root/b.img
Command failed
real 0m0.716s
user 0m0.034s
sys 0m0.581s
Testing the error path requires region creation to start to
fail part way through the operation (in order to have regions
to clean up): the simplest way is to ensure the system is
close to the kernel limit of 1/4 RAM or 1/2 vmalloc space
consumed by dmstats data.
Split dm_stats_delete_region() so that internal callers can manage
the handle state themselves.
dm_stats_delete_region() now just handles checking the state of the
handle, reporting validation errors, and calling dm_stats_list() if
necessary, before calling _stats_delete_region().
The new _stats_delete_region() function performs the actual group
member removal and region deletion, and requires a fully listed
handle to operate.
Callers that repeatedly delete regions can use a single listed
handle for many operations on the same device, avoiding one
message ioctl per region deleted: since @stats_list with many
regions is expensive, this yields large runtime improvements.
If we fail to create a region during dm_stats_create_regions_from_fd(),
we must remove all regions that were created to do this to date. This
needs to loop over the table of region_id values that were populated
by _stats_create_file_regions() before the error.
The code for this failure case in the out_remove branch incorrectly
uses the table index as the region_id:
for (--i; i != DM_STATS_REGION_NOT_PRESENT; i--) {
if (!dm_stats_delete_region(dms, i))
log_error("Could not delete region " FMTu64 ".", i);
}
This causes the cleanup code to delete a completely unrelated set
of regions (since the index here will always be nr_regions..0).
Fix it to pass the actual region_id stored in regions[i] instead.
Fix a silly bug in dm_stats_delete_region() that hugely inflates
runtimes when deleting a large number of regions.
For ~50,000 regions this change reduces the runtime from 98s to
6s on my test systems (a ~93% reduction).
The bug exists because dm_stats_delete_region() applies a truth
test to the return value of dm_stats_get_nr_areas(); this is
never correct usage - it will walk the entire region table and
calculate area counts for each region (which is roughly O(n^2)
in the number of regions, as dm_stats_delete_region() is being
called inside a region walk).
Although the individual area calculation is not that costly,
uselessly running anything 2,500,000,000 times over gets a bit
slow.
A much cheaper test (which is always true if the areas check is
true) is to just test dm_stats_get_nr_regions() or dms->regions;
if either is true it implies at least one area exists.
Old:
Performance counter stats for 'dmstats delete --allregions --alldevices':
98117.791458 task-clock (msec) # 1.000 CPUs utilized
127 context-switches # 0.001 K/sec
3 cpu-migrations # 0.000 K/sec
6,631 page-faults # 0.068 K/sec
307,711,724,562 cycles # 3.136 GHz
544,762,959,577 instructions # 1.77 insn per cycle
84,287,824,115 branches # 859.047 M/sec
2,538,875 branch-misses # 0.00% of all branches
98.119578733 seconds time elapsed
New:
Performance counter stats for 'dmstats delete --allregions --alldevices':
6427.251074 task-clock (msec) # 1.000 CPUs utilized
6 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
6,634 page-faults # 0.001 M/sec
21,613,018,724 cycles # 3.363 GHz
3,794,755,445 instructions # 0.18 insn per cycle
852,974,026 branches # 132.712 M/sec
808,625 branch-misses # 0.09% of all branches
6.428953647 seconds time elapsed
Simplify info run for use only for INFO & STATUS.
Drop handling MKNODES within _info_run() call
and use more advanced _setup_task_run() directly.
This allows to further simplify _info_run().
Integrate also query for inactive table and
handle dm_task_run() and dm_task_get_info()
(thus switching to setup_task_run)
Add one exception case for DM_DEVICE_TARGET_MSG.
This allows further shortening and simplification of all
other users of this function.
It's actually not needed to call extra lv_has_target_type() to detect
snapshot merge is in progress - decode this right during status
capturing and save even few extra ioctl calls.
Drop LV from passed API arg - it's always segment being checked.
Also use_layer is now in full control of lv_info_with_seg_status().
It decides which device needs to be checked to get 'the most info'.
TODO: future version should be able to expose status from
Start moving selection of status taken for a LV into a single place.
The logic for showing info & status has been spread over multiple
places and were doing too complex decision going agains each other.
Unify selection of status of origin & cow scanned device.
TODO: in future we want to grab status for LV and layered LV and have
both statuses present for display - i.e. when 'old snapshot'
of thinLV is takes and there is ongoing merge - at some moment
we are not capable to show all needed info.
When lvm2 wants to see a status, it needs to validate,
segment for status reading is matching whan lvm2 expects in
metadata.
Also ensure status failure will not cause '0' from info reading
when actual info was collected properly.
Failure in 'status' reading is considered to be
a 'log_warn()' event only.
When we can't parse status, switch to warning as this is not
considered an errornous case. LVS is not supposed to return
error status code when device is not what it's been expected to
be - but it should be WARNING a user there is something unexpected.
Convert lvs -o lv_merge_failed,lv_snapshot_invalid to use
lv_info_and_status function.
This makes it equal to attr value showing this info
(as they were different since they were derived from
different data set and different logic as well).
Also saves couple extra ioctl that were needed to obtain this info.
When displaying <reporting_command> -o help, we'd like to have fields
grouped nicely, not starting having groups interleaved as it was before.
The code that displays the help output for fields takes the order as
written in columns.h file - this caused output like:
$ lvs -o help
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Fields
---------------------
...field list...
Instead, let's have it without groups interleaved which may be
a bit confusing, so:
Logical Volume Fields
---------------------
...field list...
Logical Volume Device Status Fields
-----------------------------------
...field list...
Logical Volume Device Info and Status Combined Fields
-----------------------------------------------------
...field list...
..and so on.
Use LVM_DBUSD_TEST_MODE env variable to customize what we test.
Default is the same where we try to test all combinations of all
modes. Renamed to make it consistent with the other env variables
that are used in the unit test.
- We check that all properties match the introspection data. We
don't verify values for every property as only lvm knows what they
should be.
- We are testing vg.Move
Added a properties changed signal on the job dbus object so that client
can wait for a signal that the job is complete instead of polling or
blocking on the wait method.
Allows the user to override the number of commands that get dumped
to the log when we encounter a lvm error. Also useful during
development when you don't want to see the blackbox output.
In case any SubLV of a RaidLV transiently fails, it needs
two "lvchange --refresh RaidLV" runs to get it to fully
operational mode again. Reason being, that lvm reloads all
targets for the RaidLV tree but doesn't resume the SubLVs
until after the whole tree has been reloaded in the first
refresh run. Thus the live mapping table of the SubLVs
still point to an "error" mapping and the dm-raid target
can't retrieve any superblock from the MetaLV(s) in processing
the constructor during this preload thus not discovering the
again accessible SubLVs. In the second run, the SubLV targets
map proper (meta)data, hence the constructor discovers those
fine now.
Solve by resuming the SubLVs of the RaidLV before
preloading the respective top-level RaidLV target.
Resolves: rhbz1399844
When reading data from stdout & stderr we were reading until the
reading until we got None back which really isn't needed as the
read will return everything that is available.
Avoid code duplication and use exiting commonly used
lv_update_and_reload() function.
There is still one place left where mirror is doing strange
double suspend call - needs there more thinking what's wrong with
that code.
When lvconvert adds a new leg - it's doing it free 'temporary' image
layer - however this temporary 'internal' mirror is also MIRRORED LV.
But the status bit was not properly transfered through layer.
We need to acquire a lock which can block us which in turn causes
the dbus request handling to block as well. Place the request on
the work queue instead.
Our expectation was that when using the lvm shell that when the lvm prompt
was read from stdout, that all other ouput had been written and flushed.
However, this doesn't appear to be the case. Add extra read passes to
retrieve delayed report data.
The default dbus python library mode of operation is to leverage
introspection. However, this introspection data isn't accessible
for users of the library and they have to specifically retrieve
the introspection data too. This resulted in many introspection
calls being made. This change eliminates introspection calls if
we are testing multiple concurrent test clients. If it's a single
client we will leverage a reduced amount of introspection data to
verify the introspection data is correct. Typically clients don't
leverage introspection data nearly as much as this test client.
The env variable LVM_DBUSD_PV_DEVICE_LIST when present and filled in
with at least 4 physical devices will run concurrently with other
instances running as long as they specify different devices in their
env variable.
When the env variable is not present the test runs as it did before.
In preparation to have more than one thread issuing commands to lvm
at the same time we need to serialize updates to the dbus state and
retrieving the global lvm state. To achieve this we have one thread
handling this with a thread safe queue taking and coalescing requests.
Looks like this isn't support across versions. Need to add functionality
to service to return the supported segment types, so we only use the
supported ones.
There are two possible errors in _dm_stats_populate_region():
* No region struct in dms->regions[region_id]
* Failure to parse data from @stats_print
These have very different causes: the first occurs where a client
program is populating one region at a time (region_id is a single
region identifier), and has not previously called dm_stats_list()
to dimension the region tables; this is an API usage error.
The second occurs when either we read unparseable data from the
kernel (kernel bug), or where various resource allocations fail.
Separate these two cases out and log separate messages for each
(allocation failures in the path already have their own distinct
message), since the "failed to parse.." message in the un-listed
handle case is confusing and misleading.
Do not emit warning message but only log debug message if
lvm2-lvmdbusd.service unit is missing and at the same time
we have global/notify_dbus=1 (which is used by default if we
configured sources with "--enable-notify-dbus"). We don't want
hard dependency between LVM2 and lvmdbusd so it's enough to log
only debug message in this case.
0 interval leads as of now to a busy loop with lvmetad and command.
Avoid testing this patological case.
TODO: Code should possibly translate zero interval into some small
sleep. With lvmpolld it's already 1/10s
Add new targets:
make check_lvmpolld
make check_cluster_lvmpolld
make check_lvmetad_lvmpolld
make check_all_lvmpolld
So check_lvmetad runs only base lvmetad test - to much
logic of remaining targets.
Previous behavior is available via check_all_lvmpolld.
Make it easier to replace missing segments with 'zero' returning
target - otherwise user would have to create some extra target
to provide zeros as /dev/zero can't be used (not a block device).
Also break code loop when segment is found and make it an INTERNAL_ERROR
where it's missing.
Instead of clearing multiple rmeta device with sequential activation
process and waiting for udev for every _rmeta device separately,
activate all _rmeta devices first and then clear them and deactivate
afterwards.
Also update some tracing messages.
When anyhing goes wrong during clearing process, always try to
deactivate as much _rmeta devices as possible before fail.
This code is no longer needed because the back ground task has been
removed. Will add back if we change the design and end up utilizing
multiple worker threads.
There is no reason to create another background task when the task that
created it is going to block waiting for it to finish. Instead we will
just execute the logic in the worker thread that is servicing the worker
queue.
Translate log_info() into log_very_verbose() which is macro
supposed to be used by our code.
log_info() is internal macro with eventually some 'symbolic' meaning
in syslogging daemons.
Instead of compiling 2 log call for 2 different logging functions,
and runtime decide which version to use - use only 'newer' function
and when user sets his own OLD dm_log logging translate it runtime
for old arg list set.
The positive part is - we get shorter generated library,
on the negative part this translation means, we always have evaluate
all args and print the message into local on stack buffer, before
we can pass this buffer to the users' logging function with proper
expected parameters (and such function may later decide to discard
logging based on message level so whole printing was unnecessary).
Ensure different logging function for dmeventd.c logging
and dm and lvm library.
We can recognize we want to show every log_info() and
log_notice() message from dmeventd.c code while not
exposing those from libdm/libdevmapper-event
Also switch to use log with errno - it's not changing
anything and doesn't bring any more features yet to dmeventd
logging but we just properly pass dm_errno_or_class properly
through the whole code stack for possible future use
(i.e. support of class logging for dmeventd).
Reword the logging logic and try to restore previous logging
behavior for 'standalone' running daemon while preserving
debuggable feautures it has gained.
So actual rules:
dmeventd without any '-d' option will syslog all messages
from dmeventd.c it dmeventd plugins.
log_notice()==log_verbose()
log_info()==log_very_verbose()
But to show also log_debug() used has to give '-ddd'.
When user specified '-d, -dd, -ddd, -dddd' it
will also enable tracing of messages from libdm & lib
executed code - which is mainly useful for testing
i.e.: 'dmeventd -fldddd'
Introduce macros:
log_level(), log_stderr(), log_once(), log_bypass_report()
For easier and more consisten way how to 'decoder' bits
of info from passed 'level'.
This patch fixes potential problem when 'level' of message
might not have always masked right bits.
Instead of creating a thread to handle the case where a client
is calling job.Wait, we will utilize a timer. This significantly
reduces the number of threads that get created and destroyed while
the service is running.
We will fetch the lvm state in non-main thread and only process the new
data with the main thread to prevent hanging the main thread event loop.
ref. https://bugs.freedesktop.org/show_bug.cgi?id=98521
pvscan --cache -aay was activating LVs in exported VGs
when it should not.
It appears that this was a regression from commit 9b640c3684
"pvscan: use process_each_vg for autoactivate".
If blkdeactivate finds out that the device on top of device stack
is already unmounted, it still proceeds with device stack deactivation
underneath now.
This situation can happen if blkdeactivate is started and the mount
point is unmounted in parallel by chance (so when blkdeactivate
gets the the actual umount call, the device is not mounted anymore).
Before, the blkdeactivate added such device to skip list which caused
all the stack underneath to be skipped too on deactivation. Now, we
proceed just as if blkdeactivate did the umount itself.
For example, in the example below, the vg-lvol0 is mounted on /mnt/test
when blkdeactivate is called, but it gets unmounted in parallel later
on when blkdeactivate gets to the actual umount call.
Before this patch (vg-lvol0 underneath not deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... skipping
With this patch applied (vg-lvol0 underneath still deactivated):
$ blkdeactivate -u
Deactivating block devices:
[UMOUNT]: unmounting vg-lvol0 (dm-2) mounted on /mnt/test... already unmounted
[LVM]: deactivating Logical Volume vg/lvol0... done
(Automatic) repair may not be allowed during the initial sync of an upconverted
linear LV, because the data on the failing, primary leg hasn't been completely
synchronized to the N-1 other legs of the raid1 LV (replacing failed legs during
repair involves discontinuing access to any replaced legs data, thus preventing
data recovery on the primary leg e.g. via dd_rescue).
Even though repair would not cause data loss when adding legs to a fully synced
raid1 LV, we don't have information yet defining this state yet (e.g. a raid1
LV flag telling the fully synchronized status before any legs were added),
hence can't automatically decide to allow to repair.
If nonetheless a repair on a non-synced raid1 LVs is intended, the "--force"
option has to be provided.
Resolves: rhbz1311765
Validate kernel support for raid0/raid4 on given and
requested segtype before requesting conversions on them.
Because raid10 wasn't present in old RAID targets, add
the same validation to be prepared once we support them.
Check for dm-raid target version with non-standard raid4 mapping expecting the dedicated
parity device in the last rather than the first slot and prohibit to create, activate or
convert to such LVs from striped/raid0* or vice-versa in order to avoid data corruption.
Add related tests to lvconvert-raid-takeover.sh
Resolves: rhbz1388962
On conversions between striped/raid0* and raid4, the kernel expects
the dedicated raid4 parity SubLVs in the first segment area rather than
in the last it's been allocated to, thus the data mapping ain't proper.
Enhance lvconvert (lib/metadata/raid_manip.c) to shift the dedicated
parity SubLVs on conversions from striped/raid0* to raid4 and vice-versa.
In case of raid0_meta -> raid4 where the MD raid0 personality already has
stored RAID array device positions in the superblocks, the MetaLVs have to
be cleared so that the kernel doesn't fail validating the array positions
after lvm has shifted them up by one.
Add more tests to lvconvert-raid-takeover.sh including one to check for
mapping flaws by converting a created raid4 with filesystem -> striped
and fsck it.
Whilst on it:
- add missing direct striped -> raid4 conversion to the takeover array
to avoid an intermim conversion from striped -> raid0*
- clean up the takeover array
- allow lvconvert to actually call lv_raid_convert() on all takeover requests
in order to check parameters and display messages provided by takeover
functions rather than just "...not supported" from within lvconvert
- fix a typo
Resolves: rhbz1386148
If a device disappears after obtaining the list of devices but before
processing it as a member of that list, dmsetup exits with a failure code.
Most commands still produce what output they can in these circumstances,
but 'ls --tree' and 'info -c' with fields depending on device dependencies
didn't. Change this.
Keep for now function logic making its decision on string content.
We need bigger patch converting all things to bit-checks later.
This needs however bigger refactoring.
So this commit reverts some changes from:
c8b6c13015
Commit 088b3d036a allowed repair on cache origin RAID LVs
and restricted lvconvert actions on RAID SubLVs to change number of mirrors, repair,
replace and type changes in order to avoid unsuitable coversions on them.
This introduced a regression prohibiting --splitmirrors on any RAID SubLVs
(e.g. of cache or thin LVs; lvconvert-{cache,thin}-raid.sh tests failing).
Fix allows split mirrors again.
Fix some indenting whilst on it.
When we have already decoded arg_is_set into a local var
or already set segment type - already use these
values instead of repeating calls and string checks.
Seems some error path where not converted to 'new' ECMD return value.
Fix them to always 'goto out'.
Also drop unneeded 'ret = 0' when ret already is 0.
This test never passes on loop back, so we will skip unless the
pv devices are real devices which contain `/dev/sd`.
We always fail because we need lvm to run slow to get a timer to
pop, and loopback are too fast.
If you run multiple runs of unittest.main, unless you don't pass exit=true
the test case always ends with a 0 exit code. Add ability to store the
result of each invocation of the test and exit with a non-zero exit code
if anyone of them fail.
In case a RAID orig LV is being cached and fails, repair is impossible because
"lvconvert --repair" gets rejected.
Fix by allowing repair on cache orig RAID LVs and
"lvconvert --replace/--mirrors/--type {raid*|mirror|striped|linear}" as well.
Allow the same lvconvert actions on any cache pool and metadata RAID SubLVs.
Resolves: rhbz1380532
The following LvCommon properties were added so that the API
would have the same functionality as lvm2app has.
LvCommon.MetaDataSizeBytes
LvCommon.Attr
LvCommon.MetaDataPercent
LvCommon.CopyPercent
LvCommon.SnapPercent
LvCommon.SyncPercent
Fix missing wait so we have paired waiting.
Also 'wait' for precise PID to get 'exit' code.
Test for 'error' replacing only with newer snapshot targets.
The old one will wait for resume.
Note: 'wait -n' is not always available so can't be used..
Integrate back _unblock_sigalrm() and check for error code of
pthread_sigmask() function so we do not use uninitialized
sigmask_t on error path (Coverity).
When a PV device is missing lvm will return '[unknown]' for the device
path. The object manager keeps a hash table lookup for uuid and for PV's
device name. When we had multiple PVs with the same device path we
we only had 1 key in the table for the lvm id (device path). This caused
a problem when the PV device transitioned from '[unknown]' to known as any
subsequent transitions would cause an exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 66, in run_cmd
result = self.method(*self.arguments)
File "/usr/lib/python3.5/site-packages/lvmdbusd/manager.py", line 205, in _pv_scan
cfg.load()
File "/usr/lib/python3.5/site-packages/lvmdbusd/fetch.py", line 24, in load
cache_refresh=False)[1]
File "/usr/lib/python3.5/site-packages/lvmdbusd/pv.py", line 48, in load_pvs
emit_signal, cache_refresh)
File "/usr/lib/python3.5/site-packages/lvmdbusd/loader.py", line 80, in common
cfg.om.remove_object(cfg.om.get_object_by_path(k), True)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 153, in remove_object
self._lookup_remove(path)
File "/usr/lib/python3.5/site-packages/lvmdbusd/objectmanager.py", line 97, in _lookup_remove
del self._id_to_object_path[lvm_id]
KeyError: '[unknown]'
when trying to delete a key that wasn't present. In this case we don't add a
lookup key for the device path and the PV can only be located by UUID.
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1379357
Works if the pool is inactive.
Activation code doesn't notice a new raid dependency in on-disk metadata
when a thin LV is already active.
https://bugzilla.redhat.com/1365286
The dm_stats_delete_region() call removes a region from the bound
device, and, if the region is grouped, from the group leader
group descriptor stored in aux_data.
To do this requires a listed handle: previous versions of the
library do not since no dependencies exist between regions without
grouping.
This leads to strange behaviour when a command built against an old
version of the library is used with one supporting groups. Deleting
a region with dmstats succeeds, but logs errors:
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
vg_hex-root 2 2.00g 1.00g 1 1.00g dmstats
# dmstats delete --regionid 2 vg_hex/root
Region ID 2 does not exist
Could not delete statistics region.
Command failed
# dmstats list
Name RgID RgSta RgSiz #Areas ArSize ProgID
vg_hex-root 0 0 1.00g 1 1.00g dmstats
vg_hex-root 1 1.00g 1.00g 1 1.00g dmstats
This happens because the call to dm_stats_delete_region() is inside
a dm_stats_walk_*() iterator: upon entry to the call, the iterator
is at its end conditions and about to terminate. Due to the call to
dm_stats_list() inside the function, it returns with an iterator at
the beginning of a walk and performs a further iteration before
exiting. This final loop makes a further attempt to delete the
(already deleted) region, leading to the confusing error messages.
The current dmsetup.c handles DR_STATS and DR_STATS_META reports
separately in _display_info_cols(), meaning that the stats walk
functions are never called for these report types.
Versions before v2.02.159 have a loop using dm_stats_walk_do() and
dm_stats_walk_while(), that executes once for non-stats reports,
and once per region, or area, for DR_STATS/DR_STATS_META reports.
This older behaviour relies on the documented behaviour that the
walk functions will accept a NULL pointer as the struct dm_stats*
argument.
This was broken by commit f1f2df7b: the NULL test on dms and
dms->regions were incorrectly moved from the dm_stats_walk_end()
wrapper to the internal '_stats_walk_end()' helper.
Since the pointer is dereferenced in between these points, using
an older dmsetup with current libdm results in a segfault when
running a non-stats report:
# dmsetup info -c vg00/lvol0
Segmentation fault (core dumped)
Restore the NULL checks to the wrapper function as intended.
We shouldn't be losing pvscans just because of the fact that the
underlying device (PV) appears and disappears quickly in the system,
otherwise lvmetad may not see the device if it appears again (or it may
still keep the device in cache even it's already gone).
We added lightweight toolcontext handle to avoid useless initialization
of some parts of the context and also to avoid problems when using the
handle very soon at system boot, like in lvm2-activation-generator
through lvm2app interface. However, we missed reading all the other
config sources like lvmlocal.conf as well as any tag config - we need to
read these too to get the final config value which may be overriden in
any of these additional config sources.
Currently, we use this lightweight toolcontext handle to read
global/use_lvmetad and global/use_lvmpolld config values in
lvm2-activation-generator using lvm2app interface (lvm_config_find_bool
lvm2app function).
Pre 1.9 dm-raid targets status output was racy, which caused
the device status chars to be unreliable _during_ synchronization.
This shows paritcularly with tiny test devices used.
Enhance lvchange-rebuild-raid.sh to not check status
chars _during_ synchronization. Just check afterwards.
Though I'm not quite sure why we push this limit on user,
current --repair logic requires user to wait outside of command.
TODO: I'm not quite sure this repair logic is 'the most wanted'.
Make sure that the temporary dm_histogram used for the bounds
argument is freed in the case that the user provided a --bounds
argument on the command line.
The dm-raid target now rejects device rebuild requests during ongoing
resynchronization thus causing 'lvconvert --repair ...' to fail with
a kernel error message. This regresses with respect to failing automatic
repair via the dmeventd RAID plugin in case raid_fault_policy="allocate"
is configured in lvm.conf as well.
Previously allowing such repair request required cancelling the
resynchronization of any still accessible DataLVs, hence reasoning
potential data loss.
Patch allows the resynchronization of still accessible DataLVs to
finish up by rejecting any 'lvconvert --repair ...'.
It enhances the dmeventd RAID plugin to be able to automatically repair
by postponing the repair after synchronization ended.
More tests are added to lvconvert-rebuild-raid.sh to cover single
and multiple DataLV failure cases for the different RAID levels.
- resolves: rhbz1371717
Commit 199697accf rerouted funtion
for priting cache volume origin to lvm2app app function - which
however had a bug. So restore the original functionality
and print correct LV as cache origin LV.
Gris debugged that when we don't have a method the introspection
data is missing the interface itself eg.
<interface name="<your_obj_iface_name>" />
When adding the properties to the dbus object introspection we will
add the interface too if it's missing. This now allows us the
ability to have a dbus object with only properties.
Because of the different code paths we need to test job handling with
all operations. Test now runs virtually everything with timeout == 0
and timeout == 15 so that we test both use cases.
Note: If a client passes -1 for the timeout value they need to make
sure that their connection time out is also infinite. Otherwise, if the client
times out the service side hangs any new dbus calls until the job
that is in progress completes. Not sure why this behavior is occuring
at this time, but it appears a limitation/bug of the dbus-python library.
When we register a failure we need to use a valid value which will be
returned with the object manager. Otherwise we will raise an Exception
because we are trying to construct an object path from None.
The methods were returning an instance of the object instead of the
object path which was causing an exception when the result was returned
with the job object as we are explicity trying to return an object path.
Unit test added which re-creates the issue and verifies the fix.
Add simple helper/wrapper check function to check result
of dmsetup call i.e.:
check grep_dmsetup table vg-lv "grep_expected"
check grep_dmsetup status vg-lv -v "grep_unexpected"
Reload of thin-pool origin_only is designed to only post messages
to a thin-pool. It's not intended to be used for reload of thin-pool
table. Fix it by using standard call 'lv_update_and_reload()'.
Unconditionally guard there is at least 1/4 of metadata volume
free (<16Mib) or 4MiB - whichever value is smaller.
In case there is not enough free space do not let operation proceed and
recommend thin-pool metadata resize (in case user has not
enabled autoresize, manual 'lvextend --poolmetadatasize' is needed).
In the case there is no active thin volume, report thin pool
as lock holder. This fixed function like lvextend
which either expecte lock holder LV is some active thin
or 'possibly' inactive thin pool.
The existing code doesn't understand that mirror logs should cling to
parallel LVs (like extending them) instead of avoiding them.
As a quick workaround to avoid lvcreate failures, hard-code
--alloc normal for mirror logs even if the rest of the allocation
used a stricter policy.
https://bugzilla.redhat.com/show_bug.cgi?id=1376532
When rescanning a VG from disk, the metadata read from
each PV was compared as a sanity check. The comparison
is done by exporting the vg metadata from each dev to
a config tree, and then comparing the config trees.
The function to create the config tree inserts
extraneous information along with the actual VG metadata.
This extra info includes creation_time. The config
trees for two devs can easily be created one second
apart in which case the different creation_times would
cause the metadata comparison to fail. The fix is to
exclude the extraneous info from the metadata comparison.
Correction for aux test result ([] -> if;then;fi)
Use issue_discard to lower memory demands on discardable test devices
Use large devices directly through prepare_pvs
I'm still observing more then 0.5G of data usage through.
Particullary:
'lvcreate' followed by 'lvconvert' (which doesn't yet support --nosync
option) is quite demanging, and resume returns quite 'late' when
a lot of data has been already written on PV.
Reinstantiate reporting of metadata percent usage for cache volumes.
Also show the same percentage with hidden cache-pool LV.
This regression was caused by optimization for a single-ioctl in
2.02.155.
Allow RAID scrubbing on cache origin sub-LV
This patch adds the ability to perform RAID scrubbing on the cache
origin sub-LV (https://bugzilla.redhat.com/1169495). Cache origin
operations are restricted to non-clustered RAID LVs until there can
be further testing in a cluster (even for exclusive activation).
User can either specify directly _corig LV
or he can specify cache LV and operation --syncation is
passed ONLY to _corig LV.
If users wants to manipulation with cache-pool devices - he
needs to specify this object name.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Older udev versions (udev < v165), don't have the official
udev_device_get_is_initialized function available to query for
device initialization state in udev database. Also, devices don't
have USEC_INITIALIZED udev db variable set - this is bound to the
udev_device_get_is_initialized fn functionality.
In this case, check for "DEVLINKS" variable instead - all block devices
have at least one symlink set for the node (the "/dev/block/<major:minor>".
This symlink is set by default basic udev rules provided by udev directly.
We'll use this as an alternative for the check that initial udev
processing for a device has already finished.
It's possible (mainly during boot) that udev has not finished
processing the device and hence the udev database record for that
device is still marked as uninitialized when we're trying to look
at it as part of multipath component check in pvscan --cache code.
So check several times with a short delay to wait for the udev db
record to be initialized before giving up completely.
When scanning devs to populate lvmetad during system startup,
filter-mpath with native sysfs multipath component detection
may not detect that a dev is multipath component. This is
because the multipath devices may not be set up yet.
Because of this, pvscan will scan multipath components during
startup, will see them as duplicate PVs, and will disable
lvmetad. This will leave lvmetad disabled on systems using
multipath, unless something or someone runs pvscan --cache
to rescan.
To avoid this problem, the code that is scanning devices to
populate lvmetad will now check the udev db to see if a
dev is a multipath component that should be skipped.
(This may not be perfect due to inherent udev races, but will
cover most cases and will be at least as good as it's ever
been.)
The lsblk is just a nice helper here - it's not crucial for lvmdump so
do best effort here and use the most we can from current version of
lsblk that is installed on system. The lsblk -s option was added a bit
later after lsblk introduction and lsblk -O support even more later -
so if these are not available, use only pure lsblk output without any
extras.
- Prevent --lvmshell with --nojson, not a valid combination
- If user is preventing json, then no lvmshell usage
- Return boolean on Manager.UseLvmShell
The normal mode of operation will be to monitor for udev events until an
ExternalEvent occurs. In that case the service will disable monitoring
for udev events and use ExternalEvent exclusively.
Note: User specifies --udev the service will always monitor udev regardless
if ExternalEvent is being called too.
With the addition of JSON and the ability to get output which is known to
not contain any extraneous text we can now leverage lvm shell, so that we
don't fork and exec lvm command line repeatedly.
When we are running in a terminal it's useful to have a date & ts on log
output like you get when output goes to the journal. Check if we are
running on a tty and if we are, add it in.
Avoid monitoring of activated cache-pool - where the only purpose ATM
is to clear metadata volume which is actually activate in place
of cache-pool name (using public LV name).
Since VG lock is held across whole clear operation, dmeventd cannot
be used anyway - however in case of appliction crash we may
leave unmonitored device.
In future we may provide better mechanism as the current name
replacemnet is creating 'uncommon' table setups in case the metadata
LV is more complex type like raid (needs some futher thinking about
error path results).
Another point to think about is the fact we should not clear device
while holding lock (i.e. dmeventd mirror repair cannot work in cases
like this).
Introduce 'hard limit' for max number of cache chunks.
When cache target operates with too many chunks (>10e6).
When user is aware of related possible troubles he
may increase the limit in lvm.conf.
Also verbosely inform user about possible solution.
Code works for both lvcreate and lvconvert.
Lvconvert fully supports change of chunk_size when caching LV
(and validates for compatible settings).
Commit e947c362dd introduced
config_settings.h file for central place to store all definitions for
config options. By mistake, it used report/colums_as_rows instead
of report/columns_as_rows (missing "n" in "columns").
If the number of stripes requested is incompatible with the requested
type of raid, give an error instead of adjusting it.
If no stripes argument is supplied, continue to use an appropriate
default.
a579ba2ac2 fixed a regression causing a segfault if no external
origin existed but broke the logic leading to erroneous error
messages and creations of split off exported VGs in case the
external origin and the pool LVs were allocated on different PVs.
- resolves rhbz1367459
Creating a RaidLV in VGs with very small extent sizes caused
late failure in the kernel giving a not very informative error
message. Catch the attempt early and display failure message
'Unable to create RAID LV: requires minimum VG extent size 4.00 KiB'.
- resoves rhbz1179970
'pvmove -n name pv1 pv2' called with the name of a top-level LV
failed with mentioned commit.
Enhance pvmove-raid-segtypes.sh to test for prohibited RAID SubLV moves.
'pvmove -n name pv1 pv2' allows to collocate multiple RAID SubLVs
on pv2 (e.g. results in collocated raidlv_rimage_0 and raidlv_rimage_1),
thus causing loss of resilence and/or performance of the RaidLV.
Fix this pvmove flaw leading to potential data loss in case of PV failure
by preventing any SubLVs from collocation on any PVs of the RaidLV.
Still allow to collocate any DataLVs of a RaidLV with their sibling MetaLVs
and vice-versa though (e.g. raidlv_rmeta_0 on pv1 may still be moved to pv2
already holding raidlv_rimage_0).
Because access to the top-level RaidLV name is needed,
promote local _top_level_lv_name() from raid_manip.c
to global top_level_lv_name().
- resolves rhbz1202497
Adding MetaLVs to given DataLVs (e.g. raid0 -> raid0_meta takeover),
_avoid_pvs_with_other_images_of_lv() was missing code to prohibit
allocation when called with a just allocated MetaLV to prohibit
collaocation of the next allocated MetaLV on the same PV.
- resolves rhbz1366738
When lvm is compiled with --enable-notify-dbus and a user uses lvm
shell, after they issue 200+ commands the lvm shell will hang for
~30 seconds trying to notify the lvm dbus service that a change
has occurred. This appears to be caused by resource exhaustion,
because the sockets used for dbus communication are not be closed.
Enforce mirror/raid0/1/10/4/5/6 type specific maximum images when
creating LVs or converting them from mirror <-> raid1.
Document those maxima in the lvcreate/lvconvert man pages.
- resolves rhbz1366060
Some settings are not suitable for override in interactive/shell
mode because such settings may confuse the code and it may end
up with unexpected behaviour. This is because of the fact that
once we're in the interactive/shell mode, we have already applied
some settings for the shell itself and we can't override them
further because we're already using those settings to drive the
interactive/shell mode. Such settings would get ignored silently
or, in worse case, they would mess up the existing configuration.
When lvm commands are executed in lvm shell, we cover the whole lvm
command execution within this shell now. That means, all messages logged
and status caught during each command execution is now recorded in the
log report, including overall command's return code.
The dm_report_group_output_and_pop_all calls dm_report_output and
dm_report_group_pop for all the items that are currently in report
group. This is just a shortcut that makes it easier to output and
pop group's content so the group handle can be reused again without
a need to initialize and configure it again.
The functionality of dm_report_group_output_and_pop_all is the
same as dm_report_destroy but without destroying the report group
handle.
This patch moves printing of starting '{' character for JSON output up
untili it's known there's any further output following - either the
content or ending '}' character.
Also, remove unnecessary switch for different report group types and
calling individual functions to handle dm_report_group_create as that
code is shared for all existing types at the moment.
Calling dm_report_destroy_rows makes it possible to destroy any report
content we have but at the same time it doesn't destroy the report
handle itself, thus it's possible to reuse that handle again for new
report content.
Functionally, this is the same as calling dm_report_output with the
report handle but omitting the output iself. This functionality may
be useful if we, for whatever reason, need to discard the report
content and start a fresh new one but with the same report configuration
and initialization and thus we can just reuse the existing handle.
We may call arg_count/grouped_arg_count/arg_value soon enough that
cmd->arg_values is not set yet.
Normally, when running a command, we execute lvm_run_command which in
turn calls _process_command_line to allocate and parse the command line
values and stores them in cmd->arg_values.
However, if we run lvm shell, this one doesn't accept any command line
options and we parse the command line for each command that is executed
within the lvm shell then. If we used any code that tries to access
cmd->arg_values through any of the the arg handling functions too
early, we could end up with a segfault due to uninitialized (NULL)
cmd->arg_values.
This patch just saves extra checks in all the code where arg handling
may be called too early so that the cmd->arg_values is not set up yet.
This does not apply to any of existing code, but subsequent patches
will need that.
With patches that will follow, this will make it possible to widen log
report coverage when commands are executed from lvm shell so the amount
of messages that may end up in stderr/stdout instead of log report are
minimized.
Add new log_context=shell and with log_object_type=cmd and
log_object_name=<command_name> for command log report to collect
overall return code from last command (this is reported under
log_type=status).
Currently, the output is separated in 3 parts and each part can go into
a separate and user-defined file descriptor:
- common output (stdout by default, customizable by LVM_OUT_FD environment variable)
- error output (stderr by default, customizable by LVM_ERR_FD environment variable)
- report output (stdout by default, customizable by LVM_REPORT_FD environment variable)
For example, each type of output goes to different output file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvs fedora vg/abc 1>out 2>err 3>report
[0] fedora/~ # cat out
[0] fedora/~ # cat err
Volume group "vg" not found
Cannot process volume group vg
[0] fedora/~ # cat report
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
Another example in LVM shell where the report goes to "report" file:
[0] fedora/~ # export LVM_REPORT_FD=3
[0] fedora/~ # lvm 3>report
(in lvm shell)
lvm> vgs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
(in lvm shell)
lvm> lvs
(content of "report" file)
[1] fedora/~ # cat report
VG #PV #LV #SN Attr VSize VFree
fedora 1 2 0 wz--n- 19.49g 0
LV VG Attr LSize Layout Role CTime
root fedora -wi-ao---- 19.00g linear public Wed May 27 2015 08:09:21
swap fedora -wi-ao---- 500.00m linear public Wed May 27 2015 08:09:21
RAID6 LVs may not be created with --nosync or data corruption
may occur in case of device failures. The underlying MD raid6
personality used to drive the RaidLV performs read-modify-write
updates on stripes and thus relies on properly written parity
(P and Q Syndromes) during initial synchronization.
Once on it, enhance test to create/extend more and
larger RaidLVs and check sync/nosync status.
Commit 76ef2d15d8 introduced
raid0 <-> raid4 takeover and full mirror <-> raid1 support.
Add tests for these conversions.
Tests exposed a kernel semantics change freezing resynchronization
on conversions from raid0[_meta] -> raid4 or adding raid1 legs
because kernel kept the RAID mapped device in 'frozen' state unless
an 'idle' message was sent or the table was reloaded (kernel patch pending).
Although the use of the first region_id in a group to store the
DMS_GROUP=... aux_data tag is an internal implementation detail,
it has a user visible consequence in that deleting this region will
cause the group to disappear: add an explanation of this to the
'group' command and 'Regions, areas, and groups' section.
The MD raid6 personality being used to drive lvm raid6 LVs does
read-modify-write updates to any stripes and thus relies on correct
P and Q Syndromes being written during initial synchronization or
it may fail reconstructing proper user data in case of SubLVs failing.
We may not allow the '--nosync' option on
creation of raid6 LVs for that reason.
Update/fix 'man lvcreate' in that regard.
add lvcreate-raid-nosync.sh test script.
- Resolves rhbz1358532
We don't need to refresh whole cmd context if we drop profile after
processing LVM command - just like we don't refresh cmd context when
we're applying the profile. It's because profiles contain only safe
subset of settings which do not require complete cmd context refresh.
This patch calls process_profilable_config instead of
refresh_toolcontext if there was profile applied for the LVM
command only, not --config which requires toolcontext refresh.
The process_profilable_config just sets proper values based on
values of profilable settings, but it does not do complete
reinitialization of various parts (e.g. filters, logging etc.).
'lvchange --resync LV' or 'lvchange --syncaction repair LV' request the
RAID layout specific parity blocks in raid4/5/6 to be recreated or the
mirrored blocks to be copied again from the master leg/copy for raid1/10,
thus not allowing a rebuild of a particular PV.
Introduce repeatable option '--[raid]rebuild PV' to allow to request
rebuilds of specific PVs in a RaidLV which are known to contain corrupt
data (e.g. rebuild a raid1 master leg).
Add test lvchange-rebuild-raid.sh to test/shell doing rebuild
variations on raid1/10 and 5; add aux function check_status_chars
to support the new test.
- Resolves rhbz1064592
Prepare for new segment type conversion functionality in cases that
currently fail. In the short-term, we need to do this while limiting
the changes to the code paths for the conversions that are already
supported.
introduced with commit 8f62b7bfe5 rely on complete
defintions of the relations between the LVs of a VG.
Hence only run these checks when the complete_vg flag
is set on calls to check_lv_segments().
lvconvert failed in test lvconvert-thin-raid.sh when
calling check_lv_segments() from _read_segments() without
providing a complete definition.
on any thin snap external origin LV which caused a segfault
when none existed as exposed by the vgsplit-thin.sh test.
Only call lv_is_on_pvs() if an external origin LV actually
exists and correct the related splitting logic.
When converting to a cache lv, tests were hanging with a prompt for
"Do you want wipe existing metadata of cache pool volume
To preserve cache metadata add option "--zero n".
WARNING: Reusing mismatched cache pool metadata MAY DESTROY YOUR DATA!"
This is new.
When a client is doing a wait on a job, any other clients will hang
when trying to do anything with the service. This is caused by
the wait code which was placing the thread that handles
incoming dbus requests to sleep until either the timeout expired or
the job operation completed.
This change creates a thread for the wait request, so that the thread
processing incoming requests can continue to run.
with respect to the changed, configurable default behaviour
introduced with commit 7eb7909193.
E.g. raid default of 2 stripes rather than number of PVs in the VG
or on the command line minus one.
The syntax for converting an LV to a thin LV
included an unnecessary --thin option. I was
probably still confused about these options
when writing this.
General RAID and RAID segment type specific checks are added
to merge.c. New static _check_raid_seg() is called on each segment
of a RaidLV (which have just one) from check_lv_segments().
New checks caught some unititialized segment members
which are addressed here as well:
- initialize seg->region_size to 0 in lvcreate.c for raid0/raid0_meta
- initialize list seg->origin_list in lv_manip.c
Add matching support for -Z option also we doing full conversion
to cache-pool.
Extending coversion message to show which pool type is created
and whether the metadata will be wiped or remain unmodified.
Follow-up to 27a767d5e8.
Tunning behavior in a way we always prompt when option --zero is NOT specified.
Without -Z lvm expects user wants to 'reset' cache-pool metadata
(they could have been splitted from some cached LV)
If user doesn't want to zero metadata he needs to specify -Zn.
User may also avoid prompting for zeroing by using -Zy for
cache-pool (basically equals using --yes without -Z being given)
(unlike full convert case, there is no cache-pool being converted,
so there is not 'uncoditional' prompt in this case).
When volume was lvconvert-ed to a thin-volume with external origin,
then in case thin-pool was in non-zeroing mode
it's been printing WARNING about not zeroing thin volume - but
this is wanted and expected - so nothing to warn about.
So in this particular use case WARNING needs to be suppressed.
Adding parameter support for lvcreate_params.
So now lvconvert creates 'normal thin LV' in read-only mode
(so any read will 'return 0' for a moment)
then deactivate regular thin LV and reacreate in 'final R/RW' mode
thin LV with external origin and activate again.
Before, the automatic update from older to newer version of PV extension
header happened within vg_write call. This may have caused problems under
some circumnstances where there's a code in between vg_write and vg_commit
which may have failed. In such situation, we reverted precommitted metadata
and put back the state to working version of VG metadata.
However, we don't have revert for PV write operation at the moment. So
if we updated PV headers already and we reverted vg_write due to failure
in subsequent code (before vg_commit), we ended up with lost VG metadata
(because old metadata pointers got reset by the PV write operation).
To minimize problematic situations here, we should put vg_write and
vg_commit that is done after PV header rewrites as close to each
other as possible.
This patch moves the automatic PV header rewrite for new extension
header part from vg_write to _vg_read where it's done the same way
as we do any other VG repairs if detected during VG read operation
(under VG write lock).
If the VG holding the global lock is removed, we can indicate
that as the reason for not being able to acquire the global
lock in subsequent error messages, and can suggest enabling
the global lock in another VG. (This helpful error message
will go away if the global lock is enabled in another VG,
or if lvmlockd is restarted.)
When cache pool is reused for a new cached volume, there is
normally no need to 'keep' old cache-pool metadata as this
could cause major data lose.
Unlike with 'lvcreate -H -LX --cachepool' conversion, this lvconvert
path left the metadata unzeroed - partly for making easier some
debugging, but this was rather a bug.
So to keep possible reattach of 'unzeroed' metadata, user
now has to use 'lvconvert -Zn' for such conversion. In this case
the prompt will appear about possibe data loss and to proceed,
user has to confirm such operation. Without -Zn metadata are wiped.
In some cases, the command will update VG metadata
in lvmetad without writing it. In these cases there
is no vg->vg_committed and it should use 'vg' directly.
This happens when the command finds that the lvmetad
VG has been invalidated, rereads the metadata from disk,
then updates lvmetad with that metadata. This happens
often with lvmlockd or foreign VGs, and can happen without
lvmlockd if a previous command fails after invalidating
the VG in lvmetad.
Commit 3928c96a37 introduced
new defaults for raid number of stripes, which may cause
backwards compatibility issues with customer scripts.
Adding configurable option 'raid_stripe_all_devices' defaulting
to '0' (i.e. off = new behaviour) to select the old behaviour
of using all PVs in the VG or those provided on the command line.
In case any scripts rely on the old behaviour, just set
'raid_strip_all_devices = 1'.
- resolves rhbz1354650
This fixes a regression from commit a7c45ddc5, which moved
the lvmetad VG update from vg_commit() to unlock_vg().
The lvmetad VG update needs to send the version of metadata
that was committed rather than sending the state of struct 'vg'.
The 'vg' may have been partially modified since vg_commit(),
and contain non-committed metadata that shouldn't be sent
to lvmetad.
Any failing stripes in raid0/raid0_meta type LVs cause data loss,
thus replacement via 'lvconvert --replace...' does not make sense.
Patch prohibits replacement on raid0/raid0_meta LVs.
- resolves rhbz1356734
The --uuid, --major and --alldevices arguments were incorrectly tested
after confirming argc is > 0, in a branch that only executes if argc
== 0 (i.e. they were unreachable).
Move all device checks before the test for argc and log an appropriate
error before returning.
Including major and minor numbers in pvs and lvs output when calling
lvmdump -a makes it a bit easier to match these items with possible
system log/journal.
Commit ca878a3426 changed behavior
or resize operation. Later the code has been futher changed
to skip fs resize completely when size of LV is already matching
and finaly at the most recent resize changeset for resize the
check for matching size has been eliminated as well so we ended
with a request call to resize fs to 0 size in some cases.
This commit reoders some test so the prompt happens just once before
resize of possibly 2 related volumes.
Also extra test for having LV already given size is added, and
whole metadata update is skipped for this case as the only
result would be an increment of seqno.
However the filesystem is still resized when requested,
so if the LV has some size and the resize is resolved to
the same size, the filesystem resize is called so in case FS
would not match, the resize will happen.
raid0/raid0_meta type LVs don't have a default number of stripes when
created without '-i/--stripes Stripes' whereas other raid types have one.
Patch sets the default for raid0/raid0_meta to 2 stripes.
The default amount of stripes for raid4/5/10 is changed to 2 and for raid6 to 3
rather than using all PVs in the VG or those provided on the command line.
This is to avoid unintended high number of stripes in case of many PVs.
To select a different amount of stripes from the default,
use 'lvcreate -i/--stripes Stripes'.
- resolves rhbz1354650
A livelock occurs on extension in lv_manip when adjusting the region size,
which doesn't apply to any raid0/raid0_meta LVs (these don't have a bitmap).
Fix by prohibiting the region size adjustment on any such LVs.
- resolves rhbz1354604
An unconditional access to the non-existing MetaLV of a raid0 LV in
lv_raid_remove_missing() was causing the segfault.
Only call log_debug() on replacements of existing MetaLVs.
- resolves rhbz1354646
Resync attempts on raid0/raid0_meta via 'lvchange --resync ...'
cause segfaults.
'lvchange --syncaction ...' doesn't get rejected either.
Prohibit both on raid0/raid0_meta LVs.
- resolves rhbz1354656
4420d41fea introduced recursive split of lvs which
splits a top-level LV together with it's sub LVs.
This lead to invalid temporary list pointers
causing hangs/OOM situations.
Patch updates the temporary list pointer
referencing a moved sub LV.
- resolves rhbz1354686
Synopsis are very useful for quick orientation and also
we provide then for all remaining command.
Also list ALL supported options in a single ordered list,
user should not seek for them.
blkdeactivate -m disablequeueing causes "multipathd disablequeueing maps"
call inside blkdeactivate script before deactivating devices. This
avoids a situation where blkdeactivate may wait for paths to appear if
multipath is set to queueing and there's a stack of other devices and/or
mount points on top of such multipath device.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1344381.
Reduce number of evualted pvcreate commands.
Since 0 is default value used to fill missing params,
and 0 is also 1st. value in array, it's being tested.
Drop unused data_alignment_offset.
When logging to epoch files we would like to prevent creating too large
log files otherwise a spining command could fulfill available space
very easily and quickly.
Limit for to 100000 per command.
Make the --filemap switch take no arguments and instead accept one
or more files on the command line to be mapped and placed into
groups.
This allows --filemap to be used with a glob:
# dmstats create --filemap *
rhel5.10-1.qcow2: Created new group with 87 region(s) as group ID 1564.
rhel5.10.qcow2: Created new group with 8 region(s) as group ID 1651.
rhel7.0-1.qcow2: Created new group with 11 region(s) as group ID 1659.
rhel7.0.qcow2: Created new group with 1454 region(s) as group ID 1670.
vm.img: Created new group with 2 region(s) as group ID 3124.
lvconvert --splitcache VG/CachePool_corig
Allow the split via the hidden/used cache pool for the time being,
since the new lvconvert code did intend to allow it, but was just
missing the exception in the list of hidden LVs that were allowed.
The preferred method for splitcache is to run it on the visible
cache LV, not the hidden cache pool. That may eventually become
the only method since we try to avoid running commands on
hidden LVs.
When a 'dmstats create --filemap' operation fails (e.g. during
open(2), close(2), or dm_stats_create_regions_from_fd()), use the
canonical version of the path. This avoids cryptic/confusing error
messages when symbolic links exist in the path argument given:
# findmnt /var/lib/libvirt/images -otarget,source
TARGET SOURCE
/var/lib/libvirt/images /dev/mapper/vg_hex-lv_images
# readlink /var/lib/libvirt/images/my.img
/boot/my.img
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /var/lib/libvirt/images/my.img
Command failed
Using the canonical path the error is immediately obvious:
# dmstats create --filemap /var/lib/libvirt/images/my.img
Cannot map file: not a device-mapper device.
Could not create regions from file /boot/my.img
Command failed
Grouping is also useful in combination with --segments: creating a
group allows both individual segment data and data for the device
as a whole to be presented in the same report.
Support grouping for 'create --segments' in the same manner as for
'create --filemap'; group regions by default, applying an optional
alias specified with --alias, unless the user specifies --nogroup.
Support aggregate group and region histograms by allocating a new
histogram from the pool and populating it with a sum of the histogram
data for the areas contained in the region or group.
To avoid repeatedly summing the same histogram data, cache the pointer
in the group and regions structs for subsequent access. The aggregate
histograms are allocated from the same pool as the area histograms in
the corresponding handle and will be discarded at each list or populate
operation.
Add a new option to the create command to create regions that map the
extents of a file:
# dmstats create --filemap /path/to/file
/path/to/file: Created new group with 10 region(s) as group ID 0.
When performing a --filemap no device argument is required (and
supplying one results in error) since the device to bind to is implied
by the file path and is obtained directly from an fstat().
Grouping may be optionally disabled by the --nogroup switch: in this
case the command will report each region individually:
# dmstats create --nogroup --filemap /path/to/file
/path/to/file: Created new region with 1 area as region ID 0.
/path/to/file: Created new region with 1 area as region ID 1.
/path/to/file: Created new region with 1 area as region ID 2.
When grouping regions the group alias is automatically set to the
basename (as returned by dm_basename()) of the provided file.
This can be overridden to a user-defined value at the command line by
use of the --alias option.
If grouping is disabled no alias can be set.
Use of offset and subdivision options (--start, --length, --segments,
--areas, --areasize).
Setting aux_data and histograms for groups is possible but is not
currently implemented.
Add a call to create dmstats regions that correspond to the extents
present in a file descriptor open on a file in a local file system.
The file must reside on a file system type that correctly supports
physical extent location data in the FIEMAP ioctl.
Regions are optionally placed into a group with a user-defined alias.
File systems that do not support physical offsets in FIEMAP (btrfs
currently) are detected via fstatfs() - although attempting to map
a --filemap group on btrfs will fail anyway with the generic error
"Not on a device-mapper device" this is confusing; the file system
mount is on a device-mapper device, but btrfs' volume layer masks
this in the returned st_dev field since the returned logical file
extents may span multiple physical devices.
The function _stats_remove_region_id_from_group() incorecctly set
the group_id to DM_STATS_GROUP_NOT_PRESENT _before_ the call to
_stats_group_destroy(). This will cause the destroy function to
return immediately without doing anything:
339 static void _stats_group_destroy(struct dm_stats_group *group)
340 {
341 if (!_stats_group_present(group))
342 return;
Invalidating the ID in _stats_region_region_id_from_group() is
redundant anyway; it is rightly done as the last operation in
_stats_group_destroy (and it is not possible for anything to see
the old value between the two calls).
Remove the change to group_id to ensure that the alias and bitset
resources are correctly freed.
The call to dm_stats_walk_start() before the do statement makes
dm_stats_walk_do() behave inconsistently depending on context;
wrap them in an additional do { } while (0) so that the macro
always expands to a valid statement.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/RaidLV" would fail, but
"lvconvert --mirrors 0 VG/RaidLV" would succeed.
The code could perform this conversion but ironically
did not recognize the standard command form, only the
the unpreferred "implication-based" command form.
"lvconvert --type linear VG/MirrorLV" would fail, but
"lvconvert --mirrors 0 VG/MirrorLV" would succeed.
If after extracting stats arguments and group tags nothing remains
of aux_data but '-' set the region->aux_data field to the empty
string to match behaviour for non-grouped regions.
Although not harmful do not allow a group containing regions with
histograms since it is not currently possible to present histogram
data aggregated for the group.
Although a non-zero value for the number of ticks spent doing IO
should imply a non-zero number of IOs in the interval test for
this explicitly to avoid a divide-by-zero in the event of bad
counter data.
It's possible for interval_ns to be zero if the interval is not
set or the clock is misconfigured. Test for this before using the
value as the divisor in the utilisation calculation.
Walk flags are ULL constants; cast the result to a uint64_t before
logging with a FMTx64 format specifier to avoid a compiler warning:
warning: format ‘%lx’ expects argument of type ‘long unsigned int’,
but argument 5 has type ‘long long unsigned int’
Make it clear that the "aux data" presented in reports is the user
data stored in the field (and does not include any library-internal
state such as group descriptors) by renaming the field to user_data
and changing the heading to "UserData".
Make it clear in libdevmapper.h, and in function argument names, that
libdm-stats uses the aux_data field internally and that any values set
for user_data are appended to the library values before being stored
with a region, and similarly, that internal data fields will be stripped
prior to returning any previously stored user_data.
Replace --statstype=area,region,group with a separate switch for
each object type: --area, --region, --group. Omitting any object
type switch will use the defaults for the current command (regions
and groups for list, and regions, groups and areas for verbose list).
Replace the 'name' field with 'statsname' in order to report alias
names for groups, and include the 'group_id' field between statsname
and the 'region_id' field to make it clear to the user when groups
are in use.
Walk avaiable groups and regions (in addition to areas) and report
aggregate statistics and properties.
A new switch is added to filter the type of obects inclued in the
report:
--statstype={all,area,region,group}
The type of the current row is also available in a new
DR_STATS_META field 'type'.
To allow the names used to describe statistics report objects to
change (for e.g to support groups and region and group aliases)
introduce a new "stats_name" field that evaluates to the correct
name for the object being reported.
Add a pair of commands to create and delete stats groups:
dmstats group --regions REGIONS
dmstats ungroup --groupid ID
REGIONS specifies a list of regions to be included in the group.
Regions are specified as a comma separated list in order of
increasing region ID. Ranges may be specified as a hypen separated
pair of values giving the first and last member of the range.
Add support do dm_stats_walk*() to walk over the set of
available groups using the cursor embedded in the dm_stats
handle, and to obtain the type of the object at the current
stats cursor location. A set of flags is introduced to
control which objects are visited:
DM_STATS_WALK_AREA
DM_STATS_WALK_REGION
DM_STATS_WALK_GROUP
DM_STATS_WALK_ALL
A final flag suppresses visits to regions that contain only a
single area - since the aggregate of such a region is idential
to the area it contains this allows these duplicates to be
filtered out:
DM_STATS_WALK_SKIP_SINGLE_AREA
If flags are not initialised before beginning a walk the default
set matches the behaviour of previous versions of the library.
Also accept group identifiers as immediate arguments to the
counter, metric, and property functions by adding control
flags to the region and area identifiers passed in.
Region and area properties are mapped to their equivalents for
the group (for example: group size is reported as the sum of
all regions contained in the group). Counter and metric values
are aggregated for the region or group.
Introduce constants for the buffer sizes that libdm-stats uses:
one for messages sent to the kernel, one for rows of response data
returned, and a pair for the "start+len" range and histogram bounds
strings.
Add a grouping facility to the libdm-stats library that allows the
user to bind several regions together as a group. Groups may be
used to aggregate data from several regions for reporting, or to
select and sort among large sets of regions.
A textual descriptor ("group tag") is associated with each group
and is stored in the first group member's aux_data field. The
tag contains the group member list and an optional alias for the
group, allowing the user to assign meaningful names to groups of
regions.
These descriptors are parsed in @stats_list message responses and
populate the resulting region and area tables with the group
structure.
Groups with overlapping regions are permitted but since this will
result in some events being counted more than once a warning is
printed in this case.
Nested and overlapping groups are not currently supported and
attempting to create these configurations results in error.
Add a new enum based interface for accessing counter and metric
values that uses a single function for each:
uint64_t dm_stats_get_counter(const struct dm_stats *dms,
dm_stats_counter_t counter
uint64_t region_id, uint64_t area_id);
int dm_stats_get_metric(const struct dm_stats *dms, int metric,
uint64_t region_id, uint64_t area_id,
double *value);
This simplifies the implementation of value aggregation for
groups of regions. The named function interface now calls the
enum interface internally so that all new functionality is
available regardless of the method used to retrieve values.
Cache the device-mapper name of a bound device in the dm_stats
handle.
This will be used by stats groups to report a device name or
user defined alias for groups.
The device-mapper name, device numbers and uuid stored in the
dm_stats handle are used only to bind the handle to a specific
device in order to issue ioctls.
Rename them to "bind_*" to reflect this usage in preparation
for caching the device-mapper name of the bound device in the
dm_stats handle.
This will be used to allow optional aliases to be set for
dmstats groups.
Add a function to parse a list of integer values and ranges into
a dm_bitset representation. Individual values signify that that bit
is set in the resulting mask and ranges are given as a pair of
start and end values, M-N, such that M and N are the first and
last members of the range (inclusive).
The implementation is based on the kernel's __bitmap_parselist()
that is used for cpumasks and other set configuration passed in
string form from user space.
TODO: it might be better to log dmeventd messages with test output
just like we do with clvmd - maybe we will switch to this one
instead of extra DMEVENTD log file in future....
Add config for mkfs to get more predicatable results
when using mkfs across variety of distributions.
In future maybe use this per all tests as default.
For now user has to specify in a test MKE2FS_CONFIG envvar to use it.
When force removing thin-pool we loose 'real' access to hidden device,
and if such pool is in suspended state, any thin volume cannot be
dropped. It likely should be also checked by dmsetup, but meanwhile
apply simple logic - try to force remove first all higher minors first
with assumption we first create thin-pool and then thin volume
and there are usually not being released lower dm numbers to
get the order wrong.
With a single report (--count=1) no timerfd is set up and the cycle
and current timestamps should be freed during the single call to
_update_interval_times().
This patch fixes link validation for used thin-pool.
Udev rules correctly creates symlinks only for unused new thin-pool.
Such thin-pool can be used by foreing apps (like Docker) thus
has /dev/vg/lv link.
However when thin-pool becomes used by thinLV - this link is no
longer exposed to user - but internal verfication missed this
and caused messages like this to be printed upon 'vgchange -ay':
The link /dev/vg/pool should have been created by udev but it was not
found. Falling back to direct link creation.
And same with 'vgchange -an':
The link /dev/vg/pool should have been removed by udev but it is still
present. Falling back to direct link removal.
This patch ensures only unused thin-pool has this link.
Run umount code only when either thin data or metadata are
above 95% - so if there are resize failures with 60%.
system fill keep running.
Also umount will only be tried with lvm2 LVs.
Foreign users are ATM unsuppored.
Add new logic to identify each unique operation and route
it to the correct function to perform it. The functions
that perform the conversions remain unchanged.
This new code checks every allowed combination of LV type
and requested operation, and for each valid combination
calls the function that performs that conversion.
The first stage of option validation which checks for
incompatible combinations of command line options, is done
done before process_each is called. This is unchanged.
(This new code will allow that first stage validation to
be simplified in a future commit.)
The second stage of checking options against the specific
LV type is done by this new code. For each valid combination
of operation + LV type, the new code calls an existing
function that implements it.
With this in place, the ad hoc checks for valid combinations
of LV types and operations can be removed from the existing
code in a future commit.
(The #if 0 is used to keep the patch clean, and the
disabled code will be removed by a following patch.)
There are detailed messages inside _create_dir_recursive that
dm_create_dir calls (except EROFS which where the message is not
generated, like anywhere else in the code).
When we test Vg.LvCreateRaid some of the hidden LVs volume type go from
'I' to 'i' between the time it takes us to create the LV and
the time it takes to call into refresh to verify the service is up to date.
This is a fairly rare occurance.
We call 'lvm help' to find out if fullreport is supported. Lvm
dumps help to stderr. Common code prints a warning if we exit
with 0, but have something in stderr so we are skipping the warning
message.
The following operations would hang if lvm was compiled with
'enable-notify-dbus' and the client specified -1 for the timeout:
* LV snapshot merge
* VG move
* LV move
This was caused because the implementation of these three dbus methods is
different. Most of the dbus method calls are executed by gathering information
needed to fulfill it, placing that information on a thread safe queue and
returning. The results later to be returned to the client with callbacks.
With this approach we can process an arbitrary number of commands without any
of them blocking other dbus commands. However, the 3 dbus methods listed
above did not utilize this functionality because they were implemented with a
separate thread that handles the fork & exec of lvm. This is done because these
operations can be very slow to complete. However, because of this the lvm
command that we were waiting on is trying to call back into the dbus service to
notify it that something changed. Because the code was blocking the process
that handles the incoming dbus activity the lvm command blocked. We were stuck
until the client timed-out the connection, which then causes the service to
unblock and continue. If the client did not have a timeout, we would have been
hung indefinitely.
The fix is to always utilize the worker queue on all dbus methods. We need to
ensure that lvm is tested with 'enable-notify-dbus' enabled and disabled.
Apply the same idea as vg_update.
Before doing the VG remove on disk, invalidate
the VG in lvmetad. After the VG is removed,
remove the VG in lvmetad. If the command fails
after removing the VG on disk, but before removing
the VG metadata from lvmetad, then a subsequent
command will see the INVALID flag and not use the
stale metadata from lvmetad.
Previously, a command sent lvmetad new VG metadata in vg_commit().
In vg_commit(), devices are suspended, so any memory allocation
done by the command while sending to lvmetad, or by lvmetad while
updating its cache could deadlock if memory reclaim was triggered.
Now lvmetad is updated in unlock_vg(), after devices are resumed.
The new method for updating VG metadata in lvmetad is in two phases:
1. In vg_write(), before devices are suspended, the command sends
lvmetad a short message ("set_vg_info") telling it what the new
VG seqno will be. lvmetad sees that the seqno is newer than
the seqno of its cached VG, so it sets the INVALID flag for the
cached VG. If sending the message to lvmetad fails, the command
fails before the metadata is committed and the change is not made.
If sending the message succeeds, vg_commit() is called.
2. In unlock_vg(), after devices are resumed, the command sends
lvmetad the standard vg_update message with the new metadata.
lvmetad sees that the seqno in the new metadata matches the
seqno it saved from set_vg_info, and knows it has the latest
copy, so it clears the INVALID flag for the cached VG.
If a command fails between 1 and 2 (after committing the VG on disk,
but before sending lvmetad the new metadata), the cached VG retains
the INVALID flag in lvmetad. A subsequent command will read the
cached VG from lvmetad, see the INVALID flag, ignore the cached
copy, read the VG from disk instead, update the lvmetad copy
with the latest copy from disk, (this clears the INVALID flag
in lvmetad), and use the correct VG metadata for the command.
(This INVALID mechanism already existed for use by lvmlockd.)
This fixes commit 0ba5f4b8e9 which moved
field recalculation (field width and sort position) from
dm_report_object to dm_report_output but it didn't handle the case when
dm_report_column_headings was used separately to report headings (before
dm_report_outpout call) and hence we ended up with intial widths for
fields in the headings.
If we're using dm_report_column_headings, we need to recalculate
fields if we haven't done so yet, the same way as we do in
dm_report_output.
Simplify code around _do_get_report_selection - remove "expected_idxs[]"
argument which is superfluous and add "allow_single" switch instead to
allow for recognition of "--configreport <report_name> -S" as well as
single "-S" if needed.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 961 in _do_report_get_selection()
Null pointer dereferences (FORWARD_NULL) Dereferencing null pointer "single_args".
Uninitialized variables (UNINIT) /safe/guest2/covscan/LVM2.2.02.158/tools/toollib.c: 3520 in _process_pvs_in_vgs()
Uninitialized variables (UNINIT) Using uninitialized value "do_report_ret_code".
Null pointer dereferences (REVERSE_INULL) /safe/guest2/covscan/LVM2.2.02.158/libdm/libdm-report.c: 4745 in dm_report_output()
Null pointer dereferences (REVERSE_INULL) Null-checking "rh" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
Incorrect expression (MISSING_COMMA) /safe/guest2/covscan/LVM2.2.02.158/lib/log/log.c: 280 in _get_log_level_name()
Incorrect expression (MISSING_COMMA) In the initialization of "log_level_names", a suspicious concatenated string ""noticeinfo"" is produced.
Null pointer dereferences (FORWARD_NULL) /safe/guest2/covscan/LVM2.2.02.158/tools/reporter.c: 816 in_get_report_options()
Null pointer dereferences (FORWARD_NULL) Comparing "mem" to null implies that "mem" might be null.
This reverts commit fa69ed0bc8.
This code sometimes expects to be presented with a read-only filesystem
(during some boot sequences for example) and copes appropriately with
this and it should not lead to expected error messages that might cause
unnecessary alarm.
lv_name arg is only used without known LV for resolving '*lv'.
Once we know *lv, never use lv_name ever again.
So setting it when passing *lv has not needed.
Add code to support more LVs to be resized through a same code path
using a single lvresize_params struct.
(Now it's used for thin-pool metadata resize,
next user will be snapshot virtual resize).
Update code to adjust percent amount resize for use_policies.
Properly activate inactive thin-pool in case of any pool resize
as the command should not 'deffer' this operation to next activation.
Use common API design and pass just LV pointer to lv_manip.c functions.
Read cmd struct via lv->vg->cmd when needed.
Also do not try to return EINVALID_CMD_LINE error when we
have already openned VG - this error code can only be returned before
locking VG.
We have only 2 users of _lv_active() - one was already checking for ==1
while the other use (_lv_is_active()) could have take '-1' as a sign of having
an LV active. So return 0 and log_debug also the reason while detection
has failed (i.e. in case --driverload n - it's kind of expectable,
but might have confused user seeing just <backtrace>).
This fixes commit f50d4011cd which
introduced a problem when using older lvm2 code with newer libdm.
In this case, the old LVM didn't recognize new _LOG_BYPASS_REPORT flag
that libdm-report code used. This ended up with no output at all
from libdm where log_print_bypass_report was called because the
_LOG_BYPASS_REPORT was not masked properly in lvm2's print_log fn
which was called as callback function for logging.
With this patch, the lvm2 registers separate print_log_libdm logging
function for libdm instead. The print_log_libdm is exactly the same
as print_log (used throughout lvm2 code) but it checks whether we're
printing common line on output where "common" means not going to stderr,
not a warning and not an error and if we are, it adds the
_LOG_BYPASS_REPORT flag so the log_print goes directly to output, not
to any log report.
So this achieves the same goal as in f50d4011cd,
just doing it in a way that newer libdm is still compatible with older
lvm2 code (libdm-report is the only code using log_print).
Looking at the opposite mixture - older libdm with newer lvm2 code,
that won't be compilable because the new log report functionality
that is in lvm2 also requires new dm_report_group_* libdm functions
so we don't need to care here.
Move code from original print_log fn to a separate _vprint_log function
that accepts va_list and make print_log a wrapper over _vprint_log.
The print_log just initializes the va_list and uses it for _vprint_log
call now. This way, we can reuse _vprint_log if needed.
In commit 6ae22125, vgcfgrestore began disabling lvmetad
while running, and rescanned to enable it again at the end,
but missed the rescanning/enabling in the error case.
Reconnect to lvmetad if either the send fails (e.g. lvmetad
was restarted since lvmlockd last connected), or if no
lvmetad connection exists (e.g. lvmetad was started after
lvmlockd so no previous connection existed.)
Currently, a shutdown signal will cause lvmetad to quit
responding to new connections, but not actually exit until
all connections are gone. If a program is maintaining a
long running connection (e.g. lvmlockd, or even an lvm
command) when lvmetad gets a shutdown signal, then all
further commands will hang indefinately waiting for a
response that won't be sent.
With this patch, make lvmetad continue handling new
connections even after a shutdown signal. It will exit
once all connections are gone.
Previously, vgcfgrestore would attempt to vg_remove the
existing VG from lvmetad and then vg_update to add the
restored VG. But, if there was a failure in the command
or with vg_update, the lvmetad cache would be left incorrect.
Now, disable lvmetad before the restore begins, and then
rescan to populate lvmetad from disk after restore has
written the new VG to disk.
Reporting commands can be of different types (even if the command name
is the same):
- pvs command can be either of PVS, PVSEGS or LABEL report type,
- vgs command is of VGS report type,
- lvs command is of LVS or SEGS report type.
Use basic report type when looking for report prefix used for
--configreport option.
This means that:
- 'pvs --configreport pv' applies to PVS, PVSEGS or LABEL report type
- 'vgs --configreport vg' applies to VGS report type
- 'lvs --configreport lv' applies to LVS and SEGS report type
The DM_REPORT_OUTPUT_MULTIPLE_TIMES instructs reporting code to
keep rows even after dm_report_output call - the rows are not
destroyed in this case which makes it possible to call dm_report_output
multiple times.
This allows for moving parts of the code from dm_report_object to
dm_report_output which is important for subsequent patches that allow
for repeated dm_report_output, not destroying rows on each
dm_report_output call.
log_print is used during cmd line processing to log the result of the
operation (e.g. "Volume group vg successfully changed" and similar).
We don't want output from log_print to be interleaved with current
reports from group where log is reported as well. Also, the information
printed by log_print belongs to the log report too, so it should be
rerouted to log report if it's set.
Since the code in libdm-report which is responsible for doing the report
output uses log_print too, we need to use a different kind of log_print
which bypasses any log report currently used for logging (...simply,
we can't call log_print to output the log report itself which in turn
would again reroute to report - the report would never get on output
this way).
This patch adds structures and functions to reroute error and warning
logs to log report, if it's set.
There are 5 new functions:
- log_set_report
Set log report where logging will be rerouted.
- log_set_report_context
Set context globally so any report_cmdlog call will use it.
- log_set_report_object_type
Set object type globally so any report_cmdlog call will use it.
- log_set_report_object_name_and_id
Set object ID and name globally so any report_cmdlog call will use it.
- log_set_report_object_group_and_group_id
Set object group ID and name globally so any report_cmdlog call will use it.
These functions will be called during LVM command processing so any logs
which are rerouted to log report contain proper information about current
processing state.
The lvm fullreport works per VG and as such, the vg, lv, pv, seg and
pvseg subreport is done for each VG. However, if the PV is not part of
any VG yet, we still want to display pv and pvseg subreports for these
"orphan" PVs - so enable this for lvm fullreport's process_each_vg call.
If we have fullreport, make sure that the options/sort keys used for
each report doesn't change its type - we want to preserve the original
type so it's always 5 different subreports within fullreport (vg, lv, pv,
seg, pvseg). Since we have all report types within fullreport, users
should add fields under proper subreport type - this minimizes
duplication of info displayed on output.
Groupable args (the ones marked with ARG_GROUPABLE flag) start a new
group of args if:
- this is the first time we hit such a groupable arg,
- or if non-countable arg is repeated.
However, there may be cases where we want to give priorities when
forming groups and hence force new group creation if we hit an arg
with higher grouping priority.
For example, let's assume (for now) hypothetical sequence of args used:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
Without giving any priorites, we end up with:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\__________GROUP1___________/ \________GROUP2___________/ \_GROUP3_/
This is because we hit "-o" as the first groupable arg. The --configreport,
even though it's groupable too, it falls into the previous "-o" group.
While we may need to give priority to the --configreport arg that should
always start a new group in this scenario instead:
lvs -o lv_name --configreport log -o log_type --configreport lv -o +vg_name
| | | | | |
\_GROUP1_/ \_________GROUP2___________/ \_________GROUP3__________/
So here "-o" started a new group but since "--configreport" has higher
priority than "-o", it starts fresh new group now and hence the rest of
the command line's args are grouped by --configreport now.
lvm fullreport executes 5 subreports (vg, pv, lv, pvseg, seg) per each VG
(and so taking one VG lock each time) within one command which makes it
easier to produce full report about LVM entities.
Since all 5 subreports for a VG are done under a VG lock, the output is
more consistent mainly in cases where LVM entities may be changed in
parallel.
Add any report (pvs/vgs/lvs) currently processed to current report group
which is part of processing handle and which already contains log
report. This way both log report and pvs/vgs/lvs report will be
reported as a whole within a group, thus having same output format as
selected by --reportformat option.
If there's parent processing handle, we don't need to create completely
new report group and status report - we'll just reuse the one already
initialized for the parent.
Currently, the situation where this matter is when doing internal report
to do the selection for processing commands where we have parent processing
handle for the command itself and processing handle for the selection
part (that is selection for non-reporting tools).
Wire up report group creation with log report in struct
processing_handle and call report_format_init during processing handle
initialization (init_processing_handle fn) and destroy it while
destroing processing handle (destroy_processing_handle fn).
This way, all the LVM command processing using processing handle
has access to log report via which the current command log
can be reported as items are processed.
Separating common report and per-report arguments prepares the code for
handling several reports per one command (for example, the command log
report and LVM command report itself).
Each report can have sort keys, options (fields), list of fields to
compact and selection criteria set individually. Hooks for setting these
per report within one command will be a part of subsequent patches, this
patch only separates new struct single_report_args out of existing
struct report_args.
New report/output_format configuration sets the output format used
for all LVM commands globally. Currently, there are 2 formats
recognized:
- basic (the classical basic output with columns and rows, used by default)
- json (output is in json format)
Add new --reportformat option and new report_format_init function that
checks this option and creates new report group accordingly, also
preparing log report handle and adding it to the report group just
created.
This is a preparation for new CMDLOG report type which is going to be
used for reporting LVM command log.
The new report type introduces several new fields (log_seq_num, log_type,
log_context, log_object_type, log_object_group, log_object_id, object_name,
log_message, log_errno, log_ret_code) as well as new configuration settings
to set this report type (report/command_log_sort and report/command_log_cols
lvm.conf settings).
This patch also introduces internal report_cmdlog helper function
which is a wrapper over dm_report_object to report command log via
CMDLOG report type and which is going to be used throughout the code
to report the log items.
This patch introduces DM_REPORT_GROUP_JSON report group type. When using
this group type and when pushing a report to such a group, these flags
are automatically unset:
DM_REPORT_OUTPUT_ALIGNED
DM_REPORT_OUTPUT_HEADINGS
DM_REPORT_OUTPUT_COLUMNS_AS_ROWS
...and this flag is set:
DM_REPORT_OUTPUT_BUFFERED
The whole group is encapsulated in { } for the outermost JSON object
and then each report is reported on output as array of objects where
each object is the row from report:
{
"report_name1": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
],
"report_name2": [
{field1="value", field2="value",...},
{field1="value", field2="value",...}
...
]
...
}
This patch introduces DM_REPORT_GROUP_BASIC report group type. This
type has exactly the classical output format as we know from before
introduction of report groups. However, in addition to that, it allows
to put several reports into a group - this is the very basic grouping
scheme that doesn't change the output format itself:
Report: report1_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
Report: report2_name
Header1 Header2 ...
value value ...
value value ...
... ... ...
There's no change in output for this report group type - with this type,
we only make sure there's always only one report in a group at a time,
not more.
This patch introduces DM report group (represented by dm_report_group
structure) that is used to group several reports to make a whole. As a
whole, all the reports in the group follow the same settings and/or
formatting used on output and it controls that the output is properly
ordered (e.g. the output from different reports is not interleaved
which would break readability and/or syntax of target output format
used for the whole group).
To support this feature, there are 4 new functions:
- dm_report_group_create
- dm_report_group_push
- dm_report_group_pop
- dm_report_group_destroy
From the naming used (dm_report_group_push/pop), it's clear the reports
are pushed onto a stack. The rule then is that only the report on top
of the stack can be reported (that means calling dm_report_output).
This way we make sure that the output is not interleaved and provides
determinism and control over the output.
Different formats may allow or disallow some of the existing report
flags controlling output itself (DM_REPORT_OUTPUT_*) to be set or not so
once the report is pushed to a group, the grouping code makes sure that
all the reports have compatible flags set and then these flags are
restored once each report is popped from the report group stack.
We also allow to push/pop non-report item in which case such an item
creates a structure (e.g. to put several reports together with any
opening and/or closing lines needed on output which pose as extra
formatting structure besides formatting the reports).
The dm_report_group_push function accepts an argument to pass any
format-specific data needed (e.g. handle, name, structures passed
along while working with reports...).
We can call dm_report_output directly anytime we need (with the only
restriction that we can call dm_report_output only for the report that
is currently on top of the group's stack). Or we don't need to call
dm_report_output explicitly in which case all the reports in a stack are
reported on output automatically once we call dm_report_group_destroy.
This fixes a problem in commit ae0a8740c. The problem
in that commit was that all existing PVs are initially
dropped from lvmetad. This works if the VG is updated
at the end, which replaces the dropped PVs, but if the
rescan finds that the VG seqno is unchanged, it leaves
the cached VG in place. So, we should only drop the
existing PVs in lvmetad when the VG is going to be updated.
commit 15da467b was meant to address the case where
use_lvmetad=1 in lvm.conf, and lvmetad is not available,
in which case, pvscan --cache -aay should activate LVs.
But the commit unintentionally also changed the case
where use_lvmetad=0 in lvm.conf, in which case
pvscan --cache -aay should not activate LVs, so fix
that here.
When pvscan --cache -aay fails to connect to lvmetad it will
simply exit and do nothing. Change this so that it will
skip the lvmetad cache step and do the activation step from
disk.
We were initially looking to see if an LV was hidden and if it was we were
creating an instance of a LvCommon object to represent it. Thus if we
had a hidden cache pool for example we were missing the methods and
properties for the cache pool. However, when we create the object path,
any hidden LVs, regardless of type/functionality will be placed in the
hidden path.
The object manager method get_object_by_lvm_id was used in many cases for
the sole reason of getting the object path for the object. Instead of
retrieving the object and then calling 'dbus_object_path' on the object, we
are adding a method which returns the object path.
When we are processing the LVs we need to build up dbus objects from least
dependent to most dependent, so that we have information available when
constructing.
Original code missed to catch all apperances of SIGINT.
Also enhance logging when running in shell without tty.
Accept this regex as valid input:
'^[ ^t]*([Yy]([Ee]([Ss]|)|)|[Nn]([Oo]|))[ ^t]*$'
Some commands scan labels to populate lvmcache multiple
times, i.e. lvmcache_init, scan labels to fill lvmcache,
lvmcache_destroy, then later repeat
Each time labels are scanned, duplicates are detected,
and preferred devices are chosen. Each time this is done
within a single command, we want to choose the same
preferred devices. So, check for existing preferences
when choosing preferred devices.
This also fixes a problem with the list of unused duplicate
devs when run in an lvm shell. The devs had been allocated
from cmd memory, resulting in invalid list entries between
commands.
A number of places are working on a specific dev when they
call lvmcache_info_from_pvid() to look up an info struct
based on a pvid. In those cases, pass the dev being used
to lvmcache_info_from_pvid(). When a dev is specified,
lvmcache_info_from_pvid() will verify that the cached
info it's using matches the dev being processed before
returning the info. Calling code will not mistakenly
get info for the wrong dev when duplicate devs exist.
This confusion was happening when scanning labels when
duplicate devs existed. label_read for the first dev
would add an info struct to lvmcache for that dev/pvid.
label_read for the second dev would see the pvid in
lvmcache from first dev, and mistakenly conclude that
the label_read from the second dev can be skipped
because it's already been done. By verifying that the
dev for the cached pvid matches the dev being read,
this mismatch is avoided and the label is actually read
from the second duplicate.
If a command gets stuck during an lvmetad update, lvmetad
will cancel that update after the timeout. The next command
to check the lvmetad will see that lvmetad needs to be
populated because lvmetad will return token of "none" after
a timed out update (same as when lvmetad is not populated
at all after starting.)
If a command gets an error during an lvmetad update, it
will now just quit and leave its updating token in place.
That update will be cancelled after the timeout.
Commit #5b3a4a9 caused the "name" variable to be cleared if
declaration and assignment is on two lines so put it back
so it's on one line for it to work again.
pvmove began processing tags unintentionally from commit,
6d7dc87cb pvmove: use toollib
pvmove works on a single PV, but tags can match multiple PVs.
If we allowed tags, but processed only the first matching PV,
then the resulting PV would be unpredictable.
Also, the current processing code does not allow us to simply
report an error and do nothing if more than one PV matches the tag,
because the command starts processing PVs as they are found,
so it's too late to do nothing if a second PV matches.
If configuration consists of several sources in config cascade
("config cascade" defined in man lvmconfig(8)), lvmconfig displayed
only difference from defaults of the topmost config in the cascade.
Fix lvmconfig to display complete difference, considering all
the configuration in the cascade.
For example, before this patch:
(use_lvmetad=0 set in lvm.conf which differs from defaults)
$ lvmconfig --type diff
global {
use_lvmetad=0
}
(compact_output=1 set on cmd line)
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
(headings=0 set in profile)
$ lvmconfig --type diff --commandprofile test
report {
headings=0
}
(difference in topmost configuration source is displayed)
$ lvmconfig --type diff --commandprofile test --config report/compact_output=1
report {
compact_output=1
}
With this patch applied (the config cascade is merged before looking for
difference from defaults in configuration):
$ lvmconfig --type diff
global {
use_lvmetad=0
}
$ lvmconfig --type diff --config report/compact_output=1
report {
compact_output=1
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test
report {
headings=0
}
global {
use_lvmetad=0
}
$ lvmconfig --type diff --profile test --config report/compact_output=1
report {
headings=0
compact_output=1
}
global {
use_lvmetad=0
}
Treat loop device created with 'losetup -P' as regular
partitioned device - so if it has partition table,
prevent its usage in commands like 'pvcreate'.
Before 'pvcreate /dev/loop0' could have erased and formated as PV,
after this patch, device is filtered out and cannot be used.
All the variables for sscanf in lvmlockctl.c and lvmlockd-sanlock.c are
zeroed before sscanf call so the failure is detected by seeing the zero
value instead of proper one in subsequent code - so use (void) for
sscanf calls to ignore return value here.
Before this fix, when reporting 'lvm devtypes', the report was
initialized with incorrect reserved values - the ones used for
pvs/vgs/lvs report were used instead of NULL value (because devtypes
doesn't have any reserved values).
For example, trying to (incorrectly) use lv_name for the -S|--select
with lvm devtypes which doesn't have this field at all:
Before this patch (internal error issued):
$ lvm devtypes -S 'lv_name=lvol0'
Internal error: _check_reserved_values_supported: field-specific reserved value of type 0x0 for field not supported
Internal error: dm_report_init_with_selection: trying to register unsupported reserved value type, skipping report selection
DevType MaxParts Description
aoe 16 ATA over Ethernet
ataraid 16 ATA Raid
bcache 1 bcache block device cache
...
With this patch applied (correct error displayed about
unrecognized selection field):
$ lvm devtypes -S 'lv_name=lvol0'
Device Types Fields
-------------------
devtype_name - Name of Device Type exactly as it appears in /proc/devices. [string]
devtype_max_partitions - Maximum number of partitions. (How many device minor numbers get reserved for each device.) [number]
devtype_description - Description of Device Type. [string]
Special Fields
--------------
selected - Set if item passes selection criteria. [number]
help - Show help. [unselectable number]
? - Show help. [unselectable number]
Unrecognised selection field: lv_name
Selection syntax error at 'lv_name=lvol0'.
Use 'help' for selection to get more help.
Convert fields into using a single status ioctl call per LV.
This is a bit tricky since when there are more complicated
stacks, at this moment its undefined which values should be shown.
It's clear we need to cache more then single ioctl per LV,
but also we need to define more explicitely relation between
reported values for snapshots.
This patch is not a final state, rather a transitional step.
It should not be giving more 'worst' values then previous
many-ioctl-calls-per-lv solution.
Add function to obtain percentage value for cache lv_seg_status.
This API is rather evolving 'middle' step as the ultimate goal
is segment API fuctionality.
But first we need to be clear at reporting level which values
are needed to be reported for which LVs and segments.
Add more code to properly store status for snapshot segment
maintaining lvm2 fiction of COW and snapshot internal volumes.
The key issue here is however not though-through reporting
logic - as there is no single answer for whole line state.
It not counting with layer and we may need few more ioctl to
cover all reporting needs depending upon what is actually
needed.
In reality we need to 'cache' more ioctl status queries for
individual LVs and their segments (so they checked at most once).
The other 'hard' topic for conversion is mirror segment handling.
Also we definitelly need to relocate some logic into segment's methods,
yet it might be complex as we have not clear border between targets.
TODO: define more clearly how are reporting fields defined in case
we 'stack' volumes like - cache of stacked thin LV snapshot origin.
lv_refresh_suspend_resume() has escaped with fail ret code
after failing suspend and could have left many volumes in suspend state.
So always unconditionally call resume also when suspend has failed.
To get better control when flushing is used add extra arg when
setting up dm task.
By default now check dm device status without flush.
(At this moment this should effect only thin and cache volumes).
Also switch dev_manager_thin_pool_status() to use more
readable 'flush' parameter instead of 'no_flush'.
Check first the LV is cow before even checking it's a merging COW.
Note: previosly merging_cow was also merging origin, so without
this explicit check it used to return '1' also when passed
LV has been merging origin.
When mirror/raid called copy_percent function to return,
when 100% was supposed to be returned, wrong float 100.0 value
could have been reported back instead of dm_percent_t DM_PERCENT_100.
There is broken API somewhere, since the function here rely on
actively being modifid VG content even when doing 'lvs' operation.
(extents_copies)
This reverts commit 8fd886f735.
This was a deliberate omission because logging token-by-token metadata
parsing greatly increases the amount of logging for hardly any benefit.
In general, only LVM config file settings need to be logged, and in
places where it's considered important to log particular elements of
metadata that should be done using specific log_* lines.
This area can be revisited.
In the same way that process_each_vg() can be passed
a single VG name to process, also allow process_each_lv()
to be passed a single VG name and LV name to process.
This refactors the code for autoactivation. Previously,
as each PV was found, it would be sent to lvmetad, and
the VG would be autoactivated using a non-standard VG
processing function (the "activation_handler") called via
a function pointer from within the lvmetad notification path.
Now, any scanning that the command needs to do (scanning
only the named device args, or scanning all devices when
there are no args), is done first, before any activation
is attempted. During the scans, the VG names are saved.
After scanning is complete, process_each_vg is used to do
autoactivation of the saved VG names. This makes pvscan
activation much more similar to activation done with
vgchange or lvchange.
The separate autoactivate phase also means that if lvmetad
is disabled (either before or during the scan), the command
can continue with the activation step by simply not using
lvmetad and reverting to disk scanning to do the
activation.
Add support for active cache LV.
Handle --cachemode args validation during command line processing.
Rework some lvm2 internal to use lvm2 defined CACHE_MODE enums
indepently on libdm defines and use enum around the code instead
of passing and comparing strings.
Don't use lvm_init() to create a full command context, which
does a lot of command setup (like connecting to daemons), which
is unnecessary for simply reading a value from lvm.conf.
Passing a NULL context arg to the lvm_config_ function is now
allowed, in which case lvm.conf is read without doing lvm
command setup.
A program may be using liblvm2app for simply checking a config
setting in lvm.conf. In this case, a full lvm context is not
needed, only cmd->cft (which are the config settings read from
lvm.conf).
lvm_config_find_bool() can now be passed a NULL lvm context
in which case it will only create cmd->cft, check the config
setting asked for, and destroy the cmd.
Only call lvm_init() when it's needed so that simply
loading the lvm python code in another program doesn't
make that program do lvm initialization.
The version call doesn't need a handle.
The garbage collection can just do lvm_quit to destroy
the command. The next call that needs lvm_init will
do it first.
When setting up a toolcontext, the lib init function
was detecting an error when there was none, and then
it was returning an incompletely initialized cmd struct
instead of NULL. The effect was that the lib would try
to use the uninitialized cmd struct and segfault.
This would happen if a non-fatal error occurred during
cmd setup, e.g. user permission failed on lvmetad socket,
causing cmd to fall back to scanning and not use lvmetad.
The only real error condition is when create_toolcontext
returns NULL. If cmd is returned, the lib can use it.
The _report fn is getting big - separate it in two:
- _report fn to get all the options and arguments
- _do_report fn for reporting itself
Also, place all the variables/arguments in one structure for easier
handling of the variables around.
If a command begins repopulating the lvmetad cache,
and fails part way through, it should set the disabled
state in lvmetad so other commands don't use bad data.
If a subsequent scan succeeds, the disabled state is
cleared.
If duplicate devices exist for a PV, and one device's
size matches the PV size, but the other doesn't, then
prefer the matching device.
If one device is used by an active LV, prefer that device.
When there are duplicate devices for a PV, one device
is preferred and chosen to exist in the VG. The other
devices are not used by lvm, but are displayed by pvs
with a new PV attr "d", indicating that they are
unchosen duplicate PVs.
The "duplicate" reporting field is set to "duplicate"
when the PV is an unchosen duplicate, and that field
is blank for the chosen PV.
Previously, duplicate PVs were processed as a side effect
of processing the "chosen" PV in lvmcache. The duplicate
PV would be hacked into lvmcache temporarily in place of
the chosen PV.
In the old way, we had to always process the "chosen" PV
device, even if a duplicate of it was named on the command
line. This meant we were processing a different device than
was asked for. This could be worked around by naming
multiple duplicate devs on the command line in which case
they were swapped in and out of lvmcache for processing.
Now, the duplicate devs are processed directly in their
own processing loop. This means we can remove the old
hacks related to processing dups as a side effect of
processing the chosen device. We can now simply process
the device that was named on the command line.
When the same PVID exists on two or more devices, one device
is preferred and used in the VG, and the others are duplicates
and are not used in the VG. The preferred device exists in
lvmcache as usual. The duplicates exist in a specical list
of unused duplicate devices.
The duplicate devs have the "d" attribute and the "duplicate"
reporting field displays "duplicate" for them.
'pvs' warns about duplicates, but the formal output only
includes the single preferred PV.
'pvs -a' has the same warnings, and the duplicate devs are
included in the output.
'pvs <path>' has the same warnings, and displays the named
device, whether it is preferred or a duplicate.
Wait to compare and choose alternate duplicate devices until
after all devices are scanned. During scanning, the first
duplicate dev is kept in lvmcache, and others are kept in a
new list (_found_duplicate_devs).
After all devices are scanned, compare all the duplicates
available for a given PVID and decide which is best.
If the dev used in lvmcache is changed, drop the old dev
from lvmcache entirely and rescan the replacement dev.
Previously the VG metadata from the old dev was kept in
lvmcache and only the dev was replaced.
A new config setting devices/allow_changes_with_duplicate_pvs
can be set to 0 which disallows modifying a VG or activating
LVs in it when the VG contains PVs with duplicate devices.
Set to 1 is the old behavior which allowed the VG to be
changed.
The logic for which of two devs is preferred has changed.
The primary goal is to choose a device that is currently
in use if the other isn't, e.g. by an active LV.
. prefer dev with fs mounted if the other doesn't, else
. prefer dev that is dm if the other isn't, else
. prefer dev in subsystem if the other isn't
If neither device is preferred by these rules, then don't
change devices in lvmcache, leaving the one that was found
first.
The previous logic for preferring a device was:
. prefer dev in subsystem if the other isn't, else
. prefer dev without holders if the other has holders, else
. prefer dev that is dm if the other isn't
When duplicate PVs are detected, set the disabled
flag so that commands will disable use of lvmetad.
This duplicate detection is done by lvmetad itself
when it's told about a single new PV with a PVID
that matches an existing PV on another device.
(This is different from the case where the command
is scanning all devices and detects the duplicate.)
Remove the "altdev" logic that attempted to keep
track of multiple devices for a single PV. It
is no longer used since lvmetad is disabled in
this case.
For raid1 use chunksize as bitmap-chunk specification.
Always enforce usage of bitmap - getting comparable outcome
as lvm2 raid support uses.
Add udev_wait after stopping md array - as in fact leg-device
are still in use by target even command has finished.
(mdadm --stop causes WATCH rule wakeup, and
ioctl(STOP_ARRAY) returns IMHO to early - it should finish
and fsync work on leg devices first).
Support parsing --chunksize option also when converting.
Now user can use cache pool created with i.e. 32K chunksize,
while in caching user can select 512K blocks.
Tool is supposed to validate cache metadata size is big enough
to support such chunk size. Otherwise error is shown.
When creating LV - in some case we change created segment type
(ATM for cache and snapshot) and we then manipulate with
lv segment according to 'lp' segtype.
Fix this by checking for proper type before accessing segment members.
This makes command like:
lvcreate --type cache-pool -L10 vg/cpool
lvcreate -H -L10 --cachesettings migtation_threshold=10000 vg/cpool
to pass since now tool correctly selects default cache policy.
Rather than doing repeated translations from name to
device when comparing args to existing PVs, do one
translation of the arg names and saving the device,
before checking existing PVs.
If there's an activation volume_filter, it might not be possible
to activate the rmeta LVs to wipe them. At least inherit any
LV tags from the parent LV while attempting this.
pvscan autoactivation has its own VG processing implementation,
so it can't properly handle things like foreign or shared VGs,
so make it ignore those VG types (or errors from them) as best
as possible.
Add a FIXME stating that pvscan autoactivation must really be
moved to the standard VG processing by calling process_each_vg
to do activation once the scanning / cache update is finished.
Checking for devices uses is_missing_pv() to check
if there is a device for the PV. is_missing_pv()
is based on the MISSING_PV flag, which does not
always correspond to !pv->dev. When using lvmetad,
a command like:
pvs --config 'devices/filter=["a|/dev/sdb|", "r|.*|"]'
will cause a number of PVs to have NULL pv->dev, but
not the MISSING_PV flag. So, NULL pv->dev needs to
also be checked.
Before executing modprobe for given module name, just check
if the module is not already present in /sys/module.
Useful when checking dm-cache-policy modules as we do not
having matching interface like for targets.
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Probably not worth mentioning "segments" here, just state that devices
for an LV can't be all found during the check - it's less mysterious for
user then:
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find all devices for LV fedora/root while checking used and assumed devices.
WARNING: Couldn't find all devices for LV fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
When checking assumed PVs against real devices used for LVs and if
there's no device assigned for an assumed PV (e.g. due to filters),
do log_warn instead of log_error and continue checking LV segments
and associated assumed PVs further, just like we do log_warn elsewhere
in this situation.
This way user will see the warning for each LV which couldn't be
checked completely against real PVs used. Before, we logged only
the very first occurence of missing device for an LV in a VG and we
returned from the function doing this check for all the LVs in VG
immediately which may be a bit misleading because it didn't tell
user about all the other LVs and whether they could be checked
or not.
For example, we have this setup:
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
/dev/vda2 fedora lvm2 a-- 19.49g 0
[0] fedora/~ # lvs -o+devices
LV VG Attr LSize Devices
root fedora -wi-ao---- 19.00g /dev/vda2(0)
swap fedora -wi-ao---- 500.00m /dev/vda2(4864)
Before this patch (only the very first LV in a VG is logged to have a
problem while checking used and assumed devices):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
With this patch applied (all LVs where we hit problem while checking
used and assumed devices are logged and it's warning, not error):
[0] fedora/~ # pvs --config 'devices/filter=["a|/dev/sda|", "r|.*|"]'
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Device for PV Qcxpcy-XgtP-UD3s-PmG0-qLyE-Z0ho-DYsxoz not found or rejected by a filter.
WARNING: Couldn't find device for segment belonging to fedora/root while checking used and assumed devices.
WARNING: Couldn't find device for segment belonging to fedora/swap while checking used and assumed devices.
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
[unknown] fedora lvm2 a-m 19.49g 0
Check that @stats_list and @stats_print returned data in the
_stats_parse_list() and _stats_parse_region() functions before
attempting to operate on region and area values.
This avoids a coverity warning since fgets() could potentially
return no data from the memory buffer returned by the ioctl.
In both cases the ioctl would return an error, preventing these
functions from running, however it is cleaner to test for the
condition explicitly and fail in those cases.
If lvmetad is running, and a command opts to not use it
(--config global/use_lvmetad=0), and the command changes
metadata, then the metadata change is not visible to
lvmetad. Subsequent commands using lvmetad to change
metadata may cause corruption based on the invalid
lvmetad state.
Eventually we can set the disabled state in lvmetad
to prevent this problem, but for now print a warning
about the possibility.
When command is not using lvmetad because
use_lvmetad=0 in the config, but the lvmetad
pidfile exists, print a warning (previously
this checked for the socket existing instead
of the pidfile existing.)
vg/snapshotN should not appear anywhere.
No code should be showing this, but it was noticed in some logs last
week and we can deal with it in display_lvname().
When user requested on cmdline disabling of lvmetad/lvmpoll,
respect it and when lvmlockd requires these daemon,
Error configure with clear message about misconfiguration.
The lvmetad connection is created within the
init_connections() path during command startup,
rather than via the old lvmetad_active() check.
The old lvmetad_active() checks are replaced
with lvmetad_used() which is a simple check that
tests if the command is using/connected to lvmetad.
The old lvmetad_set_active(cmd, 0) calls, which
stopped the command from using lvmetad (to revert to
disk scanning), are replaced with lvmetad_make_unused(cmd).
The test was a weak attempt at verifying the special
combination of lvchange/vgchange -aay --sysinit, but
was only looking for lvmetad connection warnings.
Update the warning checks, and check the LV activation
state directly which is the main point.
Rename the test to reflect its purpose of checking
the -aay --sysinit combination.
Update the check about lvmetad running but not used.
Also add tests related to the new lvmetad disabled state.
lvm1 metadata is used here to test the disabled state
because lvm1 metadata is the first condition using the
disabled state.
After a device rescan that repopulates lvmetad,
if no reason for disabling lvmetad was seen
(lvm1 metadata or duplicate PVs), then clear
the disabled flag in lvmetad. This allows
commands to resume using the lvmetad cache
after the cause for disabling it has been removed.
Commands already check if the lvmetad token is valid,
and if not, they rescan devices to repopulate lvmetad
before running. Now, in addition to checking the
lvmetad token, they also check if the lvmetad disabled
flag is set. If so, they do not use the lvmetad cache
and revert to disk scanning.
A global flag in lvmetad indicates it has been disabled.
Other flags indicate the reason it was disabled.
These flags can be queried using get_global_info.
The lvmetactl debugging utility can set and clear the
disabled flag in lvmetad. Nothing else sets the
disabled flag yet.
Commands will check these flags after connecting to
lvmetad. If the disabled flag is set, the command
will not use the lvmetad cache, but revert to disk
scanning.
To test this feature:
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 0
disable_reason = "none"
token = "filter:3041577944"
$ vgs
(should report VGs from lvmetad)
$ lvmetactl set_global_disable 1
$ lvmetactl get_global_info
response = "OK"
global_invalid = 0
global_disable = 1
disable_reason = "DIRECT"
token = "filter:3041577944"
$ vgs
WARNING: Not using lvmetad because the disable flag was set directly.
(should report VGs without contacting lvmetad)
$ lvmetactl set_global_disable 0
$ vgs
(should report VGs from lvmetad)
process_each_pv was doing:
1. lvmcache_seed_infos_from_lvmetad()
sends pv_list request to lvmetad.
2. get_vgnameids()
sends vg_list request to lvmetad.
3. _get_all_devices()
first calls lvmcache_seed_infos_from_lvmetad(),
which is a no-op if it's already been called.
Because get_vgnameids() does not use the information
from lvmcache_seed_infos_from_lvmetad(), it does not
need to be called prior to get_all_devices where
it is actually needed.
Improve code for snapshot merge for readabilty
and also reduce number of tests needed to decide
if merging can or cannot be started.
(Further improving 9cccf5245a)
When dm_tree_find_node_by_uuid() fails to find passed uuid,
report in lof_debug the complete original uuid,
not the one stripped of LVM- prefix.
TODO: inspect manipulation with LVM- prefix here.
To recognize in runtime if we are merging or not
to make consistent decision between suspend and resume
add function to parse thin table line when add
merging thin device to the table.
A snapshot merge into its origin cannot be initiated while the devices
are in use. If there is outside interference (such as from udev),
the suspend (preload) and resume stages can reach conflicting decisions
about whether or not to proceed.
Try to make the logic more robust by checking the inactive or live
table during resume. (This is still not perfect.)
Commit 971ab733b7 ("thin: activation of
merging thin snapshot") also added an incorrect deactivation attempt
for non-thin LVs: find_snapshot(lv)->lv is not designed to be
activated and any attempt to deactivate it is incorrect.
Move checking the lvmetad state, and the possible rescan,
out of lvmetad_send() to the start of the command.
Previously, the token mismatch and rescan would occur
within lvmetad_send() for some other request. Now,
the token mismatch is detected earlier, so the
rescan can be done before the main command is in
progress. Rescanning deep within the processing of
another command will disturb the lvmcache state of
that other command.
A rescan already exists at the start of the command
for the case where foreign VGs are going to be read.
This same rescan is now also performed when there is
an lvmetad token mismatch (from a changed global_filter).
The commands pvscan/vgscan/lvscan/vgimport are excluded
from this preemptive checking/rescanning for lvmetad
because they want to do rescanning themselves explicitly.
If rescanning devices fails, then lvmetad has not been
correctly repopulated and should not be used, so make
the command revert to not using lvmetad.
When not obtaining device from udev, we are doing deep devdir scan,
and at the same time we try to insert everything what /sys/dev/block
knows about. However in case lvm2 is configured to use nonstardard
devdir this way it will see (and scan) devices from a real system.
lvm2 test suite is using its own test devdir with its
own device nodes. To avoid touching real /dev devices, validate
the device node exist in give dir and do not insert such device
into a cache.
With obtain list from udev this patch has no effect
(the normal user path).
We have _insert_dirs() for udev and non-udev compilation.
Compiling without udev missed to call dev_cache_index_devs().
Move the call after _insert_dirs() call so both compilation
gets it.
When scanning if device is being usable as PV,
we call STATUS - but this status should not cause
any flushing.
Skip also open_count information as it's not needed.
We don't have any report field of this type yet. Return this patch into
the play if we really need that. Currenly we always report status
(result of "status" dm ioctl) for an LV as a whole where we choose
segment which represents the LV, not calling status for each possible
segment it contains - we don't need this now so I'm removing it to
not make the code more complex uselessly.
Devices without "LVM-" uuid prefix have been generated by very old
version of lvm2 2.00 and 2.01.
Since version 2.02 all lvm2 devices are using prefix "LVM-".
However checking for present of ancient non prefixed devices does
take extra IOCTL per every call and for majority of todays user
it will not find anything new.
So use the assumption that users with kernel 3.X and newer are not
really using such old versions of lvm2 (year <2005) and with their
new kernel they are also using new version of lvm2 and skip
checking for them.
This change also makes trace logs more readable.
This is hotfix for RHBZ: https://bugzilla.redhat.com/1324537
However already the %FREE is not a good fit and we need something
better. Meanwhile make -l%PVS work at least as good as %FREE
for thin-pool.
TODO: this needs rework - it should be allocator to do all the size
decisions at one place.
Improved test script to verify lost mirror image does not
cause mirror corruption while mirror is in use.
TODO: add more cases (lost mlog...), lost image from 3leg mirror...
When leaving preload routine it should not altet state of flush required
when it's been already set to 1 and drop it to 0.
The API here is unclean, but current usage expects the same
variable pointer is for all preload calls and combines 'flush_required'
across all of them.
Commit 844b009584 tried to move
limit for usage of noflush to 'preload' however this was not
correctly processed.
Intead explicitly check for which types we do not want noflush
and also add debug message in this case.
Fix regression caused by commit ba41ee1dc9.
The idea was to use no_flush for changed device only for thin volumes
and thin pools but also to merge this with change made in commit
844b009584.
However the resulting condition has caused misbehavior for the mirror
suspend - as that has been before the ONLY allowed target type
that could have been suspended with noflush.
Result was badly working repair for --type mirror that has been
passing 'flush' to the repaired mirror target whenever preload
wrongly set flush_required.
The origin code has required the flush_required to be set whenever
deivce size is changed.
Now it first detects if device size got smaller
'dm_tree_node_size_changed(root) < 0' - this requires flush.
Otherwise it checks device is not thin volume nor thin pool and its
size has changed (got bigger) and requires flush.
This mean upsize of thin-pool or thin volume will not require flush.
/sys/dev/block is available since kernel version 2.2.26 (~ 2008):
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-dev
The VGID/LVID indexing code relies on this feature so skip indexing
if it's not available to avoid error messages about inability to open
/sys/dev/block directory.
We're not going to provide fallback code to read the /sys/block/
instead in this case as that's not that efficient - it needs extra
reads for getting major:minor and reading partitions would also
pose further reads and that's not worth it.
If compiling without udev_sync support, udev_get_library_context simply
returns NULL so we don't need to remember putting ifdef UDEV_SYNC_SUPPORT
in the code all the time we just need to check whether there's any udev
context initialized or not.
If obtain_device_list_from_udev=0, LVM can make use of persistent .cache
file. This cache file contains only devices which underwent filters in
previous LVM command run. But we need to iterate over all block devices
to create the VGID/LVID index completely for the device mismatch check
to be complete as well.
This patch iterates over block devices found in sysfs to generate the
VGID/LVID index in dev cache if obtain_device_list_from_udev=0
(if obtain_device_list_from_udev=1, we always read complete list of
block devices from udev and we ignore .cache file so we don't need
to look in sysfs for the complete list).
For the case when we print device name associated with struct device
that was not found in /dev, but in sysfs, for example when printing
devices where LV device mismatch is found.
Use meta% to expose highest mapped sector in thinLV.
so showing there 100.00% means thinLV maps latest sector.
Currently using a 'trick' with total_numerator to pass-in
device size when 'seg==NULL'
TODO: Improve device status API per target - current 'percentage'
is not really extensible.
Previous fix missed the fact the we do query for 'percent' with
seg value either set or unset (API overload...)
When 'seg' was unset, we still issue flush with status.
Fix it by cheking segtype by target_type.
As we check for segtype - we could also skip whole percentage
if the 'segtype' is unknown by code directly.
Reported-by: Ming-Hung Tsai <mingnus gmail com
It's correct to have a DM device that has no DM UUID assigned
so no need to issue error message in this case. Also, if the
device doesn't have DM UUID, it's also clear it's not an LVM LV
(...when looking for VGID/LVID while creating VGID/LVID indices
in dev cache).
For example:
$ dmsetup create test --table "0 1 linear /dev/sda 0"
And there's no PV in the system.
Before this patch (spurious error message issued):
$ pvs
_get_sysfs_value: /sys/dev/block/253:2/dm/uuid: no value
With this patch applied (no spurious error message):
$ pvs
If we're using persistent .cache file, we're reading this file instead
of traversing the /dev content. Fix missing indexing by VGID and LVID
here - hook this into persistent_filter_load where we populate device
cache from persistent .cache file instead of scanning /dev.
For example, inducing situation in which we warn about different device
actually used than what LVM thinks should be used based on metadata:
$ lsblk -s /dev/vg/lvol0
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol0 253:4 0 124M 0 lvm
`-loop1 7:1 0 128M 0 loop
$ lvmconfig --type diff
global {
use_lvmetad=0
}
devices {
obtain_device_list_from_udev=0
}
(obtain_device_list_from_udev=0 also means the persistent .cache file is used)
Before this patch - pvs is fine as it does the dev scan, but lvs relies
on persistent .cache file and it misses the VGID/LVID indices to check
and warn about incorrect devices used:
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
With this patch applied - both pvs and lvs is fine - the indices are
always created correctly (lvs just an example here, other LVM commands
that rely on persistent .cache file are fixed with this patch too):
$ pvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
PV VG Fmt Attr PSize PFree
/dev/loop0 vg lvm2 a-- 124.00m 0
$ lvs
Found duplicate PV B9gXTHkIdEIiMVwcOoT2LX3Ywh4YIHgR: using /dev/loop0 not /dev/loop1
Using duplicate PV /dev/loop0 without holders, ignoring /dev/loop1
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/loop1 instead of /dev/loop0.
LV VG Attr LSize
lvol0 vg -wi-a----- 124.00m
It's possible that while a device is already referenced in sysfs, the node
is not yet in /dev directory.
This may happen in some rare cases right after LVs get created - we sync
with udev (or alternatively we create /dev content ourselves) while VG
lock is held. However, dev scan is done without VG lock so devices may
already be in sysfs, but /dev may not be updated yet if we call LVM command
right after LV creation (so the fact that fs_unlock is done within VG
lock is not usable here much). This is not a problem with devtmpfs as
there's at least kernel name for device in /dev as soon as the sysfs
item exists, but we still support environments without devtmpfs or
where different directory for dev nodes is used (e.g. our test suite).
This patch covers these situations by tracking such devices in
_cache.sysfs_only_names helper hash for the vgid/lvid check to work still.
This also resolves commit 6129d2e64d
which was then reverted by commit 109b7e2095
due to performance issues it may have brought (...and it didn't resolve
the problem fully anyway).
dmeventd daemon may call further code itself that looks at /dev, e.g.
via dmeventd_lvm2_command call. We need to have a consistent view of
the /dev content at that time. Therefore, sync /dev content before
calling monitoring hook which contacts dmeventd.
This problem was quite hidden before, but now it has manifested itself
because of recent additions to dev-cache code where we started looking
at device holders as seen in sysfs. What happened here was that the
device was already in sysfs, but not yet under /dev and this triggered
the new error message sometimes:
log_error("%s: failed to find associated device structure for holder %s.", devname, devpath);
This problem has manifested recently in our api/pytest.sh test from
testsuite where we create thin pool LVs and thin LVs and hence it also
causes dmeventd to be used as well and these error messages were
visible there.
UUID for LV is either "LVM-<vg_uuid><lv_uuid>" or "LVM-<vg_uuid><lv_uuid>-<suffix>".
The code before just checked the length of the UUID based on the first
template, not the variant with suffix - so LVs with this suffix were not
processed properly.
For example a thin pool LV (as an example of an LV that contains
sub LVs where UUIDs have suffixes):
[0] fedora/~ # lsblk -s /dev/vg/lvol1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vg-lvol1 253:8 0 4M 0 lvm
`-vg-pool-tpool 253:6 0 116M 0 lvm
|-vg-pool_tmeta 253:2 0 4M 0 lvm
| `-sda 8:0 0 128M 0 disk
`-vg-pool_tdata 253:3 0 116M 0 lvm
`-sda 8:0 0 128M 0 disk
Before this patch (spurious warning message about device mismatch):
[0] fedora/~ # pvs
WARNING: Device mismatch detected for vg/lvol1 which is accessing /dev/mapper/vg-pool-tpool instead of (null).
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
With this patch applied (no spurious warning message about device mismatch):
[0] fedora/~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 0
To help out with debug, when an exception is thrown in the dbus service we
will dump all the information we have on the last 16 commands that were
executed along with the stack strace.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
With commit 2d5dc6512e the dbus server
no longer needs to utilize udev to know when to update its internal
state.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Check if the value we read from sysfs is not blank and replace the '\n'
at the end only when needed ('\n' should usually be there for sysfs values,
but better check this).
It's possible for an LVM LV to use a device during activation which
then differs from device which LVM assumes based on metadata later on.
For example, such device mismatch can occur if LVM doesn't have
complete view of devices during activation or if filters are
misbehaving or they're incorrectly set during activation.
This patch adds code that can detect this mismatch by creating
VG UUID and LV UUID index while scanning devices for device cache.
The VG UUID index maps VG UUID to a device list. Each device in the
list has a device layered above as a holder which is an LVM LV device
and for which we know the VG UUID (and similarly for LV UUID index).
We can acquire VG and LV UUID by reading /sys/block/<dm_dev_name>/dm/uuid.
So these indices represent the actual state of PV device use in
the system by LVs and then we compare that to what LVM assumes
based on metadata.
For example:
[0] fedora/~ # lsblk /dev/sdq /dev/sdr /dev/sds /dev/sdt
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdq 65:0 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev1 253:3 0 104M 0 mpath
sdr 65:16 0 104M 0 disk
`-mpath_dev1 253:3 0 104M 0 mpath
sds 65:32 0 104M 0 disk
|-vg-lvol0 253:2 0 200M 0 lvm
`-mpath_dev2 253:4 0 104M 0 mpath
sdt 65:48 0 104M 0 disk
`-mpath_dev2 253:4 0 104M 0 mpath
In this case the vg-lvol0 is mapped onto sdq and sds becauset this is
what was available and seen during activation. Then later on, sdr and
sdt appeared and mpath devices were created out of sdq+sdr (mpath_dev1)
and sds+sdt (mpath_dev2). Now, LVM assumes (correctly) that mpath_dev1
and mpath_dev2 are the PVs that should be used, not the mpath
components (sdq/sdr, sds/sdt).
[0] fedora/~ # pvs
Found duplicate PV xSUix1GJ2SK82ACFuKzFLAQi8xMfFxnO: using /dev/mapper/mpath_dev1 not /dev/sdq
Using duplicate PV /dev/mapper/mpath_dev1 from subsystem DM, replacing /dev/sdq
Found duplicate PV MvHyMVabtSqr33AbkUrobq1LjP8oiTRm: using /dev/mapper/mpath_dev2 not /dev/sds
Using duplicate PV /dev/mapper/mpath_dev2 from subsystem DM, ignoring /dev/sds
WARNING: Device mismatch detected for vg/lvol0 which is accessing /dev/sdq, /dev/sds instead of /dev/mapper/mpath_dev1, /dev/mapper/mpath_dev2.
PV VG Fmt Attr PSize PFree
/dev/mapper/mpath_dev1 vg lvm2 a-- 100.00m 0
/dev/mapper/mpath_dev2 vg lvm2 a-- 100.00m 0
If we're using non-standard /dev layout so we can't get the dm-X name
easily, we can't also look at the /sys/blocl/dm-X/dev to get the major:minor
pair. Use "stat" in this case even though it triggers automounts
(but there's no better way for now).
The /proc/self/mountinfo is not bound to device names like /proc/mounts
and it uses major:minor pairs instead.
This fixes a situation in which a volume is mounted and then renamed
later on - that makes /proc/mounts unreliable when detecting mounted
volumes.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1196910.
Commit b64703401d cause regression
when handling stacked resize of pool metadata volume that would
be a raid LV.
Fix it by properly setting up size also for layer extension.
Allow pvchange and pvresize to process exported VGs,
and have them check for the exported state in their
single function.
Previously, the exported VG state would trigger a
failure in vg_read()/ignore_vg() because the VGs are
being read with READ_FOR_UPDATE. Because these commands
read all VGs to search for the intended PVs, any
exported VG would trigger a failure, even if it was
not related to the intended PV.
Use <> around user entered option parameters to make it visually
different (just like we use Italic style in man page).
TODO:
In the future we should consistently provide such notation and
possibly generate it automagically from some internal data structures.
Preferably for man pages as well so we report actual set of supported
options.
Recent kernel (4.4) start to report values smaller then sector size
(but in reporting size for SSD which support data zeroing on discard).
For now log warning and assume it really means 1 sector.
Addressing RHBZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1313377
Continuing with lvconvert style updates.
Minimize usage of '\:' as zero-width break space, since html renderers
do not handle this groff sign well (some of them seems to even render
':' there).
But since we do not want see badly rendered man pages itself, prefer
the 'man visual' appearence over html page generation (man2html should
be actually fixed).
Start to use 'user input option' with Capitals more consistently.
Fixing vpath usage as it has been checking for existance of
generated file also in the $(scrdir) e.g.:
No need to remake target '10-dm.rules.in'; using VPATH name '...'
If the $(srcdir) had been also $(builddir) and contained already
generated rules file, it's been used instead generating new
one.
(See: http://make.mad-scientist.net/papers/how-not-to-use-vpath/)
Commit abd9618dd8 tried to improve
parsing of vg name from logical path - but still missed couple
corner cases.
This patch further improves the logic and reuses
validate_lvname_param() for parsing of lv_name.
Also explicitly checks for LVM_VG_NAME in the right case.
So now also properly parses cases like:
'lvconvert --repairt vg/'
and will provide correct error message.
There's a window between doing VG read and checking PV device size
against real device size. If the device is removed in this window,
the dev cache still holds struct device and pv->dev still references
that and that PV is not marked as missing. However, if we're trying
to get size for such device, the open fails because that device
doesn't exists anymore.
We called existing pv_dev_size in _check_pv_dev_sizes fn. But
pv_dev_size assigned a size of 0 if the dev_get_size it called failed
(because the device is gone).
So call the dev_get_size directly and check for the return code
in _check_pv_dev_sizes and go further only if we really know the
device size. This is to avoid confusing warning messages like:
Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
One or more devices used as PVs in VG helter_skelter have changed sizes.
While running on F24 a number of warnings were being emitted from using the
deprecated GObject instead of GLib. Tested on python 3.4 and 3.5.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Python 3.5 in F24 was throwing the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/lvmdbusd/main.py", line 73, in process_request
req.run_cmd()
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 73, in run_cmd
self.register_error(-1, st)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 123, in register_error
self._reg_ending(None, error_rc, error)
File "/usr/lib/python3.5/site-packages/lvmdbusd/request.py", line 115, in _reg_ending
self.cb_error(self._rc_error)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 669, in <lambda>
keywords[error_callback] = lambda exception: _method_reply_error(connection, message, exception)
File "/usr/lib64/python3.5/site-packages/dbus/service.py", line 293, in _method_reply_error
exception))
File "/usr/lib64/python3.5/traceback.py", line 136, in format_exception_only
return list(TracebackException(etype, value, None).format_exception_only())
File "/usr/lib64/python3.5/traceback.py", line 442, in __init__
if (exc_value and exc_value.__cause__ is not None
AttributeError: 'str' object has no attribute '__cause__'
This was caused because we were calling the dbus error callback with a
string instead of an actual exception. On python 3.4 this was apparently
OK, but not with 3.5. Corrected to pass the exception to error callback.
Change tested on both python 3.4 and 3.5.
Reported-by: Vratislav Podzimek <vpodzime@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Specifying an output width of 0 now leads to a default minimum width
taken from the width of the column heading. (Most fields should use
this.)
Components of field names are generally separated by underscores (which
are optional at run-time). (Dropped 3 duplicate fields now covered by
this rule.)
Each field heading must be unique and generally doesn't have spaces
between words (which get capitalised) unless they are already short and
the fields are normally longer or clarity demands it.
With the recent conversion of pvcreate/pvremove to the
common toollib processing function, skipping in-use PVs
in _process_pvs_in_vg prevented them from being protected
as intended by the in-use flag.
The processing code for pvcreate/pvremove checks for the
in-use state itself and prevents using an in-use PV.
If a PV is skipped, it looks like an unused device and
is not protected from being used in pvcreate/pvremove.
This command option can be used to trigger a D-Bus
notification independent of the usual notifications
that are sent from other commands as an effect of
changes to PV/VG/LV state. If lvm is not built with
dbus notification support or if notify_dbus is disabled
in the config, this command will exit with an error.
When a command modifies a PV or VG, or changes the
activation state of an LV, it will send a dbus
notification when the command is finished. This
can be enabled/disabled with a config setting.
The code in _print_historical_lv function works with temporary
"descendants_buffer" that is allocated and freed within this
function.
When printing text out, we used "outf" macro which called
"out_text" fn and it checked return value and if failed,
the macro called "return_0" automatically. But since we
use the temporary buffer, if any of the out_text calls
fails, we need to deallocate this buffer properly - that's
the "goto_out", otherwise we'll be leaking memory.
So add new "outfgo" helper macro which does the same as "outf",
but it calls "goto_out" instead of "return_0" so we can jump
to a cleanup hook at the end.
- stdout/stderr for test output
- debug.log* for daemon output
- use install instead of cp for DBus config
- use system bus
- DBus reloads on new file. Better having it created with correct
permissions.
Notes:
- Squashed commits from mcsontos development branch
- Still disabled at this time
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Using lvm1 metadata with lvmetad is not generally allowed,
but nothing has prevented creating new lvm1 metadata with
lvmetad (missing error checking in pvcreate/vgcreate.)
Various tests are using lvm1 with lvmetad and happen to
work because of the missing error checks.
This commit fixes the tests so they won't fail when the
lvm1/lvmetad error checking is fixed. A new variable
LVM_TEST_LVM1 is defined and is used in the scripts to
decide if lvm1 metadata should be tested. LVM_TEST_LVM1
is not defined when lvmetad is being tested, and the
combination of LVM_TEST_LVM1 and LVM_TEST_LVMETAD can
be used to verify the desired lvmetad+lvm1 behavior.
Historical LV is valid as long as there is at least one live LV among
its ancestors. If we find any invalid (dangling) historical LVs, remove
them automatically.
The vg_strip_outdated_historical_lvs iterates over the list of historical LVs
we have and it shoots down the ones which are outdated.
Configuration hook to set the timeout will be in subsequent patch.
The metadata/record_lvs_history is global switch which enables or
disables recording historical LVs in metadata.
If both metadata/record_lvs_history=1 lvm.conf option and
--nohistory command switch is used at the same time, the
--nohistory prevails.
Report proper values for historical LVs in lv_layout and lv_role fields.
Any historical LV doesn't have any layout anymore and the role is "history".
For example:
$ lvs -H -o name,lv_attr,lv_layout,lv_role vg/-lvol1
LV Attr Layout Role
-lvol1 ----h----- none public,history
The lv_full_ancestors reporting field is just like the existing
lv_ancestors field but it also takes into account any history and
indirect relations recorded.
All names for historical LVs are prefixed with '-' character to make clear
difference between live and historical LVs. The '-' can't be set by users
for live LV names during lvcreate hence we never get into a conflict with
the names that user defines for live LVs.
The lv_historical reporting field is a simple binary field that reports
whether an LV is historical one ("historical" value or value of "1" displayed)
or not (blank string "" or value of "0" displayed).
When processing LVs in a VG and when the -H|--history switch is used,
make process_each_lv_in_vg to iterate over historical volumes too.
For each historical LV, we use dummy struct logical_volume instance with
the "this_glv" reference set to a wrapper over proper struct
historical_logical_volume representation. This makes it possible to process
historical LVs just like normal live LVs (though a dummy one without any
segments and all the other fields zeroed and blank) and it also allows
for using all historical LV related information via lv->this_glv->historical
reference.
One can use a simple call to lv_is_historical to make a difference between
live and historical LV in all the code that is called by process_each_* fns.
This patch adds "include_historical_lvs" field to struct cmd_context to
make it possible for the command to switch between original funcionality
where no historical LVs are processed and functionality where historical
LVs are taken into account (and reported or processed further). The switch
between these modes is done using the '-H|--history' switch on command
line.
The include_historical_lvs state is then passed to process_each_* fns
using the "include_historical_lvs" field within struct processing_handle.
Add support for making an interconnection between thin LV segment and
its indirect origin (which may be historical or live LV) - add a new
"indirect_origin" argument to attach_pool_lv function.
Also export historical LVs when exporting LVM2 metadata.
This is list of all historical LVs listed in
"historical_logical_volumes" metadata section with all
the properties exported for each historical LV.
For example, we have this thin snapshot sequence:
lvol1 --> lvol2 --> lvol3
\
--> lvol4
We end up with these metadata:
logical_volume {
...
(lvol1, lvol3 and lvol4 listed here as usual - no change here)
...
}
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "lvol1"
descendants = ["lvol3", "lvol4"]
}
}
By removing lvol1 further, we end up with:
historical_logical_volumes {
lvol2 {
id = "S0Dw1U-v5sF-LwAb-W9SI-pNOF-Madd-5dxSv5"
creation_time = 1456919613 # 2016-03-02 12:53:33 +0100
removal_time = 1456919620 # 2016-03-02 12:53:40 +0100
origin = "-lvol1"
descendants = ["lvol3", "lvol4"]
}
lvol1 {
id = "me0mes-aYnK-nRfT-vNlV-UiR1-GP7r-ojbROr"
creation_time = 1456919608 # 2016-03-02 12:53:28 +0100
removal_time = 1456919767 # 2016-03-02 12:56:07 +0100
}
}
When an LV is being removed, we create an instance of
"struct historical_logical_volume" wrapped up in
"struct generic_logical_volume".
All instances of "struct historical_logical_volume" are then recorded in
"historical_lvs" list which is part of "struct volume_group".
The "historical LV" is then interconnected with "live LVs" to
connect a history chain for the live LV.
The add_glv_to_indirect_glvs is a helper function that registers a
volume represented by struct generic_logical_volume instance ("glv")
as an indirect user of another volume ("origin_glv") and vice versa -
it also registers the other volume ("origin_glv") as indirect_origin
of user volume ("glv").
The remove_glv_from_indirect_glvs does the opposite.
The get_or_create_glv is helper function that retrieves any existing
generic_logical_volume wrapper for the LV. If the wrapper does not exist
yet, it's created.
The get_org_create_glvl is the same as get_or_create_glv but it creates
the glv_list wrapper in addition so it can be added to a list.
Add new structures and new fields in existing structures to support
tracking history of LVs (the LVs which don't exist - the have been
removed already):
- new "struct historical_logical_volume"
This structure keeps information specific to historical LVs
(historical LV is very reduced form of struct logical_volume +
it contains a few specific fields to track historical LV
properties like removal time and connections among other LVs).
- new "struct generic_logical_volume"
Wrapper for "struct historical_logical_volume" and
"struct logical_volume" to make it possible to handle volumes
in uniform way, no matter if it's live or historical one.
- new "struct glv_list"
Wrapper for "struct generic_logical_volume" so it can be
added to a list.
- new "indirect_glvs" field in "struct logical_volume"
List that stores references to all indirect users of this LV - this
interconnects live LV with historical descendant LVs or even live
descendant LVs.
- new "indirect_origin" field in "struct lv_segment"
Reference to indirect origin of this segment - this interconnects
live LV (segment) with historical ancestor.
- new "this_glv" field in "struct logical_volume"
This references an existing generic_logical_volume wrapper for this
LV, if used. It can be NULL if not needed - which means we're not
handling historical LVs at all.
- new "historical_lvs" field in "struct volume group
List of all historical LVs read from VG metadata.
Pair kernel_cache_settings with kernel_cache_policy.
There is very small chance in error case that the value in table
might be differnet from the value stored in metadata
so make it 'checkable'.
Showing 'u' in the pv_attr reporting field is mostly unnecessary because
most PVs are allocatable, and being allocatable implies it is (u)sed,
and this is already obvious from other fields in the default 'pvs'
output like the VG name.
So move the new (u)sed pv_attr from character position 4 to 1, and only
show it in those rare cases when the PV is not (a)llocatable or the
relevant metadata is missing.
(Scripts should not be using pv_attr, but rather pv_allocatable,
pv_exported, pv_missing, pv_in_use etc.)
Helping with understanding we will not try to deref NULL pointer,
as if the sizes are initialized to NULL it also means 'mem' would
be NULL, but thats too hard to model so make it obvious.
Correct name is lvm2-lvmdbusd.service not lvmdbusd.service.
This makes the bus-activation (auto-activation) work.
Signed-off-by: Vratislav Podzimek <vpodzime@redhat.com>
Make the data_alignment variable 64 bits so it
can hold the invalid command line arg used in
pvreate-usage.sh pvcreate --dataalignment 1e.
On 32 bit arches, the smaller variable wouldn't
hold the invalid value so the error would not
trigger as expected by the test.
This simple tool calls the Manager.Refresh method on the dbus service
to check and see if the dbus service has the most up to date state.
This is to be used for testing to ensure that event driven updates are
working as planned.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When we use udev or have lvm call back into the dbus service when a
change occurs, even if that change originated from the dbus service
we end up refreshing the state of the system twice which is not
needed or wanted. This change handles this case by removing any
pending refreshes in the worker queue if the state of the system
was just updated.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Since we want to read env LVM_VG_NAME vg names,
we cannot just check LV names which do contain '/'.
So before the patch commands like:
> lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Path required for Logical Volume "vg".
Please provide a valid volume group name
> LVM_VG_NAME=vg lvconvert --repair vg
Before:
Please provide a valid volume group name
After:
Can't find LV vg in VG vg
Commit ca878a3426 introduced an issue
that zero sized extesion suddenly started to be accepted and
missed to return error.
Properly check invalid input values for sizes.
Add tests for the "dmstats report" command:
* report
* report --count
* report --histogram
So far the tests just check the command runs as expected when a
correctly configured stats region exists: validation of output
can be added later.
Add tests for the "dmstats create" command:
* simple whole-device region
* region using --start/--len options
* region using --segments option
* region with precise timestamps (--precise)
* region with histogram bounds (--bounds)
The install target already depends on .tests-stamp - since this
in turn depends on lib/version-expected there is no need to have
this as a dependency of install.
Add initial dmstats tests to 000-basic.sh. These tests ensure that
the dmsetup binary is built and linked correctly when called as
'dmstats' and that the version of the binary matches the expected
library version used for the build.
"pvcreate_each_params" was a temporary name used
to transition from the old "pvcreate_params".
Remove the old pvcreate_params struct and rename the
new pvcreate_each_params struct to pvcreate_params.
Rename various pvcreate_each_params terms to simply
pvcreate_params.
Use the new pvcreate_each_device() function from
toollib, previously added for pvcreate, in place
of the old pvcreate_vol().
This also requires shifting the location where the
lock is acquired for the new VG name. The lock for
the new VG is supposed to be acquired before pvcreate.
This means splitting the vg_lock_newname() out of
vg_create(), and calling vg_lock_newname() directly
before pvcreate, and then calling the remainder of
vg_create() after pvcreate.
The new function vg_lock_and_create() now does
vg_lock_newname() + vg_create(), like the previous
version of vg_create().
The lock on the new VG name is released before the
pvcreate and reacquired after the pvcreate because
pvcreate needs to reset lvmcache, which doesn't work
when locks are held. An exception could likely be
made for the new VG name lock, which would allow
vgcreate to hold the new VG name lock across the
pvcreate step.
This is common code for handling PV create/remove
that can be shared by pvcreate/vgcreate/vgextend/pvremove.
This does not change any commands to use the new code.
- Pull out the hidden equivalent of process_each_pv
into an actual top level process_each_pv.
- Pull the prompts to the top level, and do not
run any prompts while locks are held.
The orphan lock is reacquired after any prompts are
done, and the devices being created are checked for
any change made while the lock was not held.
Previously, pvcreate_vol() was the shared function for
creating a PV for pvcreate, vgcreate, vgextend.
Now, it will be toollib function pvcreate_each_device().
pvcreate_vol() was called effectively as a helper, from
within vgcreate and vgextend code paths.
pvcreate_each_device() will be called at the same level
as other process_each functions.
One of the main problems with pvcreate_vol() is that
it included a hidden equivalent of process_each_pv for
each device being created:
pvcreate_vol() -> _pvcreate_check() ->
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
pvcreate_each_device() reorganizes the code so that
each-VG-each-PV loop is done once, and uses the standard
process_each_pv function at the top level of the function.
This uses the vg->pv_write_list in place of the
vg->pvs_to_write list, and eliminates the use of
pvcreate_params. The label remove and zeroing
steps are shifted out of vg_write() to the higher
level like pvcreate will do.
The vg->pv_write_list contains pv_list structs for which
vg_write() should call pv_write().
The new list will replace vg->pvs_to_write that contains
vg_to_create structs which are used to perform higher-level
pvcreate-related operations. The higher level pvcreate
operations will be moved out of vg_write() to higher levels.
Since we already check in few other places 'info' is not NULL,
do the same for others - however when info would be NULL
it more or less looks like internal error.
Use #define instead, since we do not require actually buffer needs
to exists to eliminated new gcc6 warning:
clvm.h:53:19: warning: ‘CLVMD_SOCKNAME’ defined but not used
[-Wunused-const-variable]
Reshuffle messages during pvremove.
Always print WARNING: when PV is in use so using options
--force --force doesn't make this important user
notification go away.
Simplify variable 'used' usage (so older gcc doesn't warn
about the use of unitilizied variable).
Add some '.' into messages.
Currently it's been checked for 'zero' header for thin-pool,
but lets use it always for cache as well - since it's relatively 'cheap'
detection of read 'error' problems as thin/cache tools
currently do not work fast enough in this case.
export LVMDBUSD_SESSION=True to run on the session bus instead
of the system bus so that we can run the unit test without
installing the dbus conf file.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Reduced the size of LVs created and use actual PE numbers instead of hard
coding them to allow us to work with the loop back devices.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
It appears that the output of lvconvert --merge can vary some. The code
was blowing up as it was trying to parse a line of stdout to retrieve the
% complete, but the line did not have the needed format and an execption
was thrown. The uncaught exception caused the background thread to exit
without updating the job object, which caused the client to hang forever
waiting. Added a default exception handler to prevent unhandled execptions
causing hangs and removed the parameter skip_first_line as it's no longer
needed. The code checks to see if the line can be parsed before doing so.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
After the lockspace has been successfully removed,
invalidate the name field in the lockspace struct.
The struct remains on the list of lockspaces until
the struct can be freed later. Until the struct is
freed, its name will prevent another new lockspace
from being created with the same name.
When update fails in suspend() (sending of messages
fails because metadata space is full) call resume(),
so the locking sequence works properly for clustering.
Also failing deactivation should unlock memory.
Fix reporting of Fail thin-pool target status
as attr[8] letter 'F'.
Report 'needs_check' status from thin-pool target via
attr field [4] (letter 'c'/'C'), and also via CheckNeeded field.
TODO: think about better name here?
TODO: lots of prop_not_implemented_set
Fix parsing of 'Fail' status (using capital letter) for thin-pool.
Add also parsing of 'Error' state for thin-pool.
Add needs_check test for thin-pool.
Detect Fail state for thin.
Ask for confirmation when using pvcreate/pvremove on a PV which is
marked as belonging to a VG, just like we do in case of a PV which
belongs to known VG:
$ pvcreate -ff /dev/sda
Really INITIALIZE physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume not initialized
$ pvremove -ff /dev/sda
Really WIPE LABELS from physical volume "/dev/sda" that is marked as belonging to a VG [y/n]? n
/dev/sda: physical volume label not removed
The host that owns foreign VGs is responsible for fixing up PV_EXT_USED
flag - the same already applies to repairing any inconsistent VG.
This patch also moves the iteration over vg->pvs inside
_check_or_repair_pv_ext fn - it's cleaner this way.
pv->vg is not set yet during pvcreate processing. Use pv->fmt instead to
check for these fake PVs (all normal PVs have format defined, devices
which are not PVs don't have this set).
This fixes commit 0000db7f98.
If we know that a PV belongs to some VG and we're missing metadata
(because we have only those PV(s) from VG present in the system that
don't have metadata areas), we should skip such PV when processing
under system ID.
This is because we know that the PV belongs to some VG, but we
really can't decide whether it matches system ID unless the VG
metadata is present again.
Some of the PVs are not even orphan PVs - they're fake PVs - this can
happen if we're listing all devices with "pvs -a". Such PV must not
be marked as used.
The backup_restore_vg is used directly for restoring the VG from backup.
It's also used to do the VG conversions from one metadata format to
another which means vgconvert calls backup_restore_vg too.
When restoring VG from backup, we need to rewrite/write PV headers as
PVs may have been orphans before and now they're becoming part of some
VG - we need to write the PV_EXT_USED flag at least.
When using the backup_restore_vg for vgconvert, we need to write
completely new PV header in different format.
Avoid the special "pv_write" call and handling that was used before
this patch in vgconvert (vgconvert_single function to be more precise)
and reuse existing internal interface to register PV header for writing
(or rewriting) via vg->pvs_to_write list instead like we do it elsewhere
in the code.
This patch also resolves a problem in which PV headers with target
format were written in the vgconvert_single fn as orphans and VG
metadata were added later on - this was a tiny hack actually.
We can't do this now - we need to write the PV as belonging
to a VG because otherwise the PV_EXT_USED flag won't be written
properly (if the PV header is written as orphan, the PV_EXT_USED
is set to 0, of course, even though metadata are attached later).
So this patch removes this tiny inconsistency which was passing
just fine before because we didn't have any relation to the VG
in PV header before. Now we have the PV_EXT_USED flag which says
the "PV is used in some VG".
The same check as we already do for orphan PVs, just the other way
round now: if the PV is surely part of some VG and any PV the VG
contains does not have the PV_EXT_USED flag set, repair it.
For example - /dev/sda here is in VG vg and it's incorrectly not
marked as used by PV_EXT_USED flag:
pvs --binary -o pv_ext_vsn,pv_in_use
WARNING: Volume Group vg is not consistent.
WARNING: Repairing Physical Volume /dev/sda that is in Volume Group vg but not marked as used.
PV VG Fmt Attr PSize PFree ExtVsn PInUse
/dev/sda vg lvm2 a-- 124.00m 124.00m 2 1
PV header extension versions:
0 - the original PV without any extensions
1 - bootloader area support added
2 - PV_EXT_USED flag support added
So do the associated checks related to PV_EXT_USED flag only if
PV header extension found is of version 2 and higher.
If we know that the PV is orphan, meaning there's at least one MDA on
that PV which does not reference any VG and at the same time there's
PV_EXT_USED flag set, we're certainly in an inconsistent state and we
need to fix this.
For example, such situation can happen during vgremove/vgreduce if we
removed/reduced the VG, but we haven't written PV headers yet because
vgremove stopped abruptly for whatever reason just before writing new
PV headers with updated state, including PV extension flags (and so the
PV_EXT_USED flag).
However, in case the PV has no MDAs at all, we can't double-check
whether the PV_EXT_USED is correct or not - if that PV is marked
as used, it's either:
- really used (but other disks with MDAs are missing)
- or the error state as described above is hit
User needs to overwrite the PV header directly if it's really clear
the PV having no MDAs does not belong to any VG and at the same time
it's still marked as being in use (pvcreate -ff <dev_name> will fix this).
For example - /dev/sda here has 1 MDA, orphan and is incorrectly marked
with PV_EXT_USED flag:
$ pvs --binary -o+pv_in_use
WARNING: Found inconsistent standalone Physical Volumes.
WARNING: Repairing flag incorrectly marking Physical Volume /dev/sda as used.
PV VG Fmt Attr PSize PFree InUse
/dev/sda lvm2 --- 128.00m 128.00m 0
For example:
$ pvs -o pv_name,vg_name,pv_in_use
PV VG InUse
/dev/sda vg used
/dev/sdb
/dev/sdc used
(sda is part of vg - it's used
sdb is not part of vg - it's not used
sdc is part of vg, but MDAs missing - it's used)
Scenario:
$ pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
We're adding the PV to a VG.
Before this patch:
$ vgcreate vg /dev/sda
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
With this path applied:
$ vgcreate vg /dev/sda
Volume group "vg" successfully created
...and verbose log containing: "Physical volume "/dev/sda" successfully written"
Make sure we won't use a PV that is already marked as used. Normally,
VG metadata would stop us from doing that, but we can run into a
situation where such metadata is missing because PVs with MDAs
are missing and the PVs left are the ones with 0 MDAs.
(/dev/sda in this example has 0 MDAs and it belongs to a VG,
but other PVs with MDA are missing)
$ pvs -o pv_name,pv_mda_count /dev/sda
PV #PMda
/dev/sda 0
$ pvcreate /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't initialize PV '/dev/sda' without -ff.
$ pvchange -u /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Can't change PV '/dev/sda' without -ff.
Physical volume /dev/sda not changed
0 physical volumes changed / 1 physical volume not changed
$ pvremove /dev/sda
PV '/dev/sda' is marked as belonging to a VG but its metadata is missing.
(If you are certain you need pvremove, then confirm by using --force twice.)
$ vgcreate vg /dev/sda
Physical volume '/dev/sda' is marked as belonging to a VG but its metadata is missing.
Unable to add physical volume '/dev/sda' to volume group 'vg'.
We'll use this struct in subsequent patches for PVs which should
be rewritten, not just created. So rename struct pv_to_create to
struct pv_to_write for clarity.
This is a hotfix for a bug introduced in
6d7dc87cb3.
The bug description: First we allocate memory for
processing handle (at an address 1) then we
allocate some memory on the same pool for later use
in pvmove_poll function inside the process_each_pv
function (at an address 2). After we jump out of
process_each_pv we called destroy_processing_handle.
As a result of destroying the handle memory pool could
deallocate all memory at address 1 or higher. The
pvmove_poll function tried to copy a memory allocated
at address 2 that could be returned to the system.
If it was so it led to segfault.
We need to rethink proper fix but in the same time
cmd->mem pool is recreated per each lvm command so
this should not cause problems even when we run
multiple commands in lvm shell.
A valgrind snapshot of the corruption:
Invalid read of size 1
at 0x4C29F92: strlen (mc_replace_strmem.c:403)
by 0x5495F2E: dm_pool_strdup (pool.c:51)
by 0x1592A7: _create_id (pvmove.c:774)
by 0x159409: pvmove_poll (pvmove.c:796)
by 0x1599E3: pvmove (pvmove.c:931)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address 0xf15df8a is 138 bytes inside a block of size 8,192 free'd
at 0x4C28430: free (vg_replace_malloc.c:446)
by 0x5494E73: dm_free_wrapper (dbg_malloc.c:357)
by 0x5495DE2: _free_chunk (pool-fast.c:318)
by 0x549561C: dm_pool_free (pool-fast.c:151)
by 0x164451: destroy_processing_handle (toollib.c:1837)
by 0x1598C1: pvmove (pvmove.c:903)
by 0x15105B: lvm_run_command (lvmcmdline.c:1655)
by 0x1523C3: lvm2_main (lvmcmdline.c:2121)
by 0x1754F3: main (lvm.c:22)
Address this gcc warning:
metadata/lv.c:243: warning: initialized field overwritten
metadata/lv.c:243: warning: (near initialization for 'status.seg_status')
Present with e.g.: gcc version 4.3.2 (Debian 4.3.2-1.1)
Simplify calculation of extents rounding needed for
segment size.
Segment size has to divisible by 'extent count' needed to contain
whole stripe. LVM currently does not support stripes across segment.
In case the stripe size is bigger then extent size,
require bigger rounding.
Fixing regression caused by 197b5e6dc7.
So the 'TODO' part now finally know the answer - there is 'sparc64'
architecture which imposes limitation to read 64b words only through
64b aligned address.
Since we never could know how is the user going to use the returned
pointer and the userusually expects it's aligned on the highest CPU
required alignement, preserve it also for char*.
Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=809685
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Wrong thin-pool feature flag ordering in dm table: It will lead to
unnecessary table reload.
Fix it by placeing feature flags in order they are returned from the
kernel so current 'table line diff' code will not see a difference.
'verbose' was marked as a boolean option while it
takes integer args - so it has limited usage to 0 or 1,
but we supported 0-4 at least.
Fix it by switching to corrent int type.
(Hopefully noone was trying to use this variable as true/yes/false/no
way - as the would be unsupported/undocumented).
Reporter noticed lvm2 incorrectly translated
lvm2 threshold value to water mark in commit:
99237f0908
Fix it by properly translating size to number of
blocks in thin-pool and then calc for free blocks
matching configured lvm2 threshold value.
Reported-by: Ming-Hung Tsai <mingnus@gmail.com>
Normally, we generate and provide lvm.conf file where use_blkid_wiping
is set based on whether support for this is compiled in or not. This was
generated properly based on configure.
However, if lvm.conf is not used at all (someone deletes it) or the value
in lvm.conf is commented out (user edited it), we still need to use
proper default value that is based on DEFAULT_USE_BLKID_WIPING taken
from configure script - we used hardcoded value of "1" in this case
by mistake.
We already do check for suspended devs within udev rules where
the pvscan is to update lvmetad. So the check for suspended devs
in "pre-lvmetad" chain is not useful here - remove it - it may
be a source of hardly to detect races anyway (if udev rule detects
the device is not suspended and then the pvscan instance sees the
dev as suspended, we may end up not reacting to the event properly).
lvm1 and pool format do not support bootloader areas and we need to
remove any existing associated bootloader areas when we read lvm1 and
pool labels.
This has its importance if we're converting from one format to another
and we're reusing lvmcache in long-running commands (e.g. clvmd or lvm
shell) and we need to make lvmcache consistent and valid for current format.
Non-dm devices have ID_PART_TABLE_TYPE variable exported in
udev db from blkid scan for *both* whole devices and partitions.
We used ID_PART_ENTRY_DISK in addition to decide whether this
is the whole device or partition and then we filtered out only
whole devices where the partition table really is.
However, ID_PART_ENTRY_DISK was added in blkid 2.20 so we need
to use a different set of variables to decide on whole devices
and partitions on systems where older blkid is still used.
Now, we use ID_PART_TABLE_TYPE to detect that there's something
related to partitioning with this device and we use DEVTYPE variable
instead to decide between whole device (DEVTYPE="disk") and partition
(DEVTYPE="partition").
For dm devices it's simpler, we have ID_PART_TABLE_TYPE variable\
set in udev db for whole devices. It's not set for partitions,
hence we don't need more variable in addition to make the decision
on whole device vs. partition (dm devices do not have regular
partitions, hence DEVTYPE can't be used anyway, it's always set
to "disk" for whole disks and partitions).
Add "size" and "size_seqno" to struct device to cache device's size
and also to control its lifetime - the cached value is valid as long
as the global _dev_size_seqno is equal to the device's size_seqno,
otherwise we need to get the size again and cache the new value.
This patch also adds new dev_size_seqno_inc() fn for the appropriate
parts of the code to increment current global value of _dev_size_seqno
and hence to cause all currently cached values for device sizes to
be invalidated.
The device size is now cached because we're planning to reuse this
information for further checks and we want to avoid checking it more
than necessary to save resources.
Fix regression caused by c9f021de0b.
This commit actually transfered real-action (e.g. device removal)
into the next loop which has however missed to check for break.
So add check for break also there.
When creating a list in 'context of command' - use proper mempool.
vg->vgmem is mempool related to VG metadata - and can be eventually
locked read-only when VG struct is shared.
W: manual-page-warning /usr/share/man/man8/lvm.8.gz 491: warning: macro `_cdata',' not defined
rpmlint actually notices we had few hidden word in man page.
the line cannot start with apostrophe as it has then a different
meaning.
If not using explicit --enable-blkid-wiping/--disable-blkid-wiping
configure option, the configure script tries to enable/disable blkid
wiping feature automatically based on blkid library version found.
The script incorrectly set default value for lvm.conf's
allocation/use_blkid_wiping" setting to "1" (enabled) if proper
blkid library version was not found or the version found was less
than the minimum required. It should be set to "0" in this case.
The extent size must fits all blocks in 4294967295 sectors
(in 512b units) this is 1/2 KiB less then 2TiB.
So while previous statement 'suggested' 2TiB is still acceptable value,
make it clear it's not.
As now we support any multiples of 128KB as extent size -
values like 2047G will still 'flow-in' otherwise the largest power-of-2
supported value is 1TiB.
With 1TiB user needs 8388608 extents for 8EiB device.
(FYI such device is already unusable with todays glibc-2.22.90-27)
4GiB extent size is currently the smallest extent size which allows
a user to create 8EiB devices (with 2GiB it's less then 8EiB).
TODO: lvm2 may possibly print amount of 'lost/unused space' on a PV,
since using such ridiculously sized extent size may result in huge
space being left unaccessible.
Since commit 2fc126b00d, the library
code requires udev to be initialised for device scanning and
clvmd can fail to find VGs if devices/external_device_info_source
is set to "udev".
There are two basic groups of fields for LV segment device reporting:
- related to LV segment's devices: devices and seg_pe_ranges
- related to LV segment's metadata devices: metadata_devices and seg_metadata_le_ranges
The devices and metadata_devices report devices in this format:
"device_name(extent_start)"
The seg_pe_ranges and seg_metadata_le_ranges report devices in
this format:
"device_name:extent_start-extent_end"
This patch reverts partly what commit 7f74a99502
(v 2.02.140) introduced in this area - it added [] for
hidden devices to mark them for all four fields mentioned above.
We won't be marking hidden devices in devices and metadata_devices
fields.
The seg_metadata_le_ranges field will have hidden devices marked -
it's new enough that we don't need to care about compatibility much
yet.
The seg_pe_ranges is old enough that we shouldn't be changing this
one - so we're reverting to not marking hidden devices here.
Instead, there's going to be a new field "seg_le_ranges" which
is going to replace the seg_pe_ranges and it will mark hidden devices -
this is going to be introduced in a patch later.
So in the end we'll end up with:
(LV segment's devices)
devices field with "device_name(extent_start)" format, not marking hidden devices
seg_pe_ranges field with "device_name:extent_start-extent_end" format, not marking hidden devices (deprecated, new seg_le_ranges should be used instead for standardized format)
seg_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
(LV segment's metadata devices)
metadata_devices field with "device_name:extent_start-extent_end" format, not marking hidden devices
seg_metadata_le_ranges field with "device_name:extent_start-extent_end" format, marking hidden devices
Also, both seg_le_ranges and seg_metadata_le_ranges will honour the
report/list_item_separator setting which can be used to configure
the delimiter used for list items.
So, to sum it up, we will recommend using the new seg_le_ranges and
seg_metadata_le_ranges fields because they display devices with
standard extent range format, they can mark hidden devices and they
honour the report/list_item_separator setting.
We'll be keeping devices,seg_pe_ranges and metadata_devices fields
for compatibility.
The associated devices,metadata_devices,seg_pe_ranges and
seg_metadata_le_ranges are reported as genuine string lists now.
This allows for using the items separately in -S|--select
(so searching for subsets etc.) and also it allows for
configuring the separator using report/list_item_separator
which may be useful in scripts (however, we'll enable this
only for seg_le_metadata_ranges and not for devices,seg_pe_ranges
and seg_metadata_devices for compatibility reasons - see following
patch).
Add a comment in _process_pvs_in_vg() to document the
place where there have been problems with processing
PVs twice.
For a while we had a hacky workaround here where we'd
skip processing a PV if its device wasn't found in
all_devices (and !is_missing_pv since we want to
process PVs with missing devices.). That workaround
was removed in commit 5cd4d46f because it was no
longer needed.
The workaround had originally been needed to prevent
a device from being processed twice when the PV had
no MDAs -- it would be processed once in its real VG
and then the workaround would prevent it from being
processed a second time in the orphan VG.
Wrongly appearing as an orphan likely happened because
lvmcache would consider the no-MDA PV an orphan unless
the real VG holding that PV was also in lvmcache.
This issue is also mentioned in pvchange where holding
the global lock allows VGs to remain in lvmcache so
PVs with 0 mdas are not considered orphans.
The workaround in _process_pvs_in_vg() was originally
intended for reporting commands, not for pvchange.
But, it was accidentally helping pvchange also because
the method described by the pvchange global lock
comment had been subverted by commit 80f4b4b8.
Commit 80f4b4b8 was found to be unnecessary, and was
reverted in commit e710bac0. This restored the
intended global lock lvmcache effect to pvchange, and
it no longer relied on the workaround in toollib.
When reporting on LVs, take the end of the range from the size of the
underlying (hidden) LV rather than the logical size of the current
segment (that PVs use).
Previously, pvmove used the function find_pv_in_vg() which did the
equivalent of process_each_pv() by doing:
find_pv_by_name() -> get_pvs() ->
get_pvs_internal() -> _get_pvs() -> get_vgids() ->
/* equivalent to process_each_pv */
dm_list_iterate_items(vgids)
vg = vg_read_internal()
dm_list_iterate_items(&vg->pvs)
With the found 'pv', it would do vg_read() on pv_vg_name(pv),
and then do the actual pvmove processing.
This commit simplifies by using process_each_pv() and putting
the actual pvmove processing into the "single" function.
This eliminates both find_pv_by_name() and the vg_read().
The processing code that followed vg_read remains the same.
The return code for the pvmove command is not based on the
process_each_pv return code, but is based on the success/fail
conditions in the existing code.
Make the lvb validation rules for convert match
those for unlock (even though it would be very
unlikely or impossible for convert to deal with
zero lvb.)
When an orphan PV is changed/resized, the
lvmlockd global lock is converted from sh
to ex. If the command is changing two
orphan PVs, the conversion to ex should
be done only once.
Existing cache_settings field displays the settings which are
saved in metadata. Add new kernel_cache_settings fields to display
the settings which are currently used by kernel, including fields
for which default values are used.
This way users have complete view of the set of cache settings
supported (and which they can set) and their values which are used
at the moment by kernel.
For example:
$ lvs -o name,cache_policy,cache_settings,kernel_cache_settings vg
LV Cache Policy Cache Settings KCache Settings
cached1 mq migration_threshold=1024,write_promote_adjustment=2 migration_threshold=1024,random_threshold=4,sequential_threshold=512,discard_promote_adjustment=1,read_promote_adjustment=4,write_promote_adjustment=2
cached2 smq migration_threshold=1024 migration_threshold=1024
cached3 smq migration_threshold=2048
Fix lvm2app to return either 0 or 1 for lvm_vg_is_{clustered,exported},
including internal functions pvseg_is_allocated and vg_is_resizeable
which are not yet exposed in lvm2app but make them consistent with the
rest.
This reverts e28e22b9e1
The problem that that commit was fixing (pytest failure)
no longer appears with the current code, so the commit is
not needed.
That commit is a problem for pvchange, because it prevents
lvmcache from retaining VG metadata even while the global
lock is held. pvchange holds the global lock to ensure
that VG metadata is kept in lvmcache throughout processing.
If the cache is not kept, a PV with zero MDAs will appear
first in its actual VG and then appear again in the orphan VG.
It wrongly appears a second time in the orphan VG only if
the actual VG is dropped from lvmcache.
Thin pool discard mode set in metadata can be different from the one
actually used if any device underneath does not support that mode. Add
kernel_discard report field to make it possible to see this difference.
Internal _alloc_init() is only called from allocate_extents(),
which already does prevent usage of virtual segments.
So mark as internal error early and do not process it any further.
Add new test for lv_is_snapshot().
Also move few other bitchecks into same place as remaining bit tests.
TODO: drop lv_is_merging_origin() and keep using lv_is_merging().
The problem addressed by this workaround no longer
seems to exist, so remove it. PVs with no mdas
no longer appear in both their actual VG and in
the orphan VG.
Include brackets for the name if the dev is invisible.
This change applies to all callers of _format_pvsegs fn:
- lvseg_devices (the "lvs -o devices")
- lvseg_metadata_devices (the "lvs -o metadata_devices)
- lvseg_seg_pe_ranges (the "lvs -o seg_pe_ranges")
- lvseg_seg_metadata_le_ranges (the "lvs -o seg_metadata_le_ranges")
The common lv_pool_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_metadata_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_data_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_mirror_log_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_origin_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
The common lv_convert_lv fn avoids code duplication and also
the reporting part now uses _lvname_disp and _uuid_disp to display
name and uuid respectively, including brackets for the name if the
dev is invisible.
Use common _lvname_disp to report lv_parent. The _lvname_disp
takes care of properly marking LVs which are not visible - such
LVs are always enclosed in brackets when reported within any
other field.
For example, thin pool over RAID.
Before:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] cache_pool_tdata
[cache_pool_tdata_rimage_1] cache_pool_tdata
[cache_pool_tdata_rmeta_0] cache_pool_tdata
[cache_pool_tdata_rmeta_1] cache_pool_tdata
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] cache_pool_tmeta
[cache_pool_tmeta_rimage_1] cache_pool_tmeta
[cache_pool_tmeta_rmeta_0] cache_pool_tmeta
[cache_pool_tmeta_rmeta_1] cache_pool_tmeta
[lvol0_pmspare]
With this patch applied:
$ lvs -a -o name,lv_parent,data_lv,metadata_lv vg
LV Parent Data Meta
cache_pool [cache_pool_tdata] [cache_pool_tmeta]
[cache_pool_tdata] cache_pool
[cache_pool_tdata_rimage_0] [cache_pool_tdata]
[cache_pool_tdata_rimage_1] [cache_pool_tdata]
[cache_pool_tdata_rmeta_0] [cache_pool_tdata]
[cache_pool_tdata_rmeta_1] [cache_pool_tdata]
[cache_pool_tmeta] cache_pool
[cache_pool_tmeta_rimage_0] [cache_pool_tmeta]
[cache_pool_tmeta_rimage_1] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_0] [cache_pool_tmeta]
[cache_pool_tmeta_rmeta_1] [cache_pool_tmeta]
[lvol0_pmspare]
Do not mix dm_report_field_set_value and _field_set_value and
use single function call throughout for clarity. The same applies
for dm_report_field_string and _string_disp.
Fix regression caused by commit c2d4330f27
which removed the dm_pool_strdup for the cache policy name in
_cache_policy_disp report function.
This regression was hit with buffered reporting only (which is
used by default). The reason is that for buffered reporting, we're
iterating over LVs in VG (process_each_lv) while gathering
all the information that is needed for the report. In this case,
the LV's cache policy name has not been duped, but only the pointer
to the original VG buffer was stored. When the LV iteration finished,
the VG buffer was freed and any report to output called later
(dm_report_output call) accessed already freed VG data.
This didn't appear if unbuffered reporting was used (--unbuffered)
because in this case, the data were reported to output as
soon as they were processed, hence it was reported to output
before the VG data was freed.
The lvm2-activation{-early,-net}.service systemd unit statuses were missing
in dump gathered by lvmdump -s. These are quite important when debugging
scenarios with systemd environment and where lvmetad is not used.
Have commands send lvmlockd the update message
in vg_write instead of vg_commit, so that it's
not done while LVs are suspended. If the vg_write
is not committed, and the seqno sent to lvmlockd
is not used, then lvmlockd can detect this when
the next update uses the same seqno.
Use process_each_vg() to lock and read the old VG,
and then call the main vgrename code.
When real VG names are used (not a UUID in place of the
old name), the command still pre-locks the new name
(when strcmp wants it locked first), before calling
process_each_vg on the old name.
In the case where the old name is replaced with a UUID,
process_each_vg now translates that UUID into the real
VG name, which it locks and reads. In this case, we
cannot do pre-locking to maintain lock ordering because
the old name is unknown. So, in this case the strcmp
based lock ordering is suppressed and the old name is
always locked first. This opens a remote chance for
lock ordering conflict between racing vgrenames between
two names where one or both commands use the UUID.
Also always clear the internal lvmcache after rescanning, and
reinstate a test for --trustcache so that 'pvs --trustcache'
(for example) avoids rescanning.
Before commit c1f246fedf,
_get_all_devices() did a full device scan before
get_vgnameids() was called. The full scan in
_get_all_devices() is from calling dev_iter_create(f, 1).
The '1' arg forces a full scan.
By doing a full scan in _get_all_devices(), new devices
were added to dev-cache before get_vgnameids() began
scanning labels. So, labels would be read from new devices.
(e.g. by the first 'pvs' command after the new device appeared.)
After that commit, _get_all_devices() was called
after get_vgnameids() was finished scanning labels.
So, new devices would be missed while scanning labels.
When _get_all_devices() saw the new devices (after
labels were scanned), those devices were added to
the .cache file. This meant that the second 'pvs'
command would see the devices because they would be
in .cache.
Now, the full device scan is factored out of
_get_all_devices() and called by itself at the
start of the command so that new devices will
be known before get_vgnameids() scans labels.
Since we mark cache-pool as 'hidden/private' while it is in-use,
we may still allow user to change it's name.
It should not cause any harm and user may prefer better naming
for a cache-pool in use.
If an existing fifo has the wrong attributes it cannot be trusted
so we must unlink it and recreate it correctly.
(Replaces 2c8d6f5c90: if the other end of
the fifo already got opened while its mode was insecure, delaying the
chmod isn't going to make any difference!)
Reinstate and extend checks removed by e1b111b02a.
The code has always assumed that only root has access to the directory
containing the fifos and that they are under the complete control of
dmeventd code. If anything is found not to be as expected, then open()
should certainly not be attempted!
It's getting a bit more complex here.
Basic idea behind is - check_current_backup() should not
log error when a user is using a read-only filesystem,
so e.g. vgscan will not report any error when it tries
to take missing backup.
We still have cases when error could be reported though,
e.g. the backup this would be a symbolic link, but these
are rather misconfiguration and unexpected case.
We have to modes of 'archive()' usage -
1. compulsory - fail stops command and user may try '-An' option
to do a command.
2. non-compulsory - some fails in archiving are ignorable (i.e.
read-only filesystem where archive dir is located).
Those 2 cases needs to be properly handle - i.e. the non-compulsory
logging should not be tampering error logging message production.
So more work here is needed
Pass full buffer size to printf() function - no reason to make
buffer 1 char smaller.
Also rename locn buffer to message buffer directly since it's
not used for anything else.
TODO: we may use same buffer also for 'buf[]' since there is
no collision - so may safe 1K on stack usage.
In general, --select should be used to specify a VG by UUID,
but vgrename already allows a uuid to be substituted for
the name, so continue to allow it in that case.
If the VG arg from the command line does not match the
name of any known VGs, then check if the arg looks like
a UUID. If it's a valid UUID, then compare it to the
UUID of known VGs. If it matches the UUID of a known VG,
then process that VG.
Pass the single vgname as a new process_each_vg arg
instead of setting a cmd flag to tell process_each_vg
to take only the first vgname arg from argv.
Other commands with different argv formats will be
able to use it this way.
When two different VGs with the same name exist,
they are both stored in lvmcache using the vginfo->next
list. Previously, the code would print warnings (sometimes)
when adding VGs to this list. Now the duplicate VG names
are handled by higher level code, so this list no longer
needs to print warnings about duplicate VG names being found.
After recent changes to process_each, vg_read() is usually
given both the vgname and vgid for the intended VG.
However, in some cases vg_read() is given a vgid with
no vgname, or is given a vgname with no vgid.
When given a vgid with no vgname, vg_read() uses lvmcache
to look up the vgname using the vgid. If the vgname is
not found, vg_read() fails.
When given a vgname with no vgid, vg_read() should also
use lvmcache to look up the vgid using the vgname.
If the vgid is not found, vg_read() fails.
If the lvmcache lookup finds multiple vgids for the
vgname, then the lookup fails, causing vg_read() to fail
because the intended VG is uncertain.
Usually, both vgname and vgid for the intended VG are passed
to vg_read(), which means the lvmcache translations
between vgname and vgid are not done.
If two different VGs with the same name exist on the system,
a command that just specifies that ambiguous name will fail
with a new error:
$ vgs -o name,uuid
...
foo qyUS65-vn32-TuKs-a8yF-wfeQ-7DkF-Fds0uf
foo vfhKCP-mpc7-KLLL-Uh08-4xPG-zLNR-4cnxJX
$ lvs foo
Multiple VGs found with the same name: foo
Use the --select option with VG UUID (vg_uuid).
$ vgremove foo
Multiple VGs found with the same name: foo
Use the --select option with VG UUID (vg_uuid).
$ lvs -S vg_uuid=qyUS65-vn32-TuKs-a8yF-wfeQ-7DkF-Fds0uf
lv1 foo ...
This is implemented for process_each_vg/lv, and works
with or without lvmetad. It does not work for commands
that do not use process_each.
This change includes one exception to the behavior shown
above. If one of the VGs is foreign, and the other is not,
then the command assumes that the intended VG is the local
one and uses it.
This makes process_each_vg/lv always use the list of
vgnames on the system. When specific VGs are named on
the command line, the corresponding entries from
vgnameids_on_system are moved to vgnameids_to_process.
Previously, when specific VGs were named on the command
line, the vgnameids_on_system list was not created, and
vgnameids_to_process was created from the arg_vgnames
list (which is only names, without vgids).
Now, vgnameids_on_system is always created, and entries
are moved from that list to vgnameids_to_process -- either
some (when arg_vgnames specifies only some), or all (when
the command is processing all VGs, or needs to look at
all VGs for checking tags/selection).
This change adds one new lvmetad lookup (vg_list) to a
command that specifies VG names. It adds no new work
for other commands, e.g. non-lvmetad commands, or
commands that look at all VGs.
When using lvmetad, 'lvs foo' previously sent one
request to lvmetad: 'vg_lookup foo'.
Now, 'lvs foo' sends two requests to lvmetad:
'vg_list' and 'vg_lookup foo <uuid>'.
(The lookup can now always include the uuid in the request
because the initial vg_list contains name/vgid pairs.)
When not using lvmetad, this uses the system_id field in
the cached vginfo structs that are populated during a scan.
When using lvmetad, this requests the VG from lvmetad, and
checks the system_id field in the returned metadata.
When the command already knows both the vgid and vgname,
it should send both to lvmetad for a more exact request,
and it can save lvmetad the work of a name lookup.
Remove long outstand unused code lines, which were already
been obsoleted by other code.
Statuses and snapshot tree creation is already handled differently.
Also drop some 'extra' log_error() and use only stack;
since error has already been reported.
Since we do not use dev_manager in a way we would have destroyed VG
content while in-use - we could safely keep just pointer.
So dropping strdup.
Also it seems we actually no longer use vg_name for anything
so it may possibly go away completely unless it would be useful
for debugging...
Just for convenience to display all new configuration settings
introduced since given version (before, there was only --atversion
to display settings introduced in concrete version).
For example:
$ lvmconfig --type new --sinceversion 2.2.120
allocation {
# cache_mode="writethrough"
# cache_settings {
# }
}
global {
use_lvmlockd=0
# lvmlockd_lock_retries=3
# sanlock_lv_extend=256
use_lvmpolld=1
}
activation {
}
# report {
# compact_output_cols=""
# time_format="%Y-%m-%d %T %z"
# }
local {
# host_id=0
}
Unifying terminology.
Since all the metadata in-use are ALWAYS on disk - switch
to terminology committed and precommitted.
Patch has no functional change inside.
lv preload for detached LVs started to be used also
for various other types which just happens to pass through
weak if() condition.
TODO: find here better solution to rather explicitly check
for types we really need to preload.
We do not won't to 'expose' internals of VG struct.
ATM we use lists to keep all LVs - we may want to switch
to better struct for quicker 'search'.
Since we do not need 'lists' but always actual LV,
switch find_lv_in_vg_by_lvid() to return LV,
and replaces some use case of find_lv_in_vg()
with 'better' working find_lv() which already
returns LV.
When 'lvextend -L+XX vg/thinpool' do not leave inactive table
loaded for 'wrapping' LV on top of resized thin-pool
(ATM we use linear LV for this with same size as thin-pool).
Udev recently start to 'link-in' major amount of useless libs.
(Seem to be faulty 'systemd' link-in all issue)
Anyway - avoid locking those libs in RAM.
When preloading thin-pool device node for already
existing/running thin-pool do not resume such thin-pool.
This allows to properly schedule commit point for metadata,
when thin-pool data or metadata volume is resized.
Extra space between 'cache' target and metadata device caused
string comparation being not equal and thus always causing
table reload even when uneeded.
Avoid internal error message where thin pool repair code tries to
fix cache pool - was catched later in code stack, so rather
catch this early and make the repair function exlusive
to thin pools.
So far we have no code for repairing cache pools
(other then the automatic during activation/deactivation).
To handle multiple VGs with the same name.
Simply using the VG name is ambiguous, and
lvmetad requires the VG uuid be used to
specify which one is meant.
Coverity here is a bit 'blind' here and cannot resolve which
code paths are actually able to hit this code path.
(It's using 'statistic' to resolve all possible paths,
and it's not scanning 'individual' code paths.)
This just cleans warns and add 'cheap' tests.
Use 'mda' instead of NULL to quite Coverity warn.
However this code seems to be actually not even possible to hit.
With proper analysis it may possibly be dropped from code to
simplify logic.
Skip testing target_pvs for NULL, we already
dereference it in many other places.
If check would ever be needed - it needs to be
in front of _raid_extract_images().
In lookup, return a count of entries with the
same key rather than the value from a second
entry with the same key.
Using some slightly different names.
Simply use lookup_withval right away rather than doing a
standard lookup, checking for the wrong mapping, then
repeating with lookup_withval to get the right mapping.
Coverity is not able to understand assembly language in
system's header file, so provide model for such macro.
Note: to really see model in-use: #nodef FD_ZERO model_FD_ZERO
need to go to coverity/config/user_nodefs.h
Add missing display_lvname in _lvconvert_merge_thin_snapshot().
Also when we detect missing origin, report Internal error,
which would likely be the primary fault here
(and avoid dereft of NULL origin as noticed by Coverity).
When reading older lvm2 metadata for cache-pool - we now handle more
extended syntax - basically we want to enter most setting when
actually creating cached LV.
For this new validation code has been added. However older metadata
without new settings set is now found as invalid.
Fix it by adding default settings for cache policy mq
and cache mode writethrough.
If the data len is passed into the hash table
and saved there, then the hash table internals
do not need to assume that the data value is
a string at any point.
New hash table functions are added that allow for
multiple entries with the same key. Use of the
vgname_to_vgid hash table is converted to these
new functions since there are multiple entries
in vgname_to_vgid that have the same key (vgname).
When multiple VGs with the same name exist, commands
that reference only a VG name will fail saying the
VG could not be found (that error message could be
improved.) Any command that works with the select
option can access one of the VGs with -S vg_uuid=X.
vgrename is a special case that allows the first VG
name arg to be replaced by a uuid, which also works.
(The existing hash table implementation is not well
suited for handling this case, but it works ok with
the new extensions. Changing lvmetad to use its own
custom hash tables may be preferable at some point.)
Here Coverity cannot see the pointer cannot be NULL in this
code path - opened coverity case #00531860.
We could make a model to avoid seeing related reports,
but then we loose coverage for modeled function.
So decided to add minor hint for this case.
Recent change 2c8d6f5c90
actually droped restart when the reason of failing open is missing
device completely - check for ENOENT now as another reason
to start new dmeventd server (when there is no systemd to maintain it).
The udev_device_get_is_initialized is available since libudev version
165. Older versions are still used somewhere (e.g. RHEL6). So better
check for this fn and use it only if it's available.
Udev db records are marked as not initialized (incomplete) on timeout.
Issue an error message whenever LVM finds such records so users are
aware that something's going wrong with udev db.
This is important in case we use devices/external_device_info_source="udev"
where udev database records are used to do various filtering decisions.
For example:
udev log of timed out worker:
Nov 11 13:02:25 raw.virt systemd-udevd[607]: seq 1997 '/devices/virtual/block/dm-2' is taking a long time
Nov 11 13:04:25 raw.virt systemd-udevd[607]: seq 1997 '/devices/virtual/block/dm-2' killed
Nov 11 13:04:25 raw.virt systemd-udevd[607]: worker [11221] terminated by signal 9 (Killed)
Nov 11 13:04:25 raw.virt systemd-udevd[607]: worker [11221] failed while handling '/devices/virtual/block/dm-2'
...
LVM also issues error message visibly if incomplete udev db record is found,
devices/external_device_info_source="udev" is set:
$ pvs
Udev database has incomplete information about device /dev/dm-2.
Failed to get external handle for device /dev/dm-2 [udev].
...
Coverity noticed this condition is always false and the error
path could never be visited.
So check for all mismatches of supported messages
and actually mark log_error as internal error.
When the first arg is a UUID and vgrename translates
that UUID to a current VG name, the old and new VG
names are not being checked for equality. If they
are equal, it produces an internal error rather than
a proper error.
Use delay_dev to slow down mirror sync so we could more
easily check for race and proper reject of parallel mirror
leg addition/reduction.
Also expose fail in mirror allocation of parallel leg.
Rather then skipping whole test - just do not use it.
Failing tests that have required delay need to deal with reality
and shell either check for HAVE_DM_DELAY and skip portion
of test or using should when needed.
While through all codepaths we never 'read' lock_id unless LCKF_CONVERT,
coverity cannot decrypt this.
As since it's usually better to pass in 'well-defined' data structures
preset lock_id to 0.
Use fputs() when printing plain string,
easier then fprintf which needs to parse it.
Also check fd before close is >= 0 -
it is - but coverity fail to see it, so eliminate
this false-positive warning.
Doing 'stat' checking first and later opening is racy.
And since we do not really care about any 'status' info
here and we read 'sysfs' here - just drop whole 'stat()'
call and directly handle error from failing 'fopen()'.
Coverity here is not fully-in-picture - but please it
with validation of pointer which currently cannot be null,
since we always return at least empty string.
Check for arg_vgid_lookup and arg_name_lookup not being NULL.
Drop checking arg_vgid and arg_name for NULL since they
are already dereference earlier - thus mostly must be NOT NULL.
(If that would be possible larger rework of this function would be
required).
Put calls related to fifo opening into a single function.
Fix Time-Of-Check-Time-Of-Use and use fstat()
and fchmod() on already opened fd instead of
checking first path and then risking to open something
different.
Currently the code creates the log separately after allocating space for
the data and as no data allocation is needed this second time,
total_extents ends up holding zero so use new_extents directly instead.
When reading a foreign VG we cannot write it, since
it belongs to another host. When reading a shared VG
we cannot write it because we may not have an ex lock.
(Or we may be reading the shared VG while not using
lvmlockd in which case it's like reading a foreign VG.)
Add the same checks for wiping outdated PVs. We may
read a foreign or shared VG, or see the PVs, while
another host is part way through writing a new version
of the VG to the PVs. This might cause us to think
some of the PVs are outdated. We do not want to
write another host's PVs, especially when we may
wrongly conclude they are outdated.
When the command gets a list of alternate devices
from lvmetad, log each one directly. This is not
the same as the warnings when adding lvmcache,
which are related to which duplicate is preferred.
If two PVs have different VGs with the same name
(different uuids), one of the VGs is ignored by
lvmetad. A FIXME exists in lvmetad to find a
better response.
update_metadata and pv_found update the cached metadata;
these are both reworked to improve the code, organize it
by each possible state and transition, make it much more
clear what's changing, add more error checking and
handling, and add comments.
The state and content of the cache (hash tables) does not
change (apart from some things that didn't work before),
and the communication to/from commands does not change.
The implementation and organization of the code making
the state changes does change significantly.
One detail related to the content of the cache does change:
different hash tables do not reference the same memory any more;
the target values in each hash table are allocated and freed
individually.
The str_list_destroy function may be called to cleanup memory when
the list is not used anymore and the list itself was not allocated
from the memory pool.
When checking minimum mda size, make sure the mda_size after alignment
and calculation is more than 0 - if there's no place for an MDA at the
end of the disk, the _text_pv_add_metadata_area does not try to add it
there and it returns (because we already have the MDA at the start of
the disk at least).
Actually, we don't need extra condition as introduced in commit
00348c0a63. We should fix the last
condition:
(mdac->rlocn.size >= mdah->size)
...which should be:
(MDA_HEADER_SIZE + (rlocn ? rlocn->size : 0) + mdac->rlocn.size >= mdah->size))
Where the "mdac" is new metadata, the "rlocn" is old metadata.
So the main problem with the previous condition was that it
didn't count in MDA_HEADER_SIZE properly (and possible existing
metadata - the "rlocn"). This could have caused the error state
where metadata in ring buffer overlap to not be hit.
Replace the new condition introduced in 00348c0a63
with the improved one for the condition that existed there
already but it was just incomplete.
We're already checking whether old and new meta do not overlap in
ring buffer (as we need to keep both old and new meta during vg_write
up until vg_commit).
We also need to check whether the new metadata do not overlap
themselves in case we don't have old metadata yet (...because
we're in vgcreate). This could happen if we're creating a VG so
that the very first metadata written are long enough that it wraps
themselves in metadata ring buffer.
Although we limited the minimum metadata area size better with the
previous commit ccb8da404d which
makes the initial VG metadata overlap in ring buffer to be less
probable, the risk of hitting this overlap condition is still there
if we still manage to generate big enough metadata somehow.
For example, users can provide many and/or long VG tags during vgcreate
so that the VG metadata is long enough to start to wrap in the ring
buffer again...
Drop already tested 'threshold & create' which is in
lvextend-thin-full.sh
Count with now match faster 'dmeventd' wakeup on watermark
as it's now nearly instant after crossing threshold value.
If the underlaying device has actually bigger read-ahead settings,
let it pass.
But anyway switch to 512 strip-size to get really high R-A sector count.
This option could never have been printed in lvm2 metadata, so it could
be safely removed as it could have been set only as 0.
These configurable setting is supported via metadata profile.
Now with correctly functioning dmeventd enable usage of
low_water_mark for faster reaction on pool's threshold.
When user select e.g. 80% as a threshold value,
dmeventd doesn't need to wait 10 seconds till monitoring
timer expires, but nearly instantly resizes thin-pool
to fit bellow threshold.
If plugin's lvm command execution fails too often (>10 times),
there is no point to torture system more then necessary, just log
and drop monitoring in this case.
The recent addition to check for PVs that were
missed during the first iteration of processing
was unintentionally catching duplicate PVs because
duplicates were not removed from the all_devices
list when the primary dev was processed.
Also change a message from warn back to verbose.
If a VG is removed between the time that 'vgs'
or 'lvs' (with no args) creates the list of VGs
and the time that it reads the VG to process it,
then ignore the removed VG; don't report an error
that it could not be found, since it wasn't named
by the command.
The former patch(dab3ebce4c) is a little bit strict. For example, it is
OK to create PV on unpartitioned DASD devices with LDL formatted. So
after lvm version containing the patch, LVs created on those devices
could not be found.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Improve event string parser to avoid unneeded alloc+free.
Daemon talk function uses '-' to mark NULL/missing field.
So restore the NULL pointer back on parser.
This should have made old tools like 'dmevent_tool' work again.
As now 'uuid' or 'dso' could become NULL and then be
properly used in _want_registered_device() function.
Since lvm2 always fill these parameters, this change should
have no effect on lvm2.
PVs could be missing from the 'pvs' output if
their VG was removed at the same time that the
'pvs' command was run. To fix this:
1. If a VG is not found when processed, don't
silently skip the PVs in it, as is done when
the "skip" variable is set.
2. Repeat the VG search if some PVs are not
found on the first search through all VGs.
The second search uses a specific list of
PVs that were missed the first time.
testing:
/dev/sdb is a PV
/dev/sdd is a PV
/dev/sdg is not a PV
each test begins with:
vgcreate test /dev/sdb /dev/sdd
variations to test:
vgremove -f test & pvs
vgremove -f test & pvs -a
vgremove -f test & pvs /dev/sdb /dev/sdd
vgremove -f test & pvs /dev/sdg
vgremove -f test & pvs /dev/sdb /dev/sdg
The pvs command should always display /dev/sdb
and /dev/sdd, either as a part of VG test or not.
The pvs command should always print an error
indicating that /dev/sdg could not be found.
Recognize the target only 'extends' and do not enforce
'flush' in this case. Only the size reduction
still requires flush (so disables usage of no_flush flag).
If some other targets do require flush before suspend,
they have to explicitly ask for it.
While the activation code tries to evaluate which target
really needs flush with suspend and which may go without flush,
it has stayed effectively disabled by original commit:
33f732c5e9 since here
it only allows to pass non-pvmoving 'mirrors'.
So remove check for mirror LV type and only disable
no_flush for 'pvmove'..
TODO: Looking into history - it also seemed like raid target
would have always required flushing but it's been later
removed without clean explanation.
If some more targets really do need 'no_flush' it should
been handle at their 'level' - since we now stack multiple
targets over itself.
Add more functionality to size_changed function.
While 'existing' API only detected 0 for
unchanged, and !0 for changed,
new improved API will also detected if the
size has only went bigger - or there was
size reduction.
Function work for the whole dm-tree - so
no change is size is always 0.
only size extension 1.
and if some size reduction is there - returns -1.
This result can be used for better evaluation
whether we need to flush before suspend.
Use single code to evaluate if the percentage value has
crossed threshold.
Recalculate amount value to always fit bellow
threshold so there are not need any extra reiterations
to reach this state in case policy amount is too small.
Since plugin's percentage compare has been fixed,
it's now revealed wrong compare here.
The logic for threshold is - to allow to go as high
as given value e.g. 80% - so if pool is exactlu 80%
full it's still allowed to use it (dmeventd will not
resize it).
Commit 1a74171ca5 added
a check to ignore a VG that was FAILED_INCONSISTENT
if the command doesn't care if the VG is not found.
Remove that check because that case is never reached
by the current code.
The ONE_VGNAME_ARG was being passed and tested as
vg_read() flag but it's a cmd struct flag.
(It affects command arg processing in toollib,
not vg_read behavior. Flags related to command
processing are generally cmd struct flags, while
vg_read arg flags are generally related to vg_read
behavior.)
Running "vgremove -f VG & pvs" results in the pvs
command reporting that the VG is not found or is
inconsistent. If the VG is gone or being removed,
the pvs command should just skip it and not print
errors about it.
"Not found" is because the pvs command created the
list of VGs to process, including VG, then vgremove
removed the VG, then the pvs command came to to read
the VG to process it and did not find it.
An "inconsistent" error could be reported if vgremove
had only partially completed removing VG when pvs did
vg_read on the VG to process it, causing pvs to find
the VG in a partially-removed state.
This fix adds a flag that pvs uses to ignore a VG
that can't be read or is inconsistent.
When lvmetad is used and lvmcache update function (lvmcache_update_vgname_and_id)
was called to update existing lvmcache records, a condition was met
which made to retun from the update function immediately, effectively
making it NOOP.
It seems there's no reason for such condition and lvmcache should be
update appropriately even when lvmetad used as lvmcache may be reused,
most notably in lvm shell.
It's possible this is a remnant of the lvmetad development code which
didn't get removed for some reason and the bug didn't get spotted
because lvm shell is not used often (the condition dates back to 2012
or so).
Example, lvmetad and lvm shell used:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda vg lvm2 a-- 124.00m 124.00m
Before this patch:
==================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
With this patch applied:
========================
lvm> vgremove vg
Volume group "vg" successfully removed
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
The lvmcache info might be resued, most notably in lvm shell.
We need to be sure that even lvmcache_info marked as invalid
is removed from the lvmcache so it does not confuse any subsequent
code/commands executed later on.
Problematic example with the lvm shell:
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Before this patch (/dev/sda still displayed in a way):
======================================================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
No physical volume label read from /dev/sda
(with lvmetad)
lvm> pvs
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
With this patch applied:
========================
lvm> pvremove /dev/sda
Labels on physical volume "/dev/sda" successfully wiped
(without lvmetad)
lvm> pvs
(with lvmetad)
lvm> pvs
Older pthread library was missing 'trick'
in pthread_cleanup_pop() which lead to
compilation error:
error: label at end of compound statement
Use explicit ';' to fix it.
Make lvm2_disable_dmeventd_monitoring() more explicit.
As memlock_inc_daemon() is also used by clvmd, which
does changes dmeventd and suspend ignore state at
some stages - make updates of these 2 variable
tied to the call of lvm2_disable_dmeventd_monitoring().
Once this call is made dmeventd monitoring
and suspended devices are ignored.
TODO: all lvm-global settings should really be moved
to command context.
Implementing exit when 'dmeventd' is idle.
Default idle timeout set to 1 hour - after this time period
dmeventd will cleanly exit.
On systems with 'systemd' - service is automatically started with
next contact on dmeventd communication socket/fifo.
On other systems - new dmeventd starts again when lvm2 command detects
its missing and monitoring is needed.
Add support to unmonitor device when monitor recognizes there is
nothing to monitor anymore.
TODO: possibly API change with return value could be also used.
Redesign threading code:
- plugin registration runs within its new created thread for
improved parallel usage.
- wait task is created just once and used during whole plugin lifetime.
- event thread is based over 'events' filter being set - when
filter is 0, such thread is 'unused'.
- event loop is simplified.
- timeout thread is never signaling 'processing' thread.
- pending of events filter cnange is properly reported and
running event thread is signalled when possible.
- helgrind is not reporting problems.
Need here to keep control device opened while there is 'any' dso
plugin loaded - otherwise there would a race closing controlfd
inside lvm2 plugin while some other monitoring thread would
tried to execute another WAITEVENT task.
Move all DSO related function in front, so they could be easily
referenced from rest of code.
Add proper error paths with logging and error reporting.
Drop mutex locking when releasing DSO - since DSO is always
allocated and released in main 'event' processing thread.
CONVERTING status flag is a tricky one. It's not set when converting
a non-mirror LV type to the mirror type, i.e.: linear -> two leg mirror.
Also the conversion itself is instant and doesn't require to be polled.
When mirror reaches sync state there's no final update on VG metadata
for lvmpolld to be made thereby report_progress in fact doesn't report
percentage of mirror being converted but percentage of mirror
being in sync. Perhaps we should reword the lvconvert output here.
On the other hand CONVERTING is set while we upconvert the mirror
from i.e. two leg mirror to four leg mirror. In such case the operation
is required to be polled so that lvmpolld can cleanup temporary
conversion log when the conversion is over.
Ignore CONVERTING lv_type for the moment and match LVs only by uuids
during 'mirror conversion'/'waiting for a sync to finish'.
The old code made two loops through the PVs: in the first
loop it found the max PV and VG name lengths, and in the
second loop it printed each PV using the name lengths as
field widths for aligning columns.
The new code uses process_each_pv() which makes one loop
through the PVs. In the *first* call to pvscan_single(),
the max name lengths are found by looping through the
lvmcache entries which have been populated by the generic
process_each code prior to calling any _single functions.
Subsequent calls to pvscan_single() reuse the max lengths
that were found by the first call.
The new report/compact_output_cols setting has exactly the same effect
as report/compact_output setting. The difference is that with the new
setting it's possible to define which cols should be compacted exactly
in contrast to all cols in case of report/compact_output.
In case both compact_output and compact_output_cols is enabled/set,
the compact_output prevails.
For example:
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols=""
$ lvs vg
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=0
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize Log Cpy%Sync Convert
lvol0 vg -wi-a----- 4.00m
---
$ lvmconfig --type full report/compact_output report/compact_output_cols
compact_output=1
compact_output_cols="data_percent,metadata_percent,pool_lv,move_pv,origin"
$ lvs vg
LV VG Attr LSize
lvol0 vg -wi-a----- 4.00m
dm_report_compact_given_fields is the same as dm_report_compact_fields,
but it processes only given fields, not all the fields in the report
like dm_report_compact_field does.
If a host failed while holding a sanlock lease,
sanlock_acquire will by default block and wait
for the lease to expire before returning. We
want it to return with an error so we can retry
instead of blocking, which allows us to process
other lock operations.
(Enclose this in an ifdef until the new flag
appears in a sanlock release.)
Respect lvm2_log_fn prototype. The idea of 'reusing' print_log with
plain cast is causing very strange crashes with some older 'gcc' compilers.
So just do it cleanly...
This reverts commit 1b1c01a27b.
This caused messages to get dropped instead of logged into the log file.
(The log file and log function are independent at the moment.)
Rework thread creation code to better use resources.
New code will not leak 'timeout' registered thread on error path.
Also if the thread already exist - avoid creation of thread
object and it's later destruction.
If the race is noticed during adding new monitoring thread,
such thread is put on cleanup list and -EEXIST is reported.
As we now use 'unified' logging macro system - we no longer need
to protect from change of logging function pointer - it's set
once at the start of dmeventd and not change anymore
(as lvm2 library no longer interferers here).
Some signatures are spread around the disk in several copies, mainly for
backup. Make libblkid to detect these extra copies - there was missing
"blkid_probe_step_back" fn call after successful wipe of previous signature
copy.
An example with FAT table which has copies:
$ mkfs.vfat /dev/sda1
Before this patch:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
With this patch applied:
$ pvcreate /dev/sda1
WARNING: vfat signature detected on /dev/sda1 at offset 54. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 0. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
WARNING: vfat signature detected on /dev/sda1 at offset 510. Wipe it? [y/n]: y
Wiping vfat signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
Make sure log/prefix is set to "" when getting the list of VG names.
We need this for the format to be correct so it's properly searched
through later on.
$ vgcreate vgA /dev/sda
Volume group "vgA" successfully created
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ dd if=/dev/sda of=/dev/sdc bs=1M
(the new VG name is prefix of existing VG name)
$ vgimportclone -n vg /dev/sdb
(the new VG name is suffix of existing VG name)
$ vgimportclone -n gA /dev/sdc
Before this patch:
------------------
(we end up with "vg1" and "gA1" names with the "1" suffix which is not needed)
$ vgs -o vg_name
VG
gA1
vg1
vgA
With this patch applied:
------------------------
(we end up with "vg" and "gA" names as they're unique already and no extra suffix is added)
$ # vgs -o vg_name
VG
gA
vg
vgA
Of course, if the name supplied is not unique, the number is added correctly:
$ dd if=/dev/sda of=/dev/sdb bs=1M
$ vgimportclone -n vgA /dev/sdb
$ vgs -o vg_name
VG
vgA
vgA1
If lvmlockd is running, lvmetad is configured (use_lvmetad=1),
but lvmetad is not running, then commands will seg fault
when trying to send a message to lvmetad.
The difference is lvmetad being "active", not just "used".
We can replace the expressions with awk/grep/cut/tr with --select now and
more suitable reporting options and modes. Also, we don't need to check
the temporary lvm.conf generated within vgimportclone script since we're
generating it ourselves now using lvmconfig, not using sed anymore like
it was before (so we can be pretty sure it's correct - we use lvmconfig
now even for generating the lvm.conf itself).
We already have pv_count to report number of PVs that a VG has based
on metadata.
This patch exposes the information about how many of these PVs are
missing which is also useful information for a VG. Wwe could count
the sum of pv_missing reporting fields for each PV in the VG before,
but the new field is practical when reporting VG as a whole and there's
no need to process each PV from VG alone.
If 'vgcreate --shared' finds both sanlock and dlm are running,
print a more accurate error message:
"Found multiple lock managers, select one with --lock-type."
When neither is running, we still print:
"Failed to detect a running lock manager to select lock type."
Using --lock-type sanlock|dlm implies --shared.
Using --shared selects lock type sanlock|dlm
(by choosing the one that's running.)
Using both --shared and --lock-type sanlock|dlm should
also be allowed (--shared is just redundant information.)
Also, leave out the note about "circular buffer" which is
an internal imeplementation detail anyway and not quite
informational for users:
Before this patch:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large for circular buffer
Failed to write VG vg1.
With this patch applied:
$ vgcreate vg1 /dev/sda
VG vg1 metadata too large: size of metadata to write is 691 bytes while PV metadata area size on /dev/sda is 512 bytes.
Failed to write VG vg1.
Before this patch:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear public
With this patch applied:
$ lvs -a -o name,layout,role test/lvmlock
LV Layout Role
[lvmlock] linear private,lockd,sanlock
Avoid running tests, when prefix already exist in the system.
As prefix just uses PID number, we may hit a case for long
running tests, where devices from some previous runs were not
properly cleared away - detect this and fail early.
(Such machine should be inspected and fixed).
Add metadata_devices and seg_metadata_le_ranges report fields.
Currently only defined for raid, but should probably be extended
to all other segment types that don't report all their device
usage in the 'devices' field.
Correct some things, e.g. set mode and policy on
the cache lv, not the pool, lvm.conf field for
mode changed.
Add smq which was missing.
Make the sections on cache mode and cache policy
consistent in structure and style.
When lvmetad_pvscan_vg() reads VG metadata from each PV,
it compares it to the last one to verify it matches.
If the VG metadata does not match on the PVs, an error
is printed and it fails to read the VG. In this error
case, use log_debug to show the differences between
the two unmatching copies of the metadata.
One host changes a VG, making the cached VG on another
host invalid. The other host then rereads the VG from
disk to get the latest copy. If the first host removed
a PV from the VG, the second host attempts to reread the
VG from old PV when rescanning. Reading the VG from the
removed PV fails, causing vg_read to return "VG not found".
The fix is to simply not fail when a VG is not found while
rereading a PV and continue without it.
(This doesn't happen if the second host happens to first
run a command like 'vgs' that triggers a global revalidation
of metadata.)
vgchange --lock-type iterates through LVs to ensure
no LVs are active before changing the lock type of
the VG, but the loop was not checking that an LV
actually has a lock before trying it, so it would
fail if the VG had any LVs that don't use locks,
e.g it would fail on a tmeta LV from a pool.
We want most of our units to be started before any local/remote mount
points are mounted - we used {local,remote}-fs.target for this purpose
before, but it was not 100% correct as there's even {local,remote}-fs-pre.target
special systemd unit reserved for this exact purpose.
See also man 7 systemd.special and "local-fs-pre.target"/"remote-fs-pre.target"
description.
When user specifies '--force' with remove/remove_all/wipe_table
use '--noflush --nolockfs' resume flags, so the operation
will not block when device underneath is blocked.
Also make error messages more consistent:
Before this patch:
(/run/lock exists and is not a directory)
$ pvs
/run/lock/lvm: mkdir failed: Not a directory
File-based locking initialisation failed.
(/run/lock/lvm exists and is not a directory)
$ pvs
Directory "/run/lock/lvm" not found
File-based locking initialisation failed.
With this patch applied:
(/run/lock exists and is not a directory)
$ pvs
Existing path /run/lock is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed
(/run/lock/lvm exists and is not a directory)
$ pvs
Existing path /run/lock/lvm is not a directory.
Failed to create directory /run/lock/lvm.
File-based locking initialisation failed.
When using udev, the /dev/mapper entries are symlinks - fix the code
to count with this.
This patch also fixes the dmsetup mknodes and vgmknodes to properly
repair /dev/mapper content if it sees dangling symlink in /dev/mapper.
$ lvs -o name,tags vg
LV LV Tags
lvol0
lvol1 mytag
Before this patch:
$ lvs -o name,tags vg -S 'tags=""'
Failed to parse string list value for selection field lv_tags.
Selection syntax error at 'tags=""'.
Use 'help' for selection to get more help.
(and the same for -S 'tags={}' and -S 'tags=[]')
With this patch applied:
$ lvs -o name,tags vg -S 'tags=""'
LV LV Tags
lvol0
(and the same for -S 'tags={}' and -S 'tags=[]')
If lvmlockd acquires an lv lock for a command, but the
command exits before the reply, then the command has
not activated the lv and lvmlockd should unlock it.
This only applies when the lv was not already locked.
(There will always be a chance that the lv lock is held
while the lv is not active, i.e. if the command fails in
the small window between getting the lv lock and before
doing the activation. In that case, rerunning the
activation command corrects the inconsistency.)
This commit helps by automatically clearing the
inconsistency (lv locked by not activated) in the most
common case when the lv lock operation is slow to
complete and the command is canceled by the user.
This commit also adds and cleans up references to the
client id in a bunch of log messages, which is useful
to follow processing on each independent lock request.
When using lvm shell, some structures which are cached in memory may be
reused. This happens for the struct label (a part of lvmcache_info
structure) when lvmetad is used in which case the PV scan is not
done that would normally overwrite these label structures in memory
and making them up-to-date.
This is all consequence of the fact that struct lvmcache_info and
struct label are not always assigned in the same part of the code.
For example, if lvmetad *is not* used, parts of the struct label are
reassigned in label_read fn while struct lvmcache_info is created
elsewhere. No part of the code reused struct label (and its "dev"
field) before calling label_read fn. That's why the real bug is
hidden when using lvm shell without lvmetad.
However, with lvmetad and lvm shell, the situation is a bit different.
The label_read fn is not called if lvmetad *is* used, hence the
struct label may have ended up not initialized properly.
There was missing assignment for the dev field in struct label
in _text_pv_write fn which caused this problem to appear in
lvm shell with lvmetad, for example:
Before this patch:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
unknown device lvm2 --- 128.00m 128.00m
With this patch applied:
lvm> pvcreate /dev/sda
Physical volume "/dev/sda" successfully created
lvm> pvs /dev/sda
PV VG Fmt Attr PSize PFree
/dev/sda lvm2 --- 128.00m 128.00m
Also, this problem had not appeared before changes introduced
by commits e1a63905d1 through
3a6f91d713 which, among other
things, added proper label field type reporting. Before, label
reporting was the same as using struct physical_volume which
has its own dev field assigned and so this problem was not exposed.
When a command does a sequence of
vg_write + vg_commit + vg_write + vg_commit,
initialization of non-PV devices happens during the
first vg_write, and does not need to be repeated by
the second vg_write.
When creating a lockd VG, this sequence occurs because
the VG is first created, then the lockd data is created,
then the lockd data is then written to the VG metadata.
Avoid validation of free space in pool, when no messages are passed.
Patch a3c7e326c3 add new check for
pool overload - but this check should not be made if there are
no messages and transaction_id is still within 'bounds' (bigger by 1).
Commit 00b36ef06a had a typo
and missed '{' for shell variable, thus command used slightly
different 'tmp' dir name for cache dir (with extra '}').
Such change was unnoticed until a recent fix in persistent
filter, lvm2 missed to update cache file when --config
was specified.
The result was, /tmp dir was accumulating snap.XXXXX} dirs when
running vgimportclose script.
Since we may want to swap names when LVs are complex types, we cannot
avoid doing full renames on both LV stacks.
Temporarily use 'pvmove_tmeta' as unused name to prevent validation troubles.
Commit 9403edbb93 move location of
configure.h and lvm-version.h.
Let's try even better place then /conf dir which should be left
for user configurable files.
Put these files right into include dir.
This applies the same rule/logic to dlm VGs that has always
existed for sanlock VGs. Allowing a dlm VG to be removed
while its lockspace was still running on other hosts largely
worked, but there were difficult problems if another VG with
the same name was recreated. Forcing the VG lockspace to
be stopped, gives both sanlock and dlm VGs the same behavior.
This shortcut was added for an odd case that I do not
believe is relevant any more. Having an alternate
path for lockspace thread cleanup is a complication
that could lead to problems.
The code was expecting the wrong return value from
compare_config, which returns 0 when equal.
This is a problem for a lockd VG using multiple PVs
when the VG needs to be rescanned.
Somehow raid tests landed in plain cache - separte them out
so they properly check for have_raid.
Check we do not support strip option with cache-pool creation.
ATM allocation can't handle stripping and cache pool allocation.
It's not yet even clear what should be actually result.
Until resolved, disable this option (it's been coredumping
inside allocation anyway).
Certain stacks of cached LVs may have unexpected consequences.
So add a warning function called when LV is cached to detect
such caces and WARN user about them - the best we could do ATM.
When we insert layer we also move status flag-bits for certain LV types,
so internal volume_group structure remains consistent.
(Perhaps it's misuse of 'insert_layer' function and we should have
another similar function for this.)
Basically we aim to maintain the same state as after reading fresh
metadata out of volume group.
Currently we when i.e. cache 'raid' LV - this should transfer 'raidLV' flag
to _corigin LV and cache is no longer a raid.
TODO: bits for stacked devices needs more exact rules.
The dlm will often lose the lvb content, so we need to
check quite a few possibilities for lvb values that
were not being checked before.
Refactoring was required to pass the entire lvb value
back to the core code instead of the single value.
The only functional change should be detecting new
lvb states where metadata is now invalidated where
it wasn't before.
When an action is created by lvmlockd for itself,
there is no client to send the result to. Add
the NO_CLIENT flag to the action to skip sending
the result to a client.
Add a new arg to lockd_start_vg() that indicates
it is being called for a new lockd VG, so that
lvmlockd knows the lockspace being started is new.
(Will be used by a following commit.)
Commit f6473baffc introduced a new
cmd->initialized variable to keep info about which parts of the
cmd_context have been initialized.
A part of this patch was also a change in refresh_filters fn
which checks for cmd->initialized.filters variable and it does
the filter refresh *only* if the filter has already been initialized
before otherwise it's a NOOP (before, the refresh_filters also
initialized filters as a side effect in case it had not been
initialized before which was not quite correct).
However, the commit f6473baffc
did not handle the case in which configuration changes
either via --config argument or when configuration file changed
and its timestamp was higher than the timestamp of the persistent
cache file - the /etc/lvm/cache/.cache.
This patch fixes this issue and it causes the init_filters fn
in lvm_run_command fn to be called with proper value of
"load_persistent_cache" switch even if the configuration changes,
hence causing the persistent cache file to be ignored in this
case.
Ensure make clean cleans any left-over file from their previous
location so they are not in conflict with new ones.
Also hide error message when .commands file is not present.
The regex filter (controlled by devices/filter lvm.conf setting) was
evaluated as the very last filter. However, this is not optimal when
it comes to restricting disk access - users define devices/filter
as well as devices/global_filter to avoid this.
The devices/global_filter is already positioned at the beginning of the
filter chain. We need to do the same for devices/filter.
Filter chains before this patch:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
type_filter -> usable->filter -> mpath_component_filter ->
partition_filter -> md_component_filter -> fw_raid_filter ->
regex_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> usable_filter -> regex_filter
From the chain list above we can see that particularly in case when
lvmetad is not used, the regex filter is the very last one that is
processed. If lvmetad is used, it doesn't matter much as there's
the global_regex_filter which is used instead when updating lvmetad
and when retrieving info from lvmetad, putting regex_filter in front
of usable_filter wouldn't change much since usabled_filter is not
reading disks directly.
This patch puts the regex filter to the front even in case lvmetad
is not used, hence reinstating the state as it was before commit
a7be3b12df (which moved the regex_filter
position in the chain). Still, the arguments for the commit
a7be3b12df still apply and they're
still satisfied since component filters (MD, mpath...) are evaluated
first just before updating lvmetad.
So with this patch, we end up with:
A: when lvmetad is not used:
persistent_filter -> sysfs_filter -> global_regex_filter ->
regex_filter -> type_filter -> usable->filter ->
mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B: when lvmetad is used:
B1: to update lvmetad:
sysfs_filter -> global_regex_filter -> type_filter ->
usable_filter -> mpath_component_filter -> partition_filter ->
md_component_filter -> fw_raid_filter
B2: to retrieve info from lvmetad:
persistent_filter -> regex_filter -> usable_filter
This way, specifying the regex_filter in non-lvmetad case causes
the devices to be filtered based on regex first before processing
any other filters which can access disks (like md_component_filter).
This patch also streamlines the code for better readability.
Hopefull 4.3 will be fixed and test will be updated to let
raid test running again.
Meanwhile using md-raid may effectively kill kernel,
so leave at least other tests running.
Split up _build_histogram_arg() into separate functions to allocate
and fill the histogram arg string and remove nested local variable
declarations from the parent function.
Coverity flags a user-after-free in _stats_histograms_destroy():
>>> Calling "dm_pool_free" frees pointer "mem->chunk" which has
>>> already been freed.
This should not be possible since the histograms are destroyed in
reverse order of allocation:
203 for (n = _nr_areas_region(region) - 1; n; n--)
204 if (region->counters[n].histogram)
205 dm_pool_free(mem, region->counters[n].histogram);
It appears that Coverity is unaware that pool->chunk is updated
during the call to dm_pool_free() and valgrind flags no errors in
this function when called with multiple allocated histograms.
Since there is no actual need to free the histograms individually
in this way simplify the code and just free the first allocated
object (which will also free all later allocated histograms in a
single call).
Put include/.symlinks_created as a prerequisite for dep calc.
Otherwise if these are not generated and user enters tests subdir and
runs 'make' he just gets endless loop of dep calculation.
Relocate generated configure.h and lvm-version.h outside
of compilable .c source tree.
The reason is behind - when compiling in builddir != srcdir
the generated file in lib/misc/configure.h was used for all compiled
source file except ones located in lib/misc dir - those would have used
configure.h file located in this dir - if there have existed one (i.e.
from some other build)
This problem was only visible, when srcdir == buildir was used before
trying to use srcdri != builddir (as configure.h appeared then in
srcdir).
The histogram changes adds a new error path to dm_stats_create().
Make sure that the dm_stats handle is properly destroyed if we fail
to create the histogram pool and check for failures setting the
program_id.
Since we are growing an object in the histogram pool the return
value of dm_pool_grow_object() must be checked and error paths need
to abandon the object before returning.
Undo the part of the recent EREMOVED change which
automatically stopped the lockspace for a remotely
removed VG. It didn't always work (would not work
when lvb content was rebuilt in the dlm). This will
be handled better when the lvb content is controlled
more strictly.
As part of fix that came with cf700151eb,
I forgot to add the check whether the result of stat was successful or
not. This bug caused uninitialized buffer to be used for entries
from .cache file which are no longer valid.
This bug may have caused these uninitialized values to be used further,
for example (see the unreal (2567,590944) representing major:minor
pair):
$ pvs
/dev/abc: stat failed: No such file or directory
Path /dev/abc no longer valid for device(2567,590944)
PV VG Fmt Attr PSize PFree
/dev/mapper/test lvm2 --- 104.00m 104.00m
/dev/vda2 rhel lvm2 a-- 9.51g 0
Older versions of gcc aren't able to track the assignments of
local variables as well as the latest versions leading to spurious
warnings like:
libdm-stats.c:2183: warning: "len" may be used uninitialized in this
function
libdm-stats.c:2177: warning: "minwidth" may be used uninitialized in
this function
Both of these variables are in fact assigned in all possible paths
through the function and later compilers do not produce these
warnings.
There's no reason to not initialize these variables though and
it makes the function slightly easier to follow.
Also fix one use of 'unsigned' for a nr_bins value.
Replace the histogram stats subcommand with a --histogram switch
to enable histogram related fields for both list and report output.
To avoid overloading the existing --histogram rename it to --bounds:
this is also a better description of the option.
Remove the optimization/shortcut for starting the dlm global
lockspace when it was already running.
Reenable automatically starting the dlm global lockspace
when a command attempts to use it and it's not yet started.
This had become disabled at some point.
Since we may easily get blocked when checking for percentage
of thin-pool - do not flush and just show current values.
This avoids holding VG locked when pool is overfilled.
Try to detect thin-pool which my block lvm2 command from furher
processing (i.e. lvextend).
Check if pool is read-only or out-of-space and in this case thins
will skipped from being scanned (so user may miss some PVs located
on thin volumes).
Fix regression introduced with commit:
2fc126b00d
This commit has moved pv_min_size() test in front
of device_is_usable(). However pv_min_size needs to open device,
so it may have actually get blocked.
So restore the original order and first validate
dm device to be usable for open.
It's worth to note that such check is not 'race-free',
but it usually eliminates 99.99% of problems ;).
Print [source:handler] in filters' debug messages only if external
device info source other than "none" is used.
$ lvmconfig --type full devices/external_device_info_source
external_device_info_source="none
Before this patch (from the -vvvv log):
filters/filter-usable.c:47 /dev/mapper/test: Skipping: Too small to hold a PV [none:(nil)]
filters/filter-md.c:33 /dev/sdb: Skipping md component device [none:(nil)]
filters/filter-partitioned.c:25 /dev/vda: Skipping: Partition table signature found [none:(nil)]
With this patch applied:
filters/filter-usable.c:44 /dev/mapper/test: Skipping: Too small to hold a PV
filters/filter-md.c:35 /dev/sdb: Skipping md component device
filters/filter-partitioned.c:27 /dev/vda: Skipping: Partition table signature found
librt doesn't have a pkgconfig file so use Libs.private: -lrt instead
to declare the dependency directly.
The same applies for -lm which is also used and which hasn't been
defined in the devmapper.pc file yet.
Make sure that correct 'dmstats create' messages are shown for all
examples and fix LV examples to use correct dmsetup output name
format (vg/lv -> vg-lv).
Improve the names and labels of stats reports columns, ensure that
the minimum field widths allow unambiguos labels to be shown and
update the man page descriptions of these fields.
Add support to dmstats to create and report histograms.
Add a --histogram switch to 'create' that accepts a string
description of bin boundaries and DR_STATS and DR_STATS_META fields
to report bin configuration and absolute and relative histogram
values:
hist_bins
hist_bounds
hist_ranges
hist_count
hist_count_bounds
hist_count_ranges
hist_percent
hist_percent_bounds
hist_percent_ranges
A new 'histogram' subcommand displays a report that emphasizes
histogram data as either counters or percentage values.
Add support for creating, parsing, and reporting dm-stats latency
histograms on kernels that support precise_timestamps.
Histograms are specified as a series of time values that give the
boundaries of the bins into which I/O counts accumulate (with
implicit lower and upper bounds on the first and last bins).
A new type, struct dm_histogram, is introduced to represent
histogram values and bin boundaries.
The boundary values may be given as either a string of values (with
optional unit suffixes) or as a zero terminated array of uint64_t
values expressing boundary times in nanoseconds.
A new bounds argument is added to dm_stats_create_region() which
accepts a pointer to a struct dm_histogram initialised with bounds
values.
Histogram data associated with a region is parsed during a call to
dm_stats_populate() and used to build a table of histogram values
that are pointed to from the containing area's counter set. The
histogram for a specified area may then be obtained and interogated
for values and properties.
This relies on kernel support to provide the boundary values in
a @stats_list response: this will be present in 4.3 and 4.2-stable. A
check for a minimum driver version of 4.33.0 is implemented to ensure
that this is present (4.32.0 has the necessary precise_timestamps and
histogram features but is unable to report these via @stats_list).
Access methods are provided to retrieve histogram values and bounds
as well as simple string representations of the counts and bin
boundaries. Methods are also available to return the total count
for a histogram and the relative value (as a dm_percent_t) of a
specified bin.
For repeating reports field widths should be re-calculated for
each report interval. Not doing so will cause a single row with
wide field data to cause all subsequent rows to share the width:
Name RgID ArID R/s W/s Histogram Bounds
vg_hex-lv_home 0 0 4522.00 834.00 0s: 991, 2ms: 152, 4ms: 161, 6ms: 4052 0s, 2ms, 4ms, 6ms
vg_hex-lv_swap 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_root 0 0 1754.00 683.00 0s: 369, 2ms: 65, 4ms: 90, 6ms: 1913 0s, 2ms, 4ms, 6ms
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e 0 0 4522.00 868.00 0s: 985, 2ms: 152, 4ms: 161, 6ms: 4092 0s, 2ms, 4ms, 6ms
vg_hex-lv_images 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
Name RgID ArID R/s W/s Histogram Bounds
vg_hex-lv_home 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_swap 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_root 0 0 0.00 2.00 0s: 1, 2ms: 0, 4ms: 0, 6ms: 1 0s, 2ms, 4ms, 6ms
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
vg_hex-lv_images 0 0 0.00 0.00 0s: 0, 2ms: 0, 4ms: 0, 6ms: 0 0s, 2ms, 4ms, 6ms
^^^^^^^^^^^^^^^^^
This is especially significant for the current histogram fields:
depending on the time since the last clear operation the first
report iteration may contain very large values leading to a very
large minimum field width. Without resetting field widths this
large minimum field width value is used for all subsequent rows.
Previously all stderr messages issued by spawned lvpoll command were reported
as INFO only. This made all such messages invisible in syslog or lvmpolld log
while running default configuration.
All lvpoll stderr messages are loged with WARN priority now and lvpoll
command exiting with retcode != 0 is logged with ERROR priority in
syslog and lvmpolld log
Include both the VG uuid and name in the lvmetad
set_vg_info message. This works around an obscure
problem where the VG uuid in lvmlockd is wrong
when one host removes a dlm VG, then creates a new
VG with the same name. If the dlm lockspace for
the initial VG was never stopped on another host,
that other host will be using the old uuid in its
lvmetad set_vg_info message. (That can be
corrected with a larger change, but this is an
effective workaround.)
set_vg_info previously accepted only vg uuid,
now accept both vg uuid and vg name. If the
uuid is provided, it's used just as before,
but if the uuid is not provided, or if it's
not found, then fall back to using the vg
name if that is provided.
lvmlockd would fail to recognize that the global lockspace
failed to start if the dlm wasn't running, so future attempts
to start the dlm global lockspace would do nothing, thinking
it was already running.
This was only used to return two flags indicating specific
reasons for a lock failure so that a more specific error
message could be printed by the command (lockspace had been
stopped, or lockspace had an error starting.)
Remove the list, given its limited usefulness, the fact it
would easily become inaccurate, and the fact it was causing
misleading error messages. The error conditions it was meant
to help could be reported differently.
Previously, a command would only rescan a lockd VG
when lvmetad returned the "vg_invalid" flag indicating
that the cached copy was invalid (which is done by
lvmlockd.) This is still the only usual reason for
rescanning a lockd VG, but two new special cases are
added where we also do the rescan:
. When the --shared option is used to display lockd VGs
from hosts not using lvmlockd. This is the same case
as using --foreign to display foreign VGs, but --shared
was missing the corresponding bits to rescan the VGs.
. When a lockd VG is allowed to be read for displaying
after failing to acquire the lock from lvmlockd. In
this case, the usual mechanism for validating the
cache is missed, so assume the cache would have been
invalidated. (This had been a previous todo item
that was lost during other cleanup.)
These were long-standing todos that were lost track of.
This makes lvmlockd removal steps for dlm VGs closely match
sanlock VGs. Because dlm lockspaces are not required to be
stopped on all hosts before vgremove, there is an extra bit
for dlm lockspaces, where a flag is set in the VG lock lvb
indicating that the VG was removed. If other hosts happen
to use the VG lock they will see this flag and stop their
lockspace.
Remove the existing lock type using the same functions
used to remove the lockd components during vgremove.
This results in a "clean" VG and lvmlockd state after
the vgchange, i.e. no bits left over from previous
lock type.
Originally when vgdisplay encountered an exported VG it issued a
WARNING. Commit d6b1de30 replaced this with an error message
but still exited with success (incorrect). A backtrace was recently
added in commit b193809987.
As vgdisplay already states that the VG is exported in its output,
just drop these messages completely.
dm_stats_create_region is now assigned to DM_1_02_106 by default:
the DM_1_02_104 .exported_symbols file entry was moved into
libdm-stats.c as:
DM_EXPORT_SYMBOL(dm_stats_create_region, 1_02_104)
so delete it from .exported_symbols.DM_1_02_104.
Since commit 797c18d543 some internal symbols
have been exported in shared libraries by mistake because 'local: *' got
lost. Fix the shell script not to compare the whole filename with
'Base'
All cache args could be specified when caching LV
(means converting LV to cached).
When --cachemode arg is given during cache-pool conversion,
store it in the metadata.
https://bugzilla.redhat.com/show_bug.cgi?id=1255184
Since cache-pool actualy keeps info about caching,
display this info for cache-pool LV as well
(matches info for cache LV when cache-pool is asociated with it).
Commit 82a27a8 introduced a change to the symbol versioning macros
that allows a new version of a function to be introduced while
keeping the old behaviour via a versioned symbol export. The new
symbol is listed in the current .exported_symbols.DM_* file and a
default (@@VERSION) binding is created during linking.
This broke the build on RHEL5, RHEL6 and Debian Lenny. This is
because the make version in these distros returns results from the
$(wildcard *) command in a different order to the RHEL7 and F22
versions: this affects the ordering of the generated .export.sym
version script:
RHEL7/F22
for i in ./.exported_symbols.Base ./.exported_symbols.DM_1_02_99
./.exported_symbols.DM_1_02_98 ./.exported_symbols.DM_1_02_97
./.exported_symbols.DM_1_02_106 ./.exported_symbols.DM_1_02_105
./.exported_symbols.DM_1_02_103 ./.exported_symbols.DM_1_02_101
./.exported_symbols.DM_1_02_104 ./.exported_symbols.DM_1_02_100
290: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
*388: 000000000003cfc7 314 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@@DM_1_02_106
391: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
*552: 000000000003cfc7 314 FUNC GLOBAL DEFAULT 12 dm_stats_create_region
944: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
992: 000000000003d101 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
RHEL6:
for i in ./.exported_symbols.Base ./.exported_symbols.DM_1_02_100
./.exported_symbols.DM_1_02_101 ./.exported_symbols.DM_1_02_103
./.exported_symbols.DM_1_02_104 ./.exported_symbols.DM_1_02_105
./.exported_symbols.DM_1_02_106 ./.exported_symbols.DM_1_02_97
./.exported_symbols.DM_1_02_98 ./.exported_symbols.DM_1_02_99; do\
290: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
390: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
*479: 000000000003cfa7 314 FUNC LOCAL DEFAULT 12 dm_stats_create_region
944: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region_v1_02_104
992: 000000000003d0e1 106 FUNC GLOBAL DEFAULT 12 dm_stats_create_region@DM_1_02_104
The F22 build has the correct behaviour (although the sort order is
inconsistent) but on RHEL6 the 1_02_106 symbol file appears after
version 1_02_104 which introduced the original symbol. This causes
the later version of the symbol to lose its version binding and be
reduced to local scope.
If using un-versioned exports of the current version of a symbol
(i.e. exported with the plain symbol name and no macro) and using
the linker script to set the symbol version, the current version
node must appear first in the version script: the un-versioned
symbol will be bound to the first version node found that contains
it.
On RHEL6 and the other older distros the original version of the
dm_stats_create_region() call sorted before the current version
(DM_1_02_104 vs. DM_1_02_106) leading to a subsequent link error for
the later symbol version:
dmsetup.o: In function `_do_stats_create_regions':
/root/src/git/lvm2/tools/dmsetup.c:4658: undefined reference to
`dm_stats_create_region'
Ensure that the ordering of entries in the version script is
consistent to avoid an old implementation shadowing a newer one by
sorting the list of file names before the loop:
$$(echo $(EXPORTED_SYMBOLS) | tr ' ' '\n' | sort -rnt_ -k5 )
This only sorts by patch level but this is sufficient to maintain
the correct order for current version files.
Tested on RHEL5, 6, 7 and F22.
Add support for the kernel precise_timestamps feature. This allows
regions to be created using counters with nanosecond precision.
A new dm_stats method, dm_stats_set_precise_timestamps() causes all
future regions created with this handle to attempt to enable precise
counters.
Fix the version export macros to make it possible to export two
different DM_* versions of a symbol: currently it is only possible for a
DM_* symbol to override a symbol in Base. Attempting to export two
symbols at different DM_* version levels (e.g. DM_1_02_104 and
DM_1_02_106) leads to a linker error due to a duplicate symbol
definition.
This is because the DM_EXPORTED_SYMBOL macro makes each exported symbol
the default (@@VERSION):
__asm__(".symver " #func "_v" #ver ", " #func "@@DM_" #ver )
Fix the macro to use a single '@' for a symbols exported in multiple
versions and rename the macros to DM_EXPORT_*:
DM_EXPORT_SYMBOL(func,ver)
DM_EXPORT_SYMBOL_BASE(func,ver)
For functions that have multiple implementations these macros control
symbol export and versioning.
Function definitions that exist in only one version never need to use
these macros.
Backwards compatible implementations must include a version tag of
the form "_v1_02_104" as a suffix to the function name and use the
macro DM_EXPORT_SYMBOL to export the function and bind it to the
specified version string.
Since versioning is only available when compiling with GCC the entire
compatibility version should be enclosed in '#if defined(__GNUC__)',
for example:
int dm_foo(int bar)
{
return bar;
}
#if defined(__GNUC__)
// Backward compatible dm_foo() version 1.02.104
int dm_foo_v1_02_104(void);
int dm_foo_v1_02_104(void)
{
return 0;
}
DM_EXPORT_SYMBOL(dm_foo,1_02_104)
#endif
A prototype for the compatibility version is required as these
functions must not be declared static.
The DM_EXPORT_SYMBOL_BASE macro is only used to export the base
versions of library symbols prior to the introduction of symbol
versioning: it must never be used for new symbols.
Single messages sent over unix sockets are limited in
size to /proc/sys/net/core/wmem_max, so send the 1MB
debug buffer in smaller chunks to avoid EMSGSIZE.
Also look for EAGAIN and retry sending for a limited
time when the reader is slower than the writer.
Also shift the location of that code so it's the same
as other requests.
With clusters larger than 3 nodes, the 32-byte debug buffer in
cpg_join_callback() is too small to contain all the node IDs, because
32-bit identifiers are generally rendered in 10 decimal digits. No fixed
size is good in all cases, but this is conditionally logged debug info,
so we can simply truncate it. Double the size, nevertheless.
Add a function to test whether the kernel precise_timestamps
feature is available in the current device-mapper driver version.
Presence of precise_timestamps also implies the availability of
latency histograms.
This mainly makes the description text use 80 columns.
There are a few minor adjustments to wording to help
the text layout, and a couple minor improvements to
descriptions.
The unlock call will fail in expected and normal cases,
and should not cause the command to fail. (An actual
unlock in the lock manager should never fail.)
We already use -lm functions in a couple of places (these are
satisfied by gcc built-ins for most builds): add a configure.in
check and explicitly link to -lm.
This reverts commit 70db1d523d.
Since we use 'strncpy' even for case where it exactly matches
the buffer size and \0 is not expected to be added there.
Before printing a commented automatic config value,
print a line describing what it is. Otherwise, the
commented value can look like it's a part of an
example preceding it.
The timerfd guarantees that it will return 8 bytes when a read(2)
is issued (a uint64_t giving the number of timer events during the
call). Check that it does so and log a non-fatal error if the byte
count is not 8.
Several interfaced in libdm-stats return a uint64_t when it is
only used to signal success/failure: change all these uses to
return a simple int instead.
Move code which runtime detects settings for cache_policy
out of config dir to cache seg handling code.
Also mark cache_mode as command profilable setting.
Revert back to already existing behavior which has been slightly
modified by a900d150e4.
At the end however it seem to be equal to change TID right with first
metadata write.
Existing code missed handling for 'unused' thin-pool which would
require to also check empty message list for TID==0.
So with the fix we now again preserve 'active' thin-pool volume
when first thin volume is created - this property was lost and caused
problems in cluster, where the lock was hold, but volume was no longer
active on the node.
Another missing part was the proper support for already increased,
but unfinished TID change.
So going back here with existing logic -
TID is increased with first MDA update.
Code allows start with either same TID or (TID-1).
If there are messages, TID must be lower by 1 for sending,
otherwise messages were already posted.
As cache_policy is evaluated in runtime, we no longer should use
CFG_COMMENTED, but have to switch to CFG_UNDEFINED.
So as long as the value is undefined, it's runtime evaluated.
Once it's set - it's always respected (no runtime fallback).
Also fix version of introduced settings to 2.2.128.
Commit f10ad95 introduced a regression causing the size of regions
passed in on the command line to be truncated to zero. Initialise
the 'this_len' variable to the supplied length to correct this.
Commit f10ad95 introduced a regression in the calculation of the
number of areas in a region created with the --areasize switch:
vg_hex-lv_home: Created new region with 0 area(s) as region ID 1
vg_hex-lv_swap: Created new region with 0 area(s) as region ID 1
Fis this by using the correct region size when calculating the
value.
When dmstats is run with -v or higher enable a per-area reporting
mode for statistics regions. This will output one row per area
(rather than one row per region) and adds additional fields of use
when viewing areas:
area_id - index within the region assigned by libdm-stats
area_start - the start location of the area in the containing
device.
Add a method to retrieve the offset of an area within the
containing region (rather than the offset within the containing
device returned by dm_stats_get_area_start()).
Although users of the library can calculate this themselves it is
better to provide this through a method call to avoid users making
assumptions about the structure of regions and areas.
The dm_stats_get_area_start (and its '_current_' variant) methods
are expected to return the start sector of the area in the
containing device.
Make sure the call adds region->start to the returned value.
Add a '--raw' switch to stats reports that causes us to report the
basic counter values rather than derived metrics for each visible
statistics region.
Add prefixes to all dmsetup report types to allow the 'group_all'
option to be effective:
DR_NAME name_
DR_INFO info_
DR_DEPS deps_
DR_TREE tree_
DR_NAME splitname_
When run with full verbosity dmsetup or dmstats reports will
output a figure that tracks a moving average over a window of the
last two intervals:
Interval #3 time delta: 999991087ns
Interval #3 mean duration: 999907064ns, current err: -8913ns
End interval #3 duration: 999991087ns
Adjusted sample interval duration: 999991087ns
Due to the narrow window this is a very crude estimate and is only
of use to someone debugging or modifying the stats clock: remove
the value and the global variables used to track it.
Anyone with a particular use for this information can construct a
better mean by calculating the value of a greater number of
intervals.
Unlike 'info -c' and 'stats report' the 'dmstats list' subcommand
does its own report processing. This complicates the handling of
the DR_STATS and DR_STATS_META fields and leads to inconsistent
behaviour between the different commands. In particular it causes
'stats list' to segfault when using 'all' field options:
Segmentation fault (core dumped)
Delete _stats_list() entirely and adapt _stats_report so that it
can correctly format a DR_STATS_META-only report request.
This requires passing the subcommand into _report_init() where it
is used in addition to the command name to select the default set
of report fields for the 'list' and 'report' stats subcommands.
With this change both 'list' and 'report' dmstats report will use
the correct report object type and ensure that it is initialised
appropriately for the field selection in use.
Although statistics and meta fields (region and area properties) share
the same object type the state of the handle they expect differs: meta
only expects a dm_stats_list() operation to have been performed whereas
statistics require a fully populated handle.
Distinguish between these requirements by separating the fields into
two distinct report types:
DR_STATS = 32,
DR_STATS_META = 64
The new category is described as "Mapped Device Statistics Region
Information" in the help text.
Make the use of the this_start and this_len variables easier to
follow and clarify the use of zero start and len arguments to
request a whole-device region.
Add a pair of fields to expose the current per-interval duation
estimate. The 'interval' field provides a real value in units of
seconds and the 'interval_ns' field provides the same quantity
expressed as a whole number of nanoseconds.
Do not include bits/time.h as it is an internal libc header file.
A comment at the top of the glibc specific bits/time.h says:
"Never include this file directly; use <time.h> instead."
This fixes the following build error with musl libc:
libdm-timestamp.c:37:23: fatal error: bits/time.h: No such file or directory
---
Compile tested with Alpine Linx (musl libc) and ubuntu 15.04
libdm/libdm-timestamp.c | 1 -
1 file changed, 1 deletion(-)
Introduce enums and global variables to record cleanly which command we
are processing and eliminate the historically inconsistent use of the
shifted argv[0] and fix assorted bugs discovered along the way.
Add dm_report_is_empty() to indicate there is no data awaiting output
and use this to suppress dmsetup report headings when no data is output
so we don't get a stray line saying 'Help' at the end of reporting help.
Define a report type (as the interface requires) so -o all selects
the right fields in splitname. (A fix for stats list will follow.)
Exit immediately if no device is supplied to dmsetup wipe_table instead
of hitting errors later and failing.
Adjust the command name printed in usage/help output to match command
invoked (most of the time).
The '--force' switch is only used by dmstats to allow either
creation or deletion of one or more regions on all devices.
These operations do not carry any risk: just a possible mess of
region IDs to be cleaned up.
Remove the use of '--force' for stats commands and change current
uses to a new '--alldevices' switch.
The region creation message just outputs the new region_id, e.g.:
Created region: 0
This is fine when the device is unambigous (as above) but produces
unhelpful output when creating multiple regions, or regions on
multiple devices:
Created region: 0
Created region: 0
Created region: 1
Created region: 2
Created region: 0
To address this refactor _stats_create_segments() (previously only
used when creating one-region-per-target for --segments) into a
more general _do_stats_create_regions() that can create regions
for each segment, or a single region spanning either the entire
device or a specied start/len range.
This allows us to output all region creation messages from a
single point where both the device name and all information needed
to derive the number of areas is available.
This allows us to log all these facts in the resulting messages:
vg_hex-lv_home: Created new region with 13 area(s) as region ID 0
vg_hex-lv_home: Created new region with 4 area(s) as region ID 1
vg_hex-lv_home: Created new region with 1 area(s) as region ID 2
vg_hex-lv_swap: Created new region with 1 area(s) as region ID 0
vg_hex-lv_root: Created new region with 10 area(s) as region ID 0
luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e: Created new region with 17 area(s) as region ID 0
vg_hex-lv_images: Created new region with 20 area(s) as region ID 0
vg_hex-lv_images: Created new region with 4 area(s) as region ID 1
Don't use cryptic abbreviations and make sure that all values can
be understood by someone not familiar with the clock internals.
Include the current interval number (inverse of the _count) in all
interval update messages and attempt to align interval timestamp
logs for interval counts < 99,999.
If _stats_report fails (e.g. due to an invalid device on the
command line) destroy the _report to prevent stats columns headings
from being displayed.
This also requires a change in main to test the return from
_perform_command_for_all_repeatable_args inside the interval loop
and exit immediately in case of error.
The _update_interval_times() function is called once per reported
object: when shutting down at the end of a run only the first call
should free timestamps. Clear the timestamp pointers after free
and use this to signal to other callers that the clock is already
shut down.
If the Linux timerfd interface to POSIX timers is available at compile
time use it for all report interval timekeeping. This gives more
accurate interval timing when the per-interval processing time is less
than the configured interval and simplifies the timestamp bookkeeping
required to keep accurate time.
For systems without timerfd support fall back to the simple usleep based
timer.
Change logic and naming of some internal API functions.
cache_set_mode() and cache_set_policy() both take segment.
cache mode is now correctly 'masked-in'.
If the passed segment is 'cache' segment - it will automatically
try to find 'defaults' according to profiles if the are NOT
specified on command line or they are NOT already set for cache-pool.
These defaults are never set for cache-pool.
Add code to detect available cache features.
Support policy_mq & policy_smq features which might be disabled.
Introduce global_cache_disabled_features_CFG.
Add new profilable configurables:
allocation/cache_policy
allocation/cache_settings
and mark allocation/cache_pool_chunk_size as profilable as well.
Obsolete allocation/cache_pool_cachemode and
introduce new allocation/cache_mode instead.
Rename DEFAULT_CACHE_POOL_POLICY to DEFAULT_CACHE_POLICY.
Request a transient LV lock from lvmlockd when
converting an LV. If the LV is inactive when
lvconvert is run, the LV lock will be acquired
and then released when the command is done.
If the LV is active, a persistent lock exists
already and the transient lock request does nothing.
This fixes the issue that had been mentioned in the
comment previously.
lvrename should not be done if the LV is active on another host.
This check was mistakenly removed when the code was changed to
use LV uuids in locks rather than LV names.
Since libdm-stats only uses fmemopen'd FILE objects the only way
that a close can fail is corruption of the memory containing the
FILE: check for this case and emit a backtrace if it occurs.
libdm/libdm-stats.c: 338 in _stats_parse_list()
libdm/libdm-stats.c: 341 in _stats_parse_list()
libdm/libdm-stats.c: 481 in _stats_parse_region()
libdm/libdm-stats.c: 487 in _stats_parse_region()
libdm/libdm-stats.c: 487 in _stats_parse_region()
- Calling "fclose" without checking return value
Remove an unneccessary conditional operator and simplify the logic
in _nr_areas:
libdm/libdm-stats.c: 501 in _nr_areas() - Control flow issues (DEADCODE)
The error path of _stats_list frees the task and stats objects:
don't try to branch to it before they have been allocated.
tools/dmsetup.c: 4589 in _stats_help() - Null pointer dereferences (FORWARD_NULL)
Make sure the newly created handle is freed if we are unable to
also create the pool for it.
tools/dmsetup.c: 4255 in _stats_list() - Variable "dms" going out of scope leaks the storage it points to.
Make sure comm is closed in the error path of _program_id_from_proc().
libdm/libdm-stats.c: 98 in dm_stats_create() - Variable "comm" going out of scope leaks the storage it points to.
There's no point testing _report here in _stats_report: it's always
initialised before the function is called and if the check did fail
we'd end up freeing an uninitialized dm_task in the error path.
tools/dmsetup.c: 4389 in _stats_report() - Declaring variable "dmt" without initializer.
The check for other sanlock lockspaces was not checking
that the lockspace type was sanlock, so if dlm lockspaces
were visible, they were wrongly included.
When vgremove is used to remove multiple VGs in one command,
e.g. vgremove foo bar, the first VG (foo) that is removed
may have held the sanlock global lock. In this case,
do not continue removing further VGs (bar) without the
global lock.
Add the libdm-stats module to libdm: this implements a simple interface
for creating, managing and interrogating I/O statistics regions and
areas on device-mapper devices.
The library interface is documented in libdevmapper.h and provides a
'dm_stats' handle that is used to perform statistics operations and
obtain data.
Public methods are provided to create and destroy handles and to list,
create, and destroy statistics regions as well as to obtain and parse
counter data and calculate rate-based metrics.
This commit also adds a 'dmsetup stats' (aka 'dmstats') command with
'clear', 'create', 'delete', 'list', 'print', and 'report' sub-commands.
See the library documentation and the dmstats.8 manual page for detailed
API and command descriptions.
Don't do interval management and external timekeeping for stats in
dm_report: let applications handle this on their own.
Since this has not been included in a release remove it from the
library entirely and handle report timing directly inside dmsetup.
Add a function to print column headings regardless of whether they
have already been output. This will be used by dmstats to issue
periodic reminders of the column headings.
This patch removes a check for RH_HEADINGS_PRINTED from
_report_headings that prevents headings being displayed if the flag
is already set; this check is redundant since the only existing
caller (_output_as_columns()) already tests the flag before
calling the function.
Not releasing objects back to the pool is fine for short-lived
pools since the memory will be freed when dm_pool_destroy() is
called.
Any pool that may be long-lived needs to be more careful to free
objects back to the pool to avoid leaking memory that will not be
reclaimed until the pool is destroyed at process exit time.
The report pool currently leaks each headings line and some row
data.
Although dm_report_output() tries to free the first allocated row
this may end up freeing a later row due to sorting of the row list
while reporting. Store a pointer to the first allocated row from
_do_report_obect() instead and free this at the end of
_output_as_columns(), _output_as_rows(), and dm_report_clear().
Also make sure to call dm_pool_free() for the headings line built
in _report_headings().
When dmstats is introduced it will maintain dm_report objects for
the whole lifetime of the process: without these changes a stats
report could leak around 600k in 10m (exact rate depends on field
selection and data values):
top - 12:11:32 up 4 days, 3:16, 15 users, load average: 0.01, 0.12, 0.14
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6473 root 20 0 130196 3124 2792 S 0.0 0.0 0:00.00 dmstats
top - 12:22:04 up 4 days, 3:26, 15 users, load average: 0.06, 0.11, 0.13
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6498 root 20 0 130836 3712 2752 S 0.0 0.0 0:00.60 dmstats
With this patch no increase in RSS is seen:
top - 13:54:58 up 4 days, 4:59, 15 users, load average: 0.12, 0.14, 0.14
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 130196 2996 2688 S 0.0 0.0 0:00.00 dmstats
top - 14:04:31 up 4 days, 5:09, 15 users, load average: 1.02, 0.67, 0.36
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 130196 2996 2688 S 0.3 0.0 0:00.32 dmstats
This also affects report output for repeating reports in the
DM_REPORT_OUTPUT_COLUMNS_AS_ROWS case; row state is not fully cleared for
the next iteration leading to progressive growth of the heading width:
vg_hex-lv_home:vg_hex-lv_swap:vg_hex-lv_root:luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:vg_hex-lv_images
253:253:253:253:253
2:0:1:4:3
L--w:L--w:L--w:L--w:L--w
1:2:1:1:1
3:1:1:1:2
0:0:0:0:0
LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiv08BCGvF4WsJSqWUDUt7qtf2hEmjtVvo:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiKf7XIiwdAYOJfaGhQe9fu26cTEICGgFS:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiEZj7ZXbmrWDuGhd7vvi88VF0NdTMG8iA:CRYPT-LUKS1-797339213f684c929eb7d0aca4c6ba3e-luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOi2rKredlBPnw2X7v1BiCuEpFo6gaE7BRw
:::::vg_hex-lv_home:vg_hex-lv_swap:vg_hex-lv_root:luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:vg_hex-lv_images
:::::253:253:253:253:253
:::::2:0:1:4:3
:::::L--w:L--w:L--w:L--w:L--w
:::::1:2:1:1:1
:::::3:1:1:1:2
:::::0:0:0:0:0
:::::LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiv08BCGvF4WsJSqWUDUt7qtf2hEmjtVvo:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiKf7XIiwdAYOJfaGhQe9fu26cTEICGgFS:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOiEZj7ZXbmrWDuGhd7vvi88VF0NdTMG8iA:CRYPT-LUKS1-797339213f684c929eb7d0aca4c6ba3e-luks-79733921-3f68-4c92-9eb7-d0aca4c6ba3e:LVM-9t8ITqLZa6AuuyVoz5Olp1KwF9ZDBfOi2rKredlBPnw2X7v1BiCuEpFo6gaE7BRw
This adds the infrastructure, code paths, error reporting,
etc. to handle storage errors, or storage loss, under the
sanlock leases in a VG that is being used. The loss of
storage means sanlock cannot renew its leases, which means
that the host needs to stop using the shared VG before its
leases expire.
This still requires manually shutting down a VG that has
lost lease storage, e.g. unmounting file systems,
deactivating LVs in the VG. The next step is to
automatically use a command like blkdeactivate to do that.
tools/polldaemon.c:465: uninit_use_in_call: Using uninitialized value "id.vg_name" when calling "print_log".
tools/polldaemon.c:465: uninit_use_in_call: Using uninitialized value "id.lv_name" when calling "print_log".
/lib/log/log.c:88: warning[invalidScanfArgType_int]: %llu in format string (no. 2) requires 'unsigned long long *' but the argument type is 'long long *'.
daemons/lvmlockd/lvmlockd-core.c:791: error[uninitstring]: Dangerous usage of 'version' (strncpy doesn't always null-terminate it).
The dlm global lockspace is automatically added when the
first dlm VG lockspace is added. Reverse this by removing
the dlm global lockspace after the last dlm VG lockspace
is removed. (Remove old non-working code that did this
based on an old command that could explicitly add/remove
the dlm global lockspace.)
Whenver reporting field name is registered with libdevmapper and if
the field name contains any number of underscores ('_'), libdm
can now automatically recognize any of its variant without any
underscores used.
For example:
..for underscores in prefixes:
pvs -o pv_name
pvs -o name
pvs -o pvname (newly recognized besides pvname)
..for underscores in the name:
lvs -o cache_mode
lvs -o cachemode
..or even multiple underscores:
pvs -o pv___na___me
It's all variant of the same field name.
No commands set has_subcommands yet.
Move multiple device loop to separate function because we'll
soon want to call it repeatedly.
(Based on patch from bmr.)
Use refresh_filters instead of destroy_filters and init_filters
in refresh_toolcontext fn which deals with cmd->initialized.filters
correctly on refresh.
Just shuffle the items and put them into logical groups so it's
visible at first sight what each group contains - it makes it a bit
easier to make heads and tails of the whole cmd_context monster.
When a command is flagged with NO_METADATA_PROCESSING flag, it means
such command does not process any metadata and hence it doens't require
lvmetad, lvmpolld and it can get away with no locking too. These are
mostly simple commands (like lvmconfig/dumpconfig, version, types,
segtypes and other builtin commands that do not process metadata
in any way).
At first, when lvm command is executed, create toolcontext without
initializing connections (lvmetad,lvmpolld) and without initializing
filters (which depend on connections init). Instead, delay this
initialization until we know we need this. That is, until the
lvm_run_command fn is called in which we know what the actual
command to run is and hence we can avoid any connection, filter
or locking initiliazation for commands that would not make use
of it anyway.
For all the other create_toolcontext calls, we keep the original
behaviour - the filters and connections are initialized together
with the toolcontext.
Make it possible to decide whether we want to initialize connections and
filters together with toolcontext creation.
Add "filters" and "connections" fields to struct
cmd_context_initialized_parts and set these in cmd_context.initialized
instance accordingly.
(For now, all create_toolcontext calls do initialize connections and
filters, we'll change that in subsequent patch appropriately.)
Move original lvmetad and lvmpolld initialization code from
_process_config fn to their own functions _init_lvmetad and
_init_lvmpolld (both covered with single _init_connections fn).
Add struct cmd_context_initialized_parts to wrap up information
about which cmd context pieces are initialized and add variable
of this struct type into struct cmd_context.
Also, move existing "config_initialized" variable that was directly
part of cmd_context into the new cmd_context.initialized wrapper.
We'll be adding more items into the struct cmd_context_initialized_parts
with subsequent patches...
This tries harder to avoid creating duplicate global locks in
sanlock VGs by refusing to create a new sanlock VG with a
global lock if other sanlock VGs exist that may have a gl.
vgsummary information contains provisional VG information
that is obtained without holding the VG lock. This info
can be used to lock the VG, and then read it with vg_read().
After the VG is read properly, the vgsummary info should
be verified.
Add the VG lock_type to the vgsummary. It needs to be
known before the VG can be locked and read.
This is a regression introduced by commit
6c0e44d5a2 which changed
the way dev_cache_get fn works - before this patch, when a
device was not found, it fired a full rescan to correct the
cache. However, the change coming with that commit missed
this full_rescan call, causing the lvmcache to still contain
info about PVs which should be filtered now.
Such situation may have happened by coincidence of using
old persistent cache (/etc/lvm/cache/.cache) which does not
reflect the actual state anymore, a device name/symlink which
now points to a device which should be filtered and a fact we
keep info about usable DM devices in .cache no matter what
the filter setting is.
This bug could be hidden though by changes introduced in
commit f1a000a477 as it
calls full_rescan earlier before this problem is hit.
But we need to fix this anyway for the dev_cache_get
to be correct if we happen to use the same code path
again somewhere sometime.
For example, simple reproducer was (before commit
1a000a477558e157532d5f2cd2f9c9139d4f87c):
- /dev/sda contains a PV header with UUID y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M
- lvm.conf: filter = [ "r|.*|" ]
- rm -f .cache (to start with clean state)
- dmsetup create test --table "0 8388608 linear /dev/sda 0" (8388608 is
just the size of the /dev/sda device I use in the reproducer)
- pvs (this will create .cache file which contains
"/dev/disk/by-id/lvm-pv-uuid-y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M"
as well as "/dev/mapper/test" and the target node "/dev/dm-1" - all the
usable DM mappings (and their symlinks) get into the .cache file even
though the filter "is set to "ignore all" - we do this - so far it's OK)
- dmsetup remove test (so we end up with /dev/disk/by-id/lvm-pv-uuid-...
pointing to the /dev/sda now since it's the underlying device
containing the actual PV header)
- now calling "pvs" with such .cache file and we get:
$ pvs
PV VG Fmt Attr PSize PFree
/dev/disk/by-id/lvm-pv-uuid-y5PzRD-RBAv-7sBx-V3SP-vDmy-DeSq-GUh65M vg lvm2 a-- 4.00g 0
Even though we have set filter = [ "r|.*|" ] in the lvm.conf file!
Moved out from lib/display and a little documentation added.
It's tuned to LVM's requirements historically and its behaviour
might not always be what you would expect.
There is no longer an "enable" option for the global lock,
so remove the bit of code that was checking for it. It
was an optional variation anyway, and not one that was likely
to be used.
Also update the corresponding comment describing global lock
creation.
Stop removing hyphens when = is seen. With an option
like --profile=thin-performance, the hyphen removal
will stop at = and will not remove - after thin.
Stop removing hyphens altogether when a stand alone arg
of -- appears.
Simply running concurrent copies of 'pvscan | true' is enough to make
clvmd freeze: pvscan exits on the EPIPE without first releasing the
global lock.
clvmd notices the client disappear but because the cleanup code that
releases the locks is triggered from within some processing after the
next select() returns, and that processing can 'break' after doing just
one action, it sometimes never releases the locks to other clients.
Move the cleanup code before the select.
Check all fds after select().
Improve some debug messages and warn in the unlikely event that
select() capacity could soon be exceeded.
When there are duplicate global locks, check if the gl
is still enabled each time a gl or vg lock is acquired
in the lockspace. Once one of the duplicates is disabled,
then other hosts will recognize that the issue is resolved
without needing to restart the lockspaces.
Move the DEBUG_MEM decision inside libdevmapper.so instead of exposing
it in libdevmapper.h which causes failures if the binary and library
were compiled with opposite debugging settings.
pvscan autoactivation does not work for lockd VGs because
lock start is needed on a lockd VG before locking can be
done for it. Add a check to skip the attempt at autoactivate
rather than calling it, knowing it will fail.
Add a comment explaining why pvscan --cache works fine for
lockd VGs without locks, and why autoactivate is not done.
. the poll check will eventually call finish which will
write the VG, so an ex VG lock is needed from lvmlockd.
. fix missing unlock on poll error path
. remove the lockd locking while monitoring the progress
of the command, as suggested by the earlier FIXME comment,
as it's not needed.
The CFG_SECTION_NO_CHECK flag can be used to mark a section
and its whole subtree as containing settings where checks
won't be made (lvmconfig --validate).
These are setting where we don't know the names and and type
in advance and they're recognized in runtime. As we don't know
the type and name in advance, we can't do any checks here
of course.
Use this flag with great care as it disables config checks
for the whole config subtree found under such section.
This flag is going to be used by subsequent patches from
Zdenek to support some cache settings...
Recent change to move the polling outside of core lvconvert
code was wrongly using 'lv' and 'vg' structs which can't be
used outside of the core code, which caused seg fault.
Properly isolate all use of lv structs within the core of
the lvconvert code, saving any information necessary,
(esp lvid). After the core of lvconvert is done, use
the saved information to do polling.
FIXME: the need for is_merging_origin and is_merging_origin_thin
in this patch is ugly, and a cleaner way should be found to deal
with that than what is done here.
Also it effectively removed all hacks in _lvconvert_merge_single
performing ugly: VG reread, unlock, polling, lock sequence.
Moreover all polling operations are postponed after all conversions
are finished.
lvm2 (while locking via lvmlockd) should now be able to run with
or without lvmpolld while performing poll operations originating
in lvconvert command.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Comply with the rules we have for log_error and log_warn...
$ pvcreate /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
1 existing signature left on the device.
Aborting pvcreate on /dev/sda1.
$ pvcreate /dev/sda1 --force
WARNING: Failed to get offset of the xfs_external_log signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
libblkid may return the list of signatures found, but it may not
provide offset and size for each signature detected. This may
happen in case signatures are mixed up or there are more, possibly
overlapping, signatures found.
Make lvm commands pass if such situation happens and we're using
--force (or any stronger force method).
For example:
$ pvcreate /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
1 existing signature left on the device.
Aborting pvcreate on /dev/sda1.
$ pvcreate --force /dev/sda1
Failed to get offset of the xfs_external_log signature on /dev/sda1.
Physical volume "/dev/sda1" successfully created
The vgchange/lvchange activation commands read the VG, and
don't write it, so they acquire a shared VG lock from lvmlockd.
When other commands fail to acquire a shared VG lock from
lvmlockd, a warning is printed and they continue without it.
(Without it, the VG metadata they display from lvmetad may
not be up to date.)
vgchange/lvchange -a shouldn't continue without the shared
lock for a couple reasons:
. Usually they will just continue on and fail to acquire the
LV locks for activation, so continuing is pointless.
. More importantly, without the sh VG lock, the VG metadata
used by the command may be stale, and the LV locks shown
in the VG metadata may no longer be current. In the
case of sanlock, this would result in odd, unpredictable
errors when lvmlockd doesn't find the expected lock on
disk. In the case of dlm, the invalid LV lock could be
granted for the non-existing LV.
The solution is to not continue after the shared lock fails,
in the same way that a command fails if an exclusive lock fails.
When lvmlockd is compiled without support for one of the
lock managers (sanlock or dlm), and a command tries to use
one of them, explain that in the error message.
Don't abort test when clvmd processes two VGs concurrently.
CLVMD: ioctl/libdm-iface.c:1940 Internal error: Performing unsafe table load while 3 device(s) are known to be suspended: (253:19)
When lvm is built without lvmlockd support, vgcreate using a
shared lock type would succeed and create a local VG (the
--shared option was effectively ignored). Make it fail.
Fix the same issue when using vgchange to change a VG to a
shared lock type.
Make the error messages consistent.
A segfault was reported when extending an LV with a smaller number of
stripes than originally used. Under unusual circumstances, the cling
detection code could successfully find a match against the excess
stripe positions and think it had finished prematurely leading to an
allocation being pursued with a length of zero.
Rename ix_offset to num_positional_areas and move it to struct
alloc_state so that _is_condition() can obtain access to it.
In _is_condition(), areas_size can no longer be assumed to match the
number of positional slots being filled so check this newly-exposed
num_positional_areas directly instead. If the slot is outside the
range we are trying to fill, just ignore the match for now.
(Also note that the code still only performs cling detection against
the first segment of the LV.)
Replace misleading "not found" in the log message when
devices/preferred_names is set to empty array:
Really not found:
device/dev-cache.c:689 devices/preferred_names not found in config: using built-in preferences
Found, but empty:
config/config.c:1431 Setting devices/preferred_names to preferred_names = [ ]
device/dev-cache.c:689 devices/preferred_names is empty: using built-in preferences
Commit 7e728fe1a1 added a log call
directly in find_config_tree_array when defaults are used.
This patch also adds the log for the value which is found in
existing configuration and for which defaults are not used.
For example:
Defaults used:
config/config.c:1428 devices/scan not found in config: defaulting to scan = [ "/dev" ]
Value defined in configuration used:
config/config.c:1431 Setting devices/scan to scan = [ "/dev", "/mydev", "/abc" ]
This makes the logging consistent with the other find_config_tree_* functions.
Policy name has to be always defined.
Capture it as an internal error before write.
When reading metadata without defined policy name, use default defined policy.
TODO: Unsure, but it might have to be actually always 'mq' in this case.
Keep policy name separate from policy settings and avoid
to mangling and demangling this string from same config tree.
Ensure policy_name is always defined.
Use find_config_tree_array for all config arrays. Also, add
INTERNAL_ERROR in case there should have been at least default
value defined for a setting but it was not returned for some
reason (either config_settings.h misconfiguration or other config
tree error printed by functions called by find_config_tree_array).
Both lock_start filters were being skipped when any lock-opt
values were used. The "auto" lock-opt should cause the
auto_lock_start_list to be used. The lock_start_list should
always be used.
The behavior of lock_start_list/auto_lock_start_list are tested
and verified to behave like volume_list/auto_activation_volume_list.
Since the default was changed to wait for lock-start to finish,
the "wait" and "autowait" lock-opt values are not needed, but a
new "autonowait" is added to the existing "nowait" avoid the
default waiting.
There are two different failure conditions detected in
access_vg_lock_type() that should have different error
messages. This adds another failure flag so the two
cases can be distinguished to avoid printing a misleading
error message.
Require global/{thin,cache}_{check,repair}_options to be always defined.
If not defined directly by user in the configuration and if there's no
concrete default option to use, make "" (empty string) the default one -
it's then clearly visible in the "lvmconfig --type default" (and
generated lvm.conf) and also it makes its handling in the code more
straightforward so we don't need to handle undefined values.
This means, if there are no default values for these settings defined,
we end up with this generated now:
{thin,cache}_{check,repair}_options = [ "" ]
So the value is never undefined and if it is, it's an error.
(The cache_repair_options is actually not used in the code at the moment,
but once the code using this setting is in, it will follow the same logic
as used for thin_repair_options.)
The "exported" state of the VG can be useful with lockd VGs
because the exported state keeps a VG from being used in general.
It's a way to keep a VG protected and out of the way.
Also fix the command flags: ALL_VGS_IS_DEFAULT is not true for
vgimport/vgexport, since they both return errors immediately if
no VG args are specified. LOCKD_VG_SH is not true for vgexport
beause it must use an ex lock to write the VG.
When --nolocking is used (by vgs, lvs, pvs):
. don't use lvmlockd at all (set use_lvmlockd to 0)
. allow lockd VGs to be read
When --readonly is used (by vgs, lvs, pvs, vgdisplay, lvdisplay,
pvdisplay, lvmdiskscan, lvscan, pvscan, vgcfgbackup):
. skip actual lvmlockd locking calls
. allow lockd VGs to be read
. check that only shared gl/vg locks are being requested
(even though the actually locking is being skipped)
. check that no LV locks are requested, because no LVs
should be activated or used in readonly mode
. disable using lvmetad so VGs are read from disk
It is important to note the limited commands that accept
the --nolocking and --readonly options, i.e. no commands
that change/write a VG or change/activate LVs accept these
options, only commands that read VGs.
A new lockd lock needs to be created for the new LV
created by split mirror and split snapshot. Disallow
these options in lockd VGs until that is implemented.
There are at least a couple instances where
the lock_args check does not work correctly,
(listed in the comment), so disable the
NULL check for lock_args until those are
resolved.
log_warn was added recently because no known code used
the given condition, but running pvcreate on an existing
PV uses this case, and should not produce a warning.
Put the change from commit #10d27998b3d2f6100e9e29e83d1d99948c55875f
back so we have working tree again for now. This code needs a bit of
a cleanup to return proper return value to check...
lib/format1/import-export.c:167: var_deref_op: Dereferencing null pointer "vg->lvm1_system_id"
lib/cache/lvmetad.c:1023: var_deref_op: Dereferencing null pointer "this"
daemons/lvmlockd/lvmlockd-core.c:2659: check_after_deref: Null-checking "act" suggests that it may be null, but it has already been dereferenced on all paths leading to the check
/daemons/lvmetad/lvmetad-core.c:1024: check_after_deref: Null-checking "pvmeta" suggests that it may be null, but it has already been dereferenced on all paths leading to the check
If running lvmconf ... --startstopservice --mirrorservice in systemd
environment, handle lvm2-cmirrord accordingly. A typo in the script
caused the lvm2-cmirrord to not start/stop immediately, it was
only enabled/disabled (so the --startstopservice was ignored in this
case).
This prevents 'lvremove vgname' from attempting to remove the
hidden sanlock LV. Only vgremove should remove the hidden
sanlock LV holding the sanlock locks.
tools/polldaemon.c:457: array_null: Comparing an array to null is not useful: "lv->lvid.s"
The lv->lvid.s is never NULL. The check was supposed to be *lv->lvid.s
to check if the string is not empty.
... Using uninitialized value "lockd_state" when calling "lockd_vg"
(even though lockd_vg assigns 0 to the lockd_state, but it looks at
previous state of lockd_state just before that so we need to have
that properly initialized!)
libdm/libdm-report.c:2934: uninit_use_in_call: Using uninitialized value "tm". Field "tm.tm_gmtoff" is uninitialized when calling "_get_final_time".
daemons/lvmlockd/lvmlockctl.c:273: uninit_use_in_call: Using uninitialized element of array "r_name" when calling "format_info_r_action". (just added FIXME as this looks unfinished?)
lib/log/log.c:115: leaked_storage: Variable "st" going out of scope leaks the storage it points to
daemons/lvmpolld/lvmpolld-core.c:573: leaked_storage: Variable "cmdargv" going out of scope leaks the storage it points to
daemons/lvmlockd/lvmlockd-core.c:5341: leaked_handle: Handle variable "fd" going out of scope leaks the handle
daemons/lvmlockd/lvmlockctl.c:575: overwrite_var: Overwriting "able_vg_name" in "able_vg_name = strdup(optarg)" leaks the storage that "able_vg_name" points to
daemons/lvmlockd/lvmlockctl.c:571: overwrite_var: Overwriting "able_vg_name" in "able_vg_name = strdup(optarg)" leaks the storage that "able_vg_name" points to
daemons/lvmlockd/lvmlockctl.c:385: leaked_handle: Handle variable "s" going out of scope leaks the handle
Before, we used general find_config_tree_node function to retrieve
array values. This had a downside where if the node was not found,
we had to insert default values directly in-situ after the
find_config_tree_node call. This way, we had two copies of default
values - one in config_settings.h and the other one directly in the
code where we found out that find_config_tree_node returned NULL and
hence we needed to fall back to defaults.
With separate find_config_tree_array used for array config values,
we keep all the defaults centrally in config_settings.h because
the new find_config_tree_array automatically returns these defaults
if it can't find any value set in the configuration.
This patch just makes the behaviour exactly the same for arrays as
for any other non-array type where we call find_config_tree_<type>
already, hence making the internal interface for handling array
values consistent with the rest of the config types.
if lvm2 is built with debug memory options dm_free() is not
mapped directly to std library's free(). This may cause memory corruption
as a line buffer may get reallocated in getline with realloc.
This is a temporary hotfix. Other debug memory failure needs to
be investigated and explained.
including the allow_override_lock_modes setting.
It was not possible to override default lock modes any longer,
since the command line options had already been removed.
A mechanism will probably be required later that puts part of
this back.
# lvmdbusd relies on command log report to inspect LVM command's execution status
report_command_log=1
# display only outermost LVM shell-related log that lvmdbusd inspects first after LVM command execution (it calls 'lastlog' for more detailed log afterwards if needed)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.