IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
When reading an info about MDAs from lvmetad, we need to use 64 bit
int to read the value of the offset/size, otherwise the value is
overflows and then it's used throughout!
This is dangerous if we're trying to write such metadata area then,
mostly visible if we're using 2 mdas where the 2nd one is at the end
of the underlying device and hence the value of the mda offset is
high enough to cause problems:
(the offset trimmed to value of 0 instead of 4096m, so we write
at the very start of the disk (or elsewhere if the offset has
some other value!)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 4097m
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
[1] raw/~ # pvchange -u /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" changed
1 physical volume changed / 0 physical volumes not changed
[1] raw/~ # hexdump -n 512 /dev/vg/lvol0
0000000 d43e d2a5 4c20 4d56 2032 5b78 4135 7225
0000010 4e30 3e2a 0001 0000 0000 0000 0000 0000
0000020 0000 0010 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200
=======
(the offset overflows to undefined values which is far behind
the end of the disk)
[1] raw/~ # lvcreate -s -l 100%FREE vg --virtualsize 100g
Logical volume "lvol0" created
[1] raw/~ # pvcreate --metadatacopies 2 /dev/vg/lvol0
Physical volume "/dev/vg/lvol0" successfully created
[1] raw/~ # pvchange -u /dev/vg/lvol0
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
/dev/vg/lvol0: lseek 18446744073708503040 failed: Invalid argument
Failed to store physical volume "/dev/vg/lvol0"
0 physical volumes changed / 1 physical volume not changed
Also add -k/--setactivationskip y/n and -K/--ignoreactivationskip
options to lvcreate.
The --setactivationskip y sets the flag in metadata for an LV to
skip the LV during activation. Also, the newly created LV is not
activated.
Thin snapsots have this flag set automatically if not specified
directly by the --setactivationskip y/n option.
The --ignoreactivationskip overrides the activation skip flag set
in metadata for an LV (just for the run of the command - the flag
is not changed in metadata!)
A few examples for the lvcreate with the new options:
(non-thin snap LV => skip flag not set in MDA + LV activated)
raw/~ $ lvcreate -l1 vg
Logical volume "lvol0" created
raw/~ $ lvs -o lv_name,attr vg/lvol0
LV Attr
lvol0 -wi-a----
(non-thin snap LV + -ky => skip flag set in MDA + LV not activated)
raw/~ $ lvcreate -l1 -ky vg
Logical volume "lvol1" created
raw/~ $ lvs -o lv_name,attr vg/lvol1
LV Attr
lvol1 -wi------
(non-thin snap LV + -ky + -K => skip flag set in MDA + LV activated)
raw/~ $ lvcreate -l1 -ky -K vg
Logical volume "lvol2" created
raw/~ $ lvs -o lv_name,attr vg/lvol2
LV Attr
lvol2 -wi-a----
(thin snap LV => skip flag set in MDA (default behaviour) + LV not activated)
raw/~ $ lvcreate -L100M -T vg/pool -V 1T -n thin_lv
Logical volume "thin_lv" created
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg
LV Attr
pool twi-a-tz-
thin_lv Vwi-a-tz-
thin_snap Vwi---tz-
(thin snap LV + -K => skip flag set in MDA (default behaviour) + LV activated)
raw/~ $ lvcreate -s vg/thin_lv -n thin_snap -K
Logical volume "thin_snap" created
raw/~ $ lvs -o name,attr vg/thin_lv
LV Attr
thin_lv Vwi-a-tz-
(thins snap LV + -kn => no skip flag in MDA (default behaviour overridden) + LV activated)
[0] raw/~ # lvcreate -s vg/thin_lv -n thin_snap -kn
Logical volume "thin_snap" created
[0] raw/~ # lvs -o name,attr vg/thin_snap
LV Attr
thin_snap Vwi-a-tz-
If "vgcreate/lvcreate --profile <profile_name>" is used, the profile
name is automatically stored in metadata for making it possible to
load it automatically next time the VG/LV is used.
This is per VG/LV profile loading on demand. The profile itself is saved
in struct volume_group/logical_volume as "profile" field so we can
reference it whenever needed.
A helper type that helps with identification of the configuration source
which makes handling the configuration cascade a bit easier, mainly
removing and adding configuration trees to cascade dynamically.
Currently, the possible types are:
CONFIG_UNDEFINED - configuration is not defined yet (not initialized)
CONFIG_FILE - one file configuration
CONFIG_MERGED_FILES - configuration that is a result of merging more files into one
CONFIG_STRING - configuration string typed on cmd line directly
CONFIG_PROFILE - profile configuration (the new type of configuration, patches will follow...)
Also, generalize existing "remove_overridden_config_tree" to work with
configuration type identification in a cascade. Before, it was just
the CONFIG_STRING we used. Now, we need some more to add in a
cascade (like the CONFIG_PROFILE). So, we have:
struct dm_config_tree *remove_config_tree_by_source(struct cmd_context *cmd, config_source_t source);
config_source_t config_get_source_type(struct dm_config_tree *cft);
... for removing the tree by its source type from the cascade and
simply getting the source type.
In the last update not all code paths have set the archived flag.
If we run in test mode or without archiving enabled - set the bit
as well - so test whether archiving has been called succesfully
will be ok. (in relase fix).
Do not keep multiple archives for the executed command.
Reuse the ALLOCATABLE_PV from pv status for
ARCHIVED_VG vg status. Mark VG with the bit with the
first archivation.
...not the other way round as it was before. This way it makes
more sense as BA use is exceptional and it's useless to
contaminate the log with messages about BA not being found
in metadata.
When vgname has not existed in metadata, it has crashed on double free
in format_instance destroy() - since VG was created, used FID and was
released - which also released FID, so further use was accessing bad
memory.
Fix it for this code path before release_vg() so FID will exists
when _vg_read_file_name() returns NULL.
'lvchange' is used to alter a RAID 1 logical volume's write-mostly and
write-behind characteristics. The '--writemostly' parameter takes a
PV as an argument with an optional trailing character to specify whether
to set ('y'), unset ('n'), or toggle ('t') the value. If no trailing
character is given, it will set the flag.
Synopsis:
lvchange [--writemostly <PV>:{t|y|n}] [--writebehind <count>] vg/lv
Example:
lvchange --writemostly /dev/sdb1:y --writebehind 512 vg/raid1_lv
The last character in the 'lv_attr' field is used to show whether a device
has the WriteMostly flag set. It is signified with a 'w'. If the device
has failed, the 'p'artial flag has priority.
Example ("nosync" raid1 with mismatch_cnt and writemostly):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg Rwi---r-m 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-w 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-- 1 linear 4.00m
Example (raid1 with mismatch_cnt, writemostly - but failed drive):
[~]# lvs -a --segment vg
LV VG Attr #Str Type SSize
raid1 vg rwi---r-p 2 raid1 500.00m
[raid1_rimage_0] vg Iwi---r-- 1 linear 500.00m
[raid1_rimage_1] vg Iwi---r-p 1 linear 500.00m
[raid1_rmeta_0] vg ewi---r-- 1 linear 4.00m
[raid1_rmeta_1] vg ewi---r-p 1 linear 4.00m
A new reportable field has been added for writebehind as well. If
write-behind has not been set or the LV is not RAID1, the field will
be blank.
Example (writebehind is set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r-- 512
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
Example (writebehind is not set):
[~]# lvs -a -o name,attr,writebehind vg
LV Attr WBehind
lv rwi-a-r--
[lv_rimage_0] iwi-aor-w
[lv_rimage_1] iwi-aor--
[lv_rmeta_0] ewi-aor--
[lv_rmeta_1] ewi-aor--
For example, the old call and reference:
find_config_tree_str(cmd, "devices/dir", DEFAULT_DEV_DIR)
...now becomes:
find_config_tree_str(cmd, devices_dir_CFG)
So we're referring to the named configuration ID instead
of passing the configuration path and the default value
is taken from central config definition in config_settings.h
automatically.
Just to prevent accidental and improper use when reading the layout
from disk because of the already existing disk_areas_xl[0] lists
that are variable in size. We can read pv_header_extension only
after we know exactly where the lists end...
The PV header extension information (PV header extension version, flags
and list of Embedding Area locations) is stored just beyond the PV header base.
When calculating the Embedding Area start value (ea_start), the same logic is
used as when calculating the pe_start value for Data Area - the value must
follow exactly the same alignment restrictions for its start value
(the alignment detected automatically or provided via command line using
the --dataalignment and --dataalignmentoffset arguments).
The Embedding Area is placed at the very start of the PV, starting at
ea_start. The Data Area starting at pe_start is placed next. The pe_start is
still properly aligned. Due to the pe_start alignment, it's possible that the
resulting Embedding Area size (ea_size) ends up bigger in size than requested
(but never less than requested).
New tools with PV header extension support will read the extension
if it exists and it's not an error if it does not exist (so old PVs
will still work seamlessly with new tools).
Old tools without PV header extension support will just ignore any
extension.
As for the Embedding Area location information (its start and size),
there are actually two places where this is stored:
- PV header extension
- VG metadata
The VG metadata contains a copy of what's written in the PV header
extension about the Embedding Area location (NULL value is not copied):
physical_volumes {
pv0 {
id = "AkSSRf-difg-fCCZ-NjAN-qP49-1zzg-S0Fd4T"
device = "/dev/sda" # Hint only
status = ["ALLOCATABLE"]
flags = []
dev_size = 262144 # 128 Megabytes
pe_start = 67584
pe_count = 23 # 92 Megabytes
ea_start = 2048
ea_size = 65536 # 32 Megabytes
}
}
The new metadata fields are "ea_start" and "ea_size".
This is mostly useful when restoring the PV by using existing
metadata backups (e.g. pvcreate --restorefile ...).
New tools does not require these two fields to exist in VG metadata,
they're not compulsory. Therefore, reading old VG metadata which doesn't
contain any Embedding Area information will not end up with any kind
of error but only a debug message that the ea_start and ea_size values
were not found.
Old tools just ignore these extra fields in VG metadata.
PV header extension comes just beyond the existing PV header base:
PV header base (existing):
- uuid
- device size
- null-terminated list of Data Areas
- null-terminater list of MetaData Areas
PV header extension:
- extension version
- flags
- null-terminated list of Embedding Areas
This patch also adds "eas" (Embedding Areas) list to lvmcache (lvmcache_info)
and it also adds support for common operations on the list (just like for
already existing "das" - Data Areas list):
- lvmcache_add_ea
- lvmcache_update_eas
- lvmcache_foreach_ea
- lvmcache_del_eas
Also, add ea_start and ea_size to struct physical_volume for processing
PV Embedding Area location throughout the code (currently only one
Embedding Area is supported, though the definition on disk allows for
more if needed in the future...).
Also, define FMT_EAS format flag to mark that the format actually
supports Embedding Areas (currently format-text only).
If zero metadata copies are used, there's no further recalculation of
PV alignment that happens when adding metadata areas to the PV and
which actually calculates the alignment correctly as a matter of fact.
So fix this for "PV without MDA" case as well.
Before this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 8.00m
After this patch:
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 1 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
[1] raw/~ # pvcreate --dataalignment 8m --dataalignmentoffset 4m
--metadatacopies 0 /dev/sda
Physical volume "/dev/sda" successfully created
[1] raw/~ # pvs -o pv_name,pe_start
PV 1st PE
/dev/sda 12.00m
Also, remove a superfluous condition "pv->pe_start < pv->pe_align" in:
if (pe_start == PV_PE_START_CALC && pv->pe_start < pv->pe_align)
pv->pe_start = pv->pe_align ...
This part of the condition is not reachable as with the PV_PE_START_CALC,
we always have pv->pe_start set to 0 from the PV struct initialisation
(...the pv->pe_start value is just being calculated).
Allow restoring metadata with thin pool volumes.
No validation is done for this case within vgcfgrestore tool -
thus incorrect metadata may lead to destruction of pool content.
Use log_warn to print non-fatal warning messages.
Use of log_error would confuse checker for testing
whether proper error has been reported for some real error.
We were using daemon_send_simple until now, but it is no longer adequate, since
we need to manipulate requests in a generic way (adding a validity token to each
request), and the tree-based request interface is much more suitable for this.
Add 3rd daemon return state "unknown" for lookups that are carried out
successfully but don't find the item requested.
Avoid issuing error messages when it's expected that a device that's
being looked up in lvmetad might not be there.
Make sure both hash tables are initialized before _read_sections() call.
Presents no functional change (since PV scan phase was not adding LV hashes),
but makes the code easier to handle mem failing case, and static analyzer is
hapier as well.
Adding at least stack traces with some FIXMEs for cases,
where we might want to do something cleaver - maybe fail command
or give user hints something is not going well ?
For remote_backup is stack probably 'good' enough for now.
Move commod code to destroy orphan VG into free_orphan_vg() function.
Use orphan vgmem for creation of PV lists.
Remove some free_pv_fid() calls (FIXME: check all of them)
FIXME: Check whether we could merge release_vg back again for all VGs.
Basic support to keep info when the LV was created.
Host and time is stored into LV mda section.
FIXME: Current version doesn't support configurable string via lvm.conf
and used fixed version strftime "%Y-%m-%d %T %z".
RAID is not like traditional LVM mirroring. LVM mirroring required failed
devices to be removed or the logical volume would simply hang. RAID arrays can
keep on running with failed devices. In fact, for RAID types other than RAID1,
removing a device would mean substituting an error target or converting to a
lower level RAID (e.g. RAID6 -> RAID5, or RAID4/5 to RAID0). Therefore, rather
than removing a failed device unconditionally and potentially allocating a
replacement, RAID allows the user to "replace" a device with a new one. This
approach is a 1-step solution vs the current 2-step solution.
example> lvconvert --replace <dev_to_remove> vg/lv [possible_replacement_PVs]
'--replace' can be specified more than once.
example> lvconvert --replace /dev/sdb1 --replace /dev/sdc1 vg/lv
Use static buffer instead of stack allocated buffer.
This reduces stack size usage of lvm tool and the
change is very simple.
Since the whole library is not thread safe - it should not
add any new problems - and if there will be some conversion
it's easy to convert this to use some preallocated buffer.
When a PV label write is deferred to a vg_write call (as introduced by a patch
in 2.02.86), the PV is flagged with the internal UNLABELLED_PV flag. However,
when calling vg_archive before vg_write, we still have the PV labelled with the
UNLABELLED_PV flag which was not recognised as a proper flag while exporting
VG metadata:
# vgcreate vg /dev/sda
No physical volume label read from /dev/sda
Metadata inconsistency: Not all flags successfully exported.
Metadata inconsistency: Not all flags successfully exported.
Writing physical volume data to disk "/dev/sda"
Physical volume "/dev/sda" successfully created
Volume group "vg" successfully created
functionality. A number of bugs (copied and pasted all over the code) should
disappear:
- most string lookup based on dm_config_find_node would segfault when
encountering a non-zero integer (the intention there was to print an
error message instead)
- check for required sections in metadata would have been satisfied by
values as well (i.e. not sections)
- encountering a section in place of expected flag value would have
segfaulted (due to assumed but unchecked cn->v != NULL)
leaving behind the LVM-specific parts of the code (convenience wrappers that
handle `struct device` and `struct cmd_context`, basically). A number of
functions have been renamed (in addition to getting a dm_ prefix) -- namely,
all of the config interface now has a dm_config_ prefix.
There's a very high memory usage when calling _pv_analyse_mda_raw (e.g. while
executing pvck) that can end up with "out of memory".
_pv_analyse_mda_raw scans for metadata in the MDA, iteratively increasing the
size to scan with SECTOR_SIZE until we find a probable config section or we're
at the edge of the metadata area. However, when using a memory pool, we're also
iteratively chasing for bigger and bigger mempool chunk which can't be found
and so we're always allocating a new one, consuming more and more memory...
This patch just changes the mempool to direct memory allocation in this
problematic part of the code.
Move the free_vg() to vg.c and replace free_vg with release_vg
and make the _free_vg internal.
Patch is needed for sharing VG in vginfo cache so the release_vg function name
is a better fit here.
Implementation described in doc/lvm2-raid.txt.
Basic support includes:
- ability to create RAID 1/4/5/6 arrays
- ability to delete RAID arrays
- ability to display RAID arrays
Notable missing features (not included in this patch):
- ability to clean-up/repair failures
- ability to convert RAID segment types
- ability to monitor RAID segment types
It's useful to keep the partial flag cached - so just move the call
for vg_mark_partil_lvs() into import_vg_from_config_tree() so it gets
evaluated before it goes through the lvmcache.
This patch should not present any functional change.
Note: It is rather temporal solution - proper place is probably inside the
'read' call back - but needs some more discussion.
For now using this minor hack.
transient error), stemming from the following sequence of events:
1) devices fail IO, triggering repair
2) dmeventd starts fixing up the mirror
3) during the downconversion, a new metadata version is written
--> the devices come back online here
4) the mirror device suspend/resume is called to update DM tables
5) during the suspend/resume cycle, *pre*-commit metadata is read;
however, since the failed devices are now back online, we get back
inconsistent set of precommit metadata and the whole operation fails
The patch relaxes the check that fails in step 5 above, namely by ignoring
inconsistencies coming from PVs that are marked MISSING.
Before, we used vg_write_lock_held call to determnine the way a device is
opened. Unfortunately, this opened many devices in RW mode when it was not
really necessary. With the OPTIONS+="watch" rule used in the udev rules,
this could fire numerous events while closing such devices (and it caused
useless scans from within udev rules in return).
A common bug we hit with this was with the lvremove command which was unable
to remove the LV since it was being opened from within the udev rules. This
patch should minimize such situations (at least with respect to LVM handling
of devices).
Though there's still a possibility someone will open a device 'outside' in
parallel and fire the event based on the watch rule when closing a device
once opened for RW.
Avoid using of already released memory when duplicated MDA is found.
As get_pv_from_vg_by_id() may call lvmcache_label_scan() use the local copy
of the vgname and vgid on the stack as vginfo may dissapear and code was
then accessing garbage in memory.
i.e. pvs /dev/loop0
(when /dev/loop0 and /dev/loop1 has same MDA content)
Invalid read of size 1
at 0x523C986: dm_hash_lookup (hash.c:325)
by 0x440C8C: vginfo_from_vgname (lvmcache.c:399)
by 0x4605C0: _create_vg_text_instance (format-text.c:1882)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
by 0x47FBD8: vg_read_internal (metadata.c:3231)
by 0x477594: get_pv_from_vg_by_id (metadata.c:344)
by 0x45F07A: _get_pv_if_in_vg (format-text.c:1400)
by 0x45F0B9: _populate_pv_fields (format-text.c:1414)
by 0x45F40F: _text_pv_read (format-text.c:1493)
by 0x480431: _pv_read (metadata.c:3500)
by 0x4802B2: pv_read (metadata.c:3462)
Address 0x652ab80 is 0 bytes inside a block of size 4 free'd
at 0x4C2756E: free (vg_replace_malloc.c:366)
by 0x442277: _free_vginfo (lvmcache.c:963)
by 0x44235E: _drop_vginfo (lvmcache.c:992)
by 0x442B23: _lvmcache_update_vgname (lvmcache.c:1165)
by 0x443449: lvmcache_update_vgname_and_id (lvmcache.c:1358)
by 0x443C07: lvmcache_add (lvmcache.c:1492)
by 0x46588C: _text_read (text_label.c:271)
by 0x466A65: label_read (label.c:289)
by 0x4413FC: lvmcache_label_scan (lvmcache.c:635)
by 0x4605AD: _create_vg_text_instance (format-text.c:1881)
by 0x46140D: _text_create_text_instance (format-text.c:2243)
by 0x47EB49: _vg_read (metadata.c:2887)
Add testing script
As code uses strncpy(system_id, NAME_LEN) and doesn't set '\0'
Fix it by always allocating NAME_LEN + 1 buffer size and with zalloc
we always get '\0' as the last byte.
This bug may trigger some unexpected behavior of the string operation
code - depends on the pool allocator.
FIXME: refactor this code to alloc_vg.
Missing free_vg on error_path in lvmcache_get_vg fn. Call destroy_instance
only if the fid is not part of the vg in backup_read_vg fn (otherwise it's
part of the VG we're returning and we definitely don't want to destroy it!).
This is essential for proper format instance ref_count support. We must
use these functions to set the fid everywhere from now on, even the NULL
value!
We'd like to use the fid mempool for text_context that is stored
in the instance (we used cmd mempool before, so the order of
initialisation was not a matter, but now it is since we need to
create the fid mempool first which happens in create_instance fn).
The text_context initialisation is not needed anywhere outside the
create_instance fn so move it there.
Format instances can be created anytime on demand and it contains
metadata area information mostly (at least for now, but in the future,
we may store more things here to update/edit in a PV/VG). In case we
have lots of metadata areas, memory consumption will rise. Using cmd
context mempool is not quite optimal here because it is destroyed too
late. So let's use a separate mempool for format instances.
Reference counting is used because fids could be shared, e.g. each PV
has either a PV-based fid or VG-based fid. If it's VG-based, each PV has
a shared fid with the VG - a reference to VG's fid.
Create new function alloc_vg() to allocate VG structure.
It takes pool_name (for easier debugging).
and also take vg_name to futher simplify code.
Move remainder of _build_vg_from_pds to _pool_vg_read
and use vg memory pool for import functions.
(it's been using smem -> fid mempool -> cmd mempool)
(FIXME: remove mempool parameter for import functions and use vg).
Move remainder of the _build_vg to _format1_vg_read
We allow writing non-orphan PVs only for resize now. The "orphan PV" assert
in pv_write fn uses the "allow_non_orphan" parameter to control this assert.
However, we should find a more elaborate solution so we can remove this
restriction altogether (pv_write together with vg_write is not atomic, we
need to find a safe mechanism so there's an easy revert possible in case of
an error).
Add a small fix that preserves pe_start for lvm1 PVs when being converted.
(this fix needs to be replaced with something more clever, but let's have this working now)
If the PV is already part of the VG (so the pv->fid == vg->fid), it makes no
sense to attach the mdas information from PV to a VG. Instead, we read new
PV metadata information from cache and attach it to the VG fid.
This function also sets a reference to a new VG format instance for all PVs
that are part of the VG so the PV-VG interconnection is consistent after the
change.
Add supporting functions to work with the format instance and metadata area
structures stored within the format instance. Add support for simple indexing
of metadata areas using PV id and mda order (for on-disk PV only for now, we
can extend the indexing even for other mdas if needed - we only need to define
a proper key for the index).
New strategy for memory locking to decrease the number of call to
to un/lock memory when processing critical lvm functions.
Introducing functions for critical section.
Inside the critical section - memory is always locked.
When leaving the critical section, the memory stays locked
until memlock_unlock() is called - this happens with
sync_local_dev_names() and sync_dev_names() function call.
memlock_reset() is needed to reset locking numbers after fork
(polldaemon).
The patch itself is mostly rename:
memlock_inc -> critical_section_inc
memlock_dec -> critical_section_dec
memlock -> critical_section
Daemons (clmvd, dmevent) are using memlock_daemon_inc&dec
(mlockall()) thus they will never release or relock memory they've
already locked memory.
Macros sync_local_dev_names() and sync_dev_names() are functions.
It's better for debugging - and also we do not need to add memlock.h
to locking.h header (for memlock_unlock() prototyp).
Change function import_vg_from_buffer() to import_vg_from_config_tree().
Instead of creating config tree inside the function allow config tree to
be passed as parameter - usable later for caching.
Checking for vg being != NULL in this place is not needed.
Pointer vg is already dereferced in this function above this code line.
Also this internal function _read_pv is always called with valid 'vg' pointer.
As const segment_type or const format_type are never released
use their non-const version and remove const downcast from dm_free calls.
This change fixes many gcc warnings we were getting from them.
To have better control were the config tree could be modified use more
const pointers and very carefully downcast them back to non-const
(for config tree merge).
Set cmd->independent_metadata_areas if metadata/dirs or disk_areas in use.
- Identify and record this state.
Don't skip full scan when independent mdas are present even if memlock is set.
- Clusters and OOM aren't supported, so no problem doing the proper scans.
Avoid revalidating the label cache immediately after scanning.
- A simple optimisation.
Support scanning for a single VG in independent mdas.
- Not used by the fix but I left it in anyway as later patches might use it.
Nicely hidden memory leak in outf macro error path.
This macro is using out_text() and does automagical return_0.
That would leak tag_buffer allocated memory.
As there was same code for tags output - create _out_tags() function.
In other LVM memory structures such as volume_group, the field
used to store flags is called "status", and on-disk fields are called
'flags', so rename the one inside metadata_area to be consistent.
Not only is it more consistent with existing code but is cleaner
to say "the status of this mda is ignored".
Background for this patch - prajnoha pinged me on IRC this morning
about a fix he was working on related to metadataignore when
metadata/dirs was set. I was reviewing my patches from this year
and realized the 'flags' field was probably not the best choice
when I originally did the metadataignore patches.
In certain configurations, we're not under a VG rw lock while trying to write
a new archive file with VG metadata. A common example is using "vgs" while
having the content of backup and archive directories empty. The code scans the
content of these directories and tries to determine the final index that should
be used in archive name. Since we're not under a lock, we can get into a race
while choosing the index which could end up showing errors about not being able
to rename to final archive name. Let's add random number suffix to these archive
file names so we can avoid the race.
For example, when using '--config "backup { ... }"' line, the values from
lvm.conf (or default values) should be overridden. This patch adds
reinitialisation of archive and backup handling on toolcontext refresh
which makes these settings to be applied.
Add "devices/default_data_alignment" to lvm.conf to control the internal
default that LVM2 uses: 0==64k, 1==1MB, 2==2MB, etc.
If --dataalignment (or lvm.conf's "devices/data_alignment") is specified
then it is always used to align the start of the data area. This means
the md_chunk_alignment and data_alignment_detection are disabled if set.
(Same now applies to pvcreate --dataalignmentoffset, the specified value
will be used instead of the result from data_alignment_offset_detection)
set_pe_align() still looks to use the determined default alignment
(based on lvm.conf's default_data_alignment) if the default is a
multiple of the MD or topology detected values.
Pass metadataignore through PV creation / setup paths.
As a result of this cleanup, we can remove the unnecessary setting
of mda_ignore bits inside pvcreate_single(), after call to pv_create.
For now, just set metadataignore to '0' in some places. This is
equivalent to the prior functionality, although the 0 is given
by the caller not hardcoded in _mda_setup() call.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Allow metadataignore flag to be passed in to pvcreate.
Ideally, more refactoring of the mda allocation / initialization
is warranted, but for now, we just add another parameter to 'add_mda'
to take an existing mda ignored flag. We need to do this or pv_write
loses the state of the mda 'ignored' flag before copying and writing
to disk.
Print device name when setting or clearing metadata ignore bit.
Example:
label/label.c:160 /dev/loop2: lvm2 label detected
cache/lvmcache.c:1136 lvmcache: /dev/loop2: now in VG #orphans_lvm2 (#orphans_lvm2)
metadata/metadata.c:4142 Setting mda ignored flag for metadata_locn /dev/loop2.
format_text/text_label.c:318 Skipping mda with ignored flag on device /dev/loop2 at offset 4096
Logging isn't ideal, especially for mda_set_ignore. Ideally we'd
like to display the device name and offset in this case but this
requires a bit more work and a per-format 'mda_description' function
pointer definition (we don't have access to mda_context in
metadata.c).
There's an intermittent failure with vgcfgbackup that seems to have been
introduced with the metadataignore / vgmetadatacopies patchset.
Intermittent failures are often the result of uninitialized data,
so this patch calls zalloc in a few places it might matter.
This patch adds the ability to read/write the vg->mda_copies values
from/to the vg metadata.
If we read the VG metadata and this field does not exist, we set
mda_copies to the default value of 0. Later in the code, we use
this special '0' value to indicate a disable of metadata balancing.
This should preserve existing LVM behavior and ensure metadata balancing
can be turned off should the need arise.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When we are constructing the vg, we may need to adjust the list of
metadata_areas if there are ignored mdas. At label read time, we
do not read the metadata of ignored mdas, and as a result, they do
not get placed on vg->fid->metadata_areas inside _text_create_text_instance
since lvmcache does not have these areas attached to vginfo->infos.
However, when we're checking the pvids inside _vg_read, after having
read another metadata area from another PV, we do have the opportunity
to update the metadata_area and metadata_areas_ignored lists based
on the read metadata_area. We need accurate mda lists for the reporting
functions that count the ignored mdas, as well as general correctness
of mda balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
We implement ignore of an mda at label_read time by checking for
the ignore bit, and then skipping the reading of the vgname and
other information in the metadata. This will have an effect similar
to a PV found with no mdas. Thus, it will look like an orphan in the
cache until we scan the rest of the system and find a PV with
metadata, and the mda will not be on the vg->fid->metadata_areas
list so no read/writes will be done to the metadata area.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a second mda list, metadata_areas_ignored to fid, and a couple
functions, fid_add_mda() and fid_add_mdas() to help manage the lists.
These functions are needed to properly count the ignored mdas and
manage the lists attached to the 'fid' and ultimately the 'vg'.
Ensure metadata_areas_ignored is initialized in other formats, even
if the list is never used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Because of the way mdas are handled internally, where a PV in a VG
has mdas on both info->mdas and vg->fid->metadata_areas list, we
need a location independent copy constructor for struct
metadata_area. Break up the existing format-text specific copy
constructor into a format independent piece and a format dependent
piece.
This function is necessary to properly implement pv_set_mda_ignored().
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
A metadata_area is defined independent of the location. One downside
is that there is no obvious mapping from a pv to an mda. For a PV in
a VG, we need a way to start with a PV and end up with an MDA, if we
are to manage mdas starting with a device/pv. This function provides
us a way to go down the list of PVs on a VG, and identify which ones
match a particular PV.
I'm not entirely happy with this approach, but it does fit into the
existing structures in a reasonable way.
An alternative solution might be to refactor the VG - PV interface such
that mdas are a list tied to a PV. However, this seemed a bit tricky since
a PV does not come into existence until after the list of mdas is
constructed (see _vg_read() - we create a 'fid' and attach mdas to it,
then we go through them and attach pvs).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
We'd like to pass in mda_header to vgname_from_mda(). In order to
do this, we need to call raw_read_mda_header() from text_label.c,
_text_read(), which gets called from the label_read() path, and
peers into the metadata and update vginfo cache. We should check
the disable bit here, and if set, not peer into the vg metadata,
thus reducing the I/O to disk.
In the process, move vgname_from_mda() to layout.h, since the fn
only gets called from format_text code, and we need the mda_header
definition from the private layout.h.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This refactoring moves the device open/close up one level to the caller of
_vg_read_raw_area(). Should be no functional change and facilitate future
changes related to metadata balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
First we add a 'flags' field to the location independent
metadata_area structure, and a MDA_IGNORE flag. The
mda_is_ignored and mda_set_ignored functions are added to
manage the flag. Adding the flag and functions gives a
library interface to ignore metadata areas independent of
the underlying location (disk, file, etc). The location
specific read/write functions must then handle the specifics
of what this flag means to the location.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Adding a flag to the 'rlocn' structure in the mda header of the
text format allows us to flip a bit to ignore an area on disk that
stores the metadata via the text format specific mda_header.
This patch defines the flag and access functions to manage the flag.
Other patches will manage the ignore on a format-independent basis,
by using a flag in the metadata_area structure.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Future patches will make use of a specific flag in the on-disk 'raw_locn'
structure to enable/disable metadata areas, and facilitate metadata
balancing.
Note that 'filler' is always set to '0' (see add_mda() - memset),
so use of this area as a non-zero flags field is a safe way to
provide future code features.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Adding configure.in support for Replicators.
Adding basic lib lvm support for Replicators.
Adding flags REPLICATOR and REPLICATOR_LOG.
Adding segments SEG_REPLICATOR and SEG_REPLICATOR_DEV.
Adding basic methods for handling replicator metadata.
We should write metadata into next position in the ring buffer while calling
vgrename and vgcfgrestore. At this code level (_vg_write_raw), we were not able
to determine if this is a rename or not. If yes, then accompanying VG structure
passed here has a new name set, not the old one.
When looking for a location where to put metadata next, we were given a NULL
value because of failed VG name comparison (in _find_vg_rlocn) between the
name in existing metadata and metadata we're just about to write.
This resets the position in the ring buffer, overwriting any existing metadata
(and also incorrectly updates the cache to "orphan" afterwards).
This patch just adds old_name item in struct volume_group that we can check and use
if necessary and detect renames at lower layers as well.
The same applies for vgcfgrestore, but here we're using a special value of
old_name, an empty string, to disable the check with existing metadata totally.
When moving parts of striped LVs, pvmove wouldn't care about leaving you with
two stripes on the same disk. Now --alloc anywhere is needed for that.
(Tried and gave up on two alternative approaches before the one committed here.)
Small refactor of main places in the code where a pv is added to a
vg into a small function which adds the pv to the list and updates
the vg counts.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Simple refactor to mov code that updates the vg extent counts from a
single pv's counts close to the code that adds a pv to vg->pvs and
updates vg->pv_count. No functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Physical segments were still allocated from global
command context mempool.
This leads to very high memory usage when
activating large VG (vgchange).
(Memory usage was about 2G when >3000LVs).
Fix it by properly using vg->vgmem private pool,
so all the memory is released early.
New memory pool parameter is needed here for pv_split_segment
function.
Also fix the same problem in some minor allocations
(vg description, lv segment split).
The _read_vg uses already hash for PVs to optimise
reading of large VGs and avoiding repeated PV list traversing.
Use the same aproach to speed up parsing VG with many LVs.
Eliminate 'merging_snapshot' from 'struct logical_volume' and just use
'snapshot' for origin lv's reference to the merging snapshot; also set
MERGING in the origin lv's status.
Make 'merging_snapshot' pointer that points from the origin to the
segment that represents the merging snapshot.
Import/export 'merging_store' metadata.
Do not allow creating snapshots while another snapshot is merging.
Snapshot created in this state would certainly contain invalid data.
NOTE: patches at the end of this series will remove 'merging_snapshot'
and will introduce helpful wrappers and cleanups.
At this point they probably do not matter but going forward they
may - depends on future patches for replicator, etc. I think
these probably got missed because they were 'flags' so I changed
the name to 'status' to be consistent. So the on-disk
things 'flags' and the in structure 'status' (bits).
NOTE: WHATS_NEW already has entry for this in current release.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
The physical_volume, volume_group, logical_volume and lv_segment
structures' 'status' member is now uint64_t.
The alignment of these structures was also audited to remove holes. The
movement of some members in 'volume_group' and 'lv_segment' eliminates
holes. The 'physical_volume' structure still has one 4-byte hole after
'pe_size'; the other structures no longer have any holes. Each
structures' size has not changed.
indented metadata lines.
Macro outnl() is using exported out_newline() instead of direct
call f->fn(), that required the visibility of the internal
struct formatter.
This patch is all just cleanup and no other patch depends on it.
Replace explicit dereference and check with vg_is_exported().
Update a few copyrights and remove unnecessary whitespace.
Should be no functional change.
If the pvcreate --dataalignmentoffset option is not specified the start
of a PV's aligned data area will be shifted by the associated
'alignment_offset' exposed in sysfs (unless
devices/data_alignment_offset_detection is disabled in lvm.conf).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Documented which use-cases force the reinstatement of the nuanced
handling of pe_start. As soon as orphan PVs are eliminated much of this
will no longer be a concern ('preserve_pe_start' can be reenabled in
.pv_setup).
Added defensive 'if (pv->pe_align)' check in _text_pv_write()'s pe_start
loop.
If pv_setup was given a non-zero pe_start it would short-circuit
establishing a default pv->pe_align. pv->pe_align=0 would result
in a divide by zero in _mda_setup(). 'vgconvert -M2 $vgname' hit this.
.pv_write still properly preserves pe_start if it was supplied.
Adds pe_align_offset to 'struct physical_volume'; is initialized with
set_pe_align_offset(). After pe_start is established pe_align_offset is
added to it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
areas.
This preserved pe_start would quickly be readjusted to follow the first
mda anyway. An example use-case that hit this code path is: running
pvcreate on an already existing PV _without_ a preceeding pvremove.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Without this fix rounding the end of the first mda to a pe_align
boundary could silently exceed the disk_size.
Final 'if (start1 + mda_size1 > disk_size)' block serves as a safety
net.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Document existing pe_start policy.
Fix issue in _text_pv_setup() where existing pe_start case could have
the pv->pe_start set to pv->pe_align even though pe_start shouldn't ever
change.
vgconvert and pvcreate have a facility to preserve the existing start
of the on-disk data extents, known as pe_start.
They indicate this by passing the existing value to the pvsetup function
which must preserve it.
This patch avoids one particular case where the value could get
changed incorrectly now that the alignment settings are configurable.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This function behaves a little bit different than vg_reduce_single, because
it allowes to remove even the latest pv. This has been done to be consistent
to lvm_vg_create, which creates an empty vg.
removed_pvs has been added to the volume_group struct. vg_reduce adds remove
pvs to this list to be able to commit the changes for the pvs in lvm_vg_comm
in liblvm2app.
Initialize removed_pvs list in format-specific volume_group constructors.
Ideally, we should have a base constructor here that initializes the general
non-format specific members of struct volume_group. But until then, there
are multiple places to initialize these members. Maybe a better patch would
be a base constructor patch for struct volume_group. That is more work
though.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Thomas Woerner <twoerner@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
Currently code uses pv_dev_name() for hash when getting internal
"pvX" name.
This produce corrupted metadata if PVs are missing, pv->dev
is NULL and all these missing devices returns one name
(using "unknown device" for all missing devices as hash key).
link_lv_to_vg and unlink_lv_from_vg are the only functions
for adding/removing logical volume from volume group.
Only these function should manipulate with vg->lvs list.
The snapshot segment (snapshotX) is created twice
during the text metadata segment processing.
This can cause temporary violation of max_lv count.
Simplify the code, snapshot segment is properly initialized
in init_snapshot_seg function now and do not need to be replaced
by vg_add_snapshot call.
The vg_add_snapshot() is now usefull only for adding new
snapshot and it shares the same initialization function.
The snapshot name is always generated, name paramater can be
removed from function call.
The dataalign value must always be aligned according
to MDA area.
The currect code checks if calculated value collides with
MDA area but not if the value is so small that it is
located before MDA starts.
Unfortunatelly there can be also MDA in the end of the device.
The patch adds simple check to avoid this miscalculation.
Patch expects that first MDA always starts on <= pagesize boundary
(this is true for all allowed label sector parameters).
Add lvs origin_size field.
Fix linux configure --enable-debug to exclude -O2.
Still a few rough edges, but hopefully usable now:
lvcreate -s vg1 -L 100M --virtualoriginsize 1T
Run backup of metadata on remote nodes in the
same place like local node - when calling backup().
Introduce backup_locally() which calls only
local backup if needed.
Remote backup is now trigerred by LCK_VG_BACKUP flag
combination (special VG lock).
This lock type will call check_current_backup()
(including backup_locally() call) and updates
metadata on all nodes.
(Patch fixes non-functional remote backup,
current call during VG lock never triggers.)
Since now, all code reading volume group is responsible for releasing
the memory allocated by calling vg_release(vg).
(For simplicity of use, vg_releae can be called for vg == NULL,
the same logic like free(NULL)).
Also providing simple macro for unlocking & releasing in one step,
tools usualy uses this approach.
The global memory pool (cmd->mem) should be used only for global
physical volume operations.
This patch have to be applied with all subsequent patches to complete
memory pool per vg logic.
Using separate memory pool has quite bit memory saving impact when
using large VGs, this is mainly needed when we have to use
preallocated and locked memory (and should not overflow from that
memory space).
if rlocn not defined (there is no metadata area).
In most cases it fails in validate_name(),
unfortunately there are situatuions, when
validate_name is ok and later code fails with
checksum error.
Reproducer:
# dd if=/dev/zero of=/dev/loop0
# pvcreate --metadatasize 637k /dev/loop0
Physical volume "/dev/loop0" successfully created
# pvs /dev/loop0
/dev/loop0: Checksum error
PV VG Fmt Attr PSize PFree
/dev/loop0 lvm2 -- 1.00M 1.00M
Signed-off-by: Milan Broz <mbroz@redhat.com>
-
This patch is not fully tested and leaves some related bugs unfixed.
Intended behaviour of the code now:
pe_start in the lvm2 format PV label header is set only by pvcreate (or
vgconvert -M2) and then preserved in *all* operations thereafter.
In some specialist cases, after the PV is added to a VG, the pe_start
field in the VG metadata may hold a different value and if so, it
overrides the other one for as long as the PV is in such a VG.
Currently, the field storing the size of the data area in the PV label
header always holds 0. As it only has meaning in the context of a
volume group, it is calculated whenever the PV is added to a VG (and can
be derived from extent_size and pe_count in the VG metadata).
Reports the size of the smallest metadata area in a PV or a VG.
Useful to confirm pvcreate --metadatasize or pvmetadatasize setting in
/etc/lvm/lvm.conf file.
NOTE: Actual value in these fields will most always differ from that
given in pvcreate options due to rounding and alignment effects.
Identical argument to previous patch which removed archive_enable() calls.
We add a new parameter to backup_init() which sets the enable value based
on the cmd->default_settings.backup value. This value was used to set
cmd->current_settings.backup, used in the removed backup_enable() call.
_init_backup() calls archive_init(), which originally set 'enabled' to
a hardcoded '1' value. This seems incorrect based on my read of other
areas of the code so here we add a 'enabled' paramter to archive_init().
We pass in cmd->default_settings.archive, which is obtained from the
config tree. Later in create_toolcontext, cmd->current_settings is
set to cmd->default_settings. The archive_enable() call we remove
here was using cmd->current_settings to set the 'archive' enable
value. The final value of cmd->archive_params->enabled should thus
be equivalent to the original code.
Function _text_pv_write doesn't use memory pool but static buffer,
call dm_pool_free in error path in _raw_write_mda_header is wrong.
Move pool free only to path where is the memory pool used.