IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Pass metadataignore through PV creation / setup paths.
As a result of this cleanup, we can remove the unnecessary setting
of mda_ignore bits inside pvcreate_single(), after call to pv_create.
For now, just set metadataignore to '0' in some places. This is
equivalent to the prior functionality, although the 0 is given
by the caller not hardcoded in _mda_setup() call.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
- If a PV contained empty mdas, the auto-recovery code was not kicking in.
- The 'inconsistent' state was getting lost when metadata was cached so
recovery didn't kick in. But leave the behaviour alone when using
precommitted metadata because of a warning in a confusing FIXME.
In my testing, pvs and vgs didn't repair inconsistent metadata like they
used to do. (How many other tools fail similarly now?)
And there should be no need to cache inconsistent metadata because it is
supposed to get repaired under the protection of a write lock immediately it is
discovered.
This code is in need of a redesign based on first principles.
I still see bugs in this code and this commit is risky.
Allow metadataignore flag to be passed in to pvcreate.
Ideally, more refactoring of the mda allocation / initialization
is warranted, but for now, we just add another parameter to 'add_mda'
to take an existing mda ignored flag. We need to do this or pv_write
loses the state of the mda 'ignored' flag before copying and writing
to disk.
Print device name when setting or clearing metadata ignore bit.
Example:
label/label.c:160 /dev/loop2: lvm2 label detected
cache/lvmcache.c:1136 lvmcache: /dev/loop2: now in VG #orphans_lvm2 (#orphans_lvm2)
metadata/metadata.c:4142 Setting mda ignored flag for metadata_locn /dev/loop2.
format_text/text_label.c:318 Skipping mda with ignored flag on device /dev/loop2 at offset 4096
Logging isn't ideal, especially for mda_set_ignore. Ideally we'd
like to display the device name and offset in this case but this
requires a bit more work and a per-format 'mda_description' function
pointer definition (we don't have access to mda_context in
metadata.c).
In preparation to call this from both pvcreate as well as pvchange,
move the guts of metadataignore into a library function.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Allowing an 'all' and 'unmanaged' value is more intuitive, and
provides a simple way for users to get back to original LVM behavior
of metadata written to all PVs in the volume group.
If the user requests "--vgmetadatacopies unmanaged", this instructs
LVM not to manage the ignore bits to achieve a specific number of
metadata copies in the volume group. The user is free to use
"pvchange --metadataignore" to control the mdas on a per-PV basis.
If the user requests "--vgmetadatacopies all", this instructs LVM
to do 2 things: 1) clear all ignore bits, and 2) set the "unmanaged"
policy going forward.
Internally, we use the special MAX_UINT32 value to indicate 'all'.
This 'just' works since it's the largest value possible for the
field and so all 'ignore' bits on all mdas in the VG will get
cleared inside _vg_metadata_balance(). However, after we've
called the _vg_metadata_balance function, we check for the special
'all' value, and if set, we write the "unmanaged" value into the
metadata. As such, the 'all' value is never written to disk.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The check in vg_split_mdas will trigger an error if the 'from' vg
list is empty. However, this might be ok in some instances now
that we have ignored mdas. Relax this check so an error is triggered
only in the case where there's truly no more mdas in the 'from'
vg.
One example of where this makes a difference is with vgreduce.
If we try to vgreduce a PV with un-ignored mdas, this should trigger
the balancing function to un-ignore mdas on another PV in the VG.
However, we don't get to vg_write() before we fail because this
list size check fails, and we see an error message indicating:
"Cannot remove final metadata area ..."
Another example is with vgsplit into a new VG, where the PVs
being moved contain all ignored mdas. We must move the mdas on
fid->metadata_areas_ignored from 'vg_from' to 'vg_to'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The vgextend path calls add_pv_to_vg(). Inside add_pv_to_vg(),
we must ensure we pass the correct mdas list into pv_setup(), as
copies of mdas are placed on the vg->fid list. If we don't place
the mdas on the correct vg->fid list, the various counts may be
incorrect and the metadata balance algorithm will not work when
called from vg_write() path.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Compare the value of the newly added vg_mda_copies field
(--vgmetadatacopies parameter) with the current count of
in-use mdas and ignoring or unignoring mdas as necessary to
get to the target count. Also, as a safety check before
returning, ensure we have at least one mda enabled.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
This patch adds the get and partially implemented set function.
The 'set' function should probably ignore or un-ignore metadata areas
based on new values.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a field to struct volume_group to later implement metadata
balancing:
- mda_copies: target # of non-ignored mdas in the VG; default 0 (do
not control pv 'ignore mdas' bit.
This patch just adds the parameter to the structures with the default
values but does not modify any commands. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Arrange mdas so mdas that are to be ignored come first. This is an
optimization that ensures consistency on disk for the longest period of time.
This was noted by agk in review of the v4 patchset of pvchange-based mda
balance.
Note the following example for an explanation of the background:
Assume the initial state on disk is as follows:
PV0 (v1, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
If we did not sort the list, we would have a commit sequence something like
this:
PV0 (v2, non-ignored)
PV1 (v2, ignored)
PV2 (v2, ignored)
PV3 (v2, non-ignored)
After the commit of PV0's mdas, we'd have an on-disk state like this:
PV0 (v2, non-ignored)
PV1 (v1, non-ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is an inconsistent state of the disk. If the machine fails, the next
time it was brought back up, the auto-correct mechanism in vg_read would
update the metadata on PV1-PV3. However, if possible we try to avoid
inconsistent on-disk states. Clearly, because we did not sort, we have
a greater chance of on-disk inconsistency - from the time the commit of
PV0 is complete until the time PV3 is complete.
We could improve the amount of time the on-disk state is consistent by simply
sorting the commit order as follows:
PV1 (v2, ignored)
PV2 (v2, ignored)
PV0 (v2, non-ignored)
PV3 (v2, non-ignored)
Thus, after the first PV is committed (in this case PV1), on-disk we would
have:
PV0 (v1, non-ignored)
PV1 (v2, ignored)
PV2 (v1, non-ignored)
PV3 (v1, non-ignored)
This is clearly a consistent state. PV1 will be read but the mda will be
ignored. All other PVs contain v1 metadata, and no auto-correct will be
required. In fact, if we commit all PVs with ignored mdas first, we'll
only have an inconsistent state when we start writing non-ignored PVs,
and thus the chances we'll get an inconsistent state on disk is much
less with the sorted method.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When we are constructing the vg, we may need to adjust the list of
metadata_areas if there are ignored mdas. At label read time, we
do not read the metadata of ignored mdas, and as a result, they do
not get placed on vg->fid->metadata_areas inside _text_create_text_instance
since lvmcache does not have these areas attached to vginfo->infos.
However, when we're checking the pvids inside _vg_read, after having
read another metadata area from another PV, we do have the opportunity
to update the metadata_area and metadata_areas_ignored lists based
on the read metadata_area. We need accurate mda lists for the reporting
functions that count the ignored mdas, as well as general correctness
of mda balancing.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
With the addition of ignored mdas, we replace all checks for an empty
mda list with a new function to look for either an empty mda list or
ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a helper function to consolidate checking for an empty mdas list
or ignored mdas. Ignored mdas should behave almost identically to
an empty mda list - the metadata areas should not be read or written
to. This function will make it easier to implement metadata balancing
and easier to track pvs with an empty mda list or ignored mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Define a new pvs field, pv_mda_used_count, and a new vgs field,
vg_mda_used_count to match the existing pv_mda_count and vg_mda_count.
These new fields count the number of mdas that have the 'ignored' bit
clear (they are in use on the PV / VG). Also define various supporting
functions to implement the counting as well as setting the ignored
flag and determining if an mda is ignored. These high level functions
call into the lower level location independent mda ignore functions
defined by earlier patches.
Note that counting ignored mdas in a vg requires traversing both lists
and checking for the ignored bit on the mda. The count of 'ignored'
mdas then is defined by having the bit set, not by which list the mda
is on. The list does determine whether LVM actually does read/write to
the mda, though we must count the bits in order to return accurate numbers
for the various counts. Also, pv_mda_set_ignored must search both vg
lists for ignored mda. If the state changes and needs to be committed
to disk, the ignored mda will be on the non-ignored list.
Note also in pv_mda_set_ignored(), we must properly manage the mda lists.
If we change the ignored state of an mda, we must change any mdas on
vg->fid->metadata_areas that correspond to this pv. Also, we may
need to allocate a copy of the mda, as is done when fid->metadata_areas
is populated from _vg_read(), if we are un-ignoring an ignored mda.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a second mda list, metadata_areas_ignored to fid, and a couple
functions, fid_add_mda() and fid_add_mdas() to help manage the lists.
These functions are needed to properly count the ignored mdas and
manage the lists attached to the 'fid' and ultimately the 'vg'.
Ensure metadata_areas_ignored is initialized in other formats, even
if the list is never used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Because of the way mdas are handled internally, where a PV in a VG
has mdas on both info->mdas and vg->fid->metadata_areas list, we
need a location independent copy constructor for struct
metadata_area. Break up the existing format-text specific copy
constructor into a format independent piece and a format dependent
piece.
This function is necessary to properly implement pv_set_mda_ignored().
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
A metadata_area is defined independent of the location. One downside
is that there is no obvious mapping from a pv to an mda. For a PV in
a VG, we need a way to start with a PV and end up with an MDA, if we
are to manage mdas starting with a device/pv. This function provides
us a way to go down the list of PVs on a VG, and identify which ones
match a particular PV.
I'm not entirely happy with this approach, but it does fit into the
existing structures in a reasonable way.
An alternative solution might be to refactor the VG - PV interface such
that mdas are a list tied to a PV. However, this seemed a bit tricky since
a PV does not come into existence until after the list of mdas is
constructed (see _vg_read() - we create a 'fid' and attach mdas to it,
then we go through them and attach pvs).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
First we add a 'flags' field to the location independent
metadata_area structure, and a MDA_IGNORE flag. The
mda_is_ignored and mda_set_ignored functions are added to
manage the flag. Adding the flag and functions gives a
library interface to ignore metadata areas independent of
the underlying location (disk, file, etc). The location
specific read/write functions must then handle the specifics
of what this flag means to the location.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Some commands start with a pvname, but we'd like to force users to
start with a vg handle to obtain a pv handle. Our best option seems
to be providing a way to look up the vgname from the pvname, and then
require them to use vg_read/vg_open.
In addition to the pvname lookup function, this patch also provides a
lookup by pvid. The lookup by pvid can be used in conjunction with
lvmcache_get_pvids to process all pvs in the system.
The pvid find function first calls lvmcache_vgname_from_pvid, which may
cause the label to be read if it is not in the cache. If the vgname is
returned is an orphan, we then check to see if there are metadata areas,
and if not, we scan every PV on the system by calling scan_vgs_for_pvs().
In most cases we should not need to do this, and by using the info->mdas
count, we avoid calling pv_read() as prior code did. So this patch is a
bit cleaner and should allow us to refactor more of the pv code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
are active mirrors or snapshots.
We don't have the mechanisms in place to change the device-mapper
tables for those targets that have behavioral differences between
cluster and single machine instances. Allowing users to change
the attribute but not changing the target's behavior can lead to
data corruption.
The following bugs are fixed/avoided by this patch:
235123 - vgchange -c [ny] do not change target types when necessary
289331 - RFE: switching from cluster domain to local domain needs to deactivate volume somehow
289541 - when changing from local to cluster, volumes can not appear to be deactivated
Allow lv_remove_with_dependencies() to know the top-level LV that was
requested to be removed (otherwise it recurses and we lose context).
A merging snapshot cannot be removed directly but the associated origin
can be. Disallow removal of a merging snapshot unless the associated
origin is also being removed.
We should write metadata into next position in the ring buffer while calling
vgrename and vgcfgrestore. At this code level (_vg_write_raw), we were not able
to determine if this is a rename or not. If yes, then accompanying VG structure
passed here has a new name set, not the old one.
When looking for a location where to put metadata next, we were given a NULL
value because of failed VG name comparison (in _find_vg_rlocn) between the
name in existing metadata and metadata we're just about to write.
This resets the position in the ring buffer, overwriting any existing metadata
(and also incorrectly updates the cache to "orphan" afterwards).
This patch just adds old_name item in struct volume_group that we can check and use
if necessary and detect renames at lower layers as well.
The same applies for vgcfgrestore, but here we're using a special value of
old_name, an empty string, to disable the check with existing metadata totally.
lvm2app needs a link back to the vg in order to use the vg handle for
memory allocations as well as other things. This patch adds the field
to struct physical_volume, and sets pv->vg when reading a vg from disk or
extending a vg by using the helper function previously added,
add_pvl_to_vgs(). Moves and renames are handled with separate code
inside move_pv() and vgmerge(). Add pv->vg check to vg_validate().
A NULL value in pv->vg signifies membership in the orphan VG.
Note though in the case of pv_read() on a device with metadatacopies == 0,
more devices may need to be read for an authoritative answer.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Now that we have library functions to add/delete a pv from the vg->pvs
list, call them from everywhere.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Add a delete function to manage the vg->pvs list.
NOTE: It may be possible to do further cleanup to these add/del functions
by passing a 'pv' as input instead of 'pv_list'. The pv_list is used for
functions which do allocations (lvcreate) while other places in the code
just manage a list of 'pv' (e.g. import functions, vgextend, etc).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
A user specifying duplicate paths on the cmdline of vgcreate will
get a message similar to the following:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Internal error: Duplicate PV id jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8 detected for /dev/loop3 in vgtest2.
This is caught by vg_validate(), but it would be good to find
this condition earlier in the vgcreate code. add_pv_to_vg()
currently checks by pvname, but does not look for duplcate pvids.
This patch adds the check for duplicate pvids and results in new
error output as follows:
vgcreate vgtest2 /dev/loop3 /dev/loop5
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop5 not /dev/loop3
Found duplicate PV jk1lXsKzwyOKlXq6bhaFFKMQQ06oPgu8: using /dev/loop3 not /dev/loop5
Physical volume '/dev/loop5 (jk1lXs-Kzwy-OKlX-q6bh-aFFK-MQQ0-6oPgu8)' listed more than once.
Unable to add physical volume '/dev/loop5' to volume group 'vgtest2'.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Small refactor of main places in the code where a pv is added to a
vg into a small function which adds the pv to the list and updates
the vg counts.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
In add_pv_to_vg(), we should only add the pv to vg->pvs after all
internal checks have passed. The check for vg->extent_count exeeding
maximum was after we added the pv to the list, so this function could
return a state of vg->pvs that did not reflect other parameters such
as vg->pv_count.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
The function find_peg_by_pe is incredibly inefficient
for Pvs with many segments.
In shiny future there should be binary (or interval) tree
instead of sorted linked list (volunteers?).
Anyway, for now, we can use dirty trick here to optimise this case:
- Allocations are usually applied from the beginning
of PV (we have no alloocation policy which allocates areas
"backwards")
- The only user of find_peg_by_pe is pv_split_segment()
call. In *most* cases it need to split *last* PV segment.
So if we search sorted pv segment list backwards, we
hit the requested segment immediatelly.
This patch applies this tiny change.
(and saves >30% of processing time when >3000LVs segments are on one PV!)
To discourage using this inefficient function from other code,
it is moved to pv_manip.c and used static for now:-)
When we pv_read() a device that has an orphan vgname, we might need to scan
the system to be sure this is true. However, if the PV has mdas, there's
no way possible for it to have an orphan vgname unless it is a true orphan.
Some areas of the code were optimized to take advantage of this fact, while
others were not (we would still do the expensive scan if a device had mdas
but had an orphan VG).
This patch unifies the code so that every place we are operating on such
a PV, we skip the expensive scan if there are mdas.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Petr Rockai <prockai@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
If user try to vgcreate or vgextend non-existent VG,
these messages appears:
# vgcreate xxx /dev/xxx
Internal error: Volume Group xxx was not unlocked
Device /dev/xxx not found (or ignored by filtering).
Unable to add physical volume '/dev/xxx' to volume group 'xxx'.
Internal error: Attempt to unlock unlocked VG xxx.
(the same with existing VG and non-existing PV & vgextend)
# vgextend vg_test /dev/xxx
...
It is caused because code tries to "refresh" cache if
md filter is switched on using cache destroy.
But we can change filters and rescan even without this
machinery now, just use refresh_filters
(and reset md filter afterwards).
(Patch also discovers cache alias bug in vgsplit test,
fix it by using better filter line.)
The kernel's blk_stack_limits() function may flag a device as
'misaligned'. If it does the alignment_offset will be -1.
Update set_pe_align_offset() to accommodate this corner case.
We need to allocate memory for the tag and copy the tag value before we
add it to the list of tags. We could put this inside lvm2app since the
tools keep their memory around until vg_write/vg_commit is called, but
we put it inside the internal library to minimize code in lvm2app.
We need to copy the tag passed in by the caller to ensure the lifetime of
the memory until the {vg|lv} handle is released.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Similar refactoring to vgchange - pull out common parts and put into
library function for reuse. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Pull out common code to be called from tools as well as lvm2app.
Leave archive() at tool level so we can use from vgcreate
as well as vgchange. Should be no functional change.
- add stack macro in vgchange
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
where we should not expose internal VG names/uuids (the ones with "#" prefix )through the
interface. Otherwise, we could end up with library users opening internal VGs which will
initiate locking mechanism that won't be cleaned up properly.
"#orphans_{lvm1, lvm2, pool}" names are treated in a special way, they are truncated first
to "orphans" and this is used as a part of the lock name then (e.g. while calling lvm_vg_open()).
When library user calls lvm_vg_close(), the original name "orphans_{lvm1, lvm2, pool}"
is used directly and therefore no unlock occurs.
We should exclude internal VG names and uuids in the lists provided by lvmcache:
lvmcache_get_vgids() and lvmcache_get_vgnames().
All this seems to do is provide a memory leak so remove it.
The only caller of _alloc_pv() later explicitly sets
pv->vg_name = fmt->orphan_vg_name so clearly this allocation
should be removed. I also saw no where in the code where
strncpy was used to assign pv->vg_name - only direct assignments
and strdup's.
This patch tries to correctly track changes in lvmcache related to commit/revert.
For vg_commit: if there is cached precommitted metadata, after successfull commit
these metadata must be tracked as committed.
For vg_revert: remote nodes must drop precommitted metadata and its flag in lvmcache.
(N.B. Patch do not touch LV locks here in any way.)
All this machinery is needed to properly solve remote node cache invalidaton which
cause several problems recently observed.
The use_precommitted flag indicates, that we want to use precommitted metadata
(used in suspend call to preload table with precommitted data).
But if there are no such data, committed metadata are read but the cache
still contains that precommitted flag.
(The problem is that later possible drop_metadata call will not invalidate
device in cache.)
The wrong precommitted state is stored in on remote nodes during normal
suspend/resume cycle _without_ vg_write/commit.
Use the PRECOMMITTED status flag here instead (which is always set if using
precommited metadata here).
When PV device reappears with old metadata, it is
always updated to new version byt atutomatic metadata
repair.
Remove missing flag if device is empty.
If device contains allocated extents, issue warning that
user must remove volumes and re-add this PV before
manipulating with this volume.
This partially solves bug 547842 when one PV (log) is failed,
dmeventd removes that device and later this device reappears and
is wrongly added into VG marked missing.
The new recovery code first tries to repair LV and then removes failed PV
from VG. It means that during operation there can be VG with PV missing,
and vg_read code handles it like not consistent VG.
We already allows returning "inconsistent" commited metadata,
for mirror repair we need this for precommited too.
(The suspend call prepares precommited metadata to inactive table on
other cluster nodes.)
"Inconsistent" here means - correct metadata, just with some metadata areas
not found (obviously on missing or failed PVs).
The physical_volume, volume_group, logical_volume and lv_segment
structures' 'status' member is now uint64_t.
The alignment of these structures was also audited to remove holes. The
movement of some members in 'volume_group' and 'lv_segment' eliminates
holes. The 'physical_volume' structure still has one 4-byte hole after
'pe_size'; the other structures no longer have any holes. Each
structures' size has not changed.
Rename fill_default_pvcreate_params to pvcreate_params_set_defaults.
Rename pvcreate_validate_restore_params to pvcreate_restore_params_validate.
Rename pvcreate_validate_params to pvcreate_params_validate.
Similar to other vg_set_* functions, we create a vg_set_clustered() function
which does a few checks and sets a flag. This is where we check for
any limitations of clusters.
Split pvcreate_validate_params into recovery and non-recovery parameters.
This is necessary so we can call the non-recovery validate function from
vgextend / vgcreate. Note in the pvcreate tool case, we must call the
recovery validation function first (see treatment of pe_start and --zero),
and that we add a call to fill_default_pvcreate_params before the validation
functions.
We need defaults for pvcreate_params at a higher level - this will
allow us to use a common function from the tools to take defaults,
then fill in any non-defaults from the commandline.
Future patches will refactor vgcreate/vgextend to call this function
if one or more pvcreate parameters are given on the commandline.
Another refactoring for implicit pvcreate support. We need to get
the pvcreate parameters somehow to the vg_extend routine. Options
seemed to be:
1. Attach the parameters to struct volume_group. I personally
did not like this idea in most cases, though one could make an
agrument why it might be ok at least for some of the parameters
(e.g. metadatacopies).
2. Pass them in to the extend routine. This second route seemed
to be the best approach given the constraints.
Future patches will parse the command line and fill in the actual
values for the pvcreate_single call.
Should be no functional change.
Should be no functional change. If this parameter is set to NULL, just fail
the extend if the device is not already a PV. If non-NULL, try pvcreate_single
before failing. Note that pvcreate_single() handles the log_error in case
of failure so we just return 0 if pvcreate_single() fails.
Clean up VG_RESIZEABLE flag by creating vg_is_resizeable().
Update comment - we no longer have ALLOW_RESIZEABLE.
Also use vg_is_exported() in one place missed by earlier patch.
Should be no functional change.
Now that we've split vg_remove_single into two routines, in the first routine
that only manipulates memory, we move the PVs from the vg->pvs list to the
vg->removed_pvs list. Then later, we iterate through this list to write the
removed PVs to disk, which removes them from the volume group and places them
into the internal ORPHAN VG.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
Split vg_remove_single into vg_remove_check (mandatory checks before
vgremove) and vg_remove (do actual remove by committing to disk).
In liblvm, we'd like to provide an consistent API that allows multiple
changes in memory, then let lvm_vg_write() control the commit to disk. In
some cases (for example, lvresize calls fsadm) this may not be possible.
However, since we are using an object model and dividing things into small
operations, the most logical model seems to be the lvm_vg_write model, and
handling the special cases as they arrive. So as best as possible
we move towards this end.
A possible optimization would be to consolidate vg_remove (committing)
code with vgreduce code. A second possible optimization is making vgreduce
of the last device equivalent to vgremove. Today, lvm_vg_reduce fails if
vgreduce is called with the last device, but from an object model perspective
we could view this as equivalent to vgremove and allow it. My gut feel is
we do not want to do this though.
Author: Dave Wysochanski <dwysocha@redhat.com>
Later patches should consolidate the vgremove / vgreduce functions but for
now let's clarify what vg_remove actually does by changing the name.
Author: Dave Wysochanski <dwysocha@redhat.com>
# pvcreate -u udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi /dev/sdc
uuid udwxr7BoKYEeKMr033xK6o4og7F13sGi|��� already in use on "/dev/sdb1"
is now
# pvcreate -u udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi /dev/sdc
uuid udwxr7-BoKY-EeKM-r033-xK6o-4og7-F13sGi already in use on "/dev/sdb1"
Adds 'data_alignment_detection' config option to the devices section of
lvm.conf. If your kernel provides topology information in sysfs (linux
>= 2.6.31) for the Physical Volume, the start of data area will be
aligned on a multiple of the ’minimum_io_size’ or ’optimal_io_size’
exposed in sysfs.
minimum_io_size is used if optimal_io_size is undefined (0). If both
md_chunk_alignment and data_alignment_detection are enabled the result
of data_alignment_detection is used.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pvcreate --dataalignmentoffset option is not specified the start
of a PV's aligned data area will be shifted by the associated
'alignment_offset' exposed in sysfs (unless
devices/data_alignment_offset_detection is disabled in lvm.conf).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Adds pe_align_offset to 'struct physical_volume'; is initialized with
set_pe_align_offset(). After pe_start is established pe_align_offset is
added to it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For now, a simple way to enforce the read/write semantics is to just save the
open mode of the VG. If the caller uses lvm_vg_create, the mode is write.
The caller using lvm_vg_open can use either read or write to open the VG.
Once we have this, we enforce the permissions on each API call and don't allow
a caller to modify a VG that has not been opened properly.
This may be better combined with the locking mode, but I view that as future
cleanup, past this initial release. The intial release should enforce the
basic object semantics though, as described in the lvm.h file.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Adding the ability to get the seqno is important for an application to
determine if something has changed in a VG. Otherwise, the only way to
know is to open the VG with write permission and hold the handle.
This function behaves a little bit different than vg_reduce_single, because
it allowes to remove even the latest pv. This has been done to be consistent
to lvm_vg_create, which creates an empty vg.
removed_pvs has been added to the volume_group struct. vg_reduce adds remove
pvs to this list to be able to commit the changes for the pvs in lvm_vg_comm
in liblvm2app.
Initialize removed_pvs list in format-specific volume_group constructors.
Ideally, we should have a base constructor here that initializes the general
non-format specific members of struct volume_group. But until then, there
are multiple places to initialize these members. Maybe a better patch would
be a base constructor patch for struct volume_group. That is more work
though.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Thomas Woerner <twoerner@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
For liblvm 'get' functions, we should share code with the reporting functions.
This means we need common code to return the values for the fields.
In this patch we refactor a few of the fields needed in liblvm.
Unfortunately, for the simple fields that do derefernces of structure
members (for example, vg_extent_count), we cannot call the common function
from the reporting infrastructure without more refactoring. The reason is
that the dereference of the simple fields is done deep inside the reporting
code (to get the generic "data" pointer), and the display function is a
generic 'size32' function. We can fix these issues later with more
refactoring.
Should be no functional change and the testsuite should cover any possible
regressions. The only fields in the report affected by this patch are:
vg_size, vg_free, and pv_mda_count.
Author: Dave Wysochanski <dwysocha@redhat.com>
The implicit pvcreate require either moving the ORPHAN_VG lock outside
pvcreate_single or somehow having the function know or detect whether
the ORPHAN_VG lock is already held.
Author: Dave Wysochanski <dwysocha@redhat.com>
Passing NULL for pvcreate parameters gives you default parameters for
pvcreate_single.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
In preparation for implicit pvcreate during vgcreate / vgextend,
move bulk of pvcreate logic inside library.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
We must hold the VG_ORPHAN lock until we commit to disk. Otherwise,
we risk a race condition on vgcreate / vgextend. Reverts the following
commit:
commit 72a41480ba
Author: Dave Wysochanski <dwysocha@redhat.com>
Date: Fri Jul 10 20:09:21 2009 +0000
Move orphan lock obtain/release inside vg_extend().
With this change we now have vgcreate/vgextend liblvm functions.
Note that this changes the lock order of the following functions as the
orphan lock is now obtained first. With our policy of non-blocking
second locks, this should not be a problem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
These messages are unnecessary in the set functions. We check for this
condition and print a message in the vgchange tool but not the library
functions.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
When converting to the new liblvm functions, the vgcreate code path
changed to create a new vg, then set values. As a result of this
change, and the fact that we give a user a message if they try to
set the same value of a VG attribute (extent_size, alloc_policy, etc),
you'll see these 2 extraneous "is already" messages with vgcreate:
tools/lvm vgcreate vg2 /dev/loop2
Physical extent size of VG vg2 is already 4.00 MB
Volume group allocation policy is already normal
Volume group "vg2" successfully created
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Author: Dave Wysochanski <dwysocha@redhat.com>
We provide a lock type that behaves like no_locking, but is not
clustered. Moreover, it also forbids any write locks. This magically (and
consistently) prevents use of clustered VGs, or changing local VGs with
--ignorelockingfailure. As a bonus, we can remove the special hacks in a few
places. Of course, people looking for trouble can always set their locking_type
to 0 to override.
The checks for RESIZEABLE_VG should now be inside the various functions that
have to do such operations.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Remove READ_REQUIRE_RESIZEABLE flag from vgsplit similar to the removal from
vgextend. Move the check inside the functions that actually move pvs from
one vg structure to another. Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
In the future we may export these functions or something like them in liblvm
For now this helps in cleaning up the checks for RESIZEABLE since we can
use the internal library function vg_bad_status_bits.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Move the check for the RESIZEABLE flag inside the vg_extend function.
When we consolidated the vg locking, reading, and status flag checking,
we tied the check for the RESIZEABLE flag to the vg_read() call. The problem
with this is you cannot know what other APIs the application my or may not
call after a vg_read() call. Thus the READ_REQUIRE_RESIZEABLE flag is not
really ideal - ideally we should be checking for this flag on a specific
operation, not inside the vg_read() call. This patch moves one check inside
the library.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
With this change we now have vgcreate/vgextend liblvm functions.
Note that this changes the lock order of the following functions as the
orphan lock is now obtained first. With our policy of non-blocking
second locks, this should not be a problem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Move the vg orphan lock inside vg_remove_single, now a complete liblvm
function. Note that this changes the order of the locks - originally
VG_ORPHAN was obtained first, then the vgname lock. With the current
policy of non-blocking second locks, this could mean we get a failure
obtaining the orphan lock. In the case of a vg with lvs being removed,
this could result in the lvs being removed but not the vg. Such a
scenario could have happened prior though with a different failure.
Other tools were examined for side-effects, and no major problems
were noted.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Move check for active LVs outside of library function. The vgremove
liblvm function function will fail if there are active LVs. It will
be the application's responsibility to check this condition and remove
the LVs individually before calling vgremove. Note also that we've
duplicated the EXPORTED_VG check in vgremove_single (tools) and
vg_remove_single (library). Duplication seemed the only option here
since we don't want to do the automatic removal of LVs (in the tools)
if the vg is exported, and we still need to protect the library call
from removal if the vg is exported.
We still need to deal with the ORPHAN lock but vg_remove_single is now
very close to our liblvm function.
TODO: Refactor lvremove in a similar way.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
vg_t *vg_create(struct cmd_context *cmd, const char *vg_name);
This is the first step towards the API called to create a VG.
Call vg_lock_newname() inside this function. Use _vg_make_handle()
where possible.
Now we have 2 ways to construct a volume group:
1) vg_read: Used when constructing an existing VG from disks
2) vg_create: Used when constructing a new VG
Both of these interfaces obtain a lock, and return a vg_t *.
The usage of _vg_make_handle() inside vg_create() doesn't fit
perfectly but it's ok for now. Needs some cleanup though and I've
noted "FIXME" in the code.
Add the new vg_create() plus vg 'set' functions for non-default
VG parameters in the following tools:
- vgcreate: Fairly straightforward refactoring. We just moved
vg_lock_newname inside vg_create so we check the return via
vg_read_error.
- vgsplit: The refactoring here is a bit more tricky. Originally
we called vg_lock_newname and depending on the error code, we either
read the existing vg or created the new one. Now vg_create()
calls vg_lock_newname, so we first try to create the VG. If this
fails with FAILED_EXIST, we can then do the vg_read. If the
create succeeds, we check the input parameters and set any new
values on the VG.
TODO in future patches:
1. The VG_ORPHAN lock needs some thought. We may want to treat
this as any other VG, and require the application to obtain a handle
and pass it to other API calls (for example, vg_extend). Or,
we may find that hiding the VG_ORPHAN lock inside other APIs is
the way to go. I thought of placing the VG_ORPHAN lock inside
vg_create() and tying it to the vg handle, but was not certain
this was the right approach.
2. Cleanup error paths. Integrate vg_read_error() with vg_create and
vg_read* error codes and/or the new error APIs.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
NOTE: vg_set_alloc_policy() returns success if you try to set a value that
is already stored. The behavior of vgchange is the same though - it fails.
There is a fixme noted in the code about this inconsistency, which should
be resolved if possible.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
In liblvm, we will reserve the word 'change' to mean an API that
both sets one or more values, and commits to disk. This will be
consistent with the LVM commandline. The existing vg_change_pesize()
function does not commit to disk, but just changes the extent_size
and ensures all internal structures are updated. This logic should
be contained in a function that sets the value.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
It would be nice to have one function that does all the validation
and setting of the VG's pesize. However, currently some checks
are in the higher-level function _vgchange_pesize(), and some
checks are in the lower function vg_change_pesize().
This patch moves most of the higher-level checks inside
vg_change_pesize. In one case a failure return code is
changed from ECMD_FAILED to EINVALID_CMD_LINE.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove unneeded LOCK_NONBLOCKING from vg_read() API and tools that
use it. We no longer need this flag anywhere since we now automatically
set LCK_NONBLOCK inside lock_vol() if vgs_locked().
For further details, see:
commit d52b3fd3fe
Author: Dave Wysochanski <dwysocha@redhat.com>
Date: Wed May 13 13:02:52 2009 +0000
Remove NON_BLOCKING lock flag from tools and set a policy to auto-set.
As a simplification to the tools and further liblvm, this patch pushes
the setting of NON_BLOCKING lock flag inside the lock_vol() call.
The policy we set is if any existing VGs are currently locked, we
set the NON_BLOCKING flag.
At some point it may make sense to add this flag back if we get an
RFE from a liblvm user, but for now let's keep it as simple as
possible.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove READ_CHECK_EXISTENCE and vg_might_exist().
This flag and API is no longer used now that we have a separate
API to check for existence.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Remove unneeded LOCK_KEEP from vg_read() interface.
Update comment to clarify cases where _vg_lock_and_read() may return
with an error but the lock held. Would be nice to make the vg_read()
interface consistent with regards to lock held and error behavior.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Sun May 3 13:06:14 CEST 2009 Petr Rockai <me@mornfall.net>
* Don't segfault in vg_release when vg->cmd is NULL.
Author: Petr Rockai <prockai@redhat.com>
Committer: Dave Wysochanski <dwysocha@redhat.com>
Sun May 3 12:32:30 CEST 2009 Petr Rockai <me@mornfall.net>
* Rework the toollib interface (process_each_*) on top of new vg_read.
Rebased 6/26/09 by Dave W.
- Add skipping message to process_each_lv
- Remove inconsistent_t.
status flag after reading the volume group. Thus, no need to set the flag
in vg_read() or clear it later before calling _vg_bad_status_bits().
Also, add back in the !lockingfailed() as part of the CLUSTERED flag check.
It's unclear why it was removed when the check was moved from
_vg_bad_status_bits() to inside _vg_lock_and_read().
There was an open question about the last check in the 'if' stmt for
lockingfailed() with a previous patch submitted. However, I would
defer that right now as it is a separate item and this patch should
be no functional change by including the !lockingfailed().
Petr acked this patch on 5/26 I just forgot to check it in.
Acked-by: Petr Rockai <prockai@redhat.com>
Various tools need to check for existence of a VG before doing something
(vgsplit, vgrename, vgcreate). Currently we don't have an interface to
check for existence, but the existence check is part of the vg_read* call(s).
This patch is an attempt to pull out some of that functionality into a
separate function, and hopefully simplify our vg_read interface, and
move those patches along.
vg_lock_newname() is only concerned about checking whether a vg exists in
the system. Unfortunately, we cannot just scan the system, but we must first
obtain a lock. Since we are reserving a vgname, we take a WRITE lock on
the vgname. Once obtained, we scan the system to ensure the name does
not exist. The return codes and behavior is in the function header.
You might think of this function as similar to an open() call with
O_CREAT and O_EXCL flags (returns failure with -EEXIST if file already
exists).
NOTE: I think including the word "lock" in the function name is important,
as it clearly states the function obtains a lock and makes the code more
readable, especially when it comes to cleanup / unlocking. The ultimate
function name is somewhat open for debate though so later we may rename.
During vgreduce is failed mirror image replaced with error segment,
this segmant type has always area_count == 0.
Current code expects that there is at least one area with device,
patch fixes it by additional check (fixes segfault during vgreduce).
Also do not calculate readahead in every lv_info call, we only need
to cache PV readahead before activation calls which locks memory.
When we are stacking LV over device, which has for some reason
increased read_ahead (e.g. MD RAID), the read_ahead hint
for libdevmapper is wrong (it is zero).
If the calculated read_ahead hint is zero, patch uses read_ahead of underlying device
(if first segment is PV) when setting DM_READ_AHEAD_MINIMUM_FLAG.
Because we are using dev-cache, it also store this value to cache for future use
(if several LVs are over one PV, BLKRAGET is called only once for underlying device.)
This should fix all the reamining problems with readahead mismatch reported
for DM over MD configurations (and similar cases).
We can temporarily violate max_lv during mirror conversion etc.
(If the operation fails, orphan mirror images are visible to administrator
for manual remove for example. Not that this should ever happen:-)
Force limit only for lvcreate (and vg merge) command.
Patch also adds simple max_lv tests into testsuite
The seg variable is temporary variable for list iterator,
code cannot expect that after iteration it remains NULL
(it contains non-NULL pointer here id list is empty).
Patch fixes first_seg function so it now correctly returns NULL
for empty segment list.
Since now, all code reading volume group is responsible for releasing
the memory allocated by calling vg_release(vg).
(For simplicity of use, vg_releae can be called for vg == NULL,
the same logic like free(NULL)).
Also providing simple macro for unlocking & releasing in one step,
tools usualy uses this approach.
The global memory pool (cmd->mem) should be used only for global
physical volume operations.
This patch have to be applied with all subsequent patches to complete
memory pool per vg logic.
Using separate memory pool has quite bit memory saving impact when
using large VGs, this is mainly needed when we have to use
preallocated and locked memory (and should not overflow from that
memory space).
The all_pvs list, used in vg_read, should make its own private
copy of pv structures, otherwise (when vg will use its own pool)
it will point to released memory pool.
The same applies for get_pvs() call.
Patch adds pv_list copy helper and adds explicit memory pool
parameter into _copy_pv.
(Please note that all these helper functions cannot guarantee that
vg related fields are valid - proper vg read & lock must be used
if it is requested.)
Patch fixes these problems:
- during the snapshot creation process, it needs create 2 LVs,
one is cow, second becomes snapshot.
If the code fails in vg_add_snapshot, code lvcreate will not remove
LV cow volume.
- if max_lv is set and VG contains snapshot, it can happen that
during the activation lv_count is temporarily increased over the limit
and VG metadata are not properly processed
see https://bugzilla.redhat.com/show_bug.cgi?id=490298
- vgcfgrestore alows restore with max_lv set to lower valuer that actual
LV count. This later leads to situation that max_lv is completely ignored.
- vgck doesn't call vg_validate(). It should at least try:-)
Signed-off-by: Milan Broz <mbroz@redhat.com>
This patch is not fully tested and leaves some related bugs unfixed.
Intended behaviour of the code now:
pe_start in the lvm2 format PV label header is set only by pvcreate (or
vgconvert -M2) and then preserved in *all* operations thereafter.
In some specialist cases, after the PV is added to a VG, the pe_start
field in the VG metadata may hold a different value and if so, it
overrides the other one for as long as the PV is in such a VG.
Currently, the field storing the size of the data area in the PV label
header always holds 0. As it only has meaning in the context of a
volume group, it is calculated whenever the PV is added to a VG (and can
be derived from extent_size and pe_count in the VG metadata).