mirror of
git://sourceware.org/git/lvm2.git
synced 2025-01-11 09:18:25 +03:00
Initial design document for LVMetaD, building on the draft from June of last
year, incorporating the outcomes of today's and yesterday's discussions.
This commit is contained in:
parent
7c9f1ae834
commit
b904359bac
186
daemons/lvmetad/DESIGN
Normal file
186
daemons/lvmetad/DESIGN
Normal file
@ -0,0 +1,186 @@
|
||||
The design of LVMetaD
|
||||
=====================
|
||||
|
||||
Invocation and setup
|
||||
--------------------
|
||||
|
||||
The daemon should be started automatically by the first LVM command issued on
|
||||
the system, when needed. The usage of the daemon should be configurable in
|
||||
lvm.conf, probably with its own section. Say
|
||||
|
||||
lvmetad {
|
||||
enabled = 1 # default
|
||||
autostart = 1 # default
|
||||
socket = "/path/to/socket" # defaults to /var/run/lvmetad or such
|
||||
}
|
||||
|
||||
Library integration
|
||||
-------------------
|
||||
|
||||
When a command needs to access metadata, it currently needs to perform a scan
|
||||
of the physical devices available in the system. This is a possibly quite
|
||||
expensive operation, especially if many devices are attached to the system. In
|
||||
most cases, LVM needs a complete image of the system's PVs to operate
|
||||
correctly, so all devices need to be read, to at least determine presence (and
|
||||
content) of a PV label. Additional IO is done to obtain or write metadata
|
||||
areas, but this is only marginally related and addressed by Dave's
|
||||
metadata-balancing work.
|
||||
|
||||
In the existing scanning code, a cache layer exists, under
|
||||
lib/cache/lvmcache.[hc]. This layer is keeping a textual copy of the metadata
|
||||
for a given volume group, in a format_text form, as a character string. We can
|
||||
plug the lvmetad interface at this level: in lvmcache_get_vg, which is
|
||||
responsible for looking up metadata in a local cache, we can, if the metadata
|
||||
is not available in the local cache, query lvmetad. Under normal circumstances,
|
||||
when a VG is not cached yet, this operation fails and prompts the caller to
|
||||
perform a scan. Under the lvmetad enabled scenario, this would never happen and
|
||||
the fall-through would only be activated when lvmetad is disabled, which would
|
||||
lead to local cache being populated as usual through a locally executed scan.
|
||||
|
||||
Therefore, existing stand-alone (i.e. no lvmetad) functionality of the tools
|
||||
would be not compromised by adding lvmetad. With lvmetad enabled, however,
|
||||
significant portions of the code would be short-circuited.
|
||||
|
||||
Scanning
|
||||
--------
|
||||
|
||||
Initially (at least), the lvmetad will be not allowed to read disks: it will
|
||||
rely on an external program to provide the metadata. In the ideal case, this
|
||||
will be triggered by udev. The role of lvmetad is then to collect and maintain
|
||||
an accurate (up to the data it has received) image of the VGs available in the
|
||||
system. I imagine we could extend the pvscan command (or add a new one, say
|
||||
lvmetad_client, if pvscan is found to be inappropriate):
|
||||
|
||||
$ pvscan --lvmetad /dev/foo
|
||||
$ pvscan --lvmetad --remove /dev/foo
|
||||
|
||||
These commands would simply read the label and the MDA (if applicable) from the
|
||||
given PV and feed that data to the running lvmetad, using
|
||||
lvmetad_{add,remove}_pv (see lvmetad_client.h).
|
||||
|
||||
We however need to ensure a couple of things here:
|
||||
|
||||
1) only LVM commands ever touch PV labels and VG metadata
|
||||
2) when a device is added or removed, udev fires a rule to notify lvmetad
|
||||
|
||||
While the latter is straightforward, there are issues with the first. We
|
||||
*might* want to invoke the dreaded "watch" udev rule in this case, however it
|
||||
ends up being implemented. Of course, we can also rely on the sysadmin to be
|
||||
reasonable and not write over existing LVM metadata without first telling LVM
|
||||
to let go of the respective device(s).
|
||||
|
||||
Even if we simply ignore the problem, metadata write should fail in these
|
||||
cases, so the admin should be unable to do substantial damage to the system. If
|
||||
there were active LVs on top of the vanished PV, they are in trouble no matter
|
||||
what happens there.
|
||||
|
||||
Incremental scan
|
||||
----------------
|
||||
|
||||
There are some new issues arising with the "udev" scan mode. Namely, the
|
||||
devices of a volume group will be appearing one by one. The behaviour in this
|
||||
case will be very similar to the current behaviour when devices are missing:
|
||||
the volume group, until *all* its physical volumes have been discovered and
|
||||
announced by udev, will be in a state with some of its devices flagged as
|
||||
MISSING_PV. This means that the volume group will be, for most purposes,
|
||||
read-only until it is complete and LVs residing on yet-unknown PVs won't
|
||||
activate without --partial. Under usual circumstances, this is not a problem
|
||||
and the current code for dealing with MISSING_PVs should be adequate.
|
||||
|
||||
However, the code for reading volume groups from disks will need to be adapted,
|
||||
since it currently does not work incrementally. Such support will need to track
|
||||
metadata-less PVs that have been encountered so far and to provide a way to
|
||||
update an existing volume group. When the first PV with metadata of a given VG
|
||||
is encountered, the VG is created in lvmetad (probably in the form of "struct
|
||||
volume_group") and it is assigned any previously cached metadata-less PVs it is
|
||||
referencing. Any PVs that were not yet encountered will be marked as MISSING_PV
|
||||
in the "struct volume_group". Upon scanning a new PV, if it belongs to any
|
||||
already-known volume group, this PV is checked for consistency with the already
|
||||
cached metadata (in a case of mismatch, the VG needs to be recovered or
|
||||
declared conflicted), and is subsequently unmarked MISSING_PV. Care need be
|
||||
taken not to unmark MISSING_PV on PVs that have this flag in their persistent
|
||||
metadata, though.
|
||||
|
||||
The most problematic aspect of the whole design may be orphan PVs. At any given
|
||||
point, a metadata-less PV may appear orphaned, if a PV of its VG with metadata
|
||||
has not been scanned yet. Eventually, we will have to decide that this PV is
|
||||
really an orphan and enable its usage for creating or extending VGs. In
|
||||
practice, the decision might be governed by a timeout or assumed immediately --
|
||||
the former case is a little safer, the latter is probably more transparent. I
|
||||
am not very keen on using timeouts and we can probably assume that the admin
|
||||
won't blindly try to re-use devices in a way that would trip up LVM in this
|
||||
respect. I would be in favour of just assuming that metadata-less VGs with no
|
||||
known referencing VGs are orphans -- after all, this is the same approach as we
|
||||
use today. The metadata balancing support may stress this a bit more than the
|
||||
usual contemporary setups do, though.
|
||||
|
||||
Automatic activation
|
||||
--------------------
|
||||
|
||||
It may also be prudent to provide a command that will block until a volume
|
||||
group is complete, so that scripts can reliably activate/mount LVs and such. Of
|
||||
course, some PVs may never appear, so a timeout is necessary. Again, this is
|
||||
something not handled by current tools, but may become more important in
|
||||
future. It probably does not need to be implemented right away though.
|
||||
|
||||
The other aspect of the progressive VG assembly is automatic activation. The
|
||||
currently only problem with that is that we would like to avoid having
|
||||
activation code in lvmetad, so we would prefer to fire up an event of some sort
|
||||
and let someone else handle the activation and whatnot.
|
||||
|
||||
Cluster support
|
||||
---------------
|
||||
|
||||
When working in a cluster, clvmd integration will be necessary: clvmd will need
|
||||
to instruct lvmetad to re-read metadata as appropriate due to writes on remote
|
||||
hosts. Overall, this is not hard, but the devil is in the details. I would
|
||||
possibly disable lvmetad for clustered volume groups in the first phase and
|
||||
only proceed when the local mode is robust and well tested.
|
||||
|
||||
Protocol & co.
|
||||
--------------
|
||||
|
||||
I expect a simple text-based protocol executed on top of an Unix Domain Socket
|
||||
to be the communication interface for lvmetad. Ideally, the requests and
|
||||
replies will be well-formed "config file" style strings, so we can re-use
|
||||
existing parsing infrastructure.
|
||||
|
||||
Since we already have two daemons, I would probably look into factoring some
|
||||
common code for daemon-y things, like sockets, communication (including thread
|
||||
management) and maybe logging and re-using it in all the daemons (clvmd,
|
||||
dmeventd and lvmetad). This shared infrastructure should live under
|
||||
daemons/common, and the existing daemons shall be gradually migrated to the
|
||||
shared code.
|
||||
|
||||
Future extensions
|
||||
-----------------
|
||||
|
||||
The above should basically cover the use of lvmetad as a cache-only
|
||||
daemon. Writes could still be executed locally, and the new metadata version
|
||||
can be provided to lvmetad through the socket the usual way. This is fairly
|
||||
natural and in my opinion reasonable. The lvmetad acts like a cache that will
|
||||
hold metadata, no more no less.
|
||||
|
||||
Above this, there is a couple of things that could be worked on later, when the
|
||||
above basic design is finished and implemented.
|
||||
|
||||
_Metadata writing_: We may want to support writing new metadata through
|
||||
lvmetad. This may or may not be a better design, but the write itself should be
|
||||
more or less orthogonal to the rest of the story outlined above.
|
||||
|
||||
_Locking_: Other than directing metadata writes through lvmetad, one could
|
||||
conceivably also track VG/LV locking through the same.
|
||||
|
||||
_Clustering_: A deeper integration of lvmetad with clvmd might be possible and
|
||||
maybe desirable. Since clvmd communicates over the network with other clvmd
|
||||
instances, this could be extended to metadata exchange between lvmetad's,
|
||||
further cutting down scanning costs. This would combine well with the
|
||||
write-through-lvmetad approach.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Since (at least bare-bones) lvmetad has no disk interaction and is fed metadata
|
||||
externally, it should be very amenable to automated testing. We need to provide
|
||||
a client that can feed arbitrary, synthetic metadata to the daemon and request
|
||||
the data back, providing reasonable (nearly unit-level) testing infrastructure.
|
Loading…
Reference in New Issue
Block a user