mirror of
git://sourceware.org/git/lvm2.git
synced 2024-12-29 15:22:30 +03:00
339 lines
9.9 KiB
Plaintext
339 lines
9.9 KiB
Plaintext
LVM disk reading
|
|
|
|
Reading disks happens in two phases. The first is a discovery phase,
|
|
which determines what's on the disks. The second is a working phase,
|
|
which does a particular job for the command.
|
|
|
|
|
|
Phase 1: Discovery
|
|
------------------
|
|
|
|
Read all the disks on the system to find out:
|
|
- What are the LVM devices?
|
|
- What VG's exist on those devices?
|
|
|
|
This phase is called "label scan" (although it reads and scans everything,
|
|
not just the label.) It stores the information it discovers (what LVM
|
|
devices exist, and what VGs exist on them) in lvmcache. The devs/VGs info
|
|
in lvmcache is the starting point for phase two.
|
|
|
|
|
|
Phase 1 in outline:
|
|
|
|
For each device:
|
|
|
|
a. Read the first <N> KB of the device. (N is configurable.)
|
|
|
|
b. Look for the lvm label_header in the first four sectors,
|
|
if none exists, it's not an lvm device, so quit looking at it.
|
|
(By default, label_header is in the second sector.)
|
|
|
|
c. Look at the pv_header, which follows the label_header.
|
|
This tells us the location of VG metadata on the device.
|
|
There can be 0, 1 or 2 copies of VG metadata. The first
|
|
is always at the start of the device, the second (if used)
|
|
is at the end.
|
|
|
|
d. Look at the first mda_header (location came from pv_header
|
|
in the previous step). This is by default in sector 8,
|
|
4096 bytes from the start of the device. This tells us the
|
|
location of the actual VG metadata text.
|
|
|
|
e. Look at the first copy of the text VG metadata (location came
|
|
from mda_header in the previous step). This is by default
|
|
in sector 9, 4608 bytes from the start of the device.
|
|
The VG metadata is only partially analyzed to create a basic
|
|
summary of the VG.
|
|
|
|
f. Store an "info" entry in lvmcache for this device,
|
|
indicating that it is an lvm device, and store a "vginfo"
|
|
entry in lvmcache indicating the name of the VG seen
|
|
in the metadata in step e.
|
|
|
|
g. If the pv_header in step c shows a second mda_header
|
|
location at the end of the device, then read that as
|
|
in step d, and repeat steps e-f for it.
|
|
|
|
At the end of phase 1, lvmcache will have a list of devices
|
|
that belong to LVM, and a list of VG names that exist on
|
|
those devices. Each device (info struct) is associated
|
|
with the VG (vginfo struct) it is used in.
|
|
|
|
|
|
Phase 1 in code:
|
|
|
|
The most relevant functions are listed for each step in the outline.
|
|
|
|
lvmcache_label_scan()
|
|
label_scan()
|
|
|
|
. dev_cache_scan()
|
|
choose which devices on the system to look at
|
|
|
|
. for each dev in dev_cache: bcache prefetch/read
|
|
|
|
. _process_block() to process data from bcache
|
|
_find_lvm_header() checks if this is an lvm dev by looking at label_header
|
|
_text_read() via ops->read() looks at mda/pv/vg data to populate lvmcache
|
|
|
|
. _read_mda_header_and_metadata()
|
|
raw_read_mda_header()
|
|
|
|
. _read_mda_header_and_metadata()
|
|
read_metadata_location()
|
|
text_read_metadata_summary()
|
|
config_file_read_fd()
|
|
_read_vgsummary() via ops->read_vgsummary()
|
|
|
|
. _text_read(): lvmcache_add()
|
|
[adds this device to list of lvm devices]
|
|
_read_mda_header_and_metadata(): lvmcache_update_vgname_and_id()
|
|
[adds the VG name to list of VGs]
|
|
|
|
|
|
Phase 2: Work
|
|
-------------
|
|
|
|
This phase carries out the operation requested by the command that was
|
|
run.
|
|
|
|
Whereas the first phase is based on iterating through each device on the
|
|
system, this phase is based on iterating through each VG name. The list
|
|
of VG names comes from phase 1, which stored the list in lvmcache to be
|
|
used by phase 2.
|
|
|
|
Some commands may need to iterate through all VG names, while others may
|
|
need to iterate through just one or two.
|
|
|
|
This phase includes locking each VG as work is done on it, so that two
|
|
commands do not interfere with each other.
|
|
|
|
|
|
Phase 2 in outline:
|
|
|
|
For each VG name:
|
|
|
|
a. Lock the VG.
|
|
|
|
b. Repeat the phase 1 scan steps for each device in this VG.
|
|
The phase 1 information in lvmcache may have changed because no VG lock
|
|
was held during phase 1. So, repeat the phase 1 steps, but only for the
|
|
devices in this VG. N.B. for commands that are just reporting data,
|
|
we skip this step if the data from phase 1 was complete and consistent.
|
|
|
|
c. Get the list of on-disk metadata locations for this VG.
|
|
Phase 1 created this list in lvmcache to be used here. At this
|
|
point we copy it out of lvmcache. In the simple/common case,
|
|
this is a list of devices in the VG. But, some devices may
|
|
have 0 or 2 metadata locations instead of the default 1, so it
|
|
is not always equal to the list of devices. We want to read
|
|
every copy of the metadata for this VG.
|
|
|
|
d. For each metadata location on each device in the VG
|
|
(the list from the previous step):
|
|
|
|
1) Look at the mda_header. The location of the mda_header was saved
|
|
in the lvmcache info struct by phase 1 (where it came from the
|
|
pv_header.) The mda_header tells us where the text VG metadata is
|
|
located.
|
|
|
|
2) Look at the text VG metadata. The location came from mda_header
|
|
in the previous step. The VG metadata is fully analyzed and used
|
|
to create an in-memory 'struct volume_group'.
|
|
|
|
e. Compare the copies of VG metadata that were found in each location.
|
|
If some copies are older, choose the newest one to use, and update
|
|
any older copies.
|
|
|
|
f. Update details about the devices/VG in lvmcache.
|
|
|
|
g. Pass the 'vg' struct to the command-specific code to work with.
|
|
|
|
|
|
Phase 2 in code:
|
|
|
|
The most relevant functions are listed for each step in the outline.
|
|
|
|
For each VG name:
|
|
process_each_vg()
|
|
|
|
. vg_read()
|
|
lock_vol()
|
|
|
|
. vg_read()
|
|
lvmcache_label_rescan_vg() (if needed)
|
|
[insert phase 1 steps for scanning devs, but only devs in this vg]
|
|
|
|
. vg_read()
|
|
create_instance()
|
|
_text_create_text_instance()
|
|
_create_vg_text_instance()
|
|
lvmcache_fid_add_mdas_vg()
|
|
[Copies mda locations from info->mdas where it was saved
|
|
by phase 1, into fid->metadata_areas_in_use. This is
|
|
the key connection between phase 1 and phase 2.]
|
|
|
|
. dm_list_iterate_items(mda, &fid->metadata_areas_in_use)
|
|
|
|
. _vg_read_raw() via ops->vg_read()
|
|
raw_read_mda_header()
|
|
|
|
. _vg_read_raw()
|
|
text_read_metadata()
|
|
config_file_read_fd()
|
|
_read_vg() via ops->read_vg()
|
|
|
|
. return the 'vg' struct from vg_read() and use it to do
|
|
command-specific work
|
|
|
|
|
|
|
|
Filter i/o
|
|
----------
|
|
|
|
Some filters must be applied before reading a device, and other filters
|
|
must be applied after reading a device. In all cases, the filters must be
|
|
applied before lvm processes the device, i.e. before it looks for an lvm
|
|
label.
|
|
|
|
1. Some filters need to be applied prior to reading any devices
|
|
because the purpose of the filter is to avoid submitting any
|
|
io on the excluded devices. The regex filter is the primary
|
|
example. Other filters benefit from being applied prior to
|
|
reading devices because they can tell which devices to
|
|
exclude without doing io to the device. An example of this
|
|
is the mpath filter.
|
|
|
|
2. Some filters need to be applied after reading a device because
|
|
they are based on data/signatures seen on the device.
|
|
The partitioned filter is an example of this; lvm needs to
|
|
read a device to see if it has a partition table before it can
|
|
know whether to exclude the device from further processing.
|
|
|
|
We apply filters from 1 before reading devices, and we apply filters from
|
|
2 after populating bcache, but before processing the device (i.e. before
|
|
checking for an lvm label, which is the first step in processing.)
|
|
|
|
The current implementation of this makes filters return -EAGAIN if they
|
|
want to read the device, but bcache data is not yet available. This will
|
|
happen when filtering runs prior to populating bcache. In this case the
|
|
device is flagged. After bcache is populated, the filters are reapplied
|
|
to the flagged devices. The filters which need to look at device content
|
|
are now able to get it from bcache. Devices that do not pass filters at
|
|
this point are excluded just like devices which were excluded earlier.
|
|
|
|
(Some filters from 2 can be skipped by consulting udev for the information
|
|
instead of reading the device. This is not entirely reliable, so it is
|
|
disabled by default with the config setting external_device_info_source.
|
|
It may be worthwhile to change the filters to use the udev info as a hint,
|
|
or only use udev info for filtering in reporting commands where
|
|
inaccuracies are not a big problem.)
|
|
|
|
|
|
|
|
I/O Performance
|
|
---------------
|
|
|
|
. 400 loop devices used as PVs
|
|
. 40 VGs each with 10 PVs
|
|
. each VG has one active LV
|
|
. each of the 10 PVs in vg0 has an artificial 100 ms read delay
|
|
. read/write/io_submit are system call counts using strace
|
|
. old is lvm 2.2.175
|
|
. new is lvm 2.2.178 (shortly before)
|
|
|
|
|
|
Command: pvs
|
|
------------
|
|
old: 0m17.422s
|
|
new: 0m0.331s
|
|
|
|
old: read 7773 write 497
|
|
new: read 2807 write 495 io_submit 448
|
|
|
|
|
|
Command: vgs
|
|
------------
|
|
old: 0m20.383s
|
|
new: 0m0.325s
|
|
|
|
old: read 10684 write 129
|
|
new: read 2807 write 129 io_submit 448
|
|
|
|
|
|
Command: vgck vg0
|
|
-----------------
|
|
old: 0m16.212s
|
|
new: 0m1.290s
|
|
|
|
old: read 6372 write 4
|
|
new: read 2807 write 4 io_submit 458
|
|
|
|
|
|
Command: lvcreate -n test -l1 -an vg0
|
|
-------------------------------------
|
|
old: 0m29.271s
|
|
new: 0m1.351s
|
|
|
|
old: read 6503 write 39
|
|
new: read 2808 write 9 io_submit 488
|
|
|
|
|
|
Command: lvremove vg0/test
|
|
--------------------------
|
|
old: 0m29.262s
|
|
new: 0m1.348s
|
|
|
|
old: read 6502 write 36
|
|
new: read 2807 write 6 io_submit 488
|
|
|
|
|
|
io_submit sources
|
|
-----------------
|
|
|
|
vgs:
|
|
reads:
|
|
- 400 for each PV
|
|
- 40 for each LV
|
|
- 8 for other devs on the system
|
|
|
|
vgck vg0:
|
|
reads:
|
|
- 400 for each PV
|
|
- 40 for each LV
|
|
- 10 for each PV in vg0 (rescan)
|
|
- 8 for other devs on the system
|
|
|
|
lvcreate -n test -l1 -an vg0
|
|
reads:
|
|
- 400 for each PV
|
|
- 40 for each LV
|
|
- 10 for each PV in vg0 (rescan)
|
|
- 8 for other devs on the system
|
|
writes:
|
|
- 10 for metadata on each PV in vg0
|
|
- 10 for precommit on each PV in vg0
|
|
- 10 for commit on each PV in vg0
|
|
|
|
|
|
|
|
With lvmetad
|
|
------------
|
|
|
|
Command: pvs
|
|
------------
|
|
old: 0m5.405s
|
|
new: 0m1.404s
|
|
|
|
Command: vgs
|
|
------------
|
|
old: 0m0.222s
|
|
new: 0m0.223s
|
|
|
|
Command: lvcreate -n test -l1 -an vg0
|
|
-------------------------------------
|
|
old: 0m10.128s
|
|
new: 0m1.137s
|
|
|
|
|