IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
sysconf() may also return -1 although rather theoretically.
Default to 4K when such case would happen.
Also in function call it just once and keep as static variable.
Ensure only nonNULL 'du' pointer is dereference altough the comment
to the last assign 'du' pointer already suggest 'NULL' case should not happen.
So just being explicit.
mer du
Some device id types can only be used with specific device major
numbers, so use this restriction to avoid some comparisions.
This is more efficient, and can avoid some incorrect matches.
pvid and vgid are sometimes a null-terminated string, and
other times a 'struct id', and the two types were often
cast between each other. When a struct id was cast to a char
pointer, the resulting string would not necessarily be null
terminated. Casting a null-terminated string id to a
struct id is fine, but is still avoided when possible.
A struct id is: int8_t uuid[ID_LEN]
A string id is: char pvid[ID_LEN + 1]
A convention is introduced to help distinguish them:
- variables and struct fields named "pvid" or "vgid"
should be null-terminated strings.
- variables and struct fields named "pv_id" or "vg_id"
should be struct id's.
- examples:
char pvid[ID_LEN + 1];
char vgid[ID_LEN + 1];
struct id pv_id;
struct id vg_id;
Function names also attempt to follow this convention.
Avoid casting between the two types as much as possible,
with limited exceptions when known to be safe and clearly
commented.
Avoid using variations of strcpy and strcmp, and instead
use memcpy/memcmp with ID_LEN (with similar limited
exceptions possible.)
sysfs-based multipath component detection quit if a
device had multiple holders, and in this case would
fail to detect a device was an mpath component.
related to config settings:
obtain_device_info_from_udev (controls if lvm gets
a list of devices from readdir /dev or from libudev)
external_device_info_source (controls if lvm asks
libudev for device information)
. Make the obtain_device_list_from_udev setting
affect only the choice of readdir /dev vs libudev.
The setting no longer controls if udev is used for
device type checks.
. Change obtain_device_list_from_udev default to 0.
This helps avoid boot timeouts due to slow libudev
queries, avoids reported failures from
udev_enumerate_scan_devices, and avoids delays from
"device not initialized in udev database" errors.
Even without errors, for a system booting with 1024 PVs,
lvm2-pvscan times improve from about 100 sec to 15 sec,
and the pvscan command from about 64 sec to about 4 sec.
. For external_device_info_source="none", remove all
libudev device info queries, and use only lvm
native device info.
. For external_device_info_source="udev", first check
lvm native device info, then check libudev info.
. Remove sleep/retry loop when attempting libudev
queries for device info. udev info will simply
be skipped if it's not immediately available.
. Only set up a libdev connection if it will be used by
obtain_device_list_from_udev/external_device_info_source.
. For native multipath component detection, use
/etc/multipath/wwids. If a device has a wwid
matching an entry in the wwids file, then it's
considered a multipath component. This is
necessary to natively detect multipath
components when the mpath device is not set up.
expands commit d5a06f9a7d
"pvscan: skip indexing devices used by LVs"
The dev cache index is expensive and slow, so limit it
to commands that are used to observe the state of lvm.
The index is only used to print warnings about incorrect
device use by active LVs, e.g. if an LV is using a
multipath component device instead of the multipath
device. Commands that continue to use the index and
print the warnings:
fullreport, lvmdiskscan, vgs, lvs, pvs,
vgdisplay, lvdisplay, pvdisplay,
vgscan, lvscan, pvscan (excluding --cache)
A couple other commands were borrowing the DEV_USED_FOR_LV
flag to just check if a device was actively in use by LVs.
These are converted to the new dev_is_used_by_active_lv().
dev_cache_index_devs() is taking a large amount of time
when there are many PVs. The index keeps track of
devices that are currently in use by active LVs. This
info is used to print warnings for users in some limited
cases.
The checks/warnings that are enabled by the index are not
needed by pvscan --cache, so disable it in this case.
This may be expanded to other cases in future commits.
dev_cache_index_devs should also be improved in another
commit to avoid the extreme delays with many devices.
When adding a device to the devices file with --adddev, lvm
by default chooses the best device ID type for the new device.
The new --deviceidtype option allows the user to override the
built in preference. This is useful if there's a problem with
the default type, or if a secondary type is preferrable.
If the specified deviceidtype does not produce a device ID,
then lvm falls back to the preference it would otherwise use.
error reading dev and no pvid on dev were both
returning 0. make it easier for callers to
know which, if they care.
return 1 if the device could be read, regardless
of whether a pvid was found or not.
set has_pvid=1 if a pvid is found and 0 if no
pvid is found.
This case happens when i.e. we convert LV to another type,
when we change existing LV into a different type - so change
to debug level and avoid confusing users with message about
Device path not match.
We may eventually enhnace caching code to drop cached info
after taking lock and reading VG.
With commit b44db5d1a7
needs to check allocated pointer for failed malloc().
Existing check was actually no checking anything so failing
malloc here would result in segfault (although with very
low chance to ever happen).
Use different 'hint' size for dm_hash_create() call - so
when debug info about hash is printed we can recognize which
hash was in use.
This patch doesn't change actual used size since that is always
rounded to be power of 2 and >=16 - so as such is only a
help to developer.
We could eventually use 'name' arg, but since this would have changed
API and this patchset will be routed to libdm & stable - we will
just use this small trick.
Use 'C' for alphasort - there is no need to use localized and slower
sorting for internal directory scanning.
Ensure on all code paths allocated dirent entries are released.
Optimize full path construction.
The LVM devices file lists devices that lvm can use. The default
file is /etc/lvm/devices/system.devices, and the lvmdevices(8)
command is used to add or remove device entries. If the file
does not exist, or if lvm.conf includes use_devicesfile=0, then
lvm will not use a devices file. When the devices file is in use,
the regex filter is not used, and the filter settings in lvm.conf
or on the command line are ignored.
LVM records devices in the devices file using hardware-specific
IDs, such as the WWID, and attempts to use subsystem-specific
IDs for virtual device types. These device IDs are also written
in the VG metadata. When no hardware or virtual ID is available,
lvm falls back using the unstable device name as the device ID.
When devnames are used, lvm performs extra scanning to find
devices if their devname changes, e.g. after reboot.
When proper device IDs are used, an lvm command will not look
at devices outside the devices file, but when devnames are used
as a fallback, lvm will scan devices outside the devices file
to locate PVs on renamed devices. A config setting
search_for_devnames can be used to control the scanning for
renamed devname entries.
Related to the devices file, the new command option
--devices <devnames> allows a list of devices to be specified for
the command to use, overriding the devices file. The listed
devices act as a sort of devices file in terms of limiting which
devices lvm will see and use. Devices that are not listed will
appear to be missing to the lvm command.
Multiple devices files can be kept in /etc/lvm/devices, which
allows lvm to be used with different sets of devices, e.g.
system devices do not need to be exposed to a specific application,
and the application can use lvm on its own set of devices that are
not exposed to the system. The option --devicesfile <filename> is
used to select the devices file to use with the command. Without
the option set, the default system devices file is used.
Setting --devicesfile "" causes lvm to not use a devices file.
An existing, empty devices file means lvm will see no devices.
The new command vgimportdevices adds PVs from a VG to the devices
file and updates the VG metadata to include the device IDs.
vgimportdevices -a will import all VGs into the system devices file.
LVM commands run by dmeventd not use a devices file by default,
and will look at all devices on the system. A devices file can
be created for dmeventd (/etc/lvm/devices/dmeventd.devices) If
this file exists, lvm commands run by dmeventd will use it.
Internal implementaion:
- device_ids_read - read the devices file
. add struct dev_use (du) to cmd->use_devices for each devices file entry
- dev_cache_scan - get /dev entries
. add struct device (dev) to dev_cache for each device on the system
- device_ids_match - match devices file entries to /dev entries
. match each du on cmd->use_devices to a dev in dev_cache, using device ID
. on match, set du->dev, dev->id, dev->flags MATCHED_USE_ID
- label_scan - read lvm headers and metadata from devices
. filters are applied, those that do not need data from the device
. filter-deviceid skips devs without MATCHED_USE_ID, i.e.
skips /dev entries that are not listed in the devices file
. read lvm label from dev
. filters are applied, those that use data from the device
. read lvm metadata from dev
. add info/vginfo structs for PVs/VGs (info is "lvmcache")
- device_ids_find_renamed_devs - handle devices with unstable devname ID
where devname changed
. this step only needed when devs do not have proper device IDs,
and their dev names change, e.g. after reboot sdb becomes sdc.
. detect incorrect match because PVID in the devices file entry
does not match the PVID found when the device was read above
. undo incorrect match between du and dev above
. search system devices for new location of PVID
. update devices file with new devnames for PVIDs on renamed devices
. label_scan the renamed devs
- continue with command processing
Since lvm2 normally block signals during protected
phase where it does not want to be interrupted.
Support interruptible processing when allowed
in section between sigint_allow() ... sigint_restore())
and let the 'io_getenvents()' finish with EINTR.
When bcache tries to write data to a faulty device,
it may get out of caching blocks and then just busy-loops
on a CPU - so this check protects this by checking
if there is already max_io (~64) errored blocks.
Call _wait_all() which does check whether there is still
some pending IO before sleep. Otherwise it may happen
our submitted IO operations have been already dispatched
and this call then endlessly waits for IO which are all done.
This can be reproduced when device returns quickly errors
on write requests.
Add a "device index" (di) for each device, and use this
in the bcache api to the rest of lvm. This replaces the
file descriptor (fd) in the api. The rest of lvm uses
new functions bcache_set_fd(), bcache_clear_fd(), and
bcache_change_fd() to control which fd bcache uses for
io to a particular device.
. lvm opens a dev and gets and fd.
fd = open(dev);
. lvm passes fd to the bcache layer and gets a di
to use in the bcache api for the dev.
di = bcache_set_fd(fd);
. lvm uses bcache functions, passing di for the dev.
bcache_write_bytes(di, ...), etc.
. bcache translates di to fd to do io.
. lvm closes the device and clears the di/fd bcache state.
close(fd);
bcache_clear_fd(di);
In the bcache layer, a di-to-fd translation table
(int *_fd_table) is added. When bcache needs to
perform io on a di, it uses _fd_table[di].
In the following commit, lvm will make use of the new
bcache_change_fd() function to change the fd that
bcache uses for the dev, without dropping cached blocks.
Switch remaining zero sized struct to flexible arrays to be C99
complient.
These simple rules should apply:
- The incomplete array type must be the last element within the structure.
- There cannot be an array of structures that contain a flexible array member.
- Structures that contain a flexible array member cannot be used as a member of another structure.
- The structure must contain at least one named member in addition to the flexible array member.
Although some of the code pieces should be still improved.
When initiated larger write request, it may have happened, bcache
got out of free chunks - fix the loop, that is supposed to wait
until next free chunk becomes avain available.
It's possible for a dev-cache entry to remain after all
paths for it have been removed, and other parts of the
code expect that a dev always has a name. A better fix
may be to remove a device from dev-cache after all paths
to it have been removed.
dm-integrity stores checksums of the data written to an
LV, and returns an error if data read from the LV does
not match the previously saved checksum. When used on
raid images, dm-raid will correct the error by reading
the block from another image, and the device user sees
no error. The integrity metadata (checksums) are stored
on an internal LV allocated by lvm for each linear image.
The internal LV is allocated on the same PV as the image.
Create a raid LV with an integrity layer over each
raid image (for raid levels 1,4,5,6,10):
lvcreate --type raidN --raidintegrity y [options]
Add an integrity layer to images of an existing raid LV:
lvconvert --raidintegrity y LV
Remove the integrity layer from images of a raid LV:
lvconvert --raidintegrity n LV
Settings
Use --raidintegritymode journal|bitmap (journal is default)
to configure the method used by dm-integrity to ensure
crash consistency.
Initialization
When integrity is added to an LV, the kernel needs to
initialize the integrity metadata/checksums for all blocks
in the LV. The data corruption checking performed by
dm-integrity will only operate on areas of the LV that
are already initialized. The progress of integrity
initialization is reported by the "syncpercent" LV
reporting field (and under the Cpy%Sync lvs column.)
Example: create a raid1 LV with integrity:
$ lvcreate --type raid1 -m1 --raidintegrity y -n rr -L1G foo
Creating integrity metadata LV rr_rimage_0_imeta with size 12.00 MiB.
Logical volume "rr_rimage_0_imeta" created.
Creating integrity metadata LV rr_rimage_1_imeta with size 12.00 MiB.
Logical volume "rr_rimage_1_imeta" created.
Logical volume "rr" created.
$ lvs -a foo
LV VG Attr LSize Origin Cpy%Sync
rr foo rwi-a-r--- 1.00g 4.93
[rr_rimage_0] foo gwi-aor--- 1.00g [rr_rimage_0_iorig] 41.02
[rr_rimage_0_imeta] foo ewi-ao---- 12.00m
[rr_rimage_0_iorig] foo -wi-ao---- 1.00g
[rr_rimage_1] foo gwi-aor--- 1.00g [rr_rimage_1_iorig] 39.45
[rr_rimage_1_imeta] foo ewi-ao---- 12.00m
[rr_rimage_1_iorig] foo -wi-ao---- 1.00g
[rr_rmeta_0] foo ewi-aor--- 4.00m
[rr_rmeta_1] foo ewi-aor--- 4.00m
Do this at two levels, although one would be enough to
fix the problem seen recently:
- Ignore any reported sector size other than 512 of 4096.
If either sector size (physical or logical) is reported
as 512, then use 512. If neither are reported as 512,
and one or the other is reported as 4096, then use 4096.
If neither is reported as either 512 or 4096, then use 512.
- When rounding up a limited write in bcache to be a multiple
of the sector size, check that the resulting write size is
not larger than the bcache block itself. (This shouldn't
happen if the sector size is 512 or 4096.)
An active md device with an end superblock causes lvm to
enable full md component detection. This was being done
within the filter loop instead of before, so the full
filtering of some devs could be missed.
Also incorporate the recently added config setting that
controls the md component detection.
If udev info is missing for a device, (which would indicate
if it's an MD component), then do an end-of-device read to
check if a PV is an MD component. (This is skipped when
using hints since we already know devs in hints are good.)
A new config setting md_component_checks can be used to
disable the additional end-of-device MD checks, or to
always enable end-of-device MD checks.
When both hints and udev info are disabled/unavailable,
the end of PVs will now be scanned by default. If md
devices with end-of-device superblocks are not being
used, the extra I/O overhead can be avoided by setting
md_component_checks="start".
udev_dev_is_md_component and udev_dev_is_mpath_component
are not used for obtaining the device list, but they still
use libudev for device info. When there are problems with
udev, these functions can get stuck. So, use the existing
obtain_device_list_from_udev config setting to also control
whether these "is component" functions are used, which gives
us a way to avoid using libudev entirely when it's causing
problems.
Save the list of PVs in /run/lvm/hints. These hints
are used to reduce scanning in a number of commands
to only the PVs on the system, or only the PVs in a
requested VG (rather than all devices on the system.)
Ensure configure.h is always 1st. included header.
Maybe we could eventually introduce gcc -include option, but for now
this better uses dependency tracking.
Also move _REENTRANT and _GNU_SOURCE into configure.h so it
doesn't need to be present in various source files.
This ensures consistent compilation of headers like stdio.h since
it may produce different declaration.
commit de28637
scan: use full md filter when md 1.0 devices are present
missed the fact that md superblock version 0.90 also puts
metadata at the end of the device, so the full md filter
needs to be used when either 0.90 or 1.0 is present.
fix lseek error check
fix read/write error checks
handle zero return from read and write
don't return an error for short io
fix partial read/write loop
io_setup() for aio may fail if a system has reached the
aio request limit. In this case, fall back to using
sync io. Also, lvm use of aio can be disabled entirely
with config setting global/use_aio=0.
The system limit for aio requests can be seen from
/proc/sys/fs/aio-max-nr
The current usage of aio requests can be seen from
/proc/sys/fs/aio-nr
The system limit for aio requests can be increased by
setting fs.aio-max-nr using sysctl.
Also add last-byte limit to the sync io code.
lvm uses a bcache block size of 128K. A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated. How much depends on
alignments. When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.
This patch is a quick hack to prevent lvm from writing
past the end of the metadata area.
This is the number of concurrent async io requests that
the scan layer will submit to the bcache layer. There
will be an open fd for each of these, so it is best to
keep this well below the default limit for max open files
(1024), otherwise lvm may get EMFILE from open(2) when
there are around 1024 devices to scan on the system.
When lvm2 command is executed in test mode, discard ioctl is skipped.
This may cause even data-loose in case, issuing discard for released
areas was enabled and user 'tested' lvreduce.
udev creates a train wreck of events if we open devices
with RDWR. Until we can fix/disable/scrap udev, work around
this by opening RDONLY and then closing/reopening RDWR when
a write is needed. This invalidates the bcache blocks for
the device before writing so it can trigger unnecessary
rereading.
The md filter can operate in two native modes:
- normal: reads only the start of each device
- full: reads both the start and end of each device
md 1.0 devices place the superblock at the end of the device,
so components of this version will only be identified and
excluded when lvm uses the full md filter.
Previously, the full md filter was only used in commands
that could write to the device. Now, the full md filter
is also applied when there is an md 1.0 device present
on the system. This means the 'pvs' command can avoid
displaying md 1.0 components (at the cost of doubling
the i/o to every device on the system.)
(The md filter can operate in a third mode, using udev,
but this is disabled by default because there have been
problems with reliability of the info returned from udev.)
Remove the io error message from bcache.c since it is not
very useful without the device path.
Make the io error messages from dev_read_bytes/dev_write_bytes
more user friendly.
We have been warning about duplicate devices (and disabling lvmetad)
immediately when the dup was detected (during label_scan). Move the
warnings (and the disabling) to happen later, after label_scan is
finished.
This lets us avoid an unwanted warning message about duplicates
in the special case were md components are eliminated during the
duplicate device resolution.
The device-mapper directory now holds a copy of libdm source. At
the moment this code is identical to libdm. Over time code will
migrate out to appropriate places (see doc/refactoring.txt).
The libdm directory still exists, and contains the source for the
libdevmapper shared library, which we will continue to ship (though
not neccessarily update).
All code using libdm should now use the version in device-mapper.
As we start refactoring the code to break dependencies (see doc/refactoring.txt),
I want us to use full paths in the includes (eg, #include "base/data-struct/list.h").
This makes it more obvious when we're breaking abstraction boundaries, eg, including a file in
metadata/ from base/
md devices using an older superblock version have
superblocks at the end of the md device. For commands
that skip reading the end of devices during filtering,
the md component devs will be scanned, and will appear
as duplicate PVs to the original md device. Remove
these md components from the list of unused duplicate
devices, so they are treated as if they had been
ignored during filtering. This avoids the restrictions
that are placed on using PVs with duplicates.
All these functions are now used as utilities,
e.g. for ioctl (not for io), and need to
open/close the device each time they are called.
(Many of the opens can probably be eliminated by
just using the bcache fd for the ioctl.)
Filters are still applied before any device reading or
the label scan, but any filter checks that want to read
the device are skipped and the device is flagged.
After bcache is populated, but before lvm looks for
devices (i.e. before label scan), the filters are
reapplied to the devices that were flagged above.
The filters will then find the data they need in
bcache.
bcache_invalidate() now returns a bool to indicate success. If fails
if the block is currently held, or the block is dirty and writeback
fails.
Added a bunch of unit tests for the invalidate functions.
Fixed some bugs to do with invalidating errored blocks.
The error handling code wasn't working, but it
appears that just removing it is what we need.
The doesn't really need any different behavior
related to bcache blocks on an io error, it just
wants to know if there was an error.
Create a new dev->bcache_fd that the scanning code owns
and is in charge of opening/closing. This prevents other
parts of lvm code (which do various open/close) from
interfering with the bcache fd. A number of dev_open
and dev_close are removed from the reading path since
the read path now uses the bcache.
With that in place, open(O_EXCL) for pvcreate/pvremove
can then be fixed. That wouldn't work previously because
of other open fds.
New label_scan function populates bcache for each device
on the system.
The two read paths are updated to get data from bcache.
The bcache is not yet used for writing. bcache blocks
for a device are invalidated when the device is written.
With these read errors it's useful to know the reason.
Also avoid to log error just once so we know exactly
how many times we did failing read.
On the other hand reduce repeated log_error() on code 'backtrace'
path and change severity of message to just log_debug() so the
actual read error is printed once for one read.
Actually the removed code is necessary - since not all writes are
getting alligned buffer - older compilers seems to be not able
to create 4K aligned buffers on stack - this the aligning code still
need to be present for write path.
If the data being requested is present in last_[extra_]devbuf,
return that directly instead of reading it from disk again.
Typical LVM2 access patterns request data within two adjacent 4k blocks
so we eliminate some read() system calls by always reading at least 8k.
Callers that read larger amounts of data now get a pointer to read-only
data directly without copying it through an intermediate buffer. This
data is owned by the device layer so the callers no longer free it.
If it obtains the data, it passes it into the supplied callback function
and returns 1. Otherwise the callback receives failed = 1.
Updated config_file_read_fd to use this and similarly return the data
via a callback fn of its own.
Rename dev_read() to dev_read_buf() - the function that reads data
into a supplied buffer.
Introduce a new dev_read() that allocates the buffer it returns and
switch the important users over to this. No caller may change the
returned data. (For now, callers are responsible for freeing it after
use, but later the device layer will take full ownership.)
dev_read_buf() should only be used for tiny buffers or unimportant code
(such as the old disk formats).
The creation of wrapped around metadata - where the start of metadata is
written up to the end of the buffer and the remainder follows back at
the start of the buffer - is now restricted to cases where writing the
metadata in one piece wouldn't fit. This shouldn't happen in 'normal'
usage so let's begin treating the code for this as a special case that
can be ignored when optimising 'normal' cases.
Mark the first metadata area on each text format PV as MDA_PRIMARY.
Pass this information down to the device layer so that when
there are two metadata areas on a block device, we can easily
distinguish two independent streams of I/O.
Introduce enum dev_io_reason to categorise block device I/O
in debug messages so it's obvious what it is for.
DEV_IO_SIGNATURES /* Scanning device signatures */
DEV_IO_LABEL /* LVM PV disk label */
DEV_IO_MDA_HEADER /* Text format metadata area header */
DEV_IO_MDA_CONTENT /* Text format metadata area content */
DEV_IO_FMT1 /* Original LVM1 metadata format */
DEV_IO_POOL /* Pool metadata format */
DEV_IO_LV /* Content written to an LV */
DEV_IO_LOG /* Logging messages */
- Use 'lvmcache' consistently instead of 'metadata cache'
- Always use 5 characters for source line number
- Remember to convert uuids into printable form
- Use <no name> rather than (null) when VG has no name.