mirror of
git://sourceware.org/git/lvm2.git
synced 2025-01-03 05:18:29 +03:00
216 lines
8.8 KiB
Plaintext
216 lines
8.8 KiB
Plaintext
|
Introducing asynchronous I/O to LVM
|
||
|
===================================
|
||
|
|
||
|
Issuing I/O asynchronously means instructing the kernel to perform specific
|
||
|
I/O and return immediately without waiting for it to complete. The data
|
||
|
is collected from the kernel later.
|
||
|
|
||
|
Advantages
|
||
|
----------
|
||
|
|
||
|
A1. While waiting for the I/O to happen, the program could perform other
|
||
|
operations.
|
||
|
|
||
|
A2. When LVM is searching for its Physical Volumes, it issues a small amount of
|
||
|
I/O to a large number of disks. If this was issued in parallel the overall
|
||
|
runtime might be shorter while there should be little effect on the cpu time.
|
||
|
|
||
|
A3. If more than one timeout occurs when accessing any devices, these can be
|
||
|
taken in parallel, again reducing the runtime. This applies globally,
|
||
|
not just while the code is searching for Physical Volumes, so reading,
|
||
|
writing and committing the metadata may occasionally benefit too to some
|
||
|
extent and there are probably maintenance advantages in using the same
|
||
|
method of I/O throughout the main body of the code.
|
||
|
|
||
|
A4. By introducing a simple callback function mechanism, the conversion can be
|
||
|
performed largely incrementally by first refactoring and continuing to
|
||
|
use synchronous I/O with the callbacks performed immediately. This allows the
|
||
|
callbacks to be introduced without changing the running sequence of the code
|
||
|
initially. Future projects could refactor some of the calling sites to
|
||
|
simplify the code structure and even eliminate some of the nesting.
|
||
|
This allows each part of what might ultimately amount to a large change to be
|
||
|
introduced and tested independently.
|
||
|
|
||
|
|
||
|
Disadvantages
|
||
|
-------------
|
||
|
|
||
|
D1. The resulting code may be more complex with more failure modes to
|
||
|
handle. Mitigate by thorough auditing and testing, rolling out
|
||
|
gradually, and offering a simple switch to revert to the old behaviour.
|
||
|
|
||
|
D2. The linux asynchronous I/O implementation is less mature than
|
||
|
its synchronous I/O implementation and might show up problems that
|
||
|
depend on the version of the kernel or library used. Fixes or
|
||
|
workarounds for some of these might require kernel changes. For
|
||
|
example, there are suggestions that despite being supposedly async,
|
||
|
there are still cases where system calls can block. There might be
|
||
|
resource dependencies on other processes running on the system that make
|
||
|
it unsuitable for use while any devices are suspended. Mitigation
|
||
|
as for D1.
|
||
|
|
||
|
D3. The error handling within callbacks becomes more complicated.
|
||
|
However we know that existing call paths can already sometimes discard
|
||
|
errors, sometimes deliberately, sometimes not, so this aspect is in need
|
||
|
of a complete review anyway and the new approach will make the error
|
||
|
handling more transparent. Aim initially for overall behaviour that is
|
||
|
no worse than that of the existing code, then work on improving it
|
||
|
later.
|
||
|
|
||
|
D4. The work will take a few weeks to code and test. This leads to a
|
||
|
significant opportunity cost when compared against other enhancements
|
||
|
that could be achieved in that time. However, the proof-of-concept work
|
||
|
performed while writing this design has satisfied me that the work could
|
||
|
proceed and be committed incrementally as a background task.
|
||
|
|
||
|
|
||
|
Observations regarding LVM's I/O Architecture
|
||
|
---------------------------------------------
|
||
|
|
||
|
H1. All device, metadata and config file I/O is constrained to pass through a
|
||
|
single route in lib/device.
|
||
|
|
||
|
H2. The first step of the analysis was to instrument this code path with
|
||
|
log_debug messages. I/O is split into the following categories:
|
||
|
|
||
|
"dev signatures",
|
||
|
"PV labels",
|
||
|
"VG metadata header",
|
||
|
"VG metadata content",
|
||
|
"extra VG metadata header",
|
||
|
"extra VG metadata content",
|
||
|
"LVM1 metadata",
|
||
|
"pool metadata",
|
||
|
"LV content",
|
||
|
"logging",
|
||
|
|
||
|
H3. A bounce buffer is used for most I/O.
|
||
|
|
||
|
H4. Most callers finish using the supplied data before any further I/O is
|
||
|
issued. The few that don't could be converted trivially to do so.
|
||
|
|
||
|
H5. There is one stream of I/O per metadata area on each device.
|
||
|
|
||
|
H6. Some reads fall at offsets close to immediately preceding reads, so it's
|
||
|
possible to avoid these by caching one "block" per metadata area I/O stream.
|
||
|
|
||
|
H7. Simple analysis suggests a minimum aligned read size of 8k would deliver
|
||
|
immediate gains from this caching. A larger size might perform worse because
|
||
|
almost all the time the extra data read would not be used, but this can be
|
||
|
re-examined and tuned after the code is in place.
|
||
|
|
||
|
|
||
|
Proposal
|
||
|
--------
|
||
|
|
||
|
P1. Retain the "single I/O path" but offer an asynchronous option.
|
||
|
|
||
|
P2. Eliminate the bounce buffer in most cases by improving alignment.
|
||
|
|
||
|
P3. Reduce the number of reads by always reading a minimum of an aligned
|
||
|
8k block.
|
||
|
|
||
|
P4. Eliminate repeated reads by caching the last block read and changing
|
||
|
the lib/device interface to return a pointer to read-only data within
|
||
|
this block.
|
||
|
|
||
|
P5. Only perform these interface changes for code on the critical path
|
||
|
for now by converting other code sites to use wrappers around the new
|
||
|
interface.
|
||
|
|
||
|
P6. Treat asynchronous I/O as the interface of choice and optimise only
|
||
|
for this case.
|
||
|
|
||
|
P7. Convert the callers on the critical path to pass callback functions
|
||
|
to the device layer. These functions will be called later with the
|
||
|
read-only data, a context pointer and a success/failure indicator.
|
||
|
Where an existing function performs a sequence of I/O, this has the
|
||
|
advantage of breaking up the large function into smaller ones and
|
||
|
wrapping the parameters used into structures. While this might look
|
||
|
rather messy and ad-hoc in the short-term, it's a first step towards
|
||
|
breaking up confusingly long functions into component parts and wrapping
|
||
|
the existing long parameter lists into more appropriate structures and
|
||
|
refactoring these parts of the code.
|
||
|
|
||
|
P8. Limit the resources used by the asynchronous I/O by using two
|
||
|
tunable parameters, one limiting the number of outstanding I/Os issued
|
||
|
and another limiting the total amount of memory used.
|
||
|
|
||
|
P9. Provide a fallback option if asynchronous I/O is unavailable by
|
||
|
sharing the code paths but issuing the I/O synchronously and calling the
|
||
|
callback immediately.
|
||
|
|
||
|
P10. Only allocate the buffer for the I/O at the point where the I/O is
|
||
|
about to be issued.
|
||
|
|
||
|
P11. If the thresholds are exceeded, add the request to a simple queue,
|
||
|
and process it later after some I/O has completed.
|
||
|
|
||
|
|
||
|
Future work
|
||
|
-----------
|
||
|
F1. Perform a complete review of the error tracking so that device
|
||
|
failures are handled and reported more cleanly, extending the existing
|
||
|
basic error counting mechanism.
|
||
|
|
||
|
F2. Consider whether some of the nested callbacks can be eliminated,
|
||
|
which would allow for additional simplifications.
|
||
|
|
||
|
F3. Adjust the contents of the adhoc context structs into more logical
|
||
|
arrangements and use them more widely.
|
||
|
|
||
|
F4. Perform wider refactoring of these areas of code.
|
||
|
|
||
|
|
||
|
Testing considerations
|
||
|
----------------------
|
||
|
T1. The changes touch code on the device path, so a thorough re-test of
|
||
|
the device layer is required. The new code needs a full audit down
|
||
|
through the library layer into the kernel to check that all the error
|
||
|
conditions that are currently implemented (such as EAGAIN) are handled
|
||
|
sensibly. (LVM's I/O layer needs to remain as solid as we can make it.)
|
||
|
|
||
|
T2. The current test suite provides a reasonably broad range of coverage
|
||
|
of this area but is far from comprehensive.
|
||
|
|
||
|
|
||
|
Acceptance criteria
|
||
|
-------------------
|
||
|
A1. The current test suite should pass to the same extent as before the
|
||
|
changes.
|
||
|
|
||
|
A2. When all debugging and logging is disabled, strace -c must show
|
||
|
improvements e.g. the expected fewer number of reads.
|
||
|
|
||
|
A3. Running a range of commands under valgrind must not reveal any
|
||
|
new leaks due to the changes.
|
||
|
|
||
|
A4. All new coverity reports from the change must be addressed.
|
||
|
|
||
|
A5. CPU time should be similar to that before, as the same work
|
||
|
is being done overall, just in a different order.
|
||
|
|
||
|
A6. Tests need to show improved behaviour in targetted areas. For example,
|
||
|
if several devices are slow and time out, the delays should occur
|
||
|
in parallel and the elapsed time should be less than before.
|
||
|
|
||
|
|
||
|
Release considerations
|
||
|
----------------------
|
||
|
R1. Async I/O should be widely available and largely reliable on linux
|
||
|
nowadays (even though parts of its interface and implementation remain a
|
||
|
matter of controversy) so we should try to make its use the default
|
||
|
whereever it is supported. If certain types of systems have problems we
|
||
|
should try to detect those cases and disable it automatically there.
|
||
|
|
||
|
R2. Because the implications of an unexpected problem in the new code
|
||
|
could be severe for the people affected, the roll out needs to be gentle
|
||
|
without a deadline to allow us plenty of time to gain confidence in the
|
||
|
new code. Our own testing will only be able to cover a tiny fraction of
|
||
|
the different setups our users have, so we need to look out for problems
|
||
|
caused by this proactively and encourage people to test it on their own
|
||
|
systems and report back. It must go into the tree near the start of a
|
||
|
release cycle rather than at the end to provide time for our confidence
|
||
|
in it to grow.
|
||
|
|