Introducing asynchronous I/O to LVM =================================== Issuing I/O asynchronously means instructing the kernel to perform specific I/O and return immediately without waiting for it to complete. The data is collected from the kernel later. Advantages ---------- A1. While waiting for the I/O to happen, the program could perform other operations. A2. When LVM is searching for its Physical Volumes, it issues a small amount of I/O to a large number of disks. If this was issued in parallel the overall runtime might be shorter while there should be little effect on the cpu time. A3. If more than one timeout occurs when accessing any devices, these can be taken in parallel, again reducing the runtime. This applies globally, not just while the code is searching for Physical Volumes, so reading, writing and committing the metadata may occasionally benefit too to some extent and there are probably maintenance advantages in using the same method of I/O throughout the main body of the code. A4. By introducing a simple callback function mechanism, the conversion can be performed largely incrementally by first refactoring and continuing to use synchronous I/O with the callbacks performed immediately. This allows the callbacks to be introduced without changing the running sequence of the code initially. Future projects could refactor some of the calling sites to simplify the code structure and even eliminate some of the nesting. This allows each part of what might ultimately amount to a large change to be introduced and tested independently. Disadvantages ------------- D1. The resulting code may be more complex with more failure modes to handle. Mitigate by thorough auditing and testing, rolling out gradually, and offering a simple switch to revert to the old behaviour. D2. The linux asynchronous I/O implementation is less mature than its synchronous I/O implementation and might show up problems that depend on the version of the kernel or library used. Fixes or workarounds for some of these might require kernel changes. For example, there are suggestions that despite being supposedly async, there are still cases where system calls can block. There might be resource dependencies on other processes running on the system that make it unsuitable for use while any devices are suspended. Mitigation as for D1. D3. The error handling within callbacks becomes more complicated. However we know that existing call paths can already sometimes discard errors, sometimes deliberately, sometimes not, so this aspect is in need of a complete review anyway and the new approach will make the error handling more transparent. Aim initially for overall behaviour that is no worse than that of the existing code, then work on improving it later. D4. The work will take a few weeks to code and test. This leads to a significant opportunity cost when compared against other enhancements that could be achieved in that time. However, the proof-of-concept work performed while writing this design has satisfied me that the work could proceed and be committed incrementally as a background task. Observations regarding LVM's I/O Architecture --------------------------------------------- H1. All device, metadata and config file I/O is constrained to pass through a single route in lib/device. H2. The first step of the analysis was to instrument this code path with log_debug messages. I/O is split into the following categories: "dev signatures", "PV labels", "VG metadata header", "VG metadata content", "extra VG metadata header", "extra VG metadata content", "LVM1 metadata", "pool metadata", "LV content", "logging", H3. A bounce buffer is used for most I/O. H4. Most callers finish using the supplied data before any further I/O is issued. The few that don't could be converted trivially to do so. H5. There is one stream of I/O per metadata area on each device. H6. Some reads fall at offsets close to immediately preceding reads, so it's possible to avoid these by caching one "block" per metadata area I/O stream. H7. Simple analysis suggests a minimum aligned read size of 8k would deliver immediate gains from this caching. A larger size might perform worse because almost all the time the extra data read would not be used, but this can be re-examined and tuned after the code is in place. Proposal -------- P1. Retain the "single I/O path" but offer an asynchronous option. P2. Eliminate the bounce buffer in most cases by improving alignment. P3. Reduce the number of reads by always reading a minimum of an aligned 8k block. P4. Eliminate repeated reads by caching the last block read and changing the lib/device interface to return a pointer to read-only data within this block. P5. Only perform these interface changes for code on the critical path for now by converting other code sites to use wrappers around the new interface. P6. Treat asynchronous I/O as the interface of choice and optimise only for this case. P7. Convert the callers on the critical path to pass callback functions to the device layer. These functions will be called later with the read-only data, a context pointer and a success/failure indicator. Where an existing function performs a sequence of I/O, this has the advantage of breaking up the large function into smaller ones and wrapping the parameters used into structures. While this might look rather messy and ad-hoc in the short-term, it's a first step towards breaking up confusingly long functions into component parts and wrapping the existing long parameter lists into more appropriate structures and refactoring these parts of the code. P8. Limit the resources used by the asynchronous I/O by using two tunable parameters, one limiting the number of outstanding I/Os issued and another limiting the total amount of memory used. P9. Provide a fallback option if asynchronous I/O is unavailable by sharing the code paths but issuing the I/O synchronously and calling the callback immediately. P10. Only allocate the buffer for the I/O at the point where the I/O is about to be issued. P11. If the thresholds are exceeded, add the request to a simple queue, and process it later after some I/O has completed. Future work ----------- F1. Perform a complete review of the error tracking so that device failures are handled and reported more cleanly, extending the existing basic error counting mechanism. F2. Consider whether some of the nested callbacks can be eliminated, which would allow for additional simplifications. F3. Adjust the contents of the adhoc context structs into more logical arrangements and use them more widely. F4. Perform wider refactoring of these areas of code. Testing considerations ---------------------- T1. The changes touch code on the device path, so a thorough re-test of the device layer is required. The new code needs a full audit down through the library layer into the kernel to check that all the error conditions that are currently implemented (such as EAGAIN) are handled sensibly. (LVM's I/O layer needs to remain as solid as we can make it.) T2. The current test suite provides a reasonably broad range of coverage of this area but is far from comprehensive. Acceptance criteria ------------------- A1. The current test suite should pass to the same extent as before the changes. A2. When all debugging and logging is disabled, strace -c must show improvements e.g. the expected fewer number of reads. A3. Running a range of commands under valgrind must not reveal any new leaks due to the changes. A4. All new coverity reports from the change must be addressed. A5. CPU time should be similar to that before, as the same work is being done overall, just in a different order. A6. Tests need to show improved behaviour in targetted areas. For example, if several devices are slow and time out, the delays should occur in parallel and the elapsed time should be less than before. Release considerations ---------------------- R1. Async I/O should be widely available and largely reliable on linux nowadays (even though parts of its interface and implementation remain a matter of controversy) so we should try to make its use the default whereever it is supported. If certain types of systems have problems we should try to detect those cases and disable it automatically there. R2. Because the implications of an unexpected problem in the new code could be severe for the people affected, the roll out needs to be gentle without a deadline to allow us plenty of time to gain confidence in the new code. Our own testing will only be able to cover a tiny fraction of the different setups our users have, so we need to look out for problems caused by this proactively and encourage people to test it on their own systems and report back. It must go into the tree near the start of a release cycle rather than at the end to provide time for our confidence in it to grow.