xfs: document online file metadata repair code

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This commit is contained in:
Darrick J. Wong 2023-04-11 18:59:49 -07:00
parent 7fb8ccffd3
commit d697887193

View File

@ -2997,3 +2997,157 @@ The allocation group free block list (AGFL) is repaired as follows:
5. The next operation to fix the freelist will right-size the list.
See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
Inode Record Repairs
--------------------
Inode records must be handled carefully, because they have both ondisk records
("dinodes") and an in-memory ("cached") representation.
There is a very high potential for cache coherency issues if online fsck is not
careful to access the ondisk metadata *only* when the ondisk metadata is so
badly damaged that the filesystem cannot load the in-memory representation.
When online fsck wants to open a damaged file for scrubbing, it must use
specialized resource acquisition functions that return either the in-memory
representation *or* a lock on whichever object is necessary to prevent any
update to the ondisk location.
The only repairs that should be made to the ondisk inode buffers are whatever
is necessary to get the in-core structure loaded.
This means fixing whatever is caught by the inode cluster buffer and inode fork
verifiers, and retrying the ``iget`` operation.
If the second ``iget`` fails, the repair has failed.
Once the in-memory representation is loaded, repair can lock the inode and can
subject it to comprehensive checks, repairs, and optimizations.
Most inode attributes are easy to check and constrain, or are user-controlled
arbitrary bit patterns; these are both easy to fix.
Dealing with the data and attr fork extent counts and the file block counts is
more complicated, because computing the correct value requires traversing the
forks, or if that fails, leaving the fields invalid and waiting for the fork
fsck functions to run.
The proposed patchset is the
`inode
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
repair series.
Quota Record Repairs
--------------------
Similar to inodes, quota records ("dquots") also have both ondisk records and
an in-memory representation, and hence are subject to the same cache coherency
issues.
Somewhat confusingly, both are known as dquots in the XFS codebase.
The only repairs that should be made to the ondisk quota record buffers are
whatever is necessary to get the in-core structure loaded.
Once the in-memory representation is loaded, the only attributes needing
checking are obviously bad limits and timer values.
Quota usage counters are checked, repaired, and discussed separately in the
section about :ref:`live quotacheck <quotacheck>`.
The proposed patchset is the
`quota
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
repair series.
.. _fscounters:
Freezing to Fix Summary Counters
--------------------------------
Filesystem summary counters track availability of filesystem resources such
as free blocks, free inodes, and allocated inodes.
This information could be compiled by walking the free space and inode indexes,
but this is a slow process, so XFS maintains a copy in the ondisk superblock
that should reflect the ondisk metadata, at least when the filesystem has been
unmounted cleanly.
For performance reasons, XFS also maintains incore copies of those counters,
which are key to enabling resource reservations for active transactions.
Writer threads reserve the worst-case quantities of resources from the
incore counter and give back whatever they don't use at commit time.
It is therefore only necessary to serialize on the superblock when the
superblock is being committed to disk.
The lazy superblock counter feature introduced in XFS v5 took this even further
by training log recovery to recompute the summary counters from the AG headers,
which eliminated the need for most transactions even to touch the superblock.
The only time XFS commits the summary counters is at filesystem unmount.
To reduce contention even further, the incore counter is implemented as a
percpu counter, which means that each CPU is allocated a batch of blocks from a
global incore counter and can satisfy small allocations from the local batch.
The high-performance nature of the summary counters makes it difficult for
online fsck to check them, since there is no way to quiesce a percpu counter
while the system is running.
Although online fsck can read the filesystem metadata to compute the correct
values of the summary counters, there's no way to hold the value of a percpu
counter stable, so it's quite possible that the counter will be out of date by
the time the walk is complete.
Earlier versions of online scrub would return to userspace with an incomplete
scan flag, but this is not a satisfying outcome for a system administrator.
For repairs, the in-memory counters must be stabilized while walking the
filesystem metadata to get an accurate reading and install it in the percpu
counter.
To satisfy this requirement, online fsck must prevent other programs in the
system from initiating new writes to the filesystem, it must disable background
garbage collection threads, and it must wait for existing writer programs to
exit the kernel.
Once that has been established, scrub can walk the AG free space indexes, the
inode btrees, and the realtime bitmap to compute the correct value of all
four summary counters.
This is very similar to a filesystem freeze, though not all of the pieces are
necessary:
- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
prevent other threads from thawing the filesystem, or other scrub threads
from initiating another fscounters freeze.
- It does not quiesce the log.
With this code in place, it is now possible to pause the filesystem for just
long enough to check and correct the summary counters.
+--------------------------------------------------------------------------+
| **Historical Sidebar**: |
+--------------------------------------------------------------------------+
| The initial implementation used the actual VFS filesystem freeze |
| mechanism to quiesce filesystem activity. |
| With the filesystem frozen, it is possible to resolve the counter values |
| with exact precision, but there are many problems with calling the VFS |
| methods directly: |
| |
| - Other programs can unfreeze the filesystem without our knowledge. |
| This leads to incorrect scan results and incorrect repairs. |
| |
| - Adding an extra lock to prevent others from thawing the filesystem |
| required the addition of a ``->freeze_super`` function to wrap |
| ``freeze_fs()``. |
| This in turn caused other subtle problems because it turns out that |
| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the |
| last reference to the VFS superblock, and any subsequent access |
| becomes a UAF bug! |
| This can happen if the filesystem is unmounted while the underlying |
| block device has frozen the filesystem. |
| This problem could be solved by grabbing extra references to the |
| superblock, but it felt suboptimal given the other inadequacies of |
| this approach. |
| |
| - The log need not be quiesced to check the summary counters, but a VFS |
| freeze initiates one anyway. |
| This adds unnecessary runtime to live fscounter fsck operations. |
| |
| - Quiescing the log means that XFS flushes the (possibly incorrect) |
| counters to disk as part of cleaning the log. |
| |
| - A bug in the VFS meant that freeze could complete even when |
| sync_filesystem fails to flush the filesystem and returns an error. |
| This bug was fixed in Linux 5.17. |
+--------------------------------------------------------------------------+
The proposed patchset is the
`summary counter cleanup
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
series.