IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Last week, I was fiddling around with the metadump name obfuscation code
while writing a debugger command to generate directories full of names
that all have the same hash name. I had a few questions about how well
all that worked with ascii-ci mode, and discovered a nasty discrepancy
between the kernel and glibc's implementations of the tolower()
function.
I discovered that I could create a directory that is large enough to
require separate leaf index blocks. The hashes stored in the dabtree
use the ascii-ci specific hash function, which uses a library function
to convert the name to lowercase before hashing. If the kernel and C
library's versions of tolower do not behave exactly identically,
xfs_ascii_ci_hashname will not produce the same results for the same
inputs. xfs_repair will deem the leaf information corrupt and rebuild
the directory. After that, lookups in the kernel will fail because the
hash index doesn't work.
The kernel's tolower function will convert extended ascii uppercase
letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
a-with-umlaut), whereas glibc's will only do that if you force LANG to
ascii. Tiny embedded libc implementations just plain won't do it at
all, and the result is a mess. Stabilize the behavior of the hash
function by encoding the name transformation function in libxfs, add it
to the selftest, and fix all the userspace tools, none of which handle
this transformation correctly.
The v1 series generated a /lot/ of discussion, in which several things
became very clear: (1) Linus is not enamored of case folding of any
kind; (2) Dave and Christoph don't seem to agree on whether the feature
is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
encoded names if those happen to show up; and (4) I don't want to
maintain this mess any longer than I have to. Kill it in 2030.
v2: rename the functions to make it clear we're moving away from the
letters t, o, l, o, w, e, and r; and deprecate the whole feature once
we've fixed the bugs and added tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqwAKCRBKO3ySh0YR
pi33AQC4MFCz0uP1aF64zRgE+wtU2YBGw5cGps7nWIljVptbkAEAubfoY88wAop8
/KHIgZ8pHIb7ooPrYKpPZL5m0udtMw8=
=3Up6
-----END PGP SIGNATURE-----
Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix ascii-ci problems, then kill it [v2]
Last week, I was fiddling around with the metadump name obfuscation code
while writing a debugger command to generate directories full of names
that all have the same hash name. I had a few questions about how well
all that worked with ascii-ci mode, and discovered a nasty discrepancy
between the kernel and glibc's implementations of the tolower()
function.
I discovered that I could create a directory that is large enough to
require separate leaf index blocks. The hashes stored in the dabtree
use the ascii-ci specific hash function, which uses a library function
to convert the name to lowercase before hashing. If the kernel and C
library's versions of tolower do not behave exactly identically,
xfs_ascii_ci_hashname will not produce the same results for the same
inputs. xfs_repair will deem the leaf information corrupt and rebuild
the directory. After that, lookups in the kernel will fail because the
hash index doesn't work.
The kernel's tolower function will convert extended ascii uppercase
letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
a-with-umlaut), whereas glibc's will only do that if you force LANG to
ascii. Tiny embedded libc implementations just plain won't do it at
all, and the result is a mess. Stabilize the behavior of the hash
function by encoding the name transformation function in libxfs, add it
to the selftest, and fix all the userspace tools, none of which handle
this transformation correctly.
The v1 series generated a /lot/ of discussion, in which several things
became very clear: (1) Linus is not enamored of case folding of any
kind; (2) Dave and Christoph don't seem to agree on whether the feature
is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
encoded names if those happen to show up; and (4) I don't want to
maintain this mess any longer than I have to. Kill it in 2030.
v2: rename the functions to make it clear we're moving away from the
letters t, o, l, o, w, e, and r; and deprecate the whole feature once
we've fixed the bugs and added tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqgAKCRBKO3ySh0YR
pgQ8AP4uxtGQY7RXaJ6aKXTwqarZO9vwZyQiL/nFo1onGCHv4wD/cIwn7hvJmbnF
7avZMCsYJlVFNAJUFZuACmz0v4PwWQ8=
=KqDr
-----END PGP SIGNATURE-----
Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: strengthen rmapbt scrubbing [v24.5]
This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'. Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.
Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists. This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqQAKCRBKO3ySh0YR
pmoIAQDszM48pGVRa7zh942ACCa+CbPABFyKDaVePwhccnX2/QD/a9gi4lmEoPVw
915wIjWbk5rzAjZPp2k3IwYkEf4v2wI=
=QCWU
-----END PGP SIGNATURE-----
Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: rework online fsck incore bitmap [v24.5]
In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'. Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.
Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists. This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code. This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers. We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare. Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.
Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
pnfxAQDGPPb0yT/FD7qnVSSdmUgQPc3Q8OPam/pOrIvXYABn9AD+I9zr64zoR/cD
xn2Z7nkcLV4wK+ofOO8beGqK6R/oBwI=
=/OO/
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: clean up memory management in xattr scrub [v24.5]
Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code. This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers. We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare. Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.
Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.
For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain. This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.
The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length. We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.
Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees. This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records. For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.
The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly. This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
pj1kAP97i2gcFebCwLC+YTDFQoiGVYTf2Mo/bHcPopP/ptXFSAD9HYeG8xM0T3aB
EGrmyaXSPcO+l/txMGTx8ACCJjxOFwg=
=G9gi
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect mergeable and overlapping btree records [v24.5]
While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.
For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain. This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.
The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length. We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.
Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees. This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records. For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.
The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly. This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
I started looking into performance problems with the data fork scrubber
in generic/333, and noticed a few things that needed improving. First,
due to design reasons, it's possible for file forks btrees to contain
multiple contiguous mappings to the same physical space. Instead of
checking each ondisk mapping individually, it's much faster to combine
them when possible and check the combined mapping because that's fewer
trips through the rmap btree, and we can drop this check-around
behavior that it does when an rmapbt lookup produces a record that
starts before or ends after a particular bmbt mapping.
Second, I noticed that the bmbt scrubber decides to walk every reverse
mapping in the filesystem if the file fork is in btree format. This is
very costly, and only necessary if the inode repair code had to zap a
fork to convince iget to work. Constraining the full-rmap scan to this
one case means we can skip it for normal files, which drives the runtime
of this test from 8 hours down to 45 minutes (observed with realtime
reflink and rebuild-all mode.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcQAKCRBKO3ySh0YR
pl1UAPoDtMaFrsLvz7clh31S6Yi+X8oCB/iJZXWl7HXaNsIjUQEA253GuiOj80Rz
IHYo3t0KPYTm2Mc/7kBFQcctFbisDwE=
=zFQ+
-----END PGP SIGNATURE-----
Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: merge bmap records for faster scrubs [v24.5]
I started looking into performance problems with the data fork scrubber
in generic/333, and noticed a few things that needed improving. First,
due to design reasons, it's possible for file forks btrees to contain
multiple contiguous mappings to the same physical space. Instead of
checking each ondisk mapping individually, it's much faster to combine
them when possible and check the combined mapping because that's fewer
trips through the rmap btree, and we can drop this check-around
behavior that it does when an rmapbt lookup produces a record that
starts before or ends after a particular bmbt mapping.
Second, I noticed that the bmbt scrubber decides to walk every reverse
mapping in the filesystem if the file fork is in btree format. This is
very costly, and only necessary if the inode repair code had to zap a
fork to convince iget to work. Constraining the full-rmap scan to this
one case means we can skip it for normal files, which drives the runtime
of this test from 8 hours down to 45 minutes (observed with realtime
reflink and rebuild-all mode.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code. The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time. Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.
The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.
The final patch elides MMAPLOCK from scrub paths when possible. It did
not fit anywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
piKpAP43yHHymFsXykAj23/YTBumdWYvAYu/n3IDLHe8Tt/UTgEAuiJj/MWfi3aR
35JL0XbhANm9i72vfwRJJl0WkCwwwAI=
=zQ5R
-----END PGP SIGNATURE-----
Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix iget/irele usage in online fsck [v24.5]
This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code. The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time. Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.
The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.
The final patch elides MMAPLOCK from scrub paths when possible. It did
not fit anywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another. Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.
Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
po7/AQD2P0QjtwTDUPPMy9AttJjXybZ0Xc/kKjrPk7AXyIrmDAEA758BIpxI7Wwv
9OrAWI2NG8allL4sIM7c4Y+fguHu8gg=
=f+Wc
-----END PGP SIGNATURE-----
Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix bugs in parent pointer checking [v24.5]
Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another. Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.
Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In this series, we fix some problems with how the directory scrubber
grabs child inodes. First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode. Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPbwAKCRBKO3ySh0YR
pjuKAQDHfHFTFVo2Svy/R/E4l/OqWWpO6ffHzESIE1Dzmb8flgEAoaPgXrMJwFg6
lNoEJRc9DZ+K+rNvHtgZhxIKfFOgSw4=
=xSO3
-----END PGP SIGNATURE-----
Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix iget usage in directory scrub [v24.5]
In this series, we fix some problems with how the directory scrubber
grabs child inodes. First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode. Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner. This enables us to determine exclusive ownership of space that
can't be shared.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdpAAKCRBKO3ySh0YR
pk9sAQDXcVPG4a2TvTd+j95UkkPovjYjTJekTTlJL/Xo91rAxgD/fEx3I8A8vNes
dxVeyT/CwiYOPRYxFE3g3UdJGbaeHQA=
=ux+s
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in rmap btree [v24.5]
Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner. This enables us to determine exclusive ownership of space that
can't be shared.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdowAKCRBKO3ySh0YR
pt2WAQDcHbg0JDyGDcTiSyYqTlT2xxzeaxtMRg75fWYpIRa2dQEAuatGejdp56in
AbH6jSmtS9f4M0wcy5JhHyHzZdZjcgc=
=1G5P
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in inode btree [v24.5]
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pjDDAQC88qzAvA3j2JP8ZC9mnK89LsYpkOEX2i6HV2m4LWYdWgD/fWdGnp0BFoQj
is+V82X6oRhWi8SRnjOX28Mk8gCdDA8=
=fzga
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in refcount btree [v24.5]
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pszmAP94RPiNPhUac1Eh6cCOGrt7Cor84dyZmWugqDI+nGo5JAD/Qfy5gFwAK7WV
vQZZPSf/AdLTEEJFA11NVqBW0TukWgE=
=xIq/
-----END PGP SIGNATURE-----
Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: enhance btree key scrubbing [v24.5]
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoQAKCRBKO3ySh0YR
prUmAP9WiaLPxeMAnQiQcaZyqyAhaiqbwNoLkDMx0+1+SKDPCwD7BU6tPQpT039i
mrDag3g2x4N7g/e89N29SQp8EDGuQQQ=
=Chkt
-----END PGP SIGNATURE-----
Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix rmap btree key flag handling [v24.5]
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoAAKCRBKO3ySh0YR
pvbPAP9zGtY7B15ORWk9wcHELUoPgDhNZR39ye7MfxWNCBZJxgD6A8SzZpbZc5Gh
9a1/ImUDZ0ekFnAdx0dVRA+gnrO4Vwo=
=197l
-----END PGP SIGNATURE-----
Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: hoist scrub record checks into libxfs [v24.5]
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
ppn6AQCOWjqsq7klLAQdvEDm3O8v4k94geKdn4Ruvbptwa2iUQD/WAJ5LwKnEPuQ
+eB5AfzsziMQMNX7DtUwncaDJm1RBgY=
=ys9Z
-----END PGP SIGNATURE-----
Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: standardize btree record checking code [v24.5]
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
poQmAQDAu0YNxoRGok7H/RGfQQHWBReSkLXT9RKGzjWn4G51EQD8DA/CpuqsC3yU
uJ55vGAb8jSCBFJITVF1/i8B9sfpngw=
=Nz0X
-----END PGP SIGNATURE-----
Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: drain deferred work items when scrubbing [v24.5]
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdngAKCRBKO3ySh0YR
pjGfAP9QqnMMXWlre680TuB1pWXKzHviCdKHlxTeJtYTAycNoAEAiA0JMyahF4kT
p8EEGUvw0cg0RcZA4PUF/vCsxvCJfws=
=NYxO
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs_scrub: fix licensing and copyright notices [v24.5]
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnQAKCRBKO3ySh0YR
po/YAPsEFygm4/bQZBtOf0HFmVDtTXYAEujJeiXKbmEqzlMxpQEAhuCqFaTQ+Pnr
zpg1egeIcaw6dNTW4f2slcATaQgG0gM=
=8HsC
-----END PGP SIGNATURE-----
Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: pass perag references around when possible [v24.5]
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnAAKCRBKO3ySh0YR
poBzAP9+tx/LNTZeLtmjj/d7tVLMm2/f8LPyhDmkF85JWnjknwEAnLQxkqRMfF9i
ah3ACAZ30o+Mp7Qe6tnYVIdOSD2xCAM=
=mRAy
-----END PGP SIGNATURE-----
Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: make intent items take a perag reference [v24.5]
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
syzbot detected a crash during log recovery:
XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074
CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
print_address_description+0x74/0x340 mm/kasan/report.c:306
print_report+0x107/0x1f0 mm/kasan/report.c:417
kasan_report+0xcd/0x100 mm/kasan/report.c:517
xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
get_tree_bdev+0x400/0x620 fs/super.c:1282
vfs_get_tree+0x88/0x270 fs/super.c:1489
do_new_mount+0x289/0xad0 fs/namespace.c:3145
do_mount fs/namespace.c:3488 [inline]
__do_sys_mount fs/namespace.c:3697 [inline]
__se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
</TASK>
The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF. The ondisk AGF has a higher LSN than the recovered log
item. xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.
Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:
error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
buf_f->blf_len, buf_flags, &bp, NULL);
Skipping the buffer leaves its contents in memory unverified. This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32. This is
impossible so we run off the end of the cursor object and crash.
Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt. It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.
Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:
XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:
Thread 0 Thread 1
xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
match nextents>
xfs_need_iread_extents
<observe iext tree>
xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
*BOOM*
Fix this race by adding a flag to the xfs_ifork structure to indicate
that we have not yet read in the extent records and changing the
predicate to look at the flag state, not if_height. The memory barrier
ensures that the flag will not be set until the very end of the
function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
It just creates unnecessary bot noise these days.
Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In commit fe08cc5044 we reworked the valid superblock version
checks. If it is a V5 filesystem, it is always valid, then we
checked if the version was less than V4 (reject) and then checked
feature fields in the V4 flags to determine if it was valid.
What we missed was that if the version is not V4 at this point,
we shoudl reject the fs. i.e. the check current treats V6+
filesystems as if it was a v4 filesystem. Fix this.
cc: stable@vger.kernel.org
Fixes: fe08cc5044 ("xfs: open code sb verifier feature checks")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This feature is a mess -- the hash function has been broken for the
entire 15 years of its existence if you create names with extended ascii
bytes; metadump name obfuscation has silently failed for just as long;
and the feature clashes horribly with the UTF8 encodings that most
systems use today. There is exactly one fstest for this feature.
In other words, this feature is crap. Let's deprecate it now so we can
remove it from the codebase in 2030.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've made kernel and userspace use the same tolower code for
computing directory index hashes, add that to the selftest code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Back in the old days, the "ascii-ci" feature was created to implement
case-insensitive directory entry lookups for latin1-encoded names and
remove the large overhead of Samba's case-insensitive lookup code. UTF8
names were not allowed, but nobody explicitly wrote in the documentation
that this was only expected to work if the system used latin1 names.
The kernel tolower function was selected to prepare names for hashed
lookups.
There's a major discrepancy in the function that computes directory entry
hashes for filesystems that have ASCII case-insensitive lookups enabled.
The root of this is that the kernel and glibc's tolower implementations
have differing behavior for extended ASCII accented characters. I wrote
a program to spit out characters for which the tolower() return value is
different from the input:
glibc tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z
kernel tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á
194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í
206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù
218:Ú 219:Û 220:Ü 221:Ý 222:Þ
Which means that the kernel and userspace do not agree on the hash value
for a directory filename that contains those higher values. The hash
values are written into the leaf index block of directories that are
larger than two blocks in size, which means that xfs_repair will flag
these directories as having corrupted hash indexes and rewrite the index
with hash values that the kernel now will not recognize.
Because the ascii-ci feature is not frequently enabled and the kernel
touches filesystems far more frequently than xfs_repair does, fix this
by encoding the kernel's toupper predicate and tolower functions into
libxfs. Give the new functions less provocative names to make it really
obvious that this is a pre-hash name preparation function, and nothing
else. This change makes userspace's behavior consistent with the
kernel.
Found by auditing obfuscate_name in xfs_metadump as part of working on
parent pointers, wondering how it could possibly work correctly with ci
filesystems, writing a test tool to create a directory with
hash-colliding names, and watching xfs_repair flag it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Strengthen the rmap btree record checker a little more by comparing
OWN_REFCBT reverse mappings against the refcount btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a typechecked bitmap for extents within an AG. Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.
In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures. We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent. None of the existing callers actually need
that ability anyway, so drop the safe variable.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Local extended attributes store their values within the same leaf block.
There's no header for the values themselves, nor are they separately
checksummed. Hence we can save a bit of time in the attr scrubber by
not wasting time retrieving the values.
Regrettably, shortform attributes do not set XFS_ATTR_LOCAL so this
offers us no advantage there, but at least there are very few attrs in
that case.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the for_each_xbitmap_ macros in favor of proper iterator
functions. We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The free space bitmap is only required if we're going to check the
bestfree space at the end of an xattr leaf block. Therefore, we can
reduce the memory requirements of this scrubber if we can determine that
the xattr is in short format.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Clean up local variable initialization and error returns in xchk_xattr.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Make sure that the records used inside a shortform xattr structure do
not overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer. This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap. This means we don't need dstmap any
more, and can thus reduce the memory requirements.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check that each extended attribute exists in only one namespace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Enhance the rmap scrubber to flag adjacent records that could be merged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway. For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.
Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>