IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdowAKCRBKO3ySh0YR
pt2WAQDcHbg0JDyGDcTiSyYqTlT2xxzeaxtMRg75fWYpIRa2dQEAuatGejdp56in
AbH6jSmtS9f4M0wcy5JhHyHzZdZjcgc=
=1G5P
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in inode btree [v24.5]
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pjDDAQC88qzAvA3j2JP8ZC9mnK89LsYpkOEX2i6HV2m4LWYdWgD/fWdGnp0BFoQj
is+V82X6oRhWi8SRnjOX28Mk8gCdDA8=
=fzga
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in refcount btree [v24.5]
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pszmAP94RPiNPhUac1Eh6cCOGrt7Cor84dyZmWugqDI+nGo5JAD/Qfy5gFwAK7WV
vQZZPSf/AdLTEEJFA11NVqBW0TukWgE=
=xIq/
-----END PGP SIGNATURE-----
Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: enhance btree key scrubbing [v24.5]
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoQAKCRBKO3ySh0YR
prUmAP9WiaLPxeMAnQiQcaZyqyAhaiqbwNoLkDMx0+1+SKDPCwD7BU6tPQpT039i
mrDag3g2x4N7g/e89N29SQp8EDGuQQQ=
=Chkt
-----END PGP SIGNATURE-----
Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix rmap btree key flag handling [v24.5]
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoAAKCRBKO3ySh0YR
pvbPAP9zGtY7B15ORWk9wcHELUoPgDhNZR39ye7MfxWNCBZJxgD6A8SzZpbZc5Gh
9a1/ImUDZ0ekFnAdx0dVRA+gnrO4Vwo=
=197l
-----END PGP SIGNATURE-----
Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: hoist scrub record checks into libxfs [v24.5]
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
ppn6AQCOWjqsq7klLAQdvEDm3O8v4k94geKdn4Ruvbptwa2iUQD/WAJ5LwKnEPuQ
+eB5AfzsziMQMNX7DtUwncaDJm1RBgY=
=ys9Z
-----END PGP SIGNATURE-----
Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: standardize btree record checking code [v24.5]
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
poQmAQDAu0YNxoRGok7H/RGfQQHWBReSkLXT9RKGzjWn4G51EQD8DA/CpuqsC3yU
uJ55vGAb8jSCBFJITVF1/i8B9sfpngw=
=Nz0X
-----END PGP SIGNATURE-----
Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: drain deferred work items when scrubbing [v24.5]
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdngAKCRBKO3ySh0YR
pjGfAP9QqnMMXWlre680TuB1pWXKzHviCdKHlxTeJtYTAycNoAEAiA0JMyahF4kT
p8EEGUvw0cg0RcZA4PUF/vCsxvCJfws=
=NYxO
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs_scrub: fix licensing and copyright notices [v24.5]
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnQAKCRBKO3ySh0YR
po/YAPsEFygm4/bQZBtOf0HFmVDtTXYAEujJeiXKbmEqzlMxpQEAhuCqFaTQ+Pnr
zpg1egeIcaw6dNTW4f2slcATaQgG0gM=
=8HsC
-----END PGP SIGNATURE-----
Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: pass perag references around when possible [v24.5]
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnAAKCRBKO3ySh0YR
poBzAP9+tx/LNTZeLtmjj/d7tVLMm2/f8LPyhDmkF85JWnjknwEAnLQxkqRMfF9i
ah3ACAZ30o+Mp7Qe6tnYVIdOSD2xCAM=
=mRAy
-----END PGP SIGNATURE-----
Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: make intent items take a perag reference [v24.5]
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
syzbot detected a crash during log recovery:
XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074
CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
print_address_description+0x74/0x340 mm/kasan/report.c:306
print_report+0x107/0x1f0 mm/kasan/report.c:417
kasan_report+0xcd/0x100 mm/kasan/report.c:517
xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
get_tree_bdev+0x400/0x620 fs/super.c:1282
vfs_get_tree+0x88/0x270 fs/super.c:1489
do_new_mount+0x289/0xad0 fs/namespace.c:3145
do_mount fs/namespace.c:3488 [inline]
__do_sys_mount fs/namespace.c:3697 [inline]
__se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
</TASK>
The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF. The ondisk AGF has a higher LSN than the recovered log
item. xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.
Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:
error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
buf_f->blf_len, buf_flags, &bp, NULL);
Skipping the buffer leaves its contents in memory unverified. This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32. This is
impossible so we run off the end of the cursor object and crash.
Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt. It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.
Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:
XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:
Thread 0 Thread 1
xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
match nextents>
xfs_need_iread_extents
<observe iext tree>
xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
*BOOM*
Fix this race by adding a flag to the xfs_ifork structure to indicate
that we have not yet read in the extent records and changing the
predicate to look at the flag state, not if_height. The memory barrier
ensures that the flag will not be set until the very end of the
function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
It just creates unnecessary bot noise these days.
Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In commit fe08cc5044 we reworked the valid superblock version
checks. If it is a V5 filesystem, it is always valid, then we
checked if the version was less than V4 (reject) and then checked
feature fields in the V4 flags to determine if it was valid.
What we missed was that if the version is not V4 at this point,
we shoudl reject the fs. i.e. the check current treats V6+
filesystems as if it was a v4 filesystem. Fix this.
cc: stable@vger.kernel.org
Fixes: fe08cc5044 ("xfs: open code sb verifier feature checks")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.
Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result. On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state. While we're at it, collapse trivial
helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer. There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents. The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings. We don't
currently check this, so make the refcount scrubber more complete.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about. For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).
Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through. This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range. Unfortunately, that's not
sufficient for scrub. We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between. By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior. We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In commit d47fef9342, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor. This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.
The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling. Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.
Fixes: d47fef9342 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We keep doing these conversions to support btree queries, so refactor
this into a helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup. This brought the behavior of the rmapbt
implementation in line with its specification.
However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys. We should detect these situations and flag the
rmapbt as one that would benefit from optimization. Eventually, online
repair will be able to do something in response to this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:
(physical block, owner, fork, is_btree, offset)
This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.
Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:
(physical block, owner, fork, is_btree, unwritten state, offset)
Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset. This means that
lookup comparisons are only done with:
(physical block, owner, offset)
Queries can (theoretically) return incorrect results because of this
omission. On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field. For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.
Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code. Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit. New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.
A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.
Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.
Found by fuzzing keys[1].attrfork = ones on xfs/371.
Fixes: 4b8ed67794 ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Similar to what we've just done for the other btrees, create a function
to log corrupt bmbt records and call it whenever we encounter a bad
record in the ondisk btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
For every btree type except for the bmbt, refactor the code that
complains about bad records into a helper and make the ->query_range
helpers call it so that corruptions found via that avenue are logged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a xfs_rmap_check_irec function to detect corruption in btree
records. Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a xfs_refcount_check_irec function to detect corruption in btree
records. Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a xfs_inobt_check_irec function to detect corruption in btree
records. Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption. Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.
However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to. This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records. This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.
The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain. For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.
From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process. When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code. The post-atomic
lockless waiter check adds about 0.03ns, which is basically free.
For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns. However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource. When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again. In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.
We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled. Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain. Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.
When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.
Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.
One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter. A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata. In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.
Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started. As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count. The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.
A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction. When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it. The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.
In the next patches we'll improve on this facility, but this patch
provides the basic functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Update the copyright years in the scrub/ source code files. This isn't
required, but it's helpful to remind myself just how long it's taken to
develop this feature.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Fix the spdx tags to match current practice, and update the author
contact information.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
There are a few places in the XFS codebase where a caller has either an
active or a passive reference to a perag structure and wants to give
a passive reference to some other piece of code. Btree cursor creation
and inode walks are good examples of this. Replace the open-coded logic
with a helper to do this.
The new function adds a few safeguards -- it checks that there's at
least one reference to the perag structure passed in, and it records the
refcount bump in the ftrace information. This makes it much easier to
debug perag refcounting problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Give the xfs_refcount_intent a passive reference to the perag structure
data. This reference will be used to enable scrub intent draining
functionality in subsequent patches. Any space being modified by a
refcount intent is already allocated, so we need to be able to operate
even if the AG is being shrunk or offlined.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Give the xfs_rmap_intent a passive reference to the perag structure
data. This reference will be used to enable scrub intent draining
functionality in subsequent patches. The space we're (reverse) mapping
is already allocated, so we need to be able to operate even if the AG is
being shrunk or offlined.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>