IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Lately I've been stress-testing extreme-sized rmap btrees by using the
(new) xfs_db bmap_inflate command to clone bmbt mappings billions of
times and then using xfs_repair to build new rmap and refcount btrees.
This of course is /much/ faster than actually FICLONEing a file billions
of times.
Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
sufficiently large for the test scenario. For a 1TB filesystem (~67
million AG blocks, 4 AGs) the btheight command reports:
$ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
level 0: 4400801200 records, 52390491 blocks
level 1: 52390491 records, 1164234 blocks
level 2: 1164234 records, 25872 blocks
level 3: 25872 records, 575 blocks
level 4: 575 records, 13 blocks
level 5: 13 records, 1 block
6 levels, 53581186 blocks total
The AG is sufficiently large to build this rmap btree. Unfortunately,
m_rmap_maxlevels is 5. Augmenting the loop in the space->height
function to report height, node blocks, and blocks remaining produces
this:
ht 1 node_blocks 45 blockleft 67108863
ht 2 node_blocks 2025 blockleft 67108818
ht 3 node_blocks 91125 blockleft 67106793
ht 4 node_blocks 4100625 blockleft 67015668
final height: 5
The goal of this function is to compute the maximum height btree that
can be stored in the given number of ondisk fsblocks. Starting with the
top level of the tree, each iteration through the loop adds the fanout
factor of the next level down until we run out of blocks. IOWs, maximum
height is achieved by using the smallest fanout factor that can apply
to that level.
However, the loop setup is not correct. Top level btree blocks are
allowed to contain fewer than minrecs items, so the computation is
incorrect because the first time through the loop it should be using a
fanout factor of 2. With this corrected, the above becomes:
ht 1 node_blocks 2 blockleft 67108863
ht 2 node_blocks 90 blockleft 67108861
ht 3 node_blocks 4050 blockleft 67108771
ht 4 node_blocks 182250 blockleft 67104721
ht 5 node_blocks 8201250 blockleft 66922471
final height: 6
Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
- Fix a race condition w.r.t. percpu inode free counters
- Fix a broken error return in xfs_remove
- Print FS UUID at mount/unmount time
- Numerous fixes to the online fsck code
- Fix inode locking inconsistency problems when dealing with realtime
metadata files
- Actually merge pull requests so that we capture the cover letter
contents
- Fix a race between rebuilding VFS inode state and the AIL flushing
inodes that could cause corrupt inodes to be written to the
filesystem
- Fix a data corruption problem resulting from a write() to an
unwritten extent racing with writeback started on behalf of memory
reclaim changing the extent state
- Add debugging knobs so that we can test iomap invalidation
- Fix the blockdev pagecache contents being stale after unmounting the
filesystem, leading to spurious xfs_db errors and corrupt metadumps
- Fix a file mapping corruption bug due to ilock cycling when attaching
dquots to a file during delalloc reservation
- Fix a refcount btree corruption problem due to the refcount
adjustment code not handling MAXREFCOUNT correctly, resulting in
unnecessary record splits
- Fix COW staging extent alloctions not being classified as USERDATA,
which results in filestreams being ignored and possible data
corruption if the allocation was filled from the AGFL and the block
buffer is still being tracked in the AIL
- Fix new duplicated includes
- Fix a race between the dquot shrinker and dquot freeing that could
cause a UAF
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmOSEWsACgkQ+H93GTRK
tOvpsg//Y8pgue8GFwyXq0LYEYb1yjueGIxDGz9SwkfMP9vADsdDpXxquHmes5M+
Q9vMyFnfaizZs2oXD6Nw/+RJMyOa3ZQtNqjxJET5pTIBcWvdjsP9UGW+K+1uN7LT
NsM7lgpxy8RfQFHjvFHpOysxGIpT70n3lz98qlwy1yIGF/EFE52pkKcArGjpIu4A
wBdyL0hIBwXc27zLRahLxfwFaW/I40ka3D40EUYpNnAjE5Sy0YgLlsOCzrxN0UvY
a9dlq+WFJjWDsLp6vr11ruewXAmzYG2m/3RdP2aLbmDHDvo06UkesKkPNhexlClM
kRE/ZImmakqKlAqgtUbkxT06NbIKOxYslbcoOOLDneqb1grTcgk79J7jsMlLLU1s
s1WyPMWR3wb0jjclgGBxd3c1nprdkvJSkBpyEOwIYLhwdPNuwqTwEVsq7TvasRLI
dgals5/J6fBnIeTR7x2YObonQRd4FlkXFv+AVYpGVUJEI02eRgY3i7NJBZWyBKAS
+Gcd1Bq1F387b0FRqq1iVhGD+NpoHHiP84bOQED9R9t0jP1AHj9t47f+Uuvjj2hN
ByT7MpA0nZdbYGKU+rFyKsIvONyLdxyjL+jm6FkmrW+G25fJ1af2yhrVhZQhw7dm
zLb1ntSnXvNTj4OopfKSDD2MPGf+2C/o2XJvAAS501pmsQefKOM=
=plES
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
"The highlight of this is a batch of fixes for the online metadata
checking code as we start the loooong march towards merging online
repair. I aim to merge that in time for the 2023 LTS.
There are also a large number of data corruption and race condition
fixes in this patchset. Most notably fixed are write() calls to
unwritten extents racing with writeback, which required some late(r
than I prefer) code changes to iomap to support the necessary
revalidations. I don't really like iomap changes going in past -rc4,
but Dave and I have been working on it long enough that I chose to
push it for 6.2 anyway.
There are also a number of other subtle problems fixed, including the
log racing with inode writeback to write inodes with incorrect link
count to disk; file data mapping corruptions as a result of incorrect
lock cycling when attaching dquots; refcount metadata corruption if
one actually manages to share a block 2^32 times; and the log
clobbering cow staging extents if they were formerly metadata blocks.
Summary:
- Fix a race condition w.r.t. percpu inode free counters
- Fix a broken error return in xfs_remove
- Print FS UUID at mount/unmount time
- Numerous fixes to the online fsck code
- Fix inode locking inconsistency problems when dealing with realtime
metadata files
- Actually merge pull requests so that we capture the cover letter
contents
- Fix a race between rebuilding VFS inode state and the AIL flushing
inodes that could cause corrupt inodes to be written to the
filesystem
- Fix a data corruption problem resulting from a write() to an
unwritten extent racing with writeback started on behalf of memory
reclaim changing the extent state
- Add debugging knobs so that we can test iomap invalidation
- Fix the blockdev pagecache contents being stale after unmounting
the filesystem, leading to spurious xfs_db errors and corrupt
metadumps
- Fix a file mapping corruption bug due to ilock cycling when
attaching dquots to a file during delalloc reservation
- Fix a refcount btree corruption problem due to the refcount
adjustment code not handling MAXREFCOUNT correctly, resulting in
unnecessary record splits
- Fix COW staging extent alloctions not being classified as USERDATA,
which results in filestreams being ignored and possible data
corruption if the allocation was filled from the AGFL and the block
buffer is still being tracked in the AIL
- Fix new duplicated includes
- Fix a race between the dquot shrinker and dquot freeing that could
cause a UAF"
* tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (50 commits)
xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING
xfs: Remove duplicated include in xfs_iomap.c
xfs: invalidate xfs_bufs when allocating cow extents
xfs: get rid of assert from xfs_btree_islastblock
xfs: estimate post-merge refcounts correctly
xfs: hoist refcount record merge predicates
xfs: fix super block buf log item UAF during force shutdown
xfs: wait iclog complete before tearing down AIL
xfs: attach dquots to inode before reading data/cow fork mappings
xfs: shut up -Wuninitialized in xfsaild_push
xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
xfs: invalidate block device page cache during unmount
xfs: add debug knob to slow down write for fun
xfs: add debug knob to slow down writeback for fun
xfs: drop write error injection is unfixable, remove it
xfs: use iomap_valid method to detect stale cached iomaps
iomap: write iomap validity checks
xfs: xfs_bmap_punch_delalloc_range() should take a byte range
iomap: buffered write failure should not truncate the page cache
xfs,iomap: move delalloc punching to iomap
...
While investigating test failures in xfs/17[1-3] in alwayscow mode, I
noticed through code inspection that xfs_bmap_alloc_userdata isn't
setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
fork. COW staging extents should be flagged as USERDATA, since user
data are persisted to these blocks before being remapped into a file.
This mis-classification has a few impacts on the behavior of the system.
First, the filestreams allocator is supposed to keep allocating from a
chosen AG until it runs out of space in that AG. However, it only does
that for USERDATA allocations, which means that COW allocations aren't
tied to the filestreams AG. Fortunately, few people use filestreams, so
nobody's noticed.
A more serious problem is that xfs_alloc_ag_vextent_small looks for a
buffer to invalidate *if* the USERDATA flag is set and the AG is so full
that the allocation had to come from the AGFL because the cntbt is
empty. The consequences of not invalidating the buffer are severe --
if the AIL incorrectly checkpoints a buffer that is now being used to
store user data, that action will clobber the user's written data.
Fix filestreams and yet another data corruption vector by flagging COW
allocations as USERDATA.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up,
turn on the debugging knob can trigger the assert of xfs_btree_islastblock,
test script as follows:
while true
do
mount $disk $mountpoint
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null
echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk
sleep 10
umount $mountpoint
done
Kick off fsstress and only *then* turn on the debugging knob. If it
happens that the knob gets turned on after the cntbt lookup succeeds
but before the call to xfs_btree_islastblock, then we *can* end up in
the situation where a previously checked btree block suddenly starts
returning EFSCORRUPTED from xfs_btree_check_block. Kaboom.
Darrick give a very detailed explanation as follows:
Looking back at commit 27d9ee577dcce, I think the point of all this was
to make sure that the cursor has actually performed a lookup, and that
the btree block at whatever level we're asking about is ok.
If the caller hasn't ever done a lookup, the bc_levels array will be
empty, so cur->bc_levels[level].bp pointer will be NULL. The call to
xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is
pointless.
If the caller did a lookup but the lookup failed due to block
corruption, the corresponding cur->bc_levels[level].bp pointer will also
be NULL, and we'll still crash. The "ASSERT(xfs_btree_check_block);"
logic is also unnecessary.
If the cursor level points to an inode root, the block buffer will be
incore, so it had better always be consistent.
If the caller ignores a failed lookup after a successful one and calls
this function, the cursor state is garbage and the assert wouldn't have
tripped anyway. So get rid of the assert.
Fixes: 27d9ee577dcc ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock")
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount
metadata corruptions after being run. Specifically, xfs_repair noticed
single-block refcount records that could be combined but had not been.
The root cause of this is improper MAXREFCOUNT edge case handling in
xfs_refcount_merge_extents. When we're trying to find candidates for a
refcount btree record merge, we compute the refcount attribute of the
merged record, but we fail to account for the fact that once a record
hits rc_refcount == MAXREFCOUNT, it is pinned that way forever. Hence
the computed refcount is wrong, and we fail to merge the extents.
Fix this by adjusting the merge predicates to compute the adjusted
refcount correctly.
Fixes: 3172725814f9 ("xfs: adjust refcount of an extent of blocks in refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
Hoist these multiline conditionals into separate static inline helpers
to improve readability and set the stage for corruption fixes that will
be introduced in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
Add a new error injection knob so that we can arbitrarily slow down
pagecache writes to test for race conditions and aberrant reclaim
behavior if the writeback mechanisms are slow to issue writeback. This
will enable functional testing for the ifork sequence counters
introduced in commit 304a68b9c63b ("xfs: use iomap_valid method to
detect stale cached iomaps") that fixes write racing with reclaim
writeback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new error injection knob so that we can arbitrarily slow down
writeback to test for race conditions and aberrant reclaim behavior if
the writeback mechanisms are slow to issue writeback. This will enable
functional testing for the ifork sequence counters introduced in commit
745b3f76d1c8 ("xfs: maintain a sequence count for inode fork
manipulations").
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmOFSzwUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h3djhAAwOf9VeLO7TW/0B1XfE3ktWGiDmEG
ekB8mkB7CAHB9SBq7TZMHjktJIJxY81q5+Iq9qHGiW3asoVbmWvkeRSJgXljhTby
D2KsUIT1NK/X6DhC9FhNjv/Q2GJ0nY6s65RLudUEkelYBFhGMM0kdXX+fZmtZ4yT
T/lRYk/KBFpeQCaGRcFXK55TnB/B9muOI9FyKvh2DNWe6r0Xu3Obb3a9k+snZA9R
EeUpAosDSrXzP4c2w2ovpU2eutUdo4eYTHIzXKGkhktbRhmCRLn4NlxvFCanoe8h
eSS85sb8DHUh2iyaaB8yrJ6LL3MuBytOi24rNBeyd1KAyEtT21+cTUK/QAahzble
pL8l6TA7ZXbhYcbk5uQvFEIAInR+0ffjde61uE14N55awq0Vdrym7C7D2ri60iw6
ts45AVYKYeF61coAbwvmaJyvqvQ0tUlmVZXI4lQzN2O17Lr6004gJFMjDRsXXU7H
eHLUt496Geq39rglw85y8G0vmxxGZ9iIGkeC1kUSSCmlvx/JfuJlbWBgyMGtNRBI
qzv0jmk67Ft1seQSMWQJttxCZs4uOF2gwERYGAF7iUR8F4PGob/N1e2/hpg75G8q
0S8u1N1p8Cv5u/jwybqy8FnSC2MlUZl6SQURaVQDy2DLMKHb4T1diu0jrCbiSPiF
JKfQ7aNQxaEZIJw=
=cv9i
-----END PGP SIGNATURE-----
Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB
xfs, iomap: fix data corruption due to stale cached iomaps
This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: drop write error injection is unfixable, remove it
xfs: use iomap_valid method to detect stale cached iomaps
iomap: write iomap validity checks
xfs: xfs_bmap_punch_delalloc_range() should take a byte range
iomap: buffered write failure should not truncate the page cache
xfs,iomap: move delalloc punching to iomap
xfs: use byte ranges for write cleanup ranges
xfs: punching delalloc extents on write failure is racy
xfs: write page faults in iomap are not buffered writes
With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.
The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.
On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.
IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.
Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.
And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.
This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.
As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:
xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
or the ->iomap_valid() introduction commit for exact details of the
corruption vector.
The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When lazysbcount is enabled, fsstress and loop mount/unmount test report
the following problems:
XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed
This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.
#!/bin/bash
device=/dev/sda
testdir=/mnt/test
round=0
function fail()
{
echo "$*"
exit 1
}
mkdir -p $testdir
while [ $round -lt 10000 ]
do
echo "******* round $round ********"
mkfs.xfs -f $device
mount $device $testdir || fail "mount failed!"
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
sleep 4
killall -w fsstress
umount $testdir
xfs_repair -e $device > /dev/null
if [ $? -eq 2 ];then
echo "ERR CODE 2: Dirty log exception during repair."
exit 1
fi
round=$(($round+1))
done
With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:
CPU0 CPU1
xfs_log_sb xfs_trans_unreserve_and_mod_sb
---------- ------------------------------
percpu_counter_sum(&mp->m_icount)
percpu_counter_add_batch(&mp->m_icount,
idelta, XFS_ICOUNT_BATCH)
percpu_counter_add(&mp->m_ifree, ifreedelta);
percpu_counter_sum(&mp->m_ifree)
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.
Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag. Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We're supposed to initialize the list head of an object before adding it
to another list. Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree. This works because an AG is never allowed to have
more than (1U << 31) blocks in it. If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.
However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.
Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.
The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we have an explicit enum for shared and CoW staging extents, we
can get rid of the old FIND_RCEXT flags. Omit a couple of conversions
that disappear in the next patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount. We'll put this to more use in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've broken out the startblock and shared/cow domain in the
incore refcount extent record structure, update the tracepoints to
report the domain.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents. The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock. We don't allow AGs to have more than 2^31
blocks, so the bit was free.
Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set. This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.
Fuzzing demonstrates practical exploits of this weakness. If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain. This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.
Found by fuzzing keys[2].cowflag = add in xfs/464.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Structure definitions for incore objects do not belong in the ondisk
format header. Move them to the incore types header where they belong.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent. Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.
If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block. If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.
The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a predicate function to verify that a given agbno/blockcount pair
fit entirely within a single allocation group and don't suffer
mathematical overflows. Refactor the existng open-coded logic; we're
going to add more calls to this function in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
such that the end of the range would not be in the middle of a refcount
record. If this is no longer the case, something is seriously wrong
with the btree. Bail out with a corruption error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor all the open-coded sizeof logic for EFI/EFD log item and log
format structures into common helper functions whose names reflect the
struct names.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Since we're already fixing problems with BUI item copying, we
should fix it everything else.
An extra difficulty here is that the ef[id]_extents arrays are declared
as single-element arrays. This is not the convention for flex arrays in
the modern kernel, and it causes all manner of problems with static
checking tools, since they often cannot tell the difference between a
single element array and a flex array.
So for starters, change those array[1] declarations to array[]
declarations to signal that they are proper flex arrays and adjust all
the "size-1" expressions to fit the new declaration style.
Next, refactor the xfs_efi_copy_format function to handle the copying of
the head and the flex array members separately. While we're at it, fix
a minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_rename can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip
and wip. So we need to increase the inode reservation to match.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
xfs_dir2_isleaf is used to see if the directory is a single-leaf
form directory instead, as commented right above the function.
Besides getting rid of the broken comment, we rearrange the logic by
converting everything over to standard formatting and conventions,
at the same time, to make it easier to understand and self documenting.
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Take a look at the for-loop in xfs_da_grow_inode_int:
======
for(){
nmap = min(XFS_BMAP_MAX_NMAP, count);
...
error = xfs_bmapi_write(...,&mapp[mapi], &nmap);//(..., $1, $2)
...
mapi += nmap;
}
=====
where $1 stands for the start address of the array,
while $2 is used to indicate the size of the array.
The array $1 will advance by $nmap in each iteration after
the allocation of extents.
But the size $2 still remains unchanged, which is determined by
min(XFS_BMAP_MAX_NMAP, count).
It seems that it has forgotten to trim the mapp array after each
iteration, so change it.
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Return the value xfs_dir_cilookup_result() directly instead of storing it
in another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The "%Ld" specifier, which represents long long unsigned,
doesn't meet C language standard, and even more,
it makes people easily mistake with "%ld", which represent
long unsigned. So replace "%Ld" with "lld".
Do the same with "%Lu".
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a
directory entry writes:
/* fs/xfs/libxfs/xfs_trans_resv.c begin */
/*
* For removing a directory entry we can modify:
* the parent directory inode: inode size
* the removed inode: inode size
...
xfs_calc_remove_reservation(
struct xfs_mount *mp)
{
return XFS_DQUOT_LOGRES(mp) +
xfs_calc_iunlink_add_reservation(mp) +
max((xfs_calc_inode_res(mp, 1) +
...
/* fs/xfs/libxfs/xfs_trans_resv.c end */
There has 2 inode size of space to be reserverd, but the actual code
for inode reservation space writes.
There only count for 1 inode size to be reserved in
'xfs_calc_inode_res(mp, 1)', rather than 2.
Signed-off-by: hexiaole <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: remove redundant code citations]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Replace 'the the' with 'the' in the comment.
Signed-off-by: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
I observed the following evidence of a memory leak while running xfs/399
from the xfs fsck test suite (edited for brevity):
XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
------------[ cut here ]------------
WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_ifork_zap_attr+0x7c/0xb0
xfs_iformat_attr_fork+0x86/0x110
xfs_inode_from_disk+0x41d/0x480
xfs_iget+0x389/0xd70
xfs_bulkstat_one_int+0x5b/0x540
xfs_bulkstat_iwalk+0x1e/0x30
xfs_iwalk_ag_recs+0xd1/0x160
xfs_iwalk_run_callbacks+0xb9/0x180
xfs_iwalk_ag+0x1d8/0x2e0
xfs_iwalk+0x141/0x220
xfs_bulkstat+0x105/0x180
xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
xfs_file_ioctl+0xa5f/0xef0
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents. From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.
The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.
The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr. IOWs, it eliminates one way to
fail.
Note: This change only applies cleanly to 2ed5b09b3e8f, since we just
reworked the attr fork lifetime. However, I think this memory leak has
existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp. The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.
Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.
Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs:
make inode attribute forks a permanent part of struct xfs_inode").
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The 'ctime', 'mtime', and 'atime' for inode is the type of
'xfs_timestamp_t', which is a 64-bit type:
/* fs/xfs/libxfs/xfs_format.h begin */
typedef __be64 xfs_timestamp_t;
/* fs/xfs/libxfs/xfs_format.h end */
When the 'bigtime' feature is disabled, this 64-bit type is splitted
into two parts of 32-bit, one part is encoded for seconds since
1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds
above the seconds, this two parts are the type of
'xfs_legacy_timestamp' and the min and max time value of this type are
defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX':
/* fs/xfs/libxfs/xfs_format.h begin */
struct xfs_legacy_timestamp {
__be32 t_sec; /* timestamp seconds */
__be32 t_nsec; /* timestamp nanoseconds */
};
#define XFS_LEGACY_TIME_MIN ((int64_t)S32_MIN)
#define XFS_LEGACY_TIME_MAX ((int64_t)S32_MAX)
/* fs/xfs/libxfs/xfs_format.h end */
/* include/linux/limits.h begin */
#define U32_MAX ((u32)~0U)
#define S32_MAX ((s32)(U32_MAX >> 1))
#define S32_MIN ((s32)(-S32_MAX - 1))
/* include/linux/limits.h end */
'XFS_LEGACY_TIME_MIN' is the min time value of the
'xfs_legacy_timestamp', that is -(2^31) seconds relative to the
1970-01-01 00:00:00 UTC, it can be converted to human-friendly time
value by 'date' command:
/* command begin */
[root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S'
1970-01-01 00:00:00
[root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S'
1901-12-13 20:45:52
[root@~]#
/* command end */
When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit
nanoseconds counter, with the start time value is the min time value of
'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds
counter is 0). We have already caculated the min time value of
'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment
for the start time value of inode with 'bigtime' feature enabled writes
the value is 1901-12-31 20:45:52 UTC:
/* fs/xfs/libxfs/xfs_format.h begin */
/*
* XFS Timestamps
* ==============
* When the bigtime feature is enabled, ondisk inode timestamps become an
* unsigned 64-bit nanoseconds counter. This means that the bigtime inode
* timestamp epoch is the start of the classic timestamp range, which is
* Dec 31 20:45:52 UTC 1901. ...
...
*/
/* fs/xfs/libxfs/xfs_format.h end */
That is a typo, and this patch corrects the typo, from 'Dec 31' to
'Dec 13'.
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Xiaole He <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This series fixes a use-after-free bug that syzbot uncovered. The UAF
itself is a result of a race condition between getxattr and removexattr
because callers to getxattr do not necessarily take any sort of locks
before calling into the filesystem.
Although the race condition itself can be fixed through clever use of a
memory barrier, further consideration of the use cases of extended
attributes shows that most files always have at least one attribute, so
we might as well make them permanent.
v2: Minor tweaks suggested by Dave, and convert some more macros to
helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmLQRsAACgkQ+H93GTRK
tOseOw/+JdSH2MU2xx+XoE5M/fStzGpw0UsoOqDo8kUPKDt3z12NwuczlL4OAiuw
XFrN/1IAnxBsTD9YxFYbqoCPNi/VR81ajfWV7JqD2B1Joj0aATsxGDdNUYJnxCdU
HMlMqP5o77XvArwkxFbgxYi7UGdAeNwXxqUJcJ8FopQo2lb8+SG6XzpLgGKnyrKT
FRNKXNObplhtzOs/Bv8qYAxJulmmjkktFJXhK2OAUJlIDjFwFY9Wo2T4QuOVe9w+
NXFOYyu0BqWLpDZQkYKWoCnF0GNsUavS8DP6zZYW3qJ6mX/f1jmtfbOLAkHNhlh8
uHy/3k3SeQhKztTqM28sPioe6mdj2xocorDCCVBgGXgWxVF6aWeM/PS4tMTWN/Bg
TWd1egERpeVC0Ymkm0LTCIDNuLqxsknK1G6sxXhwrQ8cw/70Gl08ePwgdyZ6hpD9
Q61iJQofcI7MJX189a2VSbbHRzFnZIA+uVK4oyhmdEkQVKTHgmzwHVP660oAv9bD
Y5hqkWEoyKTaaCsOWRAPVXG3k03lq+TNcaGggZgwFx11Gw4oMEx5hMUztoP54xX4
aEXb1HWcCmMxy8llnFY/82baW98ucwl8FwWF1qhNIPT40mYx9IobDmvkCtNrAoOC
41U7O8CxxPlt7XKxoRhafQOAhzp0ZzuhCdbaFIUENV7pTAJtq5Q=
=W3Ad
-----END PGP SIGNATURE-----
Merge tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.20-mergeB
xfs: make attr forks permanent
This series fixes a use-after-free bug that syzbot uncovered. The UAF
itself is a result of a race condition between getxattr and removexattr
because callers to getxattr do not necessarily take any sort of locks
before calling into the filesystem.
Although the race condition itself can be fixed through clever use of a
memory barrier, further consideration of the use cases of extended
attributes shows that most files always have at least one attribute, so
we might as well make them permanent.
v2: Minor tweaks suggested by Dave, and convert some more macros to
helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: replace inode fork size macros with functions
xfs: replace XFS_IFORK_Q with a proper predicate function
xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
xfs: make inode attribute forks a permanent part of struct xfs_inode
xfs: convert XFS_IFORK_PTR to a static inline helper
Current work to merge the XFS inode life cycle with the VFS inode
life cycle is finding some interesting issues. If we have a path
that hits buffer trylocks fairly hard (e.g. a non-blocking
background inode freeing function), we end up hitting massive
contention on the buffer cache hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by creating a lookup fast
path which leverages the rhashtable's ability to do RCU protected
lookups.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLPvngUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h0gTw/9EK1gj31QpurgGziYsL0JFI1Uq33Z
2rB/yTJXzxe+J7cE6B2RYuSj4EK7YI1aZXTRC5De5A8TqbFaNztrigqxNNpm3jh0
T0AbVQoY7XzjbvMHQ0VFPBcJGcVbQypA+rabSlLHfU9zfN3t4EnM+BmuaFqygGZj
1A6ZjkVChmEprGjd16846sgvMqdLa4yJ4/9Jsu5WlI+vPZj9gJX/7Mjc580Zljb5
gg9Cf8ziW78gpHzj3ufSWv2jBcWcMdyHpyCF/fNceROUaxmZKsMUDKcsia9TyQhB
yJXxw9Rnb3F23VJSYMJIcf4+RTd7iqd88GhEEFYxj41gI/jQxqRovlS1ljk2l20R
3i4TUs7yF24sLLQdL8YkJiGCOEvRqPPcNd4xfGwdioRwXwoEqB7L/vYpUheQ8qSZ
Tnn4vmGm+GQHNnQNhxiF8KkAd9gwcUslN36ZJn+h3zjvfgAFQFChsk+3CoFoxsth
BpbFT3lo4Hc6xJBDCp7Z3Gxurxq5fQ2CGYHxCBT4feNkZS5YOLd/Os2hIZVId8XA
jp66ZyELd8zj+CxMp4ZyYqsFETIao13B8KPEqvI2/obEDE6p/++olP8aqKIP1C8d
ASOjxP8KqWEHLe3or4W3m2WSDa5fp1b3G/mjS7r/jDKqIuTMZXYw4CJx1x3rTr4F
nXAnlWoGVq7HjWc=
=8UYp
-----END PGP SIGNATURE-----
Merge tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB
xfs: lockless buffer cache lookups
Current work to merge the XFS inode life cycle with the VFS inode
life cycle is finding some interesting issues. If we have a path
that hits buffer trylocks fairly hard (e.g. a non-blocking
background inode freeing function), we end up hitting massive
contention on the buffer cache hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by creating a lookup fast
path which leverages the rhashtable's ability to do RCU protected
lookups.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: lockless buffer lookup
xfs: remove a superflous hash lookup when inserting new buffers
xfs: reduce the number of atomic when locking a buffer after lookup
xfs: merge xfs_buf_find() and xfs_buf_get_map()
xfs: break up xfs_buf_find() into individual pieces
xfs: rework xfs_buf_incore() API
To facilitate future improvements in inode logging and improving
inode cluster buffer locking order consistency, we need a new
mechanism for defering inode cluster buffer modifications during
unlinked list modifications.
The unlinked inode list buffer locking is complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.
Hence we need a mechanism for defering unlinked inode list updates
until a point where we know that all modifications have been made
and all that remains is to lock and modify the cluster buffers.
We can do this by first observing that we serialise unlinked list
modifications by holding the AGI buffer lock. IOWs, the AGI is going
to be locked until the transaction commits any time we modify the
unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.
We add an in-memory unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations that need to be
done. Essentially all we need to do is record the inodes that need
to have their unlinked list pointer updated in a new log item that
we attached to the transaction.
This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.
Hence if we add a "precommit" hook into xfs_trans_commit()
to run a "precommit" operation on these iunlink log items, we can
delay the locking, modification and logging of the inode cluster
buffer until after all other modifications have been made. The
precommit hook reuires us to sort the items that are going to be run
so that we can lock precommit items in the correct order as we
perform the modifications they describe.
To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.
We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here. The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.
This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. Tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.
The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLPvZAUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h2CDBAAj9QH4/XIe8JIx/mKgAGzcNNwQxu8
geBqb5S2ri0oB22pRXKc/3zArw/8zPwcgZF83ChkFrQ6tLn4JGkEEuvIKr3b8k50
2AEghYf8dqCaXRpkdvIGjJtdK54MFZIHv9TYRwHVzBp3WLtrz7uHmKeRf2qeSBMI
DLurzVIbcocMptvHxrZZCpf1ajuVdovXtuw8ExiORZZKLOeF+3xBGztenkfh2BTO
8Kh8qJVSNN41XQ8h87PWyQtmah6JouqURXXGERJcgLbr80pTSw2EBihJvmXUmn8y
qnoT27TCPAMOEDTWe+SHzLOVRLvhN+at/lFWbvas6PwOvDGAwQQtZkv/QyLTSgqD
6Zg9xJeeSgHhHP2kCeLlKmvW1dRptcUzhCWOrQ9Ry+WZnKK5ZenevkaAXAva6ucS
NXwIU1DnWfJ51SHIYQiQIci2g+vF+pnJRQq1DtYUuwtBSWfsmw1uquNZodgbA9Ue
k6hfk4qVua63k+vXsd5gVdCHT+Liw+1ldTInl2GNhT/riNzewO0HY3zmc1aZQyMM
mymHXKVcQJbLpJvwqB5SXq8a37fbpoQDYlycptSF/YxxBhiCKKWuc6q7Tl6Y9VSS
qpSHvh+MkJcP8PYtPjNUcJ9yeXhYJgkv1KK47zkIKzOD9a+zh4SIrfiBUflZbpQq
M9ubXGHVmFqS4Nk=
=59M0
-----END PGP SIGNATURE-----
Merge tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB
xfs: introduce in-memory inode unlink log items
To facilitate future improvements in inode logging and improving
inode cluster buffer locking order consistency, we need a new
mechanism for defering inode cluster buffer modifications during
unlinked list modifications.
The unlinked inode list buffer locking is complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.
Hence we need a mechanism for defering unlinked inode list updates
until a point where we know that all modifications have been made
and all that remains is to lock and modify the cluster buffers.
We can do this by first observing that we serialise unlinked list
modifications by holding the AGI buffer lock. IOWs, the AGI is going
to be locked until the transaction commits any time we modify the
unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.
We add an in-memory unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations that need to be
done. Essentially all we need to do is record the inodes that need
to have their unlinked list pointer updated in a new log item that
we attached to the transaction.
This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.
Hence if we add a "precommit" hook into xfs_trans_commit()
to run a "precommit" operation on these iunlink log items, we can
delay the locking, modification and logging of the inode cluster
buffer until after all other modifications have been made. The
precommit hook reuires us to sort the items that are going to be run
so that we can lock precommit items in the correct order as we
perform the modifications they describe.
To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.
We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here. The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.
This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. Tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.
The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: add in-memory iunlink log item
xfs: add log item precommit operation
xfs: combine iunlink inode update functions
xfs: clean up xfs_iunlink_update_inode()
xfs: double link the unlinked inode list
xfs: introduce xfs_iunlink_lookup
xfs: refactor xlog_recover_process_iunlinks()
xfs: track the iunlink list pointer in the xfs_inode
xfs: factor the xfs_iunlink functions
xfs: flush inode gc workqueue before clearing agi bucket
Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.
The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.
This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Replace the shouty macros here with typechecked helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace this shouty macro with a real C function that has a more
descriptive name.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
attribute fork but i_forkoff is zero. This eliminates the ambiguity
between i_forkoff and i_af.if_present, which should make it easier to
understand the lifetime of attr forks.
While we're at it, remove the if_present checks around calls to
xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
forks that have already been torn down.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Syzkaller reported a UAF bug a while back:
==================================================================
BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
5.15.0-0.30.3-20220406_1406 #3
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
__vfs_getxattr+0xdf/0x13d fs/xattr.c:399
cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
do_truncate+0xc3/0x1e0 fs/open.c:56
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
RIP: 0033:0x7f7ef4bb753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
</TASK>
Allocated by task 2953:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:254 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3213 [inline]
slab_alloc mm/slub.c:3221 [inline]
kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
kmem_cache_zalloc include/linux/slab.h:711 [inline]
xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_setxattr+0x11b/0x177 fs/xattr.c:180
__vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
__vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
vfs_setxattr+0x154/0x33d fs/xattr.c:301
setxattr+0x216/0x29f fs/xattr.c:575
__do_sys_fsetxattr fs/xattr.c:632 [inline]
__se_sys_fsetxattr fs/xattr.c:621 [inline]
__x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
Freed by task 2949:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:230 [inline]
slab_free_hook mm/slub.c:1700 [inline]
slab_free_freelist_hook mm/slub.c:1726 [inline]
slab_free mm/slub.c:3492 [inline]
kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_removexattr+0x106/0x16a fs/xattr.c:468
cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
security_inode_killpriv+0x54/0xa1 security/security.c:1414
setattr_prepare+0x1a6/0x897 fs/attr.c:146
xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
notify_change+0xae5/0x10a1 fs/attr.c:410
do_truncate+0x134/0x1e0 fs/open.c:64
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
The buggy address belongs to the object at ffff88802cec9188
which belongs to the cache xfs_ifork of size 40
The buggy address is located 20 bytes inside of
40-byte region [ffff88802cec9188, ffff88802cec91b0)
The buggy address belongs to the page:
page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x2cec9
flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
>ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
^
ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
==================================================================
The root cause of this bug is the unlocked access to xfs_inode.i_afp
from the getxattr code paths while trying to determine which ILOCK mode
to use to stabilize the xattr data. Unfortunately, the VFS does not
acquire i_rwsem when vfs_getxattr (or listxattr) call into the
filesystem, which means that getxattr can race with a removexattr that's
tearing down the attr fork and crash:
xfs_attr_set: xfs_attr_get:
xfs_attr_fork_remove: xfs_ilock_attr_map_shared:
xfs_idestroy_fork(ip->i_afp);
kmem_cache_free(xfs_ifork_cache, ip->i_afp);
if (ip->i_afp &&
ip->i_afp = NULL;
xfs_need_iread_extents(ip->i_afp))
<KABOOM>
ip->i_forkoff = 0;
Regrettably, the VFS is much more lax about i_rwsem and getxattr than
is immediately obvious -- not only does it not guarantee that we hold
i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
The getxattr system call won't acquire the lock before calling XFS, but
the file capabilities code calls getxattr with and without i_rwsem held
to determine if the "security.capabilities" xattr is set on the file.
Fixing the VFS locking requires a treewide investigation into every code
path that could touch an xattr and what i_rwsem state it expects or sets
up. That could take years or even prove impossible; fortunately, we
can fix this UAF problem inside XFS.
An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
ensure that i_forkoff is always zeroed before i_afp is set to null and
changed the read paths to use smp_rmb before accessing i_forkoff and
i_afp, which avoided these UAF problems. However, the patch author was
too busy dealing with other problems in the meantime, and by the time he
came back to this issue, the situation had changed a bit.
On a modern system with selinux, each inode will always have at least
one xattr for the selinux label, so it doesn't make much sense to keep
incurring the extra pointer dereference. Furthermore, Allison's
upcoming parent pointer patchset will also cause nearly every inode in
the filesystem to have extended attributes. Therefore, make the inode
attribute fork structure part of struct xfs_inode, at a cost of 40 more
bytes.
This patch adds a clunky if_present field where necessary to maintain
the existing logic of xattr fork null pointer testing in the existing
codebase. The next patch switches the logic over to XFS_IFORK_Q and it
all goes away.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>