IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Create a bag structure for refcount information that uses the refcount
bag btree defined in the previous patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:
{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}
It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.
So we add the first rmap to the rcbag, and it looks like:
0: {agbno: 10, length: 40}
The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:
0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:
0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}
The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:
0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}
The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:
0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}
The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.
That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.
I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.
(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rebuild the reverse mapping btree from all primary metadata. This first
patch establishes the bare mechanics of finding records and putting
together a new ondisk tree; more complex pieces are needed to make it
work properly.
Link: Documentation/filesystems/xfs-online-fsck-design.rst
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Adapt the generic btree cursor code to be able to create a btree whose
buffers come from a (presumably in-memory) buftarg with a header block
that's specific to in-memory btrees. We'll connect this to other parts
of online scrub in the next patches.
Note that in-memory btrees always have a block size matching the system
memory page size for efficiency reasons. There are also a few things we
need to do to finalize a btree update; that's covered in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Allow the buffer cache to target in-memory files by making it possible
to have a buftarg that maps pages from private shmem files. As the
prevous patch alludes, the in-memory buftarg contains its own cache,
points to a shmem file, and does not point to a block_device.
The next few patches will make it possible to construct an xfs_btree in
pageable memory by using this buftarg.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the same summary counter calculation infrastructure to generate new
values for the in-core summary counters. The difference between the
scrubber and the repairer is that the repairer will freeze the fs during
setup, which means that the values should match exactly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix the file link counts since we just computed the correct ones.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create the necessary scrub code to walk the filesystem's directory tree
so that we can compute file link counts. Similar to quotacheck, we
create an incore shadow array of link count information and then we walk
the filesystem a second time to compare the link counts. We need live
updates to keep the information up to date during the lengthy scan, so
this scrubber remains disabled until the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new trio of scrub functions to check quota counters. While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters. We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Certain types of filesystem metadata can only be checked by scanning
every file in the entire filesystem. Specific examples of this include
quota counts, file link counts, and reverse mappings of file extents.
Directory and parent pointer reconstruction may also fall into this
category. File scanning is much trickier than scanning AG metadata
because we have to take inode locks in the same order as the rest of
[VX]FS, we can't be holding buffer locks when we do that, and scanning
the whole filesystem takes time.
Earlier versions of the online repair patchset relied heavily on
fsfreeze as a means to quiesce the filesystem so that we could take
locks in the proper order without worrying about concurrent updates from
other writers. Reviewers of those patches opined that freezing the
entire fs to check and repair something was not sufficiently better than
unmounting to run fsck offline. I don't agree with that 100%, but the
message was clear: find a way to repair things that minimizes the
quiet period where nobody can write to the filesystem.
Generally, building btree indexes online can be split into two phases: a
collection phase where we compute the records that will be put into the
new btree; and a construction phase, where we construct the physical
btree blocks and persist them. While it's simple to hold resource locks
for the entirety of the two phases to ensure that the new index is
consistent with the rest of the system, we don't need to hold resource
locks during the collection phase if we have a means to receive live
updates of other work going on elsewhere in the system.
The goal of this patch, then, is to enable online fsck to learn about
metadata updates going on in other threads while it constructs a shadow
copy of the metadata records to verify or correct the real metadata. To
minimize the overhead when online fsck isn't running, we use srcu
notifiers because they prioritize fast access to the notifier call chain
(particularly when the chain is empty) at a cost to configuring
notifiers. Online fsck should be relatively infrequent, so this is
acceptable.
The intended usage model is fairly simple. Code that modifies a
metadata structure of interest should declare a xfs_hook_chain structure
in some well defined place, and call xfs_hook_call whenever an update
happens. Online fsck code should define a struct notifier_block and use
xfs_hook_add to attach the block to the chain, along with a function to
be called. This function should synchronize with the fsck scanner to
update whatever in-memory data the scanner is collecting. When
finished, xfs_hook_del removes the notifier from the list and waits for
them all to complete.
Originally, I selected srcu notifiers over blocking notifiers to
implement live hooks because they seemed to have fewer impacts to
scalability. The per-call cost of srcu_notifier_call_chain is higher
(19ns) than blocking_notifier_ (4ns) in the single threaded case, but
blocking notifiers use an rwsem to stabilize the list. Cacheline
bouncing for that rwsem is costly to runtime code when there are a lot
of CPUs running regular filesystem operations. If there are no hooks
installed, this is a total waste of CPU time.
Therefore, I stuck with srcu notifiers, despite trading off single
threaded performance for multithreaded performance. I also wasn't
thrilled with the very high teardown time for srcu notifiers, since the
caller has to wait for the next rcu grace period. This can take a long
time if there are a lot of CPUs.
Then I discovered the jump label implementation of static keys.
Jump labels use kernel code patching to replace a branch with a nop sled
when the key is disabled. IOWs, they can eliminate the overhead of
_call_chain when there are no hooks enabled. This makes blocking
notifiers competitive again -- scrub runs faster because teardown of the
chain is a lot cheaper, and runtime code only pays the rwsem locking
overhead when scrub is actually running.
With jump labels enabled, calls to empty notifier chains are elided from
the call sites when there are no hooks registered, which means that the
overhead is 0.36ns when fsck is not running. This is perfect for most
of the architectures that XFS is expected to run on (e.g. x86, powerpc,
arm64, s390x, riscv).
For architectures that don't support jump labels (e.g. m68k) the runtime
overhead of checking the static key is an atomic counter read. This
isn't great, but it's still cheaper than taking a shared rwsem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.
The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited. This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data. It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.
This new code is a separate patch from the patchsets adding callers for
the sake of enabling the author to move patches around his tree with
ease. The intended usage model for this code is roughly:
xchk_iscan_start(iscan, 0, 0);
while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
xfs_ilock(ip, ...);
/* capture inode metadata */
xchk_iscan_mark_visited(iscan, ip);
xfs_iunlock(ip, ...);
xfs_irele(ip);
}
xchk_iscan_stop(iscan);
if (error)
return error;
Hook functions for live updates can then do:
if (xchk_iscan_want_live_update(...))
/* update the captured inode metadata */
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
kmem_alloc() is just a thin wrapper around kmalloc() these days.
Convert everything to use kmalloc() so we can get rid of the
wrapper.
Note: the transaction region allocation in xlog_add_to_transaction()
can be a high order allocation. Converting it to use
kmalloc(__GFP_NOFAIL) results in warnings in the page allocation
code being triggered because the mm subsystem does not want us to
use __GFP_NOFAIL with high order allocations like we've been doing
with the kmem_alloc() wrapper for a couple of decades. Hence this
specific case gets converted to xlog_kvmalloc() rather than
kmalloc() to avoid this issue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Upon a closer inspection of the quota record scrubber, I noticed that
dqiterate wasn't actually walking all possible dquots for the mapped
blocks in the quota file. This is due to xfs_qm_dqget_next skipping all
XFS_IS_DQUOT_UNINITIALIZED dquots.
For a fsck program, we really want to look at all the dquots, even if
all counters and limits in the dquot record are zero. Rewrite the
implementation to do this, as well as switching to an iterator paradigm
to reduce the number of indirect calls.
This enables removal of the old broken dqiterate code from xfs_dquot.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix all the file metadata surrounding the realtime bitmap file, which
includes the rt geometry, file size, forks, and space mappings. The
bitmap contents themselves cannot be fixed without rt rmap, so that will
come later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again. If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the rmapbt to find inode chunks, query the chunks to compute hole
and free masks, and with that information rebuild the inobt and finobt.
Refer to the case study in
Documentation/filesystems/xfs-online-fsck-design.rst for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move struct xagb_bitmap to its own pair of C and header files per
request of Christoph.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
As for the particular choice of lowspace thresholds and btree block
slack factors -- at this point one could say that the thresholds in
online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
But that's not the entire story, since the offline btree rebuilding
code in xfs_repair was merged as a retroport of the online btree code
in this patchset!
Before xfs_btree_staging.[ch] came along, xfs_repair determined the
slack factor (aka the number of slots to leave unfilled in each new
btree block) via open-coded logic in repair/phase5.c[2]. At that point
the slack factors were arbitrary quantities per btree. The rmapbt
automatically left 10 slots free; everything else left zero.
That had a noticeable effect on performance straight after mounting
because adding records to /any/ btree would result in splits. A few
years ago when this patch was first written, Dave and I decided that
repair should generate btree blocks that were 75% full unless space was
tight, in which case it should try to fill the blocks to nearly full.
We defined tight as ~10% free to avoid repair failures but settled on
3/32 (~9%) to avoid div64.
IOWs, we mostly pulled the thresholds out of thin air. We've been
QAing with those geometry numbers ever since. ;)
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Track the usage, outcomes, and run times of the online fsck code, and
report these values via debugfs. The columns in the file are:
* scrubber name
* number of scrub invocations
* clean objects found
* corruptions found
* optimizations found
* cross referencing failures
* inconsistencies found during cross referencing
* incomplete scrubs
* warnings
* number of time scrub had to retry
* cumulative amount of time spent scrubbing (microseconds)
* number of repair inovcations
* successfully repaired objects
* cumuluative amount of time spent repairing (microseconds)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reaping blocks after a repair is a complicated affair involving a lot of
rmap btree lookups and figuring out if we're going to unmap or free old
metadata blocks that might be crosslinked. Eventually, we will need to
be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
garbage CoW staging extents, and (even later) blocks from btrees rooted
in inodes. This results in a lot of reaping code, so we might as well
split that off while it's easy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory. This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.
Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.
Note: Porting the dotdot checking code is left for a subsequent patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain. Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.
When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.
Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.
One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter. A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata. In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.
Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started. As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count. The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.
A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction. When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it. The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.
In the next patches we'll improve on this facility, but this patch
provides the basic functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in
ext4 against generic/454. The cause of this test failure was the
unfortunate combination of setting an xattr name containing UTF8 encoded
emoji, an xattr hash function that accepted a char pointer with no
explicit signedness, signed type extension of those chars to an int, and
the 6.2 build tools maintainers deciding to mandate -funsigned-char
across the board. As a result, the ondisk extended attribute structure
written out by 6.1 and 6.2 were not the same.
This discrepancy, in fact, had been noticeable if a filesystem with such
an xattr were moved between any two architectures that don't employ the
same signedness of a raw "char" declaration. The only reason anyone
noticed is that x86 gcc defaults to signed, and no such -funsigned-char
update was made to e2fsprogs, so e2fsck immediately started reporting
data corruption.
After a day and a half of discussing how to handle this use case (xattrs
with bit 7 set anywhere in the name) without breaking existing users,
Linus merged his own patch and didn't tell the maintainer. None of the
ext4 developers realized this until AUTOSEL announced that the commit
had been backported to stable.
In the end, this problem could have been detected much earlier if there
had been any useful tests of hash function(s) in use inside ext4 to make
sure that they always produce the same outputs given the same inputs.
The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's
vulnerable to this problem. However, let's avoid all this drama by
adding our own self test to check that the da hash produces the same
outputs for a static pile of inputs on various platforms. This enables
us to fix any breakage that may result in a controlled fashion. The
buffer and test data are identical to the patches submitted to xfsprogs.
Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/
Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.
If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do is
kill processes with that file mapped into their address spaces, but future
patches could actually do something about corrupt metadata.
After that, the memory failure needs to notify the processes who are using
those files.
Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that we have a clean operation to update the di_next_unlinked
field of inode cluster buffers, we can easily defer this operation
to transaction commit time so we can order the inode cluster buffer
locking consistently.
To do this, we introduce a new in-memory log item to track the
unlinked list item modification that we are going to make. This
follows the same observations as the in-memory double linked list
used to track unlinked inodes in that the inodes on the list are
pinned in memory and cannot go away, and hence we can simply
reference them for the duration of the transaction without needing
to take active references or pin them or look them up.
This allows us to pass the xfs_inode to the transaction commit code
along with the modification to be made, and then order the logged
modifications via the ->iop_sort and ->iop_precommit operations
for the new log item type. As this is an in-memory log item, it
doesn't have formatting, CIL or AIL operational hooks - it exists
purely to run the inode unlink modifications and is then removed
from the transaction item list and freed once the precommit
operation has run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently attributes are modified directly across one or more
transactions. But they are not logged or replayed in the event of an
error. The goal of log attr replay is to enable logging and replaying
of attribute operations using the existing delayed operations
infrastructure. This will later enable the attributes to become part of
larger multi part operations that also must first be recorded to the
log. This is mostly of interest in the scheme of parent pointers which
would need to maintain an attribute containing parent inode information
any time an inode is moved, created, or removed. Parent pointers would
then be of interest to any feature that would need to quickly derive an
inode path from the mount point. Online scrub, nfs lookups and fs grow
or shrink operations are all features that could take advantage of this.
This patch adds two new log item types for setting or removing
attributes as deferred operations. The xfs_attri_log_item will log an
intent to set or remove an attribute. The corresponding
xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
freed once the transaction is done. Both log items use a generic
xfs_attr_log_format structure that contains the attribute name, value,
flags, inode, and an op_flag that indicates if the operations is a set
or remove.
[dchinner: added extra little bits needed for intent whiteouts]
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a generic dispatch structure to delegate recovery of different
log item types into various code modules. This will enable us to move
code specific to a particular log item type out of xfs_log_recover.c and
into the log item source.
The first operation we virtualize is the log item sorting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
I ran into a linker warning in XFS that originates from a mismatch
between libelf, binutils and objtool when certain files in the kernel
are built with "gcc -g":
x86_64-linux-ld: fs/xfs/xfs_trace.o: unable to initialize decompress status for section .debug_info
After some discussion, nobody could identify why xfs sets this flag
here. CONFIG_XFS_DEBUG used to enable lots of unrelated settings, but
now its main purpose is to enable extra consistency checks and assertions
that are unrelated to the debug info.
Remove the Makefile logic to set the flag here. If anyone relies
on the debug info, this can simply be enabled again with the global
CONFIG_DEBUG_INFO option.
Dave Chinner writes:
I'm pretty sure it was needed for the original kgdb integration back
in the early 2000s. That was when SGI used to patch their XFS dev
tree with kgdb and debug symbols were needed by the custom kgdb
modules that were ported across from the Irix kernel debugger.
ISTR that the early kcrash kernel dump analysis tools (again,
originated from the Irix "icrash" kernel dump tools) had custom XFS
debug scripts that needed also the debug info to work correctly...
Which is a long way of saying "we don't need it anymore" instead of
"nobody knows why it was set"... :)
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20200409074130.GD21033@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create an in-core fake root for AG-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem. It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Userspace now has an identical xfs_trans_inode.c which it has already
moved to libxfs/ so do the same move for kernelspace.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create a parallel iwalk implementation and switch quotacheck to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a new iterator function to simplify walking inodes in an XFS
filesystem. This new iterator will replace the existing open-coded
walking that goes on in various places.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Keep all bmap item related code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all rmap item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all the refcount item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all the extree item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_buf structure is basically used as a glorified container for
a memory allocation in the log recovery code. Replace it with a
call to kmem_alloc_large and a simple abstraction to read into or
write from it synchronously using chained bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].
To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.
Having whitespaces after -I does not matter since commit 48f6e3cf5b
("kbuild: do not drop -I without parameter").
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Teach online scrub how to check the filesystem summary counters. We use
the incore delalloc block counter along with the incore AG headers to
compute expected values for fdblocks, icount, and ifree, and then check
that the percpu counter is within a certain threshold of the expected
value. This is done to avoid having to freeze or otherwise lock the
filesystem, which means that we're only checking that the counters are
fairly close, not that they're exactly correct.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>