Commit Graph

8296 Commits

Author SHA1 Message Date
Darrick J. Wong
c473a3320b xfs: stagger the starting AG of scrub iscans to reduce contention
Online directory and parent repairs on parent-pointer equipped
filesystems have shown that starting a large number of parallel iscans
causes a lot of AGI buffer contention.  Try to reduce this by making it
so that iscans scan wrap around the end of the filesystem, and using a
rotor to stagger where each scanner begins.  Surprisingly, this boosts
CPU utilization (on the author's test machines) from effectively
single-threaded to 160%.  Not great, but see the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:46 -08:00
Darrick J. Wong
4e98cc905c xfs: allow scrub to hook metadata updates in other writers
Certain types of filesystem metadata can only be checked by scanning
every file in the entire filesystem.  Specific examples of this include
quota counts, file link counts, and reverse mappings of file extents.
Directory and parent pointer reconstruction may also fall into this
category.  File scanning is much trickier than scanning AG metadata
because we have to take inode locks in the same order as the rest of
[VX]FS, we can't be holding buffer locks when we do that, and scanning
the whole filesystem takes time.

Earlier versions of the online repair patchset relied heavily on
fsfreeze as a means to quiesce the filesystem so that we could take
locks in the proper order without worrying about concurrent updates from
other writers.  Reviewers of those patches opined that freezing the
entire fs to check and repair something was not sufficiently better than
unmounting to run fsck offline.  I don't agree with that 100%, but the
message was clear: find a way to repair things that minimizes the
quiet period where nobody can write to the filesystem.

Generally, building btree indexes online can be split into two phases: a
collection phase where we compute the records that will be put into the
new btree; and a construction phase, where we construct the physical
btree blocks and persist them.  While it's simple to hold resource locks
for the entirety of the two phases to ensure that the new index is
consistent with the rest of the system, we don't need to hold resource
locks during the collection phase if we have a means to receive live
updates of other work going on elsewhere in the system.

The goal of this patch, then, is to enable online fsck to learn about
metadata updates going on in other threads while it constructs a shadow
copy of the metadata records to verify or correct the real metadata.  To
minimize the overhead when online fsck isn't running, we use srcu
notifiers because they prioritize fast access to the notifier call chain
(particularly when the chain is empty) at a cost to configuring
notifiers.  Online fsck should be relatively infrequent, so this is
acceptable.

The intended usage model is fairly simple.  Code that modifies a
metadata structure of interest should declare a xfs_hook_chain structure
in some well defined place, and call xfs_hook_call whenever an update
happens.  Online fsck code should define a struct notifier_block and use
xfs_hook_add to attach the block to the chain, along with a function to
be called.  This function should synchronize with the fsck scanner to
update whatever in-memory data the scanner is collecting.  When
finished, xfs_hook_del removes the notifier from the list and waits for
them all to complete.

Originally, I selected srcu notifiers over blocking notifiers to
implement live hooks because they seemed to have fewer impacts to
scalability.  The per-call cost of srcu_notifier_call_chain is higher
(19ns) than blocking_notifier_ (4ns) in the single threaded case, but
blocking notifiers use an rwsem to stabilize the list.  Cacheline
bouncing for that rwsem is costly to runtime code when there are a lot
of CPUs running regular filesystem operations.  If there are no hooks
installed, this is a total waste of CPU time.

Therefore, I stuck with srcu notifiers, despite trading off single
threaded performance for multithreaded performance.  I also wasn't
thrilled with the very high teardown time for srcu notifiers, since the
caller has to wait for the next rcu grace period.  This can take a long
time if there are a lot of CPUs.

Then I discovered the jump label implementation of static keys.

Jump labels use kernel code patching to replace a branch with a nop sled
when the key is disabled.  IOWs, they can eliminate the overhead of
_call_chain when there are no hooks enabled.  This makes blocking
notifiers competitive again -- scrub runs faster because teardown of the
chain is a lot cheaper, and runtime code only pays the rwsem locking
overhead when scrub is actually running.

With jump labels enabled, calls to empty notifier chains are elided from
the call sites when there are no hooks registered, which means that the
overhead is 0.36ns when fsck is not running.  This is perfect for most
of the architectures that XFS is expected to run on (e.g. x86, powerpc,
arm64, s390x, riscv).

For architectures that don't support jump labels (e.g. m68k) the runtime
overhead of checking the static key is an atomic counter read.  This
isn't great, but it's still cheaper than taking a shared rwsem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:45 -08:00
Darrick J. Wong
8660c7b74a xfs: implement live inode scan for scrub
This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.

The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited.  This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data.  It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.

This new code is a separate patch from the patchsets adding callers for
the sake of enabling the author to move patches around his tree with
ease.  The intended usage model for this code is roughly:

	xchk_iscan_start(iscan, 0, 0);
	while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
		xfs_ilock(ip, ...);
		/* capture inode metadata */
		xchk_iscan_mark_visited(iscan, ip);
		xfs_iunlock(ip, ...);

		xfs_irele(ip);
	}
	xchk_iscan_stop(iscan);
	if (error)
		return error;

Hook functions for live updates can then do:

	if (xchk_iscan_want_live_update(...))
		/* update the captured inode metadata */

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:45 -08:00
Darrick J. Wong
ae05eb1171 xfs: speed up xfs_iwalk_adjust_start a little bit
Replace the open-coded loop that recomputes freecount with a single call
to a bit weight function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:44 -08:00
Dave Chinner
4b2f459d86 xfs: fix SEEK_HOLE/DATA for regions with active COW extents
A data corruption problem was reported by CoreOS image builders
when using reflink based disk image copies and then converting
them to qcow2 images. The converted images failed the conversion
verification step, and it was isolated down to the fact that
qemu-img uses SEEK_HOLE/SEEK_DATA to find the data it is supposed to
copy.

The reproducer allowed me to isolate the issue down to a region of
the file that had overlapping data and COW fork extents, and the
problem was that the COW fork extent was being reported in it's
entirity by xfs_seek_iomap_begin() and so skipping over the real
data fork extents in that range.

This was somewhat hidden by the fact that 'xfs_bmap -vvp' reported
all the extents correctly, and reading the file completely (i.e. not
using seek to skip holes) would map the file correctly and all the
correct data extents are read. Hence the problem is isolated to just
the xfs_seek_iomap_begin() implementation.

Instrumentation with trace_printk made the problem obvious: we are
passing the wrong length to xfs_trim_extent() in
xfs_seek_iomap_begin(). We are passing the end_fsb, not the
maximum length of the extent we want to trim the map too. Hence the
COW extent map never gets trimmed to the start of the next data fork
extent, and so the seek code treats the entire COW fork extent as
unwritten and skips entirely over the data fork extents in that
range.

Link: https://github.com/coreos/coreos-assembler/issues/3728
Fixes: 60271ab79d ("xfs: fix SEEK_DATA for speculative COW fork preallocation")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 12:31:12 +05:30
Darrick J. Wong
e5a2f47cff xfs: remove xfile_{get,put}_page
These functions aren't used anymore, so get rid of them.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:56 +05:30
Darrick J. Wong
ee13fc6720 xfs: convert xfarray_pagesort to deal with large folios
Convert xfarray_pagesort to handle large folios by introducing a new
xfile_get_folio routine that can return a folio of arbitrary size, and
using heapsort on the full folio.  This also corrects an off-by-one bug
in the calculation of len in xfarray_pagesort that was papered over by
xfarray_want_pagesort.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:55 +05:30
Christoph Hellwig
b2fdfe19df xfs: fix a comment in xfarray.c
xfiles are shmem files, not memfds.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:55 +05:30
Christoph Hellwig
fd3d46e630 xfs: remove xfarray_sortinfo.page_kaddr
Now that xfile pages don't need kmapping, there is no need to cache
the kernel virtual address for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:55 +05:30
Darrick J. Wong
6907e3c00a xfs: add file_{get,put}_folio
Add helper similar to file_{get,set}_page, but which deal with folios
and don't allocate new folio unless explicitly asked to, which map
to shmem_get_folio instead of calling into the aops.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:54 +05:30
Christoph Hellwig
e97d70a573 xfs: use shmem_get_folio in in xfile_load
Switch to using shmem_get_folio in xfile_load instead of using
shmem_read_mapping_page_gfp.  This gets us support for large folios
and also optimized reading from unallocated space, as
shmem_get_folio with SGP_READ won't allocate a page for them just
to zero the content.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:54 +05:30
Christoph Hellwig
fd2634e2dd xfs: use shmem_get_folio in xfile_obj_store
Switch to using shmem_get_folio and manually dirtying the page instead
of abusing aops->write_begin and aops->write_end in xfile_get_page.

This simplifies the code by not doing indirect calls of not actually
exported interfaces that don't really fit the use case very well, and
happens to get us large folio support for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:54 +05:30
Christoph Hellwig
e62e26acc9 xfs: don't allow highmem pages in xfile mappings
XFS is generally used on 64-bit, non-highmem platforms and xfile
mappings are accessed all the time.  Reduce our pain by not allowing
any highmem mappings in the xfile page cache and remove all the kmap
calls for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:53 +05:30
Christoph Hellwig
0e2a24afb9 xfs: don't try to handle non-update pages in xfile_obj_load
shmem_read_mapping_page_gfp always returns an uptodate page or an
ERR_PTR.  Remove the code that tries to handle a non-uptodate page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:53 +05:30
Christoph Hellwig
e47e2e0ba9 xfs: remove the xfile_pread/pwrite APIs
All current and pending xfile users use the xfile_obj_load
and xfile_obj_store API, so make those the actual implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:53 +05:30
Christoph Hellwig
0473635d46 xfs: remove xfile_stat
vfs_getattr is needed to query inode attributes for unknown underlying
file systems.  But shmemfs is well known for users of shmem_file_setup
and shmem_read_mapping_page_gfp that rely on it not needing specific
inode revalidation and having a normal mapping.  Remove the detour
through the getattr method and an extra wrapper, and just read the
inode size and i_bytes directly in the scrub tracing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:52 +05:30
Christoph Hellwig
a2078df025 xfs: don't modify file and inode flags for shmem files
shmem_file_setup is explicitly intended for a file that can be
fully read and written by kernel users without restrictions.  Don't
poke into internals to change random flags in the file or inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:52 +05:30
Christoph Hellwig
efc9dc0963 xfs: use shmem_kernel_file_setup in xfile_create
shmem_kernel_file_setup is equivalent to shmem_file_setup except that it
already sets the S_PRIVATE flag.  Use it instead of open coding the
logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:52 +05:30
Christoph Hellwig
1b07ea2ab3 xfs: shmem_file_setup can't return NULL
shmem_file_setup always returns a struct file pointer or an ERR_PTR,
so remove the code to check for a NULL return.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:52 +05:30
Christoph Hellwig
b44c0eb8ae xfs: use VM_NORESERVE in xfile_create
xfile_create creates a (potentially large) sparse file.  Pass
VM_NORESERVE to shmem_file_setup to not account for the entire file size
at file creation time.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21 11:36:51 +05:30
Dave Chinner
7d5ba7ca6a xfs: use kvfree in xfs_ioc_getfsmap()
Another incorrect conversion to kfree() instead of kvfree().

Fixes: 4929257613 ("xfs: convert kmem_free() for kvmalloc users to kvfree()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-20 10:38:14 +05:30
Dave Chinner
661723c3bd xfs: use kvfree() in xfs_ioc_attr_list()
Wrongly converted from kmem_free() to kfree().

Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: 4929257613 ("xfs: convert kmem_free() for kvmalloc users to kvfree()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-20 10:38:05 +05:30
Matthew Wilcox (Oracle)
785dd13152 xfs: Remove mrlock wrapper
mrlock was an rwsem wrapper that also recorded whether the lock was
held for read or write.  Now that we can ask the generic code whether
the lock is held for read or write, we can remove this wrapper and use
an rwsem directly.

As the comment says, we can't use lockdep to assert that the ILOCK is
held for write, because we might be in a workqueue, and we aren't able
to tell lockdep that we do in fact own the lock.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19 21:19:33 +05:30
Matthew Wilcox (Oracle)
3fed24fffc xfs: Replace xfs_isilocked with xfs_assert_ilocked
To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't
use the existing ASSERT macro.  Add a new xfs_assert_ilocked() and
convert all the callers.

Fix an apparent bug in xfs_isilocked(): If the caller specifies
XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both
the IOLOCK and the ILOCK are held for write.  xfs_isilocked() only
checked that the ILOCK was held for write.

xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't
defined.  It's a cheap check, so I don't think it's worth defining
it away.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19 21:19:33 +05:30
Christoph Hellwig
49c379d3a7 xfs: use kvfree for buf in xfs_ioc_getbmap
Without this the kernel crashes in kfree for files with a sufficiently
large number of extents.

Fixes: d4c75a1b40 ("xfs: convert remaining kmem_free() to kfree()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19 09:47:31 +05:30
Long Li
e4c3b72a6e xfs: ensure submit buffers on LSN boundaries in error handlers
While performing the IO fault injection test, I caught the following data
corruption report:

 XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
 CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
 Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_corruption_error+0x134/0x150
  xfs_free_ag_extent+0x7d3/0x1130
  __xfs_free_extent+0x201/0x3c0
  xfs_trans_free_extent+0x29b/0xa10
  xfs_extent_free_finish_item+0x2a/0xb0
  xfs_defer_finish_noroll+0x8d1/0x1b40
  xfs_defer_finish+0x21/0x200
  xfs_itruncate_extents_flags+0x1cb/0x650
  xfs_free_eofblocks+0x18f/0x250
  xfs_inactive+0x485/0x570
  xfs_inodegc_worker+0x207/0x530
  process_scheduled_works+0x24a/0xe10
  worker_thread+0x5ac/0xc60
  kthread+0x2cd/0x3c0
  ret_from_fork+0x4a/0x80
  ret_from_fork_asm+0x11/0x20
  </TASK>
 XFS (dm-0): Corruption detected. Unmount and run xfs_repair

After analyzing the disk image, it was found that the corruption was
triggered by the fact that extent was recorded in both inode datafork
and AGF btree blocks. After a long time of reproduction and analysis,
we found that the reason of free sapce btree corruption was that the
AGF btree was not recovered correctly.

Consider the following situation, Checkpoint A and Checkpoint B are in
the same record and share the same start LSN1, buf items of same object
(AGF btree block) is included in both Checkpoint A and Checkpoint B. If
the buf item in Checkpoint A has been recovered and updates metadata LSN
permanently, then the buf item in Checkpoint B cannot be recovered,
because log recovery skips items with a metadata LSN >= the current LSN
of the recovery item. If there is still an inode item in Checkpoint B
that records the Extent X, the Extent X will be recorded in both inode
datafork and AGF btree block after Checkpoint B is recovered. Such
transaction can be seen when allocing enxtent for inode bmap, it record
both the addition of extent to the inode extent list and the removing
extent from the AGF.

  |------------Record (LSN1)------------------|---Record (LSN2)---|
  |-------Checkpoint A----------|----------Checkpoint B-----------|
  |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
  |     Extent X is freed       |     Extent X is allocated       |

After commit 12818d24db ("xfs: rework log recovery to submit buffers
on LSN boundaries") was introduced, we submit buffers on lsn boundaries
during log recovery. The above problem can be avoided under normal paths,
but it's not guaranteed under abnormal paths. Consider the following
process, if an error was encountered after recover buf item in Checkpoint
A and before recover buf item in Checkpoint B, buffers that have been
added to the buffer_list will still be submitted, this violates the
submits rule on lsn boundaries. So buf item in Checkpoint B cannot be
recovered on the next mount due to current lsn of transaction equal to
metadata lsn on disk. The detailed process of the problem is as follows.

First Mount:

  xlog_do_recovery_pass
    error = xlog_recover_process
      xlog_recover_process_data
        xlog_recover_process_ophdr
          xlog_recovery_process_trans
            ...
              /* recover buf item in Checkpoint A */
              xlog_recover_buf_commit_pass2
                xlog_recover_do_reg_buffer
                /* add buffer of agf btree block to buffer_list */
                xfs_buf_delwri_queue(bp, buffer_list)
            ...
            ==> Encounter read IO error and return
    /* submit buffers regardless of error */
    if (!list_empty(&buffer_list))
      xfs_buf_delwri_submit(&buffer_list);

    <buf items of agf btree block in Checkpoint A recovery success>

Second Mount:

  xlog_do_recovery_pass
    error = xlog_recover_process
      xlog_recover_process_data
        xlog_recover_process_ophdr
          xlog_recovery_process_trans
            ...
              /* recover buf item in Checkpoint B */
              xlog_recover_buf_commit_pass2
                /* buffer of agf btree block wouldn't added to
                   buffer_list due to lsn equal to current_lsn */
                if (XFS_LSN_CMP(lsn, current_lsn) >= 0)
                  goto out_release

    <buf items of agf btree block in Checkpoint B wouldn't recovery>

In order to make sure that submits buffers on lsn boundaries in the
abnormal paths, we need to check error status before submit buffers that
have been added from the last record processed. If error status exist,
buffers in the bufffer_list should not be writen to disk.

Canceling the buffers in the buffer_list directly isn't correct, unlike
any other place where write list was canceled, these buffers has been
initialized by xfs_buf_item_init() during recovery and held by buf item,
buf items will not be released in xfs_buf_delwri_cancel(), it's not easy
to solve.

If the filesystem has been shut down, then delwri list submission will
error out all buffers on the list via IO submission/completion and do
all the correct cleanup automatically. So shutting down the filesystem
could prevents buffers in the bufffer_list from being written to disk.

Fixes: 50d5c8d8e9 ("xfs: check LSN ordering for v5 superblocks during recovery")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17 09:34:52 +05:30
Shrikanth Hegde
0164defd0d xfs: remove duplicate ifdefs
when a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A
...code block...
endif
...code block...
endif

In the xfs code two such patterns were seen. Hence removing these ifdefs.
No functional change is intended here. It only aims to improve code
readability.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17 09:32:32 +05:30
Darrick J. Wong
1149314a16 xfs: disable sparse inode chunk alignment check when there is no alignment
While testing a 64k-blocksize filesystem, I noticed that xfs/709 fails
to rebuild the inode btree with a bunch of "Corruption remains"
messages.  It turns out that when the inode chunk size is smaller than a
single filesystem block, no block alignments constraints are necessary
for inode chunk allocations, and sb_spino_align is zero.  Hence we can
skip the check.

Fixes: dbfbf3bdf6 ("xfs: repair inode btrees")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17 09:18:53 +05:30
Dave Chinner
57b98393b8 xfs: use xfs_defer_alloc a bit more
Noticed by inspection, simple factoring allows the same allocation
routine to be used for both transaction and recovery contexts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:36 +05:30
Dave Chinner
204fae32d5 xfs: clean up remaining GFP_NOFS users
These few remaining GFP_NOFS callers do not need to use GFP_NOFS at
all. They are only called from a non-transactional context or cannot
be accessed from memory reclaim due to other constraints. Hence they
can just use GFP_KERNEL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:35 +05:30
Dave Chinner
c704ecb241 xfs: place the CIL under nofs allocation context
This is core code that needs to run in low memory conditions and
can be triggered from memory reclaim. While it runs in a workqueue,
it really shouldn't be recursing back into the filesystem during
any memory allocation it needs to function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:35 +05:30
Dave Chinner
2c1e31ed5c xfs: place intent recovery under NOFS allocation context
When recovery starts processing intents, all of the initial intent
allocations are done outside of transaction contexts. That means
they need to specifically use GFP_NOFS as we do not want memory
reclaim to attempt to run direct reclaim of filesystem objects while
we have lots of objects added into deferred operations.

Rather than use GFP_NOFS for these specific allocations, just place
the entire intent recovery process under NOFS context and we can
then just use GFP_KERNEL for these allocations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:35 +05:30
Dave Chinner
0b3a76e955 xfs: use GFP_KERNEL in pure transaction contexts
When running in a transaction context, memory allocations are scoped
to GFP_NOFS. Hence we don't need to use GFP_NOFS contexts in pure
transaction context allocations - GFP_KERNEL will automatically get
converted to GFP_NOFS as appropriate.

Go through the code and convert all the obvious GFP_NOFS allocations
in transaction context to use GFP_KERNEL. This further reduces the
explicit use of GFP_NOFS in XFS.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:35 +05:30
Dave Chinner
94a69db236 xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS
In the past we've had problems with lockdep false positives stemming
from inode locking occurring in memory reclaim contexts (e.g. from
superblock shrinkers). Lockdep doesn't know that inodes access from
above memory reclaim cannot be accessed from below memory reclaim
(and vice versa) but there has never been a good solution to solving
this problem with lockdep annotations.

This situation isn't unique to inode locks - buffers are also locked
above and below memory reclaim, and we have to maintain lock
ordering for them - and against inodes - appropriately. IOWs, the
same code paths and locks are taken both above and below memory
reclaim and so we always need to make sure the lock orders are
consistent. We are spared the lockdep problems this might cause
by the fact that semaphores and bit locks aren't covered by lockdep.

In general, this sort of lockdep false positive detection is cause
by code that runs GFP_KERNEL memory allocation with an actively
referenced inode locked. When it is run from a transaction, memory
allocation is automatically GFP_NOFS, so we don't have reclaim
recursion issues. So in the places where we do memory allocation
with inodes locked outside of a transaction, we have explicitly set
them to use GFP_NOFS allocations to prevent lockdep false positives
from being reported if the allocation dips into direct memory
reclaim.

More recently, __GFP_NOLOCKDEP was added to the memory allocation
flags to tell lockdep not to track that particular allocation for
the purposes of reclaim recursion detection. This is a much better
way of preventing false positives - it allows us to use GFP_KERNEL
context outside of transactions, and allows direct memory reclaim to
proceed normally without throwing out false positive deadlock
warnings.

The obvious places that lock inodes and do memory allocation are the
lookup paths and inode extent list initialisation. These occur in
non-transactional GFP_KERNEL contexts, and so can run direct reclaim
and lock inodes.

This patch makes a first path through all the explicit GFP_NOFS
allocations in XFS and converts the obvious ones to GFP_KERNEL |
__GFP_NOLOCKDEP as a first step towards removing explicit GFP_NOFS
allocations from the XFS code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
178231af2b xfs: use an empty transaction for fstrim
We currently use a btree walk in the fstrim code. This requires a
btree cursor and btree cursors are only used inside transactions
except for the fstrim code. This means that all the btree operations
that allocate memory operate in both GFP_KERNEL and GFP_NOFS
contexts.

This causes problems with lockdep being unable to determine the
difference between objects that are safe to lock both above and
below memory reclaim. Free space btree buffers are definitely locked
both above and below reclaim and that means we have to mark all
btree infrastructure allocations with GFP_NOFS to avoid potential
lockdep false positives.

If we wrap this btree walk in an empty cursor, all btree walks are
now done under transaction context and so all allocations inherit
GFP_NOFS context from the tranaction. This enables us to move all
the btree allocations to GFP_KERNEL context and hence help remove
the explicit use of GFP_NOFS in XFS.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
d4c75a1b40 xfs: convert remaining kmem_free() to kfree()
The remaining callers of kmem_free() are freeing heap memory, so
we can convert them directly to kfree() and get rid of kmem_free()
altogether.

This conversion was done with:

$ for f in `git grep -l kmem_free fs/xfs`; do
> sed -i s/kmem_free/kfree/ $f
> done
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
4929257613 xfs: convert kmem_free() for kvmalloc users to kvfree()
Start getting rid of kmem_free() by converting all the cases where
memory can come from vmalloc interfaces to calling kvfree()
directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
afdc115559 xfs: move kmem_to_page()
Move it to the general xfs linux wrapper header file so we can
prepare to remove kmem.h

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
f078d4ea82 xfs: convert kmem_alloc() to kmalloc()
kmem_alloc() is just a thin wrapper around kmalloc() these days.
Convert everything to use kmalloc() so we can get rid of the
wrapper.

Note: the transaction region allocation in xlog_add_to_transaction()
can be a high order allocation. Converting it to use
kmalloc(__GFP_NOFAIL) results in warnings in the page allocation
code being triggered because the mm subsystem does not want us to
use __GFP_NOFAIL with high order allocations like we've been doing
with the kmem_alloc() wrapper for a couple of decades. Hence this
specific case gets converted to xlog_kvmalloc() rather than
kmalloc() to avoid this issue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:34 +05:30
Dave Chinner
10634530f7 xfs: convert kmem_zalloc() to kzalloc()
There's no reason to keep the kmem_zalloc() around anymore, it's
just a thin wrapper around kmalloc(), so lets get rid of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13 18:07:33 +05:30
Darrick J. Wong
881f78f472 xfs: remove conditional building of rt geometry validator functions
I mistakenly turned off CONFIG_XFS_RT in the Kconfig file for arm64
variant of the djwong-wtf git branch.  Unfortunately, it took me a good
hour to figure out that RT wasn't built because this is what got printed
to dmesg:

XFS (sda2): realtime geometry sanity check failed
XFS (sda2): Metadata corruption detected at xfs_sb_read_verify+0x170/0x190 [xfs], xfs_sb block 0x0

Whereas I would have expected:

XFS (sda2): Not built with CONFIG_XFS_RT
XFS (sda2): RT mount failed

The root cause of these problems is the conditional compilation of the
new functions xfs_validate_rtextents and xfs_compute_rextslog that I
introduced in the two commits listed below.  The !RT versions of these
functions return false and 0, respectively, which causes primary
superblock validation to fail, which explains the first message.

Move the two functions to other parts of libxfs that are not
conditionally defined by CONFIG_XFS_RT and remove the broken stubs so
that validation works again.

Fixes: e14293803f ("xfs: don't allow overly small or large realtime volumes")
Fixes: a6a38f309a ("xfs: make rextslog computation consistent with mkfs")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-30 14:04:43 +05:30
Andrey Albershteyn
82ef1a5356 xfs: reset XFS_ATTR_INCOMPLETE filter on node removal
In XFS_DAS_NODE_REMOVE_ATTR case, xfs_attr_mode_remove_attr() sets
filter to XFS_ATTR_INCOMPLETE. The filter is then reset in
xfs_attr_complete_op() if XFS_DA_OP_REPLACE operation is performed.

The filter is not reset though if XFS just removes the attribute
(args->value == NULL) with xfs_attr_defer_remove(). attr code goes
to XFS_DAS_DONE state.

Fix this by always resetting XFS_ATTR_INCOMPLETE filter. The replace
operation already resets this filter in anyway and others are
completed at this step hence don't need it.

Fixes: fdaf1bb3ca ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework")
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-29 13:48:10 +05:30
Dave Chinner
d8d222e09d xfs: read only mounts with fsopen mount API are busted
Recently xfs/513 started failing on my test machines testing "-o
ro,norecovery" mount options. This was being emitted in dmesg:

[ 9906.932724] XFS (pmem0): no-recovery mounts must be read-only.

Turns out, readonly mounts with the fsopen()/fsconfig() mount API
have been busted since day zero. It's only taken 5 years for debian
unstable to start using this "new" mount API, and shortly after this
I noticed xfs/513 had started to fail as per above.

The syscall trace is:

fsopen("xfs", FSOPEN_CLOEXEC)           = 3
mount_setattr(-1, NULL, 0, NULL, 0)     = -1 EINVAL (Invalid argument)
.....
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/pmem0", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "norecovery", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 EINVAL (Invalid argument)
close(3)                                = 0

Showing that the actual mount instantiation (FSCONFIG_CMD_CREATE) is
what threw out the error.

During mount instantiation, we call xfs_fs_validate_params() which
does:

        /* No recovery flag requires a read-only mount */
        if (xfs_has_norecovery(mp) && !xfs_is_readonly(mp)) {
                xfs_warn(mp, "no-recovery mounts must be read-only.");
                return -EINVAL;
        }

and xfs_is_readonly() checks internal mount flags for read only
state. This state is set in xfs_init_fs_context() from the
context superblock flag state:

        /*
         * Copy binary VFS mount flags we are interested in.
         */
        if (fc->sb_flags & SB_RDONLY)
                set_bit(XFS_OPSTATE_READONLY, &mp->m_opstate);

With the old mount API, all of the VFS specific superblock flags
had already been parsed and set before xfs_init_fs_context() is
called, so this all works fine.

However, in the brave new fsopen/fsconfig world,
xfs_init_fs_context() is called from fsopen() context, before any
VFS superblock have been set or parsed. Hence if we use fsopen(),
the internal XFS readonly state is *never set*. Hence anything that
depends on xfs_is_readonly() actually returning true for read only
mounts is broken if fsopen() has been used to mount the filesystem.

Fix this by moving this internal state initialisation to
xfs_fs_fill_super() before we attempt to validate the parameters
that have been set prior to the FSCONFIG_CMD_CREATE call being made.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Fixes: 73e5fff98b ("xfs: switch to use the new mount-api")
cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-22 11:33:57 +05:30
Linus Torvalds
ec2d264ae4 Bug fixes for 6.8:
** Bug fixes
    1. Fix per-inode space accounting bug.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZZ93aAAKCRAH7y4RirJu
 9JWwAQCl8jP5OdseJmCHSdCKrw3p1LGyWys7XIVgBQ/XLhAVhAEAu3dZEGqEOTsj
 Md+13Z/b9cPKSzeihjm9P7WJFJX+rgs=
 =Y5iY
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.8-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Chandan Babu:

 - Fix per-inode space accounting bug

* tag 'xfs-6.8-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix backwards logic in xfs_bmap_alloc_account
2024-01-19 09:57:08 -08:00
Darrick J. Wong
d61b40bf15 xfs: fix backwards logic in xfs_bmap_alloc_account
We're only allocating from the realtime device if the inode is marked
for realtime and we're /not/ allocating into the attr fork.

Fixes: 5864346054 ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-11 10:34:01 +05:30
Linus Torvalds
a05aea98d4 sysctl-6.8-rc1
To help make the move of sysctls out of kernel/sysctl.c not incur a size
 penalty sysctl has been changed to allow us to not require the sentinel, the
 final empty element on the sysctl array. Joel Granados has been doing all this
 work. On the v6.6 kernel we got the major infrastructure changes required to
 support this. For v6.7 we had all arch/ and drivers/ modified to remove
 the sentinel. For v6.8-rc1 we get a few more updates for fs/ directory only.
 The kernel/ directory is left but we'll save that for v6.9-rc1 as those patches
 are still being reviewed. After that we then can expect also the removal of the
 no longer needed check for procname == NULL.
 
 Let us recap the purpose of this work:
 
   - this helps reduce the overall build time size of the kernel and run time
     memory consumed by the kernel by about ~64 bytes per array
   - the extra 64-byte penalty is no longer inncurred now when we move sysctls
     out from kernel/sysctl.c to their own files
 
 Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to see further
 work by Thomas Weißschuh with the constificatin of the struct ctl_table.
 
 Due to Joel Granados's work, and to help bring in new blood, I have suggested
 for him to become a maintainer and he's accepted. So for v6.9-rc1 I look forward
 to seeing him sent you a pull request for further sysctl changes. This also
 removes Iurii Zaikin as a maintainer as he has moved on to other projects and
 has had no time to help at all.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmWdWDESHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinjJAP/jTNNoyzWisvrrvmXqR5txFGLOE+wW6x
 Xv9avuiM+DTHsH/wK8CkXEivwDqYNAZEHU7NEcolS5bJX/ddSRwN9b5aSVlCrUdX
 Ab4rXmpeSCNFp9zNszWJsDuBKIqjvsKw7qGleGtgZ2qAUHbbH30VROLWCggaee50
 wU3icDLdwkasxrcMXy4Sq5dT5wYC4j/QelqBGIkYPT14Arl1im5zqPZ95gmO/s/6
 mdicTAmq+hhAUfUBJBXRKtsvxY6CItxe55Q4fjpncLUJLHUw+VPVNoBKFWJlBwlh
 LO3liKFfakPSkil4/en+/+zuMByd0JBkIzIJa+Kk5kjpbHRhK0RkmU4+Y5G5spWN
 jjLfiv6RxInNaZ8EWQBMfjE95A7PmYDQ4TOH08+OvzdDIi6B0BB5tBGQpG9BnyXk
 YsLg1Uo4CwE/vn1/a9w0rhadjUInvmAryhb/uSJYFz/lmApLm2JUpY3/KstwGetb
 z+HmLstJb24Djkr6pH8DcjhzRBHeWQ5p0b4/6B+v1HqAUuEhdbyw1F2GrDywyF3R
 h/UOAaKLm1+ffdA246o9TejKiDU96qEzzXMaCzPKyestaRZuiyuYEMDhYbvtsMV5
 zIdMJj5HQ+U1KHDv4IN99DEj7+/vjE3f4Sjo+POFpQeQ8/d+fxpFNqXVv449dgnb
 6xEkkxsR0ElM
 =2qBt
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "To help make the move of sysctls out of kernel/sysctl.c not incur a
  size penalty sysctl has been changed to allow us to not require the
  sentinel, the final empty element on the sysctl array. Joel Granados
  has been doing all this work.

  In the v6.6 kernel we got the major infrastructure changes required to
  support this. For v6.7 we had all arch/ and drivers/ modified to
  remove the sentinel. For v6.8-rc1 we get a few more updates for fs/
  directory only.

  The kernel/ directory is left but we'll save that for v6.9-rc1 as
  those patches are still being reviewed. After that we then can expect
  also the removal of the no longer needed check for procname == NULL.

  Let us recap the purpose of this work:

   - this helps reduce the overall build time size of the kernel and run
     time memory consumed by the kernel by about ~64 bytes per array

   - the extra 64-byte penalty is no longer inncurred now when we move
     sysctls out from kernel/sysctl.c to their own files

  Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to
  see further work by Thomas Weißschuh with the constificatin of the
  struct ctl_table.

  Due to Joel Granados's work, and to help bring in new blood, I have
  suggested for him to become a maintainer and he's accepted. So for
  v6.9-rc1 I look forward to seeing him sent you a pull request for
  further sysctl changes. This also removes Iurii Zaikin as a maintainer
  as he has moved on to other projects and has had no time to help at
  all"

* tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: remove struct ctl_path
  sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR
  coda: Remove the now superfluous sentinel elements from ctl_table array
  sysctl: Remove the now superfluous sentinel elements from ctl_table array
  fs: Remove the now superfluous sentinel elements from ctl_table array
  cachefiles: Remove the now superfluous sentinel element from ctl_table array
  sysclt: Clarify the results of selftest run
  sysctl: Add a selftest for handling empty dirs
  sysctl: Fix out of bounds access for empty sysctl registers
  MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl
  MAINTAINERS: remove Iurii Zaikin from proc sysctl
2024-01-10 17:44:36 -08:00
Linus Torvalds
12958e9c4c New code for 6.8:
* New features/functionality
     * Online repair
       * Reserve disk space for online repairs.
       * Fix misinteraction between the AIL and btree bulkloader because of
         which the bulk load fails to queue a buffer for writeback if it
         happens to be on the AIL list.
       * Prevent transaction reservation overflows when reaping blocks during
         online repair.
       * Whenever possible, bulkloader now copies multiple records into a
         block.
       * Support repairing of
         1. Per-AG free space, inode and refcount btrees.
 	2. Ondisk inodes.
 	3. File data and attribute fork mappings.
       * Verify the contents of
         1. Inode and data fork of realtime bitmap file.
 	2. Quota files.
     * Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks about
       a pmem device being removed.
 
   * Bug fixes
     * Fix memory leak of recovered attri intent items.
     * Fix UAF during log intent recovery.
     * Fix realtime geometry integer overflows.
     * Prevent scrub from live locking in xchk_iget.
     * Prevent fs shutdown when removing files during low free disk space.
     * Prevent transaction reservation overflow when extending an RT device.
     * Prevent incorrect warning from being printed when extending a
       filesystem.
     * Fix an off-by-one error in xreap_agextent_binval.
     * Serialize access to perag radix tree during deletion operation.
     * Fix perag memory leak during growfs.
     * Allow allocation of minlen realtime extent when the maximum sized
       realtime free extent is minlen in size.
 
   * Cleanups
     * Remove duplicate boilerplate code spread across functionality associated
       with different log items.
     * Cleanup resblks interfaces.
     * Pass defer ops pointer to defer helpers instead of an enum.
     * Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings.
     * Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size of
       structures and structure member offsets. This is done in order to be
       able to share the code with userspace.
     * Move XFS documentation under a new directory specific to XFS.
     * Do not invoke deferred ops' ->create_done callback if the deferred
       operation does not have an intent item associated with it.
     * Remove duplicate inclusion of header files from scrub/health.c.
     * Refactor Realtime code.
     * Cleanup attr code.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZZJQbwAKCRAH7y4RirJu
 9JjkAP9Zg0QZNmAMsZwvgEBbuF/OnHKl4GmPA5uq0jPmSWCOqAEA0HjlOmuNfQWn
 93fIw6CPbt+9QCluTYBwUisKLIJ/wgA=
 =qmO0
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:
 "New features/functionality:
   - Online repair:
       - Reserve disk space for online repairs
       - Fix misinteraction between the AIL and btree bulkloader because
         of which the bulk load fails to queue a buffer for writeback if
         it happens to be on the AIL list
       - Prevent transaction reservation overflows when reaping blocks
         during online repair
       - Whenever possible, bulkloader now copies multiple records into
         a block
       - Support repairing of
           1. Per-AG free space, inode and refcount btrees
           2. Ondisk inodes
           3. File data and attribute fork mappings
       - Verify the contents of
           1. Inode and data fork of realtime bitmap file
           2. Quota files
   - Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks
     about a pmem device being removed

  Bug fixes:
   - Fix memory leak of recovered attri intent items
   - Fix UAF during log intent recovery
   - Fix realtime geometry integer overflows
   - Prevent scrub from live locking in xchk_iget
   - Prevent fs shutdown when removing files during low free disk space
   - Prevent transaction reservation overflow when extending an RT
     device
   - Prevent incorrect warning from being printed when extending a
     filesystem
   - Fix an off-by-one error in xreap_agextent_binval
   - Serialize access to perag radix tree during deletion operation
   - Fix perag memory leak during growfs
   - Allow allocation of minlen realtime extent when the maximum sized
     realtime free extent is minlen in size

  Cleanups:
   - Remove duplicate boilerplate code spread across functionality
     associated with different log items
   - Cleanup resblks interfaces
   - Pass defer ops pointer to defer helpers instead of an enum
   - Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings
   - Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size
     of structures and structure member offsets. This is done in order
     to be able to share the code with userspace
   - Move XFS documentation under a new directory specific to XFS
   - Do not invoke deferred ops' ->create_done callback if the deferred
     operation does not have an intent item associated with it
   - Remove duplicate inclusion of header files from scrub/health.c
   - Refactor Realtime code
   - Cleanup attr code"

* tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (123 commits)
  xfs: use the op name in trace_xlog_intent_recovery_failed
  xfs: fix a use after free in xfs_defer_finish_recovery
  xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts
  xfs: remove xfs_attr_sf_hdr_t
  xfs: remove struct xfs_attr_shortform
  xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue
  xfs: remove xfs_attr_shortform_lookup
  xfs: simplify xfs_attr_sf_findname
  xfs: move the xfs_attr_sf_lookup tracepoint
  xfs: return if_data from xfs_idata_realloc
  xfs: make if_data a void pointer
  xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc
  xfs: simplify and optimize the RT allocation fallback cascade
  xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc
  xfs: remove XFS_RTMIN/XFS_RTMAX
  xfs: remove rt-wrappers from xfs_format.h
  xfs: factor out a xfs_rtalloc_sumlevel helper
  xfs: tidy up xfs_rtallocate_extent_exact
  xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block
  xfs: reflow the tail end of xfs_rtallocate_extent_block
  ...
2024-01-10 08:45:22 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Linus Torvalds
3f6984e730 vfs-6.8.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc
 osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1
 /wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc=
 =iFAQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs super updates from Christian Brauner:
 "This contains the super work for this cycle including the long-awaited
  series by Jan to make it possible to prevent writing to mounted block
  devices:

   - Writing to mounted devices is dangerous and can lead to filesystem
     corruption as well as crashes. Furthermore syzbot comes with more
     and more involved examples how to corrupt block device under a
     mounted filesystem leading to kernel crashes and reports we can do
     nothing about. Add tracking of writers to each block device and a
     kernel cmdline argument which controls whether other writeable
     opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
     allowed.

     Note that this effectively only prevents modification of the
     particular block device's page cache by other writers. The actual
     device content can still be modified by other means - e.g. by
     issuing direct scsi commands, by doing writes through devices lower
     in the storage stack (e.g. in case loop devices, DM, or MD are
     involved) etc. But blocking direct modifications of the block
     device page cache is enough to give filesystems a chance to perform
     data validation when loading data from the underlying storage and
     thus prevent kernel crashes.

     Syzbot can use this cmdline argument option to avoid uninteresting
     crashes. Also users whose userspace setup does not need writing to
     mounted block devices can set this option for hardening. We expect
     that this will be interesting to quite a few workloads.

     Btrfs is currently opted out of this because they still haven't
     merged patches we require for this to work from three kernel
     releases ago.

   - Reimplement block device freezing and thawing as holder operations
     on the block device.

     This allows us to extend block device freezing to all devices
     associated with a superblock and not just the main device. It also
     allows us to remove get_active_super() and thus another function
     that scans the global list of superblocks.

     Freezing via additional block devices only works if the filesystem
     chooses to use @fs_holder_ops for these additional devices as well.
     That currently only includes ext4 and xfs.

     Earlier releases switched get_tree_bdev() and mount_bdev() to use
     @fs_holder_ops. The remaining nilfs2 open-coded version of
     mount_bdev() has been converted to rely on @fs_holder_ops as well.
     So block device freezing for the main block device will continue to
     work as before.

     There should be no regressions in functionality. The only special
     case is btrfs where block device freezing for the main block device
     never worked because sb->s_bdev isn't set. Block device freezing
     for btrfs can be fixed once they can switch to @fs_holder_ops but
     that can happen whenever they're ready"

* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  block: Fix a memory leak in bdev_open_by_dev()
  super: don't bother with WARN_ON_ONCE()
  super: massage wait event mechanism
  ext4: Block writes to journal device
  xfs: Block writes to log device
  fs: Block writes to mounted block devices
  btrfs: Do not restrict writes to btrfs devices
  block: Add config option to not allow writing to mounted devices
  block: Remove blkdev_get_by_*() functions
  bcachefs: Convert to bdev_open_by_path()
  fs: handle freezing from multiple devices
  fs: remove dead check
  nilfs2: simplify device handling
  fs: streamline thaw_super_locked
  ext4: simplify device handling
  xfs: simplify device handling
  fs: simplify setup_bdev_super() calls
  blkdev: comment fs_holder_ops
  porting: document block device freeze and thaw changes
  fs: remove unused helper
  ...
2024-01-08 10:43:51 -08:00
Christoph Hellwig
bcdfae6ee5 xfs: use the op name in trace_xlog_intent_recovery_failed
Instead of tracing the address of the recovery handler, use the name
in the defer op, similar to other defer ops related tracepoints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29 13:37:05 +05:30