6992 Commits

Author SHA1 Message Date
Dave Chinner
89b1f55a29 xfs: collapse AG selection for inode allocation
xfs_dialloc_select_ag() does a lot of repetitive work. It first
calls xfs_ialloc_ag_select() to select the AG to start allocation
attempts in, which can do up to two entire loops across the perags
that inodes can be allocated in. This is simply checking if there is
spce available to allocate inodes in an AG, and it returns when it
finds the first candidate AG.

xfs_dialloc_select_ag() then does it's own iterative walk across
all the perags locking the AGIs and trying to allocate inodes from
the locked AG. It also doesn't limit the search to mp->m_maxagi,
so it will walk all AGs whether they can allocate inodes or not.

Hence if we are really low on inodes, we could do almost 3 entire
walks across the whole perag range before we find an allocation
group we can allocate inodes in or report ENOSPC.

Because xfs_ialloc_ag_select() returns on the first candidate AG it
finds, we can simply do these checks directly in
xfs_dialloc_select_ag() before we lock and try to allocate inodes.
This reduces the inode allocation pass down to 2 perag sweeps at
most - one for aligned inode cluster allocation and if we can't
allocate full, aligned inode clusters anywhere we'll do another pass
trying to do sparse inode cluster allocation.

This also removes a big chunk of duplicate code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
4268547305 xfs: simplify xfs_dialloc_select_ag() return values
The only caller of xfs_dialloc_select_ag() will always return
-ENOSPC to it's caller if the agbp returned from
xfs_dialloc_select_ag() is NULL. IOWs, failure to find a candidate
AGI we can allocate inodes from is always an ENOSPC condition, so
move this logic up into xfs_dialloc_select_ag() so we can simplify
the return logic in this function.

xfs_dialloc_select_ag() now only ever returns 0 with a locked
agbp, or an error with no agbp.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
50f02fe333 xfs: remove agno from btree cursor
Now that everything passes a perag, the agno is not needed anymore.
Convert all the users to use pag->pag_agno instead and remove the
agno from the cursor. This was largely done as an automated search
and replace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
7b13c51551 xfs: use perag for ialloc btree cursors
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
289d38d22c xfs: convert allocbt cursors to use perags
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
a81a06211f xfs: convert refcount btree cursor to use perags
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
fa9c3c1973 xfs: convert rmap btree cursor to using a perag
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
be9fb17d88 xfs: add a perag to the btree cursor
Which will eventually completely replace the agno in it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02 10:48:24 +10:00
Dave Chinner
58d43a7e32 xfs: pass perags around in fsmap data dev functions
Needs a [from, to] ranged AG walk, and the perag to be stuffed into
the info structure for callouts to use.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
30933120ad xfs: push perags through the ag reservation callouts
We currently pass an agno from the AG reservation functions to the
individual feature accounting functions, which in future may have to
do perag lookups to access per-AG state. Instead, pre-emptively
plumb the perag through from the highest AG reservation layer to the
feature callouts so they won't have to look it up again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02 10:48:24 +10:00
Dave Chinner
45d0662117 xfs: pass perags through to the busy extent code
All of the callers of the busy extent API either have perag
references available to use so we can pass a perag to the busy
extent functions rather than having them have to do unnecessary
lookups.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
7f8d3b3ca6 xfs: convert secondary superblock walk to use perags
Clean up the last external manual AG walk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
6f4118fc64 xfs: convert xfs_iwalk to use perag references
Rather than manually walking the ags and passing agnunbers around,
pass the perag for the AG we are currently working on around in the
iwalk structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
934933c3ee xfs: convert raw ag walks to use for_each_perag
Convert the raw walks to an iterator, pulling the current AG out of
pag->pag_agno instead of the loop iterator variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
f250eedcf7 xfs: make for_each_perag... a first class citizen
for_each_perag_tag() is defined in xfs_icache.c for local use.
Promote this to xfs_ag.h and define equivalent iteration functions
so that we can use them to iterate AGs instead to replace open coded
perag walks and perag lookups.

We also convert as many of the straight forward open coded AG walks
to use these iterators as possible. Anything that is not a direct
conversion to an iterator is ignored and will be updated in future
commits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
07b6403a68 xfs: move perag structure and setup to libxfs/xfs_ag.[ch]
Move the xfs_perag infrastructure to the libxfs files that contain
all the per AG infrastructure. This helps set up for passing perags
around all the code instead of bare agnos with minimal extra
includes for existing files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
61aa005a5b xfs: prepare for moving perag definitions and support to libxfs
The perag structures really need to be defined with the rest of the
AG support infrastructure. The struct xfs_perag and init/teardown
has been placed in xfs_mount.[ch] because there are differences in
the structure between kernel and userspace. Mainly that userspace
doesn't have a lot of the internal stuff that the kernel has for
caches and discard and other such structures.

However, it makes more sense to move this to libxfs than to keep
this separation because we are now moving to use struct perags
everywhere in the code instead of passing raw agnumber_t values
about. Hence we shoudl really move the support infrastructure to
libxfs/xfs_ag.[ch].

To do this without breaking userspace, first we need to rearrange
the structures and code so that all the kernel specific code is
located together. This makes it simple for userspace to ifdef out
the all the parts it does not need, minimising the code differences
between kernel and userspace. The next commit will do the move...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner
9bbafc7191 xfs: move xfs_perag_get/put to xfs_ag.[ch]
They are AG functions, not superblock functions, so move them to the
appropriate location.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Darrick J. Wong
20bd8e63f3 xfs: remove unnecessary shifts
The superblock verifier already validates that (1 << blocklog) ==
blocksize, so use the value directly instead of doing math.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-01 12:53:59 -07:00
Darrick J. Wong
a7bcb147fe xfs: clean up open-coded fs block unit conversions
Replace some open-coded fs block unit conversions with the standard
conversion macro.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-01 12:53:59 -07:00
Allison Henderson
4fd084dbbd xfs: Clean up xfs_attr_node_addname_clear_incomplete
We can use the helper function xfs_attr_node_remove_name to reduce
duplicate code in this function

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:49 -07:00
Allison Henderson
0e6acf29db xfs: Remove xfs_attr_rmtval_set
This function is no longer used, so it is safe to remove

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:49 -07:00
Allison Henderson
8f502a4009 xfs: Add delay ready attr set routines
This patch modifies the attr set routines to be delay ready. This means
they no longer roll or commit transactions, but instead return -EAGAIN
to have the calling routine roll and refresh the transaction.  In this
series, xfs_attr_set_args has become xfs_attr_set_iter, which uses a
state machine like switch to keep track of where it was when EAGAIN was
returned. See xfs_attr.h for a more detailed diagram of the states.

Two new helper functions have been added: xfs_attr_rmtval_find_space and
xfs_attr_rmtval_set_blk.  They provide a subset of logic similar to
xfs_attr_rmtval_set, but they store the current block in the delay attr
context to allow the caller to roll the transaction between allocations.
This helps to simplify and consolidate code used by
xfs_attr_leaf_addname and xfs_attr_node_addname. xfs_attr_set_args has
now become a simple loop to refresh the transaction until the operation
is completed.  Lastly, xfs_attr_rmtval_remove is no longer used, and is
removed.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:49:48 -07:00
Allison Henderson
2b74b03c13 xfs: Add delay ready attr remove routines
This patch modifies the attr remove routines to be delay ready. This
means they no longer roll or commit transactions, but instead return
-EAGAIN to have the calling routine roll and refresh the transaction. In
this series, xfs_attr_remove_args is merged with
xfs_attr_node_removename become a new function, xfs_attr_remove_iter.
This new version uses a sort of state machine like switch to keep track
of where it was when EAGAIN was returned. A new version of
xfs_attr_remove_args consists of a simple loop to refresh the
transaction until the operation is completed. A new XFS_DAC_DEFER_FINISH
flag is used to finish the transaction where ever the existing code used
to.

Calls to xfs_attr_rmtval_remove are replaced with the delay ready
version __xfs_attr_rmtval_remove. We will rename
__xfs_attr_rmtval_remove back to xfs_attr_rmtval_remove when we are
done.

xfs_attr_rmtval_remove itself is still in use by the set routines (used
during a rename).  For reasons of preserving existing function, we
modify xfs_attr_rmtval_remove to call xfs_defer_finish when the flag is
set.  Similar to how xfs_attr_remove_args does here.  Once we transition
the set routines to be delay ready, xfs_attr_rmtval_remove is no longer
used and will be removed.

This patch also adds a new struct xfs_delattr_context, which we will use
to keep track of the current state of an attribute operation. The new
xfs_delattr_state enum is used to track various operations that are in
progress so that we know not to repeat them, and resume where we left
off before EAGAIN was returned to cycle out the transaction. Other
members take the place of local variables that need to retain their
values across multiple function calls.  See xfs_attr.h for a more
detailed diagram of the states.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:47 -07:00
Allison Henderson
3f562d092b xfs: Hoist node transaction handling
This patch basically hoists the node transaction handling around the
leaf code we just hoisted.  This will helps setup this area for the
state machine since the goto is easily replaced with a state since it
ends with a transaction roll.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:46 -07:00
Allison Henderson
83c6e70789 xfs: Hoist xfs_attr_leaf_addname
This patch hoists xfs_attr_leaf_addname into the calling function.  The
goal being to get all the code that will require state management into
the same scope. This isn't particularly aesthetic right away, but it is a
preliminary step to merging in the state machine code.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:49:45 -07:00
Allison Henderson
5d954cc09f xfs: Hoist xfs_attr_node_addname
This patch hoists the later half of xfs_attr_node_addname into
the calling function.  We do this because it is this area that
will need the most state management, and we want to keep such
code in the same scope as much as possible

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:45 -07:00
Allison Henderson
6ca5a4a1f5 xfs: Add helper xfs_attr_node_addname_find_attr
This patch separates the first half of xfs_attr_node_addname into a
helper function xfs_attr_node_addname_find_attr.  It also replaces the
restart goto with an EAGAIN return code driven by a loop in the calling
function.  This looks odd now, but will clean up nicly once we introduce
the state machine.  It will also enable hoisting the last state out of
xfs_attr_node_addname with out having to plumb in a "done" parameter to
know if we need to move to the next state or not.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:44 -07:00
Allison Henderson
f0f7c502c7 xfs: Separate xfs_attr_node_addname and xfs_attr_node_addname_clear_incomplete
This patch separate xfs_attr_node_addname into two functions.  This will
help to make it easier to hoist parts of xfs_attr_node_addname that need
state management

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:43 -07:00
Allison Henderson
6286514b63 xfs: Refactor xfs_attr_set_shortform
This patch is actually the combination of patches from the previous
version (v18).  Initially patch 3 hoisted xfs_attr_set_shortform, and
the next added the helper xfs_attr_set_fmt. xfs_attr_set_fmt is similar
the old xfs_attr_set_shortform. It returns 0 when the attr has been set
and no further action is needed. It returns -EAGAIN when shortform has
been transformed to leaf, and the calling function should proceed the
set the attr in leaf form.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:42 -07:00
Allison Henderson
a8490f699f xfs: Add xfs_attr_node_remove_name
This patch pulls a new helper function xfs_attr_node_remove_name out
of xfs_attr_node_remove_step.  This helps to modularize
xfs_attr_node_remove_step which will help make the delayed attribute
code easier to follow

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:48:41 -07:00
Allison Henderson
4126c06e25 xfs: Reverse apply 72b97ea40d
Originally we added this patch to help modularize the attr code in
preparation for delayed attributes and the state machine it requires.
However, later reviews found that this slightly alters the transaction
handling as the helper function is ambiguous as to whether the
transaction is diry or clean.  This may cause a dirty transaction to be
included in the next roll, where previously it had not.  To preserve the
existing code flow, we reverse apply this commit.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:48:19 -07:00
Dave Chinner
e7d236a6fe xfs: move page freeing into _xfs_buf_free_pages()
Rather than open coding it just before we call
_xfs_buf_free_pages(). Also, rename the function to
xfs_buf_free_pages() as the leading underscore has no useful
meaning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner
02c5117386 xfs: merge _xfs_buf_get_pages()
Only called from one place now, so merge it into
xfs_buf_alloc_pages(). Because page array allocation is dependent on
bp->b_pages being null, always ensure that when the pages array is
freed we always set bp->b_pages to null.

Also convert the page array to use kmalloc() rather than
kmem_alloc() so we can use the gfp flags we've already calculated
for the allocation context instead of hard coding KM_NOFS semantics.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner
c9fa563072 xfs: use alloc_pages_bulk_array() for buffers
Because it's more efficient than allocating pages one at a time in a
loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner
07b5c5add4 xfs: use xfs_buf_alloc_pages for uncached buffers
Use the newly factored out page allocation code. This adds
automatic buffer zeroing for non-read uncached buffers.

This also allows us to greatly simply the error handling in
xfs_buf_get_uncached(). Because xfs_buf_alloc_pages() cleans up
partial allocation failure, we can just call xfs_buf_free() in all
error cases now to clean up after failures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:35 +10:00
Dave Chinner
0a683794ac xfs: split up xfs_buf_allocate_memory
Based on a patch from Christoph Hellwig.

This splits out the heap allocation and page allocation portions of
the buffer memory allocation into two separate helper functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:02 +10:00
Dave Chinner
0fe0bbe00a xfs: bunmapi has unnecessary AG lock ordering issues
large directory block size operations are assert failing because
xfs_bunmapi() is not completely removing fragmented directory blocks
like so:

XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677
....
Call Trace:
 xfs_dir2_shrink_inode+0x1a8/0x210
 xfs_dir2_block_to_sf+0x2ae/0x410
 xfs_dir2_block_removename+0x21a/0x280
 xfs_dir_removename+0x195/0x1d0
 xfs_rename+0xb79/0xc50
 ? avc_has_perm+0x8d/0x1a0
 ? avc_has_perm_noaudit+0x9a/0x120
 xfs_vn_rename+0xdb/0x150
 vfs_rename+0x719/0xb50
 ? __lookup_hash+0x6a/0xa0
 do_renameat2+0x413/0x5e0
 __x64_sys_rename+0x45/0x50
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

We are aborting the bunmapi() pass because of this specific chunk of
code:

                /*
                 * Make sure we don't touch multiple AGF headers out of order
                 * in a single transaction, as that could cause AB-BA deadlocks.
                 */
                if (!wasdel && !isrt) {
                        agno = XFS_FSB_TO_AGNO(mp, del.br_startblock);
                        if (prev_agno != NULLAGNUMBER && prev_agno > agno)
                                break;
                        prev_agno = agno;
                }

This is designed to prevent deadlocks in AGF locking when freeing
multiple extents by ensuring that we only ever lock in increasing
AG number order. Unfortunately, this also violates the "bunmapi will
always succeed" semantic that some high level callers depend on,
such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and
xfs_inactive_symlink_rmt().

This AG lock ordering was introduced back in 2017 to fix deadlocks
triggered by generic/299 as reported here:

https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/

This codebase is old enough that it was before we were defering all
AG based extent freeing from within xfs_bunmapi(). THat is, we never
actually lock AGs in xfs_bunmapi() any more - every non-rt based
extent free is added to the defer ops list, as is all BMBT block
freeing. And RT extents are not RT based, so there's no lock
ordering issues associated with them.

Hence this AGF lock ordering code is both broken and dead. Let's
just remove it so that the large directory block code works reliably
again.

Tested against xfs/538 and generic/299 which is the original test
that exposed the deadlocks that this code fixed.

Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27 08:11:24 -07:00
Dave Chinner
991c2c5980 xfs: btree format inode forks can have zero extents
xfs/538 is assert failing with this trace when testing with
directory block sizes of 64kB:

XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608
....
Call Trace:
 xfs_bmap_btree_to_extents+0x2a9/0x470
 ? kmem_cache_alloc+0xe7/0x220
 __xfs_bunmapi+0x4ca/0xdf0
 xfs_bunmapi+0x1a/0x30
 xfs_dir2_shrink_inode+0x71/0x210
 xfs_dir2_block_to_sf+0x2ae/0x410
 xfs_dir2_block_removename+0x21a/0x280
 xfs_dir_removename+0x195/0x1d0
 xfs_remove+0x244/0x460
 xfs_vn_unlink+0x53/0xa0
 ? selinux_inode_unlink+0x13/0x20
 vfs_unlink+0x117/0x220
 do_unlinkat+0x1a2/0x2d0
 __x64_sys_unlink+0x42/0x60
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

This is a check to ensure that the extents have been read into
memory before we are doing a ifork btree manipulation. This assert
is bogus in the above case.

We have a fragmented directory block that has more extents in it
than can fit in extent format, so the inode data fork is in btree
format. xfs_dir2_shrink_inode() asks to remove all remaining 16
filesystem blocks from the inode so it can convert to short form,
and __xfs_bunmapi() removes all the extents. We now have a data fork
in btree format but have zero extents in the fork. This incorrectly
trips the xfs_need_iread_extents() assert because it assumes that an
empty extent btree means the extent tree has not been read into
memory yet. This is clearly not the case with xfs_bunmapi(), as it
has an explicit call to xfs_iread_extents() in it to pull the
extents into memory before it starts unmapping.

Also, the assert directly after this bogus one is:

	ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE);

Which covers the context in which it is legal to call
xfs_bmap_btree_to_extents just fine. Hence we should just remove the
bogus assert as it is clearly wrong and causes a regression.

The returns the test behaviour to the pre-existing assert failure in
xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to
remove all the extents in the range it was asked to unmap.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27 08:11:24 -07:00
Gustavo A. R. Silva
53004ee78d xfs: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix
the following warnings by replacing /* fall through */ comments,
and its variants, with the new pseudo-keyword macro fallthrough:

fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]

Notice that Clang doesn't recognize /* fall through */ comments as
implicit fall-through markings, so in order to globally enable
-Wimplicit-fallthrough for Clang, these comments need to be
replaced with fallthrough; in the whole codebase.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-26 14:51:26 -05:00
Darrick J. Wong
603f000b15 xfs: validate extsz hints against rt extent size when rtinherit is set
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume.  If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.

As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files.  This leads to
inode corruption errors when the bad values are propagated.

Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly.  Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.

Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error.  Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl.  Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-24 18:01:04 -07:00
Darrick J. Wong
6b69e48589 xfs: standardize extent size hint validation
While chasing a bug involving invalid extent size hints being propagated
into newly created realtime files, I noticed that the xfs_ioctl_setattr
checks for the extent size hints weren't the same as the ones now
encoded in libxfs and used for validation in repair and mkfs.

Because the checks in libxfs are more stringent than the ones in the
ioctl, it's possible for a live system to set inode flags that
immediately result in corruption warnings.  Specifically, it's possible
to set an extent size hint on an rtinherit directory without checking if
the hint is aligned to the realtime extent size, which makes no sense
since that combination is used only to seed new realtime files.

Replace the open-coded and inadequate checks with the libxfs verifier
versions and update the code comments a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-24 18:01:04 -07:00
Darrick J. Wong
0f9342513c xfs: check free AG space when making per-AG reservations
The new online shrink code exposed a gap in the per-AG reservation
code, which is that we only return ENOSPC to callers if the entire fs
doesn't have enough free blocks.  Except for debugging mode, the
reservation init code doesn't ever check that there's enough free space
in that AG to cover the reservation.

Not having enough space is not considered an immediate fatal error that
requires filesystem offlining because (a) it's shouldn't be possible to
wind up in that state through normal file operations and (b) even if
one did, freeing data blocks would recover the situation.

However, online shrink now needs to know if shrinking would not leave
enough space so that it can abort the shrink operation.  Hence we need
to promote this assertion into an actual error return.

Observed by running xfs/168 with a 1k block size, though in theory this
could happen with any configuration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-05-24 18:01:04 -07:00
Darrick J. Wong
e3c2b04747 xfs: restore old ioctl definitions
These ioctl definitions in xfs_fs.h are part of the userspace ABI and
were mistakenly removed during the 5.13 merge window.

Fixes: 9fefd5db08ce ("xfs: convert to fileattr")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-20 08:31:22 -07:00
Darrick J. Wong
16c9de54dc xfs: fix deadlock retry tracepoint arguments
sc->ip is the inode that's being scrubbed, which means that it's not set
for scrub types that don't involve inodes.  If one of those scrubbers
(e.g. inode btrees) returns EDEADLOCK, we'll trip over the null pointer.
Fix that by reporting either the file being examined or the file that
was used to call scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-20 08:31:22 -07:00
Darrick J. Wong
676a659b60 xfs: retry allocations when locality-based search fails
If a realtime allocation fails because we can't find a sufficiently
large free extent satisfying locality rules, relax the locality rules
and try again.  This reduces the occurrence of short writes to realtime
files when the write size is large and the free space is fragmented.

This was originally discovered by running generic/186 with the realtime
reflink patchset and a 128k cow extent size hint, but the short write
symptoms can manifest with a 128k extent size hint and no reflink, so
apply the fix now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-05-20 08:28:34 -07:00
Darrick J. Wong
9d5e8492ee xfs: adjust rt allocation minlen when extszhint > rtextsize
xfs_bmap_rtalloc doesn't handle realtime extent files with extent size
hints larger than the rt volume's extent size properly, because
xfs_bmap_extsize_align can adjust the offset/length parameters to try to
fit the extent size hint.

Under these conditions, minlen has to be large enough so that any
allocation returned by xfs_rtallocate_extent will be large enough to
cover at least one of the blocks that the caller asked for.  If the
allocation is too short, bmapi_write will return no mapping for the
requested range, which causes ENOSPC errors in other parts of the
filesystem.

Therefore, adjust minlen upwards to fix this.  This can be found by
running generic/263 (g/127 or g/522) with a realtime extent size hint
that's larger than the rt volume extent size.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-05-16 18:45:03 -07:00
Linus Torvalds
05da1f643f More new code for 5.13-rc1:
- Remove the now unused "io_private" field from struct iomap_ioend, for
   a modest savings in memory allocation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCRdGYACgkQ+H93GTRK
 tOuhBxAAptj7EjQivnQno69kWqbhjOOKZOH40BuwMsufq8XhZkn36rnkt7y3P3B8
 WUeWkrBwhFW83UIg+L4Sb7BSCzVXqjqbnCffi8g9MTyuysuHk+3DlAHX0125x1DI
 z/16F+RVBajD6Ee8D3OIIhJQmFiNw7ERhHHbDuwpc+n4Wown1UwzROTp1S8DIvdJ
 LFGi0JzbE1++vngARkRidLjp2digS8fioyw+dIeTzLG+fSgnb00ZdybE/g/b5ZqQ
 PJH/23GFBlo5AuDhxDuhNOzqqC9ensG+n9hUNdaKzxAiYD5T7WSh7y69f/zmZJE/
 xLNgXE76QNtkjGUzeCil9lQ9muQxUNBDNnpHJim4ILI8YwaNuvVbrNpURGskcCrT
 gT1LsAv+8gbcm+SgYE4gAMIEMZlA+uh8qmz+8pDHSMuHHUr2+EUEkWUTY7ioycOW
 dZgZO1ZKYlXk8vRcvGDwbR1dhmv+jR8hWBHfCLpfLOUE6KRTthA6c4JhwnFpddhM
 cSJPKqZ+uGASuDGK3WuJVIuGlYUPRS3Gyj2X2Eg43T3zTe2wz/sAAkLLC2TkSeGj
 QLZEhq/pp2/PWM2LWujdEAiX8zFBJoJjrlR42egNqk27JQ80fVe9fHZruuCYo5SZ
 ftBDXUJRTahhvW6xFrQcdRyoMG8zlvM8dOjQM38GzkuIFCKp8u8=
 =vlas
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more iomap updates from Darrick Wong:
 "Remove the now unused 'io_private' field from struct iomap_ioend, for
  a modest savings in memory allocation"

* tag 'iomap-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: remove unused private field from ioend
2021-05-06 23:54:12 -07:00
Linus Torvalds
af120709b1 More new code for 5.13:
- Rename the log timestamp struct.
 - Remove broken transaction counter debugging that wasn't working
   correctly on very old filesystems.
 - Various fixes to make pre-lazysbcount filesystems work properly again.
 - Fix a free space accounting problem where we neglected to consider
   free space btree blocks that track metadata reservation space when
   deciding whether or not to allow caller to reserve space for
   a metadata update.
 - Fix incorrect pagecache clearing behavior during FUNSHARE ops.
 - Don't allow log writes if the data device is readonly.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCRa+UACgkQ+H93GTRK
 tOvBLw/+PWgbb/sudVRk51f0bN0NgOBHM/pcW918Xo7TrASjxlRFeJit3TBvKiEi
 JqRdeUe8OPk6bhrCk1o1qo1zqK4BxDgsS6hn9/ruZAvG/Rh9oDyFQ9YTwvwRGCEs
 y8aALdlbrCT+4nQ/ORjWlZjTBuuj4N6sT2U21vtqmVjisFkVPhe5FH/Ntd1IXXOs
 FKVU3pC9SsAiEGWIEH+ZmB6ED1PIqFAqOEPDkP3t2UdN7iV3w1LaLBkYJcCHVZHT
 h2OX2bkmnDEuX2HKyMgJBOBrQtq/ZLunP+rfh8EjoBb7zBzToI6pAhH9dbmTarsM
 nV/lydkpSWdy3DIiANEGUpmIOShL5QRf2qwjEnew23scN52xDazZicPNPvEgU/YD
 EVvtOXbvVCzIs9ft3zMm6zhg3u/u07G7k3e08WO5x6SVe7ys5Z0Do7uESePC+3H+
 n9IdN4+EP6RgNPKTRr1NlIuqTYc7wf63vj27QkBr0e7Q2vtoiquBOzrzWgINL90I
 AvLKrMsniMFBSKLayEhLSWXsm/1VxE2QiYRtfe4igMl4Nfu8dHXwezi4Awv70ibI
 tLf0Fjm2CK+CMP4SFa7hUzwQ29ZRqVE43ghlHqnZQtOVG1avZJ3mipIxXeO+O9pJ
 mOgJfZjud5TfsO2dUar1qr+efzCuZ4a/qfVjPlrh0LHJM2sRK5Y=
 =yoyk
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "Except for the timestamp struct renaming patches, everything else in
  here are bug fixes:

   - Rename the log timestamp struct.

   - Remove broken transaction counter debugging that wasn't working
     correctly on very old filesystems.

   - Various fixes to make pre-lazysbcount filesystems work properly
     again.

   - Fix a free space accounting problem where we neglected to consider
     free space btree blocks that track metadata reservation space when
     deciding whether or not to allow caller to reserve space for a
     metadata update.

   - Fix incorrect pagecache clearing behavior during FUNSHARE ops.

   - Don't allow log writes if the data device is readonly"

* tag 'xfs-5.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't allow log writes if the data device is readonly
  xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_range
  xfs: set aside allocation btree blocks from block reservation
  xfs: introduce in-core global counter of allocbt blocks
  xfs: unconditionally read all AGFs on mounts with perag reservation
  xfs: count free space btree blocks when scrubbing pre-lazysbcount fses
  xfs: update superblock counters correctly for !lazysbcount
  xfs: don't check agf_btreeblks on pre-lazysbcount filesystems
  xfs: remove obsolete AGF counter debugging
  xfs: rename struct xfs_legacy_ictimestamp
  xfs: rename xfs_ictimestamp_t
2021-05-06 23:46:46 -07:00
Brian Foster
6e552494fb iomap: remove unused private field from ioend
The only remaining user of ->io_private is the generic ioend merging
infrastructure. The only user of that is XFS, which no longer sets
->io_private or passes an associated merge callback. Remove the
unused parameter and the ->io_private field.

CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-04 08:54:29 -07:00