IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Various directory functions do not modify their @name parameter,
so mark it const to make that clear. This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
XFS does not reserve quota for directory expansion when renaming
children into a directory. This means that we don't reject the
expansion with EDQUOT when we're at or near a hard limit, which means
that unprivileged userspace can use rename() to exceed quota.
Rename operations don't always expand the target directory, and we allow
a rename to proceed with no space reservation if we don't need to add a
block to the target directory to handle the addition. Moreover, the
unlink operation on the source directory generally does not expand the
directory (you'd have to free a block and then cause a btree split) and
it's probably of little consequence to leave the corner case that
renaming a file out of a directory can increase its size.
As with link and unlink, there is a further bug in that we do not
trigger the blockgc workers to try to clear space when we're out of
quota.
Because rename is its own special tricky animal, we'll patch xfs_rename
directly to reserve quota to the rename transaction. We'll leave
cleaning up the rest of xfs_rename for the metadata directory tree
patchset.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory. This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.
The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.
Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.
To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory. If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode. This should prevent
quota usage overruns with the least restriction in functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYdRCkgAKCRCRxhvAZXjc
olrvAQCdp8LWkT8TauJSl8wmUm3mZhNy+5+fXuCUSwe3PyUtTQEAq4fxm41JpG8u
WCZTrrxVhaXwgUY3aWzzeQnLCZjtEQw=
=woqV
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fs idmapping updates from Christian Brauner:
"This contains the work to enable the idmapping infrastructure to
support idmapped mounts of filesystems mounted with an idmapping.
In addition this contains various cleanups that avoid repeated
open-coding of the same functionality and simplify the code in quite a
few places.
We also finish the renaming of the mapping helpers we started a few
kernel releases back and move them to a dedicated header to not
continue polluting the fs header needlessly with low-level idmapping
helpers. With this series the fs header only contains idmapping
helpers that interact with fs objects.
Currently we only support idmapped mounts for filesystems mounted
without an idmapping themselves. This was a conscious decision
mentioned in multiple places (cf. [1]).
As explained at length in [3] it is perfectly fine to extend support
for idmapped mounts to filesystem's mounted with an idmapping should
the need arise. The need has been there for some time now (cf. [2]).
Before we can port any filesystem that is mountable with an idmapping
to support idmapped mounts in the coming cycles, we need to first
extend the mapping helpers to account for the filesystem's idmapping.
This again, is explained at length in our documentation at [3] and
also in the individual commit messages so here's an overview.
Currently, the low-level mapping helpers implement the remapping
algorithms described in [3] in a simplified manner as we could rely on
the fact that all filesystems supporting idmapped mounts are mounted
without an idmapping.
In contrast, filesystems mounted with an idmapping are very likely to
not use an identity mapping and will instead use a non-identity
mapping. So the translation step from or into the filesystem's
idmapping in the remapping algorithm cannot be skipped for such
filesystems.
Non-idmapped filesystems and filesystems not supporting idmapped
mounts are unaffected by this change as the remapping algorithms can
take the same shortcut as before. If the low-level helpers detect that
they are dealing with an idmapped mount but the underlying filesystem
is mounted without an idmapping we can rely on the previous shortcut
and can continue to skip the translation step from or into the
filesystem's idmapping. And of course, if the low-level helpers detect
that they are not dealing with an idmapped mount they can simply
return the relevant id unchanged; no remapping needs to be performed
at all.
These checks guarantee that only the minimal amount of work is
performed. As before, if idmapped mounts aren't used the low-level
helpers are idempotent and no work is performed at all"
Link: 2ca4dcc490 ("fs/mount_setattr: tighten permission checks") [1]
Link: https://github.com/containers/podman/issues/10374 [2]
Link: Documentations/filesystems/idmappings.rst [3]
Link: a65e58e791 ("fs: document and rename fsid helpers") [4]
* tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: support mapped mounts of mapped filesystems
fs: add i_user_ns() helper
fs: port higher-level mapping helpers
fs: remove unused low-level mapping helpers
fs: use low-level mapping helpers
docs: update mapping documentation
fs: account for filesystem mappings
fs: tweak fsuidgid_has_mapping()
fs: move mapping helpers
fs: add is_idmapped_mnt() helper
Enable the mapped_fs{g,u}id() helpers to support filesystems mounted
with an idmapping. Apart from core mapping helpers that use
mapped_fs{g,u}id() to initialize struct inode's i_{g,u}id fields xfs is
the only place that uses these low-level helpers directly.
The patch only extends the helpers to be able to take the filesystem
idmapping into account. Since we don't actually yet pass the
filesystem's idmapping in no functional changes happen. This will happen
in a final patch.
Link: https://lore.kernel.org/r/20211123114227.3124056-9-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-9-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This ASSERT in xfs_rename is a) incorrect, because
(RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and
b) unnecessary, because actual invalid flag combinations are already
handled at the vfs level in do_renameat2() before we get called.
So, remove it.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use swap() in order to make code cleaner. Issue found by coccinelle.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Remove these typedefs by referencing kmem_cache directly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
- Fix a potential log livelock on busy filesystems when there's so much
work going on that we can't finish a quotaoff before filling up the log
by removing the ability to disable quota accounting.
- Introduce the ability to use per-CPU data structures in XFS so that
we can do a better job of maintaining CPU locality for certain
operations.
- Defer inode inactivation work to per-CPU lists, which will help us
batch that processing. Deletions of large sparse files will *appear*
to run faster, but all that means is that we've moved the work to the
backend.
- Drop the EXPERIMENTAL warnings from the y2038+ support and the inode
btree counters, since it's been nearly a year and no complaints have
come in.
- Remove more of our bespoke kmem* variants in favor of using the
standard Linux calls.
- Prepare for the addition of log incompat features in upcoming cycles
by actually adding code to support this.
- Small cleanups of the xattr code in preparation for landing support
for full logging of extended attribute updates in a future cycle.
- Replace the various log shutdown state and flag code all over xfs
with a single atomic bit flag.
- Fix a serious log recovery bug where log item replay can be skipped
based on the start lsn of a transaction even though the transaction
commit lsn is the key data point for that by enforcing start lsns to
appear in the log in the same order as commit lsns.
- Enable pipelining in the code that pushes log items to disk.
- Drop ->writepage.
- Fix some bugs in GETFSMAP where the last fsmap record reported for a
device could extend beyond the end of the device, and a separate bug
where query keys for one device could be applied to another.
- Don't let GETFSMAP query functions edit their input parameters.
- Small cleanups to the scrub code's handling of perag structures.
- Small cleanups to the incore inode tree walk code.
- Constify btree function parameters that aren't changed, so that there
will never again be confusion about range query functions changing
their input parameters.
- Standardize the format and names of tracepoint data attributes.
- Clean up all the mount state and feature flags to use wrapped bitset
functions instead of inconsistently open-coded flag checks.
- Fix some confusion between xfs_buf hash table key variable vs. block
number.
- Fix a mis-interaction with iomap where we reported shared delalloc
cow fork extents to iomap, which would cause the iomap unshare
operation to return IO errors unnecessarily.
- Fix DONTCACHE behavior.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEnwqcACgkQ+H93GTRK
tOtpZg/9G1RD9oDbVhKJy67bxkeLPX990dUtQFhcVjL3AMMyCJez2PBTqkQY3tL9
WDQveIF0UL5TjP5QUO2/6fncIXBmf5yXtinkfeQwkvkStb/yxs10zlpn2ZDEvJ7H
EUWwkV3cBY6Q+ftJIfXJmNW6eCcaxYs6KFiBwodbcoBxy2dIx6KFBQuqwtxOA97s
ZYfv1mPGOIg6AVJN9oxFWtF36qM8loFDNQeZj1ATfCsP25VNHbQf7YOFnJEnwLOB
rzz2zKQ3lP0hWavA6M2lX+IGymDphngx7qe4lZYcjAsh2BzL0IZf0QmFrXGQKuY/
kD0dWeStM8OHQbqCdkYx4XxcjucvJ7qmIYCtrWdpFqrrrQHygaJW6nI8LgsNTdvb
OPXpPPz58jdGY3ATaRYX/IFmpJExj655ZHUfpkeVGacBTa5KCVDykYKv1eYOfNsk
Aj+bZ4g++bx3dlGFHGsPScRn+hwg5h/+UyQJpAYupuaUsq3rpBhH/bhAJNyPUsYu
ej8LIeAWB3EPLozT4ewop8G0WWDBOe0MlYeO5gQho2AfFZzFInf15cSR62KZqx+v
XTZgITnnp0ND4wzgqAhgdU4USS9z5MtHGvhSkuYejg85R/bKirrwRu2P0n681sHv
UioiIVbXGWSAJqDQicfSjncafS3POIAUmMt4tgmDI33/3mTKwZQ=
=HPJr
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There's a lot in this cycle.
Starting with bug fixes: To avoid livelocks between the logging code
and the quota code, we've disabled the ability of quotaoff to turn off
quota accounting. (Admins can still disable quota enforcement, but
truly turning off accounting requires a remount.) We've tried to do
this in a careful enough way that there shouldn't be any user visible
effects aside from quotaoff no longer randomly hanging the system.
We've also fixed some bugs in runtime log behavior that could trip up
log recovery if (otherwise unrelated) transactions manage to start and
commit concurrently; some bugs in the GETFSMAP ioctl where we would
incorrectly restrict the range of records output if the two xfs
devices are of different sizes; a bug that resulted in fallocate
funshare failing unnecessarily; and broken behavior in the xfs inode
cache when DONTCACHE is in play.
As for new features: we now batch inode inactivations in percpu
background threads, which sharply decreases frontend thread wait time
when performing file deletions and should improve overall directory
tree deletion times. This eliminates both the problem where closing an
unlinked file (especially on a frozen fs) can stall for a long time,
and should also ease complaints about direct reclaim bogging down on
unlinked file cleanup.
Starting with this release, we've enabled pipelining of the XFS log.
On workloads with high rates of metadata updates to different shards
of the filesystem, multiple threads can be used to format committed
log updates into log checkpoints.
Lastly, with this release, two new features have graduated to
supported status: inode btree counters (for faster mounts), and
support for dates beyond Y2038. Expect these to be enabled by default
in a future release of xfsprogs.
Summary:
- Fix a potential log livelock on busy filesystems when there's so
much work going on that we can't finish a quotaoff before filling
up the log by removing the ability to disable quota accounting.
- Introduce the ability to use per-CPU data structures in XFS so that
we can do a better job of maintaining CPU locality for certain
operations.
- Defer inode inactivation work to per-CPU lists, which will help us
batch that processing. Deletions of large sparse files will
*appear* to run faster, but all that means is that we've moved the
work to the backend.
- Drop the EXPERIMENTAL warnings from the y2038+ support and the
inode btree counters, since it's been nearly a year and no
complaints have come in.
- Remove more of our bespoke kmem* variants in favor of using the
standard Linux calls.
- Prepare for the addition of log incompat features in upcoming
cycles by actually adding code to support this.
- Small cleanups of the xattr code in preparation for landing support
for full logging of extended attribute updates in a future cycle.
- Replace the various log shutdown state and flag code all over xfs
with a single atomic bit flag.
- Fix a serious log recovery bug where log item replay can be skipped
based on the start lsn of a transaction even though the transaction
commit lsn is the key data point for that by enforcing start lsns
to appear in the log in the same order as commit lsns.
- Enable pipelining in the code that pushes log items to disk.
- Drop ->writepage.
- Fix some bugs in GETFSMAP where the last fsmap record reported for
a device could extend beyond the end of the device, and a separate
bug where query keys for one device could be applied to another.
- Don't let GETFSMAP query functions edit their input parameters.
- Small cleanups to the scrub code's handling of perag structures.
- Small cleanups to the incore inode tree walk code.
- Constify btree function parameters that aren't changed, so that
there will never again be confusion about range query functions
changing their input parameters.
- Standardize the format and names of tracepoint data attributes.
- Clean up all the mount state and feature flags to use wrapped
bitset functions instead of inconsistently open-coded flag checks.
- Fix some confusion between xfs_buf hash table key variable vs.
block number.
- Fix a mis-interaction with iomap where we reported shared delalloc
cow fork extents to iomap, which would cause the iomap unshare
operation to return IO errors unnecessarily.
- Fix DONTCACHE behavior"
* tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (103 commits)
xfs: fix I_DONTCACHE
xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
xfs: fix perag structure refcounting error when scrub fails
xfs: rename buffer cache index variable b_bn
xfs: convert bp->b_bn references to xfs_buf_daddr()
xfs: introduce xfs_buf_daddr()
xfs: kill xfs_sb_version_has_v3inode()
xfs: introduce xfs_sb_is_v5 helper
xfs: remove unused xfs_sb_version_has wrappers
xfs: convert xfs_sb_version_has checks to use mount features
xfs: convert scrub to use mount-based feature checks
xfs: open code sb verifier feature checks
xfs: convert xfs_fs_geometry to use mount feature checks
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
xfs: convert remaining mount flags to state flags
xfs: convert mount flags to features
xfs: consolidate mount option features in m_features
xfs: replace xfs_sb_version checks with feature flag checks
xfs: reflect sb features in xfs_mount
xfs: rework attr2 feature and mount options
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
=pDBR
-----END PGP SIGNATURE-----
Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.
This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"
* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex
Remove the shouty macro and instead use the inline function that
matches other state/feature check wrapper naming. This conversion
was done with sed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.
Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Replace m_flags feature checks with xfs_has_<feature>() calls and
rework the setup code to set flags in m_features.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.
Large parts of this conversion were done with sed with commands like
this:
for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done
With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.
The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:
$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
If we don't need to inactivate an inode, we can detach the dquots and
move on to reclamation. This isn't strictly required here; it's a
preparation patch for deferred inactivation per reviewer request[1] to
move the creation of xfs_inode_needs_inactivation into a separate
change. Eventually this !need_inactive chunk will turn into the code
path for inodes that skip xfs_inactive and go straight to memory
reclaim.
[1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Convert places in XFS that take MMAPLOCK for two inodes to use helper
VFS provides for it (filemap_invalidate_down_write_two()). Note that
this changes lock ordering for MMAPLOCK from inode number based ordering
to pointer based ordering VFS generally uses.
CC: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
purpose of invalidate_lock is exactly the same. Note that the locking in
__xfs_filemap_fault() slightly changes as filemap_fault() already takes
invalidate_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
CC: <linux-xfs@vger.kernel.org>
CC: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Introduce a new __xfs_rwsem_islocked predicate to encapsulate checking
the state of a rw_semaphore, then refactor xfs_isilocked to use it.
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Suggested-by: Dave Chinner <dchinner@redhat.com>
Suggested-by: Eric Sandeen <sandeen@redhat.com>
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
- Refactor the buffer cache to use bulk page allocation
- Convert agnumber-based AG iteration to walk per-AG structures
- Clean up some unit conversions and other code warts
- Reduce spinlock contention in the directio fastpath
- Collapse all the inode cache walks into a single function
- Remove indirect function calls from the inode cache walk code
- Dramatically reduce the number of cache flushes sent when writing log
buffers
- Preserve inode sickness reports for longer
- Rename xfs_eofblocks since it controls inode cache walks
- Refactor the extended attribute code to prepare it for the addition
of log intent items to make xattrs fully transactional
- A few fixes to earlier large patchsets
- Log recovery fixes so that we don't accidentally mark the log clean
when log intent recovery fails
- Fix some latent SOB errors
- Clean up shutdown messages that get logged to dmesg
- Fix a regression in the online shrink code
- Fix a UAF in the buffer logging code if the fs goes offline
- Fix uninitialized error variables
- Fix a UAF in the CIL when commited log item callbacks race with a
shutdown
- Fix a bug where the CIL could hang trying to push part of the log ring
buffer that hasn't been filled yet
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDXP38ACgkQ+H93GTRK
tOsKzw//eHvEgeyBo7ek06GDsUph2kQVR9AJWE7MNMiBFxlmL8R9H225xJK7Qmcr
YswcyEeDq8cNXbXDA249ueuMb+DxhZPY68hPK5BJ3KsbvL2RZV0lJCbk492l4cgb
IvBJiG/MDo55km83tdr81AlmFYQM7rSQz5MbVogGxxsnp0ul3VpIrJZba8kPRDQ1
mZzH2fdlnE9Ozw/CfvjSgT1pySyFpxNeTRucYXUQil1hL1AGTBw7rGGNnccS090y
u/EawQ4WJ131m8O3+WomUmaGyZFlWvTpHzukKxvrEvZ6AG+HpIhMcbZ5J6nkRTY4
xxhUBG2qNKIcgPmPwAGmx1cylcsOCNKQgp+fko9tAZjEkgT5cbCpqpjGgjNB0RCf
pB0PY6idCFl9hmBpVgMWz2AZ9IsDmK54qufmLtzq/zN8cThzt6A95UUR0rGu5Kd8
CUmmdQTYl0GqlTTszCO2rw1+zRtcasMpBVmeYHDxy00bd1dHLUJ6o8DuXRYTTQti
J/6CZVVD56jieRb+uvrOq4mhiPR2kynciiu1dXdY5kx79kKom6HMBBvtTl8b9kmh
smWihfip7BTpz5vFzcwFmMxFwzW3K4LnDZl7qEGqXDEIHOL+pRWazU2yN3JZRGyd
z4SQMJuER0HTTA0yO09c3/CX9onorhjUIMgQ9U25l1hdyFna0+o=
=08Q9
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Most of the work this cycle has been on refactoring various parts of
the codebase. The biggest non-cleanup changes are (1) reducing the
number of cache flushes sent when writing the log; (2) a substantial
number of log recovery fixes; and (3) I started accepting pull
requests from contributors if the commits in their branches match
what's been sent to the list.
For a week or so I /had/ staged a major cleanup of the logging code
from Dave Chinner, but it exposed so many lurking bugs in other parts
of the logging and log recovery code that I decided to defer that
patchset until we can address those latent bugs.
Larger cleanups this time include walking the incore inode cache (me)
and rework of the extended attribute code (Allison) to prepare it for
adding logged xattr updates (and directory tree parent pointers) in
future releases.
Summary:
- Refactor the buffer cache to use bulk page allocation
- Convert agnumber-based AG iteration to walk per-AG structures
- Clean up some unit conversions and other code warts
- Reduce spinlock contention in the directio fastpath
- Collapse all the inode cache walks into a single function
- Remove indirect function calls from the inode cache walk code
- Dramatically reduce the number of cache flushes sent when writing
log buffers
- Preserve inode sickness reports for longer
- Rename xfs_eofblocks since it controls inode cache walks
- Refactor the extended attribute code to prepare it for the addition
of log intent items to make xattrs fully transactional
- A few fixes to earlier large patchsets
- Log recovery fixes so that we don't accidentally mark the log clean
when log intent recovery fails
- Fix some latent SOB errors
- Clean up shutdown messages that get logged to dmesg
- Fix a regression in the online shrink code
- Fix a UAF in the buffer logging code if the fs goes offline
- Fix uninitialized error variables
- Fix a UAF in the CIL when commited log item callbacks race with a
shutdown
- Fix a bug where the CIL could hang trying to push part of the log
ring buffer that hasn't been filled yet"
* tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (102 commits)
xfs: don't wait on future iclogs when pushing the CIL
xfs: Fix a CIL UAF by getting get rid of the iclog callback lock
xfs: remove callback dequeue loop from xlog_state_do_iclog_callbacks
xfs: don't nest icloglock inside ic_callback_lock
xfs: Initialize error in xfs_attr_remove_iter
xfs: fix endianness issue in xfs_ag_shrink_space
xfs: remove dead stale buf unpin handling code
xfs: hold buffer across unpin and potential shutdown processing
xfs: force the log offline when log intent item recovery fails
xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes
xfs: shorten the shutdown messages to a single line
xfs: print name of function causing fs shutdown instead of hex pointer
xfs: fix type mismatches in the inode reclaim functions
xfs: separate primary inode selection criteria in xfs_iget_cache_hit
xfs: refactor the inode recycling code
xfs: add iclog state trace events
xfs: xfs_log_force_lsn isn't passed a LSN
xfs: Fix CIL throttle hang when CIL space used going backwards
xfs: journal IO cache flush reductions
xfs: remove need_start_rec parameter from xlog_write()
...
Hi Linus,
Please, pull the following patches that fix many fall-through warnings
when building with Clang 12.0.0 and this[1] change reverted. Notice
that in order to enable -Wimplicit-fallthrough for Clang, such change[1]
is meant to be reverted at some point. So, these patches help to move
in that direction.
Thanks!
[1] commit e2079e93f5 ("kbuild: Do not enable -Wimplicit-fallthrough for clang for now")
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmDaNe8ACgkQRwW0y0cG
2zFfGA/9G1A/Hrf261/P9olyYe2TRBwLnO1tUDREm3qtJ2JdKpf+7EM3VDm+Ue/A
qhNmwp5G7nmp7Nqq8MfbdFjeo/rPS67voXiOfO8b0pU+E4XlOc+B1BXL0BWtnP7b
xvuauklQU6dmCp2u44vsxdBIO6ooR0uQh+7/+1la+mPyEk9mlooQ4lyFcpfA53yt
zxEGrx0tZBrDXghEI1CkHxOaJaX3qhw4EUYvxe8n2L7Dgx+o2djL/G4/SRYH/xoq
MZa8TLyCuR3J0Ph4TfDONhMmf8ZLn+j70xBhewcVfZ1JfvGSVw4DQNN44KZCDnrK
tGsBo5VFksjbmX83LmT8UlqB1rTP4nVQtRmtOPvbQA9kd19yy+Y64Y58FcGU2FHl
PWt3rQJ1JzBo3TtzQoz7HSJCt9QTil4U7hFbNtcp5BbWQfUPkRgpWcL3FOchZbZ6
FnLMqHanw2lrKMzZEoyHvg6G7BT67k3rrFgtd/xGSn8ohtfKXaZBYa9PKrQ0LwuG
o8tQtIX1owj4rbdI1t6Ob4X/tT6Y7DzH8nsF+TsJQ4XeSCD2rURUcYltBMIlEr16
DFj7iWKIrrX80/JRsBXu7a9h8nn5YptxV12SGRq/Cu/2jfRwjDye4IzsCyqMf67n
oEN6YC1XYaEUmKXTnI8Z0CxY0qwSTcNjeH5Ci9jWepinsqD3Jxw=
=Kt2q
-----END PGP SIGNATURE-----
Merge tag 'fallthrough-fixes-clang-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo Silva:
"Fix many fall-through warnings when building with Clang 12.0.0 and
'-Wimplicit-fallthrough' so that we at some point will be able to
enable that warning by default"
* tag 'fallthrough-fixes-clang-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (26 commits)
rxrpc: Fix fall-through warnings for Clang
drm/nouveau/clk: Fix fall-through warnings for Clang
drm/nouveau/therm: Fix fall-through warnings for Clang
drm/nouveau: Fix fall-through warnings for Clang
xfs: Fix fall-through warnings for Clang
xfrm: Fix fall-through warnings for Clang
tipc: Fix fall-through warnings for Clang
sctp: Fix fall-through warnings for Clang
rds: Fix fall-through warnings for Clang
net/packet: Fix fall-through warnings for Clang
net: netrom: Fix fall-through warnings for Clang
ide: Fix fall-through warnings for Clang
hwmon: (max6621) Fix fall-through warnings for Clang
hwmon: (corsair-cpro) Fix fall-through warnings for Clang
firewire: core: Fix fall-through warnings for Clang
braille_console: Fix fall-through warnings for Clang
ipv4: Fix fall-through warnings for Clang
qlcnic: Fix fall-through warnings for Clang
bnxt_en: Fix fall-through warnings for Clang
netxen_nic: Fix fall-through warnings for Clang
...
In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.
xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.
The problem is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.
As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.
This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.
Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This ambitious series aims to cleans up redundant inode walk code in
xfs_icache.c, hide implementation details of the quotaoff dquot release
code, and eliminates indirect function calls from incore inode walks.
The first thing it does is to move all the code that quotaoff calls to
release dquots from all incore inodes into xfs_icache.c. Next, it
separates the goal of an inode walk from the actual radix tree tags that
may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing.
Finally, we split the speculative preallocation (blockgc) and quotaoff
dquot release code paths into separate functions so that we can keep the
implementations cohesive.
Christoph suggested last cycle that we 'simply' change quotaoff not to
allow deactivating quota entirely, but as these cleanups are to enable
one major change in behavior (deferred inode inactivation) I do not want
to add a second behavior change (quotaoff) as a dependency.
To be blunt: Additional cleanups are not in scope for this series.
Next, I made two observations about incore inode radix tree walks --
since there's a 1:1 mapping between the walk goal and the per-inode
processing function passed in, we can use the goal to make a direct call
to the processing function. Furthermore, the only caller to supply a
nonzero iter_flags argument is quotaoff, and there's only one INEW flag.
From that observation, I concluded that it's quite possible to remove
two parameters from the xfs_inode_walk* function signatures -- the
iter_flags, and the execute function pointer. The middle of the series
moves the INEW functionality into the one piece (quotaoff) that wants
it, and removes the indirect calls.
The final observation is that the inode reclaim walk loop is now almost
the same as xfs_inode_walk, so it's silly to maintain two copies. Merge
the reclaim loop code into xfs_inode_walk.
Lastly, refactor the per-ag radix tagging functions since there's
duplicated code that can be consolidated.
This series is a prerequisite for the next two patchsets, since deferred
inode inactivation will add another inode radix tree tag and iterator
function to xfs_inode_walk.
v2: walk the vfs inode list when running quotaoff instead of the radix
tree, then rework the (now completely internal) inode walk function
to take the tag as the main parameter.
v3: merge the reclaim loop into xfs_inode_walk, then consolidate the
radix tree tagging functions
v4: rebase to 5.13-rc4
v5: combine with the quotaoff patchset, reorder functions to minimize
forward declarations, split inode walk goals from radix tree tags
to reduce conceptual confusion
v6: start moving the inode cache code towards the xfs_icwalk prefix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmC5Yv0ACgkQ+H93GTRK
tOv7Fg//Z7cKph0zSg6qsukMEMZxscuNcEBydCW1bu9gSx1NpszDpiGqAiO5ZB3X
wP2XkCqjuatbNGGvkNLHS/M4sbLX3ELogvYmMRvUhDoaSFxT/KKgxvsyNffiCSS7
xRB/rvWRp9MGRpBWPF0ZUxFU6VBzhCrYdMsNhvW95AEup8S/j+NplwoIif0gzaZZ
Q6Fl4Ca9VEBvJQPV+/zkLih19iFItmARJhPHUs4BO1nZv+CzZBFQHg7Ijw7nW92j
eSY68W4LH/IQ5cqm+HrD/+Z6ns0P7J2viewzVymkNEGnuX4a0xrQrzQ8ydRsAxTi
9EDrpIe3MbSI5YjJfmRe8G3LX5p7vBpqc8TeyZdRDMGWkFjT33HPlQNb6WxKLQbA
mjKdfr8AYZR/UQKW/7oZFrJnOoMpYRAQ4Sn/9BAYZQYm7tiLzuZsrEZ7JBwiUA56
XHmlsDDeLzJeKvjmUu8M3H4oh4Nwf5/I2vJwHjueTfhl83uJP04igIXC4rnq56bM
AAAjH9uV11Fo3q0ywAnRtN2HYj8PEJlCMK5CNskILrGeMITsBPGht0SbaA6hDI2h
GYmltKInHzuPhHC9NfyPVrVr3BrmPR5cBsVFESiz5A4E9rbuKmmna6Yk8MFlMyl8
FRIA3zVatJ2qQXtsAcdI8AZzMd7ciYhkAgCqFKxv8qK/qxITHh4=
=Rxdn
-----END PGP SIGNATURE-----
Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
xfs: clean up incore inode walk functions
This ambitious series aims to cleans up redundant inode walk code in
xfs_icache.c, hide implementation details of the quotaoff dquot release
code, and eliminates indirect function calls from incore inode walks.
The first thing it does is to move all the code that quotaoff calls to
release dquots from all incore inodes into xfs_icache.c. Next, it
separates the goal of an inode walk from the actual radix tree tags that
may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing.
Finally, we split the speculative preallocation (blockgc) and quotaoff
dquot release code paths into separate functions so that we can keep the
implementations cohesive.
Christoph suggested last cycle that we 'simply' change quotaoff not to
allow deactivating quota entirely, but as these cleanups are to enable
one major change in behavior (deferred inode inactivation) I do not want
to add a second behavior change (quotaoff) as a dependency.
To be blunt: Additional cleanups are not in scope for this series.
Next, I made two observations about incore inode radix tree walks --
since there's a 1:1 mapping between the walk goal and the per-inode
processing function passed in, we can use the goal to make a direct call
to the processing function. Furthermore, the only caller to supply a
nonzero iter_flags argument is quotaoff, and there's only one INEW flag.
From that observation, I concluded that it's quite possible to remove
two parameters from the xfs_inode_walk* function signatures -- the
iter_flags, and the execute function pointer. The middle of the series
moves the INEW functionality into the one piece (quotaoff) that wants
it, and removes the indirect calls.
The final observation is that the inode reclaim walk loop is now almost
the same as xfs_inode_walk, so it's silly to maintain two copies. Merge
the reclaim loop code into xfs_inode_walk.
Lastly, refactor the per-ag radix tagging functions since there's
duplicated code that can be consolidated.
This series is a prerequisite for the next two patchsets, since deferred
inode inactivation will add another inode radix tree tag and iterator
function to xfs_inode_walk.
v2: walk the vfs inode list when running quotaoff instead of the radix
tree, then rework the (now completely internal) inode walk function
to take the tag as the main parameter.
v3: merge the reclaim loop into xfs_inode_walk, then consolidate the
radix tree tagging functions
v4: rebase to 5.13-rc4
v5: combine with the quotaoff patchset, reorder functions to minimize
forward declarations, split inode walk goals from radix tree tags
to reduce conceptual confusion
v6: start moving the inode cache code towards the xfs_icwalk prefix
* tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: refactor per-AG inode tagging functions
xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag
xfs: pass struct xfs_eofblocks to the inode scan callback
xfs: fix radix tree tag signs
xfs: make the icwalk processing functions clean up the grab state
xfs: clean up inode state flag tests in xfs_blockgc_igrab
xfs: remove indirect calls from xfs_inode_walk{,_ag}
xfs: remove iter_flags parameter from xfs_inode_walk_*
xfs: move xfs_inew_wait call into xfs_dqrele_inode
xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab
xfs: pass the goal of the incore inode walk to xfs_inode_walk()
xfs: rename xfs_inode_walk functions to xfs_icwalk
xfs: move the inode walk functions further down
xfs: detach inode dquots at the end of inactivation
xfs: move the quotaoff dqrele inode walk into xfs_icache.c
[djwong: added variable names to function declarations while fixing
merge conflicts]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.
There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.
[Patches 1-4]
The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.
[Patches 5-10]
These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.
[Patches 11-16]
These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.
The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.
[Patches 17-21]
These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.
These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.
Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.
[Patch 22]
Converts the unlink path to pass perags through it.
There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmC3HUgUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h2yaw/+P0JzpI+6n06Ei00mjgE/Du/WhMLi
0JQ93Grlj+miuGGT9DgGCiRpoZnefhEk+BH6JqoEw1DQ3T5ilmAzrHLUUHSQC3+S
dv85sJduheQ6yHuoO+4MzkaSq6JWKe7E9gZwAsVyBul5aSjdmaJaQdPwYMTXSXo0
5Uqq8ECFkMcaHVNjcBfasgR/fdyWy2Qe4PFTHTHdQpd+DNZ9UXgFKHW2og+1iry/
zDIvdIppJULA09TvVcZuFjd/1NzHQ/fLj5PAzz8GwagB4nz2x3s78Zevmo5yW/jK
3/+50vXa8ldhiHDYGTS3QXvS0xJRyqUyD47eyWOOiojZw735jEvAlCgjX6+0X1HC
k3gCkQLv8l96fRkvUpgnLf/fjrUnlCuNBkm9d1Eq2Tied8dvLDtiEzoC6L05Nqob
yd/nIUb1zwJFa9tsoheHhn0bblTGX1+zP0lbRJBje0LotpNO9DjGX5JoIK4GR7F8
y1VojcdgRI14HlxUnbF3p8wmQByN+M2tnp6GSdv9BA65bjqi05Rj/steFdZHBV6x
wiRs8Yh6BTvMwKgufHhRQHfRahjNHQ/T/vOE+zNbWqemS9wtEUDop+KvPhC36R/k
o/cmr23cF8ESX2eChk7XM4On3VEYpcvp2zSFgrFqZYl6RWOwEis3Htvce3KuSTPp
8Xq70te0gr2DVUU=
=YNzW
-----END PGP SIGNATURE-----
Merge tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2
xfs: initial agnumber -> perag conversions for shrink
If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.
There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.
[Patches 1-4]
The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.
[Patches 5-10]
These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.
[Patches 11-16]
These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.
The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.
[Patches 17-21]
These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.
These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.
Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.
[Patch 22]
Converts the unlink path to pass perags through it.
There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.
* tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits)
xfs: remove xfs_perag_t
xfs: use perag through unlink processing
xfs: clean up and simplify xfs_dialloc()
xfs: inode allocation can use a single perag instance
xfs: get rid of xfs_dir_ialloc()
xfs: collapse AG selection for inode allocation
xfs: simplify xfs_dialloc_select_ag() return values
xfs: remove agno from btree cursor
xfs: use perag for ialloc btree cursors
xfs: convert allocbt cursors to use perags
xfs: convert refcount btree cursor to use perags
xfs: convert rmap btree cursor to using a perag
xfs: add a perag to the btree cursor
xfs: pass perags around in fsmap data dev functions
xfs: push perags through the ag reservation callouts
xfs: pass perags through to the busy extent code
xfs: convert secondary superblock walk to use perags
xfs: convert xfs_iwalk to use perag references
xfs: convert raw ag walks to use for_each_perag
xfs: make for_each_perag... a first class citizen
...
Once we're done with inactivating an inode, we're finished updating
metadata for that inode. This means that we can detach the dquots at
the end and not have to wait for reclaim to do it for us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Unlinked lists are held in the perag, and freeing of inodes needs to
be passed a perag, too, so look up the perag early in the unlink
processing and use it throughout.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This is just a simple wrapper around the per-ag inode allocation
that doesn't need to exist. The internal mechanism to select and
allocate within an AG does not need to be exposed outside
xfs_ialloc.c, and it being exposed simply makes it harder to follow
the code and simplify it.
This is simplified by internalising xf_dialloc_select_ag() and
xfs_dialloc_ag() into a single xfs_dialloc() function and then
xfs_dir_ialloc() can go away.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
The only caller of xfs_dialloc_select_ag() will always return
-ENOSPC to it's caller if the agbp returned from
xfs_dialloc_select_ag() is NULL. IOWs, failure to find a candidate
AGI we can allocate inodes from is always an ENOSPC condition, so
move this logic up into xfs_dialloc_select_ag() so we can simplify
the return logic in this function.
xfs_dialloc_select_ag() now only ever returns 0 with a locked
agbp, or an error with no agbp.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
They are AG functions, not superblock functions, so move them to the
appropriate location.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In preparation to enable -Wimplicit-fallthrough for Clang, fix
the following warnings by replacing /* fall through */ comments,
and its variants, with the new pseudo-keyword macro fallthrough:
fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Notice that Clang doesn't recognize /* fall through */ comments as
implicit fall-through markings, so in order to globally enable
-Wimplicit-fallthrough for Clang, these comments need to be
replaced with fallthrough; in the whole codebase.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume. If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.
As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files. This leads to
inode corruption errors when the bad values are propagated.
Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly. Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.
Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error. Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl. Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
- Various minor fixes in online scrub.
- Prevent metadata files from being automatically inactivated.
- Validate btree heights by the computed per-btree limits.
- Don't warn about remounting with deprecated mount options.
- Initialize attr forks at create time if we suspect we're going to need
to store them.
- Reduce memory reallocation workouts in the logging code.
- Fix some theoretical math calculation errors in logged buffers that
span multiple discontig memory ranges but contiguous ondisk regions.
- Speedups in dirty buffer bitmap handling.
- Make type verifier functions more inline-happy to reduce overhead.
- Reduce debug overhead in directory checking code.
- Many many typo fixes.
- Begin to handle the permanent loss of the very end of a filesystem.
- Fold struct xfs_icdinode into xfs_inode.
- Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx ioctl.
- Remove a broken directory block format check from online scrub.
- Fix a bug where we could produce an unnecessarily tall data fork btree
when creating an attr fork.
- Fix scrub and readonly remounts racing.
- Fix a writeback ioend log deadlock problem by dropping the behavior
where we could preallocate a setfilesize transaction.
- Fix some bugs in the new extent count checking code.
- Fix some bugs in the attr fork preallocation code.
- Refactor if_flags out of the incore inode fork data structure.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmB6MFUACgkQ+H93GTRK
tOvigBAAlpzBUXnZVo+U18u0tSHnq5c1zbXYcf5GPhQv9w3n3TlPi3YhK2vgEXlI
TULwsdU+an30oqWkQiVrwQjKPVaTWeWE3K0sA2MlYX9L2CwPPde4x5hwhyppfQxq
mQyu0suWp480ao7vToXAgZ751OdZRtGu8sRQ7rVQ/FVf9K4R8EqpZMEynNry25f+
hpK235hpf4IUC9E1A4pE2hNBSr/LGPIyu1t5sZsfazcNmtpKcauy5R5b8Pdnzo2/
WFa6PoeE8SRIp4OxZY/c/4QUI5cRubJGyoB+kbl0hg69uYIJO+pc+R69yrQPD9Z+
JDW/FktH+Zz4pstFsC+qnSvhRaF2DvXpvXrIldonQ2Z2ByVqbs3r6HzKySlWQ+QE
jU717HApWl/ADI/kVD2IuQnrbU+Q8Ue8thzgQeEpTRWsea2HzPMofNi5FImU2ulw
g4V7PleQWJ6AsLhcpfA46Y+CUAtjTD1Tvj67JpXuWJ+MFTB4hRm3U7zgCtV/0c3T
wBBUybQjDoVA6DDr6CP/9ki1k0BO3wKJGlZMR0bkEsuxXdFNTvHEz5lmueYT/Wxc
D91+oRbna9NpEeIVFGo6lhMIu2t0iYssFdgQKyn1jXrpGXKvOklP8zDjRdPnnQmz
plT2ajlXPIjc6KjOTP2mbVqKs059LuJoYV7gIWwM7CgtFsMIrd8=
=oRKe
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"The notable user-visible addition this cycle is ability to remove
space from the last AG in a filesystem. This is the first of many
changes needed for full-fledged support for shrinking a filesystem.
Still needed are (a) the ability to reorganize files and metadata away
from the end of the fs; (b) the ability to remove entire allocation
groups; (c) shrink support for realtime volumes; and (d) thorough
testing of (a-c).
There are a number of performance improvements in this code drop: Dave
streamlined various parts of the buffer logging code and reduced the
cost of various debugging checks, and added the ability to pre-create
the xattr structures while creating files. Brian eliminated
transaction reservations that were being held across writeback (thus
reducing livelock potential.
Other random pieces: Pavel fixed the repetitve warnings about
deprecated mount options, I fixed online fsck to behave itself when a
readonly remount comes in during scrub, and refactored various other
parts of that code, Christoph contributed a lot of refactoring this
cycle. The xfs_icdinode structure has been absorbed into the (incore)
xfs_inode structure, and the format and flags handling around
xfs_inode_fork structures has been simplified. Chandan provided a
number of fixes for extent count overflow related problems that have
been shaken out by debugging knobs added during 5.12.
Summary:
- Various minor fixes in online scrub.
- Prevent metadata files from being automatically inactivated.
- Validate btree heights by the computed per-btree limits.
- Don't warn about remounting with deprecated mount options.
- Initialize attr forks at create time if we suspect we're going to
need to store them.
- Reduce memory reallocation workouts in the logging code.
- Fix some theoretical math calculation errors in logged buffers that
span multiple discontig memory ranges but contiguous ondisk
regions.
- Speedups in dirty buffer bitmap handling.
- Make type verifier functions more inline-happy to reduce overhead.
- Reduce debug overhead in directory checking code.
- Many many typo fixes.
- Begin to handle the permanent loss of the very end of a filesystem.
- Fold struct xfs_icdinode into xfs_inode.
- Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx
ioctl.
- Remove a broken directory block format check from online scrub.
- Fix a bug where we could produce an unnecessarily tall data fork
btree when creating an attr fork.
- Fix scrub and readonly remounts racing.
- Fix a writeback ioend log deadlock problem by dropping the behavior
where we could preallocate a setfilesize transaction.
- Fix some bugs in the new extent count checking code.
- Fix some bugs in the attr fork preallocation code.
- Refactor if_flags out of the incore inode fork data structure"
* tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits)
xfs: remove xfs_quiesce_attr declaration
xfs: remove XFS_IFEXTENTS
xfs: remove XFS_IFINLINE
xfs: remove XFS_IFBROOT
xfs: only look at the fork format in xfs_idestroy_fork
xfs: simplify xfs_attr_remove_args
xfs: rename and simplify xfs_bmap_one_block
xfs: move the XFS_IFEXTENTS check into xfs_iread_extents
xfs: drop unnecessary setfilesize helper
xfs: drop unused ioend private merge and setfilesize code
xfs: open code ioend needs workqueue helper
xfs: drop submit side trans alloc for append ioends
xfs: fix return of uninitialized value in variable error
xfs: get rid of the ip parameter to xchk_setup_*
xfs: fix scrub and remount-ro protection when running scrub
xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks
xfs: move the xfs_can_free_eofblocks call under the IOLOCK
xfs: precalculate default inode attribute offset
xfs: default attr fork size does not handle device inodes
xfs: inode fork allocation depends on XFS_IFEXTENT flag
...
The in-memory XFS_IFEXTENTS is now only used to check if an inode with
extents still needs the extents to be read into memory before doing
operations that need the extent map. Add a new xfs_need_iread_extents
helper that returns true for btree format forks that do not have any
entries in the in-memory extent btree, and use that instead of checking
the XFS_IFEXTENTS flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Fix the weird split of responsibilities between xfs_can_free_eofblocks
and xfs_free_eofblocks by moving the chunk of code that looks for any
actual post-EOF space mappings from the second function into the first.
This clears the way for deferred inode inactivation to be able to decide
if an inode needs inactivation work before committing the released inode
to the inactivation code paths (vs. marking it for reclaim).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Due to confusion on when the XFS_IFEXTENT needs to be set, the
changes in e6a688c332 ("xfs: initialise attr fork on inode
create") failed to set the flag when initialising the empty
attribute fork at inode creation. Set this flag the same way
xfs_bmap_add_attrfork() does after attry fork allocation.
Fixes: e6a688c332 ("xfs: initialise attr fork on inode create")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
The pitfalls of regression testing on a machine without realising
that selinux was disabled. Only set the attr fork during inode
allocation if the attr feature bits are already set on the
superblock.
Fixes: e6a688c332 ("xfs: initialise attr fork on inode create")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Merge _xfs_dic2xflags into its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and
remove the now entirely unused struct xfs_icdinode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the flags2
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the flags
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
forkoff field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The i_cowextsize field is only used for v3 inodes, and the i_flushiter
field is only used for v1/v2 inodes. Use a union to pack the inode a
littler better after adding a few missing guards around their usage.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
flushiter field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the
cowextsize field into the containing xfs_inode structure. Also
switch to use the xfs_extlen_t instead of a uint32_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the extsize
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the nblocks
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the on-disk
size field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In preparation of removing the historic icinode struct, move the projid
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs. So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode. The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure di_flags2 is always initialized. We currently get this implicitly
by clearing the dinode core on allocating the in-core inode, but that is
about to go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Split looking up the dinode from xfs_imap_to_bp, which can be
significantly simplified as a result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
s/sytemcall/syscall/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.
This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.
This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:
18.99% [kernel] [k] __pv_queued_spin_lock_slowpath
3.57% [kernel] [k] do_raw_spin_lock
2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
2.48% [kernel] [k] memcpy
2.34% [kernel] [k] xfs_log_commit_cil
The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:
- 15.30% xfs_init_security
- 15.23% security_inode_init_security
- 13.05% xfs_initxattrs
- 12.94% xfs_attr_set
- 6.75% xfs_bmap_add_attrfork
- 5.51% xfs_trans_commit
- 5.48% __xfs_trans_commit
- 5.35% xfs_log_commit_cil
- 3.86% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.70% xfs_trans_alloc
0.52% xfs_trans_reserve
- 5.41% xfs_attr_set_args
- 5.39% xfs_attr_set_shortform.constprop.0
- 4.46% xfs_trans_commit
- 4.46% __xfs_trans_commit
- 4.33% xfs_log_commit_cil
- 2.74% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
0.60% xfs_inode_item_format
0.90% xfs_attr_try_sf_addname
- 1.99% selinux_inode_init_security
- 1.02% security_sid_to_context_force
- 1.00% security_sid_to_context_core
- 0.92% sidtab_entry_to_string
- 0.90% sidtab_sid2str_get
0.59% sidtab_sid2str_put.part.0
- 0.82% selinux_determine_inode_label
- 0.77% security_transition_sid
0.70% security_compute_sid.part.0
And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.
So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:
- 8.99% xfs_init_security
- 9.00% security_inode_init_security
- 6.43% xfs_initxattrs
- 6.37% xfs_attr_set
- 5.45% xfs_attr_set_args
- 5.42% xfs_attr_set_shortform.constprop.0
- 4.51% xfs_trans_commit
- 4.54% __xfs_trans_commit
- 4.59% xfs_log_commit_cil
- 2.67% _raw_spin_lock
- 3.28% do_raw_spin_lock
3.08% __pv_queued_spin_lock_slowpath
0.66% xfs_inode_item_format
- 0.90% xfs_attr_try_sf_addname
- 0.60% xfs_trans_alloc
- 2.35% selinux_inode_init_security
- 1.25% security_sid_to_context_force
- 1.21% security_sid_to_context_core
- 1.19% sidtab_entry_to_string
- 1.20% sidtab_sid2str_get
- 0.86% sidtab_sid2str_put.part.0
- 0.62% _raw_spin_lock_irqsave
- 0.77% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.84% selinux_determine_inode_label
- 0.83% security_transition_sid
0.86% security_compute_sid.part.0
Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic. This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.
In other words, xfs_inactive() should leave these files alone. Add the
necessary predicate functions to make this happen. This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de. If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.
Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h
Followup-to: f4c32e87de ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many
details.
Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.
Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Nowadays, we indirectly use the idmap-aware helper functions in the VFS
to set the initial uid and gid of a file being created. Unfortunately,
we didn't convert the quota code, which means we attach the wrong dquots
to files created on an idmapped mount.
Fixes: f736d93d76 ("xfs: support idmapped mounts")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
For file creation, create a new helper xfs_trans_alloc_icreate that
allocates a transaction and reserves the appropriate amount of quota
against that transction. Replace all the open-coded idioms with a
single call to this helper so that we can contain the retry loops in the
next patchset.
This changes the locking behavior for non-tempfile creation slightly, in
that we now make the quota reservation without holding the directory
ILOCK. While the dquots chosen for inode creation are based on the
directory state at a given point in time, the directory ILOCK was
released as soon as the dquot references are picked up. Hence it was
never necessary to hold the directory ILOCK for the quota reservation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a proper helper so that inode creation calls can reserve quota
with a dedicated function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Enable idmapped mounts for xfs. This basically just means passing down
the user_namespace argument from the VFS methods down to where it is
passed to the relevant helpers.
Note that full-filesystem bulkstat is not supported from inside idmapped
mounts as it is an administrative operation that acts on the whole file
system. The limitation is not applied to the bulkstat single operation
that just operates on a single inode.
Link: https://lore.kernel.org/r/20210121131959.646623-40-christian.brauner@ubuntu.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
XFS always inherits the SGID bit if it is set on the parent inode, while
the generic inode_init_owner does not do this in a few cases where it can
create a possible security problem, see commit 0fa3ecd878
("Fix up non-directory creation in SGID directories") for details.
Switch XFS to use the generic helper for the normal path to fix this,
just keeping the simple field inheritance open coded for the case of the
non-sgid case with the bsdgrpid mount option.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
A rename operation is essentially a directory entry remove operation
from the perspective of parent directory (i.e. src_dp) of rename's
source. Hence the only place where we check for extent count overflow
for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real()
returns -ENOSPC when it detects a possible extent count overflow and in
response, the higher layers of directory handling code do the following:
1. Data/Free blocks: XFS lets these blocks linger until a future remove
operation removes them.
2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf
space and unmaps the last block.
For target_dp, there are two cases depending on whether the destination
directory entry exists or not.
When destination directory entry does not exist (i.e. target_ip ==
NULL), extent count overflow check is performed only when transaction
has a non-zero sized space reservation associated with it. With a
zero-sized space reservation, XFS allows a rename operation to continue
only when the directory has sufficient free space in its data/leaf/free
space blocks to hold the new entry.
When destination directory entry exists (i.e. target_ip != NULL), all
we need to do is change the inode number associated with the already
existing entry. Hence there is no need to perform an extent count
overflow check.
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Directory entry addition can cause the following,
1. Data block can be added/removed.
A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
Same behaviour as described above for Data block.
3. Dabtree blocks.
XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
can be new extents. Hence extent count can increase by
XFS_DA_NODE_MAXDEPTH.
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When overlayfs is running on top of xfs and the user unlinks a file in
the overlay, overlayfs will create a whiteout inode and ask xfs to
"rename" the whiteout file atop the one being unlinked. If the file
being unlinked loses its one nlink, we then have to put the inode on the
unlinked list.
This requires us to grab the AGI buffer of the whiteout inode to take it
off the unlinked list (which is where whiteouts are created) and to grab
the AGI buffer of the file being deleted. If the whiteout was created
in a higher numbered AG than the file being deleted, we'll lock the AGIs
in the wrong order and deadlock.
Therefore, grab all the AGI locks we think we'll need ahead of time, and
in order of increasing AG number per the locking rules.
Reported-by: wenli xie <wlxie7296@gmail.com>
Fixes: 93597ae8da ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This patch explicitly separates free inode chunk allocation and
inode allocation into two individual high level operations.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Get rid of the confusing ialloc_context and failure handling around
xfs_dialloc() by moving xfs_dialloc_roll() into xfs_dialloc().
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
So xfs_ialloc() will only address in-core inode allocation then,
Also, rename xfs_ialloc() to xfs_dir_ialloc_init() in order to
keep everything in xfs_inode.c under the same namespace.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce a helper to make the on-disk inode allocation rolling
logic clearer in preparation of the following cleanup.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Nowadays the only things that the XFS_TRANS_DQ_DIRTY flag seems to do
are indicates the tp->t_dqinfo->dqs[XFS_QM_TRANS_{USR,GRP,PRJ}] values
changed and check in xfs_trans_apply_dquot_deltas() and the unreserve
variant xfs_trans_unreserve_and_mod_dquots(). Actually, we also can
use the tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag, that
is to say, we allocate the new tp->t_dqinfo only when the qtrx values
changed, so the tp->t_dqinfo value isn't NULL equals the XFS_TRANS_DQ_DIRTY
flag is set, we only need to check if tp->t_dqinfo == NULL in
xfs_trans_apply_dquot_deltas() and its unreserve variant to determine
whether lock all of the dquots and join them to the transaction.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Refactor all the open-coded validation of file block ranges into a
single helper, and teach the bmap scrubber to check the ranges.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The inode extent truncate path unmaps extents from the inode block
mapping, finishes deferred ops to free the associated extents and
then explicitly rolls the transaction before processing the next
extent. The latter extent roll is spurious as xfs_defer_finish()
always returns a clean transaction and automatically relogs inodes
attached to the transaction (with lock_flags == 0). This can
unnecessarily increase the number of log ticket regrants that occur
during a long running truncate operation. Remove the explicit
transaction roll.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While running generic/042 with -drtinherit=1 set in MKFS_OPTIONS, I
observed that the kernel will gladly set the realtime flag on any file
created on the loopback filesystem even though that filesystem doesn't
actually have a realtime device attached. This leads to verifier
failures and doesn't make any sense, so be smarter about this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Hoist the code that propagates di_flags and di_flags2 from a parent to a
new child into separate functions. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit
counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the
32-bit unix time epoch). This enables us to handle dates up to 2486,
which solves the y2038 problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the buffer retry state machine logic to xfs_buf.c and call it once
from xfs_ioend instead of duplicating it three times for the three kinds
of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.
Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In the course of some operations, we look up the perag from
the mount multiple times to get or change perag information.
These are often very short pieces of code, so while the
lookup cost is generally low, the cost of the lookup is far
higher than the cost of the operation we are doing on the
perag.
Since we changed buffers to hold references to the perag
they are cached in, many modification contexts already hold
active references to the perag that are held across these
operations. This is especially true for any operation that
is serialised by an allocation group header buffer.
In these cases, we can just use the buffer's reference to
the perag to avoid needing to do lookups to access the
perag. This means that many operations don't need to do
perag lookups at all to access the perag because they've
already looked up objects that own persistent references
and hence can use that reference instead.
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:
- 38.91% 5.33% [kernel] [k] xfs_iflush
- 17.70% xfs_iflush
- 9.93% xfs_inobp_check
4.36% xfs_buf_offset
10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.
The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:
ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.
We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.
We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....
We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.
This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.
This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.
Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Once we have inodes pinning the cluster buffer and attached whenever
they are dirty, we no longer have a guarantee that the items are
flush locked when we lock the cluster buffer. Hence we cannot just
walk the buffer log item list and modify the attached inodes.
If the inode is not flush locked, we have to ILOCK it first and then
flush lock it to do all the prerequisite checks needed to avoid
races with other code. This is already handled by
xfs_ifree_get_one_inode(), so rework the inode iteration loop and
function to update all inodes in cache whether they are attached to
the buffer or not.
Note: we also remove the copying of the log item lsn to the
ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
trigger aborts and so flush lsn matching is not needed in IO
completion for processing freed inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.
This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
While this is largely a refactor of existing functionality, we
broaden the scope of the flag to beyond where inodes are explicitly
attached because future changes need to know what type of log items
are attached to the buffer. Adding this buffer flag may invoke the
inode iodone callback in cases where it wouldn't have been
previously, but this is not a functional change because the callback
is identical to the normal buffer write iodone callback when inodes
are not attached.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.
e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged. Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.
We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.
However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.
This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.
This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the double-inode locking helpers to xfs_inode.c since they're not
specific to reflink.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
- Fix a resource leak on an error bailout.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCpAACgkQ+H93GTRK
tOtlog//ZkKRzp72HXCTgGpQj0IjkCjuZlz0F8FpdVhl9lOANaZPoXDbCIAax8q1
67wfDG7p8wl109KZnMuaPPXSC5KlynaWphSs7XMXqgLFXViha31c6U7PSMyxZmBB
674hE9eKnVNjhkMk98MtVV3ShWge9T5yGVXYhQbXMWDx8GCdNd9NEP3qnMcBEaLt
EPl6yoOfdNnKo37ptrt1Qb2NgORDBDDHYPr6SX/xEYDsppDLp8u+k/YGhuoJVtdc
HGR08ryIn6lctvkLbqDxtFzFxIL8Za7AHrBXilgioJYRJ78v7VyCnj1u8eT/axsa
ZUis/sQXjgvSvlsGZQZkyPdtnfhFbzXCeulyQvrMnEheMuz691dljMid3fEBkfmq
SubqE+HDP8aC6Zs9EkV/lEtdTH+EQ2ojZHH9s5oi6qbvilfFxyoPUfIxog+bhqPO
fwl1sL2nb/eQuBF+DeHg4UxP9WzA06Z1q9nZpDjrY224aMOWnrN8TBOKv4FZiRDt
M1l3VXcVsaDbCmbOsCTXdLh0Ap3przjk4hFPOjPJxlTzTNO9rPLhopvuLd+J3quA
fzNNBA4bMSq3IFSg3VEC2U3YgF3anGrt8PuopIwCH8muc9agCs//fI3Y/eI4k9oT
VOUPSxKckZ6SAEhIr7uTyKFzS+yNFBaYN/Y0FqDGnzbf5Bqr9NM=
=C0rZ
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"We've settled down into the bugfix phase; this one fixes a resource
leak on an error bailout path"
* tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
xfs_ifree_cluster() calls xfs_perag_get() at the beginning, but forgets to
call xfs_perag_put() in one failed path.
Add the missed function call to fix it.
Fixes: ce92464c18 ("xfs: make xfs_trans_get_buf return an error code")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Both the data and attr fork have a format that is stored in the legacy
idinode. Move it into the xfs_ifork structure instead, where it uses
up padding.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are there are three extents counters per inode, one for each of
the forks. Two are in the legacy icdinode and one is directly in
struct xfs_inode. Switch to a single counter in the xfs_ifork structure
where it uses up padding at the end of the structure. This simplifies
various bits of code that just wants the number of extents counter and
can now directly dereference it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ifree only need to free inline data in the data fork, as we've
already taken care of the attr fork before (and in fact freed the
fork structure). Just open code the freeing of the inline data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Call the data/attr local fork verifiers as soon as we are ready for them.
This keeps them close to the code setting up the forks, and avoids a
few branches later on. Also open code xfs_inode_verify_forks in the
only remaining caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>