IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The patches in this set adds the ability to repair the target buffer of
a symbolic link, using the same salvage, rebuild, and swap strategy used
everywhere else.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
pjM1AQCEEuX7qTakBtA1UB5UO/xH/MkQhza+GLknMAqngJA6dgD/UcKeuWeU4SaE
oi8G4nxbe2/BUS1Muv0/Y0RU9suVIAE=
=unik
-----END PGP SIGNATURE-----
Merge tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: online repair of symbolic links
The patches in this set adds the ability to repair the target buffer of
a symbolic link, using the same salvage, rebuild, and swap strategy used
everywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: online repair of symbolic links
xfs: pass the owner to xfs_symlink_write_target
xfs: expose xfs_bmap_local_to_extents for online repair
Orphaned files are defined to be files with nonzero ondisk link count
but no observable parent directory. This series enables online repair
to reparent orphaned files into the filesystem directory tree, and wires
up this reparenting ability into the directory, file link count, and
parent pointer repair functions. This is how we fix files with positive
link count that are not reachable through the directory tree.
This patch will also create the orphanage directory (lost+found) if it
is not present. In contrast to xfs_repair, we follow e2fsck in creating
the lost+found without group or other-owner access to avoid accidental
disclosure of files that were previously hidden by an 0700 directory.
That's silly security, but people have been known to do it.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
pnJ8AP4kTiYucq40QUjYbG9xZGxbRenrgtwmBltcn6Xzm9PUVAD+KrU1GQ5Qm2zW
/Kl5nDRM9zqJgQ5CQiBGuu3puHfnOgw=
=vJ6T
-----END PGP SIGNATURE-----
Merge tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: move orphan files to lost and found
Orphaned files are defined to be files with nonzero ondisk link count
but no observable parent directory. This series enables online repair
to reparent orphaned files into the filesystem directory tree, and wires
up this reparenting ability into the directory, file link count, and
parent pointer repair functions. This is how we fix files with positive
link count that are not reachable through the directory tree.
This patch will also create the orphanage directory (lost+found) if it
is not present. In contrast to xfs_repair, we follow e2fsck in creating
the lost+found without group or other-owner access to avoid accidental
disclosure of files that were previously hidden by an 0700 directory.
That's silly security, but people have been known to do it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: ensure dentry consistency when the orphanage adopts a file
xfs: move files to orphanage instead of letting nlinks drop to zero
xfs: move orphan files to the orphanage
This series employs atomic extent swapping to enable safe reconstruction
of directory data. For now, XFS does not support reverse directory
links (aka parent pointers), so we can only salvage the dirents of a
directory and construct a new structure.
Directory repair therefore consists of five main parts:
First, we walk the existing directory to salvage as many entries as we
can, by adding them as new directory entries to the repair temp dir.
Second, we validate the parent pointer found in the directory. If one
was not found, we scan the entire filesystem looking for a potential
parent.
Third, we use atomic extent swaps to exchange the entire data fork
between the two directories.
Fourth, we reap the old directory blocks as carefully as we can.
To wrap up the directory repair code, we need to add to the regular
filesystem the ability to free all the data fork blocks in a directory.
This does not change anything with normal directories, since they must
still unlink and shrink one entry at a time. However, this will
facilitate freeing of partially-inactivated temporary directories during
log recovery.
The second half of this patchset implements repairs for the dotdot
entries of directories. For now there is only rudimentary support for
this, because there are no directory parent pointers, so the best we can
do is scanning the filesystem and the VFS dcache for answers.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
piAGAP4hhYmBAiSGau5anRwUAFiELV2Irgz8k5oIGeWhg2m/kgEAn00d1rX8buLz
ZQyYs1dSRtBpYs2EIZRTnnM2W4dYLQY=
=2L3x
-----END PGP SIGNATURE-----
Merge tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: online repair of directories
This series employs atomic extent swapping to enable safe reconstruction
of directory data. For now, XFS does not support reverse directory
links (aka parent pointers), so we can only salvage the dirents of a
directory and construct a new structure.
Directory repair therefore consists of five main parts:
First, we walk the existing directory to salvage as many entries as we
can, by adding them as new directory entries to the repair temp dir.
Second, we validate the parent pointer found in the directory. If one
was not found, we scan the entire filesystem looking for a potential
parent.
Third, we use atomic extent swaps to exchange the entire data fork
between the two directories.
Fourth, we reap the old directory blocks as carefully as we can.
To wrap up the directory repair code, we need to add to the regular
filesystem the ability to free all the data fork blocks in a directory.
This does not change anything with normal directories, since they must
still unlink and shrink one entry at a time. However, this will
facilitate freeing of partially-inactivated temporary directories during
log recovery.
The second half of this patchset implements repairs for the dotdot
entries of directories. For now there is only rudimentary support for
this, because there are no directory parent pointers, so the best we can
do is scanning the filesystem and the VFS dcache for answers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: ask the dentry cache if it knows the parent of a directory
xfs: online repair of parent pointers
xfs: scan the filesystem to repair a directory dotdot entry
xfs: online repair of directories
xfs: inactivate directory data blocks
This series adds some logic to the inode scrubbers so that they can
detect and deal with consistency errors between the link count and the
per-inode unlinked list state. The helpers needed to do this are
presented here because they are a prequisite for rebuildng directories,
since we need to get a rebuilt non-empty directory off the unlinked
list.
Note that this patchset does not provide comprehensive reconstruction of
the AGI unlinked list; that is coming in a subsequent patchset.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
pt9pAQDbOMQ/9Y3Iyywkf9jTj9EvXEOpRlFPMd0F4gmtO9rhcAD/cQH9tctLcpeY
DuAHqtmR3o2elpoXZrR1b+mAkS26Twc=
=5oDH
-----END PGP SIGNATURE-----
Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: online repair of inode unlinked state
This series adds some logic to the inode scrubbers so that they can
detect and deal with consistency errors between the link count and the
per-inode unlinked list state. The helpers needed to do this are
presented here because they are a prequisite for rebuildng directories,
since we need to get a rebuilt non-empty directory off the unlinked
list.
Note that this patchset does not provide comprehensive reconstruction of
the AGI unlinked list; that is coming in a subsequent patchset.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: update the unlinked list when repairing link counts
xfs: ensure unlinked list state is consistent with nlink during scrub
This series employs atomic extent swapping to enable safe reconstruction
of extended attribute data attached to a file. Because xattrs do not
have any redundant information to draw off of, we can at best salvage
as much data as we can and build a new structure.
Rebuilding an extended attribute structure consists of these three
steps:
First, we walk the existing attributes to salvage as many of them as we
can, by adding them as new attributes attached to the repair tempfile.
We need to add a new xfile-based data structure to hold blobs of
arbitrary length to stage the xattr names and values.
Second, we write the salvaged attributes to a temporary file, and use
atomic extent swaps to exchange the entire attribute fork between the
two files.
Finally, we reap the old xattr blocks (which are now in the temporary
file) as carefully as we can.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
pgtVAQDEjDtM1TqUn8neJtXqtOPC2FZdLFq6Z1uSzxGWSRi9TwD/fcwgpvIrdF7g
LFrCRk9UUJZRxrK6kGb+RcEtSJwwNwc=
=FVWN
-----END PGP SIGNATURE-----
Merge tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: online repair of extended attributes
This series employs atomic extent swapping to enable safe reconstruction
of extended attribute data attached to a file. Because xattrs do not
have any redundant information to draw off of, we can at best salvage
as much data as we can and build a new structure.
Rebuilding an extended attribute structure consists of these three
steps:
First, we walk the existing attributes to salvage as many of them as we
can, by adding them as new attributes attached to the repair tempfile.
We need to add a new xfile-based data structure to hold blobs of
arbitrary length to stage the xattr names and values.
Second, we write the salvaged attributes to a temporary file, and use
atomic extent swaps to exchange the entire attribute fork between the
two files.
Finally, we reap the old xattr blocks (which are now in the temporary
file) as carefully as we can.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: create an xattr iteration function for scrub
xfs: flag empty xattr leaf blocks for optimization
xfs: scrub should set preen if attr leaf has holes
xfs: repair extended attributes
xfs: use atomic extent swapping to fix user file fork data
xfs: create a blob array data structure
xfs: enable discarding of folios backing an xfile
There are a couple of significant changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.
The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure. To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number before the swap, we instead change libxfs to allow
callers of the dir and xattr code the ability to set an explicit owner
number to be written into the header fields of any new blocks that are
created. For regular operation this will be the directory inode number.
The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR
po6LAQD4r4tzS1xMNW/ynLntzLNiYukkjI8uGpHw1tPpwmskKQD8Dw+bmGQPjUUR
v62p5rUNcinvgxwdJwBcsOERGlVkIww=
=6MPB
-----END PGP SIGNATURE-----
Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: set and validate dir/attr block owners
There are a couple of significant changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.
The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure. To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number before the swap, we instead change libxfs to allow
callers of the dir and xattr code the ability to set an explicit owner
number to be written into the header fields of any new blocks that are
created. For regular operation this will be the directory inode number.
The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: validate explicit directory free block owners
xfs: validate explicit directory block buffer owners
xfs: validate explicit directory data buffer owners
xfs: validate directory leaf buffer owners
xfs: validate dabtree node buffer owners
xfs: validate attr remote value buffer owners
xfs: validate attr leaf buffer owners
xfs: reduce indenting in xfs_attr_node_list
xfs: use the xfs_da_args owner field to set new dir/attr block owner
xfs: add an explicit owner field to xfs_da_args
We now have all the infrastructure we need to repair file metadata.
We'll begin with the realtime summary file, because it is the least
complex data structure. To support this we need to add three more
pieces to the temporary file code from the previous patchset --
preallocating space in the temp file, formatting metadata into that
space and writing the blocks to disk, and swapping the fork mappings
atomically.
After that, the actual reconstruction of the realtime summary
information is pretty simple, since we can simply write the incore
copy computed by the rtsummary scrubber to the temporary file, swap the
contents, and reap the old blocks.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
pp/sAQCwODvb3ahgWbNMp6ejewJ1p8NjbmgRbTFckcwWROz4zQEAltyzqZBr6GuC
4E+7LGrh3Q03VX6pqNBGcGelNj20UgM=
=c0Gb
-----END PGP SIGNATURE-----
Merge tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: online repair of realtime summaries
We now have all the infrastructure we need to repair file metadata.
We'll begin with the realtime summary file, because it is the least
complex data structure. To support this we need to add three more
pieces to the temporary file code from the previous patchset --
preallocating space in the temp file, formatting metadata into that
space and writing the blocks to disk, and swapping the fork mappings
atomically.
After that, the actual reconstruction of the realtime summary
information is pretty simple, since we can simply write the incore
copy computed by the rtsummary scrubber to the temporary file, swap the
contents, and reap the old blocks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: online repair of realtime summaries
xfs: teach the tempfile to set up atomic file content exchanges
xfs: support preallocating and copying content into temporary files
As mentioned earlier, the repair strategy for file-based metadata is to
build a new copy in a temporary file and swap the file fork mappings
with the metadata inode. We've built the atomic extent swap facility,
so now we need to build a facility for handling private temporary files.
The first step is to teach the filesystem to ignore the temporary files.
We'll mark them as PRIVATE in the VFS so that the kernel security
modules will leave it alone. The second step is to add the online
repair code the ability to create a temporary file and reap extents from
the temporary file after the extent swap.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
pukxAQCWf6T3FpJHXPHAwc8ANNWAZHufPTn8LH1m2DTKal6rGgEAo0rnqmV5xN/p
WDMMAm5ngW2mDlBJFX8ClE2+1DHsqAc=
=SKWV
-----END PGP SIGNATURE-----
Merge tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: create temporary files for online repair
As mentioned earlier, the repair strategy for file-based metadata is to
build a new copy in a temporary file and swap the file fork mappings
with the metadata inode. We've built the atomic extent swap facility,
so now we need to build a facility for handling private temporary files.
The first step is to teach the filesystem to ignore the temporary files.
We'll mark them as PRIVATE in the VFS so that the kernel security
modules will leave it alone. The second step is to add the online
repair code the ability to create a temporary file and reap extents from
the temporary file after the extent swap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: add the ability to reap entire inode forks
xfs: refactor live buffer invalidation for repairs
xfs: create temporary files and directories for online repair
xfs: hide private inodes from bulkstat and handle functions
This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.
This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes. A
successful call completion guarantees that the new contents will be seen
even if the system fails.
The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically. The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.
User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file. If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas. Note that application software must quiesce writes to the file
while it stages an atomic update. This will be addressed by a
subsequent series.
This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file. For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.
However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed. For file access mediated
via read and write, fanotify or inotify are probably sufficient. For
mmaped files, that may not be fast enough.
Here is the proposed manual page:
IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
NAME
ioctl_xfs_exchange_range - exchange the contents of parts of
two files
SYNOPSIS
#include <sys/ioctl.h>
#include <xfs/xfs_fs.h>
int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐
change_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges.
Exchanges are atomic with regards to concurrent file opera‐
tions. Implementations must guarantee that readers see either
the old contents or the new contents in their entirety, even if
the system fails.
The system call parameters are conveyed in structures of the
following form:
struct xfs_exchange_range {
__s32 file1_fd;
__u32 pad;
__u64 file1_offset;
__u64 file2_offset;
__u64 length;
__u64 flags;
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
XFS_EXCHANGE_RANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)).
XFS_EXCHANGE_RANGE_DSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
XFS_EXCHANGE_RANGE_FILE1_WRITTEN
Only exchange sub-ranges of file1_fd that are known
to contain data written by application software.
Each sub-range may be expanded (both upwards and
downwards) to align with the file allocation unit.
For files on the data device, this is one filesystem
block. For files on the realtime device, this is
the realtime extent size. This facility can be used
to implement fast atomic scatter-gather writes of
any complexity for software-defined storage targets
if all writes are aligned to the file allocation
unit.
XFS_EXCHANGE_RANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is XFS-specific.
USE CASES
Several use cases are imagined for this system call. In all
cases, application software must coordinate updates to the file
because the exchange is performed unconditionally.
The first is a data storage program that wants to commit non-
contiguous updates to a file atomically and coordinates write
access to that file. This can be done by creating a temporary
file, calling FICLONE(2) to share the contents, and staging the
updates into the temporary file. The FULL_FILES flag is recom‐
mended for this purpose. The temporary file can be deleted or
punched out afterwards.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
ioctl(temp_fd, FICLONE, fd);
/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);
/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);
/* commit the entire update */
struct xfs_exchange_range args = {
.file1_fd = temp_fd,
.flags = XFS_EXCHANGE_RANGE_TO_EOF,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
The second is a software-defined storage host (e.g. a disk
jukebox) which implements an atomic scatter-gather write com‐
mand. Provided the exported disk's logical block size matches
the file's allocation unit size, this can be done by creating a
temporary file and writing the data at the appropriate offsets.
It is recommended that the temporary file be truncated to the
size of the regular file before any writes are staged to the
temporary file to avoid issues with zeroing during EOF exten‐
sion. Use this call with the FILE1_WRITTEN flag to exchange
only the file allocation units involved in the emulated de‐
vice's write command. The temporary file should be truncated
or punched out completely before being reused to stage another
write.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
int blksz;
fstat(fd, &sb);
blksz = sb.st_blksize;
/* land scatter gather writes between 100fsb and 500fsb */
pwrite(temp_fd, data1, blksz * 2, blksz * 100);
pwrite(temp_fd, data2, blksz * 20, blksz * 480);
pwrite(temp_fd, data3, blksz * 7, blksz * 257);
/* commit the entire update */
struct xfs_exchange_range args = {
.file1_fd = temp_fd,
.file1_offset = blksz * 100,
.file2_offset = blksz * 100,
.length = blksz * 400,
.flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
XFS_EXCHANGE_RANGE_FILE1_DSYNC,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down. Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.
Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years. It is also not
the RWF_ATOMIC patchset that has been shared. This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.
As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata. The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.
Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges. This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode. If this
completes successfully, the new contents can be committed atomically
into the inode being repaired. This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.
For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
pmYQAQCGwoAev/oRzIJrZmbpzNaU9w7XEPF+tW3vJSX6tlxG+wD8DIi4kTAplu/9
i860EFqZp5MuwHyGVDCac0owigtt6wk=
=Lsls
-----END PGP SIGNATURE-----
Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: atomic file content exchanges
This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.
This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes. A
successful call completion guarantees that the new contents will be seen
even if the system fails.
The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically. The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.
User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file. If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas. Note that application software must quiesce writes to the file
while it stages an atomic update. This will be addressed by a
subsequent series.
This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file. For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.
However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed. For file access mediated
via read and write, fanotify or inotify are probably sufficient. For
mmaped files, that may not be fast enough.
Here is the proposed manual page:
IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
NAME
ioctl_xfs_exchange_range - exchange the contents of parts of
two files
SYNOPSIS
#include <sys/ioctl.h>
#include <xfs/xfs_fs.h>
int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐
change_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges.
Exchanges are atomic with regards to concurrent file opera‐
tions. Implementations must guarantee that readers see either
the old contents or the new contents in their entirety, even if
the system fails.
The system call parameters are conveyed in structures of the
following form:
struct xfs_exchange_range {
__s32 file1_fd;
__u32 pad;
__u64 file1_offset;
__u64 file2_offset;
__u64 length;
__u64 flags;
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
XFS_EXCHANGE_RANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)).
XFS_EXCHANGE_RANGE_DSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
XFS_EXCHANGE_RANGE_FILE1_WRITTEN
Only exchange sub-ranges of file1_fd that are known
to contain data written by application software.
Each sub-range may be expanded (both upwards and
downwards) to align with the file allocation unit.
For files on the data device, this is one filesystem
block. For files on the realtime device, this is
the realtime extent size. This facility can be used
to implement fast atomic scatter-gather writes of
any complexity for software-defined storage targets
if all writes are aligned to the file allocation
unit.
XFS_EXCHANGE_RANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is XFS-specific.
USE CASES
Several use cases are imagined for this system call. In all
cases, application software must coordinate updates to the file
because the exchange is performed unconditionally.
The first is a data storage program that wants to commit non-
contiguous updates to a file atomically and coordinates write
access to that file. This can be done by creating a temporary
file, calling FICLONE(2) to share the contents, and staging the
updates into the temporary file. The FULL_FILES flag is recom‐
mended for this purpose. The temporary file can be deleted or
punched out afterwards.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
ioctl(temp_fd, FICLONE, fd);
/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);
/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);
/* commit the entire update */
struct xfs_exchange_range args = {
.file1_fd = temp_fd,
.flags = XFS_EXCHANGE_RANGE_TO_EOF,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
The second is a software-defined storage host (e.g. a disk
jukebox) which implements an atomic scatter-gather write com‐
mand. Provided the exported disk's logical block size matches
the file's allocation unit size, this can be done by creating a
temporary file and writing the data at the appropriate offsets.
It is recommended that the temporary file be truncated to the
size of the regular file before any writes are staged to the
temporary file to avoid issues with zeroing during EOF exten‐
sion. Use this call with the FILE1_WRITTEN flag to exchange
only the file allocation units involved in the emulated de‐
vice's write command. The temporary file should be truncated
or punched out completely before being reused to stage another
write.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
int blksz;
fstat(fd, &sb);
blksz = sb.st_blksize;
/* land scatter gather writes between 100fsb and 500fsb */
pwrite(temp_fd, data1, blksz * 2, blksz * 100);
pwrite(temp_fd, data2, blksz * 20, blksz * 480);
pwrite(temp_fd, data3, blksz * 7, blksz * 257);
/* commit the entire update */
struct xfs_exchange_range args = {
.file1_fd = temp_fd,
.file1_offset = blksz * 100,
.file2_offset = blksz * 100,
.length = blksz * 400,
.flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
XFS_EXCHANGE_RANGE_FILE1_DSYNC,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down. Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.
Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years. It is also not
the RWF_ATOMIC patchset that has been shared. This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.
As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata. The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.
Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges. This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode. If this
completes successfully, the new contents can be committed atomically
into the inode being repaired. This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.
For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: enable logged file mapping exchange feature
docs: update swapext -> exchmaps language
xfs: capture inode generation numbers in the ondisk exchmaps log item
xfs: support non-power-of-two rtextsize with exchange-range
xfs: make file range exchange support realtime files
xfs: condense symbolic links after a mapping exchange operation
xfs: condense directories after a mapping exchange operation
xfs: condense extended attributes after a mapping exchange operation
xfs: add error injection to test file mapping exchange recovery
xfs: bind together the front and back ends of the file range exchange code
xfs: create deferred log items for file mapping exchanges
xfs: introduce a file mapping exchange log intent item
xfs: create a incompat flag for atomic file mapping exchanges
xfs: introduce new file range exchange ioctl
vfs: export remap and write check helpers
This series applies various cleanups and refactorings to file IO
handling code ahead of the main series to implement atomic file content
exchanges.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
pginAQCcZq5bFYJGYj4UInUOfDDjuq8R9Rl8DGDhnTFb8FxAUAD+Jsol2/wQduII
Bly/+Hegen2BoayNuf3iZG7RYkJWygo=
=tHoT
-----END PGP SIGNATURE-----
Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: refactorings for atomic file content exchanges
This series applies various cleanups and refactorings to file IO
handling code ahead of the main series to implement atomic file content
exchanges.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: constify xfs_bmap_is_written_extent
xfs: refactor non-power-of-two alignment checks
xfs: hoist multi-fsb allocation unit detection to a helper
xfs: create a new helper to return a file's allocation unit
xfs: declare xfs_file.c symbols in xfs_file.h
xfs: move xfs_iops.c declarations out of xfs_inode.h
xfs: move inode lease breaking functions to xfs_inode.c
This patchset improves the performance of log incompat feature bit
handling by making a few changes to how the filesystem handles them.
First, we now only clear the bits during a clean unmount to reduce calls
to the (expensive) upgrade function to once per bit per mount. Second,
we now only allow incompat feature upgrades for sysadmins or if the
sysadmin explicitly allows it via mount option. Currently the only log
incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so
there should be no user visible impact to this change.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
pharAQDYPe2dmXxQ28fUC1GMy38DLCt0f/XXLkhUI0yeC+mbcQD/aLWU5MCy5TGN
OF3HQwrgL9+SZWwhqRbu8hPG7ZLBWQQ=
=yieh
-----END PGP SIGNATURE-----
Merge tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
xfs: improve log incompat feature handling
This patchset improves the performance of log incompat feature bit
handling by making a few changes to how the filesystem handles them.
First, we now only clear the bits during a clean unmount to reduce calls
to the (expensive) upgrade function to once per bit per mount. Second,
we now only allow incompat feature upgrades for sysadmins or if the
sysadmin explicitly allows it via mount option. Currently the only log
incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so
there should be no user visible impact to this change.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: only clear log incompat flags at clean unmount
xfs: fix error bailout in xrep_abt_build_new_trees
xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory
xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode
xfs: pass xfs_buf lookup flags to xfs_*read_agi
If a symbolic link target looks bad, try to sift through the rubble to
find as much of the target buffer that we can, and stage a new target
(short or remote format as needed) in a temporary file and use the
atomic extent swapping mechanism to commit the results. In the worst
case, we replace the target with an overly long filename that cannot
possibly resolve.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When the orphanage adopts a file, that file becomes a child of the
orphanage. The dentry cache may have entries for the orphanage
directory and the name we've chosen, so (1) make sure we abort if the
dcache has a positive entry because something's not right; and (2)
invalidate and purge negative dentries if the adoption goes through.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Require callers of xfs_symlink_write_target to pass the owner number
explicitly. This sets us up for online repair to be able to write a
remote symlink target to sc->tempip with sc->ip's inumber in the block
heaader.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we encounter an inode with a nonzero link count but zero observed
links, move it to the orphanage.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Allow online repair to call xfs_bmap_local_to_extents and add a void *
argument at the end so that online repair can pass its own context.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
It's possible that the dentry cache can tell us the parent of a
directory. Therefore, when repairing directory dot dot entries, query
the dcache as a last resort before scanning the entire filesystem.
A reviewer asks:
"How high is the chance that we actually have a valid dcache entry for a
file in a corrupted directory?"
There's a decent chance of this actually working. Say you have a
1000-block directory foo, and block 980 gets corrupted. Let's further
suppose that block 0 has a correct entry for ".." and "bar". If someone
accesses /mnt/foo/bar, that will cause the dcache to create a dentry
from /mnt to /mnt/foo whose d_parent points back to /mnt. If you then
want to rebuild the directory, XFS can obtain the parent from the dcache
without needing to wander into parent pointers or scan the filesystem to
find /mnt's connection to foo.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're repairing a directory structure or fixing the dotdot entry of
a subdirectory, it's possible that we won't ever find a parent for the
subdirectory. When this is the case, move it to the orphanage, aka
/lost+found.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach the online repair code to fix parent pointers for directories.
For now, this means correcting the dotdot entry of an existing directory
that is otherwise consistent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach the online directory repair code to scan the filesystem so that we
can set the dotdot entry when we're rebuilding a directory. This
involves dropping ILOCK on the directory that we're repairing, which
means that the VFS can sneak in and tell us to update dotdot at any
time. Deal with these races by using a dirent hook to absorb dotdot
updates, and be careful not to check the scan results until after we've
retaken the ILOCK.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're repairing the link counts of a file, we must ensure either
that the file has zero link count and is on the unlinked list; or that
it has nonzero link count and is not on the unlinked list.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If a directory looks like it's in bad shape, try to sift through the
rubble to find whatever directory entries we can, scan the directory
tree for the parent (if needed), stage the new directory contents in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach inode inactivation to delete all the incore buffers backing a
directory. In normal runtime this should never happen because the VFS
forbids rmdir on a non-empty directory.
In the next patch, online directory repair stands up a new directory,
exchanges it with the broken directory, and then drops the private
temporary directory. If we cancel the repair just prior to exchanging
the directory contents, the new directory will need to be torn down.
Note: If we commit the repair, reaping will take care of all the ondisk
space allocations and incore buffers for the old corrupt directory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a streamlined function to walk a file's xattrs, without all the
cursor management stuff in the regular listxattr.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we have the means to tell if an inode is on an unlinked inode
list or not, we can check that an inode with zero link count is on the
unlinked list; and an inode that has nonzero link count is not on that
list. Make repair clean things up too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Empty xattr leaf blocks at offset zero are a waste of space but
otherwise harmless. If we encounter one, flag it as an opportunity for
optimization.
If we encounter empty attr leaf blocks anywhere else in the attr fork,
that's corruption.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, stage a new attribute structure in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Build on the code that was recently added to the temporary repair file
code so that we can atomically switch the contents of any file fork,
even if the fork is in local format. The upcoming functions to repair
xattrs, directories, and symlinks will need that capability.
Repair can lock out access to these user files by holding IOLOCK_EXCL on
these user files. Therefore, it is safe to drop the ILOCK of both the
file being repaired and the tempfile being used for staging, and cancel
the scrub transaction. We do this so that we can reuse the resource
estimation and transaction allocation functions used by a regular file
exchange operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata. For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new xfile function to discard the page cache that's backing
part of an xfile. The next patch wil use this to drop parts of an xfile
that aren't needed anymore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Port the existing directory freespace block header checking function to
accept an owner number instead of an xfs_inode, then update the
callsites to use xfs_da_args.owner when possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Port the existing directory block header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Port the existing directory data header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a leaf block header checking function to validate the owner field
of xattr leaf blocks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reduce the indentation here so that we can add some things in the next
patch without going over the column limits.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're creating leaf, data, freespace, or dabtree blocks for
directories and xattrs, use the explicit owner field (instead of the
xfs_inode) to set the owner field. This will enable online repair to
construct replacement data structures in a temporary file without having
to change the owner fields prior to swapping the new and old structures.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add an explicit owner field to xfs_da_args, which will make it easier
for online fsck to set the owner field of the temporary directory and
xattr structures that it builds to repair damaged metadata.
Note: I hopefully found all the xfs_da_args definitions by looking for
automatic stack variable declarations and xfs_da_args.dp assignments:
git grep -E '(args.*dp =|struct xfs_da_args[[:space:]]*[a-z0-9][a-z0-9]*)'
Note that callers of xfs_attr_{get,set,change} can set the owner to zero
(or leave it unset) to have the default set to args->dp.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Repair the realtime summary data by constructing a new rtsummary file in
the scrub temporary file, then atomically swapping the contents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In preparation for supporting repair of indexed file-based metadata
(such as realtime bitmaps, directories, and extended attribute data),
add a function to reap the old blocks after a metadata repair finishes.
IOWs, this is an elaborate bunmapi call that deals with crosslinked
blocks by unmapping them without freeing them, and also scans for incore
buffers to invalidate.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create some new routines to exchange the contents of a temporary file
created to stage a repair with another ondisk file. This will be used
by the realtime summary repair function to commit atomically the new
rtsummary data, which will be staged in the tempfile.
The rest of XFS coordinates access to the realtime metadata inodes
solely through the ILOCK. For repair to hold its exclusive access to
the realtime summary file, it has to allocate a single large transaction
and roll it repeatedly throughout the repair while holding the ILOCK.
In turn, this means that for now there's only a partial file mapping
exchange implementation for the temporary file because we can only work
within an existing transaction.
For now, the only tempswap functions needed here are to estimate the
resource requirements of the exchange, reserve more space/quota to an
existing transaction, and kick off the actual exchange. The rest will
be added in a later patch in preparation for repairing xattrs and
directories.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create the routines we need to preallocate space in a temporary ondisk
file and then copy the contents of an xfile into the tempfile. The
upcoming rtsummary repair feature will construct the contents of a
realtime summary file in memory, after which it will want to copy all
that into the ondisk temporary file before atomically committing the new
rtsummary contents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers. Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach the online repair code how to create temporary files or
directories. These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We're about to start adding functionality that uses internal inodes that
are private to XFS. What this means is that userspace should never be
able to access any information about these files, and should not be able
to open these files by handle.
To prevent users from ever finding the file or mis-interactions with the
security apparatus, set S_PRIVATE on the inode. Don't allow bulkstat,
open-by-handle, or linking of S_PRIVATE files into the directory tree.
This should keep private inodes actually private.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features
that we will permit when mounting a filesystem. This turns on support
for the file range exchange feature.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Start reworking the atomic swapext design documentation to refer to its
new file contents/mapping exchange name.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>