linux

iv/linux

Author	SHA1	Message	Date
Dave Chinner	2c03d9560e	xfs: fix CIL sparse lock context warnings Sparse reports: fs/xfs/xfs_log_cil.c:1127:1: warning: context imbalance in 'xlog_cil_push_work' - different lock contexts for basic block fs/xfs/xfs_log_cil.c:1380:1: warning: context imbalance in 'xlog_cil_push_background' - wrong count at exit fs/xfs/xfs_log_cil.c:1623:9: warning: context imbalance in 'xlog_cil_commit' - unexpected unlock xlog_cil_push_background() has a locking annotations for an rw_sem. Sparse does not track lock contexts for rw_sems, so the annotation generates false warnings. Remove the annotation. xlog_wait_on_iclog() drops the log->l_ic_loglock. The function has a sparse annotation, but the prototype in xfs_log_priv.h does not. Hence the warning from xlog_cil_push_work() which calls xlog_wait_on_iclog(). Add the missing annotation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-20 20:23:59 +05:30
Chandan Babu R	9cb5f15d88	xfs: retain ILOCK during directory updates [v13.2 16/16] This series changes the directory update code to retain the ILOCK on all files involved in a rename until the end of the operation. The upcoming parent pointers patchset applies parent pointers in a separate chained update from the actual directory update, which is why it is now necessary to keep the ILOCK instead of dropping it after the first transaction in the chain. As a side effect, we no longer need to hold the IOLOCK during an rmapbt scan of inodes to serialize the scan with ongoing directory updates. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR phqUAP9ACvKuSe7BN1PYvnSTWJ27Kfzy1u9AnMivsKjaWRW2AQEAiYsDs2La+B2m Z7pdfAX6U6id5D4F9zGm1nIu08ChCQs= =/hm0 -----END PGP SIGNATURE----- Merge tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: retain ILOCK during directory updates This series changes the directory update code to retain the ILOCK on all files involved in a rename until the end of the operation. The upcoming parent pointers patchset applies parent pointers in a separate chained update from the actual directory update, which is why it is now necessary to keep the ILOCK instead of dropping it after the first transaction in the chain. As a side effect, we no longer need to hold the IOLOCK during an rmapbt scan of inodes to serialize the scan with ongoing directory updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: unlock new repair tempfiles after creation xfs: don't pick up IOLOCK during rmapbt repair scan xfs: Hold inode locks in xfs_rename xfs: Hold inode locks in xfs_trans_alloc_dir xfs: Hold inode locks in xfs_ialloc xfs: Increase XFS_QM_TRANS_MAXDQS to 5 xfs: Increase XFS_DEFER_OPS_NR_INODES to 5	2024-04-16 12:53:08 +05:30
Chandan Babu R	f910defd38	xfs: design documentation for online fsck, part 2 [v13.2 15/16] This series updates the design documentation for online fsck to reflect the final design of the parent pointers feature as well as the implementation of online fsck for the new metadata. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR pg9gAP4kwa111TCKBZjr6hMpV0isKfKUWNjbO4pTEKmKM/fDBgEAssk71ReCpEt2 rPv88LnXyce/bWgvFb3wDjxFo2ucvAU= =myBE -----END PGP SIGNATURE----- Merge tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: design documentation for online fsck, part 2 This series updates the design documentation for online fsck to reflect the final design of the parent pointers feature as well as the implementation of online fsck for the new metadata. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: docs: describe xfs directory tree online fsck docs: update offline parent pointer repair strategy docs: update online directory and parent pointer repair sections docs: update the parent pointers documentation to the final version	2024-04-16 12:50:19 +05:30
Chandan Babu R	6ad1b91470	xfs: less heavy locks during fstrim [v30.3 14/16] Congratulations! You have made it to the final patchset of the main online fsck feature! This patchset fixes some stalling behavior that I observed when running FITRIM against large flash-based filesystems with very heavily fragmented free space data. In summary -- the current fstrim implementation optimizes for trimming the largest free extents first, and holds the AGF lock for the duration of the operation. This is great if fstrim is being run as a foreground process by a sysadmin. For xfs_scrub, however, this isn't so good -- we don't really want to block on one huge kernel call while reporting no progress information. We don't want to hold the AGF so long that background processes stall. These problems are easily fixable by issuing smaller FITRIM calls, but there's still the problem of walking the entire cntbt. To solve that second problem, we introduce a new sub-AG FITRIM implementation. To solve the first problem, make it relax the AGF periodically. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR pnLkAP9dlH5LKtJLK7Wbl3GC0EjPBUaX6a/nmkJPBIMocdautAD/X6J8ItPkx885 26qzJSXeapMIXVS1olUPuUL6B3jW/gE= =HS4c -----END PGP SIGNATURE----- Merge tag 'discard-relax-locks-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: less heavy locks during fstrim Congratulations! You have made it to the final patchset of the main online fsck feature! This patchset fixes some stalling behavior that I observed when running FITRIM against large flash-based filesystems with very heavily fragmented free space data. In summary -- the current fstrim implementation optimizes for trimming the largest free extents first, and holds the AGF lock for the duration of the operation. This is great if fstrim is being run as a foreground process by a sysadmin. For xfs_scrub, however, this isn't so good -- we don't really want to block on one huge kernel call while reporting no progress information. We don't want to hold the AGF so long that background processes stall. These problems are easily fixable by issuing smaller FITRIM calls, but there's still the problem of walking the entire cntbt. To solve that second problem, we introduce a new sub-AG FITRIM implementation. To solve the first problem, make it relax the AGF periodically. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'discard-relax-locks-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix performance problems when fstrimming a subset of a fragmented AG	2024-04-16 12:46:28 +05:30
Chandan Babu R	9ba8e658d8	xfs: inode-related repair fixes [v30.3 13/16] While doing QA of the online fsck code, I made a few observations: First, nobody was checking that the di_onlink field is actually zero; Second, that allocating a temporary file for repairs can fail (and thus bring down the entire fs) if the inode cluster is corrupt; and Third, that file link counts do not pin at ~0U to prevent integer overflows. Fourth, the x{chk,rep}_metadata_inode_fork functions should be subclassing the main scrub context, not modifying the parent's setup willy-nilly. This scattered patchset fixes those three problems. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR phVNAQCUkBa3kFggj8pFTqmJUmbKK+umIBIpmQpQVEFeVVzjtwD/azZpYcexuMKY 3V81P3KZCOvs/KY0wJupB+5uLdJc5w4= =brWH -----END PGP SIGNATURE----- Merge tag 'inode-repair-improvements-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: inode-related repair fixes While doing QA of the online fsck code, I made a few observations: First, nobody was checking that the di_onlink field is actually zero; Second, that allocating a temporary file for repairs can fail (and thus bring down the entire fs) if the inode cluster is corrupt; and Third, that file link counts do not pin at ~0U to prevent integer overflows. Fourth, the x{chk,rep}_metadata_inode_fork functions should be subclassing the main scrub context, not modifying the parent's setup willy-nilly. This scattered patchset fixes those three problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'inode-repair-improvements-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype xfs: pin inodes that would otherwise overflow link count xfs: try to avoid allocating from sick inode clusters xfs: check unused nlink fields in the ondisk inode	2024-04-16 12:42:17 +05:30
Chandan Babu R	1eef01250d	xfs: online fsck of iunlink buckets [v30.3 12/16] This series enhances the AGI scrub code to check the unlinked inode bucket lists for errors, and fixes them if necessary. Now that iunlink pointer updates are virtual log items, we can batch updates pretty efficiently in the logging code. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23VAAKCRBKO3ySh0YR piMPAP4+198X70x14t7HIrNy5qDud0RmzG8aLVL1wKm3LGG4mQD/Vm68M74Dua1O pNviXZNaY0fhboehNBzwCkS8mvf0aQI= =e3hD -----END PGP SIGNATURE----- Merge tag 'repair-iunlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online fsck of iunlink buckets This series enhances the AGI scrub code to check the unlinked inode bucket lists for errors, and fixes them if necessary. Now that iunlink pointer updates are virtual log items, we can batch updates pretty efficiently in the logging code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-iunlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: repair AGI unlinked inode bucket lists xfs: hoist AGI repair context to a heap object xfs: check AGI unlinked inode buckets	2024-04-16 12:38:25 +05:30
Chandan Babu R	0313dd8fac	xfs: online repair of symbolic links [v30.3 11/16] The patches in this set adds the ability to repair the target buffer of a symbolic link, using the same salvage, rebuild, and swap strategy used everywhere else. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR pjM1AQCEEuX7qTakBtA1UB5UO/xH/MkQhza+GLknMAqngJA6dgD/UcKeuWeU4SaE oi8G4nxbe2/BUS1Muv0/Y0RU9suVIAE= =unik -----END PGP SIGNATURE----- Merge tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of symbolic links The patches in this set adds the ability to repair the target buffer of a symbolic link, using the same salvage, rebuild, and swap strategy used everywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: online repair of symbolic links xfs: pass the owner to xfs_symlink_write_target xfs: expose xfs_bmap_local_to_extents for online repair	2024-04-16 12:28:57 +05:30
Chandan Babu R	067d3f7100	xfs: move orphan files to lost and found [v30.3 10/16] Orphaned files are defined to be files with nonzero ondisk link count but no observable parent directory. This series enables online repair to reparent orphaned files into the filesystem directory tree, and wires up this reparenting ability into the directory, file link count, and parent pointer repair functions. This is how we fix files with positive link count that are not reachable through the directory tree. This patch will also create the orphanage directory (lost+found) if it is not present. In contrast to xfs_repair, we follow e2fsck in creating the lost+found without group or other-owner access to avoid accidental disclosure of files that were previously hidden by an 0700 directory. That's silly security, but people have been known to do it. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR pnJ8AP4kTiYucq40QUjYbG9xZGxbRenrgtwmBltcn6Xzm9PUVAD+KrU1GQ5Qm2zW /Kl5nDRM9zqJgQ5CQiBGuu3puHfnOgw= =vJ6T -----END PGP SIGNATURE----- Merge tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: move orphan files to lost and found Orphaned files are defined to be files with nonzero ondisk link count but no observable parent directory. This series enables online repair to reparent orphaned files into the filesystem directory tree, and wires up this reparenting ability into the directory, file link count, and parent pointer repair functions. This is how we fix files with positive link count that are not reachable through the directory tree. This patch will also create the orphanage directory (lost+found) if it is not present. In contrast to xfs_repair, we follow e2fsck in creating the lost+found without group or other-owner access to avoid accidental disclosure of files that were previously hidden by an 0700 directory. That's silly security, but people have been known to do it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: ensure dentry consistency when the orphanage adopts a file xfs: move files to orphanage instead of letting nlinks drop to zero xfs: move orphan files to the orphanage	2024-04-16 12:18:18 +05:30
Chandan Babu R	9e6b93b727	xfs: online repair of directories [v30.3 09/16] This series employs atomic extent swapping to enable safe reconstruction of directory data. For now, XFS does not support reverse directory links (aka parent pointers), so we can only salvage the dirents of a directory and construct a new structure. Directory repair therefore consists of five main parts: First, we walk the existing directory to salvage as many entries as we can, by adding them as new directory entries to the repair temp dir. Second, we validate the parent pointer found in the directory. If one was not found, we scan the entire filesystem looking for a potential parent. Third, we use atomic extent swaps to exchange the entire data fork between the two directories. Fourth, we reap the old directory blocks as carefully as we can. To wrap up the directory repair code, we need to add to the regular filesystem the ability to free all the data fork blocks in a directory. This does not change anything with normal directories, since they must still unlink and shrink one entry at a time. However, this will facilitate freeing of partially-inactivated temporary directories during log recovery. The second half of this patchset implements repairs for the dotdot entries of directories. For now there is only rudimentary support for this, because there are no directory parent pointers, so the best we can do is scanning the filesystem and the VFS dcache for answers. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR piAGAP4hhYmBAiSGau5anRwUAFiELV2Irgz8k5oIGeWhg2m/kgEAn00d1rX8buLz ZQyYs1dSRtBpYs2EIZRTnnM2W4dYLQY= =2L3x -----END PGP SIGNATURE----- Merge tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of directories This series employs atomic extent swapping to enable safe reconstruction of directory data. For now, XFS does not support reverse directory links (aka parent pointers), so we can only salvage the dirents of a directory and construct a new structure. Directory repair therefore consists of five main parts: First, we walk the existing directory to salvage as many entries as we can, by adding them as new directory entries to the repair temp dir. Second, we validate the parent pointer found in the directory. If one was not found, we scan the entire filesystem looking for a potential parent. Third, we use atomic extent swaps to exchange the entire data fork between the two directories. Fourth, we reap the old directory blocks as carefully as we can. To wrap up the directory repair code, we need to add to the regular filesystem the ability to free all the data fork blocks in a directory. This does not change anything with normal directories, since they must still unlink and shrink one entry at a time. However, this will facilitate freeing of partially-inactivated temporary directories during log recovery. The second half of this patchset implements repairs for the dotdot entries of directories. For now there is only rudimentary support for this, because there are no directory parent pointers, so the best we can do is scanning the filesystem and the VFS dcache for answers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: ask the dentry cache if it knows the parent of a directory xfs: online repair of parent pointers xfs: scan the filesystem to repair a directory dotdot entry xfs: online repair of directories xfs: inactivate directory data blocks	2024-04-16 12:01:06 +05:30
Chandan Babu R	902603bfa1	xfs: online repair of inode unlinked state [v30.3 08/16] This series adds some logic to the inode scrubbers so that they can detect and deal with consistency errors between the link count and the per-inode unlinked list state. The helpers needed to do this are presented here because they are a prequisite for rebuildng directories, since we need to get a rebuilt non-empty directory off the unlinked list. Note that this patchset does not provide comprehensive reconstruction of the AGI unlinked list; that is coming in a subsequent patchset. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR pt9pAQDbOMQ/9Y3Iyywkf9jTj9EvXEOpRlFPMd0F4gmtO9rhcAD/cQH9tctLcpeY DuAHqtmR3o2elpoXZrR1b+mAkS26Twc= =5oDH -----END PGP SIGNATURE----- Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of inode unlinked state This series adds some logic to the inode scrubbers so that they can detect and deal with consistency errors between the link count and the per-inode unlinked list state. The helpers needed to do this are presented here because they are a prequisite for rebuildng directories, since we need to get a rebuilt non-empty directory off the unlinked list. Note that this patchset does not provide comprehensive reconstruction of the AGI unlinked list; that is coming in a subsequent patchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: update the unlinked list when repairing link counts xfs: ensure unlinked list state is consistent with nlink during scrub	2024-04-16 11:57:10 +05:30
Chandan Babu R	5f3e951186	xfs: online repair of extended attributes [v30.3 07/16] This series employs atomic extent swapping to enable safe reconstruction of extended attribute data attached to a file. Because xattrs do not have any redundant information to draw off of, we can at best salvage as much data as we can and build a new structure. Rebuilding an extended attribute structure consists of these three steps: First, we walk the existing attributes to salvage as many of them as we can, by adding them as new attributes attached to the repair tempfile. We need to add a new xfile-based data structure to hold blobs of arbitrary length to stage the xattr names and values. Second, we write the salvaged attributes to a temporary file, and use atomic extent swaps to exchange the entire attribute fork between the two files. Finally, we reap the old xattr blocks (which are now in the temporary file) as carefully as we can. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR pgtVAQDEjDtM1TqUn8neJtXqtOPC2FZdLFq6Z1uSzxGWSRi9TwD/fcwgpvIrdF7g LFrCRk9UUJZRxrK6kGb+RcEtSJwwNwc= =FVWN -----END PGP SIGNATURE----- Merge tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of extended attributes This series employs atomic extent swapping to enable safe reconstruction of extended attribute data attached to a file. Because xattrs do not have any redundant information to draw off of, we can at best salvage as much data as we can and build a new structure. Rebuilding an extended attribute structure consists of these three steps: First, we walk the existing attributes to salvage as many of them as we can, by adding them as new attributes attached to the repair tempfile. We need to add a new xfile-based data structure to hold blobs of arbitrary length to stage the xattr names and values. Second, we write the salvaged attributes to a temporary file, and use atomic extent swaps to exchange the entire attribute fork between the two files. Finally, we reap the old xattr blocks (which are now in the temporary file) as carefully as we can. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: create an xattr iteration function for scrub xfs: flag empty xattr leaf blocks for optimization xfs: scrub should set preen if attr leaf has holes xfs: repair extended attributes xfs: use atomic extent swapping to fix user file fork data xfs: create a blob array data structure xfs: enable discarding of folios backing an xfile	2024-04-16 11:53:09 +05:30
Chandan Babu R	fb1f7c662c	xfs: set and validate dir/attr block owners [v30.3 06/16] There are a couple of significant changes that need to be made to the directory and xattr code before we can support online repairs of those data structures. The first change is because online repair is designed to use libxfs to create a replacement dir/xattr structure in a temporary file, and use atomic extent swapping to commit the corrected structure. To avoid the performance hit of walking every block of the new structure to rewrite the owner number before the swap, we instead change libxfs to allow callers of the dir and xattr code the ability to set an explicit owner number to be written into the header fields of any new blocks that are created. For regular operation this will be the directory inode number. The second change is to update the dir/xattr code to actually check the owner number in each block that is read off the disk, since we don't currently do that. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UwAKCRBKO3ySh0YR po6LAQD4r4tzS1xMNW/ynLntzLNiYukkjI8uGpHw1tPpwmskKQD8Dw+bmGQPjUUR v62p5rUNcinvgxwdJwBcsOERGlVkIww= =6MPB -----END PGP SIGNATURE----- Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: set and validate dir/attr block owners There are a couple of significant changes that need to be made to the directory and xattr code before we can support online repairs of those data structures. The first change is because online repair is designed to use libxfs to create a replacement dir/xattr structure in a temporary file, and use atomic extent swapping to commit the corrected structure. To avoid the performance hit of walking every block of the new structure to rewrite the owner number before the swap, we instead change libxfs to allow callers of the dir and xattr code the ability to set an explicit owner number to be written into the header fields of any new blocks that are created. For regular operation this will be the directory inode number. The second change is to update the dir/xattr code to actually check the owner number in each block that is read off the disk, since we don't currently do that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: validate explicit directory free block owners xfs: validate explicit directory block buffer owners xfs: validate explicit directory data buffer owners xfs: validate directory leaf buffer owners xfs: validate dabtree node buffer owners xfs: validate attr remote value buffer owners xfs: validate attr leaf buffer owners xfs: reduce indenting in xfs_attr_node_list xfs: use the xfs_da_args owner field to set new dir/attr block owner xfs: add an explicit owner field to xfs_da_args	2024-04-16 11:48:46 +05:30
Chandan Babu R	8b309acd10	xfs: online repair of realtime summaries [v30.3 05/16] We now have all the infrastructure we need to repair file metadata. We'll begin with the realtime summary file, because it is the least complex data structure. To support this we need to add three more pieces to the temporary file code from the previous patchset -- preallocating space in the temp file, formatting metadata into that space and writing the blocks to disk, and swapping the fork mappings atomically. After that, the actual reconstruction of the realtime summary information is pretty simple, since we can simply write the incore copy computed by the rtsummary scrubber to the temporary file, swap the contents, and reap the old blocks. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR pp/sAQCwODvb3ahgWbNMp6ejewJ1p8NjbmgRbTFckcwWROz4zQEAltyzqZBr6GuC 4E+7LGrh3Q03VX6pqNBGcGelNj20UgM= =c0Gb -----END PGP SIGNATURE----- Merge tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of realtime summaries We now have all the infrastructure we need to repair file metadata. We'll begin with the realtime summary file, because it is the least complex data structure. To support this we need to add three more pieces to the temporary file code from the previous patchset -- preallocating space in the temp file, formatting metadata into that space and writing the blocks to disk, and swapping the fork mappings atomically. After that, the actual reconstruction of the realtime summary information is pretty simple, since we can simply write the incore copy computed by the rtsummary scrubber to the temporary file, swap the contents, and reap the old blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: online repair of realtime summaries xfs: teach the tempfile to set up atomic file content exchanges xfs: support preallocating and copying content into temporary files	2024-04-16 11:45:33 +05:30
Chandan Babu R	783c51708b	xfs: create temporary files for online repair [v30.3 04/16] As mentioned earlier, the repair strategy for file-based metadata is to build a new copy in a temporary file and swap the file fork mappings with the metadata inode. We've built the atomic extent swap facility, so now we need to build a facility for handling private temporary files. The first step is to teach the filesystem to ignore the temporary files. We'll mark them as PRIVATE in the VFS so that the kernel security modules will leave it alone. The second step is to add the online repair code the ability to create a temporary file and reap extents from the temporary file after the extent swap. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR pukxAQCWf6T3FpJHXPHAwc8ANNWAZHufPTn8LH1m2DTKal6rGgEAo0rnqmV5xN/p WDMMAm5ngW2mDlBJFX8ClE2+1DHsqAc= =SKWV -----END PGP SIGNATURE----- Merge tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: create temporary files for online repair As mentioned earlier, the repair strategy for file-based metadata is to build a new copy in a temporary file and swap the file fork mappings with the metadata inode. We've built the atomic extent swap facility, so now we need to build a facility for handling private temporary files. The first step is to teach the filesystem to ignore the temporary files. We'll mark them as PRIVATE in the VFS so that the kernel security modules will leave it alone. The second step is to add the online repair code the ability to create a temporary file and reap extents from the temporary file after the extent swap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: add the ability to reap entire inode forks xfs: refactor live buffer invalidation for repairs xfs: create temporary files and directories for online repair xfs: hide private inodes from bulkstat and handle functions	2024-04-16 11:40:13 +05:30
Chandan Babu R	22d5a8e52d	xfs: atomic file content exchanges [v30.3 03/16] This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange ranges of bytes between two files atomically. This new functionality enables data storage programs to stage and commit file updates such that reader programs will see either the old contents or the new contents in their entirety, with no chance of torn writes. A successful call completion guarantees that the new contents will be seen even if the system fails. The ability to exchange file fork mappings between files in this manner is critical to supporting online filesystem repair, which is built upon the strategy of constructing a clean copy of a damaged structure and committing the new structure into the metadata file atomically. The ioctls exist to facilitate testing of the new functionality and to enable future application program designs. User programs will be able to update files atomically by opening an O_TMPFILE, reflinking the source file to it, making whatever updates they want to make, and exchange the relevant ranges of the temp file with the original file. If the updates are aligned with the file block size, a new (since v2) flag provides for exchanging only the written areas. Note that application software must quiesce writes to the file while it stages an atomic update. This will be addressed by a subsequent series. This mechanism solves the clunkiness of two existing atomic file update mechanisms: for O_TRUNC + rewrite, this eliminates the brief period where other programs can see an empty file. For create tempfile + rename, the need to copy file attributes and extended attributes for each file update is eliminated. However, this method introduces its own awkwardness -- any program initiating an exchange now needs to have a way to signal to other programs that the file contents have changed. For file access mediated via read and write, fanotify or inotify are probably sufficient. For mmaped files, that may not be fast enough. Here is the proposed manual page: IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2) NAME ioctl_xfs_exchange_range - exchange the contents of parts of two files SYNOPSIS #include <sys/ioctl.h> #include <xfs/xfs_fs.h> int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐ change_range arg); DESCRIPTION Given a range of bytes in a first file file1_fd and a second range of bytes in a second file file2_fd, this ioctl(2) ex‐ changes the contents of the two ranges. Exchanges are atomic with regards to concurrent file opera‐ tions. Implementations must guarantee that readers see either the old contents or the new contents in their entirety, even if the system fails. The system call parameters are conveyed in structures of the following form: struct xfs_exchange_range { __s32 file1_fd; __u32 pad; __u64 file1_offset; __u64 file2_offset; __u64 length; __u64 flags; }; The field pad must be zero. The fields file1_fd, file1_offset, and length define the first range of bytes to be exchanged. The fields file2_fd, file2_offset, and length define the second range of bytes to be exchanged. Both files must be from the same filesystem mount. If the two file descriptors represent the same file, the byte ranges must not overlap. Most disk-based filesystems require that the starts of both ranges must be aligned to the file block size. If this is the case, the ends of the ranges must also be so aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set. The field flags control the behavior of the exchange operation. XFS_EXCHANGE_RANGE_TO_EOF Ignore the length parameter. All bytes in file1_fd from file1_offset to EOF are moved to file2_fd, and file2's size is set to (file2_offset+(file1_length- file1_offset)). Meanwhile, all bytes in file2 from file2_offset to EOF are moved to file1 and file1's size is set to (file1_offset+(file2_length- file2_offset)). XFS_EXCHANGE_RANGE_DSYNC Ensure that all modified in-core data in both file ranges and all metadata updates pertaining to the exchange operation are flushed to persistent storage before the call returns. Opening either file de‐ scriptor with O_SYNC or O_DSYNC will have the same effect. XFS_EXCHANGE_RANGE_FILE1_WRITTEN Only exchange sub-ranges of file1_fd that are known to contain data written by application software. Each sub-range may be expanded (both upwards and downwards) to align with the file allocation unit. For files on the data device, this is one filesystem block. For files on the realtime device, this is the realtime extent size. This facility can be used to implement fast atomic scatter-gather writes of any complexity for software-defined storage targets if all writes are aligned to the file allocation unit. XFS_EXCHANGE_RANGE_DRY_RUN Check the parameters and the feasibility of the op‐ eration, but do not change anything. RETURN VALUE On error, -1 is returned, and errno is set to indicate the er‐ ror. ERRORS Error codes can be one of, but are not limited to, the follow‐ ing: EBADF file1_fd is not open for reading and writing or is open for append-only writes; or file2_fd is not open for reading and writing or is open for append-only writes. EINVAL The parameters are not correct for these files. This error can also appear if either file descriptor repre‐ sents a device, FIFO, or socket. Disk filesystems gen‐ erally require the offset and length arguments to be aligned to the fundamental block sizes of both files. EIO An I/O error occurred. EISDIR One of the files is a directory. ENOMEM The kernel was unable to allocate sufficient memory to perform the operation. ENOSPC There is not enough free space in the filesystem ex‐ change the contents safely. EOPNOTSUPP The filesystem does not support exchanging bytes between the two files. EPERM file1_fd or file2_fd are immutable. ETXTBSY One of the files is a swap file. EUCLEAN The filesystem is corrupt. EXDEV file1_fd and file2_fd are not on the same mounted filesystem. CONFORMING TO This API is XFS-specific. USE CASES Several use cases are imagined for this system call. In all cases, application software must coordinate updates to the file because the exchange is performed unconditionally. The first is a data storage program that wants to commit non- contiguous updates to a file atomically and coordinates write access to that file. This can be done by creating a temporary file, calling FICLONE(2) to share the contents, and staging the updates into the temporary file. The FULL_FILES flag is recom‐ mended for this purpose. The temporary file can be deleted or punched out afterwards. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); ioctl(temp_fd, FICLONE, fd); / append 1MB of records / lseek(temp_fd, 0, SEEK_END); write(temp_fd, data1, 1000000); / update record index / pwrite(temp_fd, data1, 600, 98765); pwrite(temp_fd, data2, 320, 54321); pwrite(temp_fd, data2, 15, 0); / commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .flags = XFS_EXCHANGE_RANGE_TO_EOF, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); The second is a software-defined storage host (e.g. a disk jukebox) which implements an atomic scatter-gather write com‐ mand. Provided the exported disk's logical block size matches the file's allocation unit size, this can be done by creating a temporary file and writing the data at the appropriate offsets. It is recommended that the temporary file be truncated to the size of the regular file before any writes are staged to the temporary file to avoid issues with zeroing during EOF exten‐ sion. Use this call with the FILE1_WRITTEN flag to exchange only the file allocation units involved in the emulated de‐ vice's write command. The temporary file should be truncated or punched out completely before being reused to stage another write. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); struct stat sb; int blksz; fstat(fd, &sb); blksz = sb.st_blksize; / land scatter gather writes between 100fsb and 500fsb / pwrite(temp_fd, data1, blksz 2, blksz * 100); pwrite(temp_fd, data2, blksz * 20, blksz * 480); pwrite(temp_fd, data3, blksz * 7, blksz * 257); /* commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .file1_offset = blksz 100, .file2_offset = blksz * 100, .length = blksz * 400, .flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN \| XFS_EXCHANGE_RANGE_FILE1_DSYNC, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); NOTES Some filesystems may limit the amount of data or the number of extents that can be exchanged in a single call. SEE ALSO ioctl(2) XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2) The reference implementation in XFS creates a new log incompat feature and log intent items to track high level progress of swapping ranges of two files and finish interrupted work if the system goes down. Sample code can be found in the corresponding changes to xfs_io to exercise the use case mentioned above. Note that this function is /not/ the O_DIRECT atomic untorn file writes concept that has also been floating around for years. It is also not the RWF_ATOMIC patchset that has been shared. This RFC is constructed entirely in software, which means that there are no limitations other than the general filesystem limits. As a side note, the original motivation behind the kernel functionality is online repair of file-based metadata. The atomic file content exchange is implemented as an atomic exchange of file fork mappings, which means that we can implement online reconstruction of extended attributes and directories by building a new one in another inode and exchanging the contents. Subsequent patchsets adapt the online filesystem repair code to use atomic file exchanges. This enables repair functions to construct a clean copy of a directory, xattr information, symbolic links, realtime bitmaps, and realtime summary information in a temporary inode. If this completes successfully, the new contents can be committed atomically into the inode being repaired. This is essential to avoid making corruption problems worse if the system goes down in the middle of running repair. For userspace, this series also includes the userspace pieces needed to test the new functionality, and a sample implementation of atomic file updates. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR pmYQAQCGwoAev/oRzIJrZmbpzNaU9w7XEPF+tW3vJSX6tlxG+wD8DIi4kTAplu/9 i860EFqZp5MuwHyGVDCac0owigtt6wk= =Lsls -----END PGP SIGNATURE----- Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: atomic file content exchanges This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange ranges of bytes between two files atomically. This new functionality enables data storage programs to stage and commit file updates such that reader programs will see either the old contents or the new contents in their entirety, with no chance of torn writes. A successful call completion guarantees that the new contents will be seen even if the system fails. The ability to exchange file fork mappings between files in this manner is critical to supporting online filesystem repair, which is built upon the strategy of constructing a clean copy of a damaged structure and committing the new structure into the metadata file atomically. The ioctls exist to facilitate testing of the new functionality and to enable future application program designs. User programs will be able to update files atomically by opening an O_TMPFILE, reflinking the source file to it, making whatever updates they want to make, and exchange the relevant ranges of the temp file with the original file. If the updates are aligned with the file block size, a new (since v2) flag provides for exchanging only the written areas. Note that application software must quiesce writes to the file while it stages an atomic update. This will be addressed by a subsequent series. This mechanism solves the clunkiness of two existing atomic file update mechanisms: for O_TRUNC + rewrite, this eliminates the brief period where other programs can see an empty file. For create tempfile + rename, the need to copy file attributes and extended attributes for each file update is eliminated. However, this method introduces its own awkwardness -- any program initiating an exchange now needs to have a way to signal to other programs that the file contents have changed. For file access mediated via read and write, fanotify or inotify are probably sufficient. For mmaped files, that may not be fast enough. Here is the proposed manual page: IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2) NAME ioctl_xfs_exchange_range - exchange the contents of parts of two files SYNOPSIS #include <sys/ioctl.h> #include <xfs/xfs_fs.h> int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐ change_range arg); DESCRIPTION Given a range of bytes in a first file file1_fd and a second range of bytes in a second file file2_fd, this ioctl(2) ex‐ changes the contents of the two ranges. Exchanges are atomic with regards to concurrent file opera‐ tions. Implementations must guarantee that readers see either the old contents or the new contents in their entirety, even if the system fails. The system call parameters are conveyed in structures of the following form: struct xfs_exchange_range { __s32 file1_fd; __u32 pad; __u64 file1_offset; __u64 file2_offset; __u64 length; __u64 flags; }; The field pad must be zero. The fields file1_fd, file1_offset, and length define the first range of bytes to be exchanged. The fields file2_fd, file2_offset, and length define the second range of bytes to be exchanged. Both files must be from the same filesystem mount. If the two file descriptors represent the same file, the byte ranges must not overlap. Most disk-based filesystems require that the starts of both ranges must be aligned to the file block size. If this is the case, the ends of the ranges must also be so aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set. The field flags control the behavior of the exchange operation. XFS_EXCHANGE_RANGE_TO_EOF Ignore the length parameter. All bytes in file1_fd from file1_offset to EOF are moved to file2_fd, and file2's size is set to (file2_offset+(file1_length- file1_offset)). Meanwhile, all bytes in file2 from file2_offset to EOF are moved to file1 and file1's size is set to (file1_offset+(file2_length- file2_offset)). XFS_EXCHANGE_RANGE_DSYNC Ensure that all modified in-core data in both file ranges and all metadata updates pertaining to the exchange operation are flushed to persistent storage before the call returns. Opening either file de‐ scriptor with O_SYNC or O_DSYNC will have the same effect. XFS_EXCHANGE_RANGE_FILE1_WRITTEN Only exchange sub-ranges of file1_fd that are known to contain data written by application software. Each sub-range may be expanded (both upwards and downwards) to align with the file allocation unit. For files on the data device, this is one filesystem block. For files on the realtime device, this is the realtime extent size. This facility can be used to implement fast atomic scatter-gather writes of any complexity for software-defined storage targets if all writes are aligned to the file allocation unit. XFS_EXCHANGE_RANGE_DRY_RUN Check the parameters and the feasibility of the op‐ eration, but do not change anything. RETURN VALUE On error, -1 is returned, and errno is set to indicate the er‐ ror. ERRORS Error codes can be one of, but are not limited to, the follow‐ ing: EBADF file1_fd is not open for reading and writing or is open for append-only writes; or file2_fd is not open for reading and writing or is open for append-only writes. EINVAL The parameters are not correct for these files. This error can also appear if either file descriptor repre‐ sents a device, FIFO, or socket. Disk filesystems gen‐ erally require the offset and length arguments to be aligned to the fundamental block sizes of both files. EIO An I/O error occurred. EISDIR One of the files is a directory. ENOMEM The kernel was unable to allocate sufficient memory to perform the operation. ENOSPC There is not enough free space in the filesystem ex‐ change the contents safely. EOPNOTSUPP The filesystem does not support exchanging bytes between the two files. EPERM file1_fd or file2_fd are immutable. ETXTBSY One of the files is a swap file. EUCLEAN The filesystem is corrupt. EXDEV file1_fd and file2_fd are not on the same mounted filesystem. CONFORMING TO This API is XFS-specific. USE CASES Several use cases are imagined for this system call. In all cases, application software must coordinate updates to the file because the exchange is performed unconditionally. The first is a data storage program that wants to commit non- contiguous updates to a file atomically and coordinates write access to that file. This can be done by creating a temporary file, calling FICLONE(2) to share the contents, and staging the updates into the temporary file. The FULL_FILES flag is recom‐ mended for this purpose. The temporary file can be deleted or punched out afterwards. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); ioctl(temp_fd, FICLONE, fd); / append 1MB of records / lseek(temp_fd, 0, SEEK_END); write(temp_fd, data1, 1000000); / update record index / pwrite(temp_fd, data1, 600, 98765); pwrite(temp_fd, data2, 320, 54321); pwrite(temp_fd, data2, 15, 0); / commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .flags = XFS_EXCHANGE_RANGE_TO_EOF, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); The second is a software-defined storage host (e.g. a disk jukebox) which implements an atomic scatter-gather write com‐ mand. Provided the exported disk's logical block size matches the file's allocation unit size, this can be done by creating a temporary file and writing the data at the appropriate offsets. It is recommended that the temporary file be truncated to the size of the regular file before any writes are staged to the temporary file to avoid issues with zeroing during EOF exten‐ sion. Use this call with the FILE1_WRITTEN flag to exchange only the file allocation units involved in the emulated de‐ vice's write command. The temporary file should be truncated or punched out completely before being reused to stage another write. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); struct stat sb; int blksz; fstat(fd, &sb); blksz = sb.st_blksize; / land scatter gather writes between 100fsb and 500fsb / pwrite(temp_fd, data1, blksz 2, blksz * 100); pwrite(temp_fd, data2, blksz * 20, blksz * 480); pwrite(temp_fd, data3, blksz * 7, blksz * 257); /* commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .file1_offset = blksz 100, .file2_offset = blksz * 100, .length = blksz * 400, .flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN \| XFS_EXCHANGE_RANGE_FILE1_DSYNC, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); NOTES Some filesystems may limit the amount of data or the number of extents that can be exchanged in a single call. SEE ALSO ioctl(2) XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2) The reference implementation in XFS creates a new log incompat feature and log intent items to track high level progress of swapping ranges of two files and finish interrupted work if the system goes down. Sample code can be found in the corresponding changes to xfs_io to exercise the use case mentioned above. Note that this function is /not/ the O_DIRECT atomic untorn file writes concept that has also been floating around for years. It is also not the RWF_ATOMIC patchset that has been shared. This RFC is constructed entirely in software, which means that there are no limitations other than the general filesystem limits. As a side note, the original motivation behind the kernel functionality is online repair of file-based metadata. The atomic file content exchange is implemented as an atomic exchange of file fork mappings, which means that we can implement online reconstruction of extended attributes and directories by building a new one in another inode and exchanging the contents. Subsequent patchsets adapt the online filesystem repair code to use atomic file exchanges. This enables repair functions to construct a clean copy of a directory, xattr information, symbolic links, realtime bitmaps, and realtime summary information in a temporary inode. If this completes successfully, the new contents can be committed atomically into the inode being repaired. This is essential to avoid making corruption problems worse if the system goes down in the middle of running repair. For userspace, this series also includes the userspace pieces needed to test the new functionality, and a sample implementation of atomic file updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: enable logged file mapping exchange feature docs: update swapext -> exchmaps language xfs: capture inode generation numbers in the ondisk exchmaps log item xfs: support non-power-of-two rtextsize with exchange-range xfs: make file range exchange support realtime files xfs: condense symbolic links after a mapping exchange operation xfs: condense directories after a mapping exchange operation xfs: condense extended attributes after a mapping exchange operation xfs: add error injection to test file mapping exchange recovery xfs: bind together the front and back ends of the file range exchange code xfs: create deferred log items for file mapping exchanges xfs: introduce a file mapping exchange log intent item xfs: create a incompat flag for atomic file mapping exchanges xfs: introduce new file range exchange ioctl vfs: export remap and write check helpers	2024-04-16 11:25:09 +05:30
Chandan Babu R	4ec2e3c167	xfs: refactorings for atomic file content exchanges [v30.3 02/16] This series applies various cleanups and refactorings to file IO handling code ahead of the main series to implement atomic file content exchanges. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR pginAQCcZq5bFYJGYj4UInUOfDDjuq8R9Rl8DGDhnTFb8FxAUAD+Jsol2/wQduII Bly/+Hegen2BoayNuf3iZG7RYkJWygo= =tHoT -----END PGP SIGNATURE----- Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: refactorings for atomic file content exchanges This series applies various cleanups and refactorings to file IO handling code ahead of the main series to implement atomic file content exchanges. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: constify xfs_bmap_is_written_extent xfs: refactor non-power-of-two alignment checks xfs: hoist multi-fsb allocation unit detection to a helper xfs: create a new helper to return a file's allocation unit xfs: declare xfs_file.c symbols in xfs_file.h xfs: move xfs_iops.c declarations out of xfs_inode.h xfs: move inode lease breaking functions to xfs_inode.c	2024-04-16 11:17:28 +05:30
Chandan Babu R	ebe0f798e1	xfs: improve log incompat feature handling [v30.3 01/16] This patchset improves the performance of log incompat feature bit handling by making a few changes to how the filesystem handles them. First, we now only clear the bits during a clean unmount to reduce calls to the (expensive) upgrade function to once per bit per mount. Second, we now only allow incompat feature upgrades for sysadmins or if the sysadmin explicitly allows it via mount option. Currently the only log incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so there should be no user visible impact to this change. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR pharAQDYPe2dmXxQ28fUC1GMy38DLCt0f/XXLkhUI0yeC+mbcQD/aLWU5MCy5TGN OF3HQwrgL9+SZWwhqRbu8hPG7ZLBWQQ= =yieh -----END PGP SIGNATURE----- Merge tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: improve log incompat feature handling This patchset improves the performance of log incompat feature bit handling by making a few changes to how the filesystem handles them. First, we now only clear the bits during a clean unmount to reduce calls to the (expensive) upgrade function to once per bit per mount. Second, we now only allow incompat feature upgrades for sysadmins or if the sysadmin explicitly allows it via mount option. Currently the only log incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so there should be no user visible impact to this change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: only clear log incompat flags at clean unmount xfs: fix error bailout in xrep_abt_build_new_trees xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode xfs: pass xfs_buf lookup flags to xfs_*read_agi	2024-04-16 11:09:58 +05:30
Darrick J. Wong	df76047147	xfs: unlock new repair tempfiles after creation After creation, drop the ILOCK on temporary files that have been created to stage a repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:03 -07:00
Darrick J. Wong	34ef5e17d5	xfs: don't pick up IOLOCK during rmapbt repair scan Now that we've fixed the directory operations to hold the ILOCK until they're finished with rmapbt updates for directory shape changes, we no longer need to take this lock when scanning directories for rmapbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:02 -07:00
Allison Henderson	69291726ca	xfs: Hold inode locks in xfs_rename Modify xfs_rename to hold all inode locks across a rename operation We will need this later when we add parent pointers Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:02 -07:00
Allison Henderson	bd5562111d	xfs: Hold inode locks in xfs_trans_alloc_dir Modify xfs_trans_alloc_dir to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:02 -07:00
Allison Henderson	267979b4ce	xfs: Hold inode locks in xfs_ialloc Modify xfs_ialloc to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: hold the parent ilocked across transaction rolls too] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:02 -07:00
Darrick J. Wong	67bdcd4999	docs: describe xfs directory tree online fsck I've added a scrubber that checks the directory tree structure and fixes them; describe this in the design documentation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:01 -07:00
Allison Henderson	f103df7635	xfs: Increase XFS_QM_TRANS_MAXDQS to 5 With parent pointers enabled, a rename operation can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip and wip. This causes their dquots to a be attached to the transaction chain, so we need to increase XFS_QM_TRANS_MAXDQS. This patch also add a helper function xfs_dqlockn to lock an arbitrary number of dquots. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:01 -07:00
Darrick J. Wong	c91fe20e5a	docs: update offline parent pointer repair strategy Now update how xfs_repair checks and repairs parent pointer info. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:01 -07:00
Allison Henderson	7560c937b4	xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Renames that generate parent pointer updates can join up to 5 inodes locked in sorted order. So we need to increase the number of defer ops inodes and relock them in the same way. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: have one sorting function] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:01 -07:00
Darrick J. Wong	b0ffe661fa	xfs: fix performance problems when fstrimming a subset of a fragmented AG On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com	2024-04-15 14:59:00 -07:00
Darrick J. Wong	1a5f6e08d4	xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype When a file-based metadata structure is being scrubbed in xchk_metadata_inode_subtype, we should create an entirely new scrub context so that each scrubber doesn't trip over another's buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:00 -07:00
Darrick J. Wong	5220727ce8	docs: update online directory and parent pointer repair sections Update the case studies of online directory and parent pointer reconstruction to reflect what they actually do in the final version. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:00 -07:00
Darrick J. Wong	d85fe250f2	docs: update the parent pointers documentation to the final version Now that we've decided on the ondisk format of parent pointers, update the documentation to reflect that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:59:00 -07:00
Darrick J. Wong	5f204051d9	xfs: pin inodes that would otherwise overflow link count The VFS inc_nlink function does not explicitly check for integer overflows in the i_nlink field. Instead, it checks the link count against s_max_links in the vfs_{link,create,rename} functions. XFS sets the maximum link count to 2.1 billion, so integer overflows should not be a problem. However. It's possible that online repair could find that a file has more than four billion links, particularly if the link count got corrupted while creating hardlinks to the file. The di_nlinkv2 field is not large enough to store a value larger than 2^32, so we ought to define a magic pin value of ~0U which means that the inode never gets deleted. This will prevent a UAF error if the repair finds this situation and users begin deleting links to the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:59 -07:00
Darrick J. Wong	2935213a68	xfs: try to avoid allocating from sick inode clusters I noticed that xfs/413 and xfs/375 occasionally failed while fuzzing core.mode of an inode. The root cause of these problems is that the field we fuzzed (core.mode or core.magic, typically) causes the entire inode cluster buffer verification to fail, which affects several inodes at once. The repair process tries to create either a /lost+found or a temporary repair file, but regrettably it picks the same inode cluster that we just corrupted, with the result that repair triggers the demise of the filesystem. Try avoid this by making the inode allocation path detect when the perag health status indicates that someone has found bad inode cluster buffers, and try to read the inode cluster buffer. If the cluster buffer fails the verifiers, try another AG. This isn't foolproof and can result in premature ENOSPC, but that might be better than shutting down. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:59 -07:00
Darrick J. Wong	40cb8613d6	xfs: check unused nlink fields in the ondisk inode v2/v3 inodes use di_nlink and not di_onlink; and v1 inodes use di_onlink and not di_nlink. Whichever field is not in use, make sure its contents are zero, and teach xfs_scrub to fix that if it is. This clears a bunch of missing scrub failure errors in xfs/385 for core.onlink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:59 -07:00
Darrick J. Wong	ab97f4b1c0	xfs: repair AGI unlinked inode bucket lists Teach the AGI repair code to rebuild the unlinked buckets and lists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:58 -07:00
Darrick J. Wong	2651923d8d	xfs: online repair of symbolic links If a symbolic link target looks bad, try to sift through the rubble to find as much of the target buffer that we can, and stage a new target (short or remote format as needed) in a temporary file and use the atomic extent swapping mechanism to commit the results. In the worst case, we replace the target with an overly long filename that cannot possibly resolve. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:58 -07:00
Darrick J. Wong	5b57257025	xfs: hoist AGI repair context to a heap object Save ~460 bytes of stack space by moving all the repair context to a heap object. We're going to add even more context data in the next patch, which is why we really need to do this now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:58 -07:00
Darrick J. Wong	10d587ecb7	xfs: check AGI unlinked inode buckets Look for corruptions in the AGI unlinked bucket chains. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:58 -07:00
Darrick J. Wong	73597e3e42	xfs: ensure dentry consistency when the orphanage adopts a file When the orphanage adopts a file, that file becomes a child of the orphanage. The dentry cache may have entries for the orphanage directory and the name we've chosen, so (1) make sure we abort if the dcache has a positive entry because something's not right; and (2) invalidate and purge negative dentries if the adoption goes through. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:57 -07:00
Darrick J. Wong	ea8214c319	xfs: pass the owner to xfs_symlink_write_target Require callers of xfs_symlink_write_target to pass the owner number explicitly. This sets us up for online repair to be able to write a remote symlink target to sc->tempip with sc->ip's inumber in the block heaader. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:57 -07:00
Darrick J. Wong	e6c9e75fbe	xfs: move files to orphanage instead of letting nlinks drop to zero If we encounter an inode with a nonzero link count but zero observed links, move it to the orphanage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:57 -07:00
Darrick J. Wong	ef744be416	xfs: expose xfs_bmap_local_to_extents for online repair Allow online repair to call xfs_bmap_local_to_extents and add a void * argument at the end so that online repair can pass its own context. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:57 -07:00
Darrick J. Wong	34c9382c12	xfs: ask the dentry cache if it knows the parent of a directory It's possible that the dentry cache can tell us the parent of a directory. Therefore, when repairing directory dot dot entries, query the dcache as a last resort before scanning the entire filesystem. A reviewer asks: "How high is the chance that we actually have a valid dcache entry for a file in a corrupted directory?" There's a decent chance of this actually working. Say you have a 1000-block directory foo, and block 980 gets corrupted. Let's further suppose that block 0 has a correct entry for ".." and "bar". If someone accesses /mnt/foo/bar, that will cause the dcache to create a dentry from /mnt to /mnt/foo whose d_parent points back to /mnt. If you then want to rebuild the directory, XFS can obtain the parent from the dcache without needing to wander into parent pointers or scan the filesystem to find /mnt's connection to foo. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:56 -07:00
Darrick J. Wong	1e58a8ccf2	xfs: move orphan files to the orphanage When we're repairing a directory structure or fixing the dotdot entry of a subdirectory, it's possible that we won't ever find a parent for the subdirectory. When this is the case, move it to the orphanage, aka /lost+found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:56 -07:00
Darrick J. Wong	cc22edab9e	xfs: online repair of parent pointers Teach the online repair code to fix parent pointers for directories. For now, this means correcting the dotdot entry of an existing directory that is otherwise consistent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:56 -07:00
Darrick J. Wong	a07b455762	xfs: scan the filesystem to repair a directory dotdot entry Teach the online directory repair code to scan the filesystem so that we can set the dotdot entry when we're rebuilding a directory. This involves dropping ILOCK on the directory that we're repairing, which means that the VFS can sneak in and tell us to update dotdot at any time. Deal with these races by using a dirent hook to absorb dotdot updates, and be careful not to check the scan results until after we've retaken the ILOCK. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:56 -07:00
Darrick J. Wong	669dfe883c	xfs: update the unlinked list when repairing link counts When we're repairing the link counts of a file, we must ensure either that the file has zero link count and is on the unlinked list; or that it has nonzero link count and is not on the unlinked list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:55 -07:00
Darrick J. Wong	b1991ee3e7	xfs: online repair of directories If a directory looks like it's in bad shape, try to sift through the rubble to find whatever directory entries we can, scan the directory tree for the parent (if needed), stage the new directory contents in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:55 -07:00
Darrick J. Wong	8d81082a8c	xfs: inactivate directory data blocks Teach inode inactivation to delete all the incore buffers backing a directory. In normal runtime this should never happen because the VFS forbids rmdir on a non-empty directory. In the next patch, online directory repair stands up a new directory, exchanges it with the broken directory, and then drops the private temporary directory. If we cancel the repair just prior to exchanging the directory contents, the new directory will need to be torn down. Note: If we commit the repair, reaping will take care of all the ondisk space allocations and incore buffers for the old corrupt directory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:55 -07:00
Darrick J. Wong	6c631e79e7	xfs: create an xattr iteration function for scrub Create a streamlined function to walk a file's xattrs, without all the cursor management stuff in the regular listxattr. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:54 -07:00
Darrick J. Wong	e921533ef1	xfs: ensure unlinked list state is consistent with nlink during scrub Now that we have the means to tell if an inode is on an unlinked inode list or not, we can check that an inode with zero link count is on the unlinked list; and an inode that has nonzero link count is not on that list. Make repair clean things up too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:54 -07:00

1 2 3 4 5 ...

1265856 Commits