linux

iv/linux

Author	SHA1	Message	Date
Matthew Wilcox (Oracle)	9872f4de14	ceph: Convert from invalidatepage to invalidate_folio Mostly a straightforward conversion. Delete the pointer from the debugging output as this has no value. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	895586eb68	btrfs: Convert from invalidatepage to invalidate_folio A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	fcf227daed	afs: Convert invalidatepage to invalidate_folio We know the page is in the page cache, not the swap cache. If we ever support folios larger than 2GB, afs_invalidate_dirty() will need to be fixed, but that's a larger project. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	f6bc6fb88c	afs: Convert directory aops to invalidate_folio Use folio->index instead of folio_index() because there's no way we're writing a page from the swapcache to a directory. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	040cdd4bf9	9p: Convert to invalidate_folio This is a trivial conversion. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	5660a8630d	fs: Remove noop_invalidatepage() We used to have to use noop_invalidatepage() to prevent block_invalidatepage() from being called, but that behaviour is now gone. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	7ba13abbd3	fs: Turn block_invalidatepage into block_invalidate_folio Remove special-casing of a NULL invalidatepage, since there is no more block_invalidatepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	d82354f6b0	iomap: Remove iomap_invalidatepage() Use iomap_invalidate_folio() in all the iomap-based filesystems and rename the iomap_invalidatepage tracepoint. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	020df9baea	ext4: Use folio_invalidate() Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	a628304ebe	ceph: Use folio_invalidate() Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	8e1dec8eb8	btrfs: Use folio_invalidate() Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)	2e7e80f7e7	fs: Convert is_partially_uptodate to folios Since the uptodate property is maintained on a per-folio basis, the is_partially_uptodate method should also take a folio. Fix the types at the same time so it's clear that it returns true/false and takes the count in bytes, not blocks. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-14 15:43:17 -04:00
Matthew Wilcox (Oracle)	4495a96c4c	fs/remap_range: Pass the file pointer to read_mapping_folio() We have the struct file in generic_remap_file_range_prep() already; we just need to pass it around instead of the inode. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-14 15:43:16 -04:00
Matthew Wilcox (Oracle)	1241ebeca3	iomap: Fix iomap_invalidatepage tracepoint This tracepoint is defined to take an offset in the file, not an offset in the folio. Fixes: 1ac994525b9d ("iomap: Remove pgoff from tracepoints") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs	2022-03-14 15:43:16 -04:00
Darrick J. Wong	744e6c8ada	xfs: constify xfs_name_dotdot The symbol xfs_name_dotdot is a global variable that the xfs codebase uses here and there to look up directory dotdot entries. Currently it's a non-const variable, which means that it's a mutable global variable. So far nobody's abused this to cause problems, but let's use the compiler to enforce that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2022-03-14 10:23:17 -07:00
Darrick J. Wong	996b2329b2	xfs: constify the name argument to various directory functions Various directory functions do not modify their @name parameter, so mark it const to make that clear. This will enable us to mark the global xfs_name_dotdot variable as const to prevent mischief. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2022-03-14 10:23:17 -07:00
Darrick J. Wong	41667260bc	xfs: reserve quota for target dir expansion when renaming files XFS does not reserve quota for directory expansion when renaming children into a directory. This means that we don't reject the expansion with EDQUOT when we're at or near a hard limit, which means that unprivileged userspace can use rename() to exceed quota. Rename operations don't always expand the target directory, and we allow a rename to proceed with no space reservation if we don't need to add a block to the target directory to handle the addition. Moreover, the unlink operation on the source directory generally does not expand the directory (you'd have to free a block and then cause a btree split) and it's probably of little consequence to leave the corner case that renaming a file out of a directory can increase its size. As with link and unlink, there is a further bug in that we do not trigger the blockgc workers to try to clear space when we're out of quota. Because rename is its own special tricky animal, we'll patch xfs_rename directly to reserve quota to the rename transaction. We'll leave cleaning up the rest of xfs_rename for the metadata directory tree patchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2022-03-14 10:23:17 -07:00
Darrick J. Wong	871b9316e7	xfs: reserve quota for dir expansion when linking/unlinking files XFS does not reserve quota for directory expansion when linking or unlinking children from a directory. This means that we don't reject the expansion with EDQUOT when we're at or near a hard limit, which means that unprivileged userspace can use link()/unlink() to exceed quota. The fix for this is nuanced -- link operations don't always expand the directory, and we allow a link to proceed with no space reservation if we don't need to add a block to the directory to handle the addition. Unlink operations generally do not expand the directory (you'd have to free a block and then cause a btree split) and we can defer the directory block freeing if there is no space reservation. Moreover, there is a further bug in that we do not trigger the blockgc workers to try to clear space when we're out of quota. To fix both cases, create a new xfs_trans_alloc_dir function that allocates the transaction, locks and joins the inodes, and reserves quota for the directory. If there isn't sufficient space or quota, we'll switch the caller to reservationless mode. This should prevent quota usage overruns with the least restriction in functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2022-03-14 10:23:17 -07:00
Darrick J. Wong	dd3b015dd8	xfs: refactor user/group quota chown in xfs_setattr_nonsize Combine if tests to reduce the indentation levels of the quota chown calls in xfs_setattr_nonsize. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org>	2022-03-14 10:23:17 -07:00
Darrick J. Wong	e014f37db1	xfs: use setattr_copy to set vfs inode attributes Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid revocation isn't consistent with btrfs[1] or ext4. Those two filesystems use the VFS function setattr_copy to convey certain attributes from struct iattr into the VFS inode structure. Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to decide if it should clear setgid and setuid on a file attribute update. This is a second symptom of the problem that Filipe noticed. XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode, xfs_setattr_nonsize, and xfs_setattr_time. Regrettably, setattr_copy is /not/ a simple copy function; it contains additional logic to clear the setgid bit when setting the mode, and XFS' version no longer matches. The VFS implements its own setuid/setgid stripping logic, which establishes consistent behavior. It's a tad unfortunate that it's scattered across notify_change, should_remove_suid, and setattr_copy but XFS should really follow the Linux VFS. Adapt XFS to use the VFS functions and get rid of the old functions. [1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/ [2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/ Fixes: 7fa294c8991c ("userns: Allow chown and setgid preservation") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org>	2022-03-14 10:23:16 -07:00
Nikolay Borisov	d3e2996707	btrfs: zoned: put block group after final usage It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:54 +01:00
Dongliang Mu	79c9234ba5	btrfs: don't access possibly stale fs_info data in device_list_add Syzbot reported a possible use-after-free in printing information in device_list_add. Very similar with the bug fixed by commit 0697d9a61099 ("btrfs: don't access possibly stale fs_info data for printing duplicate device"), but this time the use occurs in btrfs_info_in_rcu. Call Trace: kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 btrfs_printk+0x395/0x425 fs/btrfs/super.c:244 device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957 btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387 btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by modifying device->fs_info to NULL too. Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:54 +01:00
Niels Dossche	bf7bd725b0	btrfs: add lockdep_assert_held to need_preemptive_reclaim In a previous patch ("btrfs: extend locking to all space_info members accesses") the locking for the space_info members was extended in btrfs_preempt_reclaim_metadata_space because not all the member accesses that needed locks were actually locked (bytes_pinned et al). It was then suggested to also add a call to lockdep_assert_held to need_preemptive_reclaim. This function also works with space_info members. As of now, it has only two call sites which both hold the lock. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Qu Wenruo	3777369ff1	btrfs: verify the tranisd of the to-be-written dirty extent buffer [BUG] There is a bug report that a bitflip in the transid part of an extent buffer makes btrfs to reject certain tree blocks: BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22 [CAUSE] Note the failed transid check, hex(262166) = 0x40016, while hex(22) = 0x16. It's an obvious bitflip. Furthermore, the reporter also confirmed the bitflip is from the hardware, so it's a real hardware caused bitflip, and such problem can not be detected by the existing tree-checker framework. As tree-checker can only verify the content inside one tree block, while generation of a tree block can only be verified against its parent. So such problem remain undetected. [FIX] Although tree-checker can not verify it at write-time, we still have a quick (but not the most accurate) way to catch such obvious corruption. Function csum_one_extent_buffer() is called before we submit metadata write. Thus it means, all the extent buffer passed in should be dirty tree blocks, and should be newer than last committed transaction. Using that we can catch the above bitflip. Although it's not a perfect solution, as if the corrupted generation is higher than the correct value, we have no way to catch it at all. Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/ CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@sus,ree.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Qu Wenruo	9a4ffa1bd6	btrfs: unify the error handling of btrfs_read_buffer() There is one oddball error handling of btrfs_read_buffer(): ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key); if (!ret) { *eb_ret = tmp; return 0; } free_extent_buffer(tmp); btrfs_release_path(p); return -EIO; While all other call sites check the error first. Unify the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Qu Wenruo	4eb150d612	btrfs: unify the error handling pattern for read_tree_block() We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. / } else if (!extent_buffer_uptodate(eb)) { / * Different error handling here * Normally also ended up with return or goto out; / } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { / * Handling error here / return eb or goto out; } if (!extent_buffer_uptodate(eb)) { / * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	8f8aa4c7a9	btrfs: factor out do_free_extent_accounting helper __btrfs_free_extent() does all of the hard work of updating the extent ref items, and then at the end if we dropped the extent completely it does the cleanup accounting work. We're going to only want to do that work for metadata with extent tree v2, so extract this bit into its own helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	5b2a54bb7c	btrfs: remove last_ref from the extent freeing code This is a remnant of the work I did for qgroups a long time ago to only run for a block when we had dropped the last ref. We haven't done that for years, but the code remains. Drop this remnant. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	3466670558	btrfs: add a alloc_reserved_extent helper We duplicate this logic for both data and metadata, at this point we've already done our type specific extent root operations, this is just doing the accounting and removing the space from the free space tree. Extract this common logic out into a helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	b3c958a369	btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block Switch this to an ASSERT() and return the error in the normal case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Filipe Manana	313ab75399	btrfs: add and use helper for unlinking inode during log replay During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Niels Dossche	06bae87663	btrfs: extend locking to all space_info members accesses bytes_pinned is always accessed under space_info->lock, except in btrfs_preempt_reclaim_metadata_space, however the other members are accessed under that lock. The reserved member of the rsv's are also partially accessed under a lock and partially not. Move all these accesses into the same lock to ensure consistency. This could potentially race and lead to a flush instead of a commit but it's not a big problem as it's only for preemptive flush. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Naohiro Aota	ca5e4ea0be	btrfs: zoned: mark relocation as writing There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	9f5710bbfd	fs: allow cross-vfsmount reflink/dedupe Currently we disallow reflink and dedupe if the two files aren't on the same vfsmount. However we really only need to disallow it if they're not on the same super block. It is very common for btrfs to have a main subvolume that is mounted and then different subvolumes mounted at different locations. It's allowed to reflink between these volumes, but the vfsmount check disallows this. Instead fix dedupe to check for the same superblock, and simply remove the vfsmount check for reflink as it already does the superblock check. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:53 +01:00
Josef Bacik	ae460f058e	btrfs: remove the cross file system checks from remap The sb check is already done in do_clone_file_range, and the mnt check (which will hopefully go away in a subsequent patch) is done in ioctl_file_clone(). Remove the check in our code and put an ASSERT() to make sure it doesn't change underneath us. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Josef Bacik	7eefae6bb1	btrfs: pass btrfs_fs_info to btrfs_recover_relocation We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Josef Bacik	33c4418499	btrfs: pass btrfs_fs_info for deleting snapshots and cleaner We're passing a root around here, but we only really need the fs_info, so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead, and then fix up all the callers appropriately. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Sweet Tea Dorminy	c067da8781	btrfs: add filesystems state details to error messages When a filesystem goes read-only due to an error, multiple errors tend to be reported, some of which are knock-on failures. Logging fs_states, in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the first error from subsequent messages which may only exist due to an error state. Under the new format, most initial errors will look like: `BTRFS: error (device loop0) in ...` while subsequent errors will begin with: `error (device loop0: state E) in ...` An initial transaction abort error will look like `error (device loop0: state A) in ...` and subsequent messages will contain `(device loop0: state EA) in ...` In addition to the error states we can also print other states that are temporary, like remounting, device replace, or indicate a global state that may affect functionality. Now implemented: E - filesystem error detected A - transaction aborted L - log tree errors M - remounting in progress R - device replace in progress C - data checksums not verified (mounted with ignoredatacsums) Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	b2d9f2dc01	btrfs: deal with unexpected extent type during reflinking Smatch complains about a possible dereference of a pointer that was not initialized: CC [M] fs/btrfs/reflink.o CHECK fs/btrfs/reflink.c fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'. This is because we are not dealing with the case where the type of a file extent has an unexpected value (not regular, not prealloc and not inline), in which case the transaction handle pointer is not initialized. Such unexpected type should be impossible, except in case of some memory corruption caused either by bad hardware or some software bug causing something like a buffer overrun. So ASSERT that if the extent type is neither regular nor prealloc, then it must be inline. Bail out with -EUCLEAN and a warning in case it is not. This silences smatch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	1f4613cdbe	btrfs: fix unexpected error path when reflinking an inline extent When reflinking an inline extent, we assert that its file offset is 0 and that its uncompressed length is not greater than the sector size. We then return an error if one of those conditions is not satisfied. However we use a return statement, which results in returning from btrfs_clone() without freeing the path and buffer that were allocated before, as well as not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination inode. Fix that by jumping to the 'out' label instead, and also add a WARN_ON() for each condition so that in case assertions are disabled, we get to known which of the unexpected conditions triggered the error. Fixes: a61e1e0df9f321 ("Btrfs: simplify inline extent handling when doing reflinks") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	23e3337faf	btrfs: reset last_reflink_trans after fsyncing inode When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit 40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree") and commit e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")). We also need to make sure a full fsync will copy all old file extent items it finds in modified leaves, because they might have been copied from some other inode. However once we fsync an inode, we don't need to keep paying the price of that extra special care in future fsyncs done in the same transaction, unless the inode is used for another reflink operation or the full sync flag is set on it (truncate, failure to allocate extent maps for holes, and other exceptional and infrequent cases). So after we fsync an inode reset its last_unlink_trans to zero. In case another reflink happens, we continue to update the last_reflink_trans of the inode, just as before. Also set last_reflink_trans to the generation of the last transaction that modified the inode whenever we need to set the full sync flag on the inode, just like when we need to load an inode from disk after eviction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	96acb3753e	btrfs: voluntarily relinquish cpu when doing a full fsync Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	5b7ce5e287	btrfs: hold on to less memory when logging checksums during full fsync When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Filipe Manana	7f30c07288	btrfs: stop copying old file extents when doing a full fsync When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit ea7036de0d36c4 ("btrfs: fix fsync failure and transaction abort after writes to prealloc extents")). In that particular case, we had large files that had random writes and were often truncated, which made the next fsync be a full sync. $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G # FILE_SIZE=$((512 * 1024 * 1024)) # 512M mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT # Create a file with many extents. Use direct IO to make it faster # to create the file - using buffered IO we would have to fsync # after each write (terribly slow). echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..." xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar # Commit the transaction, so every extent after this is from an # old generation. sync # Now rewrite only a few extents, which are all far spread apart from # each other (e.g. 1G / 32M = 32 extents). # After this only a few extents have a new generation, while all other # ones have an old generation. echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..." for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null done # Fsync, the inode logged in full sync mode since it was never fsynced # before. echo "Fsyncing file..." xfs_io -c "fsync" $MNT/foobar umount $MNT And the following bpftrace program was running when executing the test script: $ cat bpf-script.sh #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000; delete(@start_log_inode[tid]); } k:btrfs_sync_log { @start_sync_log[tid] = nsecs; } kr:btrfs_sync_log /@start_sync_log[tid]/ { $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000; printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]); printf("btrfs_sync_log() took %llu us\n", $sync_log_dur); delete(@start_sync_log[tid]); delete(@log_inode_dur[tid]); exit(); } With 512M test file, before this patch: btrfs_log_inode() took 15218 us btrfs_sync_log() took 1328 us Log tree has 17 leaves and 1 node, its total size is 294912 bytes. With 512M test file, after this patch: btrfs_log_inode() took 14760 us btrfs_sync_log() took 588 us Log tree has a single leaf, its total size is 16K. With 1G test file, before this patch: btrfs_log_inode() took 27301 us btrfs_sync_log() took 1767 us Log tree has 33 leaves and 1 node, its total size is 557056 bytes. With 1G test file, after this patch: btrfs_log_inode() took 26166 us btrfs_sync_log() took 593 us Log tree has a single leaf, its total size is 16K With 2G test file, before this patch: btrfs_log_inode() took 50892 us btrfs_sync_log() took 3127 us Log tree has 65 leaves and 1 node, its total size is 1081344 bytes. With 2G test file, after this patch: btrfs_log_inode() took 50126 us btrfs_sync_log() took 586 us Log tree has a single leaf, its total size is 16K. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Josef Bacik	8cbc3001a3	btrfs: do not clean up repair bio if submit fails The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Josef Bacik	510671d2d8	btrfs: do not try to repair bio that has no mirror set If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:52 +01:00
Josef Bacik	f9f15de85d	btrfs: do not double complete bio on errors during compressed reads I hit some weird panics while fixing up the error handling from btrfs_lookup_bio_sums(). Turns out the compression path will complete the bio we use if we set up any of the compression bios and then return an error, and then btrfs_submit_data_bio() will also call bio_endio() on the bio. Fix this by making btrfs_submit_compressed_read() responsible for calling bio_endio() on the bio if there are any errors. Currently it was only doing it if we created the compression bios, otherwise it was depending on btrfs_submit_data_bio() to do the right thing. This creates the above problem, so fix up btrfs_submit_compressed_read() to always call bio_endio() in case of an error, and then simply return from btrfs_submit_data_bio() if we had to call btrfs_submit_compressed_read(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00
Josef Bacik	606f82e797	btrfs: track compressed bio errors as blk_status_t Right now we just have a binary "errors" flag, so any error we get on the compressed bio's gets translated to EIO. This isn't necessarily a bad thing, but if we get an ENOMEM it may be nice to know that's what happened instead of an EIO. Track our errors as a blk_status_t, and do the appropriate setting of the errors accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00
Josef Bacik	e14bfdb5a1	btrfs: remove the bio argument from finish_compressed_bio_read This bio is usually one of the compressed bio's, and we don't actually need it in this function, so remove the argument and stop passing it around. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00
Josef Bacik	b0bbc8a3d4	btrfs: check correct bio in finish_compressed_bio_read Commit c09abff87f90 ("btrfs: cloned bios must not be iterated by bio_for_each_segment_all") added ASSERT()'s to make sure we weren't calling bio_for_each_segment_all() on a RAID5/6 bio. However it was checking the bio that the compression code passed in, not the cb->orig_bio that we actually iterate over, so adjust this ASSERT() to check the correct bio. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00

... 2 3 4 5 6 ...

75304 Commits