IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 4ca8e03cf2bfaeef7c85939fa1ea0c749cd116ab upstream.
If we do fast tree logging we increment a counter on the current
transaction for every ordered extent we need to wait for. This means we
expect the transaction to still be there when we clear pending on the
ordered extent. However if we happen to abort the transaction and clean
it up, there could be no running transaction, and thus we'll trip the
"ASSERT(trans)" check. This is obviously incorrect, and the code
properly deals with the case that the transaction doesn't exist. Fix
this ASSERT() to only fire if there's no trans and we don't have
BTRFS_FS_ERROR() set on the file system.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ee34a82e890a7babb5585daf1a6dd7d4d1cf142a upstream.
During the ino lookup ioctl we can end up calling btrfs_iget() to get an
inode reference while we are holding on a root's btree. If btrfs_iget()
needs to lookup the inode from the root's btree, because it's not
currently loaded in memory, then it will need to lock another or the
same path in the same root btree. This may result in a deadlock and
trigger the following lockdep splat:
WARNING: possible circular locking dependency detected
6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted
------------------------------------------------------
syz-executor277/5012 is trying to acquire lock:
ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
but task is already holding lock:
ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302
btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955
btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline]
btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338
btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline]
open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494
btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154
btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
fc_mount fs/namespace.c:1112 [inline]
vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142
btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
do_new_mount+0x28f/0xae0 fs/namespace.c:3335
do_mount fs/namespace.c:3675 [inline]
__do_sys_mount fs/namespace.c:3884 [inline]
__se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-tree-01){++++}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(btrfs-tree-00);
lock(btrfs-tree-01);
lock(btrfs-tree-00);
rlock(btrfs-tree-01);
*** DEADLOCK ***
1 lock held by syz-executor277/5012:
#0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
stack backtrace:
CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0bec94ea39
Fix this simply by releasing the path before calling btrfs_iget() as at
point we don't need the path anymore.
Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/
Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5e0e879926c1ce7e1f5e0dfaacaf2d105f7d8a05 upstream.
[BUG]
After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6bfe3959b0e7a526f5c64747801a8613f002f05a ]
The function btrfs_validate_super() should verify the metadata_uuid in
the provided superblock argument. Because, all its callers expect it to
do that.
Such as in the following stacks:
write_all_supers()
sb = fs_info->super_for_commit;
btrfs_validate_write_super(.., sb)
btrfs_validate_super(.., sb, ..)
scrub_one_super()
btrfs_validate_super(.., sb, ..)
And
check_dev_super()
btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::metadata_uuid
instead. Fix this using the correct metadata_uuid in the superblock
argument.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4844c3664a72d36cc79752cb651c78860b14c240 ]
In some cases, we need to read the FSID from the superblock when the
metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
add a helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 6bfe3959b0e7 ("btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7f72f50547b7af4ddf985b07fc56600a4deba281 ]
[BUG]
Syzbot reported several warning triggered inside
lookup_inline_extent_backref().
[CAUSE]
As usual, the reproducer doesn't reliably trigger locally here, but at
least we know the WARN_ON() is triggered when an inline backref can not
be found, and it can only be triggered when @insert is true. (I.e.
inserting a new inline backref, which means the backref should already
exist)
[ENHANCEMENT]
After the WARN_ON(), dump all the parameters and the extent tree
leaf to help debug.
Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d167aa76dc0683828588c25767da07fb549e4f48 upstream.
The function btrfs_validate_super() should verify the fsid in the provided
superblock argument. Because, all its callers expect it to do that.
Such as in the following stack:
write_all_supers()
sb = fs_info->super_for_commit;
btrfs_validate_write_super(.., sb)
btrfs_validate_super(.., sb, ..)
scrub_one_super()
btrfs_validate_super(.., sb, ..)
And
check_dev_super()
btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::fsid instead,
which is not correct. Fix this using the correct fsid in the superblock
argument.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5b135b382a360f4c87cf8896d1465b0b07f10cb0 upstream.
Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.
Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.
Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7f1326cc24e22b38afc3acd328480a1183f9e79 upstream.
One of the CI runs triggered the following panic
assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
------------[ cut here ]------------
kernel BUG at fs/btrfs/subpage.c:229!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : btrfs_subpage_assert+0xbc/0xf0
lr : btrfs_subpage_assert+0xbc/0xf0
sp : ffff800093213720
x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000
x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff
x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880
x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff
x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028
x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000
x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c
x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8
x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f
Call trace:
btrfs_subpage_assert+0xbc/0xf0
btrfs_subpage_set_dirty+0x38/0xa0
btrfs_page_set_dirty+0x58/0x88
relocate_one_page+0x204/0x5f0
relocate_file_extent_cluster+0x11c/0x180
relocate_data_extent+0xd0/0xf8
relocate_block_group+0x3d0/0x4e8
btrfs_relocate_block_group+0x2d8/0x490
btrfs_relocate_chunk+0x54/0x1a8
btrfs_balance+0x7f4/0x1150
btrfs_ioctl+0x10f0/0x20b8
__arm64_sys_ioctl+0x120/0x11d8
invoke_syscall.constprop.0+0x80/0xd8
do_el0_svc+0x6c/0x158
el0_svc+0x50/0x1b0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x194/0x198
Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000)
This is the same problem outlined in 17b17fcd6d44 ("btrfs:
set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the
fix is the same. I originally looked for the same pattern elsewhere in
our code, but mistakenly skipped over this code because I saw the page
cache readahead before we set_page_extent_mapped, not realizing that
this was only in the !page case, that we can still end up with a
!uptodate page and then do the btrfs_read_folio further down.
The fix here is the same as the above mentioned patch, move the
set_page_extent_mapped call to after the btrfs_read_folio() block to
make sure that we have the subpage blocksize stuff setup properly before
using the page.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4490e803e1fe9fab8db5025e44e23b55df54078b upstream.
When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
running transaction we end up creating one. This goes against the purpose
of TRANS_JOIN_NOSTART which is to join a running transaction if its state
is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
-ENOENT error and don't start a new transaction. So fix this to not create
a new transaction if there's no running transaction at or below that
state.
CC: stable@vger.kernel.org # 4.14+
Fixes: a6d155d2e363 ("Btrfs: fix deadlock between fiemap and transaction commits")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e28b02118b94e42be3355458a2406c6861e2dd32 upstream.
If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.
When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a6496849671a5bc9218ecec25a983253b34351b1 upstream.
btrfs_start_transaction reserves metadata space of the PERTRANS type
before it identifies a transaction to start/join. This allows flushing
when reserving that space without a deadlock. However, it results in a
race which temporarily breaks qgroup rsv accounting.
T1 T2
start_transaction
do_stuff
start_transaction
qgroup_reserve_meta_pertrans
commit_transaction
qgroup_free_meta_all_pertrans
hit an error starting txn
goto reserve_fail
qgroup_free_meta_pertrans (already freed!)
The basic issue is that there is nothing preventing another commit from
committing before start_transaction finishes (in fact sometimes we
intentionally wait for it) so any error path that frees the reserve is
at risk of this race.
While this exact space was getting freed anyway, and it's not a huge
deal to double free it (just a warning, the free code catches this), it
can result in incorrectly freeing some other pertrans reservation in
this same reservation, which could then lead to spuriously granting
reservations we might not have the space for. Therefore, I do believe it
is worth fixing.
To fix it, use the existing prealloc->pertrans conversion mechanism.
When we first reserve the space, we reserve prealloc space and only when
we are sure we have a transaction do we convert it to pertrans. This way
any racing commits do not blow away our reservation, but we still get a
pertrans reservation that is freed when _this_ transaction gets committed.
This issue can be reproduced by running generic/269 with either qgroups
or squotas enabled via mkfs on the scratch device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 332581bde2a419d5f12a93a1cdc2856af649a3cc upstream.
When multiple writes happen at once, we may need to sacrifice a currently
active block group to be zone finished for a new allocation. We choose a
block group with the least free space left, and zone finish it.
To do the finishing, we need to send IOs for already allocated region
and wait for them and on-going IOs. Otherwise, these IOs fail because the
zone is already finished at the time the IO reach a device.
However, if a block group dedicated to the data relocation is zone
finished, there is a chance that finishing it before an ongoing write IO
reaches the device. That is because there is timing gap between an
allocation is done (block_group->reservations == 0, as pre-allocation is
done) and an ordered extent is created when the relocation IO starts.
Thus, if we finish the zone between them, we can fail the IOs.
We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
avoid the zone finishing. Because, the data_reloc_bg may already switch to
a new block group, while there are still ongoing write IOs to the old
data_reloc_bg.
So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
indicate there is a data relocation allocation and/or ongoing write to the
block group. The bit is set on allocation and cleared in end_io function of
the last IO for the currently allocated region.
To change the timing of the bit setting also solves the issue that the bit
being left even after there is no IO going on. With the current code, if
the data_reloc_bg switches after the last IO to the current data_reloc_bg,
the bit is set at this timing and there is no one clearing that bit. As a
result, that block group is kept unallocatable for anything.
Fixes: 343d8a30851c ("btrfs: zoned: prevent allocation from previous data relocation BG")
Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 29eefa6d0d07e185f7bfe9576f91e6dba98189c2 upstream.
Pausing and canceling balance can race to interrupt balance lead to BUG_ON
panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance
does not take this race scenario into account.
However, the race condition has no other side effects. We can fix that.
Reproducing it with panic trace like this:
kernel BUG at fs/btrfs/volumes.c:4618!
RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0
Call Trace:
<TASK>
? do_nanosleep+0x60/0x120
? hrtimer_nanosleep+0xb7/0x1a0
? sched_core_clone_cookie+0x70/0x70
btrfs_ioctl_balance_ctl+0x55/0x70
btrfs_ioctl+0xa46/0xd20
__x64_sys_ioctl+0x7d/0xa0
do_syscall_64+0x38/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Race scenario as follows:
> mutex_unlock(&fs_info->balance_mutex);
> --------------------
> .......issue pause and cancel req in another thread
> --------------------
> ret = __btrfs_balance(fs_info);
>
> mutex_lock(&fs_info->balance_mutex);
> if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
> btrfs_info(fs_info, "balance: paused");
> btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
> }
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c962098ca4af146f2625ed64399926a098752c9c upstream.
In production we were seeing a variety of WARN_ON()'s in the extent_map
code, specifically in btrfs_drop_extent_map_range() when we have to call
add_extent_mapping() for our second split.
Consider the following extent map layout
PINNED
[0 16K) [32K, 48K)
and then we call btrfs_drop_extent_map_range for [0, 36K), with
skip_pinned == true. The initial loop will have
start = 0
end = 36K
len = 36K
we will find the [0, 16k) extent, but since we are pinned we will skip
it, which has this code
start = em_end;
if (end != (u64)-1)
len = start + len - em_end;
em_end here is 16K, so now the values are
start = 16K
len = 16K + 36K - 16K = 36K
len should instead be 20K. This is a problem when we find the next
extent at [32K, 48K), we need to split this extent to leave [36K, 48k),
however the code for the split looks like this
split->start = start + len;
split->len = em_end - (start + len);
In this case we have
em_end = 48K
split->start = 16K + 36K // this should be 16K + 20K
split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K
and now we have an invalid extent_map in the tree that potentially
overlaps other entries in the extent map. Even in the non-overlapping
case we will have split->start set improperly, which will cause problems
with any block related calculations.
We don't actually need len in this loop, we can simply use end as our
end point, and only adjust start up when we find a pinned extent we need
to skip.
Adjust the logic to do this, which keeps us from inserting an invalid
extent map.
We only skip_pinned in the relocation case, so this is relatively rare,
except in the case where you are running relocation a lot, which can
happen with auto relocation on.
Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0657b20c5a76c938612f8409735a8830d257866e ]
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 961f5b8bf48a463ac5fe5b13143426d79eb41817 ]
In zoned mode the sequential status of zone can be also tracked in the
runtime flags of block group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d7764ff58b4b45c39eb03f2c74a819c1a88fa7b ]
We already have flags in block group to track various status bits,
convert needs_free_space as well and reduce size of btrfs_block_group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a9f189716cf15913c453299d72f69c51a9b0f86b ]
An unused block group is easy to remove to free up space and should be
reclaimed fast. Such block group can often already be a target of the
reclaim process. As we check list_empty(&bg->bg_list), we keep it in the
reclaim list. That block group is never reclaimed until the file system
is filled e.g. up to 75%.
Instead, we can move unused block group to the unused list and delete it
fast.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 92fb94b69c6accf1e49fff699640fa0ce03dc910 upstream.
We set cache_block_group_error if btrfs_cache_block_group() returns an
error, this is because we could end up not finding space to allocate and
mistakenly return -ENOSPC, and which could then abort the transaction
with the incorrect errno, and in the case of ENOSPC result in a
WARN_ON() that will trip up tests like generic/475.
However there's the case where multiple threads can be racing, one
thread gets the proper error, and the other thread doesn't actually call
btrfs_cache_block_group(), it instead sees ->cached ==
BTRFS_CACHE_ERROR. Again the result is the same, we fail to allocate
our space and return -ENOSPC. Instead we need to set
cache_block_group_error to -EIO in this case to make sure that if we do
not make our allocation we get the appropriate error returned back to
the caller.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ebcd021c92b8e4b904552e4d87283032100796d upstream.
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its
subvolume tree.
[CAUSE]
After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume
trees, we just COW the involved tree block, no need to create a reloc
tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they
have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted
on-disk data can trigger that ASSERT().
[FIX]
Besides the dedicated fix and the graceful exit, also let tree-checker to
check such root keys, to make sure reloc trees can only exist for subvolumes.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 05d7ce504545f7874529701664c90814ca645c5d upstream.
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
[CAUSE]
The root cause of the triggered ASSERT() is we can have a race between
quota tree creation and relocation.
This leads us to create a duplicated quota tree in the
btrfs_read_fs_root() path, and since it's treated as fs tree, it would
have ROOT_SHAREABLE flag, causing us to create a reloc tree for it.
The bug itself is fixed by a dedicated patch for it, but this already
taught us the ASSERT() is not something straightforward for
developers.
[ENHANCEMENT]
Instead of using an ASSERT(), let's handle it gracefully and output
extra info about the mismatch reloc roots to help debug.
Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside
merge_reloc_roots() later.
Also replace those ASSERT(0)s with WARN_ON()s.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12b2d64e591652a2d97dd3afa2b062ca7a4ba352 upstream.
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false. The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end. But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end. Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.
CC: stable@vger.kernel.org # 6.1+
Fixes: a315e68f6e8b ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit effa24f689ce0948f68c754991a445a8d697d3a8 upstream.
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc1f91b9231a28fba333f931a031bf776bc6ef0e upstream.
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d8ccbd21918fd7fa6ce3226cffc22c444228e8ad upstream.
At add_new_free_space() we have these BUG_ON()'s that are there to deal
with any failure to add free space to the in memory free space cache.
Such failures are mostly -ENOMEM that should be very rare. However there's
no need to have these BUG_ON()'s, we can just return any error to the
caller and all callers and their upper call chain are already dealing with
errors.
So just make add_new_free_space() return any errors, while removing the
BUG_ON()'s, and returning the total amount of added free space to an
optional u64 pointer argument.
Reported-by: syzbot+3ba856e07b7127889d8c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000e9cb8305ff4e8327@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b28ff3a7d7e97456fd86b68d24caa32e1cfa7064 upstream.
btrfs_attach_transaction_barrier() is used to get a handle pointing to the
current running transaction if the transaction has not started its commit
yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit
has started, then we wait for the transaction to commit and finish before
returning - however we completely ignore if the transaction was aborted
due to some error during its commit, we simply return ERR_PT(-ENOENT),
which makes the caller assume everything is fine and no errors happened.
This could make an fsync return success (0) to user space when in fact we
had a transaction abort and the target inode changes were therefore not
persisted.
Fix this by checking for the return value from btrfs_wait_for_commit(),
and if it returned an error, return it back to the caller.
Fixes: d4edf39bd5db ("Btrfs: fix uncompleted transaction")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bf7ecbe9875061bf3fce1883e3b26b77f847d1e8 upstream.
At btrfs_wait_for_commit() we wait for a transaction to finish and then
always return 0 (success) without checking if it was aborted, in which
case the transaction didn't happen due to some critical error. Fix this
by checking if the transaction was aborted.
Fixes: 462045928bda ("Btrfs: add START_SYNC, WAIT_SYNC ioctls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8dbfc14fc736eb701089aff09645c3d4ad3decb1 upstream.
When using the block group tree feature, this tree is a critical tree just
like the extent, csum and free space trees, and just like them it uses the
delayed refs block reserve.
So take into account the block group tree, and its current size, when
calculating the size for the global reserve.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8a4a0b2a3eaf75ca8854f856ef29690c12b2f531 ]
If we disable quotas while we have a relocation of a metadata block group
that has extents belonging to the quota root, we can cause the relocation
to fail with -ENOENT. This is because relocation builds backref nodes for
extents of the quota root and later needs to walk the backrefs and access
the quota root - however if in between a task disables quotas, it results
in deleting the quota root from the root tree (with btrfs_del_root(),
called from btrfs_quota_disable().
This can be sporadically triggered by test case btrfs/255 from fstests:
$ ./check btrfs/255
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
# --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000
# +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100
# @@ -1,2 +1,4 @@
# QA output created by 255
# +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
# +There may be more info in syslog - try dmesg | tail
# Silence is golden
# ...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff)
Ran: btrfs/255
Failures: btrfs/255
Failed 1 of 1 tests
To fix this make the quota disable operation take the cleaner mutex, as
relocation of a block group also takes this mutex. This is also what we
do when deleting a subvolume/snapshot, we take the cleaner mutex in the
cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
at btrfs_drop_snapshot() while under the protection of the cleaner mutex.
Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4e7de35eb7d1a1d4f2dda15f39fbedd4798a0b8d ]
The mirror_num_ret is allowed to be NULL, although it has to be set when
smap is set. Unfortunately that is not a well enough specifiable
invariant for static type checkers, so add a NULL check to make sure they
are fine.
Fixes: 03793cbbc80f ("btrfs: add fast path for single device io in __btrfs_map_block")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit b19c98f237cd76981aaded52c258ce93f7daa8cb upstream.
Syzbot reported a panic that looks like this:
assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.c:259!
RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259
Call Trace:
<TASK>
btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline]
btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline]
btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The reproducer is running a balance and a cancel or pause in parallel.
The way balance finishes is a bit wonky, if we were paused we need to
save the balance_ctl in the fs_info, but clear it otherwise and cleanup.
However we rely on the return values being specific errors, or having a
cancel request or no pause request. If balance completes and returns 0,
but we have a pause or cancel request we won't do the appropriate
cleanup, and then the next time we try to start a balance we'll trip
this ASSERT.
The error handling is just wrong here, we always want to clean up,
unless we got -ECANCELLED and we set the appropriate pause flag in the
exclusive op. With this patch the reproducer ran for an hour without
tripping, previously it would trip in less than a few minutes.
Reported-by: syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f1a07c2b4e2c473ec322b8b9ece071b8c88a3512 upstream.
At exclude_super_stripes(), if we happen to find a block group that has
super blocks mapped to it and we are on a zoned filesystem, we error out
as this is not supposed to happen, indicating either a bug or maybe some
memory corruption for example. However we are exiting the function without
freeing the memory allocated for the logical address of the super blocks.
Fix this by freeing the logical address.
Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 17b17fcd6d446b95904a6929c40012ee7f0afc0c upstream.
While trying to get the subpage blocksize tests running, I hit the
following panic on generic/476
assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
kernel BUG at fs/btrfs/subpage.c:229!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : btrfs_subpage_assert+0xbc/0xf0
lr : btrfs_subpage_assert+0xbc/0xf0
Call trace:
btrfs_subpage_assert+0xbc/0xf0
btrfs_subpage_clear_checked+0x38/0xc0
btrfs_page_clear_checked+0x48/0x98
btrfs_truncate_block+0x5d0/0x6a8
btrfs_cont_expand+0x5c/0x528
btrfs_write_check.isra.0+0xf8/0x150
btrfs_buffered_write+0xb4/0x760
btrfs_do_write_iter+0x2f8/0x4b0
btrfs_file_write_iter+0x1c/0x30
do_iter_readv_writev+0xc8/0x158
do_iter_write+0x9c/0x210
vfs_iter_write+0x24/0x40
iter_file_splice_write+0x224/0x390
direct_splice_actor+0x38/0x68
splice_direct_to_actor+0x12c/0x260
do_splice_direct+0x90/0xe8
generic_copy_file_range+0x50/0x90
vfs_copy_file_range+0x29c/0x470
__arm64_sys_copy_file_range+0xcc/0x498
invoke_syscall.constprop.0+0x80/0xd8
do_el0_svc+0x6c/0x168
el0_svc+0x50/0x1b0
el0t_64_sync_handler+0x114/0x120
el0t_64_sync+0x194/0x198
This happens because during btrfs_cont_expand we'll get a page, set it
as mapped, and if it's not Uptodate we'll read it. However between the
read and re-locking the page we could have called release_folio() on the
page, but left the page in the file mapping. release_folio() can clear
the page private, and thus further down we blow up when we go to modify
the subpage bits.
Fix this by putting the set_page_extent_mapped() after the read. This
is safe because read_folio() will call set_page_extent_mapped() before
it does the read, and then if we clear page private but leave it on the
mapping we're completely safe re-setting set_page_extent_mapped(). With
this patch I can now run generic/476 without panicing.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 40b0a749388517de244643c09bdbb98f7dcb6ef1 upstream.
At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to
record a tree mod log root insertion operation, do a transaction abort
instead. There's really no need for the BUG_ON(), we can properly
release all resources in this context and turn the filesystem to RO mode
and in an error state instead.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ede600e497b1461d06d22a7d17703d9096868bc3 upstream.
At split_node(), if we fail to log the tree mod log copy operation, we
return without unlocking the split extent buffer we just allocated and
without decrementing the reference we own on it. Fix this by unlocking
it and decrementing the ref count before returning.
Fixes: 5de865eebb83 ("Btrfs: fix tree mod logging")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e27180994383b7c741ad87749db01e4989a02ba upstream.
The reclaim process can temporarily fail. For example, if the space is
getting tight, it fails to make the block group read-only. If there are no
further writes on that block group, the block group will never get back to
the reclaim list, and the BG never gets reclaimed. In a certain workload,
we can leave many such block groups never reclaimed.
So, let's get it back to the list and give it a chance to be reclaimed.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1a1b0e729d227f9f758f7b5f1c997e874e94156e upstream.
The block group tree was not present among the lockdep classes. We could
get potentially lockdep warnings but so far none has been seen, also
because block-group-tree is a relatively new feature.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 93463ff7b54626f8276c0bd3d3f968fbf8d5d380 upstream.
When a filesystem is read-only, we cannot reclaim a block group as it
cannot rewrite the data. Just bail out in that case.
Note that it can drop block groups in this case. As we did
sb_start_write(), read-only filesystem means we got a fatal error and
forced read-only. There is no chance to reclaim them again.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3ed01616bad6c7e3de196676b542ae3df8058592 upstream.
The reclaiming process only starts after the filesystem volumes are
allocated to a certain level (75% by default). Thus, the list of
reclaiming target block groups can build up so huge at the time the
reclaim process kicks in. On a test run, there were over 1000 BGs in the
reclaim list.
As the reclaim involves rewriting the data, it takes really long time to
reclaim the BGs. While the reclaim is running, btrfs_delete_unused_bgs()
won't proceed because the reclaim side is holding
fs_info->reclaim_bgs_lock. As a result, we will have a large number of
unused BGs kept in the unused list. On my test run, I got 1057 unused BGs.
Since deleting a block group is relatively easy and fast work, we can call
btrfs_delete_unused_bgs() while it reclaims BGs, to avoid building up
unused BGs.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 160fe8f6fdb13da6111677be6263e5d65e875987 upstream.
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.
`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.
This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.
Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.
This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.
In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).
[770.648] ------------[ cut here ]------------
[770.648] BTRFS: Transaction aborted (error -22)
[770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test
[770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
[770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
[770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000
[770.648] CFAR: c000000000135110 IRQMASK: 0
GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
[770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
[770.648] Call Trace:
[770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
[770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
[770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
[770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
[770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
[770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
[770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
[770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
[770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
[770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
[770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
[770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
[770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
[770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
[770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
[770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
[770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
[770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
[770.648] --- interrupt: c00 at 0x7fff94126800
[770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
[770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000
[770.648] IRQMASK: 0
GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
[770.648] NIP [00007fff94126800] 0x7fff94126800
[770.648] LR [0000000107e0b594] 0x107e0b594
[770.648] --- interrupt: c00
[770.648] Instruction dump:
[770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
[770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
[770.648] ---[ end trace 0000000000000000 ]---
[770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
[770.648] BTRFS info (device dm-2: state EA): forced readonly
[770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
[770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
[770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
[770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 39020d8abc7ec62c4de9b260e3d10d4a1c2478ce ]
At balance_level(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release
all resources in this context, and we have to abort because other future
tree searches that use the tree mod log (btrfs_search_old_slot()) may get
inconsistent results if other operations modify the tree after that
failure and before the tree mod log based search.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8fd9f4232d8152c650fd15127f533a0f6d0a4b2b ]
This fixes the following warning reported by gcc 10.2.1 under x86_64:
../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’:
../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
6211 | ret = insert_dir_log_key(trans, log, path, key.objectid,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6212 | first_dir_index, last_dir_index);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here
6161 | u64 last_range_start;
| ^~~~~~~~~~~~~~~~
This might be a false positive fixed in later compiler versions but we
want to have it fixed.
Reported-by: k2ci <kernel-bot@kylinos.cn>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit deccae40e4b30f98837e44225194d80c8baf2233 upstream.
Commit 619104ba453ad0 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter. We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.
With strict always false, this test fails:
btrfs subvol create swappy
chattr +C swappy
fallocate -l1G swappy/swapfile
chmod 600 swappy/swapfile
mkswap swappy/swapfile
btrfs subvol snap swappy swapsnap
btrfs subvol del -C swapsnap
btrfs fi sync /
sync;sync;sync
swapon swappy/swapfile
The fix is to just use args->strict, and everyone except swapfile
activation is passing false.
Fixes: 619104ba453ad0 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7833b865953c8e62abc76a3261c04132b2fb69de upstream.
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.
This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e51784a ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.
Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e51784a ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 745806fb4554f334e6406fa82b328562aa48f08f upstream.
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.
The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0
BTRFS error (device loop0: state C): failed to load root csum
assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3664!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
load_global_roots fs/btrfs/disk-io.c:2501 [inline]
btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
legacy_get_tree+0xea/0x180 fs/fs_context.c:610
vfs_get_tree+0x88/0x270 fs/super.c:1530
fc_mount fs/namespace.c:1043 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884
[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.
So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.
And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().
But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.
So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.
[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.
To make further debugging easier, also add two explicit error messages:
- Error message for conflicting global roots
- Error message when using backup roots slot
Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>