IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.
When doing an incremental send, if we have a new inode that happens to
have the same number that an old directory inode had in the base snapshot
and that old directory has a pending rmdir operation, we end up computing
a wrong path for the new inode, causing the receiver to fail.
Example reproducer:
$ cat test-send-rmdir.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
mkdir $MNT/dir
touch $MNT/dir/file1
touch $MNT/dir/file2
touch $MNT/dir/file3
# Filesystem looks like:
#
# . (ino 256)
# |----- dir/ (ino 257)
# |----- file1 (ino 258)
# |----- file2 (ino 259)
# |----- file3 (ino 260)
#
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Now remove our directory and all its files.
rm -fr $MNT/dir
# Unmount the filesystem and mount it again. This is to ensure that
# the next inode that is created ends up with the same inode number
# that our directory "dir" had, 257, which is the first free "objectid"
# available after mounting again the filesystem.
umount $MNT
mount $DEV $MNT
# Now create a new file (it could be a directory as well).
touch $MNT/newfile
# Filesystem now looks like:
#
# . (ino 256)
# |----- newfile (ino 257)
#
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
# Now unmount the filesystem, create a new one, mount it and try to apply
# both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT
umount $MNT
When running the test, the receive operation for the incremental stream
fails:
$ ./test-send-rmdir.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: chown o257-9-0 failed: No such file or directory
So fix this by tracking directories that have a pending rmdir by inode
number and generation number, instead of only inode number.
A test case for fstests follows soon.
Reported-by: Massimo B. <massimo.b@gmx.net>
Tested-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0697d9a610998b8bdee6b2390836cb2391d8fd1a upstream.
Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().
At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.
==================================================================
BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1d6/0x29e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44840a
RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
Allocated by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
kmalloc_node include/linux/slab.h:577 [inline]
kvmalloc_node+0x81/0x110 mm/util.c:574
kvmalloc include/linux/mm.h:757 [inline]
kvzalloc include/linux/mm.h:765 [inline]
btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x113/0x200 mm/slab.c:3756
deactivate_locked_super+0xa7/0xf0 fs/super.c:335
btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff8880878e0000
which belongs to the cache kmalloc-16k of size 16384
The buggy address is located 1704 bytes inside of
16384-byte region [ffff8880878e0000, ffff8880878e4000)
The buggy address belongs to the page:
page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.
Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.
There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a1fbc6750e212c5675a4e48d7f51d44607eb8756 upstream.
On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.
CC: stable@vger.kernel.org # 4.4+
Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ define the variable instead of repeating the shift ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cf89af146b7e62af55470cf5f3ec3c56ec144a5e upstream.
If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image. Fail the mount as this needs a closer
look what is actually wrong.
As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:
WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
panic+0x347/0x7c0 kernel/panic.c:231
__warn.cold+0x20/0x46 kernel/panic.c:600
report_bug+0x1bd/0x210 lib/bug.c:198
handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
btrfs_fill_super fs/btrfs/super.c:1316 [inline]
btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 468600c6ec28613b756193c5f780aac062f1acdf upstream.
There is one error handling path that does not free ref, which may cause
a minor memory leak.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0607eb1d452d45c5ac4c745a9e9e0d95152ea9d0 ]
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f96d6960abbc52e26ad124e69e6815283d3e1674 upstream.
The error message for inode transid is the same as for inode generation,
which makes us unable to detect the real problem.
Reported-by: Tyler Richmond <t.d.richmond@gmail.com>
Fixes: 496245cac57e ("btrfs: tree-checker: Verify inode item")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.19: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 496245cac57e26d8b738d85c7a29cf9a47610f3f upstream.
There is a report in kernel bugzilla about mismatch file type in dir
item and inode item.
This inspires us to check inode mode in inode item.
This patch will check the following members:
- inode key objectid
Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.
- inode key offset
Should be 0
- inode item generation
- inode item transid
No newer than sb generation + 1.
The +1 is for log tree.
- inode item mode
No unknown bits.
No invalid S_IF* bit.
NOTE: S_IFMT check is not enough, need to check every know type.
- inode item nlink
Dir should have no more link than 1.
- inode item flags
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 80e46cf22ba0bcb57b39c7c3b52961ab3a0fd5f2 upstream.
Btrfs-progs already have a comprehensive type checker, to ensure there
is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk
profile bits.
Do the same work for kernel.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8bb177d18f114358a57d8ae7e206861b48b8b4de upstream.
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e13346 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.19: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ab4ba2e133463c702b37242560d7fabedd2dc750 upstream.
[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
#PF error: [normal kernel read fault]
PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
Call Trace:
open_ctree+0x160d/0x2149
btrfs_mount_root+0x5b2/0x680
[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.
But in this case, there is no seed device at all, causing NULL pointer.
[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 075cb3c78fe7976c9f29ca1fa23f9728634ecefc upstream.
Since we have btrfs_check_chunk_valid() in tree-checker, let's do
chunk item verification in tree-checker too.
Since the tree-checker is run at endio time, if one chunk leaf fails
chunk verification, we can still retry the other copy, making btrfs more
robust to fuzzed image as we may still get a good chunk item.
Also since we have done chunk verification in tree block read time, skip
the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading
chunk items from leaf.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bf871c3b43b1dcc3f2a076ff39a8f1ce7959d958 upstream.
To follow the standard behavior of tree-checker.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Cherry-picked for 4.19 to ease backporting later fixes]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f114024376bceb1c0f61a7bad4a72a0f978767af upstream.
Old error message would be something like:
BTRFS error (device dm-3): invalid chunk num_stipres: 0
New error message would be:
Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=2097152, invalid chunk num_stripes: 0
Or
Btrfs critical (device dm-3): corrupt leaf: root=3 block=8388608 slot=3 chunk_start=2097152, invalid chunk num_stripes: 0
And for certain error message, also output expected value.
The error message levels are changed from error to critical.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Cherry-picked for 4.19 to ease backporting later fixes]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 82fc28fbedbb59642f05215db3b0ef4eb91aa31d upstream.
By function, chunk item verification is more suitable to be done inside
tree-checker.
So move btrfs_check_chunk_valid() to tree-checker.c and export it.
And since it's now moved to tree-checker, also add a better comment for
what this function is doing.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Cherry-picked for 4.19 to ease backporting later fixes]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b3ff8f1d380e65dddd772542aa9bff6c86bf715a upstream.
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 4.19: fs_info variable already exists in
btree_write_cache_pages()]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18dfa7117a3f379862dcd3f67cadd678013bb9dd upstream.
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e3c25136adfb293d517e17f761d3b8a43a8fc22 upstream.
This function needs some extra checks on locked pages and eb. For error
handling we need to unlock locked pages and the eb.
There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.
Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio(). So there
shouldn't be any problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2b952eea813b1f7e7d4b9782271acd91625b9bb9 upstream.
In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.
Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3065976b045f77a910809fa7699f99a1e7c0dbbb upstream.
Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.
If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.
If __extent_writepage() successes, then we call flush_write_bio() and
return the result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 42ffb0bf584ae5b6b38f72259af1e0ee417ac77f upstream.
There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows
PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
I used drgn to find the respective pages we were stuck on
page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following
crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}
Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).
Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on
bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")
which validated what I suspected.
The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.
Fixes: b293f02e1423 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 860473714cbe7fbedcf92bfe3eb6d69fae8c74ff. That
has an incorrect upstream commit reference, and was modified in a way
that conflicts with some older fixes. We can cleanly cherry-pick the
upstream commit *after* those fixes.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f4340622e02261fae599e3da936ff4808b418173 upstream.
We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().
Move the BUG_ON() one level up to all its callers.
This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Cherry-picked for 4.19 to ease backporting later fixes]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb58eb9e167d087cc518f7a71c3c00f1671958da upstream.
There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio(). Both of them are pretty small, just move
them to kill the forward declaration.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Cherry-picked for 4.19 to ease backporting later fixes]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 83bc1560e02e25c6439341352024ebe8488f4fbd upstream.
If we fail to find suitable zones for a new readahead extent, we end up
leaving a stale pointer in the global readahead extents radix tree
(fs_info->reada_tree), which can trigger the following trace later on:
[13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
[13367.696802] #PF: supervisor read access in kernel mode
[13367.697249] #PF: error_code(0x0000) - not-present page
[13367.697721] PGD 0 P4D 0
[13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G W 5.9.0-rc6-btrfs-next-69 #1
[13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
[13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
[13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
[13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
[13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
[13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
[13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
[13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
[13367.704880] FS: 00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
[13367.705438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
[13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[13367.707686] Call Trace:
[13367.708246] ? ___slab_alloc+0x395/0x740
[13367.708820] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.709383] lock_acquire+0xb1/0x480
[13367.709955] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.710537] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.711097] ? rcu_read_lock_sched_held+0x5d/0x90
[13367.711659] ? kmem_cache_alloc_trace+0x8d2/0x990
[13367.712221] ? lock_acquired+0x33b/0x470
[13367.712784] _raw_spin_lock+0x34/0x80
[13367.713356] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.713966] reada_add_block+0xe0/0xee0 [btrfs]
[13367.714529] ? btrfs_root_node+0x15/0x1f0 [btrfs]
[13367.715077] btrfs_reada_add+0x117/0x170 [btrfs]
[13367.715620] scrub_stripe+0x21e/0x10d0 [btrfs]
[13367.716141] ? kvm_sched_clock_read+0x5/0x10
[13367.716657] ? __lock_acquire+0x41e/0x3970
[13367.717184] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.717697] ? find_held_lock+0x32/0x90
[13367.718254] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.718773] ? lock_acquired+0x33b/0x470
[13367.719278] ? scrub_chunk+0xcd/0x140 [btrfs]
[13367.719786] scrub_chunk+0xcd/0x140 [btrfs]
[13367.720291] scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
[13367.720787] ? finish_wait+0x90/0x90
[13367.721281] btrfs_scrub_dev+0x1ee/0x620 [btrfs]
[13367.721762] ? rcu_read_lock_any_held+0x8e/0xb0
[13367.722235] ? preempt_count_add+0x49/0xa0
[13367.722710] ? __sb_start_write+0x19b/0x290
[13367.723192] btrfs_ioctl+0x7f5/0x36f0 [btrfs]
[13367.723660] ? __fget_files+0x101/0x1d0
[13367.724118] ? find_held_lock+0x32/0x90
[13367.724559] ? __fget_files+0x101/0x1d0
[13367.724982] ? __x64_sys_ioctl+0x83/0xb0
[13367.725399] __x64_sys_ioctl+0x83/0xb0
[13367.725802] do_syscall_64+0x33/0x80
[13367.726188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[13367.726574] RIP: 0033:0x7fe1df7add87
[13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
[13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
[13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
[13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
[13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
[13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
[13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
[13367.732638] CR2: 00000000000000b0
[13367.733166] ---[ end trace d298b6805556acd9 ]---
What happens is the following:
1) At reada_find_extent() we don't find any existing readahead extent for
the metadata extent starting at logical address X;
2) So we proceed to create a new one. We then call btrfs_map_block() to get
information about which stripes contain extent X;
3) After that we iterate over the stripes and create only one zone for the
readahead extent - only one because reada_find_zone() returned NULL for
all iterations except for one, either because a memory allocation failed
or it couldn't find the block group of the extent (it may have just been
deleted);
4) We then add the new readahead extent to the readahead extents radix
tree at fs_info->reada_tree;
5) Then we iterate over each zone of the new readahead extent, and find
that the device used for that zone no longer exists, because it was
removed or it was the source device of a device replace operation.
Since this left 'have_zone' set to 0, after finishing the loop we jump
to the 'error' label, call kfree() on the new readahead extent and
return without removing it from the radix tree at fs_info->reada_tree;
6) Any future call to reada_find_extent() for the logical address X will
find the stale pointer in the readahead extents radix tree, increment
its reference counter, which can trigger the use-after-free right
away or return it to the caller reada_add_block() that results in the
use-after-free of the example trace above.
So fix this by making sure we delete the readahead extent from the radix
tree if we fail to setup zones for it (when 'have_zone = 0').
Fixes: 319450211842ba ("btrfs: reada: bypass adding extent when all zone failed")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 572c83acdcdafeb04e70aa46be1fa539310be20c upstream.
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8eb2fd00153a3a96a19c62ac9c6d48c2efebe5e8 upstream.
btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.
[1] https://patchwork.kernel.org/patch/9757891/#20529627
Fixes: 818e010bf9d0 ("btrfs: replace opencoded kvzalloc with the helper")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c2b4e0347067396ceb3ae929d6888c81d610259 upstream.
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/f1
touch /mnt/sdi/f2
mkdir /mnt/sdi/d1
mkdir /mnt/sdi/d1/d2
# Filesystem looks like:
#
# . (ino 256)
# |----- f1 (ino 257)
# |----- f2 (ino 258)
# |----- d1/ (ino 259)
# |----- d2/ (ino 260)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now do a series of changes such that:
#
# *) inode 258 has one new hardlink and the previous name changed
#
# *) both names conflict with the old names of two other inodes:
#
# 1) the new name "d1" conflicts with the old name of inode 259,
# under directory inode 256 (root)
#
# 2) the new name "d2" conflicts with the old name of inode 260
# under directory inode 259
#
# *) inodes 259 and 260 now have the old names of inode 258
#
# *) inode 257 is now located under inode 260 - an inode with a number
# smaller than the inode (258) for which we created a second hard
# link and swapped its names with inodes 259 and 260
#
ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
# Swap d1 and f2.
mv /mnt/sdi/d1 /mnt/sdi/tmp
mv /mnt/sdi/f2 /mnt/sdi/d1
mv /mnt/sdi/tmp /mnt/sdi/f2
# Swap d2 and f2_link
mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
# Filesystem now looks like:
#
# . (ino 256)
# |----- d1 (ino 258)
# |----- f2/ (ino 259)
# |----- f2_link/ (ino 260)
# | |----- f1 (ino 257)
# |
# |----- d2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When executed the receive of the incremental stream fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
This happens because:
1) When processing inode 257 we end up computing the name for inode 259
because it is an ancestor in the send snapshot, and at that point it
still has its old name, "d1", from the parent snapshot because inode
259 was not yet processed. We then cache that name, which is valid
until we start processing inode 259 (or set the progress to 260 after
processing its references);
2) Later we start processing inode 258 and collecting all its new
references into the list sctx->new_refs. The first reference in the
list happens to be the reference for name "d1" while the reference for
name "d2" is next (the last element of the list).
We compute the full path "d1/d2" for this second reference and store
it in the reference (its ->full_path member). The path used for the
new parent directory was "d1" and not "f2" because inode 259, the
new parent, was not yet processed;
3) When we start processing the new references at process_recorded_refs()
we start with the first reference in the list, for the new name "d1".
Because there is a conflicting inode that was not yet processed, which
is directory inode 259, we orphanize it, renaming it from "d1" to
"o259-6-0";
4) Then we start processing the new reference for name "d2", and we
realize it conflicts with the reference of inode 260 in the parent
snapshot. So we issue an orphanization operation for inode 260 by
emitting a rename operation with a destination path of "o260-6-0"
and a source path of "d1/d2" - this source path is the value we
stored in the reference earlier at step 2), corresponding to the
->full_path member of the reference, however that path is no longer
valid due to the orphanization of the directory inode 259 in step 3).
This makes the receiver fail since the path does not exists, it should
have been "o259-6-0/d2".
Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb56f02f26fe23798edb1b2175707419b28c752a upstream.
Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.
Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.
So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.
The stack trace produced when the soft lockup happens is the following:
[10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
[10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
[10480.284915] irq event stamp: 29646366
[10480.285987] hardirqs last enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
[10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
[10480.290856] softirqs last enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
[10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
[10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
[10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
[10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
[10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
[10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
[10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
[10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
[10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
[10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
[10480.321104] FS: 00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
[10480.323559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
[10480.327259] Call Trace:
[10480.328286] ? overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.329784] __kmalloc+0x831/0xa20
[10480.331009] ? btrfs_get_32+0xb0/0x1d0 [btrfs]
[10480.332464] overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.333948] log_dir_items+0x2ee/0x570 [btrfs]
[10480.335413] log_directory_changes+0x82/0xd0 [btrfs]
[10480.336926] btrfs_log_inode+0xc9b/0xda0 [btrfs]
[10480.338374] ? init_once+0x20/0x20 [btrfs]
[10480.339711] btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
[10480.341257] ? dget_parent+0x97/0x2e0
[10480.342480] btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
[10480.343977] btrfs_sync_file+0x24b/0x5e0 [btrfs]
[10480.345381] do_fsync+0x38/0x70
[10480.346483] __x64_sys_fsync+0x10/0x20
[10480.347703] do_syscall_64+0x2d/0x70
[10480.348891] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10480.350444] RIP: 0033:0x7fa1dc80970b
[10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
[10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
[10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
[10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
[10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
[10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
CC: stable@vger.kernel.org # 4.4+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79dae17d8d44b2d15779e332180080af45df5352 upstream.
Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.
The current message is too long, so drop the unnecessary UUID and add
process name and PID.
While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent
CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b4c5d8fdfff3e2b6c4fa4a5043e8946dff500f8c upstream.
For delayed inode facility, qgroup metadata is reserved for it, and
later freed.
However we're freeing more bytes than we reserved.
In btrfs_delayed_inode_reserve_metadata():
num_bytes = btrfs_calc_metadata_size(fs_info, 1);
...
ret = btrfs_qgroup_reserve_meta_prealloc(root,
fs_info->nodesize, true);
...
if (!ret) {
node->bytes_reserved = num_bytes;
But in btrfs_delayed_inode_release_metadata():
if (qgroup_free)
btrfs_qgroup_free_meta_prealloc(node->root,
node->bytes_reserved);
else
btrfs_qgroup_convert_reserved_meta(node->root,
node->bytes_reserved);
This means, we're always releasing more qgroup metadata rsv than we have
reserved.
This won't trigger selftest warning, as btrfs qgroup metadata rsv has
extra protection against cases like quota enabled half-way.
But we still need to fix this problem any way.
This patch will use the same num_bytes for qgroup metadata rsv so we
could handle it correctly.
Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c6a5d954950c5031444173ad2195efc163afcac9 ]
If you replace a seed device in a sprouted fs, it appears to have
successfully replaced the seed device, but if you look closely, it
didn't. Here is an example.
$ mkfs.btrfs /dev/sda
$ btrfstune -S1 /dev/sda
$ mount /dev/sda /btrfs
$ btrfs device add /dev/sdb /btrfs
$ umount /btrfs
$ btrfs device scan --forget
$ mount -o device=/dev/sda /dev/sdb /btrfs
$ btrfs replace start -f /dev/sda /dev/sdc /btrfs
$ echo $?
0
BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished
$ btrfs fi show
Label: none uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
Total devices 2 FS bytes used 256.00KiB
devid 1 size 3.00GiB used 520.00MiB path /dev/sdc
devid 2 size 3.00GiB used 896.00MiB path /dev/sdb
Label: none uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
Total devices 1 FS bytes used 128.00KiB
devid 1 size 3.00GiB used 536.00MiB path /dev/sda
So as per the replace start command and kernel log replace was successful.
Now let's try to clean mount.
$ umount /btrfs
$ btrfs device scan --forget
$ mount -o device=/dev/sdc /dev/sdb /btrfs
mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
[ 636.157517] BTRFS error (device sdc): failed to read chunk tree: -2
[ 636.180177] BTRFS error (device sdc): open_ctree failed
That's because per dev items it is still looking for the original seed
device.
$ btrfs inspect-internal dump-tree -d /dev/sdb
item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
devid 1 total_bytes 3221225472 bytes_used 545259520
io_align 4096 io_width 4096 sector_size 4096 type 0
generation 6 start_offset 0 dev_group 0
seek_speed 0 bandwidth 0
uuid 59368f50-9af2-4b17-91da-8a783cc418d4 <--- seed uuid
fsid 10bd3202-0415-43af-96a8-d5409f310a7e <--- seed fsid
item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
devid 2 total_bytes 3221225472 bytes_used 939524096
io_align 4096 io_width 4096 sector_size 4096 type 0
generation 0 start_offset 0 dev_group 0
seek_speed 0 bandwidth 0
uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a <- sprout uuid
fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid
But the replaced target has the following uuid+fsid in its superblock
which doesn't match with the expected uuid+fsid in its devitem.
$ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid'
generation 20
dev_item.uuid 59368f50-9af2-4b17-91da-8a783cc418d4
dev_item.fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match]
dev_item.devid 1
So if you provide the original seed device the mount shall be
successful. Which so long happening in the test case btrfs/163.
$ btrfs device scan --forget
$ mount -o device=/dev/sda /dev/sdb /btrfs
Fix in this patch:
If a seed is not sprouted then there is no replacement of it, because of
its read-only filesystem with a read-only device. Similarly, in the case
of a sprouted filesystem, the seed device is still read only. So, mark
it as you can't replace a seed device, you can only add a new device and
then delete the seed device. If replace is attempted then returns
-EINVAL.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ]
[BUG]
When running tests like generic/013 on test device with btrfs quota
enabled, it can normally lead to data leak, detected at unmount time:
BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
------------[ cut here ]------------
WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace caf08beafeca2392 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
[CAUSE]
In the offending case, the offending operations are:
2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
The following sequence of events could happen after the writev():
CPU1 (writeback) | CPU2 (truncate)
-----------------------------------------------------------------
btrfs_writepages() |
|- extent_write_cache_pages() |
|- Got page for 1003520 |
| 1003520 is Dirty, no writeback |
| So (!clear_page_dirty_for_io()) |
| gets called for it |
|- Now page 1003520 is Clean. |
| | btrfs_setattr()
| | |- btrfs_setsize()
| | |- truncate_setsize()
| | New i_size is 18388
|- __extent_writepage() |
| |- page_offset() > i_size |
|- btrfs_invalidatepage() |
|- Page is clean, so no qgroup |
callback executed
This means, the qgroup reserved data space is not properly released in
btrfs_invalidatepage() as the page is Clean.
[FIX]
Instead of checking the dirty bit of a page, call
btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
io_tree, not bound to page status, thus we won't cause double freeing
anyway.
Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7c09c03091ac562ddca2b393e5d65c1d37da79f1 ]
Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
forced read-only. This is not a transaction abort and the filesystem is
otherwise ok, so the error should be just propagated to the callers.
This is caused by unnecessary call to btrfs_handle_fs_error for all
errors, except EAGAIN. This does not make sense as the standard
transaction abort mechanism is in btrfs_drop_snapshot so all relevant
failures are handled.
Originally in commit cb1b69f4508a ("Btrfs: forced readonly when
btrfs_drop_snapshot() fails") there was no return value at all, so the
btrfs_std_error made some sense but once the error handling and
propagation has been implemented we don't need it anymore.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1c78544eaa4660096aeb6a57ec82b42cdb3bfe5a upstream.
When faulting in the pages for the user supplied buffer for the search
ioctl, we are passing only the base address of the buffer to the function
fault_in_pages_writeable(). This means that after the first iteration of
the while loop that searches for leaves, when we have a non-zero offset,
stored in 'sk_offset', we try to fault in a wrong page range.
So fix this by adding the offset in 'sk_offset' to the base address of the
user supplied buffer when calling fault_in_pages_writeable().
Several users have reported that the applications compsize and bees have
started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix
potential deadlock in the search ioctl") was added to stable trees, and
these applications make heavy use of the search ioctls. This fixes their
issues.
Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
Link: https://github.com/kilobyte/compsize/issues/34
Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl")
CC: stable@vger.kernel.org # 4.4+
Tested-by: A L <mail@lechevalier.se>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fccc0007b8dc952c6bc0805cdf842eb8ea06a639 upstream.
Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc2+ #1 Tainted: G W
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc_trace+0x3a/0x1a0
btrfs_alloc_device+0x43/0x210
add_missing_dev+0x20/0x90
read_one_chunk+0x301/0x430
btrfs_read_sys_array+0x17b/0x1b0
open_ctree+0xa62/0x1896
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x379
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
vfs_rmdir.part.0+0x149/0x160
do_rmdir+0x136/0x1a0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&fs_info->chunk_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x12d/0x150
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa4/0x3d0
? btrfs_evict_inode+0x11e/0x500
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x46/0x60
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters. So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.
Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea57788eb76dc81f6003245427356a1dcd0ac524 upstream.
[BUG]
A completely sane converted fs will cause kernel warning at balance
time:
[ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data
[ 1563.358078] BTRFS info (device sda7): found 11722 extents
[ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2
[ 1563.358280] item 0 key (7984947200 169 0) itemoff 16250 itemsize 33
[ 1563.358281] extent refs 1 gen 90 flags 2
[ 1563.358282] ref#0: tree block backref root 4
[ 1563.358285] item 1 key (7985602560 169 0) itemoff 16217 itemsize 33
[ 1563.358286] extent refs 1 gen 93 flags 258
[ 1563.358287] ref#0: shared block backref parent 7985602560
[ 1563.358288] (parent 7985602560 is NOT ALIGNED to nodesize 16384)
[ 1563.358290] item 2 key (7985635328 169 0) itemoff 16184 itemsize 33
...
[ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182
[ 1563.358996] ------------[ cut here ]------------
[ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766
Then with transaction abort, and obviously failed to balance the fs.
[CAUSE]
That mentioned inline ref type 182 is completely sane, it's
BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to
believe it's invalid.
Commit 64ecdb647ddb ("Btrfs: add one more sanity check for shared ref
type") introduced extra checks for backref type.
One of the requirement is, parent bytenr must be aligned to node size,
which is not correct.
One example is like this:
0 1G 1G+4K 2G 2G+4K
| |///////////////////|//| <- A chunk starts at 1G+4K
| | <- A tree block get reserved at bytenr 1G+4K
Then we have a valid tree block at bytenr 1G+4K, but not aligned to
nodesize (16K).
Such chunk is not ideal, but current kernel can handle it pretty well.
We may warn about such tree block in the future, but should not reject
them.
[FIX]
Change the alignment requirement from node size alignment to sector size
alignment.
Also, to make our lives a little easier, also output @iref when
btrfs_get_extent_inline_ref_type() failed, so we can locate the item
easier.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475
Fixes: 64ecdb647ddb ("Btrfs: add one more sanity check for shared ref type")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a48b73eca4ceb9b8a4b97f290a065335dbcd8a04 ]
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
------------------------------------------------------
compsize/11122 is trying to acquire lock:
ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
but task is already holding lock:
ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-fs-00){++++}-{3:3}:
down_write_nested+0x3b/0x70
__btrfs_tree_lock+0x24/0x120
btrfs_search_slot+0x756/0x990
btrfs_lookup_inode+0x3a/0xb4
__btrfs_update_delayed_inode+0x93/0x270
btrfs_async_run_delayed_root+0x168/0x230
btrfs_work_helper+0xd4/0x570
process_one_work+0x2ad/0x5f0
worker_thread+0x3a/0x3d0
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #1 (&delayed_node->mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_delayed_update_inode+0x50/0x440
btrfs_update_inode+0x8a/0xf0
btrfs_dirty_inode+0x5b/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&mm->mmap_lock#2){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-fs-00);
lock(&delayed_node->mutex);
lock(btrfs-fs-00);
lock(&mm->mmap_lock#2);
*** DEADLOCK ***
1 lock held by compsize/11122:
#0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __might_fault+0x3e/0x90
? find_held_lock+0x72/0x90
__might_fault+0x68/0x90
? __might_fault+0x3e/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
? btrfs_search_forward+0x2a6/0x360
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
? __do_sys_newfstat+0x5a/0x70
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d3beaa253fd6fa40b8b18a216398e6e5376a9d21 ]
These are special extent buffers that get rewound in order to lookup
the state of the tree at a specific point in time. As such they do not
go through the normal initialization paths that set their lockdep class,
so handle them appropriately when they are created and before they are
locked.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 24cee18a1c1d7c731ea5987e0c99daea22ae7f4a ]
When a rewound buffer is created it already has a ref count of 1 and the
dummy flag set. Then another ref is taken bumping the count to 2.
Finally when this buffer is released from btrfs_release_path the extra
reference is decremented by the special handling code in
free_extent_buffer.
However, this special code is in fact redundant sinca ref count of 1 is
still correct since the buffer is only accessed via btrfs_path struct.
This paves the way forward of removing the special handling in
free_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c122e2a0c515cfb3f3a9cefb5dad4cb62109c78 ]
get_old_root used used only by btrfs_search_old_slot to initialise the
path structure. The old root is always a cloned buffer (either via alloc
dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
from allocation, 1 from extent_buffer_get call in get_old_root.
This latter explicit ref count acquire operation is in fact unnecessary
since the semantic is such that the newly allocated buffer is handed
over to the btrfs_path for lifetime management. Considering this just
remove the extra extent_buffer_get in get_old_root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9771a5cf937129307d9f58922d60484d58ababe7 upstream.
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fb2fecbad50964b9f27a3b182e74e437b40753ef ]
With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC. This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err. So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode. Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.
The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy). We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.
Fixes: 4a500fd178c8 ("Btrfs: Metadata ENOSPC handling for tree log")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note and comment about ENOENT ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit bbc37d6e475eee8ffa2156ec813efc6bbb43c06d upstream.
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
unreferenced object 0xffff9bbf009fa600 (size 512):
comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
hex dump (first 32 bytes):
00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=...
80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=...
backtrace:
[<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
[<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
[<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
[<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
[<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
[<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
[<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
[<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
[<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
[<0000000043ace2c9>] do_syscall_64+0x33/0x80
[<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 282dd7d7718444679b046b769d872b188818ca35 upstream.
Currently a user can set mount "-o compress" which will set the
compression algorithm to zlib, and use the default compress level for
zlib (3):
relatime,compress=zlib:3,space_cache
If the user remounts the fs using "-o compress=lzo", then the old
compress_level is used:
relatime,compress=lzo:3,space_cache
But lzo does not expose any tunable compression level. The same happens
if we set any compress argument with different level, also with zstd.
Fix this by resetting the compress_level when compress=lzo is
specified. With the fix applied, lzo is shown without compress level:
relatime,compress=lzo,space_cache
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a7f8b1c2ac21bf081b41264c9cfd6260dffa6246 ]
The incoming qgroup reserved space timing will move the data reservation
to ordered extent completely.
However in btrfs_punch_hole_lock_range() will call
btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
range.
In current stage it's OK, but if we're making ordered extents handle the
reserved space, then btrfs_punch_hole_lock_range() can clear the
QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
reserved space leakage.
So here change the timing to make reserve data space after
btrfs_punch_hole_lock_range().
The new timing is fine for either current code or the new code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>