IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit c57dd1f2f6a7cd1bb61802344f59ccdc5278c983 upstream
[BUG]
The following script can lead to tons of beyond device boundary access:
mkfs.btrfs -f $dev -b 10G
mount $dev $mnt
trimfs $mnt
btrfs filesystem resize 1:-1G $mnt
trimfs $mnt
[CAUSE]
Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.
So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.
This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.
Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.
[FIX]
This patch will fix the problem in two ways:
- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
This is the root fix
- Add extra safety check when trimming free device extents
We check and warn if the returned range is already beyond current
device.
Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: adjust context and use extent_io.h]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0697d9a610998b8bdee6b2390836cb2391d8fd1a upstream.
Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().
At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.
==================================================================
BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1d6/0x29e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44840a
RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
Allocated by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
kmalloc_node include/linux/slab.h:577 [inline]
kvmalloc_node+0x81/0x110 mm/util.c:574
kvmalloc include/linux/mm.h:757 [inline]
kvzalloc include/linux/mm.h:765 [inline]
btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x113/0x200 mm/slab.c:3756
deactivate_locked_super+0xa7/0xf0 fs/super.c:335
btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff8880878e0000
which belongs to the cache kmalloc-16k of size 16384
The buggy address is located 1704 bytes inside of
16384-byte region [ffff8880878e0000, ffff8880878e4000)
The buggy address belongs to the page:
page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.
Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.
There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cf89af146b7e62af55470cf5f3ec3c56ec144a5e upstream.
If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image. Fail the mount as this needs a closer
look what is actually wrong.
As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:
WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
panic+0x347/0x7c0 kernel/panic.c:231
__warn.cold+0x20/0x46 kernel/panic.c:600
report_bug+0x1bd/0x210 lib/bug.c:198
handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
btrfs_fill_super fs/btrfs/super.c:1316 [inline]
btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 66d204a16c94f24ad08290a7663ab67e7fc04e82 upstream.
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:
[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.
After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:
>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156
Which corresponds to:
int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;
while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)
So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:
$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()
for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)
$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$
So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.
Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:
$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'
return ret
def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)
$ drgn dump_block_groups.py
block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6
block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6
block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6
block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6
block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6
block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6
block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$
So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.
So after figuring that out it became obvious why the hang happens:
1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;
2) Task B starts a device replace with /dev/sdd as the source device;
3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;
4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.
It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.
As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;
4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;
5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().
Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().
At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.
This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.
This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.
For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:
1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;
2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;
3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;
4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();
5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().
And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.
While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.
So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.
This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79dae17d8d44b2d15779e332180080af45df5352 upstream.
Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.
The current message is too long, so drop the unnecessary UUID and add
process name and PID.
While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent
CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6b7faadd985c990324b5b5bd18cc4ba5c395eb65 upstream.
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2d974619a77f106f3d1341686dea95c0eaad601f upstream.
The old code goes:
offset = logical - em->start;
length = min_t(u64, em->len - offset, length);
Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:
offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);
Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fccc0007b8dc952c6bc0805cdf842eb8ea06a639 upstream.
Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc2+ #1 Tainted: G W
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc_trace+0x3a/0x1a0
btrfs_alloc_device+0x43/0x210
add_missing_dev+0x20/0x90
read_one_chunk+0x301/0x430
btrfs_read_sys_array+0x17b/0x1b0
open_ctree+0xa62/0x1896
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x379
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
vfs_rmdir.part.0+0x149/0x160
do_rmdir+0x136/0x1a0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&fs_info->chunk_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x12d/0x150
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa4/0x3d0
? btrfs_evict_inode+0x11e/0x500
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x46/0x60
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters. So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.
Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9771a5cf937129307d9f58922d60484d58ababe7 upstream.
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 44d354abf33e92a5e73b965c84caf5a5d5e58a0b upstream.
Since most metadata reservation calls can return -EINTR when get
interrupted by fatal signal, we need to review the all the metadata
reservation call sites.
In relocation code, the metadata reservation happens in the following
sites:
- btrfs_block_rsv_refill() in merge_reloc_root()
merge_reloc_root() is a pretty critical section, we don't want to be
interrupted by signal, so change the flush status to
BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
Since such change can be ENPSPC-prone, also shrink the amount of
metadata to reserve least amount avoid deadly ENOSPC there.
- btrfs_block_rsv_refill() in reserve_metadata_space()
It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
by signal.
- btrfs_block_rsv_refill() in prepare_to_relocate()
- btrfs_block_rsv_add() in prepare_to_relocate()
- btrfs_block_rsv_refill() in relocate_block_group()
- btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
- btrfs_start_transaction() in relocate_block_group()
- btrfs_start_transaction() in create_reloc_inode()
Can be interrupted by fatal signal and we can handle it easily.
For these call sites, just catch the -EINTR value in btrfs_balance()
and count them as canceled.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 01d01caf19ff7c537527d352d169c4368375c0a1 upstream.
We are currently getting this lockdep splat in btrfs/161:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G E
------------------------------------------------------
mount/678048 is trying to acquire lock:
ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
but task is already holding lock:
ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x8b/0x8f0
btrfs_init_new_device+0x2d2/0x1240 [btrfs]
btrfs_ioctl+0x1de/0x2d20 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
__mutex_lock+0x8b/0x8f0
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
*** DEADLOCK ***
3 locks held by mount/678048:
#0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
#2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
stack backtrace:
CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? asm_sysvec_apic_timer_interrupt+0x12/0x20
lock_acquire+0xab/0x360
? clone_fs_devices+0x4d/0x170 [btrfs]
__mutex_lock+0x8b/0x8f0
? clone_fs_devices+0x4d/0x170 [btrfs]
? rcu_read_lock_sched_held+0x52/0x60
? cpumask_next+0x16/0x20
? module_assert_mutex_or_preempt+0x14/0x40
? __module_address+0x28/0xf0
? clone_fs_devices+0x4d/0x170 [btrfs]
? static_obj+0x4f/0x60
? lockdep_init_map_waits+0x43/0x200
? clone_fs_devices+0x4d/0x170 [btrfs]
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
? super_setup_bdi_name+0x79/0xd0
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? vfs_parse_fs_string+0x84/0xb0
? rcu_read_lock_sched_held+0x52/0x60
? kfree+0x2b5/0x310
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? rcu_read_lock_sched_held+0x52/0x60
? legacy_get_tree+0x30/0x50
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
? memdup_user+0x4e/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex. The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem. We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.
An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18c850fdc5a801bad4977b0f1723761d42267e45 upstream.
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3502a8c0dc1bd4b4970b59b06e348f22a1c05581 upstream.
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 48cfa61b58a1fee0bc49eef04f8ccf31493b7cdd upstream.
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream.
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number. For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.
But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].
[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid
Consider a working filesystem with raid1:
$ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
$ mount /dev/sda /mnt-raid1
$ umount /mnt-raid1
While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt-single
$ btrfs dev add -f /dev/sda /mnt-single
Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.
$ mount -o degraded /dev/sdb /mnt-raid1
[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
kernel: BTRFS error (device sdb): failed to read devices
kernel: BTRFS error (device sdb): open_ctree failed
Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream.
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.
By having an alien device in fs_devices, we have two issues so far
1. missing device does not not show as missing in the userland
2. degraded mount will fail
Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).
A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.
And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.
This patch fixes both issues above by removing the alien device.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1362089d2ad7e20d16371b39d3c11990d4ec23e4 ]
Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.
In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a69976bc69308aa475d0ba3b8b3efd1d013c0460 ]
We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.
For example:
BTRFS info (device sdc): device stats zeroed by btrfs(9223)
Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 05840710149c7d1a78ea85a2db5723f706e97d8f upstream.
There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.
Fixes: cc5de4e70256 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b35cf1f0bf1f2b0b193093338414b9bd63b29015 upstream.
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a0cac0ec961f0d42828eeef196ac2246a2f07659 ]
Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.
However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.
While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fdecf ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the command:
btrfs balance start -dconvert=single,soft .
on a Raspberry Pi produces the following kernel message:
BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.
Fix by open-coding the check using u64 variables.
Tested by completing the original balance command on several Raspberry
Pis.
Fixes: 818255feece6 ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:
# btrfs balance start -f -dconvert=single $mnt
ERROR: error during balancing '/mnt/btrfs': Invalid argument
# dmesg -t | tail
validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
BTRFS error (device dm-3): balance: invalid convert data profile single
[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.
Thus we fail at that (profile & ~allowed) check.
This regression is caused by commit 081db89b13cb ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.
[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13cb ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The other dev stats functions are already there and the helpers are not
used by anything else.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the fake ENOMEM return error code to the actual error in
clone_fs_devices().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a corrupted tree, if search for next devid finds the device with
devid = -1, then report the error -EUCLEAN back to the parent function
to fail gracefully.
The tree checker will not catch this in case the devids are created
using the following script:
umount /btrfs
dev1=/dev/sdb
dev2=/dev/sdc
mkfs.btrfs -fq -dsingle -msingle $dev1
mount $dev1 /btrfs
_fail()
{
echo $1
exit 1
}
while true; do
btrfs dev add -f $dev2 /btrfs || _fail "add failed"
btrfs dev del $dev1 /btrfs || _fail "del failed"
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
With output:
BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
BTRFS warning (device sdb): Skipping commit of aborted transaction.
BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ add script and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_get_chunk_map() never returns NULL, it returns error pointers.
Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the function btrfs_init_dev_stats() goto out is not needed, because the
alloc has failed. So just return -ENOMEM.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
%found_key is not used, drop it since it hasn't been used since the
beginning in 733f4fbbc108 ("Btrfs: read device stats on mount, write
modified ones during commit").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_reset_dev_stats() is a small helper function to reset devices stat
values, and is used only once, instead just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is prep work for moving all of the block group cache code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Test case btrfs/156 fails since commit 302167c50b32 ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.
[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().
This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate
Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
used space, we can relocate
This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.
btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.
So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.
1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M 129M 257M 385M 513M 550M
|///////| |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.
The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.
The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.
However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.
[FIX]
The fix here is to remove btrfs_can_relocate() completely.
If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.
If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 6df9a95e6339 ("Btrfs: make the chunk allocator completely
tree lockless") we search commit root of device tree to avoid deadlock.
This introduced a safety feature, find_free_dev_extent_start() won't
use dev extents which just get freed in current transaction.
This safety feature makes sure we won't allocate new block group using
just freed dev extents to break CoW.
However, this feature also makes find_free_dev_extent_start() not
reliable reporting free device space. Just add such comment to make
later viewer careful about this behavior.
This behavior makes one caller, btrfs_can_relocate() unreliable
determining the device free space.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only used locally in find_free_dev_extent(), no
external callers.
So unexport it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/btrfs/volumes.c: In function __btrfs_map_block:
fs/btrfs/volumes.c:6023:6: warning:
variable offset set but not used [-Wunused-but-set-variable]
It is not used any more since commit 343abd1c0ca9 ("btrfs: Use
btrfs_get_io_geometry appropriately")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:
[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---
These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().
We have this split raid kobject setup since commit 75cb379d263521
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:
1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.
2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.
Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:
1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().
2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.
So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().
Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406d0a9
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference
on a extent_map, but on normal operation it does not drop this reference
anymore.
This leads to excessive kmemleak reports.
Always call free_extent_map(), not just in the error case.
Fixes: 5f1411265e16 ("btrfs: Introduce btrfs_io_geometry infrastructure")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.
This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.
Signed-off-by: David Sterba <dsterba@suse.com>