IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Btrfs device probing code needs adaptation so that it works when writes
are restricted to its mounted devices. Since btrfs maintainer wants to
merge these changes through btrfs tree and there are review bandwidth
issues with that, let's not block all other filesystems and just not
restrict writes to btrfs devices for now.
CC: <linux-btrfs@vger.kernel.org>
CC: David Sterba <dsterba@suse.com>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Chris Mason <clm@fb.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231101174325.10596-4-jack@suse.cz
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
[BUG]
Syzbot reported a regression that after commit 6ed05643ddb1 ("btrfs:
create qgroup earlier in snapshot creation") we can trigger transaction
abort during snapshot creation:
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Modules linked in:
CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Call Trace:
<TASK>
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967
btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440
create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041
__btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294
btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321
btrfs_ioctl+0xbbf/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f2f791127b9
</TASK>
[CAUSE]
The error number is -EEXIST, which can happen for qgroup if there is
already an existing qgroup and then we're trying to create a snapshot
for it.
[FIX]
In that case, we can continue creating the snapshot, although it may
lead to qgroup inconsistency, it's not so critical to abort the current
transaction.
So in this case, we can just ignore the non-critical errors, mostly -EEXIST
(there is already a qgroup).
Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com
Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that ntfs2btrfs had a bug that it can lead to
transaction abort and the filesystem flips to read-only.
[CAUSE]
For inline backref items, kernel has a strict requirement for their
ordered, they must follow the following rules:
- All btrfs_extent_inline_ref::type should be in an ascending order
- Within the same type, the items should follow a descending order by
their sequence number
For EXTENT_DATA_REF type, the sequence number is result from
hash_extent_data_ref().
For other types, their sequence numbers are
btrfs_extent_inline_ref::offset.
Thus if there is any code not following above rules, the resulted
inline backrefs can prevent the kernel to locate the needed inline
backref and lead to transaction abort.
[FIX]
Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
ability to detect such problems.
For kernel, let's be more noisy and be more specific about the order, so
that the next time kernel hits such problem we would reject it in the
first place, without leading to transaction abort.
Link: https://github.com/kdave/btrfs-progs/pull/622
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVSO50ACgkQxWXV+ddt
WDuiyg/7BZviFAyiQMAzpA319qRJZ+EemfTdF/k69q4axGYuvqVdXKnpOV44AR4I
dKcHLOPpDZIxsh8lFytkm1UEAHptw1v7A64c+gcdjGK0tAA7aKbw/1nmNysowT23
L0v2+34hkBUfG8A3uVgOwL1rjItEX5Fl54slpVsazSqlEbKrqC4MGNjmqdp3IOeC
qfXTgkvkXmm8s8NyoJybKewM9Aw0tmK0jkAFHA+2sgcZPYKXjqWGv9KUOsXnCx5o
3kPWIRT1sj4q2qzrgP14Q12O6qPLZ2/0oTBhi6nhj8+N1yiH+USS5zBITegF+w2n
leQeVHtyBYHlPYQSQlCIZy7+10gkePvs+JmoAuL8YFISnGYnvOZqCeArlV7cnNI3
CQt7ZBER5Dqw78Y756usUhpYrLWa9kOpcPVRmjJ/R62+TY1FkkyY7irETbn5EGjI
NlhEa4PMYeYpAOccoxWEm9tIiiVD1abURhVBdn3Znfcb1Sv/lrGBlo9DYGFCxbBh
xU1JP7sly8w0aPLqCbn1X3VY8dXp+CeYz4FQabHjQA/zr9lF08/pRYj3haAbYAyH
0KphXurwz/YqY+LmRg7SbQ/KMgBAiBV8Qk9JyNvdvaQbnYnq7CWdpoHcpZu3mvpb
HLGoXew58kZaSfxLHlcT5wwYlbq0rooXRstuFg2+BBcOFOMCQfw=
=GM+1
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential overflow in returned value from SEARCH_TREE_V2
ioctl on 32bit architecture
- zoned mode fixes:
- drop unnecessary write pointer check for RAID0/RAID1/RAID10
profiles, now it works because of raid-stripe-tree
- wait for finishing the zone when direct IO needs a new
allocation
- simple quota fixes:
- pass correct owning root pointer when cleaning up an
aborted transaction
- fix leaking some structures when processing delayed refs
- change key type number of BTRFS_EXTENT_OWNER_REF_KEY,
reorder it before inline refs that are supposed to be
sorted, keeping the original number would complicate a lot
of things; this change needs an updated version of
btrfs-progs to work and filesystems need to be recreated
- fix error pointer dereference after failure to allocate fs
devices
- fix race between accounting qgroup extents and removing a
qgroup
* tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: make OWNER_REF_KEY type value smallest among inline refs
btrfs: fix qgroup record leaks when using simple quotas
btrfs: fix race between accounting qgroup extents and removing a qgroup
btrfs: fix error pointer dereference after failure to allocate fs devices
btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
btrfs: get correct owning_root when dropping snapshot
btrfs: zoned: wait for data BG to be finished on direct IO allocation
btrfs: zoned: drop no longer valid write pointer check
btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
btrfs: use u64 for buffer sizes in the tree search ioctls
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.
Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.
Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing qgroup accounting for an extent, we take the spinlock
fs_info->qgroup_lock and then add qgroups to the local list (iterator)
named "qgroups". These qgroups are found in the fs_info->qgroup_tree
rbtree. After we're done, we unlock fs_info->qgroup_lock and then call
qgroup_iterator_nested_clean(), which will iterate over all the qgroups
added to the local list "qgroups" and then delete them from the list.
Deleting a qgroup from the list can however result in a use-after-free
if a qgroup remove operation happens after we unlock fs_info->qgroup_lock
and before or while we are at qgroup_iterator_nested_clean().
Fix this by calling qgroup_iterator_nested_clean() while still holding
the lock fs_info->qgroup_lock - we don't need it under the 'out' label
since before taking the lock the "qgroups" list is always empty. This
guarantees safety because btrfs_remove_qgroup() takes that lock before
removing a qgroup from the rbtree fs_info->qgroup_tree.
This was reported by syzbot with the following stack traces:
BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48
CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Workqueue: btrfs-qgroup-rescan btrfs_work_helper
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x163/0x540 mm/kasan/report.c:475
kasan_report+0x175/0x1b0 mm/kasan/report.c:588
__list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
__list_del_entry_valid include/linux/list.h:124 [inline]
__list_del_entry include/linux/list.h:215 [inline]
list_del_init include/linux/list.h:287 [inline]
qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline]
btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883
qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline]
btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604
btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703
worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
</TASK>
Allocated by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209
btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Freed by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1800 [inline]
slab_free_freelist_hook mm/slub.c:1826 [inline]
slab_free mm/slub.c:3809 [inline]
__kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787
btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
Second to last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
The buggy address belongs to the object at ffff888027e42000
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 176 bytes inside of
freed 512-byte region [ffff888027e42000, ffff888027e42200)
The buggy address belongs to the physical page:
page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40
head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536
prep_new_page mm/page_alloc.c:1543 [inline]
get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170
__alloc_pages+0x255/0x670 mm/page_alloc.c:4426
alloc_slab_page+0x6a/0x160 mm/slub.c:1870
allocate_slab mm/slub.c:2017 [inline]
new_slab+0x84/0x2f0 mm/slub.c:2070
___slab_alloc+0xc85/0x1310 mm/slub.c:3223
__slab_alloc mm/slub.c:3322 [inline]
__slab_alloc_node mm/slub.c:3375 [inline]
slab_alloc_node mm/slub.c:3468 [inline]
__kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517
kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670
do_dentry_open+0x8fd/0x1590 fs/open.c:948
do_open fs/namei.c:3622 [inline]
path_openat+0x2845/0x3280 fs/namei.c:3779
do_filp_open+0x234/0x490 fs/namei.c:3809
do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1136 [inline]
free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312
free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405
discard_slab mm/slub.c:2116 [inline]
__unfreeze_partials+0x1dc/0x220 mm/slub.c:2655
put_cpu_partial+0x17b/0x250 mm/slub.c:2731
__slab_free+0x2b6/0x390 mm/slub.c:3679
qlink_free mm/kasan/quarantine.c:166 [inline]
qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185
kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292
__kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762
slab_alloc_node mm/slub.c:3478 [inline]
slab_alloc mm/slub.c:3486 [inline]
__kmem_cache_alloc_lru mm/slub.c:3493 [inline]
kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502
getname_flags+0xbc/0x4f0 fs/namei.c:140
do_sys_openat2+0xd2/0x1d0 fs/open.c:1434
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Memory state around the buggy address:
ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com
Fixes: dce28769a33a ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()")
CC: stable@vger.kernel.org # 6.6
Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At device_list_add() we allocate a btrfs_fs_devices structure and then
before checking if the allocation failed (pointer is ERR_PTR(-ENOMEM)),
we dereference the error pointer in a memcpy() argument if the feature
BTRFS_FEATURE_INCOMPAT_METADATA_UUID is enabled.
Fix this by checking for an allocation error before trying the memcpy().
Fixes: f7361d8c3fc3 ("btrfs: sipmlify uuid parameters of alloc_fs_devices()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a compilation warning reported on commit ae76d8e3e135 ("btrfs:
scrub: fix grouping of read IO"), where gcc (14.0.0 20231022 experimental)
is reporting the following uninitialized variable:
fs/btrfs/scrub.c: In function ‘scrub_simple_mirror.isra’:
fs/btrfs/scrub.c:2075:29: error: ‘found_logical’ may be used uninitialized [-Werror=maybe-uninitialized[https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmaybe-uninitialized]]
2075 | cur_logical = found_logical + BTRFS_STRIPE_LEN;
fs/btrfs/scrub.c:2040:21: note: ‘found_logical’ was declared here
2040 | u64 found_logical;
| ^~~~~~~~~~~~~
[CAUSE]
This is a false alert, as @found_logical is passed as parameter
@found_logical_ret of function queue_scrub_stripe().
As long as queue_scrub_stripe() returned 0, we would update
@found_logical_ret. And if queue_scrub_stripe() returned >0 or <0, the
caller would not utilized @found_logical, thus there should be nothing
wrong.
Although the triggering gcc is still experimental, it looks like the
extra check on "if (found_logical_ret)" can sometimes confuse the
compiler.
Meanwhile the only caller of queue_scrub_stripe() is always passing a
valid pointer, there is no need for such check at all.
[FIX]
Although the report itself is a false alert, we can still make it more
explicit by:
- Replace the check for @found_logical_ret with ASSERT()
- Initialize @found_logical to U64_MAX
- Add one extra ASSERT() to make sure @found_logical got updated
Link: https://lore.kernel.org/linux-btrfs/87fs1x1p93.fsf@gentoo.org/
Fixes: ae76d8e3e135 ("btrfs: scrub: fix grouping of read IO")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dave reported a bug where we were aborting the transaction while trying
to cleanup the squota reservation for an extent.
This turned out to be because we're doing btrfs_header_owner(next) in
do_walk_down when we decide to free the block. However in this code
block we haven't explicitly read next, so it could be stale. We would
then get whatever garbage happened to be in the pages at this point.
The commit that introduced that is "btrfs: track owning root in
btrfs_ref".
Fix this by saving the owner_root when we do the
btrfs_lookup_extent_info(). We always do this in do_walk_down, it is
how we make the decision of whether or not to delete the block. This is
cheap because we've already done the extent item lookup at this point,
so it's straightforward to just grab the owner root as well.
Then we can use this when deleting the metadata block without needing to
force a read of the extent buffer to find the owner.
This fixes the problem that Dave reported.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.
$ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
--rw=write --ioengine=libaio --iodepth=128 --direct=1
fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
...
This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.
btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.
This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a check of the write pointer vs the zone size to reject an invalid
write pointer. However, as of now, we have RAID0/RAID10 on the zoned
mode, we can have a block group whose size is larger than the zone size.
As an equivalent check against the block group's zone_capacity is already
there, we can just drop this invalid check.
Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's more obvious to return a literal zero instead of "return ret;".
Plus Smatch complains that ret could be uninitialized if the
ordered_extent->bioc_list list is empty and this silences that warning.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the tree search v2 ioctl we use the type size_t, which is an unsigned
long, to track the buffer size in the local variable 'buf_size'. An
unsigned long is 32 bits wide on a 32 bits architecture. The buffer size
defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later
try to copy the local variable 'buf_size' to the argument struct, when
the search returns -EOVERFLOW, we copy only 32 bits which will be a
problem on big endian systems.
Fix this by using a u64 type for the buffer sizes, not only at
btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain
so that we can use the u64 at btrfs_ioctl_tree_search_v2().
Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
=9TVx
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
=2PXg
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner:
"The 's_xattr' field of 'struct super_block' currently requires a
mutable table of 'struct xattr_handler' entries (although each handler
itself is const). However, no code in vfs actually modifies the
tables.
This changes the type of 's_xattr' to allow const tables, and modifies
existing file systems to move their tables to .rodata. This is
desirable because these tables contain entries with function pointers
in them; moving them to .rodata makes it considerably less likely to
be modified accidentally or maliciously at runtime"
* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
const_structs.checkpatch: add xattr_handler
net: move sockfs_xattr_handlers to .rodata
shmem: move shmem_xattr_handlers to .rodata
overlayfs: move xattr tables to .rodata
xfs: move xfs_xattr_handlers to .rodata
ubifs: move ubifs_xattr_handlers to .rodata
squashfs: move squashfs_xattr_handlers to .rodata
smb: move cifs_xattr_handlers to .rodata
reiserfs: move reiserfs_xattr_handlers to .rodata
orangefs: move orangefs_xattr_handlers to .rodata
ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
ntfs3: move ntfs_xattr_handlers to .rodata
nfs: move nfs4_xattr_handlers to .rodata
kernfs: move kernfs_xattr_handlers to .rodata
jfs: move jfs_xattr_handlers to .rodata
jffs2: move jffs2_xattr_handlers to .rodata
hfsplus: move hfsplus_xattr_handlers to .rodata
hfs: move hfs_xattr_handlers to .rodata
gfs2: move gfs2_xattr_handlers_max to .rodata
fuse: move fuse_xattr_handlers to .rodata
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc
otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF
5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ=
=E1kV
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs superblock updates from Christian Brauner:
"This contains the work to make block device opening functions return a
struct bdev_handle instead of just a struct block_device. The same
struct bdev_handle is then also passed to block device closing
functions.
This allows us to propagate context from opening to closing a block
device without having to modify all users everytime.
Sidenote, in the future we might even want to try and have block
device opening functions return a struct file directly but that's a
series on top of this.
These are further preparatory changes to be able to count writable
opens and blocking writes to mounted block devices. That's a separate
piece of work for next cycle and for that we absolutely need the
changes to btrfs that have been quietly dropped somehow.
Originally the series contained a patch that removed the old
blkdev_*() helpers. But since this would've caused needles churn in
-next for bcachefs we ended up delaying it.
The second piece of work addresses one of the major annoyances about
the work last cycle, namely that we required dropping s_umount
whenever we used the superblock and fs_holder_ops for a block device.
The reason for that requirement had been that in some codepaths
s_umount could've been taken under disk->open_mutex (that's always
been the case, at least theoretically). For example, on surprise block
device removal or media change. And opening and closing block devices
required grabbing disk->open_mutex as well.
So we did the work and went through the block layer and fixed all
those places so that s_umount is never taken under disk->open_mutex.
This means no more brittle games where we yield and reacquire s_umount
during block device opening and closing and no more requirements where
block devices need to be closed. Filesystems don't need to care about
this.
There's a bunch of other follow-up work such as moving block device
freezing and thawing to holder operations which makes it work for all
block devices and not just the main block device just as we did for
surprise removal. But that is for next cycle.
Tested with fstests for all major fses, blktests, LTP"
* tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
porting: update locking requirements
fs: assert that open_mutex isn't held over holder ops
block: assert that we're not holding open_mutex over blk_report_disk_dead
block: move bdev_mark_dead out of disk_check_media_change
block: WARN_ON_ONCE() when we remove active partitions
block: simplify bdev_del_partition()
fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
jfs: fix log->bdev_handle null ptr deref in lbmStartIO
bcache: Fixup error handling in register_cache()
xfs: Convert to bdev_open_by_path()
reiserfs: Convert to bdev_open_by_dev/path()
ocfs2: Convert to use bdev_open_by_dev()
nfs/blocklayout: Convert to use bdev_open_by_dev/path()
jfs: Convert to bdev_open_by_dev()
f2fs: Convert to bdev_open_by_dev/path()
ext4: Convert to bdev_open_by_dev()
erofs: Convert to use bdev_open_by_path()
btrfs: Convert to bdev_open_by_path()
fs: Convert to bdev_open_by_dev()
mm/swap: Convert to use bdev_open_by_dev()
...
Convert btrfs to use bdev_open_by_path() and pass the handle around. We
also drop the holder from struct btrfs_device as it is now not needed
anymore.
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU2lLEACgkQxWXV+ddt
WDvCThAApe+zMNdEhQ/cgrvfzP/X91Q53PXQsdVsrujPyUV8eEV4oJzEwVbJhRdw
3ukIQtvyAMNiWhEBhOQRwxjuUoTCApGAeEEEl1cWWEqQ7G2/2LS4+bcWzgQ3Vu32
dzYL37ddsfe4n7OgfnymtMrnv7kge0XbAlY3GbavaDccZDQDqcD5wSAOyOhfIsH7
kcu4sA5Fi44wVSfAJX1Dms+wXfsmQu/sd3c9Gcyce9Hpy1cEW3vWbApLBE4K0aKX
/JHTdmkAJ20a4APQsfGH+UymyuZgr8d2eGmL9rVYKhT/c+Dow0lNAWYkvGf/MawM
CX3GdP6f6ZOR/anCPZ8nqZCE5AoFykGazvpCCSrvCOpU7o7GqxbAQkWWFcMp1FHW
9TFrj81WK18DeCfCNw7lR3sdMy/2o2nnSUAw3DFY4n/3Lek7FUmrBTHvXlWDot7T
TM9CzYGF840QhL5s5SMYS09YmeI0I34L7HJAi/+qli48SooGuL9RZ29TmzHIX69Y
2bgpS64j06p/AGEnfHAcT1LbpiFCPmO5cpXKv/t40GL5QO5d4WV698ysDGoPYUPO
8CPL85Y8cao56KGJLyOroGz0P1bo+RdNe5bN6xJJoTRn1Y9oUA+bQSnN8x9iuunF
9QZrAIHzNyDcRGzoqgDW+3bivOvIus/Dto/u1P3ap68kP2HTVsY=
=gOyi
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix for a problem with snapshot of a newly created subvolume
that can lead to inconsistent data under some circumstances. Kernel
6.5 added a performance optimization to skip transaction commit for
subvolume creation but this could end up with newer data on disk but
not linked to other structures.
The fix itself is an added condition, the rest of the patch is a
parameter added to several functions"
* tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:
1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
address 36007936, for the leaf/root of a new subvolume that has an ID
of 291. We mark the extent buffer as dirty, and at this point the
subvolume tree has a single node/leaf which is also its root (level 0);
2) We no longer commit the transaction used to create the subvolume at
create_subvol(). We used to, but that was recently removed in
commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
create");
3) The transaction used to create the subvolume has an ID of 33, so the
extent buffer 36007936 has a generation of 33;
4) Several updates happen to subvolume 291 during transaction 33, several
files created and its tree height changes from 0 to 1, so we end up with
a new root at level 1 and the extent buffer 36007936 is now a leaf of
that new root node, which is extent buffer 36048896.
The commit root remains as 36007936, since we are still at transaction
33;
5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
we end up at transaction.c:create_pending_snapshot(), in the critical
section of a transaction commit.
There we COW the root of subvolume 291, which is extent buffer 36048896.
The COW operation returns extent buffer 36048896, since there's no need
to COW because the extent buffer was created in this transaction and it
was not written yet.
The we call btrfs_copy_root() against the root node 36048896. During
this operation we allocate a new extent buffer to turn into the root
node of the snapshot, copy the contents of the root node 36048896 into
this snapshot root extent buffer, set the owner to 292 (the ID of the
snapshot), etc, and then we call btrfs_inc_ref(). This will create a
delayed reference for each leaf pointed by the root node with a
reference root of 292 - this includes a reference for the leaf
36007936.
After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.
Then we call btrfs_insert_dir_item(), to create the directory entry in
in the tree of subvolume 291 that points to the snapshot. This ends up
needing to modify leaf 36007936 to insert the respective directory
items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
we need to COW the leaf. We end up at btrfs_force_cow_block() and then
at update_ref_for_cow().
At update_ref_for_cow() we call btrfs_block_can_be_shared() which
returns false, despite the fact the leaf 36007936 is shared - the
subvolume's root and the snapshot's root point to that leaf. The
reason that it incorrectly returns false is because the commit root
of the subvolume is extent buffer 36007936 - it was the initial root
of the subvolume when we created it. So btrfs_block_can_be_shared()
which has the following logic:
int btrfs_block_can_be_shared(struct btrfs_root *root,
struct extent_buffer *buf)
{
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
buf != root->node && buf != root->commit_root &&
(btrfs_header_generation(buf) <=
btrfs_root_last_snapshot(&root->root_item) ||
btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
return 1;
return 0;
}
Returns false (0) since 'buf' (extent buffer 36007936) matches the
root's commit root.
As a result, at update_ref_for_cow(), we don't check for the number
of references for extent buffer 36007936, we just assume it's not
shared and therefore that it has only 1 reference, so we set the local
variable 'refs' to 1.
Later on, in the final if-else statement at update_ref_for_cow():
static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct extent_buffer *buf,
struct extent_buffer *cow,
int *last_ref)
{
(...)
if (refs > 1) {
(...)
} else {
(...)
btrfs_clear_buffer_dirty(trans, buf);
*last_ref = 1;
}
}
So we mark the extent buffer 36007936 as not dirty, and as a result
we don't write it to disk later in the transaction commit, despite the
fact that the snapshot's root points to it.
Attempting to access the leaf or dumping the tree for example shows
that the extent buffer was not written:
$ btrfs inspect-internal dump-tree -t 292 /dev/sdb
btrfs-progs v6.2.2
file tree key (292 ROOT_ITEM 33)
node 36110336 level 1 items 2 free space 119 generation 33 owner 292
node 36110336 flags 0x1(WRITTEN) backref revision 1
checksum stored a8103e3e
checksum calced a8103e3e
fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
key (256 INODE_ITEM 0) block 36007936 gen 33
key (257 EXTENT_DATA 0) block 36052992 gen 33
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
total bytes 107374182400
bytes used 38572032
uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
The respective on disk region is full of zeroes as the device was
trimmed at mkfs time.
Obviously 'btrfs check' also detects and complains about this:
$ btrfs check /dev/sdb
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
generation: 33 (33)
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
owner ref check failed [36007936 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
The following tree block(s) is corrupted in tree 292:
tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
root 292 root dir 256 not found
ERROR: errors found in fs roots
found 38572032 bytes used, error(s) found
total csum bytes: 16048
total tree bytes: 1265664
total fs tree bytes: 1118208
total extent tree bytes: 65536
btree space waste bytes: 562598
file data blocks allocated: 65978368
referenced 36569088
Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.
This can be reproduced with the following script:
$ cat test.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
# Use a filesystem with a 64K node size so that we have the same node
# size on every machine regardless of its page size (on x86_64 default
# node size is 16K due to the 4K page size, while on PPC it's 64K by
# default). This way we can make sure we are able to create a btree for
# the subvolume with a height of 2.
mkfs.btrfs -f -n 64K $DEV
mount $DEV $MNT
btrfs subvolume create $MNT/subvol
# Create a few empty files on the subvolume, this bumps its btree
# height to 2 (root node at level 1 and 2 leaves).
for ((i = 1; i <= 300; i++)); do
echo -n > $MNT/subvol/file_$i
done
btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap
umount $DEV
btrfs check $DEV
Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):
$ ./test.sh
Create subvolume '/mnt/sdi/subvol'
Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
owner ref check failed [30539776 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
Wrong generation of child node/leaf, wanted: 5, have: 7
root 257 root dir 256 not found
ERROR: errors found in fs roots
found 917504 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 851968
total fs tree bytes: 393216
total extent tree bytes: 65536
btree space waste bytes: 736550
file data blocks allocated: 0
referenced 0
A test case for fstests will follow soon.
Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUuntEACgkQxWXV+ddt
WDssdQ/9Fo6tN+MCH5ISAcvsW6WvBUWT62MrDnzawh96QhUf3NYf9sjME7QqwHHv
w60SDiqRlAd5UzxdIPC4Qa/6GVZZh2yFLzew3l8Fh6anxhjO5argdsfx1Wv4ADk/
FHI8zs6EZiTlk0JmEnHNclliZfaDutBRQiPL+HZx4+FCrJweS5U/4Jpg7vdfp/tp
eWdJ51pDM8iyqGTsP7a7/VaL5wLoJhbdD9wYgupZUhvY6g2tCZ71/hNiWdbKtCK8
EyQxXiAlc+k1UflOx6Xip1HLIh6HmKwxntXxRy+yj4IvJ3PhI+KS5Nqdl35TszN9
6y9MRo3oCU+2y89Yay4HZZb6DLxcAi6VwpyswnntodFQ+ICXEw7ZaNi3rSO+FCO8
KxfhLniMD5gflRP4gy+o9iZxgVQ75nmiPgBt53r+sAKZ7lv86x84DJ/ZUqL8EV0e
OJhxdzhoT0Ks8OstIuE87fgzUCjqMcgAavxcn1psKBC6/JY9v6OneA8qauSswkKs
P+diJIqZHHOBQVKFedqdIrDU6AstivSBq0ToPBslbBlcy97EO4IRoiMIw+QgHPYn
CHsPHtooBmxPyw+4HTFuzY1NIrSeUFYxTDAs9p5kMPmltkVAlLPcrpGZVya9tjds
l/YuwY2f0C9Q1pjcAc9FcN8Y5kLRCYNEWMl0M1VpC22KgjRN6r0=
=GrNu
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix a bug in chunk size decision that could lead to suboptimal
placement and filling patterns"
* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.
After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:
1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was
1 GiB for non-zoned data chunks and not configurable.
2. btrfs_create_chunk calls gather_device_info which consumes
and produces more fields of chunk_ctl.
3. gather_device_info multiplies ctl->max_stripe_len by
ctl->dev_stripes (which is 1 in all cases except dup)
and calls find_free_dev_extent with that number as num_bytes.
4. find_free_dev_extent locates the first dev_extent hole on
a device which is at least as large as num_bytes. With default
max_chunk_size from f6fca3917b4d, it finds the first hole which is
longer than 10 GiB, or the largest hole if that hole is shorter
than 10 GiB. This is different from the pre-f6fca3917b4d
behavior, where num_bytes is 1 GiB, and find_free_dev_extent
may choose a different hole.
5. gather_device_info repeats step 4 with all devices to find
the first or largest dev_extent hole that can be allocated on
each device.
6. gather_device_info sorts the device list by the hole size
on each device, using total unallocated space on each device to
break ties, then returns to btrfs_create_chunk with the list.
7. btrfs_create_chunk calls decide_stripe_size_regular.
8. decide_stripe_size_regular finds the largest stripe_len that
fits across the first nr_devs device dev_extent holes that were
found by gather_device_info (and satisfies other constraints
on stripe_len that are not relevant here).
9. decide_stripe_size_regular caps the length of the stripe it
computed at 1 GiB. This cap appeared in 5da431b71d4b to correct
one of the other regressions introduced in f6fca3917b4d.
10. btrfs_create_chunk creates a new chunk with the above
computed size and number of devices.
At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.
Consider a filesystem using raid1 profile with 3 devices. After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
Device 2: [__________] (10 GiB contig unallocated)
Device 3: [__________] (10 GiB contig unallocated)
Before f6fca3917b4d, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):
[after 0 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [__________] (10 GiB)
Device 3: [__________] (10 GiB)
[after 1 chunk allocation]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
Device 2: [12] [_________] (9 GiB)
Device 3: [__________] (10 GiB)
[after 2 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [12] [_________] (9 GiB)
Device 3: [13] [_________] (9 GiB)
[after 3 chunk allocations]
Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
Device 2: [12] [12] [________] (8 GiB)
Device 3: [13] [_________] (9 GiB)
[...]
[after 12 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 13 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 14 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
[after 15 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
This allocates all of the space with no waste. The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space. This keeps
usable space on each device equal so they can all be filled completely.
After f6fca3917b4d, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:
[after 1 chunk allocation]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [_________] (9 GiB)
Device 3: [23] [_________] (9 GiB)
[after 2 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [________] (8 GiB)
Device 3: [23] [23] [________] (8 GiB)
[after 3 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [_______] (7 GiB)
Device 3: [23] [23] [23] [_______] (7 GiB)
[...]
[after 9 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 10 chunk allocations]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 11 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
No further allocations are possible, with 8 GiB wasted (4 GiB of data
space). The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.
Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b4d, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.
Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.
To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
While researching the background of the earlier commits, I found that an
identical fix was already proposed at:
https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
The previous review missed one detail: ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect. ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.
Signed-off-by: David Sterba <dsterba@suse.com>
During log syncing, when we start updating the log root tree we compute
an index value, stored in variable 'index2', once we lock the log root
tree's mutex. This value depends on the log root's log_transid. And
shortly after we compute again the same value for 'index2' - the value
is exactly the same since we haven't released the mutex and therefore
the log_transid of the log root is the same as before.
This second 'index2' computation became pointless after commit
a93e01682e28 ("btrfs: remove no longer needed use of log_writers for the
log root tree"). So remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable dirty is initialized with a value that is never read, it
is being re-assigned later on. Remove the redundant initialization.
Cleans up clang scan build warning:
fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds sysfs objects to indicate temp_fsid feature support and
its status.
/sys/fs/btrfs/features/temp_fsid
/sys/fs/btrfs/<UUID>/temp_fsid
For example:
Consider two cloned and mounted devices.
$ blkid /dev/sdc[1-2]
/dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
/dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
One gets actual fsid, and the other gets the temp_fsid when
mounted.
$ btrfs filesystem show -m
Label: none uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5
Total devices 1 FS bytes used 54.14MiB
devid 1 size 300.00MiB used 144.00MiB path /dev/sdc1
Label: none uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a
Total devices 1 FS bytes used 54.14MiB
devid 1 size 300.00MiB used 144.00MiB path /dev/sdc2
Their sysfs as below.
$ cat /sys/fs/btrfs/features/temp_fsid
0
$ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid
0
$ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid
1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device addition operation will transform the cloned temp-fsid mounted
device into a multi-device filesystem. Therefore, it is marked as
unsupported.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A seed device is an integral component of the sprout device, which
functions as a multi-device filesystem. Therefore, temp-fsid feature
is not supported.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the comment to explain the relationship between temp_fsid, fsid,
and metadata_uuid.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, if we get an error when updating the log root, we
check first if the log root tree context is in a log context list, and if
so it deletes from the log root tree context from the list. This check
however is pointless because at this moment the context is always in a
list, he have just added it to a context list. The check became pointless
after commit a93e01682e28 ("btrfs: remove no longer needed use of
log_writers for the log root tree"). So remove this now pointless empty
list check.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the comment for the lock named "lock" in struct btrfs_inode because
it does not mention that the fields "delalloc_bytes", "defrag_bytes",
"csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
by that lock.
Also add a comment on top of each field protected by this lock to mention
that the lock protects them.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
now that fs_info->last_trans_committed is read using READ_ONCE(), with the
helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
with the helper btrfs_set_last_trans_committed().
This barrier was introduced in 2011, by commit a4abeea41adf ("Btrfs: kill
trans_mutex"), but even back then it was not correct since the writer side
(in btrfs_commit_transaction()), did not issue a pairing memory barrier
after it updated fs_info->last_trans_committed.
So remove this barrier.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.
In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_trans_committed field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.
However there are other readers that are neither holding the lock nor
holding a transaction handle open:
1) When reading an inode from disk, at btrfs_read_locked_inode();
2) When reading the generation to expose it to sysfs, at
btrfs_generation_show();
3) Early in the fsync path, at skip_inode_logging();
4) When creating a hole at btrfs_cont_expand(), during write paths,
truncate and reflinking;
5) In the fs_info ioctl (btrfs_ioctl_fs_info());
6) While mounting the filesystem, in the open_ctree() path. In these
cases it's safe to directly read fs_info->generation as no one
can concurrently start a transaction and update fs_info->generation.
In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.
Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Guilherme's previous work [1] aimed at the mounting of cloned devices
using a superblock flag SINGLE_DEV during mkfs.
[1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/
Building upon this work, here is in memory only approach. As it mounts
we determine if the same fsid is already mounted if then we generate a
random temp fsid which shall be used the mount, in memory only not
written to the disk. We distinguish devices by devt.
Example:
$ fallocate -l 300m ./disk1.img
$ mkfs.btrfs -f ./disk1.img
$ cp ./disk1.img ./disk2.img
$ cp ./disk1.img ./disk3.img
$ mount -o loop ./disk1.img /btrfs
$ mount -o ./disk2.img /btrfs1
$ mount -o ./disk3.img /btrfs2
$ btrfs fi show -m
Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop0
Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop1
Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop2
Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for adding support to mount multiple single-disk
btrfs filesystems with the same FSID, wrap find_fsid() into
find_fsid_by_disk().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.
So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.
So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous commit created a hole in struct btrfs_inode, we can move
outstanding_extents there. This reduces size by 8 bytes from 1120 to
1112 on a release config.
Signed-off-by: David Sterba <dsterba@suse.com>
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.
Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.
Signed-off-by: David Sterba <dsterba@suse.com>
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
$ mkfs.btrfs -f -m single -b 8g /dev/vdb
$ mount /dev/vdb /mnt/test
$ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My overcommit patch exposed a bug with btrfs/177 [1]. The problem here is
that when we grow the device we're not adding to ->free_chunk_space, so
subsequent allocations can cause ->free_chunk_space to wrap, which
causes problems in can_overcommit because we add this to ->total_bytes,
which causes the counter to wrap and gives us an unexpected ENOSPC.
Fix this by properly updating ->free_chunk_space with the new available
space in btrfs_grow_device.
[1] First version of the fix:
https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two bugs in how we adjust ->free_chunk_space in
btrfs_shrink_device. First we're removing the entire diff between
new_size and old_size from ->free_chunk_space. This only works if we're
reducing the free area, which we could potentially not be. So adjust
the math to only subtract the diff in the free space from
->free_chunk_space.
Additionally in the error case we're unconditionally adding the diff
back into ->free_chunk_space, which we need to only do if this device is
writeable.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, at find_first_extent_bit(), when we are given a cached extent
state that happens to have its end offset match the desired range start,
we find the next extent state using that cached state, with next_state()
calls, and then return it.
We then try to cache that next state by calling cache_state_if_flags(),
but that will not cache the state because we haven't reset *cached_state
to NULL, so we end up with the cached_state unchanged, and if the caller
is iterating over extent states in the io tree, its next call to
find_first_extent_bit() will not use the current cached state as its end
offset does not match the minimum start range offset, therefore the cached
state is reset and we have to search the rbtree to find the next suitable
extent state record.
So fix this by resetting the cached state to NULL (and dropping our ref
on it) when we have a suitable cached state and we found a next state by
using next_state() starting from the cached state. This makes use cases
of calling find_first_extent_bit() to go over all ranges in the io tree
to do a single rbtree full search, only on the first call, and the next
calls will just do next_state() (rb_next() wrapper) calls, which is more
efficient.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.
So use extent_io_tree_release() instead of clear_extent_bits().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>