IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 4eb4e85c4f818491efc67e9373aa16b123c3f522 upstream.
If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from
swap) or btrfs_relocate_chunk systematically fails (from lack of
space), then this worker becomes an infinite loop.
At the very least, this strands the cleaner thread, but can also result
in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the
reclaim_bgs_lock mutex is not contended.
I believe the best long term fix is to manage reclaim via work queue,
where we queue up a relocation on the triggering condition and re-queue
on failure. In the meantime, this is an easy fix to apply to avoid the
immediate pain.
Fixes: 7e2718099438 ("btrfs: reinsert BGs failed to reclaim")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fb33eb2ef0d88e75564983ef057b44c5b7e4fded ]
Qgroup extent records are created when delayed ref heads are created and
then released after accounting extents at btrfs_qgroup_account_extents(),
called during the transaction commit path.
If a transaction is aborted we free the qgroup records by calling
btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(),
unless we don't have delayed references. We are incorrectly assuming
that no delayed references means we don't have qgroup extents records.
We can currently have no delayed references because we ran them all
during a transaction commit and the transaction was aborted after that
due to some error in the commit path.
So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at
btrfs_destroy_delayed_refs() even if we don't have any delayed references.
Reported-by: syzbot+0fecc032fa134afd49df@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000004e7f980619f91835@google.com/
Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9af503d91298c3f2945e73703f0e00995be08c30 upstream.
The previous patch that replaced BUG_ON by error handling forgot to
unlock the mutex in the error path.
Link: https://lore.kernel.org/all/Zh%2fHpAGFqa7YAFuM@duo.ucw.cz
Reported-by: Pavel Machek <pavel@denx.de>
Fixes: 7411055db5ce ("btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()")
CC: stable@vger.kernel.org
Reviewed-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ff09b6b8c2fb6b3edda4ffaa173153a40653067 upstream.
When compiling with gcc version 14.0.0 20231220 (experimental)
and W=1, I've noticed the following warning:
fs/btrfs/send.c: In function 'btrfs_ioctl_send':
fs/btrfs/send.c:8208:44: warning: 'kvcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
8208 | sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
| ^
Since 'n' and 'size' arguments of 'kvcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the result
and so this is not a bug. But it's still worth to fix it.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6e68de0bb0ed59e0554a0c15ede7308c47351e2d ]
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3c6f0c5ecc8910d4ffb0dfe85609ebc0c91c8f34 ]
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2f1aeab9fca1a5f583be1add175d1ee95c213cfa ]
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.
In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.
Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.
Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
instrument_copy_to_user include/linux/instrumented.h:114 [inline]
_copy_to_user+0xbc/0x110 lib/usercopy.c:40
copy_to_user include/linux/uaccess.h:191 [inline]
btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was created at:
__kmalloc_large_node+0x231/0x370 mm/slub.c:3921
__do_kmalloc_node mm/slub.c:3954 [inline]
__kmalloc_node+0xb07/0x1060 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
kvmalloc_node+0xc0/0x2d0 mm/util.c:634
kvmalloc include/linux/slab.h:766 [inline]
init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Bytes 40-65535 of 65536 are uninitialized
Memory access of size 65536 starts at ffff888045a40000
This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.
Fix this by using kvzalloc() which zeroes out the memory on allocation.
CC: stable@vger.kernel.org # 4.14+
Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 71537e35c324ea6fbd68377a4f26bb93a831ae35 ]
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.
Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 141fb8cd206ace23c02cd2791c6da52c1d77d42a upstream.
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.
Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.
Fixes: 8287475a2055 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3c6ee34c6f9cd12802326da26631232a61743501 ]
Change BUG_ON to proper error handling if building the path buffer
fails. The pointers are not printed so we don't accidentally leak kernel
addresses.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 26b66d1d366a375745755ca7365f67110bbf6bd5 ]
The get_parent handler looks up a parent of a given dentry, this can be
either a subvolume or a directory. The search is set up with offset -1
but it's never expected to find such item, as it would break allowed
range of inode number or a root id. This means it's a corruption (ext4
also returns this error code).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7411055db5ce64f836aaffd422396af0075fdc99 ]
The unhandled case in btrfs_relocate_sys_chunks() loop is a corruption,
as it could be caused only by two impossible conditions:
- at first the search key is set up to look for a chunk tree item, with
offset -1, this is an inexact search and the key->offset will contain
the correct offset upon a successful search, a valid chunk tree item
cannot have an offset -1
- after first successful search, the found_key corresponds to a chunk
item, the offset is decremented by 1 before the next loop, it's
impossible to find a chunk item there due to alignment and size
constraints
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 74098a989b9c3370f768140b7783a7aaec2759b3 upstream.
At the moment scrub_supers() doesn't grab the super block's location via
the zoned device aware btrfs_sb_log_location() but via btrfs_sb_offset().
This leads to checksum errors on 'scrub' as we're not accessing the
correct location of the super block.
So use btrfs_sb_log_location() for getting the super blocks location on
scrub.
Reported-by: WA AM <waautomata@gmail.com>
Link: http://lore.kernel.org/linux-btrfs/CANU2Z0EvUzfYxczLgGUiREoMndE9WdQnbaawV5Fv5gNXptPUKw@mail.gmail.com
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ae6bd7f9b46a29af52ebfac25d395757e2031d0d ]
At contains_pending_extent() the value of the end offset of a chunk we
found in the device's allocation state io tree is inclusive, so when
we calculate the length we pass to the in_range() macro, we must sum
1 to the expression "physical_end - physical_offset".
In practice the wrong calculation should be harmless as chunks sizes
are never 1 byte and we should never have 1 byte ranges of unallocated
space. Nevertheless fix the wrong calculation.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9845664b9ee47ce7ee7ea93caf47d39a9d4552c4 upstream.
There's a syzbot report that device name buffers passed to device
replace are not properly checked for string termination which could lead
to a read out of bounds in getname_kernel().
Add a helper that validates both source and target device name buffers.
For devid as the source initialize the buffer to empty string in case
something tries to read it later.
This was originally analyzed and fixed in a different way by Edward Adam
Davis (see links).
Link: https://lore.kernel.org/linux-btrfs/000000000000d1a1d1060cc9c5e7@google.com/
Link: https://lore.kernel.org/linux-btrfs/tencent_44CA0665C9836EF9EEC80CB9E7E206DF5206@qq.com/
CC: stable@vger.kernel.org # 4.19+
CC: Edward Adam Davis <eadavis@qq.com>
Reported-and-tested-by: syzbot+33f23b49ac24f986c9e8@syzkaller.appspotmail.com
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit efcfcbc6a36195c42d98e0ee697baba36da94dc8 ]
The implementation of XXHASH is now CPU only but still fast enough to be
considered for the synchronous checksumming, like non-generic crc32c.
A userspace benchmark comparing it to various implementations (patched
hash-speedtest from btrfs-progs):
Block size: 4096
Iterations: 1000000
Implementation: builtin
Units: CPU cycles
NULL-NOP: cycles: 73384294, cycles/i 73
NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s
CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s
CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s
CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s
CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s
CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s
XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s
Comparing purely software implementation (ref), current outdated
accelerated using crc32q instruction (NI), optimized implementations by
M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
and the best one that was taken from kernel using the PCLMULQDQ
instruction (PCL).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f884a9f9e59206a2d41f265e7e403f080d10b493 upstream.
When some ioctl flags are checked we return EOPNOTSUPP, like for
BTRFS_SCRUB_SUPPORTED_FLAGS, BTRFS_SUBVOL_CREATE_ARGS_MASK or fallocate
modes. The EINVAL is supposed to be for a supported but invalid
values or combination of options. Fix that when checking send flags so
it's consistent with the rest.
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5rryOLzp3EKq8RTbjMHMHeaJubfpsVLF6H4qJnKCUR1w@mail.gmail.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a8df35619948bd8363d330c20a90c9a7fbff28c0 upstream.
If a subvolume still exists, forbid deleting its qgroup 0/subvolid.
This behavior generally leads to incorrect behavior in squotas and
doesn't have a legitimate purpose.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e03ee2fe873eb68c1f9ba5112fee70303ebf9dfb upstream.
[BUG]
There is a syzbot crash, triggered by the ASSERT() during subvolume
creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:1319!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
<TASK>
btrfs_get_new_fs_root+0xd3/0xf0
create_subvol+0xd02/0x1650
btrfs_mksubvol+0xe95/0x12b0
__btrfs_ioctl_snap_create+0x2f9/0x4f0
btrfs_ioctl_snap_create+0x16b/0x200
btrfs_ioctl+0x35f0/0x5cf0
__x64_sys_ioctl+0x19d/0x210
do_syscall_64+0x3f/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
---[ end trace 0000000000000000 ]---
[CAUSE]
During create_subvol(), after inserting root item for the newly created
subvolume, we would trigger btrfs_get_new_fs_root() to get the
btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for
the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read
the new subvolume.
If that happens before we call btrfs_get_new_fs_root(), the subvolume
would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to
read out that subvolume (which is not yet accessible from the fs).
But things like backref walk is still possible to trigger the read on
the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX]
Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
to 0, and continue.
If the subvolume tree is read out by something else, it should have
already get a new anon_dev assigned thus we only need to free the
preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0c309d66dacddf8ce939b891d9ead4a8e21ad6f0 upstream.
Creating a qgroup 0/subvolid leads to various races and it isn't
helpful, because you can't specify a subvol id when creating a subvol,
so you can't be sure it will be the right one. Any requirements on the
automatic subvol can be gratified by using a higher level qgroup and the
inheritance parameters of subvol creation.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2c7d2a230237e7c43fa067d695937b7e484bb92a ]
This adds the initial definition of the EXTENT_TREE_V2 incompat feature
flag. This also hides the support behind CONFIG_BTRFS_DEBUG.
THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
DEVELOPER OR A TESTER.
The format is in flux and will be added in stages, any fs will need to
be re-made between updates to the format.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 7081929ab257 ("btrfs: don't abort filesystem when attempting to snapshot deleted subvolume")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 173431b274a9a54fc10b273b46e67f46bcf62d2e upstream.
Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
This is not really to enhance fuzzing tests, but as a preparation for
future expansion on btrfs_ioctl_defrag_range_args.
In the future we're going to add new members, allowing more fine tuning
for btrfs defrag. Without the -ENONOTSUPP error, there would be no way
to detect if the kernel supports those new defrag features.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a208b3f132b48e1f94f620024e66fea635925877 upstream.
There's a warning in btrfs_issue_discard() when the range is not aligned
to 512 bytes, originally added in 4d89d377bbb0 ("btrfs:
btrfs_issue_discard ensure offset/length are aligned to sector
boundaries"). We can't do sub-sector writes anyway so the adjustment is
the only thing that we can do and the warning is unnecessary.
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3324d0547861b16cf436d54abba7052e0c8aa9de upstream.
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8e7f82deb0c0386a03b62e30082574347f8b57d5 upstream.
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
are not holding the directory's inode locked, and this can result in later
attempting to add two entries to the directory with the same index number,
resulting in a transaction abort, with -EEXIST (-17), when inserting the
second delayed dir index. This results in a trace like the following:
Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
Sep 11 22:34:59 myhostname kernel: FS: 00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
Sep 11 22:34:59 myhostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
Sep 11 22:34:59 myhostname kernel: Call Trace:
Sep 11 22:34:59 myhostname kernel: <TASK>
Sep 11 22:34:59 myhostname kernel: ? die+0x36/0x90
Sep 11 22:34:59 myhostname kernel: ? do_trap+0xda/0x100
Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: ? do_error_trap+0x6a/0x90
Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: ? exc_invalid_op+0x50/0x70
Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: ? asm_exc_invalid_op+0x1a/0x20
Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260
Sep 11 22:34:59 myhostname kernel: btrfs_insert_dir_item+0x200/0x280
Sep 11 22:34:59 myhostname kernel: btrfs_add_link+0xab/0x4f0
Sep 11 22:34:59 myhostname kernel: ? ktime_get_real_ts64+0x47/0xe0
Sep 11 22:34:59 myhostname kernel: btrfs_create_new_inode+0x7cd/0xa80
Sep 11 22:34:59 myhostname kernel: btrfs_symlink+0x190/0x4d0
Sep 11 22:34:59 myhostname kernel: ? schedule+0x5e/0xd0
Sep 11 22:34:59 myhostname kernel: ? __d_lookup+0x7e/0xc0
Sep 11 22:34:59 myhostname kernel: vfs_symlink+0x148/0x1e0
Sep 11 22:34:59 myhostname kernel: do_symlinkat+0x130/0x140
Sep 11 22:34:59 myhostname kernel: __x64_sys_symlinkat+0x3d/0x50
Sep 11 22:34:59 myhostname kernel: do_syscall_64+0x5d/0x90
Sep 11 22:34:59 myhostname kernel: ? syscall_exit_to_user_mode+0x2b/0x40
Sep 11 22:34:59 myhostname kernel: ? do_syscall_64+0x6c/0x90
Sep 11 22:34:59 myhostname kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
The race leading to the problem happens like this:
1) Directory inode X is loaded into memory, its ->index_cnt field is
initialized to (u64)-1 (at btrfs_alloc_inode());
2) Task A is adding a new file to directory X, holding its vfs inode lock,
and calls btrfs_set_inode_index() to get an index number for the entry.
Because the inode's index_cnt field is set to (u64)-1 it calls
btrfs_inode_delayed_dir_index_count() which fails because no dir index
entries were added yet to the delayed inode and then it calls
btrfs_set_inode_index_count(). This functions finds the last dir index
key and then sets index_cnt to that index value + 1. It found that the
last index key has an offset of 100. However before it assigns a value
of 101 to index_cnt...
3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
and sees index_cnt still with a value of (u64)-1. Because of that it
calls btrfs_inode_delayed_dir_index_count() which fails since no dir
index entries were added to the delayed inode yet, and then it also
calls btrfs_set_inode_index_count(). This also finds that the last
index key has an offset of 100, and before it assigns the value 101
to the index_cnt field of inode X...
4) Task A assigns a value of 101 to index_cnt. And then the code flow
goes to btrfs_set_inode_index() where it increments index_cnt from
101 to 102. Task A then creates a delayed dir index entry with a
sequence number of 101 and adds it to the delayed inode;
5) Task B assigns 101 to the index_cnt field of inode X;
6) At some later point when someone tries to add a new entry to the
directory, btrfs_set_inode_index() will return 101 again and shortly
after an attempt to add another delayed dir index key with index
number 101 will fail with -EEXIST resulting in a transaction abort.
Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
only used when opening a directory or attempting to lseek on it.
Reported-by: ken <ken@bllue.org>
Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Eugeniu Rosca <eugeniu.rosca@bosch.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e60aa5da14d01fed8411202dbe4adf6c44bd2a57 upstream.
When opening a directory we find what's the index of its last entry and
then store it in the directory's file handle private data (struct
btrfs_file_private::last_index), so that in the case new directory entries
are added to a directory after an opendir(3) call we don't end up in an
infinite loop (see commit 9b378f6ad48c ("btrfs: fix infinite directory
reads")) when calling readdir(3).
However once rewinddir(3) is called, POSIX states [1] that any new
directory entries added after the previous opendir(3) call, must be
returned by subsequent calls to readdir(3):
"The rewinddir() function shall reset the position of the directory
stream to which dirp refers to the beginning of the directory.
It shall also cause the directory stream to refer to the current
state of the corresponding directory, as a call to opendir() would
have done."
We currently don't refresh the last_index field of the struct
btrfs_file_private associated to the directory, so after a rewinddir(3)
we are not returning any new entries added after the opendir(3) call.
Fix this by finding the current last index of the directory when llseek
is called against the directory.
This can be reproduced by the following C program provided by Ian Johnson:
#include <dirent.h>
#include <stdio.h>
int main(void) {
DIR *dir = opendir("test");
FILE *file;
file = fopen("test/1", "w");
fwrite("1", 1, 1, file);
fclose(file);
file = fopen("test/2", "w");
fwrite("2", 1, 1, file);
fclose(file);
rewinddir(dir);
struct dirent *entry;
while ((entry = readdir(dir))) {
printf("%s\n", entry->d_name);
}
closedir(dir);
return 0;
}
Reported-by: Ian Johnson <ian@ianjohnson.dev>
Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Eugeniu Rosca <eugeniu.rosca@bosch.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 357950361cbc6d54fb68ed878265c647384684ae upstream.
When opening a directory for reading it, we set the last index where we
stop iteration to the value in struct btrfs_inode::index_cnt. That value
does not match the index of the most recently added directory entry but
it's instead the index number that will be assigned the next directory
entry.
This means that if after the call to opendir(3) new directory entries are
added, a readdir(3) call will return the first new directory entry. This
is fine because POSIX says the following [1]:
"If a file is removed from or added to the directory after the most
recent call to opendir() or rewinddir(), whether a subsequent call to
readdir() returns an entry for that file is unspecified."
For example for the test script from commit 9b378f6ad48c ("btrfs: fix
infinite directory reads"), where we have 2000 files in a directory, ext4
doesn't return any new directory entry after opendir(3), while xfs returns
the first 13 new directory entries added after the opendir(3) call.
If we move to a shorter example with an empty directory when opendir(3) is
called, and 2 files added to the directory after the opendir(3) call, then
readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
files (but in a different order). A test program for this, reported by
Ian Johnson, is the following:
#include <dirent.h>
#include <stdio.h>
int main(void) {
DIR *dir = opendir("test");
FILE *file;
file = fopen("test/1", "w");
fwrite("1", 1, 1, file);
fclose(file);
file = fopen("test/2", "w");
fwrite("2", 1, 1, file);
fclose(file);
struct dirent *entry;
while ((entry = readdir(dir))) {
printf("%s\n", entry->d_name);
}
closedir(dir);
return 0;
}
To make this less odd, change the behaviour to never return new entries
that were added after the opendir(3) call. This is done by setting the
last_index field of the struct btrfs_file_private attached to the
directory's file handle with a value matching btrfs_inode::index_cnt
minus 1, since that value always matches the index of the next new
directory entry and not the index of the most recently added entry.
[1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html
Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Eugeniu Rosca <eugeniu.rosca@bosch.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b378f6ad48cfa195ed868db9123c09ee7ec5ea2 upstream.
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.
The following C program and test script reproduce the problem:
$ cat /mnt/readdir_prog.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
DIR *dir = opendir(".");
struct dirent *dd;
while ((dd = readdir(dir))) {
printf("%s\n", dd->d_name);
rename(dd->d_name, "TEMPFILE");
rename("TEMPFILE", dd->d_name);
}
closedir(dir);
}
$ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV &> /dev/null
#mkfs.xfs -f $DEV &> /dev/null
#mkfs.ext4 -F $DEV &> /dev/null
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= 2000; i++)); do
echo -n > $MNT/testdir/file_$i
done
cd $MNT/testdir
/mnt/readdir_prog
cd /mnt
umount $MNT
This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.
So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.
Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 5.15
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ Resolve a conflict due to member changes in 96d89923fa94 ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Eugeniu Rosca <eugeniu.rosca@bosch.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2b0122aaa800b021e36027d7f29e206f87c761d6 upstream.
The value set as scrub_speed_max accepts size with suffixes
(k/m/g/t/p/e) but we should still validate it for trailing characters,
similar to what we do with chunk_size_store.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a8892fd71933126ebae3d60aec5918d4dceaae76 upstream.
Our btrfs subvolume snapshot <source> <destination> utility enforces
that <source> is the root of the subvolume, however this isn't enforced
in the kernel. Update the kernel to also enforce this limitation to
avoid problems with other users of this ioctl that don't have the
appropriate checks in place.
Reported-by: Martin Michaelis <code@mgjm.de>
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit a480eb262b178e2baffa758b9cd5bde2d8470287 which is
commit 2db313205f8b96eea467691917138d646bb50aef upstream.
As pointed out by many, the disk_super structure is NOT initialized
before it is dereferenced in the function
fs/btrfs/disk-io.c:open_ctree() that this commit adds, so something went
wrong here.
Revert it for now until it gets straightened out.
Link: https://lore.kernel.org/r/5b0eb360-3765-40e1-854a-9da6d97eb405@roeck-us.net
Link: https://lore.kernel.org/r/20231209172836.GA2154579@dev-arch.thelio-3990X
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Anand Jain <anand.jain@oracle.com>
Cc: Qu Wenruo <wqu@suse.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7d410d5efe04e42a6cd959bfe6d59d559fdf8b25 upstream.
When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity
checks to verify we found a chunk map and that map found covers the
logical address the caller passed in. However the messages aren't very
clear in the sense that don't mention the issue is with a chunk map and
one of them prints the 'length' argument as if it were the end offset of
the requested range (while the in the string format we use %llu-%llu
which suggests a range, and the second %llu-%llu is actually a range for
the chunk map). So improve these two details in the error messages.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0ac1d13a55eb37d398b63e6ff6db4a09a2c9128c upstream.
kernel_write() requires the caller to ensure that the file is writable.
Let's do that directly after looking up the ->send_fd.
We don't need a separate bailout path because the "out" path already
does fput() if ->send_filp is non-NULL.
This has no security impact for two reasons:
- the ioctl requires CAP_SYS_ADMIN
- __kernel_write() bails out on read-only files - but only since 5.8,
see commit a01ac27be472 ("fs: check FMODE_WRITE in __kernel_write")
Reported-and-tested-by: syzbot+12e098239d20385264d3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=12e098239d20385264d3
Fixes: 31db9f7c23fb ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5fba5a571858ce2d787fdaf55814e42725bfa895 upstream.
At btrfs_get_chunk_map() we get the extent map for the chunk that contains
the given logical address stored in the 'logical' argument. Then we do
sanity checks to verify the extent map contains the logical address. One
of these checks verifies if the extent map covers a range with an end
offset behind the target logical address - however this check has an
off-by-one error since it will consider an extent map whose start offset
plus its length matches the target logical address as inclusive, while
the fact is that the last byte it covers is behind the target logical
address (by 1).
So fix this condition by using '<=' rather than '<' when comparing the
extent map's "start + length" against the target logical address.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f91192cd68591c6b037da345bc9fcd5e50540358 upstream.
In btrfs_ref_tree_mod(), when !parent 're' was allocated through
kmalloc(). In the following code, if an error occurs, the execution will
be redirected to 'out' or 'out_unlock' and the function will be exited.
However, on some of the paths, 're' are not deallocated and may lead to
memory leaks.
For example: lookup_block_entry() for 'be' returns NULL, the out label
will be invoked. During that flow ref and 'ra' are freed but not 're',
which can potentially lead to a memory leak.
CC: stable@vger.kernel.org # 5.10+
Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35f
Signed-off-by: Bragatheswaran Manickavel <bragathemanick0908@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2db313205f8b96eea467691917138d646bb50aef upstream.
There is a feature request to add dmesg output when unmounting a btrfs.
There are several alternative methods to do the same thing, but with
their own problems:
- Use eBPF to watch btrfs_put_super()/open_ctree()
Not end user friendly, they have to dip their head into the source
code.
- Watch for directory /sys/fs/<uuid>/
This is way more simple, but still requires some simple device -> uuid
lookups. And a script needs to use inotify to watch /sys/fs/.
Compared to all these, directly outputting the information into dmesg
would be the most simple one, with both device and UUID included.
And since we're here, also add the output when mounting a filesystem for
the first time for parity. A more fine grained monitoring of subvolume
mounts should be done by another layer, like audit.
Now mounting a btrfs with all default mkfs options would look like this:
[81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
[81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm
[81.908258] BTRFS info (device dm-8): using free space tree
[81.912644] BTRFS info (device dm-8): auto enabling async discard
[81.913277] BTRFS info (device dm-8): checking UUID tree
[91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
CC: stable@vger.kernel.org # 5.4+
Link: https://github.com/kdave/btrfs-progs/issues/689
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 11aeb97b45ad2e0040cbb2a589bc403152526345 upstream.
We have a random schedule_timeout() if the current transaction is
committing, which seems to be a holdover from the original delalloc
reservation code.
Remove this, we have the proper flushing stuff, we shouldn't be hoping
for random timing things to make everything work. This just induces
latency for no reason.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dec96fc2dcb59723e041416b8dc53e011b4bfc2e ]
In the tree search v2 ioctl we use the type size_t, which is an unsigned
long, to track the buffer size in the local variable 'buf_size'. An
unsigned long is 32 bits wide on a 32 bits architecture. The buffer size
defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later
try to copy the local variable 'buf_size' to the argument struct, when
the search returns -EOVERFLOW, we copy only 32 bits which will be a
problem on big endian systems.
Fix this by using a u64 type for the buffer sizes, not only at
btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain
so that we can use the u64 at btrfs_ioctl_tree_search_v2().
Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e36f94914021e58ee88a8856c7fdf35adf9c7ee1 ]
At btrfs_realloc_node() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger two WARN_ON(). This however is a
critical problem, highly unexpected and if it happens it's most likely due
to a bug, so we should error out and turn the fs into error state so that
such issue is much more easily noticed if it's triggered.
The problem is critical because in btrfs_realloc_node() we COW tree blocks,
and using such stale transaction will lead to not persisting the extent
buffers used for the COW operations, as allocating tree block adds the
range of the respective extent buffers to the ->dirty_pages iotree of the
transaction, and a stale transaction, in the unlocked state or higher,
will not flush dirty extent buffers anymore, therefore resulting in not
persisting the tree block and resource leaks (not cleaning the dirty_pages
iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a2caab29884397e583d09be6546259a83ebfbdb1 ]
At btrfs_cow_block() we check if the block being COWed belongs to a root
that is being deleted and if so we log an error message. However this is
an unexpected case and it indicates a bug somewhere, so we should return
an error and abort the transaction. So change this in the following ways:
1) Abort the transaction with -EUCLEAN, so that if the issue ever happens
it can easily be noticed;
2) Change the logged message level from error to critical, and change the
message itself to print the block's logical address and the ID of the
root;
3) Return -EUCLEAN to the caller;
4) As this is an unexpected scenario, that should never happen, mark the
check as unlikely, allowing the compiler to potentially generate better
code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 48774f3bf8b4dd3b1a0e155825c9ce48483db14c ]
At btrfs_cow_block() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger a WARN with a message and a stack
trace. This however is a critical problem, highly unexpected and if it
happens it's most likely due to a bug, so we should error out and turn the
fs into error state so that such issue is much more easily noticed if it's
triggered.
The problem is critical because using such stale transaction will lead to
not persisting the extent buffer used for the COW operation, as allocating
a tree block adds the range of the respective extent buffer to the
->dirty_pages iotree of the transaction, and a stale transaction, in the
unlocked state or higher, will not flush dirty extent buffers anymore,
therefore resulting in not persisting the tree block and resource leaks
(not cleaning the dirty_pages iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9147b9ded499d9853bdf0e9804b7eaa99c4429ed ]
Jens reported the following warnings from -Wmaybe-uninitialized recent
Linus' branch.
In file included from ./include/asm-generic/rwonce.h:26,
from ./arch/arm64/include/asm/rwonce.h:71,
from ./include/linux/compiler.h:246,
from ./include/linux/export.h:5,
from ./include/linux/linkage.h:7,
from ./include/linux/kernel.h:17,
from fs/btrfs/ioctl.c:6:
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10:
./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here
2981 | struct btrfs_ioctl_space_args space_args;
| ^~~~~~~~~~
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10:
./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here
4341 | struct btrfs_ioctl_send_args_32 args32;
| ^~~~~~
This was due to his config options and having KASAN turned on,
which adds some extra checks around copy_from_user(), which then
triggered the -Wmaybe-uninitialized checker for these cases.
Fix the warnings by initializing the different structs we're copying
into.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b4c639f699349880b7918b861e1bd360442ec450 ]
Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this
fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~
The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1bf76df3fee56d6637718e267f7c34ed70d0c7dc ]
When running a delayed tree reference, if we find a ref count different
from 1, we return -EIO. This isn't an IO error, as it indicates either a
bug in the delayed refs code or a memory corruption, so change the error
code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is
not expected to ever happen, and change the error message to print the
tree block's bytenr without the parenthesis (and there was a missing space
between the 'block' word and the opening parenthesis), for consistency as
that's the style we used everywhere else.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 58bfe2ccec5f9f137b41dd38f335290dcc13cd5c upstream.
A user reported some issues with smaller file systems that get very
full. While investigating this issue I noticed that df wasn't showing
100% full, despite having 0 chunk space and having < 1MiB of available
metadata space.
This turns out to be an overflow issue, we're doing:
total_available_metadata_space - SZ_4M < global_block_rsv_size
to determine if there's not enough space to make metadata allocations,
which overflows if total_available_metadata_space is < 4M. Fix this by
checking to see if our available space is greater than the 4M threshold.
This makes df properly report 100% usage on the file system.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 91bfe3104b8db0310f76f2dcb6aacef24c889366 ]
If we fail to add a delayed dir index item because there's already another
item with the same index number, we print an error message (and then BUG).
However that message isn't very helpful to debug anything because we don't
know what's the index number and what are the values of index counters in
the inode and its delayed inode (index_cnt fields of struct btrfs_inode
and struct btrfs_delayed_node).
So update the error message to include the index number and counters.
We actually had a recent case where this issue was hit by a syzbot report
(see the link below).
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>